Artificial Intelligence or Audio Illusion?

Lawo's mc296 Grand Production Consoles was used to create immersive audio mix at the International Broadcast Centre (IBC) Moscow at the 2018 FIFA World Cup. (Image credit: Lawo)

I will never forget the day I brought an Avid ProTools system into our studio and my partner remarked that “there was no way a computer could be faster than the old-fashioned razor blade edit.” There were two computers in the studio: one for accounting and the other was a crude device that controlled the capstan motor on the 24-track tape machine to synchronize it to a video machine and timecode.

This was the early 1980s; there were no computers in the OB vans and every single piece of equipment was analog. Videotape editing was machine-to-machine with an operator—the video going through a switcher and audio going through the mixing desk. Music was played off of the NAB Cart, a magnetic-tape sound-recording format. I guess you could call the first “computer” I remember in an OB van was the DigiCart instant playback system from hard drives.

After several decades of computerization and the implementation of IP throughout broadcast ecosystems, innovation has put us in a place where everything is computerized and we are already seeing the concept of computers controlling computers. Computers controlling computers is nothing new, but machine learning is, and to me this is a haunting reminder of Kubrick’s “2001: A Space Odyssey.”

SOUND AS AN INDICATOR

Artificial intelligence (AI) has been used in sports for awhile. At Wimbledon, for example, the computer listens and watches the tennis match and identifies exciting indicators by applying a variety of metrics. The metrics guide the computer in learning how to recognize significant points of interest and for what makes a good highlight or replay.

Interestingly, sound is a leading and reliable indicator. For example, pandemonium in the crowd after a long quiet pause is a good indicator of a memorable highlight moment. One of my logic metrics would also include the duration of the crowd burst as well as the amplitude, threshold, attack and sustain of the sounds during the interesting moment.

Additionally, the voice inflection of the crowd—sustained screaming as opposed to a sigh of dismay that dies out quickly is another valuable and identifiable metric. From these simple learning indicators the computer within a dozen repetitions or even 100 times will be able to accurately predict a good highlight moment.

I would argue that in 2018 we were close to something with AI. Lawo had developed a mixing system that takes data of the ball position (or any interesting follow target) and translates that into capturing the best possible sound from the optimal microphone or combination of microphones, plus determines what level to mix and blend them together. Tracking the ball is done optically and in a sport like football, the focus of the game is the ball—basically you tell the computer to follow the ball.

Undeniably World Cup 2018 was the best-sounding football event I have ever heard. Praise goes to HBS Christian Gobbel, Felix Kruckles and the Lawo team for implementing a true paradigm shift in the world of sound for broadcasting, but I think Philipp Lawo is on to something else.

THE SALSA ALGORITHM

An alternative and interesting method to advance automation is "Spatial Automated Live Sports Audio," which uses existing shotgun microphones around the pitch to detect the ball kicks. The system not only looks for overall level intensity, but also the envelope across a range of frequency bands for each sound event type that a sound mixer might want to capture. The SALSA algorithm is capable of detecting ball kicks that are virtually inaudible on the microphone feeds and is more reliable at recognizing sound events than our ears.

During live production, SALSA uses one of two approaches: It can automate a mixing console’s faders to capture each on-pitch sound event, or use the frequency/envelope information of the ball kick to trigger pre-produced samples. These sounds can be added to the on-pitch sounds or can replace the game sounds if you want it to sound like an EA Sports Game or a Saturday afternoon match on SKY. It is up to you as the sound designer and consumer.

Now, let’s take a look into another possibility of AI for sports coverage. Artificial intelligence comes into play when a computer analyzes the switching patterns of a sample of directing styles and compares the director’s commands to the position of the ball within the field of view of the broadcast cameras. The computer archives the director’s selection for future learning.

Within a short period of time, patterns will be detected, examined and programmed into event cycles to take over the direction of the cameras. A basic “follow-the-ball” pattern is learned, however it would seem possible that you can modify the production by blending and altering production styles. I once worked with a director that had a rhythm and repetition to his cutting style and literally repeated a dozen or so patterns over the course of a three-hour game.

I can clearly envision the day when bots and droid-computers capture, direct and produce live sporting events with little human intervention. Let’s follow the flow; camera robotics support systems have been around for awhile and there is no reason the cameras and audio cannot follow the electronic commands of a computer that is following play action.

Imagine this possible scenario: The computer is calculating that, after a goal kick, seven out of 10 directors would cut to a wide shot while optical position tracking is continually sending the directoid mapping data of the field of play. The “directoid” directs camera X, Y and Z to follow the ball while simultaneously directing camera A and B to track the coaches.

Additionally, cameras A and B are capturing the audio from the coaches and sending the information to the directoid, which is learning the patterns of the coaches and when to cut to the coach. The directoid has a library of possibilities for each ball position and makes comparisons.

Real-time action coverage could include speech interpretation played out from a computer that has ingested all the data and artificially created the commentary track. Speech synthesis has existed for awhile and once you have optical tracking it becomes conceivable that you can create droid commentators that interpret the play-by-play action and sound resynthesis to complete the entire experience—alternative reality.

My vision of the future paints a different picture of the science, art and practice of audio as I/we know it, but I believe my speculation could become reality.

Dennis Baxter has spent over 35 years in live broadcasting contributing to hundreds of live events including sound design for nine Olympic Games. He has earned multiple Emmy Awards and is the author of “A Practical Guide to Television Sound Engineering,” published in both English and Chinese. He is currently working on a book about immersive sound practices and production. He can be reached at dbaxter@dennisbaxtersound.com or at www.dennisbaxtersound.com.