Audio ingest considerations for multichannel delivery

Video offers many options when ingested into a facility’s media asset management (MAM) system. Even with its significantly lower data rates and storage requirements than video, audio presents its own set of production and distribution workflow implementation challenges. This is especially true when considering multiplatform repurposing of audio content.

Audio source formats can be AES/EBU, Dolby E or MPEG. Quality will range from pristine, studio multichannel to user-generated, barely intelligible cell phone conversations. It may be embedded in an SDI signal or maintained as discrete streams while being distributed about the production infrastructure. For finished DTV programs, audio will be compressed to Dolby AC-3, usually in either 5.1 surround sound or 2.0 stereo speaker formats. As the audio leaves the plant, it must be appropriately compressed and/or transcoded for a particular delivery channel.

Similar to video conversion, a “least-conversion” workflow philosophy, where zero conversion is ideal, is also applicable to audio. Distortion created by multiple audio format conversions may not be as striking as when viewing transcoded video, but will diminish the audio quality and overall program experience nevertheless.

Determining appropriate storage workflows and formats for audio is also an issue. How should audio be stored with associated video? As with video, potential future repurposing scenarios must be considered. Will an audio clip be used as a sound byte on a news show, be aired on a radio show or made available on the Web?

Audio content must be easily accessible, independent of the associated video. Access issues should be resolved with the design and implementation of an asset management system. The more sources and possible delivery platforms, the more complicated the asset management design challenge becomes.

Audio data rates

Audio does not occupy as much space as video, but the amounts must be carefully calculated. There are many possible formats that can be used depending on production workflows.

For PCM (pulse code modulation) AES3, the sample rate is 48KHz, while audio word length can vary: 16-bit audio words produce 768Kb/s, which equates to 96KB per second write rates and 345.6GB per hour of storage; 24-bit audio requires storage as high as 1.152Mb/s and 518.4MB/h of storage per channel, double the data rate and storage requirement for stereo. For 5.1 surround applications, the increased channel numbers require 3.110GB/h for six discrete channels of 24-bit audio words. At three hours a day for a full year, 3.405TB of storage is necessary.

When working with compressed audio, decoding and encoding cycles can degrade audio until it is no longer broadcast quality. One solution is to use a light audio compression format, such as Dolby E, that is designed to support audio editing. For 20-bit audio words, up to eight channels are encoded as a single, two-channel AES stream at 48KHz, resulting in a data rate of 1.912Mb/s, 240KB/s and 864GB/h. For three hours a day for a year, 946.08GB of storage is needed — less than one-third the storage for six channels of uncompressed audio.

Finished audio content used in DTV transmission will be in Dolby Digital AC-3 5.1 or 2.0 speaker presentation format. Dolby Digital at a maximum data rate of 640Kb/s equates to 80KB/s and 288MB per hour. For a three-hour primetime archive scenario, this will require 315.35GB a year.

An emerging ATSC audio extension, E-AC-3, also known as Dolby Digital Plus, at a maximum bit rate of 6.144Mb/s consumes 768KB/s and 2.764GB/h; and, at three hours per day, 365 days a year, requires 3.027TB. This is nearly 10 times are much storage as the equivalent in AC-3.

There is also the issue of user-generated content (UGC). MP3 and AAC sources from UGC further complicate the storage and transcoding workflow. Content quality can vary widely, and with DTV in particular, audio quality matters.

A possible workflow solution is to fix audio as it is ingested. This will ensure acceptable and consistent quality whenever the audio clip is used. It will also alleviate the need to repeatedly fix a clip, saving time and improving workflow efficiency.

Linking to video

With the storage of audio and video as separate files, an appropriate method of linking the two must be implemented. The audio/video association can be accomplished by logically wrapping the two files together. The wrapper consists of metadata that contains the locations of associated audio and video and includes other relevant technical and content descriptive information.

Two phases of the content life cycle have differing content management and tracking requirements. During production, a trail of edits and links to source content must be maintained. The Advanced Authoring Format (AAF) is finding increased use in asset management during production. For a finished program, a flat file that can stand on its own without external content references is sufficient. The Material Exchange Format (MXF) has been standardized by the SMPTE and is used to wrap finished programs. Each of these wrapper structures will have to be supported by the MAM system.

Because of the different technical requirements of audio — less demanding bit rates, lower storage requirements and simpler transcoding — it will generally be sufficient to keep audio content in one format in a single file in the MAM system. Because video exists in more than one format, this single copy of audio must be linked to each video file so multiple workflows will all access the same audio file. Care must be taken to design a media network that can support maximum audio data access and transfer demands.

More than meets the ear

Program audio is now being used for automated metadata creation and program classification. Real-time speech-to-text conversion can extract descriptive information, much like a Web site is crawled. Keywords can be identified and used to classify content. Electronic Program Guides (EPGs) and recommender systems can then use this information.

The obvious next step in the evolutionary development of such an application is a click-and-buy t-commerce feature. Web repurposing of content will beat DTV to the punch in offering program-related commercial transaction capabilities. ATSC standards (such as ACAP) are in place that can support t-commerce; broadcasters just need to find a way to implement the technology in a way that results in a viable business model.

One for all

The conclusion that can be drawn from this analysis of audio ingest is that when storing multiple copies of video content, keeping a single copy of baseband or Dolby-E audio content for production and a finished program for archiving is a prudent and sufficient approach to short- and long-term multiplatform production and dissemination.

DTV enables the ability to deliver data as well as audio and video. Closed captions and other forms of ancillary data require special attention during the ingest process and will be the subject of the next Transition to Digital.

CATEGORIES