MPEG-2 and H.264/AVC

The introduction of recording equipment employing H.264, with Fidelity Range Extensions (FRExt), has created a viable alternative to equipment using long-GOP MPEG-2. The MPEG-4, Part 10, H.264/AVC codec is currently available in two recording implementations: AVC-Intra (Panasonic) and AVCHD (Canon, Panasonic and Sony).

The AVC-Intra, for professional use, employs intraframe encoding while the latter, currently for consumer use, employs interframe encoding. Although not versions of the same codec, both are marketed with similar claims of being twice as efficient as MPEG-2. To check this claim's validity, we will examine both MPEG-2 and H.264 encoding.

MPEG-2: frame encoding

After appropriate filtering to reduce noise, a picture is partitioned into 16 × 16 pixel macroblocks. A discrete cosine transform (DCT) is applied to each of the four 8 × 8 luminance blocks within a macroblock. The DCT organizes information so the level of detail to which the human eye is most sensitive is not discarded. Conversely, very fine detail will be discarded first.

Compression occurs through quantization of coefficients from the DCT. Quantizing reduces the number of bits representing each coefficient. After quantization, further data reduction is applied using variable length coding (VLC) and run length coding (RLC). The result is an MPEG-2 I-frame.

These steps are essentially the same for MPEG-2 and DVC (DV, DVCAM and DVCPRO) encoders.

H.264: slice encoding

H.264 intra-encoding employs techniques that reduce spatial redundancies. Redundancies exist because some image areas are naturally correlated with other image areas.

The process begins by partitioning an incoming picture into 16 × 16 pixel macroblocks. Each macroblock is further partitioned into 16 4 × 4 pixel submacroblocks. The encoder uses the former block size for gross detail and the latter for fine detail.

Previous (reference) blocks are used as a source of reference pixels. These blocks will have already been encoded and decoded. (H.264 supports an adaptive deblocking filter that attenuates compression blocking artifacts and operates during encoding and decoding.)

Reference pixels are located at the left and upper boundaries between previous blocks and the current block. Predictions are made for 4 × 4 or 8 × 8 blocks using nine prediction modes. (See Figure 1.) The 16 × 16 macroblock predictions are made using four prediction modes. (See Figure 2.)

The mode that best predicts the content of the current block is selected as the current mode. This mode is used to generate a predicted block from the reference pixels. (See Figure 3.)

A residual (error) block, computed as the difference between the predicted block and the current block, is then integer transformed. H.264 employs a 4 × 4 or 8 × 8 integer transform rather than a DCT transform.(Integer transforms prevent mismatches between encoders and decoders.) Next, the results from the transform are quantized and entropy coded.

While MPEG-2 uses variable length coding (VLC) to further reduce data, H.264 employs context-adaptive binary arithmetic coding (CABAC) or context-adaptive variable-length coding (CAVLC) entropy coding. CAVLC and CABAC are supported by the Main, High-10 and High-422 Profiles.

When grouped together, intra-encoded blocks yield an I-slice. A picture — a field or frame — is encoded as one or more slices, up to a maximum of eight slices.

AVC-Intra codecs

Panasonic offers two AVC-Intra codecs as alternatives to its DVCPRO HD codec on selected P2-based devices: the 50Mb/s High-10 Profile (Hi10P) codec and the 100Mb/s High-422 Profile (H422P) codec.

Several caveats apply to Panasonic's AVC-Intra. First, as shown in Table 1, these codecs encode dissimilar numbers of luma and chroma samples — with different sample widths. Nevertheless, after equalizing these data, the H.264 block prediction tools provide nearly twice (≈1.87X) the efficiency of the DVC-based DVCPRO HD codec — or I-frame-only MPEG-2.

Second, the claim of AVC-Intra's 2X greater efficiency does not apply to long-GOP MPEG-2. Long-GOP MPEG-2 is about 160 percent more efficient than intra H.264, which is why 1920 × 1080, 4:2:2, 8-bit video can be encoded using XDCAM HD 422 at only 50Mb/s.

MPEG-2: P- and B-frame encoding

A newly encoded I-frame is decoded to regenerate an initial picture, which is divided into 16 × 16 pixel macroblocks. Starting with the upper, leftmost macroblock, a search is made to determine its location in the next picture. A correlation technique measures how closely the macroblock matches each searched macroblock.

In a methodical pattern, the macroblock is moved in all directions at increasing distances from its origin. The displacement — direction and distance moved when a match is made — becomes the macroblock's motion vector. This process is repeated for each macroblock until all motion vectors have been computed and saved in a motion compensation block. Next, a predicted frame is constructed using these vectors applied to the initial picture.

A difference (error) frame is generated from the discrepancy between a predicted frame and an actual picture. Were the motion vectors to create a perfect predicted frame, the difference frame would be empty.

Difference frames, each with its own motion compensation block, are DCT transformed, and the co-efficients quantized. VLC and run length encoding (RLE) then generate a bit stream. This process is repeated for the remaining frames of a GOP.

When a predicted frame is constructed, as described earlier, only in relation to an initial frame, it is a P-frame. When a predicted frame is computed in relation to a previous frame and a frame after the predicted frame, a B-frame is generated.

Decoding MPEG-2 is much simpler and faster than encoding. After decompression, a frame's motion vectors are applied to the current video frame to create a predicted frame. A difference frame, after being decompressed, is added to the predicted frame, thereby correcting prediction errors. The result is a new video frame.

H.264: P- and B-slice encoding

H.264 generates P- and B-slices using techniques that reduce temporal redundancies. As with MPEG-2, temporal compression employs motion estimation. However, unlike MPEG-2, H.264 motion estimates may be calculated with a finer, quarter-pixel accuracy.

H.264 encoding can use 16 × 16, 16 × 8, 8 × 16 or 8 × 8 pixel macroblocks or 8 × 4, 4 × 8 or 4 × 4 pixel submacroblocks. (See Figure 4.) The encoder first determines the best size block to which it will apply motion estimation. For example, 16 × 16 mode is ideal for clear sky.

Within a restricted spatial range, the other blocks are searched for the contents of the current macroblock or submacroblock. H.264 supports the use of blocks within up to five pictures either before or after the current picture. (See Figure 5.)

Searched blocks, whose contents significantly match the contents of the current macroblock, become reference blocks. A motion vector is computed from the displacement between the current macroblock and each reference block.

Predicted blocks for the current picture are computed from reference blocks using these motion vectors. Residual (error) blocks, computed from the differences between predicted blocks and actual blocks, are calculated. Residual blocks, plus all motion vectors, are integer transformed, quantized and entropy coded.

Residual blocks constructed as described earlier — using a single motion vector per reference block — are grouped into a P-slice. Residual blocks computed from two vectors are grouped into a B-slice. With H.264, “bi” means two vectors — not two directions.

Each picture — a video field or frame — is encoded as one or more slices. (See Figure 6.) The encoder selects from seven possible slice arrangements. Within a picture, I-, P- and B-slices can coexist.

At regular frame counts and/or at scene changes, an H.264 encoder can output an I-frame — a picture with all intra-encoded slices — thereby breaking a data stream into GOPs.

The role of interframe encoding

Although intraframe H.264 provides greater efficiency than does intraframe MPEG-2, both codecs gain their greatest increase in efficiency by using interframe encoding. The sophisticated interframe tools in the H.264 encoding toolkit account for the claims that H.264 provides twice the efficiency of long-GOP MPEG-2.

Current AVCHD camcorders record full HD (1920 × 1080) with 4:2:0 sampling. Assuming a 2X advantage, they should offer HDV quality at 16Mb/s. And indeed, tests report AVCHD visual quality matches 25Mb/s, 4:2:0, HDV once data rates reach 16Mb/s. While AVCHD offers HDV quality with higher storage efficiency than does HDV — a virtue when recording to SD cards — native AVCHD editing, unfortunately for the consumer, requires almost 8X more compute power.

The current focus on AVCHD's storage efficiency will change when Panasonic's first AVCCAM camcorder ships this fall. The AG-HMC150 will record 1080p at 21Mb/s — a data rate approximately equal to 32Mb/s long-GOP MPEG-2.

Today and tomorrow

The more powerful the H.264 encoding tool, the more computationally intensive it is. When designing low-cost, low-power, real-time encoders, it is often necessary to not implement the most powerful tools. As large-scale integration technology advances, H.264 encoder chips should be able to incorporate more powerful tools from the H.264 toolkit.

Steve Mullen is owner of Digital Video Consulting, which provides consulting and conducts seminars on digital video technology.

1080i Horiz. Vert. Sampling Width Data rate DVCPRO HD 1280 1080 2.6:1.3:1.3 8 bits 100Mb/s 50Mb/s AVC-Intra 1440 1080 4:2:0 10 bits 50Mb/s 100Mb/s AVC-Intra 1920 1080 4:2:2 10 bits 100Mb/s

Table 1. Panasonic professional compressed formats

CATEGORIES