Digital Video Primer

A recent conversation with my father lead me to realize that the world of digital video is awfully confusing for the layman. I aim to reduce the confusion with this primer. A quick reference is also available.

Introduction

Let's start with the unit of digital video most familiar to most people: the file. You have a digital video file. You open it in a program like VLC. It (hopefully) plays some video (and probably some audio). What's in there, really? You may be tempted to rely on the extension on the file's name (e.g. .avi, .mov, .mp4). Don't. Beside the fact that there is nothing stopping you from changing the extension to anything you want, it can confuse the issue. How? Well, that brings us to our first class of data format.

Container Formats

A container format is a method of combining a variety of data streams (e.g. video and audio) into a single file. Examples include Matroska, MP4, and AVI. Since these formats only specify how video and audio are combined, not how they are encoded (more on that later), knowing that a file is an "AVI file" leaves out some crucial information. Generally speaking, different container formats do not have a particularly large impact on file size, although the overhead of AVI can become significant with long videos, like feature-length films. The main factors behind the choice of container format are codec support and player support. Matroska is a nice format, but not all video players support it. AVI is supported by almost everything, but it can't handle some codecs (for example, H.264).

Codecs

A codec is a method of representing information (commonly video, audio, or a still image) as a digital data stream, typically involving some form of compression to reduce the data size. Examples include MPEG-4, H.264, and VP8. Note the similarity between the names of the MPEG-4 video codec and the MP4 container format (especially their full names, MPEG-4 Part 2 and MPEG-4 part 14, respectively). This is no accident. Both are described in the MPEG-4 standard, which was produced by the Motion Picture Experts Group. While the MP4 format and the MPEG-4 video codec are specifically designed to work together, they can be used separately. For example, a Matroska container has no problem holding MPEG-4 video, and a MP4 container has no problem holding H.264 video.

Video is composed of a series of still images that are displayed in rapid succession to create the illusion of movement. On a traditional film system, the frames are complete images arranged on a strip of translucent film. In digital video, the principle is much the same, except that the frames are divided into grids of pixels. The obvious approach would be to simply store the frame rate, the frame size, and a list of all the pixels in all of the frames. The problem with this approach is data size. With 1920x1080 frames (the size most often found in HD video like Blu-Ray discs), 3 bytes per pixel (one byte for each color channel), and a frame rate of 24 fps (the standard for professional film) a feature film 120 minutes long would take up approximately 1074 GB, read at a rate of about 150 MB/s. A dual-layer Blu-Ray disc only stores 50 GB and can read at only 4.5 MB/s. This is a problem. How do we solve it?

Most popular video codecs get around the data size problem by not storing every single frame in its entirety. Each frame represents (in our example) 1/24 of a second of action. Thus, if you compare any two adjacent frames of an average video, you will find that not much changes. Unfortunately, the real world is a messy place, and a naive method of "taking the difference" will not gain you much (if anything). Thus, a certain amount of "wiggle room" is needed. The precise methods of determining how much change is enough to be worthy of attention (and especially the methods of detecting when part of the image has moved) are quite complex and involve fairly advanced math, so I will not describe them here. There are a few high-level concepts that are important, though: I-frames, P-frames, and B-frames. An I-frame is also known as a keyframe because it stores a frame in its entirety. These frames are typically added to the video stream at regular intervals, since small errors can accumulate into large errors over time, and I-frames let the calculations "reset" in order to remove these errors. P-frames rely on information from previous frames for decoding, and B-frames rely on both previous and future frames for decoding. Since P- and B-frames can use information from other frames, they don't need to store as much information. Since B-frames have access to more other frames, they can store less data than P-frames, at the cost of being more difficult to decode. Also, not all container formats can handle B-frames; this is why AVI containers can't handle H.264 video.

Audio codecs, on the other hand, take advantage of the wave nature of their input. Fancy math like the Fourier transform enables techniques that remove audio information that the human ear would miss anyway. This is what makes a collection of MP3 files so much smaller than the size of the data on the CD they were converted from.

Codecs are a classic example of a time/space tradeoff, especially in video. If you encode the same video with H.264 and MPEG-4, the resulting data stream will be smaller for H.264, but encoding (and more importantly, decoding) it will take considerably more effort. Videos destined to be played on a wide range of devices are often encoded with less complex codecs like MPEG-4 rather than H.264, since older computers will have a better chance of being able to play them back without problems. Since storage space is at a premium on small devices like smartphones, these devices often contain specialized hardware to decode complex codecs like H.264, since the increased manufacturing cost is deemed worthwhile in order to fit more video on the same device.

Resolution

I casually mentioned the resolution 1920x1080 earlier. What does that mean, exactly? By convention, resolutions are specified with the horizontal dimension first. So, a 1920x1080 image would be 1920 pixels by 1080 pixels. Note that an image of this size is wider than it is tall. How much wider? This information is usually represented with an aspect ratio. The aspect ratio of 1920x1080 is 16:9. That is, there are 9 vertical pixels for every 16 horizontal pixels (or 16 horizontal pixels for every 9 vertical pixels if you would prefer to think of it that way). This is the standard aspect ratio for HD (high definition) digital video. Since we have this standard aspect ratio, 1920x1080 video is often referred to a 1080p or 1080i. The number represents the vertical dimension, and the horizontal dimension can be calculated from the implied 16:9 aspect ratio. What about the letters, though?

The p and i refer to how the image is "scanned out" onto the display screen: progressive or interlaced. Progressive scan is the obvious way: start at the upper left corner and move from left to right and top to bottom, like reading English text. Interlacing makes two passes on each frame, scanning out all the odd rows first, then all the even rows (or the other way around, it doesn't matter as long as it is consistent). This trick originated in analog broadcast TV, and it served much the same purpose as it does in digital video: reducing the required data rate. Since each pass displays most of the image, 30 fps interlaced video looks about as smooth as 60 fps progressive. Fewer frames per second means less data per second. However, there is a cost: artifacts (e.g. combing, see the Wikipedia page for examples). Because computer monitors and most recent HDTVs are "natively" progressive scan, and most video playback systems are fast enough to handle progressive scan, interlacing is slowly dying out. (Editorial: good riddance!) When producing video today, progressive scan should be used. Typical resolutions are 1080p, 720p, and 480p (the latter is sometimes called SD video, since it matches the resolution used by analog TV).

Subtitles

Subtitles don't generally come up in home video, so I'm going to gloss over this subject a bit. Suffice it to say that there are a number of codecs for subtitles, including SubStation Alpha (SSA) and Advanced SubStation alpha (which has the unfortunate acronym of ASS). Alternatively, you can "burn in" subtitles by adding them to the video during editing. This method is generally considered the least desirable, since there is no way to turn off the subtitles, change the language, or extract the subtitles for translation. If you find yourself wanting to use subtitles, SSA should suffice for anything you would want to do.

Putting It All Together

So, you recorded a home video and you want to share it. What format and codecs do you use? Well, it depends on how you want to share it, for one thing. If you want to make a DVD, the video will need to be in VOB (essentially MPEG PS) format with MPEG-2 video and MP2, PCM, DTS, or AC-3 audio, since that is what the DVD Video standard specifies. Fortunately, any decent DVD authoring program will take care of that for you. All you need to do is adjust the parameters (and possibly choose the audio codec). If your DVD authoring software gives you a choice for the audio codec, choose PCM. MP2 is not guaranteed to be supported on DVD players sold in North America (although it often is), and DTS and AC-3 are mainly important if you want surround sound (not an issue with home video). While PCM is uncompressed (and therefore large), it is still typically small in comparison to the video (even with the MPEG-2 compression). Only choose a different audio codec if you are really hard pressed to fit the video on the disc and you don't want to split it across two discs.

For video, you can generally choose from a wide range of bit rates. Choosing a bit rate that is too low will result in video that looks horrible. If your DVD authoring program has presets for various quality levels, pick one of those. As a general rule of thumb, you want the highest bit rate that fits on the disc. For MPEG-2, you probably don't want to go below 6 Mbps (the lower-case b means bits rather than bytes; there are 8 bits in a byte). If you are really pressed for space, you should consider using a dual-layer DVD if your DVD burner supports it. Another thing to note regarding discs: there are several kinds of writable DVDs: DVD-R, DVD+R, RW variants of both, and DVD-RAM. For most people, being able to rewrite the DVD is not particularly important, so don't bother with the RW discs. DVD-RAM never really took off and isn't widely supported, so don't use that, either. Personally, I have had the best luck with DVD-R, although some players are rather finicky.

If you want to share the video over the Internet, MPEG-2 won't do. It doesn't compress as well as many of the more recent codecs, and many computers do not have MPEG-2 playback software installed. If you want to upload the video to Youtube, choose a combination of formats from this list. If your video editing program supports it, I suggest WebM (Matroska container, VP8 video, Vorbis audio). If not, I would use a MP4 container, H.264 video (or MPEG-4 if you find that H.264 takes too long to encode), and AAC audio. If you want to host the video file yourself (if you already have your own web site, for example), I again suggest WebM if your video editing program can produce it, since Firefox and Chrome will play such video in the browser without any plugins or external programs. No muss, no fuss. On the other hand, Internet Explorer and Safari don't support it (choosing MP4/H.264/AAC, which Firefox doesn't support). Also, be aware that the web server hosting it will need a good amount of bandwidth and a reasonably high data transfer budget. Really, Youtube is the easiest option. You can even do basic editing in your browser with Youtube. For any codec in the MPEG-4/H.264/VP8 generation, I have found that 1 Mbps works pretty well for 480p, with higher-resolution video needing more. A bit of trial and error will help you develop good judgement regarding bit rates.

If you want to archive video (possibly for future editing), you want to lose as little as possible. After all, if you keep compressing the same video clip with a lossy codec, you will lose a little more quality each time until the video is completely ruined. For 480p video, the DV codec is a good choice. It compresses each frame individually, losing much less data than the MPEG family of codecs. For HD video, there are many options, but not all are well supported. H.264 has a lossless mode, but not all H.264-capable software can handle it. Also, the resulting file will be huge. Thus, for HD video, it is probably best to encode with H.264 or MPEG-4 at a high bitrate.

A Word of Caution

Some consumer-grade video editing software uses the terminology in confusing ways. For example, the encoding options for video capture may be AVI, MPEG-1, and MPEG-2. The "AVI" option actually means "AVI with DV video and LPCM audio", and the MPEG-1 and MPEG-2 options imply certain choices for the container and audio codec. Don't let these programs confuse you!

Valid XHTML 1.0 Strict