Blog

The OBE blog

Open Broadcast Systems teamed up with the Society of Motion Pictures and Television Engineers (SMPTE) UK Section to host an event looking at IT in broadcast. Being close to Christmas, it also featured mulled wine and mince pies, of course!

The main aim of the evening event was to tell the audience of broadcasters and manufacturers that broadcast data-centres can look exactly like an IT data-centre already, and work extremely well. Many people within this industry, especially the traditional broadcasters and manufacturers, find that hard to believe but actually an IT approach comes with a number of major advantages and is already being used for numerous broadcasts.

 

(This is quite a technical blog post but projects like this are the reason we’re able to deliver broadcast infrastructure faster, at lower-cost and better than anyone else. For some background visit: http://www.slideshare.net/kierank12/implementing-uncompressed-over-ip-in-software-and-the-pitfalls

At Open Broadcast Systems we push £200 Blackmagic video boards in ways the creators didn’t intend. We add functionality that people continue to tell us can only be done with specialist hardware with price tags ten, or even a hundred times more.

We also have an SMPTE 2022-6 (SDI over IP) stack written entirely in software, designed for use with standard Network Cards, something which surprised many visitors from hardware-centric vendors to our booth at IBC.

One of the great things about having rack-space in our new office is that we can now support open source projects using our equipment such as FFmpeg and Libav. They are critical parts of our software as well as underpin much of multimedia processing in the world today.

Fuzzing, is one of the ways in which we can improve the quality of the decoders when exposed to corrupted input. It involves randomly or systematically corrupting the input of a program in order to make it crash. The heartbleed vulnerability was one of the most famous bugs found via fuzzing [1].

Google, notably fuzzed FFmpeg and Libav at a relatively large scale, leading to a thousand fixes. But after seeing crashes in the H264 decoder earlier in the year, with real-world events such as packet loss and video splices, it was clear that something was wrong. One possibility is that Google only fuzzed progressive H264 content using frame threads and didn’t include interlaced content nor tried decoding in the lower-latency sliced-threads mode. Or that the codebase changed significantly enough to introduce new bugs.

Using basic tools like zzuf and later on the more advanced american fuzzy lop and a single quad-core server (in contrast to Google’s 2000 cores), the following unique bugs were found, a few of which caused easily-triggerable, real-world crashes.

H264 Frame Threads

https://trac.ffmpeg.org/ticket/4428

H264 Sliced Threads

https://trac.ffmpeg.org/ticket/4440

https://trac.ffmpeg.org/ticket/4438

https://trac.ffmpeg.org/ticket/4431

https://trac.ffmpeg.org/ticket/4408

https://trac.ffmpeg.org/ticket/4977

FFv1

https://trac.ffmpeg.org/ticket/4931

https://trac.ffmpeg.org/ticket/4932

https://trac.ffmpeg.org/ticket/4939

VP9

https://trac.ffmpeg.org/ticket/4935

Opus

https://bugzilla.libav.org/show_bug.cgi?id=876

https://bugzilla.libav.org/show_bug.cgi?id=909

Thanks to @rilian for providing fuzzing scripts and thanks to those who investigated and fixed the bugs, Michael Niedermayer in particular.

[1] http://www.codenomicon.com/files/pdf/Heartbleed-Story.pdf

 

Note: This is a more technical post than usual, and about 5 months late.

The decoding in the OBE C-100 decoder was optimised to make use of instructions in modern CPUs and this blog post explains how we did it:

HD-SDI video uses 10-bit pixels but computers operate in bytes (8-bits). However, 10-bit professional video doesn’t fit nicely into bytes. Instead, 10-bit video on a computer is stored in memory like this:



The X represents an unused bit - note how in total 12 out of 32 of the bits are unused (that’s 37.5%). It’s very wasteful if the data needs to be transferred to a piece of hardware like a Blackmagic SDI card. Virtually all professional SDI cards use the ‘v210’ format that was first introduced by Apple in the 90s [1] and v210 improves the efficiency of 10-bit storage by packing the 10-bit video samples as follows:

(adapted from [1])

Now only 2 out of the 32-bits are unused, a major improvement. Using the old v210 encoder in FFmpeg, each pixel is loaded from memory, shifted to the correct position and “inserted” using the OR operation. When doing this on 1920x1080 material, this involves about 250 million of these operations every second. More CPU time is spent packing the pixels for display than actually decompressing them from the encoded video!

Clearly, we’ve got to do something about this - Thanks to the magic of SIMD instructions (in this case SSSE3 and AVX) we can instead process 12 pixels in one go [2]: 

  1. Load luma pixels from memory
  2. Make sure they are within the v210 range
  3. Shift each pixel (if necessary) to appropriate position
  4. Shuffle pixels to rearrange them to v210 order
  5. Repeat 1-4 for chroma
  6. OR the luma and chroma registers together
  7. Store in memory

This can be (unscientifically) benchmarked with the command:

ffmpeg -pix_fmt yuv422p10 -s 1920x1080 -f rawvideo -i /dev/zero -f rawvideo -vcodec v210 -y /dev/null

Before: 168fps

After: 480fps

A 3x speed boost.

But, a lot of content that the decoder receives is 8-bit which has this packing format:



In existing software decoders, this needs to be converted to the 10-bit samples in the first picture and then packed into v210, a two step process. But, we can now just do this in a single step.

ffmpeg -pix_fmt yuv422p -s 1920x1080 -f rawvideo -i /dev/zero -f rawvideo -vcodec v210 -y /dev/null

Before: 95fps

After: 620fps

Now 6.5x faster!

What more could be done: 

  • Allow the decoder to decode straight to v210 using FFmpeg's draw_horiz_band capability. 
  • Try using AVX2 on newer Haswell CPUs - should provide a small speed increase but with an increased complexity.
  • Use multiple CPU cores on the conversion - this isn’t really useful for OBE but people creating v210 files may find it useful (especially UHD content).

Thanks must go to those who helped review this code.

[1] https://developer.apple.com/library/mac/technotes/tn2162/_index.html#//apple_ref/doc/uid/DTS40013070-CH1-TNTAG8-V210__4_2_2_COMPRESSION_TYPE

(This is from Apple’s venerable Letters from the Ice Floe)

[2] http://git.videolan.org/?p=ffmpeg.git;a=blob;f=libavcodec/x86/v210enc.asm