By James Darnley

Introduction

As in previous blog posts, we work extensively on using high SIMD instructions on Intel CPUs to speed up video processing in open source libraries such as FFmpeg and Upipe.

Recently we have been considering using Intel’s new instruction set AVX-512 and its wider vector registers, 512-bit 64-byte ZMM registers, to see if we can eke more speed out of the code anywhere.  While we were gearing up to to test this, incorporating a very new assembler and an update to the x264asm compat layer, Cloudflare published its own findings on using these features in On the dangers of Intel's frequency scaling.

Briefly put they showed that only using a little bit of code that uses ZMM registers can slow everything else down. The processor will reduce its operating frequency when it hits a ZMM instruction to reduce power consumption and heat output.

Because of that we decided to not try testing any ZMM registers.  Like Cloudflare we don’t spend enough of time in assembly functions to be able to take the CPU clock speed hit.  However the new instructions and EVEX prefix are available for narrower XMM and YMM registers and increases these to 32 registers.  Specifically this requires the AVX-512 Vector Length Extensions (VL) feature which the Skylake-X and new Xeon processors have.  If you can make use of the new features they may provide you with some speed gains.

Where to Start

Where would one begin?  There are so many new features that it can be hard to know.  There are op-masks, op-mask registers, blends, compares, permutes, conversions, scatters, and more.

I will start by covering a couple of instructions I have emulated in the past: maximum and minimum of packed signed quadwords; arithmetic right shift of packed quadwords; convert quadwords to doublewords.  These now exist as single instructions.  AVX-512 has added or extended many functions for quadwords, see Intel's Optimization Reference Manual (pdf) section 15.13.

Arithmetic shift right of quadwords could be emulated with a pregenerated sign-extend mask and pxor; pcmpgtq; pand; psrlq; por and a spare register.  5 instructions only 1 of which could be done in parallel with the others, plus however many are needed to create the mask.  For the function I needed this the shift was constant for the duration of the function so it was a once-only cost to create the mask.  The five instructions could have a latency of 7 cycles whereas vpsraq is 1, 4, or 8 cycles, depending on the precise form used, according to Intel’s own documents about latency (pdf).

Maximum and minimum of packed signed quadwords can be emulated with pcmpgtq; pand; pandn; por and a spare register.  4 instructions, 5 if a memory operand is needed for the minimum, none can be done in parallel.  The four instructions to emulate could have a 6 cycle latency whereas vpmaxsq is 3 cycles or 10 with a memory operand.

Convert quadwords to doublewords: it now exists.  AVX-512 adds many down convert instructions for doublewords and quadwords with truncation, signed and unsigned saturation.  These are a bit like the reverse operation of the pmovsx and pmovzx instructions, move with sign or zero extend from SSE 4.1.  The min/max mentioned above was to work around this particular limitation.  I needed to pack and saturate the quadwords so I was clipping with min/max and then shuffling or blending values back together.

It would need a rewrite of the function to make good use of the new features because the rather ugly logic is partly a result of the limitations of older instruction sets.  It would also need a rewrite because the older blend instructions do not have an EVEX encoded form so cannot use the new 16 registers.  Because the x264asm compat layer, which Upipe and FFmpeg use, prefers the new registers AVX-512 isn't a simple drop-in replacement for this.

Op-masks

Which brings me onto op-masks.  Op-masks are a feature that could see a great deal of use in code which has run-time branching, conditionals, or masking.  Blends can now done with op-masks.

The EVEX encoding means instructions now have a form like this vpaddw m1 {k1}, m2, m3 in which k1 is the op-mask.  k1 is one of eight dedicated op-mask registers.  They are manipulated using dedicated instructions, see the instruction set reference of Intel’s Software Development Manuals, the instructions begin with a 'K'.  They can also be set using the result of the various compare instructions.  In this example each word in m1 will only be changed to the result of m2+m3 if the corresponding bit in k1 is set otherwise it is left unchanged.  The lowest word will check bit 0 up to the highest word which will check bit 15.

It is similar for a move, which you can turn into a blend with an op-mask.  New move instructions have been added vmovdqu8; vmovdq16; vmovdq32; vmovdq64.  With movdqu16 m1 {k1}, m2 each word value in the destination will only be changed to the source value if the corresponding bit is set.  Either the destination or the source could also be a memory location, like with the older moves.  This is a conditional move of packed values.

Another feature of these op-masks is the zeroing bit of the EVEX encoding.  In the form vpaddw m1 {k1}{z}, m2, m3 the instruction will will change m1 to be m2+m3 where the corresponding bit is in k1.  However when the bit is not set then the corresponding word value will be set to zero.  This benefits by not depending on the values in m1 before the instruction.  If you can use the zero values then it will be useful in that fashion too.

These op-masks are probably the biggest reason to rewrite functions because of the conditionals they let you use.  With the op-mask registers freeing vector registers from holding masks and with the new instructions freeing more registers that may have been used in emulation and with the added 16 registers there are now more registers than I know what to do with.  Most of the functions I've worked on were not short on registers, at least on x86-64.  I could store more constants in them rather than loading from memory but that only gets you a small speedup in most cases.

Summary

For those looking for a summary or a TL;DR of what they should look at in their own code I think you should focus these areas:

  • Any function that stores intermediate data into memory because of register pressure.
  • Any function with conditionals, any function with a compare instruction.
  • Any function that uses quadwords, uint64_t, or int64_t data types.

 

Note: This is a more technical post than usual, and about 5 months late.

The decoding in the OBE C-100 decoder was optimised to make use of instructions in modern CPUs and this blog post explains how we did it:

HD-SDI video uses 10-bit pixels but computers operate in bytes (8-bits). However, 10-bit professional video doesn’t fit nicely into bytes. Instead, 10-bit video on a computer is stored in memory like this:



The X represents an unused bit - note how in total 12 out of 32 of the bits are unused (that’s 37.5%). It’s very wasteful if the data needs to be transferred to a piece of hardware like a Blackmagic SDI card. Virtually all professional SDI cards use the ‘v210’ format that was first introduced by Apple in the 90s [1] and v210 improves the efficiency of 10-bit storage by packing the 10-bit video samples as follows:

(adapted from [1])

Now only 2 out of the 32-bits are unused, a major improvement. Using the old v210 encoder in FFmpeg, each pixel is loaded from memory, shifted to the correct position and “inserted” using the OR operation. When doing this on 1920x1080 material, this involves about 250 million of these operations every second. More CPU time is spent packing the pixels for display than actually decompressing them from the encoded video!

Clearly, we’ve got to do something about this - Thanks to the magic of SIMD instructions (in this case SSSE3 and AVX) we can instead process 12 pixels in one go [2]: 

  1. Load luma pixels from memory
  2. Make sure they are within the v210 range
  3. Shift each pixel (if necessary) to appropriate position
  4. Shuffle pixels to rearrange them to v210 order
  5. Repeat 1-4 for chroma
  6. OR the luma and chroma registers together
  7. Store in memory

This can be (unscientifically) benchmarked with the command:

ffmpeg -pix_fmt yuv422p10 -s 1920x1080 -f rawvideo -i /dev/zero -f rawvideo -vcodec v210 -y /dev/null

Before: 168fps

After: 480fps

A 3x speed boost.

But, a lot of content that the decoder receives is 8-bit which has this packing format:



In existing software decoders, this needs to be converted to the 10-bit samples in the first picture and then packed into v210, a two step process. But, we can now just do this in a single step.

ffmpeg -pix_fmt yuv422p -s 1920x1080 -f rawvideo -i /dev/zero -f rawvideo -vcodec v210 -y /dev/null

Before: 95fps

After: 620fps

Now 6.5x faster!

What more could be done: 

  • Allow the decoder to decode straight to v210 using FFmpeg's draw_horiz_band capability. 
  • Try using AVX2 on newer Haswell CPUs - should provide a small speed increase but with an increased complexity.
  • Use multiple CPU cores on the conversion - this isn’t really useful for OBE but people creating v210 files may find it useful (especially UHD content).

Thanks must go to those who helped review this code.

[1] https://developer.apple.com/library/mac/technotes/tn2162/_index.html#//apple_ref/doc/uid/DTS40013070-CH1-TNTAG8-V210__4_2_2_COMPRESSION_TYPE

(This is from Apple’s venerable Letters from the Ice Floe)

[2] http://git.videolan.org/?p=ffmpeg.git;a=blob;f=libavcodec/x86/v210enc.asm