Blog

The OBE blog

Every four years the World Cup comes by and gives us a unique opportunity to see how encoders are progressing. Four years ago we looked at some of the coding tools that the the BBC were using for UHD on DTT. This year UHD is online using BBC iPlayer and after some reverse engineering of the MPEG-DASH stream it was possible to dump a copy of the raw stream at 35mbit/s. This meant we could analyse some of the technical decisions the encoder was making.

Disclaimer: This analysis is merely an objective analysis of the coding features the encoder uses and not an analysis of the subjective or objective picture quality of the encoder. It’s also worth saying that this information is from a small clip but in the main short clips can provide a good indication of the coding decisions an encoder is making.

A very good introduction to HEVC coding tools can be found here: http://forum.doom9.org/showthread.php?t=167081 

Thanks to the help provided by Parabola Research in producing this post. You can download a bitstream analysis report from Parabola Explorer Pro 4.2 below which helped produce the analysis below.

As before in no particular order:

  • The GOP structure is pretty boring IBBPBBPBBP. It appears not to adapt. It’s quite similar to MPEG-2 in that it only keeps a maximum of one frame in L0 and L1. It uses L0 and L1 on B-Frames. 50 frames closed GOP.
  • 8 Slices per frame
  • It’s doing basic constant QP throughout, no use of adaptive quantisation to try and improve visual quality so its very likely trying to optimise for PSNR and not visual quality.
  • All intra coding is 16x16. Nothing smaller or larger so similar to i16x16 only in MPEG-4 but full HEVC range of intra modes appear to be searched and used (in contrast to the encoder used in 2014)
  • Inter coding is 16x16 and 32x32. Again nothing smaller or larger.
  • Only one partition type used (2Nx2N, i.e. no prediction splits).
  • Only merge_idx 0 used (there are 4 modes left unsearched/unused).
  • SAO is not used

The full report is available here: http://downloads.obe.tv/Parabola-Explorer-BBC-Russia-UHD.pdf

So in conclusion this encoder is using a very limited subset of the features available in HEVC (arguably MPEG-4 with a few new coding tools), probably due to performance limitations onboard in order to hit realtime UHD. It would be interesting to see UHD at a lower bitrate to see if the encoder would end up using more HEVC features. It’s quite surprising in 2018 to see such a limited set of features being used and it’s likely that there are better examples of state of the art realtime encoding out there.

By James Darnley

Introduction

As in previous blog posts, we work extensively on using high SIMD instructions on Intel CPUs to speed up video processing in open source libraries such as FFmpeg and Upipe.

Recently we have been considering using Intel’s new instruction set AVX-512 and its wider vector registers, 512-bit 64-byte ZMM registers, to see if we can eke more speed out of the code anywhere.  While we were gearing up to to test this, incorporating a very new assembler and an update to the x264asm compat layer, Cloudflare published its own findings on using these features in On the dangers of Intel's frequency scaling.

Briefly put they showed that only using a little bit of code that uses ZMM registers can slow everything else down. The processor will reduce its operating frequency when it hits a ZMM instruction to reduce power consumption and heat output.

Because of that we decided to not try testing any ZMM registers.  Like Cloudflare we don’t spend enough of time in assembly functions to be able to take the CPU clock speed hit.  However the new instructions and EVEX prefix are available for narrower XMM and YMM registers and increases these to 32 registers.  Specifically this requires the AVX-512 Vector Length Extensions (VL) feature which the Skylake-X and new Xeon processors have.  If you can make use of the new features they may provide you with some speed gains.

Where to Start

Where would one begin?  There are so many new features that it can be hard to know.  There are op-masks, op-mask registers, blends, compares, permutes, conversions, scatters, and more.

I will start by covering a couple of instructions I have emulated in the past: maximum and minimum of packed signed quadwords; arithmetic right shift of packed quadwords; convert quadwords to doublewords.  These now exist as single instructions.  AVX-512 has added or extended many functions for quadwords, see Intel's Optimization Reference Manual (pdf) section 15.13.

Arithmetic shift right of quadwords could be emulated with a pregenerated sign-extend mask and pxor; pcmpgtq; pand; psrlq; por and a spare register.  5 instructions only 1 of which could be done in parallel with the others, plus however many are needed to create the mask.  For the function I needed this the shift was constant for the duration of the function so it was a once-only cost to create the mask.  The five instructions could have a latency of 7 cycles whereas vpsraq is 1, 4, or 8 cycles, depending on the precise form used, according to Intel’s own documents about latency (pdf).

Maximum and minimum of packed signed quadwords can be emulated with pcmpgtq; pand; pandn; por and a spare register.  4 instructions, 5 if a memory operand is needed for the minimum, none can be done in parallel.  The four instructions to emulate could have a 6 cycle latency whereas vpmaxsq is 3 cycles or 10 with a memory operand.

Convert quadwords to doublewords: it now exists.  AVX-512 adds many down convert instructions for doublewords and quadwords with truncation, signed and unsigned saturation.  These are a bit like the reverse operation of the pmovsx and pmovzx instructions, move with sign or zero extend from SSE 4.1.  The min/max mentioned above was to work around this particular limitation.  I needed to pack and saturate the quadwords so I was clipping with min/max and then shuffling or blending values back together.

It would need a rewrite of the function to make good use of the new features because the rather ugly logic is partly a result of the limitations of older instruction sets.  It would also need a rewrite because the older blend instructions do not have an EVEX encoded form so cannot use the new 16 registers.  Because the x264asm compat layer, which Upipe and FFmpeg use, prefers the new registers AVX-512 isn't a simple drop-in replacement for this.

Op-masks

Which brings me onto op-masks.  Op-masks are a feature that could see a great deal of use in code which has run-time branching, conditionals, or masking.  Blends can now done with op-masks.

The EVEX encoding means instructions now have a form like this vpaddw m1 {k1}, m2, m3 in which k1 is the op-mask.  k1 is one of eight dedicated op-mask registers.  They are manipulated using dedicated instructions, see the instruction set reference of Intel’s Software Development Manuals, the instructions begin with a 'K'.  They can also be set using the result of the various compare instructions.  In this example each word in m1 will only be changed to the result of m2+m3 if the corresponding bit in k1 is set otherwise it is left unchanged.  The lowest word will check bit 0 up to the highest word which will check bit 15.

It is similar for a move, which you can turn into a blend with an op-mask.  New move instructions have been added vmovdqu8; vmovdq16; vmovdq32; vmovdq64.  With movdqu16 m1 {k1}, m2 each word value in the destination will only be changed to the source value if the corresponding bit is set.  Either the destination or the source could also be a memory location, like with the older moves.  This is a conditional move of packed values.

Another feature of these op-masks is the zeroing bit of the EVEX encoding.  In the form vpaddw m1 {k1}{z}, m2, m3 the instruction will will change m1 to be m2+m3 where the corresponding bit is in k1.  However when the bit is not set then the corresponding word value will be set to zero.  This benefits by not depending on the values in m1 before the instruction.  If you can use the zero values then it will be useful in that fashion too.

These op-masks are probably the biggest reason to rewrite functions because of the conditionals they let you use.  With the op-mask registers freeing vector registers from holding masks and with the new instructions freeing more registers that may have been used in emulation and with the added 16 registers there are now more registers than I know what to do with.  Most of the functions I've worked on were not short on registers, at least on x86-64.  I could store more constants in them rather than loading from memory but that only gets you a small speedup in most cases.

Summary

For those looking for a summary or a TL;DR of what they should look at in their own code I think you should focus these areas:

  • Any function that stores intermediate data into memory because of register pressure.
  • Any function with conditionals, any function with a compare instruction.
  • Any function that uses quadwords, uint64_t, or int64_t data types.

 

 

2017 has certainly been a packed year, with a number of interesting projects, including an end-to-end IP, end-to-end IT broadcast, and a slew of events.

For quite some time, we have been championing the merits and feasibility of IP for contribution. Whilst many in the industry are still hanging back, unsure it is quite there yet, we have already proven that it is. This year, things started to take off in a big way, with many others embracing the change. This was obvious walking around the halls at IBC, but in particular, this was noticeable from the Broadcast Tech IP Summit in October. Whilst there was a room full of traditional broadcasters and a focus on whether IP is feasible, it was obvious that the rhetoric has shifted somewhat over recent months, even in that traditional broadcast space. Most broadcasters are looking to go IP even if they are not quite there yet.

From our point of view, we have worked on quite a few interesting projects. This includes working with MSTV Live Broadcasting, which provides live coverage of a wide range of sports events across Europe. Uniquely, this small company is providing these feeds using a satellite service designed for newsgathering and an IP connection. Using a combination of low cost, high quality tools, MSTV is able to ensure a compelling experience for its entire, growing, viewer base.

2017 also saw the completion of a long-term project we have been working on for Sky in the UK. We have, of course, been working with Sky for some time and it has been contributing a number of feeds over IP and using our encoders. However, this project went beyond all of that and saw the creation of an all IP, all IT master control facility. It really is using entirely off-the-shelf standard IT hardware. What is more, Sky is now using it on a daily basis to contribute live news coverage. As you can imagine, this was a massive project and involved a number of vendors. We already know that IP brings scalability, flexibility, and massive cost efficiencies, especially if you use standard off-the-shelf equipment. Above all, this project proves that IP is entirely possible and reliable.

All that said, we are not quite there yet, but I expect things to develop further as we move into 2018. These early examples will pave the way and I believe we will see many more developments throughout 2018. This will include more broadcasters switching to IP for contribution feeds and I think we will likely see more announcements of plans to build all IP facilities. I also believe we will start to see more and more IP OB trucks being used in the field. Current IP OB trucks have some legacy kit inside too. I believe over the coming months more of that will be replaced with IP technology.

We will in particular see some interesting innovations in IP for remote production. This is where IP connectivity is good enough to backhaul all traffic over an IP network to a facility, which naturally saves a great deal on travel expenses. Following on from that, we will begin to see the emergence of interactive personalised live events. Currently broadcasters sending feeds over the web simply show the same thing that is on TV, but only with 60 second delay, because it is on the web. However, IP opens up the opportunity to create a much more immersive experience as broadcasters now have more than the single world feed. We will start to see these versions become much more interesting and interactive, giving consumers a real reason for watching the web versions vs on the TV.

So, whilst 2017 was the year of (almost) IP with a few broadcasters dipping their toes in the water, I expect 2018 may turn out to be the year we actually begin to see a shift, in mindsets and workflows.  

 

Warning: like Part One of this series, these posts are very technical!

After converting the old MMX simple IDCT of FFmpeg from inline assembly to external (as described in Part One) I was to look at making the IDCT faster. A naive approach is to convert from directly using the mm registers to using xmm registers. This can usually be done with minimal changes just paying attention to packs, unpacks, and moves. This can make things faster on Skylake and related microarches from Intel. A discussion of why is beyond the scope of this post. The point is that you can measure that functions are faster if they use xmm registers.