Page MenuHomePhabricator

Research GPU-accelerated transcode of WebM VP9 video
Closed, ResolvedPublic

Description

Intel Kaby Lake-generation CPUs with the integrated GPU support hardware decoding and encoding of VP9 video (as well as H.264 and HEVC). Investigate feasibility of increasing transcoding performance vs pure-software encoding.

Currently the low performance of libvpx in pure software is holding us back from creating VP9 transcodes (T63805).

Todo:

  • obtain access to a Linux machine with a Kaby Lake or Coffee Lake CPU with integrated GPU
  • figure out how the Intel drivers for it work and interface with ffmpeg
  • test performance of single streams and multiple parallel streams on pure-CPU vs using the GPU acceleration
  • compare quality at same bandwidth, and bandwidth at same quality
  • if promising, pitch the idea of a small number of dedicated transcoding machines to ops that would have the necessary CPU configuration

Possible:

  • compare with ATI GPUs and check if they have such support too
  • note that NVidia support is missing for encoding VP9, and the non-free drivers are a no-go

Event Timeline

OIT says they should be able to send me a temporary loaner MacBook Pro with Kaby Lake CPU, which will give me what I need to do tests in isolation before I go advocating for server purchases.

Setup:

  • test machine is a MacBook Pro 14,1
  • Debian Stretch does *not* like the keyboard, wifi, screen, etc. on this Mac. ;) Installed it as a pseudo-server config, ssh'ing in.
  • Intel GPU kernel driver (i915) requires a firmware blob that's in "firmware-misc-nonfree" package
  • Requires filesystem access to /dev/dri/renderD128
  • ffmpeg 3.2.8 is not new enough to support encoding with vp8_vaapi or vp9_vaapi; I built 3.4 locally from git using the existing dependency packages.

Compression:

  • The compression is *fast*, often getting over 100 fps -- literally 20x faster than libvpx
  • Picture quality is not as good as libvpx at the same bitrate.
  • Bitrate control seems to be scaled assuming 30fps -- for 24fps file, multiply specified bitrate by 1.25, for 60fps file, halve specified bitrate to hit the desired target
  • Unsure how to use alt-ref frames; there's a "b-frame" option on vaapi that's supposedly supported for VP9 which might mean alt-ref frames or might mean something else? But it doesn't seem to help picture quality.
  • The compressed frames do not use tile columns, so clients can't use multi-threaded decoding.

Preliminary assessment:

  • Picture quality is noticeably worse at target bitrate than libvpx, with bad artifacting in high-motion scenes in a couple files I tested. Big lean-against.
  • Multiple tile columns are strongly recommended to optimize multicore decoding (not yet supported in ogv.js, but will be one day) -- there's no way to enable it that I can find. This is a moderate lean-against.
  • But this'd be GREAT for dealing with live video streams, where the alternative is "too slow, can't do anything".
  • Might or might not be relevant to consider using the hardware *decoder* for transcoding from H.264 files...?

I'll do a little more testing in case I can improve the compression, or to verify the decoding alternative.

Based on earlier assessment, recommending against hardware compression for batch/video-on-demand. It would be useful for live encoding if we switch from YouTube/etc to doing our own streaming, but that's not a high priority right now.

Have changed the software encoding settings based on the comparisons I was doing:

  • use -speed 4 for first-pass
  • use -speed 1 for second-pass (or -speed 2 in some cases, based on Google's recommendations)
  • use -crf X with suitable quality values to use constrained-quality mode, saved bandwidth on many files

With these settings, VP9 is 4x slower than VP8 to encode, which isn't great but will parallelize well once chunking is done (T158716).

Closing out as resolved (having done the research).