Transforming compressed video on the GPU using OpenCV

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

Daniel Playfair Cal's Blog About Projects

Transforming compressed video on


the GPU using OpenCV
Mar 5, 2021

In a previous post, I described various FFmpeg filters which I experimented with for the
purpose of lens correction, and I mentioned I might follow it up with a similar post about
video stabilisation. This post doesn’t quite fulfill that promise, but at least I have something
to report about GPU acceleration!

For background: videos that I recorded of my dodgeball matches had not only lens
distortion, but also unwanted shaking. Sometimes the balls would hit the net that the
camera was attached to, and the video became very shaky.

Attempt 1: FFmpeg filter


The first thing I tried was to find an FFmpeg filter which could solve the problem. I found
that the combination of vidstabdetect and vidstabtransform (wrappers for the vid.stab
library) produced reasonably good results. However, this method had a number of issues:

• It required 2 passes: one for detection of camera movement, and one to compensate
for it.
• It was very slow. The combination perspective remapping and the stabilisation
resulted in a framerate of about 3fps. This meant that a 40 minute dogeball match
took half a day to process!
• It created “wobbling” when the camera was shaking.

Wobbling
The model used by vid.stab to represent the effect of camera movement is a limited affine
transformation, including only translation, rotation, and scaling. In my application, the main
way that the camera moved was by twisting – i.e. the camera remained at the same
location, but it turned to face different directions as it shook. There was little rotation in
practice, and little change in the position of the camera, so I don’t think that vid.stab
detected much rotation or scaling. Instead I think it applied translation (basically moving a
rectangle in 2 dimensions) in order to correct for changes in the angle of the camera.

The problem is that translation is not what happens to the image when you twist a camera
The problem is that translation is not what happens to the image when you twist a camera
– what happens is a perspective transformation. Close to the center of the image or at a
high zoom level translation is a good approximation, but it gets worse further away from
the center of the image and with a wider field of view. My camera had a very wide field of
view, so the effect was quite significant.

Speed
There were a few reasons why the processing speed was so slow. One was that the
expensive (and destructive) interpolation step was happening twice – once to correct for
lens distotion, and then again for stabilisation. No matter how optimised the interpolation
process was, this was a waste of time. In theory there is no reason not to perform the
interpolation for both steps at once, but this wasn’t supported by the FFmpeg filters, and
probably wouldn’t even make sense to do with the FFmpeg filter API.

Another opportunity was to use the GPU to speed up the tranformations. FFmpeg
supports the use of GPUs with various APIs. The easiest thing to get working is
compression and decompression. On Linux the established API for this is VA-API, which
FFmpeg supports. I was already using VA-API to decompress the H.264 video from my
GoPro camera, and to compress the H.264/H.265 output videos I was creating, but the
CPU was still needed for the projection change and video stabilisation.

For more general computation on GPUs, there are various other APIs, including Vulkan
and OpenCL. Although there are some FFmpeg filters that support these APIs, neither the
lensfun or the vid.stab filters do. The consequence for me was that during the processing,
the decoded video frames (a really large amount of data) had to be copied from the GPU
memory to the main memory so that the CPU based filters could perform their tasks, and
then the transformed frames copied back to the GPU for encoding.

This copying takes significant time. For example, I found that an FFmpeg pipeline which
decoded and reencoded a video entirely on the GPU ran at about 380fps, whereas
modifying that pipeline to copy the frames to the main memory and back again dropped
this to about 100fps.

Attempt 2: OpenCV
At this point I felt like I had exhausted my ability to solve the problem with scripts that
called the FFmpeg CLI, and that to make more progress I would need to work at a lower
level. Here are the tools I used:

• libavformat, libavcodec, libavutil: C libraries for muxing/demuxing and encoding/


decoding (part of FFmpeg)
• OpenCV: An extensive library for computer vision written in C++
• VA-API: Linux API for GPU video encoding and decoding
• VA-API: Linux API for GPU video encoding and decoding
• OpenCL: API for working with objects in GPU memory, well supported by OpenCV

I knew that there were methods in OpenCV to do things like perspective remapping, and
that many of its more popular methods had implementations that operated directly on
GPU memory with OpenCL. In order to take advantage of this, I needed to take the VA-
API frames from the GPU video decoder and convert them to OpenCV Mat objects. To
make the process run as fast as possible, I wanted to do this entirely on the GPU, without
copying frames to the main memory at any point.

Decoding with OpenCV VideoCapture


The first thing to do was to decode the input video and get VA-API frames. I first
attempted to use OpenCVs VideoCapture API to do so. Depending on the platform there
is a choice of backing APIs from which to retrieve decoded video. The applicable choices
were CAP_FFMPEG , and CAP_GSTREAMER . There weren’t any capture properties in the
OpenCV capture API at the time related to hardware decoding. While the FFmpeg
backend only accepted a file path as input, the GStreamer backend also accepted a
GStreamer pipeline. So with a bit of experimentation I came up with a GStreamer pipeline
which decoded the video with VA-API (confirmed by running intel_gpu_top from igt-
gpu-tools).

Mat frame;
VideoCapture cap(
"filesrc location=/path/to/input.mp4 ! demux ! vaapih264dec ! appsink sync
CAP_GSTREAMER
);
while (true) {
cap.read(frame);
// Do stuff with frame
}

Although the decoding was done with VA-API, the resulting frame was not backed by
GPU memory – instead the VideoCapture API copied the result to the main memory
before returning it.

Aside: recently, support for hardware codec props has been added to the VideoCapture
and VideoWriter APIs. Although this would simplify using VA-API with the Gstreamer
backend (and make it possible with the FFmpeg backend), it still doesn’t return hardware
backed frames. You can see the FFmpeg capture implementation copying the data to
main memory in retrieveFrame and vice versa in writeFrame . Similarly in the
GStreamer backend it looks like the buffer is always copied to main memory in
retrieveFrame .

Decoding with libavcodec


Decoding with libavcodec
VideoCapture seemed like a dead end, so instead I turned my attention to demuxing
and decoding the video with libavformat and libavcodec. Although this required a lot more
code, I found that it worked very well. There are lots of examples in the documentation,
including for hardware codecs, OpenCL, and mapping different types of hardware frames.
I wrote code to open a file, and create a demuxer and video decoder. Then I set up a finite
state machine to pull video stream packets from the demuxer and send them to the
decoder, as well as code to pull raw frames from the decoder and process them. It was
something like this pseudocode:

state = AWAITING_INPUT_PACKETS
while (state != COMPLETE):
switch(state):
case COMPLETE:
break
case FRAMES_AVAILABLE:
while (frame = get_raw_frame_from_decoder()):
process_frame()
state = AWAITING_INPUT_PACKETS
break
case AWAITING_INPUT_PACKETS:
if (input_exhausted()):
state = COMPLETE
else:
send_demuxed_video_packet_to_decoder()
state = FRAMES_AVAILABLE
break

This was mostly the result of copying examples like this one (except for the part that
copies the VA-API buffer to main memory).

Converting VA-API frames to OpenCL memory


With the VA-API frames available, it was time to convert them into OpenCL backed
OpenCV Mat objects. OpenCL has an Intel specific extension
cl_intel_va_api_media_sharing which allows VA-API frames to be converted into OpenCL
memory without copying them to the main memory. Luckily I had an Intel GPU.

I could see two options for using this extension. One was to use OpenCVs interop with
VA-API, and another was to first map from VA-API to OpenCL in libavcodec. On the first
attempt with libavcodec I couldn’t find a way to expose the OpenCL memory, so I chose
the OpenCV VA-API interop option.

OpenCV VA-API interop


There were a few basic snags with OpenCV’s VA-API interop. OpenCV is built without it
by default, and the Arch Linux package doesn’t include the necessary build flags. So I had
to create a custom PKGBUILD and built it myself. In the process it became apparent that
OpenCV was not compatible with the newer header provided by OpenCL-Headers, and
only worked with the header from a legacy Intel specific package. So I had to also patch
OpenCV to build with the more up to date headers (this is no longer necessary after this
recent fix to OpenCV).

Making it work required some additional effort. The VA-API and OpenCL APIs both refer
to memory on a specific GPU and driver, and also with a specific scope (a “display” in the
case of VA-API and a “context” for OpenCL). So it’s necessary to initialise the scope of
each API such that the memory is compatible and can be mapped between the APIs. The
easiest way seemed to be to choose a DRM device, use it to create a VA-API
VADisplay , and then use this to create an OpenCL context (which the OpenCV VA-API
interop handles automatically). The code looked something like this:

#include <opencv2/core/va_intel.hpp>
extern "C" {
#include <va/va_drm.h>
}

void initOpenClFromVaapi () {
int drm_device = open("/dev/dri/renderD128", O_RDWR|O_CLOEXEC);
VADisplay vaDisplay = vaGetDisplayDRM(drm_device);
close(drm_device);
va_intel::ocl::initializeContextFromVA(vaDisplay, true);
}

The OpenCV API handles the OpenCL context in an implicit way - so after
initializeContextFromVA you can expect that all the other functionality in OpenCV
that uses OpenCL will use the VA-API compatible OpenCL context.

From there it was reasonably simple to create OpenCL backed Mat objects from VA-API
backed AVFrame s:

Mat get_mat_from_vaapi_frame(AVFrame *frame) {


Mat result;
va_intel::convertFromVASurface(
vaDisplay,
frame->data[0], // <- If I remember correctly...
dimensions,
result
);
return result
}
}

This method worked, but it wasn’t as fast as I had hoped. After reading the code I had a
reasonably good idea why.

Video codecs like H.264 (and by extension APIs like VA-API) usually deal with video in
NV12 format. NV12 is a semi planar format, which means instead of storing each pixel
separately including all its colour channels, there are separate matrices to store the
luminance/brightness of the whole image, and the chroma/colour (which incorporates 2
channels).

Also, OpenCL has various different types of memory, and they cannot all be treated the
same way. OpenCV Mat objects when backed by OpenCL memory use an OpenCL
Buffer , whereas VA-API works with instances of Image2D . So in order to create a
OpenCL backed Mat from a VA-API frame, it’s necessary to first remap from an OpenCL
Image2D to an OpenCL Buffer . What this means physically is dependent on the
hardware and drivers.

The OpenCL VA-API interop handles both of these problems transparently. It maps VA-
API frames to and from 2 Image2D s corresponding to the luminance (Y) and chroma
(UV) planes, and it uses an OpenCL kernel to convert between these images and a single
OpenCL Buffer in a BGR pixel format. Both of these steps take time, so the speed to
decode a video and convert each frame to a Mat with a BGR pixel format was about
260fps, compared to about 500fps for just decoding in VA-API.

libavcodec hw_frame_map

The OpenCV VA-API interop worked, but required patches to OpenCV and its build script,
and it took away control over how the NV12 pixel format was handled. So I took another
stab at doing the mapping with libavcodec. libavcodec has a lot more options for different
types of hardware acceleration and for mapping data between the different APIs, so I was
hopeful that then or in the future there might be a way to do it on non Intel GPUs.

As the OpenCV VA-API interop did, it was necessary to derive an OpenCL context from
the VA-API display so that the VA-API frames could be mapped to OpenCL. It was also
necessary to initialise OpenCV with the same OpenCL context as the libavcodec
hardware context so that they could both work with the same OpenCL memory.

// These contexts need to be used for the decoder


// and for mapping VA-API frames to OpenCL
AVBufferRef *vaapi_device_ctx;
AVBufferRef *ocl_device_ctx;

void init_opencl_contexts() {

// Create a libavcodec VA-API context


// Create a libavcodec VA-API context
av_hwdevice_ctx_create(
&vaapi_device_ctx,
AV_HWDEVICE_TYPE_VAAPI,
NULL,
NULL,
0
);

// Create a libavcodec OpenCL context from the VA-API context


av_hwdevice_ctx_create_derived(
&ocl_device_ctx,
AV_HWDEVICE_TYPE_OPENCL,
vaapi_device_ctx,
0
);

// Initialise OpenCV with the same OpenCL context


init_opencv_opencl_context(ocl_device_ctx);
}

void init_opencv_opencl_context(AVBufferRef *ocl_device_ctx) {


AVHWDeviceContext *ocl_hw_device_ctx =
(AVHWDeviceContext *) ocl_device_ctx->data;
AVOpenCLDeviceContext *ocl_device_ocl_ctx =
(AVOpenCLDeviceContext *) ocl_hw_device_ctx->hwctx;
size_t param_value_size;

// Get context properties


clGetContextInfo(
ocl_device_ocl_ctx->context,
CL_CONTEXT_PROPERTIES,
0,
NULL,
&param_value_size
);
cl_context_properties *props = malloc(param_value_size);
clGetContextInfo(
ocl_device_ocl_ctx->context,
CL_CONTEXT_PROPERTIES,
param_value_size,
props,
NULL
);

// Find the platform prop


cl_platform_id platform
cl_platform_id platform;
for (int i = 0; props[i] != 0; i = i + 2) {
if (props[i] == CL_CONTEXT_PLATFORM) {
platform = (cl_platform_id) props[i + 1];
}
}

// Get the name for the platform


clGetPlatformInfo(
platform,
CL_PLATFORM_NAME,
0,
NULL,
&param_value_size
);
char *platform_name = (char *) malloc(param_value_size);
clGetPlatformInfo(
platform,
CL_PLATFORM_NAME,
param_value_size,
platform_name,
NULL
);

// Finally: attach the context to OpenCV


ocl::attachContext(
platform_name,
platform,
ocl_device_ocl_ctx->context,
ocl_device_ocl_ctx->device_id
);
}

To make this work I had to fix a bug in FFmpeg where the header providing
AVOpenCLDeviceContext was not copied to the include directory.

Next, I attached the VA-API hardware context to the decoder context and configured the
decoder to output VA-API frames:

// AVCodecContext *decoder_ctx = avcodec_alloc_context3(decoder);


// ...etc

// Attach the previously created VA-API context to the decoder context


decoder_ctx->hw_device_ctx = av_buffer_ref(vaapi_device_ctx);
// Configure the decoder to output VA-API frames
decoder_ctx->get_format = get_vaapi_format;

// This just selects AV_PIX_FMT_VAAPI if present and errors otherwise


static enum AVPixelFormat get_vaapi_format(
AVCodecContext *ctx,
const enum AVPixelFormat *pix_fmts
);

At this point the decoder was generating VA-API backed frames, so we could map them
to OpenCL frames on the GPU:

AVFrame* map_vaapi_frame_to_opencl_frame(AVFrame *vaapi_frame) {


AVFrame *ocl_frame = av_frame_alloc();
AVBufferRef *ocl_hw_frames_ctx;

// Create an OpenCL hardware frames context from the VA-API


// frame's frames context
av_hwframe_ctx_create_derived(
&ocl_hw_frames_ctx,
AV_PIX_FMT_OPENCL,
ocl_device_ctx, // <- The OpenCL device context from earlier
vaapi_frame->hw_frames_ctx,
AV_HWFRAME_MAP_DIRECT
);

// Assign this hardware frames context to our new OpenCL frame


ocl_frame->hw_frames_ctx = av_buffer_ref(ocl_hw_frames_ctx);

// Set the pixel format for our new frame to OpenCL


ocl_frame->format = AV_PIX_FMT_OPENCL;

// Map the contents of the VA-API frame to the OpenCL frame


av_hwframe_map(ocl_frame, frame, AV_HWFRAME_MAP_READ);

return ocl_frame;
}

Internally, av_hwframe_map uses the same Intel OpenCL extension as the OpenCV VA-
API interop. However libavcodec supports many other types of hardware, and for all I
know there are or will be other options that work on non Intel GPUs. For example it might
work to first convert to a DRM hardware frame, then then to an OpenCL frame.

Next we need to convert the OpenCL backed AVFrame into an OpenCL Mat :

map_opencl_frame_to_mat( ocl_frame) {
Mat map_opencl_frame_to_mat(AVFrame *ocl_frame) {
// Extract the two OpenCL Image2Ds from the opencl frame
cl_mem luma_image = (cl_mem) ocl_frame->data[0];
cl_mem chrome_image = (cl_mem) ocl_frame->data[1];

size_t luma_w = 0;
size_t luma_h = 0;
size_t chroma_w = 0;
size_t chroma_h = 0;

clGetImageInfo(cl_luma, CL_IMAGE_WIDTH, sizeof(size_t), &luma_w, 0


clGetImageInfo(cl_luma, CL_IMAGE_HEIGHT, sizeof(size_t), &luma_h,
clGetImageInfo(cl_chroma, CL_IMAGE_WIDTH, sizeof(size_t), &chroma_w
clGetImageInfo(cl_chroma, CL_IMAGE_HEIGHT, sizeof(size_t), &chroma_h

// You can/should also check things like bit depth and channel order
// (I'm assuming that the input is in NV12),
// and you can probably avoid repeating this for each frame.

UMat dst;
dst.create(luma_h + chroma_h, luma_w, CV_8U);

cl_mem dst_buffer = (cl_mem) dst.handle(ACCESS_READ);


cl_command_queue queue = (cl_command_queue) ocl::Queue::getDefault
size_t src_origin[3] = { 0, 0, 0 };
size_t luma_region[3] = { luma_w, luma_h, 1 };
size_t chroma_region[3] = { chroma_w, chroma_h * 2, 1 };

// Copy the contents of each Image2Ds to the right place in the


// OpenCL buffer which backs the Mat
clEnqueueCopyImageToBuffer(
queue,
cl_luma,
dst_buffer,
src_origin,
luma_region,
0,
0,
NULL,
NULL
);
clEnqueueCopyImageToBuffer(
queue,
cl_chroma,
dst_buffer,
src_origin,
src_origin
chroma_region,
luma_w * luma_h * 1,
0,
NULL,
NULL
);

// Block until the copying is done


clFinish(queue);

return mat;
}

I made a different choice to the OpenCV VA-API interop in this case – rather than
converting the image to the BGR pixel format immediately, I copied it in the simplest/
fastest way possible, preserving the NV12 pixel format. This makes sense to me because
there are many algorithms that operate only on single channel images anyway, so it
seems pointless to throw away the luminance plane. If I want to convert the frame to
BGR, then I can do so with cvtColor, which also has an OpenCL implementation.

The combination of libavcodec mapping between VA-API and OpenCL hardware frames,
OpenCL conversion from Image2D to Buffer , and cvtColor seems to be about as
fast as the OpenCV VA-API interop.

The video stabilisation part


Anyway, this was an interesting adventure. The next step is to actually use the OpenCV
API to do the change in lens projection and video stabilisation. That requires some more
experimentation, so I will leave this here for now. At least I’m confident that even a very
slow implementation will be miles faster than the 3fps I started out with!

P.S. In case you really want to see the source code, it’s here (probably in a mostly
working state).

Daniel Playfair Cal's Blog

Daniel Playfair Cal's Blog Rage, rage against the dying of the light
daniel.playfair.cal@gmail.com hedgepigdaniel

You might also like