Professional Documents
Culture Documents
Transforming compressed video on the GPU using OpenCV
Transforming compressed video on the GPU using OpenCV
Transforming compressed video on the GPU using OpenCV
In a previous post, I described various FFmpeg filters which I experimented with for the
purpose of lens correction, and I mentioned I might follow it up with a similar post about
video stabilisation. This post doesn’t quite fulfill that promise, but at least I have something
to report about GPU acceleration!
For background: videos that I recorded of my dodgeball matches had not only lens
distortion, but also unwanted shaking. Sometimes the balls would hit the net that the
camera was attached to, and the video became very shaky.
• It required 2 passes: one for detection of camera movement, and one to compensate
for it.
• It was very slow. The combination perspective remapping and the stabilisation
resulted in a framerate of about 3fps. This meant that a 40 minute dogeball match
took half a day to process!
• It created “wobbling” when the camera was shaking.
Wobbling
The model used by vid.stab to represent the effect of camera movement is a limited affine
transformation, including only translation, rotation, and scaling. In my application, the main
way that the camera moved was by twisting – i.e. the camera remained at the same
location, but it turned to face different directions as it shook. There was little rotation in
practice, and little change in the position of the camera, so I don’t think that vid.stab
detected much rotation or scaling. Instead I think it applied translation (basically moving a
rectangle in 2 dimensions) in order to correct for changes in the angle of the camera.
The problem is that translation is not what happens to the image when you twist a camera
The problem is that translation is not what happens to the image when you twist a camera
– what happens is a perspective transformation. Close to the center of the image or at a
high zoom level translation is a good approximation, but it gets worse further away from
the center of the image and with a wider field of view. My camera had a very wide field of
view, so the effect was quite significant.
Speed
There were a few reasons why the processing speed was so slow. One was that the
expensive (and destructive) interpolation step was happening twice – once to correct for
lens distotion, and then again for stabilisation. No matter how optimised the interpolation
process was, this was a waste of time. In theory there is no reason not to perform the
interpolation for both steps at once, but this wasn’t supported by the FFmpeg filters, and
probably wouldn’t even make sense to do with the FFmpeg filter API.
Another opportunity was to use the GPU to speed up the tranformations. FFmpeg
supports the use of GPUs with various APIs. The easiest thing to get working is
compression and decompression. On Linux the established API for this is VA-API, which
FFmpeg supports. I was already using VA-API to decompress the H.264 video from my
GoPro camera, and to compress the H.264/H.265 output videos I was creating, but the
CPU was still needed for the projection change and video stabilisation.
For more general computation on GPUs, there are various other APIs, including Vulkan
and OpenCL. Although there are some FFmpeg filters that support these APIs, neither the
lensfun or the vid.stab filters do. The consequence for me was that during the processing,
the decoded video frames (a really large amount of data) had to be copied from the GPU
memory to the main memory so that the CPU based filters could perform their tasks, and
then the transformed frames copied back to the GPU for encoding.
This copying takes significant time. For example, I found that an FFmpeg pipeline which
decoded and reencoded a video entirely on the GPU ran at about 380fps, whereas
modifying that pipeline to copy the frames to the main memory and back again dropped
this to about 100fps.
Attempt 2: OpenCV
At this point I felt like I had exhausted my ability to solve the problem with scripts that
called the FFmpeg CLI, and that to make more progress I would need to work at a lower
level. Here are the tools I used:
I knew that there were methods in OpenCV to do things like perspective remapping, and
that many of its more popular methods had implementations that operated directly on
GPU memory with OpenCL. In order to take advantage of this, I needed to take the VA-
API frames from the GPU video decoder and convert them to OpenCV Mat objects. To
make the process run as fast as possible, I wanted to do this entirely on the GPU, without
copying frames to the main memory at any point.
Mat frame;
VideoCapture cap(
"filesrc location=/path/to/input.mp4 ! demux ! vaapih264dec ! appsink sync
CAP_GSTREAMER
);
while (true) {
cap.read(frame);
// Do stuff with frame
}
Although the decoding was done with VA-API, the resulting frame was not backed by
GPU memory – instead the VideoCapture API copied the result to the main memory
before returning it.
Aside: recently, support for hardware codec props has been added to the VideoCapture
and VideoWriter APIs. Although this would simplify using VA-API with the Gstreamer
backend (and make it possible with the FFmpeg backend), it still doesn’t return hardware
backed frames. You can see the FFmpeg capture implementation copying the data to
main memory in retrieveFrame and vice versa in writeFrame . Similarly in the
GStreamer backend it looks like the buffer is always copied to main memory in
retrieveFrame .
state = AWAITING_INPUT_PACKETS
while (state != COMPLETE):
switch(state):
case COMPLETE:
break
case FRAMES_AVAILABLE:
while (frame = get_raw_frame_from_decoder()):
process_frame()
state = AWAITING_INPUT_PACKETS
break
case AWAITING_INPUT_PACKETS:
if (input_exhausted()):
state = COMPLETE
else:
send_demuxed_video_packet_to_decoder()
state = FRAMES_AVAILABLE
break
This was mostly the result of copying examples like this one (except for the part that
copies the VA-API buffer to main memory).
I could see two options for using this extension. One was to use OpenCVs interop with
VA-API, and another was to first map from VA-API to OpenCL in libavcodec. On the first
attempt with libavcodec I couldn’t find a way to expose the OpenCL memory, so I chose
the OpenCV VA-API interop option.
Making it work required some additional effort. The VA-API and OpenCL APIs both refer
to memory on a specific GPU and driver, and also with a specific scope (a “display” in the
case of VA-API and a “context” for OpenCL). So it’s necessary to initialise the scope of
each API such that the memory is compatible and can be mapped between the APIs. The
easiest way seemed to be to choose a DRM device, use it to create a VA-API
VADisplay , and then use this to create an OpenCL context (which the OpenCV VA-API
interop handles automatically). The code looked something like this:
#include <opencv2/core/va_intel.hpp>
extern "C" {
#include <va/va_drm.h>
}
void initOpenClFromVaapi () {
int drm_device = open("/dev/dri/renderD128", O_RDWR|O_CLOEXEC);
VADisplay vaDisplay = vaGetDisplayDRM(drm_device);
close(drm_device);
va_intel::ocl::initializeContextFromVA(vaDisplay, true);
}
The OpenCV API handles the OpenCL context in an implicit way - so after
initializeContextFromVA you can expect that all the other functionality in OpenCV
that uses OpenCL will use the VA-API compatible OpenCL context.
From there it was reasonably simple to create OpenCL backed Mat objects from VA-API
backed AVFrame s:
This method worked, but it wasn’t as fast as I had hoped. After reading the code I had a
reasonably good idea why.
Video codecs like H.264 (and by extension APIs like VA-API) usually deal with video in
NV12 format. NV12 is a semi planar format, which means instead of storing each pixel
separately including all its colour channels, there are separate matrices to store the
luminance/brightness of the whole image, and the chroma/colour (which incorporates 2
channels).
Also, OpenCL has various different types of memory, and they cannot all be treated the
same way. OpenCV Mat objects when backed by OpenCL memory use an OpenCL
Buffer , whereas VA-API works with instances of Image2D . So in order to create a
OpenCL backed Mat from a VA-API frame, it’s necessary to first remap from an OpenCL
Image2D to an OpenCL Buffer . What this means physically is dependent on the
hardware and drivers.
The OpenCL VA-API interop handles both of these problems transparently. It maps VA-
API frames to and from 2 Image2D s corresponding to the luminance (Y) and chroma
(UV) planes, and it uses an OpenCL kernel to convert between these images and a single
OpenCL Buffer in a BGR pixel format. Both of these steps take time, so the speed to
decode a video and convert each frame to a Mat with a BGR pixel format was about
260fps, compared to about 500fps for just decoding in VA-API.
libavcodec hw_frame_map
The OpenCV VA-API interop worked, but required patches to OpenCV and its build script,
and it took away control over how the NV12 pixel format was handled. So I took another
stab at doing the mapping with libavcodec. libavcodec has a lot more options for different
types of hardware acceleration and for mapping data between the different APIs, so I was
hopeful that then or in the future there might be a way to do it on non Intel GPUs.
As the OpenCV VA-API interop did, it was necessary to derive an OpenCL context from
the VA-API display so that the VA-API frames could be mapped to OpenCL. It was also
necessary to initialise OpenCV with the same OpenCL context as the libavcodec
hardware context so that they could both work with the same OpenCL memory.
void init_opencl_contexts() {
To make this work I had to fix a bug in FFmpeg where the header providing
AVOpenCLDeviceContext was not copied to the include directory.
Next, I attached the VA-API hardware context to the decoder context and configured the
decoder to output VA-API frames:
At this point the decoder was generating VA-API backed frames, so we could map them
to OpenCL frames on the GPU:
return ocl_frame;
}
Internally, av_hwframe_map uses the same Intel OpenCL extension as the OpenCV VA-
API interop. However libavcodec supports many other types of hardware, and for all I
know there are or will be other options that work on non Intel GPUs. For example it might
work to first convert to a DRM hardware frame, then then to an OpenCL frame.
Next we need to convert the OpenCL backed AVFrame into an OpenCL Mat :
map_opencl_frame_to_mat( ocl_frame) {
Mat map_opencl_frame_to_mat(AVFrame *ocl_frame) {
// Extract the two OpenCL Image2Ds from the opencl frame
cl_mem luma_image = (cl_mem) ocl_frame->data[0];
cl_mem chrome_image = (cl_mem) ocl_frame->data[1];
size_t luma_w = 0;
size_t luma_h = 0;
size_t chroma_w = 0;
size_t chroma_h = 0;
// You can/should also check things like bit depth and channel order
// (I'm assuming that the input is in NV12),
// and you can probably avoid repeating this for each frame.
UMat dst;
dst.create(luma_h + chroma_h, luma_w, CV_8U);
return mat;
}
I made a different choice to the OpenCV VA-API interop in this case – rather than
converting the image to the BGR pixel format immediately, I copied it in the simplest/
fastest way possible, preserving the NV12 pixel format. This makes sense to me because
there are many algorithms that operate only on single channel images anyway, so it
seems pointless to throw away the luminance plane. If I want to convert the frame to
BGR, then I can do so with cvtColor, which also has an OpenCL implementation.
The combination of libavcodec mapping between VA-API and OpenCL hardware frames,
OpenCL conversion from Image2D to Buffer , and cvtColor seems to be about as
fast as the OpenCV VA-API interop.
P.S. In case you really want to see the source code, it’s here (probably in a mostly
working state).
Daniel Playfair Cal's Blog Rage, rage against the dying of the light
daniel.playfair.cal@gmail.com hedgepigdaniel