[Cin] FFmpeg and opencl (in case anyone with OpenCL hw will want to experiment)

Mon May 13 12:42:36 CEST 2019

Hello again, while I successfully compiled Cin-GG with opencl support via ffmpeg - I have no hardware to test it (in theory I can install CPU-only OpenCL implementation, but then it will be sloooow ...)

Still, I found some mystery I want to share:

ENHANCED VIDEO
PROCESSING WITH
FFMPEG AND OPENCL

Kelvin Phan – Massey University.
2016

Unfortunately it seems main page where this file was hosted is down at the moment, and archive.org has no copy of it, too ...

so, Google cache was used:

http://webcache.googleusercontent.com/search?q=cache:H-ISEoUS43UJ:https://kelvinphan.weebly.com/uploads/2/6/9/3/26933593/kelvin_phan_-_master%2560s_report.pdf%2BEnhanced+Video+Processing+with+FFMPEG+and+OpenCL&hl=ru&ct=clnk

whole 45 pages of it a bit too big for email text (body), so I just copy conclusion at very end:

========

1. Introduction:
In the old implementation (OpenCL version 1.0 to 1.2), the host and the device couldn’t share the
same address space. To transfer memory data between host and device, buffers need to be created.
One buffer for transferring data from host to device and another buffer for transferring data back from
device to host. An offset needs to be passed to and from devices for accessing a location within a
buffer such as accessing a region within an image. A host memory pointer cannot be used on the
device.
In OpenCL version 2.0, communication between host and device does not need to have a copied
buffer. Shared Virtual Memory (SVM), host and device may use the same virtual address space. This
address space can be effectively used to share virtual pointers created in this space. The data structures
which are based on those memory pointers can be shared between host and device.

[...]

• OS: Ubuntu 14.04 (64 bits)
• CPU: AMD Phenom(tm) II X4 955 Processor × 4
• GPU: AMD Radeon (TM) R7 360 Series

[...]

On the other hand, the same kind of result was expected to happen with un-sharp filter, but
they was not. It is true that the filter with OpenCL enable gave better performing comparing
without OpenCL enable, but the performing was not even nearly two times speed up as we see
with de-shake filter. The possible reason for this is that the program took a lot more time to
transfer the data between host buffers and device buffers, and there are a lot of transferring
which needs to do.
==============

hm, so, it was not as big win as author hoped ..yet, those filters appeared in mainline ffmpeg:

http://ffmpeg.org/pipermail/ffmpeg-devel/2017-November/219828.html
[FFmpeg-devel] [PATCH 00/15] OpenCL infrastructure, filters

-----------------
Changes since the last time this was posted:
* Add unsharp filter (to replace existing unsharp).
* Remove old experimental API.
* Miscellaneous fixes.

Now also tested with AMD OpenCL on Windows (DXVA2 mapping works nicely, D3D11 does not because it wants the Intel extension for NV12 support).

Thanks,

- Mark

Silly example using everything (for i965 VAAPI + Beignet):

./ffmpeg_g -y -init_hw_device vaapi=va:/dev/dri/renderD128 -init_hw_device opencl=ocl at va -hwaccel vaapi -hwaccel_device va -hwaccel_output_format vaapi -i in.mp4 -f image2 -r 1 -i overlays/%d.png -an -filter_hw_device ocl -filter_complex '[1:v]format=yuva420p,hwupload[x2]; [0:v]scale_vaapi=1280:720:yuv420p,hwmap[x1]; [x1][x2]overlay_opencl=0:0,program_opencl=test.cl:rotate_image,unsharp_opencl=lx=17:ly=17:la=5,hwmap=derive_device=vaapi:reverse=1,scale_vaapi=1280:720:nv12' -c:v h264_vaapi -frames:v 1000 out.mp4

test.cl:

__kernel void rotate_image(__write_only image2d_t dst,
                           __read_only  image2d_t src,
                           unsigned int index)
{
  const sampler_t sampler = (CLK_NORMALIZED_COORDS_FALSE |
                             CLK_FILTER_LINEAR);

  float angle = (float)index / 100;

  float2 dst_dim = convert_float2(get_image_dim(dst));
  float2 src_dim = convert_float2(get_image_dim(src));

  float2 dst_cen = dst_dim / 2;
  float2 src_cen = src_dim / 2;

  int2   dst_loc = (int2)(get_global_id(0), get_global_id(1));

  float2 dst_pos = convert_float2(dst_loc) - dst_cen;
  float2 src_pos = {
    cos(angle) * dst_pos.x - sin(angle) * dst_pos.y,
    sin(angle) * dst_pos.x + cos(angle) * dst_pos.y
  };
  src_pos = src_pos * src_dim / dst_dim;

  float2 src_loc = src_pos + src_cen;

  if (src_loc.x < 0         || src_loc.y < 0 ||
      src_loc.x > src_dim.x || src_loc.y > src_dim.y)
    write_imagef(dst, dst_loc, 0.5);
  else
    write_imagef(dst, dst_loc, read_imagef(src, sampler, src_loc));
}
-----------

and another patch was added in 2018, for tonemapping:

https://patchwork.ffmpeg.org/patch/9032/
[FFmpeg-devel] lavfi: add opencl tonemap filter.

--------------
This filter does HDR(HDR10/HLG) to SDR conversion with tone-mapping.

An example command to use this filter with vaapi codecs:
FFMPEG -init_hw_device vaapi=va:/dev/dri/renderD128 -init_hw_device \
opencl=ocl at va -hwaccel vaapi -hwaccel_device va -hwaccel_output_format \
vaapi -i INPUT -filter_hw_device ocl -filter_complex \
'[0:v]hwmap,tonemap_opencl=t=bt2020:tonemap=linear:format=p010[x1]; \
[x1]hwmap=derive_device=vaapi:reverse=1' -c:v hevc_vaapi -profile 2 OUTPUT

-------------------

so, their command-line examples quite complex, and not sure if libavcodec can construct such pipeline automagically in case it was called by external code (cinelerra-GG in our case)......