[Cin] wildest dreams: stream-copy and/or skipping avoidable pixel format conversions

Thu Apr 23 12:36:30 CEST 2020

Hi all,

coming back to cinelerra after a whole decade, I was really surprised
finding now even hardware accelerated decoding and encoding implemented...
Even though the landscape of cinelerra-forks seems to become increasingly
confusing, THANK YOU, to all who contributed in the last years and of
course especially to Adam, Einar and William for their tireless commitment!

With this mail I want to put forward two ideas, or rather wildest dreams,
how cinelerra could improve performance considerably in cases where little
(or rather: "nothing") is to be done to the input video frames.

Motivation / Introduction

In the last weeks I had to deal a lot with "trivial editing" with long
and/or large files: Chopping of a few seconds at the beginning or end,
sometimes concatenating two or three takes, removing a constant offset
between audio and video. As my videos had a constant group-of-pictures
(GOP) length so I knew the position of keyframes and shifting the cuts
around by half a GOP was no problem either, I could completely avoid
reencoding of the video stream using the command line tool ffmpeg. FFmpeg
works perfectly fine for single "trivial edits", but the command (and
required filters) becomes admittedly complex as soon as multiple edits have
to be made.

Stream copying

So in my wildest dreams I dreamed of good old cinelerra learning how to do
stream-copying (read up on ffmpeg's -c:v copy and -c:a copy if you are not
familiar with that concept!). As stream-copying does not require to decode
the input, the achievable speed is typically bound by the disk IO -- it can
be as fast as your SSD-Raid at nearly negligible CPU cost.

Please note that stream-copying per definition only works if the packets
from the input are not to be altered at all and the output stream has
exactly the same encoding settings [1]. Only the container format would be
allowed to be different, as long as it can carry the unmodified stream.

Implementing this in cinelerra would definitely be a huge, nontrivial
change. It would require at least detection of the input encoding settings
and matching the output settings respectively, a shortcut around the whole
'decoding->camera->intermediate->projector->encoding' pipeline where no
effects (in the wider sense!) are active and whole GOPs could be
stream-copied and I haven't even looked up yet whether it could be feasible
to adapt any of the current rendering/muxing backends to accept "already
encoded" input (being forwarded through the shortcut).

Nevertheless, I wanted to share this vision, just in case someone should be
on the look-out for a *real* challenge! ;)

Transcoding bottlenecks

Coming down to earth, I tested the hardware accelerated decoding and
encoding in cinelerra-gg. This apparently works. Having shifted the heavy
codec work away from the CPU, new bottlenecks appear, e.g. pixel format
conversions and memory operations.

I profiled cinelerra-gg using operf during rendering when using an Intel
UHD Graphics 630 GPU (gen9.5) for HW decoding and a Nvidia Quadro P2000
(Pascal nvenc) for encoding.

The most time-consuming parts appear to be:

When setting format to YUV 8bit:
17.7664  BC_Xfer::xfer_yuv888_to_yuv420p(unsigned int, unsigned int)
13.1723  BC_Xfer::xfer_yuv444p_to_yuv888(unsigned int, unsigned int)
10.7678  __memmove_avx_unaligned_erms   [ in libc-2.31.so ]
10.7615  ff_hscale8to15_4_ssse3
 8.5718  BC_Xfer::xfer_yuv888_to_bgr8888(unsigned int, unsigned int)
 2.8518  ff_yuv2plane1_8_avx

When setting format to RGB 8bit:
17.8958  yuv2rgb24_1_c
13.4321  __memmove_avx_unaligned_erms   [ in libc-2.31.so ]
10.9851  lumRangeToJpeg_c
10.2374  ff_hscale8to15_4_ssse3
 8.7581  ff_hscale14to15_4_ssse3
 7.5900  ff_rgb24ToY_avx
 7.1143  rgb24ToUV_half_c
 4.8434  chrRangeToJpeg_c
 4.7945  BC_Xfer::xfer_rgb888_to_bgr8888(unsigned int, unsigned int)
 2.0201  ff_yuv2plane1_8_avx

When setting format to RGBA 8bit:
16.3639  yuv2rgbx32_1_1_c
14.8711  __memmove_avx_unaligned_erms   [ in libc-2.31.so ]
10.0083  ff_hscale8to15_4_ssse3
 9.1448  lumRangeToJpeg_c
 8.7119  ff_rgbaToY_avx
 8.5619  ff_hscale14to15_4_ssse3
 8.2640  rgb32ToUV_half_c
 5.1650  BC_Xfer::xfer_rgbx8888_to_bgr8888(unsigned int, unsigned int)
 5.1056  chrRangeToJpeg_c
 1.9289  ff_yuv2plane1_8_avx

When setting format to RGB-Float:
15.7817  BC_Xfer::xfer_rgba_float_to_rgba16161616(unsigned int, unsigned int)
15.4870  BC_Xfer::xfer_rgba8888_to_rgba_float(unsigned int, unsigned int)
12.9261  rgb64LEToY_c
 7.4284  __memmove_avx_unaligned_erms   [ in libc-2.31.so ]
 6.6232  av_pix_fmt_desc_get
 6.0252  rgb64LEToUV_half_c
 3.7100  yuv2rgbx32_1_1_c
 2.9902  BC_Xfer::xfer_rgbx_float_to_bgr8888(unsigned int, unsigned int)
 2.1572  ff_hscale8to15_4_ssse3
 2.0333  lumRangeToJpeg_c
 1.8625  ff_hscale16to15_4_ssse3

During rendering, two of eight cores were at 70-85% (according to htop). As
none reached 100% alone, but the sum is above 100%, I'm not really sure
whether rendering is currently CPU bound or rather memory bound. If someone
knows a good tool how to discriminate between these two bounds, please tell
me! In case this should be CPU bound, multithreading in this part of the
code might help, as I have (more than) 6 idling CPU cores left on this
machine ;)

With RGBA 8bit transcoding (i.e. rendering a timeline consisting of a
single "unmodified" input clip) of a FullHD 25p h264 video using HW
accelerated cinelerra can take now only a quarter of the playback time (in
ffmpeg notation: speed=4x).

Comparison to ffmpeg transcoding

While this might seem impressive at first sight (this is equivalent to 4K
transcoding in realtime!) this is still a fraction of the speed ffmpeg
achieves for the same transcoding path (decoding on intel, encoding on nvidia):

ffmpeg -hwaccel vaapi -vaapi_device /dev/dri/renderD128 -i INPUT.MOV -c:v
h264_nvenc -preset:v slow -c:a copy OUTPUT.mkv

is still nearly four times faster 375 fps (speed = 15x) (a single CPU core
used, not even exceeding 50%, according to htop), using -preset:v medium
450 fps are reached (speed = 18x) and if speed is more important than
compression quality, -preset:v fast allows for a whopping 500 fps (speed =
20x) (a single CPU core at 60-70% according to htop).

Of course this is comparing apples to oranges, especially when comparing it
to stream-copying (instead of reencoding) which can be still at least an
order of magnitude (!) faster. But I hope it helps to see that there is
still quite some room left for improvements in (corner?) cases, where the
input video is not altered at all.

Feasibility considerations

Regarding the feasibility of improving cinelerra within finite time, I
would imagine that skipping avoidable conversions and reducing memory
operations is the lower hanging fruit compared to stream copying. However I
think this would still imply a nontrivial extension of the internals, which
allows decoded frames to be directly forwarded from input to output in
pixel formats the rest of cinelerra (e.g. camera/projector/effects)
possibly doesn't understand at all if they are not affected by any effect
in the wider sense (including scaling and translation through
camera/projector).

Ideally, cinelerra would of course implement both: where whole GOPs can be
stream-copied, do this. Where cuts do not align with GOP borders, effects
affect only some frames within a GOP, or input and output do not use the
same codec (+ settings): avoid at least as many pixel format conversions as
possible, especially for the effectively unaltered frames.

Disclaimer

I'm not working on implementing any of these two ideas and I don't expect
that I will ever find time to do so. So please take away the idea you like.

Conclusion

Using hardware acceleration in cinelerra for decoding AND encoding, pixel
format conversion and memory operations appear to be the new bottleneck
when effectively transcoding a video. Avoiding them wherever possible is
expected to allow for up to 4-5 times faster rendering when effectively
transcoding.

If the output settings are exactly[1] equal to the inputs' codec settings,
(selective) stream-copying could achieve two orders of magnitude faster
rendering, but is expected to require a major modification of cinelerra's
internals.

As both of my wildest dreams imply nontrivial, possibly major changes, I
could very likely be totally wrong what is easier to implement, so please
don't quote me on that ;)

Best regards and happy hacking,
Simeon

[1] the "exactly" can be weakened sometimes, but going into this in detail
would take us way too far.