[Cin] wildest dreams: stream-copy and/or skipping avoidable pixel format conversions

Thu Apr 23 16:45:48 CEST 2020

В сообщении от Thursday 23 April 2020 13:36:30 Simeon Vc3b6lkel написал(а):
> Hi all,
> 
> coming back to cinelerra after a whole decade, I was really surprised
> finding now even hardware accelerated decoding and encoding implemented...
> Even though the landscape of cinelerra-forks seems to become increasingly
> confusing, THANK YOU, to all who contributed in the last years and of
> course especially to Adam, Einar and William for their tireless commitment!
> 
> 
> With this mail I want to put forward two ideas, or rather wildest dreams,
> how cinelerra could improve performance considerably in cases where little
> (or rather: "nothing") is to be done to the input video frames.
> 
> 
> 
> Motivation / Introduction
> 
> In the last weeks I had to deal a lot with "trivial editing" with long
> and/or large files: Chopping of a few seconds at the beginning or end,
> sometimes concatenating two or three takes, removing a constant offset
> between audio and video. As my videos had a constant group-of-pictures
> (GOP) length so I knew the position of keyframes and shifting the cuts
> around by half a GOP was no problem either, I could completely avoid
> reencoding of the video stream using the command line tool ffmpeg. FFmpeg
> works perfectly fine for single "trivial edits", but the command (and
> required filters) becomes admittedly complex as soon as multiple edits have
> to be made.
> 
> 
> 
> Stream copying
> 
> So in my wildest dreams I dreamed of good old cinelerra learning how to do
> stream-copying (read up on ffmpeg's -c:v copy and -c:a copy if you are not
> familiar with that concept!). As stream-copying does not require to decode
> the input, the achievable speed is typically bound by the disk IO -- it can
> be as fast as your SSD-Raid at nearly negligible CPU cost.
> 
> Please note that stream-copying per definition only works if the packets
> from the input are not to be altered at all and the output stream has
> exactly the same encoding settings [1]. Only the container format would be
> allowed to be different, as long as it can carry the unmodified stream.
> 
> Implementing this in cinelerra would definitely be a huge, nontrivial
> change. It would require at least detection of the input encoding settings
> and matching the output settings respectively, a shortcut around the whole
> 'decoding->camera->intermediate->projector->encoding' pipeline where no
> effects (in the wider sense!) are active and whole GOPs could be
> stream-copied and I haven't even looked up yet whether it could be feasible
> to adapt any of the current rendering/muxing backends to accept "already
> encoded" input (being forwarded through the shortcut).

I think this was done in old Cinelerra for some DV variants and mjpeg (in mov and avi)

In old CinelerraCV it was removed in commits

https://github.com/cinelerra-gg/cinelerra-cv/commit/8364fc6af3eb9b105ecf0853f79885090b12005f
https://github.com/cinelerra-gg/cinelerra-cv/commit/0ff51f4c53e17ff33701e8cc1096de33a87313b9

I remember this because I tested this (mis)feature, and found it working for
mjpeg avi (so i was not convinced by this reasoning and just keep my copy at commit before this :E)

https://git.cinelerra-gg.org/git/?p=goodguy/cinelerra.git;a=blob;f=cinelerra-5.1/guicast/bccmodels.h;h=28b58459fb74eabb72e3fbf74f371ea51786cd18;hb=HEAD

enum BC_CModel {
  27         BC_TRANSPARENCY = 0,
  28         BC_COMPRESSED   = 1,

[..]

I think this 'colormodel' define was used for this.
But now ffmpeg rule the world, so my idea was to just copy
packets/frames exactly as they come out of libavformat demuxers,
and carry them in some extended structure (in vframe ?), along with frame type 
(I, B, P). So at the encoding end  liavformat muxers will see same thing as  if 
they were connected directly...

But unfortunately this is also just 0.1% of road 
(no coding experiments were done by me on this front)

Thanks for your experiments and detailed thoughts!

> 
> Nevertheless, I wanted to share this vision, just in case someone should be
> on the look-out for a *real* challenge! ;)
> 
> 
> 
> Transcoding bottlenecks
> 
> Coming down to earth, I tested the hardware accelerated decoding and
> encoding in cinelerra-gg. This apparently works. Having shifted the heavy
> codec work away from the CPU, new bottlenecks appear, e.g. pixel format
> conversions and memory operations.
> 
> I profiled cinelerra-gg using operf during rendering when using an Intel
> UHD Graphics 630 GPU (gen9.5) for HW decoding and a Nvidia Quadro P2000
> (Pascal nvenc) for encoding.
> 
> The most time-consuming parts appear to be:
> 
> When setting format to YUV 8bit:
> 17.7664  BC_Xfer::xfer_yuv888_to_yuv420p(unsigned int, unsigned int)
> 13.1723  BC_Xfer::xfer_yuv444p_to_yuv888(unsigned int, unsigned int)
> 10.7678  __memmove_avx_unaligned_erms   [ in libc-2.31.so ]
> 10.7615  ff_hscale8to15_4_ssse3
>  8.5718  BC_Xfer::xfer_yuv888_to_bgr8888(unsigned int, unsigned int)
>  2.8518  ff_yuv2plane1_8_avx
> 
> When setting format to RGB 8bit:
> 17.8958  yuv2rgb24_1_c
> 13.4321  __memmove_avx_unaligned_erms   [ in libc-2.31.so ]
> 10.9851  lumRangeToJpeg_c
> 10.2374  ff_hscale8to15_4_ssse3
>  8.7581  ff_hscale14to15_4_ssse3
>  7.5900  ff_rgb24ToY_avx
>  7.1143  rgb24ToUV_half_c
>  4.8434  chrRangeToJpeg_c
>  4.7945  BC_Xfer::xfer_rgb888_to_bgr8888(unsigned int, unsigned int)
>  2.0201  ff_yuv2plane1_8_avx
> 
> When setting format to RGBA 8bit:
> 16.3639  yuv2rgbx32_1_1_c
> 14.8711  __memmove_avx_unaligned_erms   [ in libc-2.31.so ]
> 10.0083  ff_hscale8to15_4_ssse3
>  9.1448  lumRangeToJpeg_c
>  8.7119  ff_rgbaToY_avx
>  8.5619  ff_hscale14to15_4_ssse3
>  8.2640  rgb32ToUV_half_c
>  5.1650  BC_Xfer::xfer_rgbx8888_to_bgr8888(unsigned int, unsigned int)
>  5.1056  chrRangeToJpeg_c
>  1.9289  ff_yuv2plane1_8_avx
> 
> When setting format to RGB-Float:
> 15.7817  BC_Xfer::xfer_rgba_float_to_rgba16161616(unsigned int, unsigned int)
> 15.4870  BC_Xfer::xfer_rgba8888_to_rgba_float(unsigned int, unsigned int)
> 12.9261  rgb64LEToY_c
>  7.4284  __memmove_avx_unaligned_erms   [ in libc-2.31.so ]
>  6.6232  av_pix_fmt_desc_get
>  6.0252  rgb64LEToUV_half_c
>  3.7100  yuv2rgbx32_1_1_c
>  2.9902  BC_Xfer::xfer_rgbx_float_to_bgr8888(unsigned int, unsigned int)
>  2.1572  ff_hscale8to15_4_ssse3
>  2.0333  lumRangeToJpeg_c
>  1.8625  ff_hscale16to15_4_ssse3
> 
> During rendering, two of eight cores were at 70-85% (according to htop). As
> none reached 100% alone, but the sum is above 100%, I'm not really sure
> whether rendering is currently CPU bound or rather memory bound. If someone
> knows a good tool how to discriminate between these two bounds, please tell
> me! In case this should be CPU bound, multithreading in this part of the
> code might help, as I have (more than) 6 idling CPU cores left on this
> machine ;)
> 
> With RGBA 8bit transcoding (i.e. rendering a timeline consisting of a
> single "unmodified" input clip) of a FullHD 25p h264 video using HW
> accelerated cinelerra can take now only a quarter of the playback time (in
> ffmpeg notation: speed=4x).
> 
> 
> 
> Comparison to ffmpeg transcoding
> 
> While this might seem impressive at first sight (this is equivalent to 4K
> transcoding in realtime!) this is still a fraction of the speed ffmpeg
> achieves for the same transcoding path (decoding on intel, encoding on nvidia):
> 
> ffmpeg -hwaccel vaapi -vaapi_device /dev/dri/renderD128 -i INPUT.MOV -c:v
> h264_nvenc -preset:v slow -c:a copy OUTPUT.mkv
> 
> is still nearly four times faster 375 fps (speed = 15x) (a single CPU core
> used, not even exceeding 50%, according to htop), using -preset:v medium
> 450 fps are reached (speed = 18x) and if speed is more important than
> compression quality, -preset:v fast allows for a whopping 500 fps (speed =
> 20x) (a single CPU core at 60-70% according to htop).
> 
> Of course this is comparing apples to oranges, especially when comparing it
> to stream-copying (instead of reencoding) which can be still at least an
> order of magnitude (!) faster. But I hope it helps to see that there is
> still quite some room left for improvements in (corner?) cases, where the
> input video is not altered at all.
> 
> 
> 
> Feasibility considerations
> 
> Regarding the feasibility of improving cinelerra within finite time, I
> would imagine that skipping avoidable conversions and reducing memory
> operations is the lower hanging fruit compared to stream copying. However I
> think this would still imply a nontrivial extension of the internals, which
> allows decoded frames to be directly forwarded from input to output in
> pixel formats the rest of cinelerra (e.g. camera/projector/effects)
> possibly doesn't understand at all if they are not affected by any effect
> in the wider sense (including scaling and translation through
> camera/projector).
> 
> Ideally, cinelerra would of course implement both: where whole GOPs can be
> stream-copied, do this. Where cuts do not align with GOP borders, effects
> affect only some frames within a GOP, or input and output do not use the
> same codec (+ settings): avoid at least as many pixel format conversions as
> possible, especially for the effectively unaltered frames.
> 
> 
> 
> Disclaimer
> 
> I'm not working on implementing any of these two ideas and I don't expect
> that I will ever find time to do so. So please take away the idea you like.
> 
> 
> 
> Conclusion
> 
> Using hardware acceleration in cinelerra for decoding AND encoding, pixel
> format conversion and memory operations appear to be the new bottleneck
> when effectively transcoding a video. Avoiding them wherever possible is
> expected to allow for up to 4-5 times faster rendering when effectively
> transcoding.
> 
> If the output settings are exactly[1] equal to the inputs' codec settings,
> (selective) stream-copying could achieve two orders of magnitude faster
> rendering, but is expected to require a major modification of cinelerra's
> internals.
> 
> As both of my wildest dreams imply nontrivial, possibly major changes, I
> could very likely be totally wrong what is easier to implement, so please
> don't quote me on that ;)
> 
> 
> 
> Best regards and happy hacking,
> Simeon
> 
> [1] the "exactly" can be weakened sometimes, but going into this in detail
> would take us way too far.
> 
> 
> 
>