Hi Andrew, this was way quicker than I expected to get a response :D Please see below for my interleaved comments. Am 23.04.20 um 16:45 schrieb Andrew Randrianasulu:
В сообщении от Thursday 23 April 2020 13:36:30 Simeon Vc3b6lkel написал(а):
Hi all,
coming back to cinelerra after a whole decade, I was really surprised finding now even hardware accelerated decoding and encoding implemented... Even though the landscape of cinelerra-forks seems to become increasingly confusing, THANK YOU, to all who contributed in the last years and of course especially to Adam, Einar and William for their tireless commitment!
With this mail I want to put forward two ideas, or rather wildest dreams, how cinelerra could improve performance considerably in cases where little (or rather: "nothing") is to be done to the input video frames.
Motivation / Introduction
In the last weeks I had to deal a lot with "trivial editing" with long and/or large files: Chopping of a few seconds at the beginning or end, sometimes concatenating two or three takes, removing a constant offset between audio and video. As my videos had a constant group-of-pictures (GOP) length so I knew the position of keyframes and shifting the cuts around by half a GOP was no problem either, I could completely avoid reencoding of the video stream using the command line tool ffmpeg. FFmpeg works perfectly fine for single "trivial edits", but the command (and required filters) becomes admittedly complex as soon as multiple edits have to be made.
Stream copying
So in my wildest dreams I dreamed of good old cinelerra learning how to do stream-copying (read up on ffmpeg's -c:v copy and -c:a copy if you are not familiar with that concept!). As stream-copying does not require to decode the input, the achievable speed is typically bound by the disk IO -- it can be as fast as your SSD-Raid at nearly negligible CPU cost.
Please note that stream-copying per definition only works if the packets from the input are not to be altered at all and the output stream has exactly the same encoding settings [1]. Only the container format would be allowed to be different, as long as it can carry the unmodified stream.
Implementing this in cinelerra would definitely be a huge, nontrivial change. It would require at least detection of the input encoding settings and matching the output settings respectively, a shortcut around the whole 'decoding->camera->intermediate->projector->encoding' pipeline where no effects (in the wider sense!) are active and whole GOPs could be stream-copied and I haven't even looked up yet whether it could be feasible to adapt any of the current rendering/muxing backends to accept "already encoded" input (being forwarded through the shortcut).
I think this was done in old Cinelerra for some DV variants and mjpeg (in mov and avi)
In old CinelerraCV it was removed in commits
https://github.com/cinelerra-gg/cinelerra-cv/commit/8364fc6af3eb9b105ecf0853... https://github.com/cinelerra-gg/cinelerra-cv/commit/0ff51f4c53e17ff33701e8cc...
I remember this because I tested this (mis)feature, and found it working for mjpeg avi (so i was not convinced by this reasoning and just keep my copy at commit before this :E)
https://git.cinelerra-gg.org/git/?p=goodguy/cinelerra.git;a=blob;f=cinelerra...
enum BC_CModel { 27 BC_TRANSPARENCY = 0, 28 BC_COMPRESSED = 1,
[..]
I think this 'colormodel' define was used for this.
This is a very interesting finding, and something I was not aware at all. So thanks for bringing this piece of history to my attention! I'm wondering, too, whether the noticeable differences are really rounding problems or rather mismatch between "full swing" and "studio swing", i.e. using all possible (e.g. 8-bit) values or only a limited sub-range. I seem to remember from the 2000s that pausing and playing a DV video could (or would in every case?) switch between "darker black" and "brighter black", reminding of a "MPEG YUV" vs "JPEG YUV" range mismatch. Maybe Einar can give additional details concerning the circumstances of the removal? But back to the stream-copying: However, DV and MJPEG are intra-frame coded, so even in the encoded packet stream, each frame is independent of all other frames. This I-Frame-only property makes it comparatively easy (mis-)use a BC_COMPRESSED "color model" for transferring them without decoding. Newer codecs feature B and P frames, which require information from adjacent frames (or sometimes frames even further away). In addition to that, in some codecs open GOPs are permissible, i.e. the last frame of one GOP may be a B-frame requiring information of the consecutive, first I-frame of the next GOP. Furthermore I and IDR-frames are to be differentiated, and when looking at https://github.com/FFmpeg/FFmpeg/blob/master/libavutil/avutil.h#L272 enum AVPictureType { AV_PICTURE_TYPE_NONE = 0, ///< Undefined AV_PICTURE_TYPE_I, ///< Intra AV_PICTURE_TYPE_P, ///< Predicted AV_PICTURE_TYPE_B, ///< Bi-dir predicted AV_PICTURE_TYPE_S, ///< S(GMC)-VOP MPEG-4 AV_PICTURE_TYPE_SI, ///< Switching Intra AV_PICTURE_TYPE_SP, ///< Switching Predicted AV_PICTURE_TYPE_BI, ///< BI type }; there seem to be even more cases to consider...
But now ffmpeg rule the world, so my idea was to just copy packets/frames exactly as they come out of libavformat demuxers, and carry them in some extended structure (in vframe ?), along with frame type (I, B, P). So at the encoding end liavformat muxers will see same thing as if they were connected directly...
I do agree that carrying the avpackets straight from the demuxer to the muxer is the way one would like to go. However, there are quite some cases where this is locally (around an edit) not possible. I mentioned some in my last mail, but I guess that there will be even more, especially when considering open GOPs. So cinelerra would need some logic that sets apart the frames around an edit that have to be decoded and reencoded from those that can be directly forwarded. Unfortunately I don't know how flexible GOP sizes can be chosen within one video stream, i.e. whether there are cases in which it would be impossible to "stream-append" a second clip if the first one has an unfavorable number of frames, even though we would want to create a (or multiple) GOP(s) around the edit.
But unfortunately this is also just 0.1% of road (no coding experiments were done by me on this front)
I'm afraid your estimation could be quite reasonable... Regarding the limitations of stream copying and appending I would suggest to have a look at ffmpeg's concat demuxer, if someone is willing to go down that road. Note that there are at least three ways to concatenate in ffmpeg: the concat demuxer, the concat protocol and the concat filter, but for stream-copying and -appending the closest will be the concat demuxer. I've already thought about whether it might be a good idea letting cinelerra compose basically the input file for the concat demuxer denoting which section from which input to concatenate. However, this would still require cinelerra to learn about GOP structures etc. which is why I haven't considered going this route further.
Thanks for your experiments and detailed thoughts!
Thank you for your thoughtful response! (: Regards, Simeon
Nevertheless, I wanted to share this vision, just in case someone should be on the look-out for a *real* challenge! ;)
Transcoding bottlenecks
Coming down to earth, I tested the hardware accelerated decoding and encoding in cinelerra-gg. This apparently works. Having shifted the heavy codec work away from the CPU, new bottlenecks appear, e.g. pixel format conversions and memory operations.
I profiled cinelerra-gg using operf during rendering when using an Intel UHD Graphics 630 GPU (gen9.5) for HW decoding and a Nvidia Quadro P2000 (Pascal nvenc) for encoding.
The most time-consuming parts appear to be:
When setting format to YUV 8bit: 17.7664 BC_Xfer::xfer_yuv888_to_yuv420p(unsigned int, unsigned int) 13.1723 BC_Xfer::xfer_yuv444p_to_yuv888(unsigned int, unsigned int) 10.7678 __memmove_avx_unaligned_erms [ in libc-2.31.so ] 10.7615 ff_hscale8to15_4_ssse3 8.5718 BC_Xfer::xfer_yuv888_to_bgr8888(unsigned int, unsigned int) 2.8518 ff_yuv2plane1_8_avx
When setting format to RGB 8bit: 17.8958 yuv2rgb24_1_c 13.4321 __memmove_avx_unaligned_erms [ in libc-2.31.so ] 10.9851 lumRangeToJpeg_c 10.2374 ff_hscale8to15_4_ssse3 8.7581 ff_hscale14to15_4_ssse3 7.5900 ff_rgb24ToY_avx 7.1143 rgb24ToUV_half_c 4.8434 chrRangeToJpeg_c 4.7945 BC_Xfer::xfer_rgb888_to_bgr8888(unsigned int, unsigned int) 2.0201 ff_yuv2plane1_8_avx
When setting format to RGBA 8bit: 16.3639 yuv2rgbx32_1_1_c 14.8711 __memmove_avx_unaligned_erms [ in libc-2.31.so ] 10.0083 ff_hscale8to15_4_ssse3 9.1448 lumRangeToJpeg_c 8.7119 ff_rgbaToY_avx 8.5619 ff_hscale14to15_4_ssse3 8.2640 rgb32ToUV_half_c 5.1650 BC_Xfer::xfer_rgbx8888_to_bgr8888(unsigned int, unsigned int) 5.1056 chrRangeToJpeg_c 1.9289 ff_yuv2plane1_8_avx
When setting format to RGB-Float: 15.7817 BC_Xfer::xfer_rgba_float_to_rgba16161616(unsigned int, unsigned int) 15.4870 BC_Xfer::xfer_rgba8888_to_rgba_float(unsigned int, unsigned int) 12.9261 rgb64LEToY_c 7.4284 __memmove_avx_unaligned_erms [ in libc-2.31.so ] 6.6232 av_pix_fmt_desc_get 6.0252 rgb64LEToUV_half_c 3.7100 yuv2rgbx32_1_1_c 2.9902 BC_Xfer::xfer_rgbx_float_to_bgr8888(unsigned int, unsigned int) 2.1572 ff_hscale8to15_4_ssse3 2.0333 lumRangeToJpeg_c 1.8625 ff_hscale16to15_4_ssse3
During rendering, two of eight cores were at 70-85% (according to htop). As none reached 100% alone, but the sum is above 100%, I'm not really sure whether rendering is currently CPU bound or rather memory bound. If someone knows a good tool how to discriminate between these two bounds, please tell me! In case this should be CPU bound, multithreading in this part of the code might help, as I have (more than) 6 idling CPU cores left on this machine ;)
With RGBA 8bit transcoding (i.e. rendering a timeline consisting of a single "unmodified" input clip) of a FullHD 25p h264 video using HW accelerated cinelerra can take now only a quarter of the playback time (in ffmpeg notation: speed=4x).
Comparison to ffmpeg transcoding
While this might seem impressive at first sight (this is equivalent to 4K transcoding in realtime!) this is still a fraction of the speed ffmpeg achieves for the same transcoding path (decoding on intel, encoding on nvidia):
ffmpeg -hwaccel vaapi -vaapi_device /dev/dri/renderD128 -i INPUT.MOV -c:v h264_nvenc -preset:v slow -c:a copy OUTPUT.mkv
is still nearly four times faster 375 fps (speed = 15x) (a single CPU core used, not even exceeding 50%, according to htop), using -preset:v medium 450 fps are reached (speed = 18x) and if speed is more important than compression quality, -preset:v fast allows for a whopping 500 fps (speed = 20x) (a single CPU core at 60-70% according to htop).
Of course this is comparing apples to oranges, especially when comparing it to stream-copying (instead of reencoding) which can be still at least an order of magnitude (!) faster. But I hope it helps to see that there is still quite some room left for improvements in (corner?) cases, where the input video is not altered at all.
Feasibility considerations
Regarding the feasibility of improving cinelerra within finite time, I would imagine that skipping avoidable conversions and reducing memory operations is the lower hanging fruit compared to stream copying. However I think this would still imply a nontrivial extension of the internals, which allows decoded frames to be directly forwarded from input to output in pixel formats the rest of cinelerra (e.g. camera/projector/effects) possibly doesn't understand at all if they are not affected by any effect in the wider sense (including scaling and translation through camera/projector).
Ideally, cinelerra would of course implement both: where whole GOPs can be stream-copied, do this. Where cuts do not align with GOP borders, effects affect only some frames within a GOP, or input and output do not use the same codec (+ settings): avoid at least as many pixel format conversions as possible, especially for the effectively unaltered frames.
Disclaimer
I'm not working on implementing any of these two ideas and I don't expect that I will ever find time to do so. So please take away the idea you like.
Conclusion
Using hardware acceleration in cinelerra for decoding AND encoding, pixel format conversion and memory operations appear to be the new bottleneck when effectively transcoding a video. Avoiding them wherever possible is expected to allow for up to 4-5 times faster rendering when effectively transcoding.
If the output settings are exactly[1] equal to the inputs' codec settings, (selective) stream-copying could achieve two orders of magnitude faster rendering, but is expected to require a major modification of cinelerra's internals.
As both of my wildest dreams imply nontrivial, possibly major changes, I could very likely be totally wrong what is easier to implement, so please don't quote me on that ;)
Best regards and happy hacking, Simeon
[1] the "exactly" can be weakened sometimes, but going into this in detail would take us way too far.