[Cin] wildest dreams: stream-copy and/or skipping avoidable pixel format conversions

Thu Apr 23 22:54:10 CEST 2020

Hi Andrew,

this was way quicker than I expected to get a response :D Please see below
for my interleaved comments.

Am 23.04.20 um 16:45 schrieb Andrew Randrianasulu:
> В сообщении от Thursday 23 April 2020 13:36:30 Simeon Vc3b6lkel написал(а):
>> Hi all,
>>
>> coming back to cinelerra after a whole decade, I was really surprised
>> finding now even hardware accelerated decoding and encoding implemented...
>> Even though the landscape of cinelerra-forks seems to become increasingly
>> confusing, THANK YOU, to all who contributed in the last years and of
>> course especially to Adam, Einar and William for their tireless commitment!
>>
>>
>> With this mail I want to put forward two ideas, or rather wildest dreams,
>> how cinelerra could improve performance considerably in cases where little
>> (or rather: "nothing") is to be done to the input video frames.
>>
>>
>>
>> Motivation / Introduction
>>
>> In the last weeks I had to deal a lot with "trivial editing" with long
>> and/or large files: Chopping of a few seconds at the beginning or end,
>> sometimes concatenating two or three takes, removing a constant offset
>> between audio and video. As my videos had a constant group-of-pictures
>> (GOP) length so I knew the position of keyframes and shifting the cuts
>> around by half a GOP was no problem either, I could completely avoid
>> reencoding of the video stream using the command line tool ffmpeg. FFmpeg
>> works perfectly fine for single "trivial edits", but the command (and
>> required filters) becomes admittedly complex as soon as multiple edits have
>> to be made.
>>
>>
>>
>> Stream copying
>>
>> So in my wildest dreams I dreamed of good old cinelerra learning how to do
>> stream-copying (read up on ffmpeg's -c:v copy and -c:a copy if you are not
>> familiar with that concept!). As stream-copying does not require to decode
>> the input, the achievable speed is typically bound by the disk IO -- it can
>> be as fast as your SSD-Raid at nearly negligible CPU cost.
>>
>> Please note that stream-copying per definition only works if the packets
>> from the input are not to be altered at all and the output stream has
>> exactly the same encoding settings [1]. Only the container format would be
>> allowed to be different, as long as it can carry the unmodified stream.
>>
>> Implementing this in cinelerra would definitely be a huge, nontrivial
>> change. It would require at least detection of the input encoding settings
>> and matching the output settings respectively, a shortcut around the whole
>> 'decoding->camera->intermediate->projector->encoding' pipeline where no
>> effects (in the wider sense!) are active and whole GOPs could be
>> stream-copied and I haven't even looked up yet whether it could be feasible
>> to adapt any of the current rendering/muxing backends to accept "already
>> encoded" input (being forwarded through the shortcut).
> 
> I think this was done in old Cinelerra for some DV variants and mjpeg (in mov and avi)
> 
> In old CinelerraCV it was removed in commits
> 
> https://github.com/cinelerra-gg/cinelerra-cv/commit/8364fc6af3eb9b105ecf0853f79885090b12005f
> https://github.com/cinelerra-gg/cinelerra-cv/commit/0ff51f4c53e17ff33701e8cc1096de33a87313b9
> 
> I remember this because I tested this (mis)feature, and found it working for
> mjpeg avi (so i was not convinced by this reasoning and just keep my copy at commit before this :E)
> 
> https://git.cinelerra-gg.org/git/?p=goodguy/cinelerra.git;a=blob;f=cinelerra-5.1/guicast/bccmodels.h;h=28b58459fb74eabb72e3fbf74f371ea51786cd18;hb=HEAD
> 
> enum BC_CModel {
>   27         BC_TRANSPARENCY = 0,
>   28         BC_COMPRESSED   = 1,
> 
> [..]
> 
> I think this 'colormodel' define was used for this.

This is a very interesting finding, and something I was not aware at all.
So thanks for bringing this piece of history to my attention!

I'm wondering, too, whether the noticeable differences are really rounding
problems or rather mismatch between "full swing" and "studio swing", i.e.
using all possible (e.g. 8-bit) values or only a limited sub-range. I seem
to remember from the 2000s that pausing and playing a DV video could (or
would in every case?) switch between "darker black" and "brighter black",
reminding of a "MPEG YUV" vs "JPEG YUV" range mismatch.

Maybe Einar can give additional details concerning the circumstances of the
removal?

But back to the stream-copying:

However, DV and MJPEG are intra-frame coded, so even in the encoded packet
stream, each frame is independent of all other frames. This I-Frame-only
property makes it comparatively easy (mis-)use a BC_COMPRESSED "color
model" for transferring them without decoding.

Newer codecs feature B and P frames, which require information from
adjacent frames (or sometimes frames even further away).
In addition to that, in some codecs open GOPs are permissible, i.e. the
last frame of one GOP may be a B-frame requiring information of the
consecutive, first I-frame of the next GOP. Furthermore I and IDR-frames
are to be differentiated, and when looking at

https://github.com/FFmpeg/FFmpeg/blob/master/libavutil/avutil.h#L272

enum AVPictureType {
    AV_PICTURE_TYPE_NONE = 0, ///< Undefined
    AV_PICTURE_TYPE_I,     ///< Intra
    AV_PICTURE_TYPE_P,     ///< Predicted
    AV_PICTURE_TYPE_B,     ///< Bi-dir predicted
    AV_PICTURE_TYPE_S,     ///< S(GMC)-VOP MPEG-4
    AV_PICTURE_TYPE_SI,    ///< Switching Intra
    AV_PICTURE_TYPE_SP,    ///< Switching Predicted
    AV_PICTURE_TYPE_BI,    ///< BI type
};

there seem to be even more cases to consider...

> But now ffmpeg rule the world, so my idea was to just copy
> packets/frames exactly as they come out of libavformat demuxers,
> and carry them in some extended structure (in vframe ?), along with frame type 
> (I, B, P). So at the encoding end  liavformat muxers will see same thing as  if 
> they were connected directly...

I do agree that carrying the avpackets straight from the demuxer to the
muxer is the way one would like to go. However, there are quite some cases
 where this is locally (around an edit) not possible. I mentioned some in
my last mail, but I guess that there will be even more, especially when
considering open GOPs.

So cinelerra would need some logic that sets apart the frames around an
edit that have to be decoded and reencoded from those that can be directly
forwarded.

Unfortunately I don't know how flexible GOP sizes can be chosen within one
video stream, i.e. whether there are cases in which it would be impossible
to "stream-append" a second clip if the first one has an unfavorable number
of frames, even though we would want to create a (or multiple) GOP(s)
around the edit.

> But unfortunately this is also just 0.1% of road 
> (no coding experiments were done by me on this front)

I'm afraid your estimation could be quite reasonable...

Regarding the limitations of stream copying and appending I would suggest
to have a look at ffmpeg's concat demuxer, if someone is willing to go down
that road. Note that there are at least three ways to concatenate in
ffmpeg: the concat demuxer, the concat protocol and the concat filter, but
for stream-copying and -appending the closest will be the concat demuxer.

I've already thought about whether it might be a good idea letting
cinelerra compose basically the input file for the concat demuxer denoting
which section from which input to concatenate. However, this would still
require cinelerra to learn about GOP structures etc. which is why I haven't
considered going this route further.

> Thanks for your experiments and detailed thoughts!

Thank you for your thoughtful response! (:

Regards,
Simeon

>>
>> Nevertheless, I wanted to share this vision, just in case someone should be
>> on the look-out for a *real* challenge! ;)
>>
>>
>>
>> Transcoding bottlenecks
>>
>> Coming down to earth, I tested the hardware accelerated decoding and
>> encoding in cinelerra-gg. This apparently works. Having shifted the heavy
>> codec work away from the CPU, new bottlenecks appear, e.g. pixel format
>> conversions and memory operations.
>>
>> I profiled cinelerra-gg using operf during rendering when using an Intel
>> UHD Graphics 630 GPU (gen9.5) for HW decoding and a Nvidia Quadro P2000
>> (Pascal nvenc) for encoding.
>>
>> The most time-consuming parts appear to be:
>>
>> When setting format to YUV 8bit:
>> 17.7664  BC_Xfer::xfer_yuv888_to_yuv420p(unsigned int, unsigned int)
>> 13.1723  BC_Xfer::xfer_yuv444p_to_yuv888(unsigned int, unsigned int)
>> 10.7678  __memmove_avx_unaligned_erms   [ in libc-2.31.so ]
>> 10.7615  ff_hscale8to15_4_ssse3
>>  8.5718  BC_Xfer::xfer_yuv888_to_bgr8888(unsigned int, unsigned int)
>>  2.8518  ff_yuv2plane1_8_avx
>>
>> When setting format to RGB 8bit:
>> 17.8958  yuv2rgb24_1_c
>> 13.4321  __memmove_avx_unaligned_erms   [ in libc-2.31.so ]
>> 10.9851  lumRangeToJpeg_c
>> 10.2374  ff_hscale8to15_4_ssse3
>>  8.7581  ff_hscale14to15_4_ssse3
>>  7.5900  ff_rgb24ToY_avx
>>  7.1143  rgb24ToUV_half_c
>>  4.8434  chrRangeToJpeg_c
>>  4.7945  BC_Xfer::xfer_rgb888_to_bgr8888(unsigned int, unsigned int)
>>  2.0201  ff_yuv2plane1_8_avx
>>
>> When setting format to RGBA 8bit:
>> 16.3639  yuv2rgbx32_1_1_c
>> 14.8711  __memmove_avx_unaligned_erms   [ in libc-2.31.so ]
>> 10.0083  ff_hscale8to15_4_ssse3
>>  9.1448  lumRangeToJpeg_c
>>  8.7119  ff_rgbaToY_avx
>>  8.5619  ff_hscale14to15_4_ssse3
>>  8.2640  rgb32ToUV_half_c
>>  5.1650  BC_Xfer::xfer_rgbx8888_to_bgr8888(unsigned int, unsigned int)
>>  5.1056  chrRangeToJpeg_c
>>  1.9289  ff_yuv2plane1_8_avx
>>
>> When setting format to RGB-Float:
>> 15.7817  BC_Xfer::xfer_rgba_float_to_rgba16161616(unsigned int, unsigned int)
>> 15.4870  BC_Xfer::xfer_rgba8888_to_rgba_float(unsigned int, unsigned int)
>> 12.9261  rgb64LEToY_c
>>  7.4284  __memmove_avx_unaligned_erms   [ in libc-2.31.so ]
>>  6.6232  av_pix_fmt_desc_get
>>  6.0252  rgb64LEToUV_half_c
>>  3.7100  yuv2rgbx32_1_1_c
>>  2.9902  BC_Xfer::xfer_rgbx_float_to_bgr8888(unsigned int, unsigned int)
>>  2.1572  ff_hscale8to15_4_ssse3
>>  2.0333  lumRangeToJpeg_c
>>  1.8625  ff_hscale16to15_4_ssse3
>>
>> During rendering, two of eight cores were at 70-85% (according to htop). As
>> none reached 100% alone, but the sum is above 100%, I'm not really sure
>> whether rendering is currently CPU bound or rather memory bound. If someone
>> knows a good tool how to discriminate between these two bounds, please tell
>> me! In case this should be CPU bound, multithreading in this part of the
>> code might help, as I have (more than) 6 idling CPU cores left on this
>> machine ;)
>>
>> With RGBA 8bit transcoding (i.e. rendering a timeline consisting of a
>> single "unmodified" input clip) of a FullHD 25p h264 video using HW
>> accelerated cinelerra can take now only a quarter of the playback time (in
>> ffmpeg notation: speed=4x).
>>
>>
>>
>> Comparison to ffmpeg transcoding
>>
>> While this might seem impressive at first sight (this is equivalent to 4K
>> transcoding in realtime!) this is still a fraction of the speed ffmpeg
>> achieves for the same transcoding path (decoding on intel, encoding on nvidia):
>>
>> ffmpeg -hwaccel vaapi -vaapi_device /dev/dri/renderD128 -i INPUT.MOV -c:v
>> h264_nvenc -preset:v slow -c:a copy OUTPUT.mkv
>>
>> is still nearly four times faster 375 fps (speed = 15x) (a single CPU core
>> used, not even exceeding 50%, according to htop), using -preset:v medium
>> 450 fps are reached (speed = 18x) and if speed is more important than
>> compression quality, -preset:v fast allows for a whopping 500 fps (speed =
>> 20x) (a single CPU core at 60-70% according to htop).
>>
>> Of course this is comparing apples to oranges, especially when comparing it
>> to stream-copying (instead of reencoding) which can be still at least an
>> order of magnitude (!) faster. But I hope it helps to see that there is
>> still quite some room left for improvements in (corner?) cases, where the
>> input video is not altered at all.
>>
>>
>>
>> Feasibility considerations
>>
>> Regarding the feasibility of improving cinelerra within finite time, I
>> would imagine that skipping avoidable conversions and reducing memory
>> operations is the lower hanging fruit compared to stream copying. However I
>> think this would still imply a nontrivial extension of the internals, which
>> allows decoded frames to be directly forwarded from input to output in
>> pixel formats the rest of cinelerra (e.g. camera/projector/effects)
>> possibly doesn't understand at all if they are not affected by any effect
>> in the wider sense (including scaling and translation through
>> camera/projector).
>>
>> Ideally, cinelerra would of course implement both: where whole GOPs can be
>> stream-copied, do this. Where cuts do not align with GOP borders, effects
>> affect only some frames within a GOP, or input and output do not use the
>> same codec (+ settings): avoid at least as many pixel format conversions as
>> possible, especially for the effectively unaltered frames.
>>
>>
>>
>> Disclaimer
>>
>> I'm not working on implementing any of these two ideas and I don't expect
>> that I will ever find time to do so. So please take away the idea you like.
>>
>>
>>
>> Conclusion
>>
>> Using hardware acceleration in cinelerra for decoding AND encoding, pixel
>> format conversion and memory operations appear to be the new bottleneck
>> when effectively transcoding a video. Avoiding them wherever possible is
>> expected to allow for up to 4-5 times faster rendering when effectively
>> transcoding.
>>
>> If the output settings are exactly[1] equal to the inputs' codec settings,
>> (selective) stream-copying could achieve two orders of magnitude faster
>> rendering, but is expected to require a major modification of cinelerra's
>> internals.
>>
>> As both of my wildest dreams imply nontrivial, possibly major changes, I
>> could very likely be totally wrong what is easier to implement, so please
>> don't quote me on that ;)
>>
>>
>>
>> Best regards and happy hacking,
>> Simeon
>>
>> [1] the "exactly" can be weakened sometimes, but going into this in detail
>> would take us way too far.
>>
>>
>>
>>
> 
>