Re: [Cin] wildest dreams: stream-copy and/or skipping avoidable pixel format conversions

23 Apr 2020

      Hi Andrew,

this was way quicker than I expected to get a response :D Please see below
for my interleaved comments.

Am 23.04.20 um 16:45 schrieb Andrew Randrianasulu:
...
В сообщении от Thursday 23 April 2020 13:36:30 Simeon Vc3b6lkel написал(а):
...
Hi all,
coming back to cinelerra after a whole decade, I was really surprised
finding now even hardware accelerated decoding and encoding implemented...
Even though the landscape of cinelerra-forks seems to become increasingly
confusing, THANK YOU, to all who contributed in the last years and of
course especially to Adam, Einar and William for their tireless commitment!
With this mail I want to put forward two ideas, or rather wildest dreams,
how cinelerra could improve performance considerably in cases where little
(or rather: "nothing") is to be done to the input video frames.
Motivation / Introduction
In the last weeks I had to deal a lot with "trivial editing" with long
and/or large files: Chopping of a few seconds at the beginning or end,
sometimes concatenating two or three takes, removing a constant offset
between audio and video. As my videos had a constant group-of-pictures
(GOP) length so I knew the position of keyframes and shifting the cuts
around by half a GOP was no problem either, I could completely avoid
reencoding of the video stream using the command line tool ffmpeg. FFmpeg
works perfectly fine for single "trivial edits", but the command (and
required filters) becomes admittedly complex as soon as multiple edits have
to be made.
Stream copying
So in my wildest dreams I dreamed of good old cinelerra learning how to do
stream-copying (read up on ffmpeg's -c:v copy and -c:a copy if you are not
familiar with that concept!). As stream-copying does not require to decode
the input, the achievable speed is typically bound by the disk IO -- it can
be as fast as your SSD-Raid at nearly negligible CPU cost.
Please note that stream-copying per definition only works if the packets
from the input are not to be altered at all and the output stream has
exactly the same encoding settings [1]. Only the container format would be
allowed to be different, as long as it can carry the unmodified stream.
Implementing this in cinelerra would definitely be a huge, nontrivial
change. It would require at least detection of the input encoding settings
and matching the output settings respectively, a shortcut around the whole
'decoding->camera->intermediate->projector->encoding' pipeline where no
effects (in the wider sense!) are active and whole GOPs could be
stream-copied and I haven't even looked up yet whether it could be feasible
to adapt any of the current rendering/muxing backends to accept "already
encoded" input (being forwarded through the shortcut).
I think this was done in old Cinelerra for some DV variants and mjpeg (in mov and avi)
In old CinelerraCV it was removed in commits
https://github.com/cinelerra-gg/cinelerra-cv/commit/8364fc6af3eb9b105ecf0853...
https://github.com/cinelerra-gg/cinelerra-cv/commit/0ff51f4c53e17ff33701e8cc...
I remember this because I tested this (mis)feature, and found it working for
mjpeg avi (so i was not convinced by this reasoning and just keep my copy at commit before this :E)
https://git.cinelerra-gg.org/git/?p=goodguy/cinelerra.git;a=blob;f=cinelerra...
enum BC_CModel {
  27         BC_TRANSPARENCY = 0,
  28         BC_COMPRESSED   = 1,
[..]
I think this 'colormodel' define was used for this.
This is a very interesting finding, and something I was not aware at all.
So thanks for bringing this piece of history to my attention!

I'm wondering, too, whether the noticeable differences are really rounding
problems or rather mismatch between "full swing" and "studio swing", i.e.
using all possible (e.g. 8-bit) values or only a limited sub-range. I seem
to remember from the 2000s that pausing and playing a DV video could (or
would in every case?) switch between "darker black" and "brighter black",
reminding of a "MPEG YUV" vs "JPEG YUV" range mismatch.

Maybe Einar can give additional details concerning the circumstances of the
removal?

But back to the stream-copying:

However, DV and MJPEG are intra-frame coded, so even in the encoded packet
stream, each frame is independent of all other frames. This I-Frame-only
property makes it comparatively easy (mis-)use a BC_COMPRESSED "color
model" for transferring them without decoding.

Newer codecs feature B and P frames, which require information from
adjacent frames (or sometimes frames even further away).
In addition to that, in some codecs open GOPs are permissible, i.e. the
last frame of one GOP may be a B-frame requiring information of the
consecutive, first I-frame of the next GOP. Furthermore I and IDR-frames
are to be differentiated, and when looking at

https://github.com/FFmpeg/FFmpeg/blob/master/libavutil/avutil.h#L272

enum AVPictureType {
    AV_PICTURE_TYPE_NONE = 0, ///< Undefined
    AV_PICTURE_TYPE_I,     ///< Intra
    AV_PICTURE_TYPE_P,     ///< Predicted
    AV_PICTURE_TYPE_B,     ///< Bi-dir predicted
    AV_PICTURE_TYPE_S,     ///< S(GMC)-VOP MPEG-4
    AV_PICTURE_TYPE_SI,    ///< Switching Intra
    AV_PICTURE_TYPE_SP,    ///< Switching Predicted
    AV_PICTURE_TYPE_BI,    ///< BI type
};

there seem to be even more cases to consider...
...
But now ffmpeg rule the world, so my idea was to just copy
packets/frames exactly as they come out of libavformat demuxers,
and carry them in some extended structure (in vframe ?), along with frame type 
(I, B, P). So at the encoding end  liavformat muxers will see same thing as  if 
they were connected directly...
I do agree that carrying the avpackets straight from the demuxer to the
muxer is the way one would like to go. However, there are quite some cases
 where this is locally (around an edit) not possible. I mentioned some in
my last mail, but I guess that there will be even more, especially when
considering open GOPs.

So cinelerra would need some logic that sets apart the frames around an
edit that have to be decoded and reencoded from those that can be directly
forwarded.

Unfortunately I don't know how flexible GOP sizes can be chosen within one
video stream, i.e. whether there are cases in which it would be impossible
to "stream-append" a second clip if the first one has an unfavorable number
of frames, even though we would want to create a (or multiple) GOP(s)
around the edit.
...
But unfortunately this is also just 0.1% of road 
(no coding experiments were done by me on this front)
I'm afraid your estimation could be quite reasonable...

Regarding the limitations of stream copying and appending I would suggest
to have a look at ffmpeg's concat demuxer, if someone is willing to go down
that road. Note that there are at least three ways to concatenate in
ffmpeg: the concat demuxer, the concat protocol and the concat filter, but
for stream-copying and -appending the closest will be the concat demuxer.

I've already thought about whether it might be a good idea letting
cinelerra compose basically the input file for the concat demuxer denoting
which section from which input to concatenate. However, this would still
require cinelerra to learn about GOP structures etc. which is why I haven't
considered going this route further.
...
Thanks for your experiments and detailed thoughts!
Thank you for your thoughtful response! (:

Regards,
Simeon
...
...
Nevertheless, I wanted to share this vision, just in case someone should be
on the look-out for a *real* challenge! ;)
Transcoding bottlenecks
Coming down to earth, I tested the hardware accelerated decoding and
encoding in cinelerra-gg. This apparently works. Having shifted the heavy
codec work away from the CPU, new bottlenecks appear, e.g. pixel format
conversions and memory operations.
I profiled cinelerra-gg using operf during rendering when using an Intel
UHD Graphics 630 GPU (gen9.5) for HW decoding and a Nvidia Quadro P2000
(Pascal nvenc) for encoding.
The most time-consuming parts appear to be:
When setting format to YUV 8bit:
17.7664  BC_Xfer::xfer_yuv888_to_yuv420p(unsigned int, unsigned int)
13.1723  BC_Xfer::xfer_yuv444p_to_yuv888(unsigned int, unsigned int)
10.7678  __memmove_avx_unaligned_erms   [ in libc-2.31.so ]
10.7615  ff_hscale8to15_4_ssse3
 8.5718  BC_Xfer::xfer_yuv888_to_bgr8888(unsigned int, unsigned int)
 2.8518  ff_yuv2plane1_8_avx
When setting format to RGB 8bit:
17.8958  yuv2rgb24_1_c
13.4321  __memmove_avx_unaligned_erms   [ in libc-2.31.so ]
10.9851  lumRangeToJpeg_c
10.2374  ff_hscale8to15_4_ssse3
 8.7581  ff_hscale14to15_4_ssse3
 7.5900  ff_rgb24ToY_avx
 7.1143  rgb24ToUV_half_c
 4.8434  chrRangeToJpeg_c
 4.7945  BC_Xfer::xfer_rgb888_to_bgr8888(unsigned int, unsigned int)
 2.0201  ff_yuv2plane1_8_avx
When setting format to RGBA 8bit:
16.3639  yuv2rgbx32_1_1_c
14.8711  __memmove_avx_unaligned_erms   [ in libc-2.31.so ]
10.0083  ff_hscale8to15_4_ssse3
 9.1448  lumRangeToJpeg_c
 8.7119  ff_rgbaToY_avx
 8.5619  ff_hscale14to15_4_ssse3
 8.2640  rgb32ToUV_half_c
 5.1650  BC_Xfer::xfer_rgbx8888_to_bgr8888(unsigned int, unsigned int)
 5.1056  chrRangeToJpeg_c
 1.9289  ff_yuv2plane1_8_avx
When setting format to RGB-Float:
15.7817  BC_Xfer::xfer_rgba_float_to_rgba16161616(unsigned int, unsigned int)
15.4870  BC_Xfer::xfer_rgba8888_to_rgba_float(unsigned int, unsigned int)
12.9261  rgb64LEToY_c
 7.4284  __memmove_avx_unaligned_erms   [ in libc-2.31.so ]
 6.6232  av_pix_fmt_desc_get
 6.0252  rgb64LEToUV_half_c
 3.7100  yuv2rgbx32_1_1_c
 2.9902  BC_Xfer::xfer_rgbx_float_to_bgr8888(unsigned int, unsigned int)
 2.1572  ff_hscale8to15_4_ssse3
 2.0333  lumRangeToJpeg_c
 1.8625  ff_hscale16to15_4_ssse3
During rendering, two of eight cores were at 70-85% (according to htop). As
none reached 100% alone, but the sum is above 100%, I'm not really sure
whether rendering is currently CPU bound or rather memory bound. If someone
knows a good tool how to discriminate between these two bounds, please tell
me! In case this should be CPU bound, multithreading in this part of the
code might help, as I have (more than) 6 idling CPU cores left on this
machine ;)
With RGBA 8bit transcoding (i.e. rendering a timeline consisting of a
single "unmodified" input clip) of a FullHD 25p h264 video using HW
accelerated cinelerra can take now only a quarter of the playback time (in
ffmpeg notation: speed=4x).
Comparison to ffmpeg transcoding
While this might seem impressive at first sight (this is equivalent to 4K
transcoding in realtime!) this is still a fraction of the speed ffmpeg
achieves for the same transcoding path (decoding on intel, encoding on nvidia):
ffmpeg -hwaccel vaapi -vaapi_device /dev/dri/renderD128 -i INPUT.MOV -c:v
h264_nvenc -preset:v slow -c:a copy OUTPUT.mkv
is still nearly four times faster 375 fps (speed = 15x) (a single CPU core
used, not even exceeding 50%, according to htop), using -preset:v medium
450 fps are reached (speed = 18x) and if speed is more important than
compression quality, -preset:v fast allows for a whopping 500 fps (speed =
20x) (a single CPU core at 60-70% according to htop).
Of course this is comparing apples to oranges, especially when comparing it
to stream-copying (instead of reencoding) which can be still at least an
order of magnitude (!) faster. But I hope it helps to see that there is
still quite some room left for improvements in (corner?) cases, where the
input video is not altered at all.
Feasibility considerations
Regarding the feasibility of improving cinelerra within finite time, I
would imagine that skipping avoidable conversions and reducing memory
operations is the lower hanging fruit compared to stream copying. However I
think this would still imply a nontrivial extension of the internals, which
allows decoded frames to be directly forwarded from input to output in
pixel formats the rest of cinelerra (e.g. camera/projector/effects)
possibly doesn't understand at all if they are not affected by any effect
in the wider sense (including scaling and translation through
camera/projector).
Ideally, cinelerra would of course implement both: where whole GOPs can be
stream-copied, do this. Where cuts do not align with GOP borders, effects
affect only some frames within a GOP, or input and output do not use the
same codec (+ settings): avoid at least as many pixel format conversions as
possible, especially for the effectively unaltered frames.
Disclaimer
I'm not working on implementing any of these two ideas and I don't expect
that I will ever find time to do so. So please take away the idea you like.
Conclusion
Using hardware acceleration in cinelerra for decoding AND encoding, pixel
format conversion and memory operations appear to be the new bottleneck
when effectively transcoding a video. Avoiding them wherever possible is
expected to allow for up to 4-5 times faster rendering when effectively
transcoding.
If the output settings are exactly[1] equal to the inputs' codec settings,
(selective) stream-copying could achieve two orders of magnitude faster
rendering, but is expected to require a major modification of cinelerra's
internals.
As both of my wildest dreams imply nontrivial, possibly major changes, I
could very likely be totally wrong what is easier to implement, so please
don't quote me on that ;)
Best regards and happy hacking,
Simeon
[1] the "exactly" can be weakened sometimes, but going into this in detail
would take us way too far.

Re: [Cin] wildest dreams: stream-copy and/or skipping avoidable pixel format conversions

Simeon Völkel