[Cin] wildest dreams: stream-copy and/or skipping avoidable pixel format conversions

Thu Apr 23 20:23:24 CEST 2020

Simeon,

WOW, thank you for your detailed analysis and taking the time to get your
information down in writing.  It is very much appreciated.  I have read
your email to Bill for discussion and he has some thoughts to relay.
Considerable thought has been put into your ideas already in the past. (I
am going to log this in our Feature/Bug Tracker because you never know if
someone will want to implement some parts of your ideas).

how cinelerra could improve performance considerably in cases where little
> (or rather: "nothing") is to be done to the input video frames.
>
> "trivial editing" with long and/or large files:  I could completely avoid
> reencoding of the video stream using the command line tool ffmpeg. FFmpeg
> works perfectly fine for single "trivial edits", but the command (and
> required filters) becomes admittedly complex as soon as multiple edits have
> to be made.
> ...
> So in my wildest dreams I dreamed of good old cinelerra learning how to do
> stream-copying (read up on ffmpeg's -c:v copy and -c:a copy if you are not
> familiar with that concept!). As stream-copying does not require to decode
> the input, the achievable speed is typically bound by the disk IO -- it can
> be as fast as your SSD-Raid at nearly negligible CPU cost.
>

Cinelerra could carry compressed AND uncompressed data in the vframe, but
you want to be able to see the video in the composer so the uncompressed
data is needed.  You would need a second uncompressed channel.  If you are
using an NLE, the compositor is needed to look at the video and this
obviously requires a decode operation.  It turns out that Cinelerra already
has Direct Copy BUT where it is applicable and feasible is a small set of
operations.

>
> Please note that stream-copying per definition only works if the packets
> from the input are not to be altered at all and the output stream has
> exactly the same encoding settings [1]. Only the container format would be
> allowed to be different, as long as it can carry the unmodified stream.
>
> Implementing this in cinelerra would definitely be a huge, nontrivial change.
> ...  a *real* challenge! ;)
>

A test that Bill uses in deciding whether or not to spend the time for
implementation is based on "developer time needed" versus "user time
saved".  In this case maybe 500 developer hours to implement stream copying
has to be balanced against the 2% of time that users would take advantage
of this feature.  That 2% is another guess based on the fact that the
majority of people using Cinelerra, are actually planning on doing a lot of
editing, not just a trivial amount.

>
> I profiled cinelerra-gg using operf during rendering when using an Intel
> UHD Graphics 630 GPU (gen9.5) for HW decoding and a Nvidia Quadro P2000
> (Pascal nvenc) for encoding.
>

Again, thanks for passing along the profiling information.  (Bill loves
this kind of stuff!)

>
> The most time-consuming parts appear to be:
>
> When setting format to YUV 8bit:
> 17.7664  BC_Xfer::xfer_yuv888_to_yuv420p(unsigned int, unsigned int)
> 13.1723  BC_Xfer::xfer_yuv444p_to_yuv888(unsigned int, unsigned int)
> 10.7678  __memmove_avx_unaligned_erms   [ in libc-2.31.so ]
> 10.7615  ff_hscale8to15_4_ssse3
>  8.5718  BC_Xfer::xfer_yuv888_to_bgr8888(unsigned int, unsigned int)
>  2.8518  ff_yuv2plane1_8_avx
>

About the above, Bill says "memmove is a tough act to follow" !!
And "bgr8888"s purpose is to draw the video on the Display -- got to have
this.

In any of the profile information provided, anything that has "ff" in it is
obviously ffmpeg which uses up a lot of time and most likely has been
fine-tuned to work as well as possible and it is very good and almost
always fast.  Cinelerra-GG takes full advantage of ffmpeg and so if it is
not as fast as it could be, it is so worth it.  Earlier versions of
Cinelerra, had their own implements and could be improved upon, others were
and still are exceptional work.

>
> During rendering, two of eight cores were at 70-85% (according to htop). As
> none reached 100% alone, but the sum is above 100%, I'm not really sure
> whether rendering is currently CPU bound or rather memory bound. If someone
> knows a good tool how to discriminate between these two bounds, please tell
> me! In case this should be CPU bound, multithreading in this part of the
> code might help, as I have (more than) 6 idling CPU cores left on this
> machine ;)
>

The contention if most likely to be "logic bound", that is probably due to
locks; i.e. have to wait for something to happen before can proceed.  One
thing that is likely to help is "having the plugin stack use threads" -
that way the plugins would be queuing up in advance and ready to go.,
pipelined instead of demand pull.

>
> With RGBA 8bit transcoding (i.e. rendering a timeline consisting of a
> single "unmodified" input clip) of a FullHD 25p h264 video using HW
> accelerated cinelerra can take now only a quarter of the playback time (in
> ffmpeg notation: speed=4x).
>

Yes, the above really is quite impressive.

>
> While this might seem impressive at first sight (this is equivalent to 4K
> transcoding in realtime!) this is still a fraction of the speed ffmpeg
> achieves for the same transcoding path
>

Believe it or not, we added Settings->Transcode to Cinelerra-GG despite the
fact that we informed the users that the convert/copy operation would be
much faster done from the command line.

Phyllis/Bill
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.cinelerra-gg.org/pipermail/cin/attachments/20200423/7acd46e4/attachment.html>