Simeon,

WOW, thank you for your detailed analysis and taking the time to get your information down in writing. It is very much appreciated. I have read your email to Bill for discussion and he has some thoughts to relay. Considerable thought has been put into your ideas already in the past. (I am going to log this in our Feature/Bug Tracker because you never know if someone will want to implement some parts of your ideas).

how cinelerra could improve performance considerably in cases where little
(or rather: "nothing") is to be done to the input video frames.

"trivial editing" with long and/or large files: I could completely avoid
reencoding of the video stream using the command line tool ffmpeg. FFmpeg
works perfectly fine for single "trivial edits", but the command (and
required filters) becomes admittedly complex as soon as multiple edits have
to be made.
...
So in my wildest dreams I dreamed of good old cinelerra learning how to do
stream-copying (read up on ffmpeg's -c:v copy and -c:a copy if you are not
familiar with that concept!). As stream-copying does not require to decode
the input, the achievable speed is typically bound by the disk IO -- it can
be as fast as your SSD-Raid at nearly negligible CPU cost.

Cinelerra could carry compressed AND uncompressed data in the vframe, but you want to be able to see the video in the composer so the uncompressed data is needed. You would need a second uncompressed channel. If you are using an NLE, the compositor is needed to look at the video and this obviously requires a decode operation. It turns out that Cinelerra already has Direct Copy BUT where it is applicable and feasible is a small set of operations.

Please note that stream-copying per definition only works if the packets
from the input are not to be altered at all and the output stream has
exactly the same encoding settings [1]. Only the container format would be
allowed to be different, as long as it can carry the unmodified stream.

Implementing this in cinelerra would definitely be a huge, nontrivial change. ... a *real* challenge! ;)

A test that Bill uses in deciding whether or not to spend the time for implementation is based on "developer time needed" versus "user time saved". In this case maybe 500 developer hours to implement stream copying has to be balanced against the 2% of time that users would take advantage of this feature. That 2% is another guess based on the fact that the majority of people using Cinelerra, are actually planning on doing a lot of editing, not just a trivial amount.

I profiled cinelerra-gg using operf during rendering when using an Intel
UHD Graphics 630 GPU (gen9.5) for HW decoding and a Nvidia Quadro P2000
(Pascal nvenc) for encoding.

Again, thanks for passing along the profiling information. (Bill loves this kind of stuff!)

The most time-consuming parts appear to be:

When setting format to YUV 8bit:
17.7664 BC_Xfer::xfer_yuv888_to_yuv420p(unsigned int, unsigned int)
13.1723 BC_Xfer::xfer_yuv444p_to_yuv888(unsigned int, unsigned int)
10.7678 __memmove_avx_unaligned_erms [ in libc-2.31.so ]
10.7615 ff_hscale8to15_4_ssse3
8.5718 BC_Xfer::xfer_yuv888_to_bgr8888(unsigned int, unsigned int)
2.8518 ff_yuv2plane1_8_avx

About the above, Bill says "memmove is a tough act to follow" !!

And "bgr8888"s purpose is to draw the video on the Display -- got to have this.

In any of the profile information provided, anything that has "ff" in it is obviously ffmpeg which uses up a lot of time and most likely has been fine-tuned to work as well as possible and it is very good and almost always fast. Cinelerra-GG takes full advantage of ffmpeg and so if it is not as fast as it could be, it is so worth it. Earlier versions of Cinelerra, had their own implements and could be improved upon, others were and still are exceptional work.

During rendering, two of eight cores were at 70-85% (according to htop). As
none reached 100% alone, but the sum is above 100%, I'm not really sure
whether rendering is currently CPU bound or rather memory bound. If someone
knows a good tool how to discriminate between these two bounds, please tell
me! In case this should be CPU bound, multithreading in this part of the
code might help, as I have (more than) 6 idling CPU cores left on this
machine ;)

The contention if most likely to be "logic bound", that is probably due to locks; i.e. have to wait for something to happen before can proceed. One thing that is likely to help is "having the plugin stack use threads" - that way the plugins would be queuing up in advance and ready to go., pipelined instead of demand pull.

With RGBA 8bit transcoding (i.e. rendering a timeline consisting of a
single "unmodified" input clip) of a FullHD 25p h264 video using HW
accelerated cinelerra can take now only a quarter of the playback time (in
ffmpeg notation: speed=4x).

Yes, the above really is quite impressive.

While this might seem impressive at first sight (this is equivalent to 4K
transcoding in realtime!) this is still a fraction of the speed ffmpeg
achieves for the same transcoding path

Believe it or not, we added Settings->Transcode to Cinelerra-GG despite the fact that we informed the users that the convert/copy operation would be much faster done from the command line.

Phyllis/Bill