[Cin] OpenMP

Thu Mar 12 00:40:20 CET 2020

В сообщении от Wednesday 11 March 2020 19:18:00 Good Guy написал(а):
> Hi,
> 
> Boy, this is interesting.  It does illustrate you can simplify a complex
> pipeline of computation, but
> it is sort of wishy washy on how it sets up the thread pools and controls
> the client thread resources.
> 
> begin blathering...
> A long time ago, there was a design called HEP (heterogeneous element
> processor, back in the Cray days)
> but it did not catch on well.  It was similar to this kind of pipeline
> simplification in that it did not require a
> great deal of programming to setup.  It employed an extra bit (the sting
> bit) on every reg/mem cell that
> indicated that the result was stored (ready).  The CPUs all were not bound
> strictly to a user context, and the
> instruction fetch would block and switch to a ready context if the input
> operands had not yet been stung
> (stored with a pipeline result).  This meant that the computation would
> inherently create a serial pipeline
> sqrt(f(x)**2 + f(y)**2)
> cpu1 f(x) stings cpu3, then cpu3 runs **2 and stings cpu5(lt) then cpu5
> runs +, and then sqrt
> cpu2 f(y) stings cpu4, then cpu4 runs **2 and stings cpu5(rt)
> The programmer normally did not see any of this functional parallelism in
> the code design, and
> the instruction fetch and context binding was per functional unit, so that
> float/int/simd etc were all
> part of a pool of computational resources that could decode and execute any
> next operation with
> all of the input regs/memory sting bits set.  Hell, there were event
> context register sets with lambda
> parameter registers so that tail recursions could be unwound into
> iterations by the hardware.
> 
> In reality, much of this is good in the classroom, but not as easy to use
> as all of that in real life.
> The stung memory idea could get hung up with A waiting on B, and B waiting
> on A by some unrealized
> recursive dependency, and much of it was way past what could be implemented
> by hardware design,
> time, and money... since it did eventually have to make it into the market
> and do something useful.
> end bathering...

Thanks for historical lesson! I tend to hang around #nouveau ch at freenode, 
so overhead a lot of compiler/OpenCL talk, because one of developers  
currently work on it. I try to understand at least something!

> 
> As for the OpenMP pragmas.  This may be a great way to speed up a bunch of
> striping done in
> many of the standard transformations.  To that extent, much of this code
> has already been painstakingly
> threaded using pthreads.  This is mostly done using LoadBalance, a class in
> cinelerra.  It is not nearly
> as easy as the OpenMP design, but does some of the code parallel operation
> that OpenMP seems to
> offer.   I have found that there frequently is a trade with the overhead of
> setting up a thread to do a
> function, and the function itself.  It does not make sense (usually) to
> create a thread for a trivial work load.
> The thread management is not free, and the needed locks and atomic ops do
> apply a cost in the code.
> Threading works well when the loop is much bigger than the cpu count, but
> not as well with lots of cpus.
> That is sort of a problem, especially with debugging.  On my devel machine
> there are 128 cpus.  This
> means that when you ask gdb: "inf thr", you get hundreds of results, and
> only one or two are interesting.
> A lot of effort has to be added to the lock trace just to get a handle on
> what is actually happening.
> 
> I am concerned that by using a more opaque thread system, that simple
> things like tracing a thread,
> determining a thread owner, lock holder, or thread client set may be much
> more difficult.  It is already
> quite difficult to just address a threaded program under the debugger.
> 
> For computations near the outer edge of the code graph, this could be
> great.  The simple setup and
> high degree of parallelism is exactly what you need for the math to code
> abstraction. It might be a good
> idea to make  a sort of "test" plugin, and see if it is worth it.  IF the
> results are good, and not difficult to
> achieve, then there may be a case to "backport" some of the threaded code
> just to see how it goes.

I tried it with 'brightness' plugin in Broadcast2000 (yes, "my" toy editor)
but for now it seems to make things only slower? See attach.

At least I got colors correct by declaring tmp variables as thread-private ...
Thing is, with all this pointer manupulation at (apparently) locked VFrame ....
I don't think old broadcast2000 _c_  was as ready for multi-CPU as I assumed 
(probably 2000a was different, I have it locally for study too).
Parallel mjpeg encoding - sure, probably mpeg2 decoding via libmpeg3, 
but not much else? Even if plugins in this ver. actually different processes 
I can see in 'top'

I had this brilliant idea about making private array in plugin, copy data to it, 
run threads, copy back to vframe ... but you see, it will eat mem bandwidth
on copying.

It seems bc2000 actually worked internally at RGB(A)/RGBA_float all the time you do processing, 
and only pass-thru yuv data if no processing was requested, for Xv display.

Still, interesting piece of software, with some comments :}

Also, back to Cin topic, may be on 'big' cpu-count machines automatic local 
renderfarm setup will provide some better experience out-of-the-box?

Or at least make it into Tip of the day ....
> 
> gg
> 
> 
> On Tue, Mar 10, 2020 at 6:29 PM Andrew Randrianasulu <
> randrianasulu at gmail.com> wrote:
> 
> > Hi, all!
> >
> > Currently I'm experimenting with OpenMP
> >
> > https://bisqwit.iki.fi/story/howto/openmp/
> >
> > --quote---
> > Support in different compilers
> > GCC (GNU Compiler Collection) supports OpenMP 4.5 since version 6.1,
> > OpenMP 4.0 since version 4.9, OpenMP 3.1 since version 4.7, OpenMP 3.0
> > since version 4.4, and OpenMP 2.5 since version 4.2. Add the commandline
> > option -fopenmp to enable it. OpenMP offloading is supported for Intel MIC
> > targets only (Intel Xeon Phi KNL + emulation) since version 5.1, and to
> > NVidia (NVPTX) targets since version 7 or so.
> >
> > [...]
> >
> > The syntax
> >  All OpenMP constructs in C and C++ are indicated with a #pragma omp
> > followed by parameters, ending in a newline. The pragma usually applies
> > only into the statement immediately following it, except for the barrier
> > and flush commands, which do not have associated statements.
> >
> > The parallel construct
> >  The parallel construct starts a parallel block. It creates a team of N
> > threads (where N is determined at runtime, usually from the number of CPU
> > cores, but may be affected by a few things), all of which execute the next
> > statement (or the next block, if the statement is a {…} -enclosure). After
> > the statement, the threads join back into one.
> >
> >   #pragma omp parallel
> >   {
> >     // Code inside this region runs in parallel.
> >     printf("Hello!\n");
> >   }
> >
> >  This code creates a team of threads, and each thread executes the same
> > code. It prints the text "Hello!" followed by a newline, as many times as
> > there are threads in the team created. For a dual-core system, it will
> > output the text twice. (Note: It may also output something like
> > "HeHlellolo", depending on system, because the printing happens in
> > parallel.) At the }, the threads are joined back into one, as if in
> > non-threaded program.
> > Internally, GCC implements this by creating a magic function and moving
> > the associated code into that function, so that all the variables declared
> > within that block become local variables of that function (and thus, locals
> > to each thread).
> >
> >  ICC, on the other hand, uses a mechanism resembling fork(), and does not
> > create a magic function. Both implementations are, of course, valid, and
> > semantically identical.
> > Variables shared from the context are handled transparently, sometimes by
> > passing a reference and sometimes by using register variables which are
> > flushed at the end of the parallel block (or whenever a flush is executed).
> > --quote end---
> >
> > http://gregslabaugh.net/publications/OpenMP_SPM.pdf
> > Multicore Image Processing with OpenMP
> > Greg Slabaugh, Richard Boyes, Xiaoyun Yang
> >
> >
> > https://nccastaff.bournemouth.ac.uk/jmacey/OpenMP/
> > -quote-
> > OpenMP by Rob Bateman
> > Introduction
> >
> >  OpenMP is an open standard that lets you easily make use of
> > multi-threaded processors. It's currently supported by the following
> > compilers: Visual C++, gcc (though not the Win32 version that comes with
> > cygwin), XCode, and the Intel compiler; and It's supported on the following
> > platforms: Win32, Linux, MacOS, XBox360*, and PS3*.
> >
> >  * Not amazingly well on those platforms
> > --quote end--
> >
> > I used bcast2000 example , namely bcast/overlayframe.C
> >
> > and those CFLAGS:
> >
> > CFLAGS = -O3 -fpermissive -fomit-frame-pointer -march=pentium3 -ffast-math
> > -mfpmath=both -fopenmp -I/usr/local/include
> > + enabled linking with libgomp (gcc 5.5.0) by adding  -lgomp to
> > bcast-2000c/bcast/Makefile
> >
> > it makes code slower, so far  :}
> >
> > but it eats all processors :} unlike original code
> > --
> > Cin mailing list
> > Cin at lists.cinelerra-gg.org
> > https://lists.cinelerra-gg.org/mailman/listinfo/cin
> >
> 

-------------- next part --------------
A non-text attachment was scrubbed...
Name: omp.diff
Type: text/x-diff
Size: 1675 bytes
Desc: not available
URL: <https://lists.cinelerra-gg.org/pipermail/cin/attachments/20200312/15fcb06d/attachment-0001.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: brightness.C
Type: text/x-c++src
Size: 5732 bytes
Desc: not available
URL: <https://lists.cinelerra-gg.org/pipermail/cin/attachments/20200312/15fcb06d/attachment-0001.C>