[Cin] OpenMP

Wed Mar 11 17:18:00 CET 2020

Hi,

Boy, this is interesting.  It does illustrate you can simplify a complex
pipeline of computation, but
it is sort of wishy washy on how it sets up the thread pools and controls
the client thread resources.

begin blathering...
A long time ago, there was a design called HEP (heterogeneous element
processor, back in the Cray days)
but it did not catch on well.  It was similar to this kind of pipeline
simplification in that it did not require a
great deal of programming to setup.  It employed an extra bit (the sting
bit) on every reg/mem cell that
indicated that the result was stored (ready).  The CPUs all were not bound
strictly to a user context, and the
instruction fetch would block and switch to a ready context if the input
operands had not yet been stung
(stored with a pipeline result).  This meant that the computation would
inherently create a serial pipeline
sqrt(f(x)**2 + f(y)**2)
cpu1 f(x) stings cpu3, then cpu3 runs **2 and stings cpu5(lt) then cpu5
runs +, and then sqrt
cpu2 f(y) stings cpu4, then cpu4 runs **2 and stings cpu5(rt)
The programmer normally did not see any of this functional parallelism in
the code design, and
the instruction fetch and context binding was per functional unit, so that
float/int/simd etc were all
part of a pool of computational resources that could decode and execute any
next operation with
all of the input regs/memory sting bits set.  Hell, there were event
context register sets with lambda
parameter registers so that tail recursions could be unwound into
iterations by the hardware.

In reality, much of this is good in the classroom, but not as easy to use
as all of that in real life.
The stung memory idea could get hung up with A waiting on B, and B waiting
on A by some unrealized
recursive dependency, and much of it was way past what could be implemented
by hardware design,
time, and money... since it did eventually have to make it into the market
and do something useful.
end bathering...

As for the OpenMP pragmas.  This may be a great way to speed up a bunch of
striping done in
many of the standard transformations.  To that extent, much of this code
has already been painstakingly
threaded using pthreads.  This is mostly done using LoadBalance, a class in
cinelerra.  It is not nearly
as easy as the OpenMP design, but does some of the code parallel operation
that OpenMP seems to
offer.   I have found that there frequently is a trade with the overhead of
setting up a thread to do a
function, and the function itself.  It does not make sense (usually) to
create a thread for a trivial work load.
The thread management is not free, and the needed locks and atomic ops do
apply a cost in the code.
Threading works well when the loop is much bigger than the cpu count, but
not as well with lots of cpus.
That is sort of a problem, especially with debugging.  On my devel machine
there are 128 cpus.  This
means that when you ask gdb: "inf thr", you get hundreds of results, and
only one or two are interesting.
A lot of effort has to be added to the lock trace just to get a handle on
what is actually happening.

I am concerned that by using a more opaque thread system, that simple
things like tracing a thread,
determining a thread owner, lock holder, or thread client set may be much
more difficult.  It is already
quite difficult to just address a threaded program under the debugger.

For computations near the outer edge of the code graph, this could be
great.  The simple setup and
high degree of parallelism is exactly what you need for the math to code
abstraction. It might be a good
idea to make  a sort of "test" plugin, and see if it is worth it.  IF the
results are good, and not difficult to
achieve, then there may be a case to "backport" some of the threaded code
just to see how it goes.

gg

On Tue, Mar 10, 2020 at 6:29 PM Andrew Randrianasulu <
randrianasulu at gmail.com> wrote:

> Hi, all!
>
> Currently I'm experimenting with OpenMP
>
> https://bisqwit.iki.fi/story/howto/openmp/
>
> --quote---
> Support in different compilers
> GCC (GNU Compiler Collection) supports OpenMP 4.5 since version 6.1,
> OpenMP 4.0 since version 4.9, OpenMP 3.1 since version 4.7, OpenMP 3.0
> since version 4.4, and OpenMP 2.5 since version 4.2. Add the commandline
> option -fopenmp to enable it. OpenMP offloading is supported for Intel MIC
> targets only (Intel Xeon Phi KNL + emulation) since version 5.1, and to
> NVidia (NVPTX) targets since version 7 or so.
>
> [...]
>
> The syntax
>  All OpenMP constructs in C and C++ are indicated with a #pragma omp
> followed by parameters, ending in a newline. The pragma usually applies
> only into the statement immediately following it, except for the barrier
> and flush commands, which do not have associated statements.
>
> The parallel construct
>  The parallel construct starts a parallel block. It creates a team of N
> threads (where N is determined at runtime, usually from the number of CPU
> cores, but may be affected by a few things), all of which execute the next
> statement (or the next block, if the statement is a {…} -enclosure). After
> the statement, the threads join back into one.
>
>   #pragma omp parallel
>   {
>     // Code inside this region runs in parallel.
>     printf("Hello!\n");
>   }
>
>  This code creates a team of threads, and each thread executes the same
> code. It prints the text "Hello!" followed by a newline, as many times as
> there are threads in the team created. For a dual-core system, it will
> output the text twice. (Note: It may also output something like
> "HeHlellolo", depending on system, because the printing happens in
> parallel.) At the }, the threads are joined back into one, as if in
> non-threaded program.
> Internally, GCC implements this by creating a magic function and moving
> the associated code into that function, so that all the variables declared
> within that block become local variables of that function (and thus, locals
> to each thread).
>
>  ICC, on the other hand, uses a mechanism resembling fork(), and does not
> create a magic function. Both implementations are, of course, valid, and
> semantically identical.
> Variables shared from the context are handled transparently, sometimes by
> passing a reference and sometimes by using register variables which are
> flushed at the end of the parallel block (or whenever a flush is executed).
> --quote end---
>
> http://gregslabaugh.net/publications/OpenMP_SPM.pdf
> Multicore Image Processing with OpenMP
> Greg Slabaugh, Richard Boyes, Xiaoyun Yang
>
>
> https://nccastaff.bournemouth.ac.uk/jmacey/OpenMP/
> -quote-
> OpenMP by Rob Bateman
> Introduction
>
>  OpenMP is an open standard that lets you easily make use of
> multi-threaded processors. It's currently supported by the following
> compilers: Visual C++, gcc (though not the Win32 version that comes with
> cygwin), XCode, and the Intel compiler; and It's supported on the following
> platforms: Win32, Linux, MacOS, XBox360*, and PS3*.
>
>  * Not amazingly well on those platforms
> --quote end--
>
> I used bcast2000 example , namely bcast/overlayframe.C
>
> and those CFLAGS:
>
> CFLAGS = -O3 -fpermissive -fomit-frame-pointer -march=pentium3 -ffast-math
> -mfpmath=both -fopenmp -I/usr/local/include
> + enabled linking with libgomp (gcc 5.5.0) by adding  -lgomp to
> bcast-2000c/bcast/Makefile
>
> it makes code slower, so far  :}
>
> but it eats all processors :} unlike original code
> --
> Cin mailing list
> Cin at lists.cinelerra-gg.org
> https://lists.cinelerra-gg.org/mailman/listinfo/cin
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.cinelerra-gg.org/pipermail/cin/attachments/20200311/dd7b365b/attachment.html>