[Cin] Re2: OpenMP
Andrew Randrianasulu
randrianasulu at gmail.com
Thu Mar 12 03:47:03 CET 2020
В сообщении от Wednesday 11 March 2020 19:18:00 Good Guy написал(а):
> Hi,
>
> Boy, this is interesting. It does illustrate you can simplify a complex
> pipeline of computation, but
> it is sort of wishy washy on how it sets up the thread pools and controls
> the client thread resources.
>
> begin blathering...
> A long time ago, there was a design called HEP (heterogeneous element
> processor, back in the Cray days)
> but it did not catch on well. It was similar to this kind of pipeline
> simplification in that it did not require a
> great deal of programming to setup. It employed an extra bit (the sting
> bit) on every reg/mem cell that
> indicated that the result was stored (ready). The CPUs all were not bound
> strictly to a user context, and the
> instruction fetch would block and switch to a ready context if the input
> operands had not yet been stung
> (stored with a pipeline result). This meant that the computation would
> inherently create a serial pipeline
> sqrt(f(x)**2 + f(y)**2)
> cpu1 f(x) stings cpu3, then cpu3 runs **2 and stings cpu5(lt) then cpu5
> runs +, and then sqrt
> cpu2 f(y) stings cpu4, then cpu4 runs **2 and stings cpu5(rt)
> The programmer normally did not see any of this functional parallelism in
> the code design, and
> the instruction fetch and context binding was per functional unit, so that
> float/int/simd etc were all
> part of a pool of computational resources that could decode and execute any
> next operation with
> all of the input regs/memory sting bits set. Hell, there were event
> context register sets with lambda
> parameter registers so that tail recursions could be unwound into
> iterations by the hardware.
>
> In reality, much of this is good in the classroom, but not as easy to use
> as all of that in real life.
> The stung memory idea could get hung up with A waiting on B, and B waiting
> on A by some unrealized
> recursive dependency, and much of it was way past what could be implemented
> by hardware design,
> time, and money... since it did eventually have to make it into the market
> and do something useful.
> end bathering...
>
> As for the OpenMP pragmas. This may be a great way to speed up a bunch of
> striping done in
> many of the standard transformations. To that extent, much of this code
> has already been painstakingly
> threaded using pthreads. This is mostly done using LoadBalance, a class in
> cinelerra. It is not nearly
> as easy as the OpenMP design, but does some of the code parallel operation
> that OpenMP seems to
> offer. I have found that there frequently is a trade with the overhead of
> setting up a thread to do a
> function, and the function itself. It does not make sense (usually) to
> create a thread for a trivial work load.
> The thread management is not free, and the needed locks and atomic ops do
> apply a cost in the code.
> Threading works well when the loop is much bigger than the cpu count, but
> not as well with lots of cpus.
> That is sort of a problem, especially with debugging. On my devel machine
> there are 128 cpus. This
> means that when you ask gdb: "inf thr", you get hundreds of results, and
> only one or two are interesting.
> A lot of effort has to be added to the lock trace just to get a handle on
> what is actually happening.
Yes, common wisdom seems to disable openmp (via CFLAGS, as it was enabled)
for debugging ...
--
Pro Tip : Debugging
It's a nice idea to actually disable openMP in debug builds, since you'll find stepping over the for loop iterations can be a bit strange.
--
src: https://nccastaff.bournemouth.ac.uk/jmacey/OpenMP/
>
> I am concerned that by using a more opaque thread system, that simple
> things like tracing a thread,
> determining a thread owner, lock holder, or thread client set may be much
> more difficult. It is already
> quite difficult to just address a threaded program under the debugger.
>
> For computations near the outer edge of the code graph, this could be
> great. The simple setup and
> high degree of parallelism is exactly what you need for the math to code
> abstraction. It might be a good
> idea to make a sort of "test" plugin, and see if it is worth it. IF the
> results are good, and not difficult to
> achieve, then there may be a case to "backport" some of the threaded code
> just to see how it goes.
I think I have very slight increase in processing speed with this diff
(2.8 fps -> 3.3 fps for 1920x1080 mjpeg, 30 fps video, for 4-core AMD computer set to 1.4Ghz
for full 'performance' setting (up to 3.9Ghz turbo) speed is like 7-8 fps for same video,
with fade set to any value)
attached diff and screenshot at full speed.
------
diff --git a/plugins/brightness/brightness.C b/plugins/brightness/brightness.C
index fd75cb7..c3f4ad1 100644
--- a/plugins/brightness/brightness.C
+++ b/plugins/brightness/brightness.C
@@ -44,15 +44,20 @@ int BrightnessMain::process_realtime(long size, VFrame **input_ptr, VFrame **out
//printf("BrightnessMain::process_realtime %f %f\n", brightness, contrast);
+
for(i = 0; i < size; i++)
{
input_rows = ((VPixel**)input_ptr[i]->get_rows());
output_rows = ((VPixel**)output_ptr[i]->get_rows());
+
if(brightness != 0)
{
// Use int since brightness is also subtractive.
int offset = (int)(brightness / 100 * VMAX);
+#pragma omp parallel for private(i,j,k,r,g,b) \
+schedule(static) num_threads(4) collapse(2)
+
for(j = 0; j < project_frame_h; j++)
{
for(k = 0; k < project_frame_w; k++)
@@ -83,7 +88,8 @@ int BrightnessMain::process_realtime(long size, VFrame **input_ptr, VFrame **out
// Floating point math.
float scalar = contrast_f;
float offset = VMAX / 2 - (VMAX * scalar) / 2;
-
+#pragma omp parallel for private (j,k,r,g,b) \
+num_threads(4) collapse(2)
for(j = 0; j < project_frame_h; j++)
{
for(k = 0; k < project_frame_w; k++)
@@ -107,7 +113,8 @@ int BrightnessMain::process_realtime(long size, VFrame **input_ptr, VFrame **out
int r, g, b;
int scalar = (int)(contrast_f * 0x100);
int offset = VMAX * 0x100 / 2 - (VMAX * scalar) / 2;
-
+#pragma omp parallel for private(j,k,r,g,b) \
+num_threads(4) collapse(2)
for(j = 0; j < project_frame_h; j++)
{
for(k = 0; k < project_frame_w; k++)
@@ -135,6 +142,9 @@ int BrightnessMain::process_realtime(long size, VFrame **input_ptr, VFrame **out
// Data never processed so copy if necessary
if(!buffers_identical(0))
{
+//#pragma omp parallel for private(j,k) \
+//schedule(dynamic) collapse(2)
+
for(j = 0; j < project_frame_h; j++)
{
for(k = 0; k < project_frame_w; k++)
---
patch relative to broadcast2000 sources...
>
> gg
>
>
> On Tue, Mar 10, 2020 at 6:29 PM Andrew Randrianasulu <
> randrianasulu at gmail.com> wrote:
>
> > Hi, all!
> >
> > Currently I'm experimenting with OpenMP
> >
> > https://bisqwit.iki.fi/story/howto/openmp/
> >
> > --quote---
> > Support in different compilers
> > GCC (GNU Compiler Collection) supports OpenMP 4.5 since version 6.1,
> > OpenMP 4.0 since version 4.9, OpenMP 3.1 since version 4.7, OpenMP 3.0
> > since version 4.4, and OpenMP 2.5 since version 4.2. Add the commandline
> > option -fopenmp to enable it. OpenMP offloading is supported for Intel MIC
> > targets only (Intel Xeon Phi KNL + emulation) since version 5.1, and to
> > NVidia (NVPTX) targets since version 7 or so.
> >
> > [...]
> >
> > The syntax
> > All OpenMP constructs in C and C++ are indicated with a #pragma omp
> > followed by parameters, ending in a newline. The pragma usually applies
> > only into the statement immediately following it, except for the barrier
> > and flush commands, which do not have associated statements.
> >
> > The parallel construct
> > The parallel construct starts a parallel block. It creates a team of N
> > threads (where N is determined at runtime, usually from the number of CPU
> > cores, but may be affected by a few things), all of which execute the next
> > statement (or the next block, if the statement is a {…} -enclosure). After
> > the statement, the threads join back into one.
> >
> > #pragma omp parallel
> > {
> > // Code inside this region runs in parallel.
> > printf("Hello!\n");
> > }
> >
> > This code creates a team of threads, and each thread executes the same
> > code. It prints the text "Hello!" followed by a newline, as many times as
> > there are threads in the team created. For a dual-core system, it will
> > output the text twice. (Note: It may also output something like
> > "HeHlellolo", depending on system, because the printing happens in
> > parallel.) At the }, the threads are joined back into one, as if in
> > non-threaded program.
> > Internally, GCC implements this by creating a magic function and moving
> > the associated code into that function, so that all the variables declared
> > within that block become local variables of that function (and thus, locals
> > to each thread).
> >
> > ICC, on the other hand, uses a mechanism resembling fork(), and does not
> > create a magic function. Both implementations are, of course, valid, and
> > semantically identical.
> > Variables shared from the context are handled transparently, sometimes by
> > passing a reference and sometimes by using register variables which are
> > flushed at the end of the parallel block (or whenever a flush is executed).
> > --quote end---
> >
> > http://gregslabaugh.net/publications/OpenMP_SPM.pdf
> > Multicore Image Processing with OpenMP
> > Greg Slabaugh, Richard Boyes, Xiaoyun Yang
> >
> >
> > https://nccastaff.bournemouth.ac.uk/jmacey/OpenMP/
> > -quote-
> > OpenMP by Rob Bateman
> > Introduction
> >
> > OpenMP is an open standard that lets you easily make use of
> > multi-threaded processors. It's currently supported by the following
> > compilers: Visual C++, gcc (though not the Win32 version that comes with
> > cygwin), XCode, and the Intel compiler; and It's supported on the following
> > platforms: Win32, Linux, MacOS, XBox360*, and PS3*.
> >
> > * Not amazingly well on those platforms
> > --quote end--
> >
> > I used bcast2000 example , namely bcast/overlayframe.C
> >
> > and those CFLAGS:
> >
> > CFLAGS = -O3 -fpermissive -fomit-frame-pointer -march=pentium3 -ffast-math
> > -mfpmath=both -fopenmp -I/usr/local/include
> > + enabled linking with libgomp (gcc 5.5.0) by adding -lgomp to
> > bcast-2000c/bcast/Makefile
> >
> > it makes code slower, so far :}
> >
> > but it eats all processors :} unlike original code
> > --
> > Cin mailing list
> > Cin at lists.cinelerra-gg.org
> > https://lists.cinelerra-gg.org/mailman/listinfo/cin
> >
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: omp-2.diff
Type: text/x-diff
Size: 1948 bytes
Desc: not available
URL: <https://lists.cinelerra-gg.org/pipermail/cin/attachments/20200312/5275880a/attachment-0001.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: BROADCAST2000_6.jpeg
Type: image/jpeg
Size: 189621 bytes
Desc: not available
URL: <https://lists.cinelerra-gg.org/pipermail/cin/attachments/20200312/5275880a/attachment-0001.jpeg>
More information about the Cin
mailing list