OpenMP

newer
New libdav1d, ICC profile support...

Andrew Randrianasulu

11 Mar 2020 11 Mar '20

12:17 a.m.

Hi, all! Currently I'm experimenting with OpenMP https://bisqwit.iki.fi/story/howto/openmp/ --quote--- Support in different compilers GCC (GNU Compiler Collection) supports OpenMP 4.5 since version 6.1, OpenMP 4.0 since version 4.9, OpenMP 3.1 since version 4.7, OpenMP 3.0 since version 4.4, and OpenMP 2.5 since version 4.2. Add the commandline option -fopenmp to enable it. OpenMP offloading is supported for Intel MIC targets only (Intel Xeon Phi KNL + emulation) since version 5.1, and to NVidia (NVPTX) targets since version 7 or so. [...] The syntax All OpenMP constructs in C and C++ are indicated with a #pragma omp followed by parameters, ending in a newline. The pragma usually applies only into the statement immediately following it, except for the barrier and flush commands, which do not have associated statements. The parallel construct The parallel construct starts a parallel block. It creates a team of N threads (where N is determined at runtime, usually from the number of CPU cores, but may be affected by a few things), all of which execute the next statement (or the next block, if the statement is a {…} -enclosure). After the statement, the threads join back into one. #pragma omp parallel { // Code inside this region runs in parallel. printf("Hello!\n"); } This code creates a team of threads, and each thread executes the same code. It prints the text "Hello!" followed by a newline, as many times as there are threads in the team created. For a dual-core system, it will output the text twice. (Note: It may also output something like "HeHlellolo", depending on system, because the printing happens in parallel.) At the }, the threads are joined back into one, as if in non-threaded program. Internally, GCC implements this by creating a magic function and moving the associated code into that function, so that all the variables declared within that block become local variables of that function (and thus, locals to each thread). ICC, on the other hand, uses a mechanism resembling fork(), and does not create a magic function. Both implementations are, of course, valid, and semantically identical. Variables shared from the context are handled transparently, sometimes by passing a reference and sometimes by using register variables which are flushed at the end of the parallel block (or whenever a flush is executed). --quote end--- http://gregslabaugh.net/publications/OpenMP_SPM.pdf Multicore Image Processing with OpenMP Greg Slabaugh, Richard Boyes, Xiaoyun Yang https://nccastaff.bournemouth.ac.uk/jmacey/OpenMP/ -quote- OpenMP by Rob Bateman Introduction OpenMP is an open standard that lets you easily make use of multi-threaded processors. It's currently supported by the following compilers: Visual C++, gcc (though not the Win32 version that comes with cygwin), XCode, and the Intel compiler; and It's supported on the following platforms: Win32, Linux, MacOS, XBox360*, and PS3*. * Not amazingly well on those platforms --quote end-- I used bcast2000 example , namely bcast/overlayframe.C and those CFLAGS: CFLAGS = -O3 -fpermissive -fomit-frame-pointer -march=pentium3 -ffast-math -mfpmath=both -fopenmp -I/usr/local/include + enabled linking with libgomp (gcc 5.5.0) by adding -lgomp to bcast-2000c/bcast/Makefile it makes code slower, so far :} but it eats all processors :} unlike original code

Attachments:

openmp_overlay.C.diff (text/x-diff — 5.9 KB)

Show replies by date

Good Guy

11 Mar 11 Mar

4:18 p.m.

Hi, Boy, this is interesting. It does illustrate you can simplify a complex pipeline of computation, but it is sort of wishy washy on how it sets up the thread pools and controls the client thread resources. begin blathering... A long time ago, there was a design called HEP (heterogeneous element processor, back in the Cray days) but it did not catch on well. It was similar to this kind of pipeline simplification in that it did not require a great deal of programming to setup. It employed an extra bit (the sting bit) on every reg/mem cell that indicated that the result was stored (ready). The CPUs all were not bound strictly to a user context, and the instruction fetch would block and switch to a ready context if the input operands had not yet been stung (stored with a pipeline result). This meant that the computation would inherently create a serial pipeline sqrt(f(x)**2 + f(y)**2) cpu1 f(x) stings cpu3, then cpu3 runs **2 and stings cpu5(lt) then cpu5 runs +, and then sqrt cpu2 f(y) stings cpu4, then cpu4 runs **2 and stings cpu5(rt) The programmer normally did not see any of this functional parallelism in the code design, and the instruction fetch and context binding was per functional unit, so that float/int/simd etc were all part of a pool of computational resources that could decode and execute any next operation with all of the input regs/memory sting bits set. Hell, there were event context register sets with lambda parameter registers so that tail recursions could be unwound into iterations by the hardware. In reality, much of this is good in the classroom, but not as easy to use as all of that in real life. The stung memory idea could get hung up with A waiting on B, and B waiting on A by some unrealized recursive dependency, and much of it was way past what could be implemented by hardware design, time, and money... since it did eventually have to make it into the market and do something useful. end bathering... As for the OpenMP pragmas. This may be a great way to speed up a bunch of striping done in many of the standard transformations. To that extent, much of this code has already been painstakingly threaded using pthreads. This is mostly done using LoadBalance, a class in cinelerra. It is not nearly as easy as the OpenMP design, but does some of the code parallel operation that OpenMP seems to offer. I have found that there frequently is a trade with the overhead of setting up a thread to do a function, and the function itself. It does not make sense (usually) to create a thread for a trivial work load. The thread management is not free, and the needed locks and atomic ops do apply a cost in the code. Threading works well when the loop is much bigger than the cpu count, but not as well with lots of cpus. That is sort of a problem, especially with debugging. On my devel machine there are 128 cpus. This means that when you ask gdb: "inf thr", you get hundreds of results, and only one or two are interesting. A lot of effort has to be added to the lock trace just to get a handle on what is actually happening. I am concerned that by using a more opaque thread system, that simple things like tracing a thread, determining a thread owner, lock holder, or thread client set may be much more difficult. It is already quite difficult to just address a threaded program under the debugger. For computations near the outer edge of the code graph, this could be great. The simple setup and high degree of parallelism is exactly what you need for the math to code abstraction. It might be a good idea to make a sort of "test" plugin, and see if it is worth it. IF the results are good, and not difficult to achieve, then there may be a case to "backport" some of the threaded code just to see how it goes. gg On Tue, Mar 10, 2020 at 6:29 PM Andrew Randrianasulu < [email protected]> wrote:

...

Hi, all!

Currently I'm experimenting with OpenMP

https://bisqwit.iki.fi/story/howto/openmp/

--quote--- Support in different compilers GCC (GNU Compiler Collection) supports OpenMP 4.5 since version 6.1, OpenMP 4.0 since version 4.9, OpenMP 3.1 since version 4.7, OpenMP 3.0 since version 4.4, and OpenMP 2.5 since version 4.2. Add the commandline option -fopenmp to enable it. OpenMP offloading is supported for Intel MIC targets only (Intel Xeon Phi KNL + emulation) since version 5.1, and to NVidia (NVPTX) targets since version 7 or so.

[...]

The syntax All OpenMP constructs in C and C++ are indicated with a #pragma omp followed by parameters, ending in a newline. The pragma usually applies only into the statement immediately following it, except for the barrier and flush commands, which do not have associated statements.

The parallel construct The parallel construct starts a parallel block. It creates a team of N threads (where N is determined at runtime, usually from the number of CPU cores, but may be affected by a few things), all of which execute the next statement (or the next block, if the statement is a {…} -enclosure). After the statement, the threads join back into one.

#pragma omp parallel { // Code inside this region runs in parallel. printf("Hello!\n"); }

This code creates a team of threads, and each thread executes the same code. It prints the text "Hello!" followed by a newline, as many times as there are threads in the team created. For a dual-core system, it will output the text twice. (Note: It may also output something like "HeHlellolo", depending on system, because the printing happens in parallel.) At the }, the threads are joined back into one, as if in non-threaded program. Internally, GCC implements this by creating a magic function and moving the associated code into that function, so that all the variables declared within that block become local variables of that function (and thus, locals to each thread).

ICC, on the other hand, uses a mechanism resembling fork(), and does not create a magic function. Both implementations are, of course, valid, and semantically identical. Variables shared from the context are handled transparently, sometimes by passing a reference and sometimes by using register variables which are flushed at the end of the parallel block (or whenever a flush is executed). --quote end---

http://gregslabaugh.net/publications/OpenMP_SPM.pdf Multicore Image Processing with OpenMP Greg Slabaugh, Richard Boyes, Xiaoyun Yang

https://nccastaff.bournemouth.ac.uk/jmacey/OpenMP/ -quote- OpenMP by Rob Bateman Introduction

OpenMP is an open standard that lets you easily make use of multi-threaded processors. It's currently supported by the following compilers: Visual C++, gcc (though not the Win32 version that comes with cygwin), XCode, and the Intel compiler; and It's supported on the following platforms: Win32, Linux, MacOS, XBox360*, and PS3*.

* Not amazingly well on those platforms --quote end--

I used bcast2000 example , namely bcast/overlayframe.C

and those CFLAGS:

CFLAGS = -O3 -fpermissive -fomit-frame-pointer -march=pentium3 -ffast-math -mfpmath=both -fopenmp -I/usr/local/include + enabled linking with libgomp (gcc 5.5.0) by adding -lgomp to bcast-2000c/bcast/Makefile

it makes code slower, so far :}

but it eats all processors :} unlike original code -- Cin mailing list [email protected] https://lists.cinelerra-gg.org/mailman/listinfo/cin

Andrew Randrianasulu

11:40 p.m.

В сообщении от Wednesday 11 March 2020 19:18:00 Good Guy написал(а):

...

Hi,

Boy, this is interesting. It does illustrate you can simplify a complex pipeline of computation, but it is sort of wishy washy on how it sets up the thread pools and controls the client thread resources.

begin blathering... A long time ago, there was a design called HEP (heterogeneous element processor, back in the Cray days) but it did not catch on well. It was similar to this kind of pipeline simplification in that it did not require a great deal of programming to setup. It employed an extra bit (the sting bit) on every reg/mem cell that indicated that the result was stored (ready). The CPUs all were not bound strictly to a user context, and the instruction fetch would block and switch to a ready context if the input operands had not yet been stung (stored with a pipeline result). This meant that the computation would inherently create a serial pipeline sqrt(f(x)**2 + f(y)**2) cpu1 f(x) stings cpu3, then cpu3 runs **2 and stings cpu5(lt) then cpu5 runs +, and then sqrt cpu2 f(y) stings cpu4, then cpu4 runs **2 and stings cpu5(rt) The programmer normally did not see any of this functional parallelism in the code design, and the instruction fetch and context binding was per functional unit, so that float/int/simd etc were all part of a pool of computational resources that could decode and execute any next operation with all of the input regs/memory sting bits set. Hell, there were event context register sets with lambda parameter registers so that tail recursions could be unwound into iterations by the hardware.

In reality, much of this is good in the classroom, but not as easy to use as all of that in real life. The stung memory idea could get hung up with A waiting on B, and B waiting on A by some unrealized recursive dependency, and much of it was way past what could be implemented by hardware design, time, and money... since it did eventually have to make it into the market and do something useful. end bathering...

Thanks for historical lesson! I tend to hang around #nouveau ch at freenode, so overhead a lot of compiler/OpenCL talk, because one of developers currently work on it. I try to understand at least something!

...

As for the OpenMP pragmas. This may be a great way to speed up a bunch of striping done in many of the standard transformations. To that extent, much of this code has already been painstakingly threaded using pthreads. This is mostly done using LoadBalance, a class in cinelerra. It is not nearly as easy as the OpenMP design, but does some of the code parallel operation that OpenMP seems to offer. I have found that there frequently is a trade with the overhead of setting up a thread to do a function, and the function itself. It does not make sense (usually) to create a thread for a trivial work load. The thread management is not free, and the needed locks and atomic ops do apply a cost in the code. Threading works well when the loop is much bigger than the cpu count, but not as well with lots of cpus. That is sort of a problem, especially with debugging. On my devel machine there are 128 cpus. This means that when you ask gdb: "inf thr", you get hundreds of results, and only one or two are interesting. A lot of effort has to be added to the lock trace just to get a handle on what is actually happening.

I am concerned that by using a more opaque thread system, that simple things like tracing a thread, determining a thread owner, lock holder, or thread client set may be much more difficult. It is already quite difficult to just address a threaded program under the debugger.

For computations near the outer edge of the code graph, this could be great. The simple setup and high degree of parallelism is exactly what you need for the math to code abstraction. It might be a good idea to make a sort of "test" plugin, and see if it is worth it. IF the results are good, and not difficult to achieve, then there may be a case to "backport" some of the threaded code just to see how it goes.

I tried it with 'brightness' plugin in Broadcast2000 (yes, "my" toy editor) but for now it seems to make things only slower? See attach. At least I got colors correct by declaring tmp variables as thread-private ... Thing is, with all this pointer manupulation at (apparently) locked VFrame .... I don't think old broadcast2000 _c_ was as ready for multi-CPU as I assumed (probably 2000a was different, I have it locally for study too). Parallel mjpeg encoding - sure, probably mpeg2 decoding via libmpeg3, but not much else? Even if plugins in this ver. actually different processes I can see in 'top' I had this brilliant idea about making private array in plugin, copy data to it, run threads, copy back to vframe ... but you see, it will eat mem bandwidth on copying. It seems bc2000 actually worked internally at RGB(A)/RGBA_float all the time you do processing, and only pass-thru yuv data if no processing was requested, for Xv display. Still, interesting piece of software, with some comments :} Also, back to Cin topic, may be on 'big' cpu-count machines automatic local renderfarm setup will provide some better experience out-of-the-box? Or at least make it into Tip of the day ....

...

gg

On Tue, Mar 10, 2020 at 6:29 PM Andrew Randrianasulu < [email protected]> wrote:

...
Hi, all!

Currently I'm experimenting with OpenMP

https://bisqwit.iki.fi/story/howto/openmp/

--quote--- Support in different compilers GCC (GNU Compiler Collection) supports OpenMP 4.5 since version 6.1, OpenMP 4.0 since version 4.9, OpenMP 3.1 since version 4.7, OpenMP 3.0 since version 4.4, and OpenMP 2.5 since version 4.2. Add the commandline option -fopenmp to enable it. OpenMP offloading is supported for Intel MIC targets only (Intel Xeon Phi KNL + emulation) since version 5.1, and to NVidia (NVPTX) targets since version 7 or so.

[...]

The syntax All OpenMP constructs in C and C++ are indicated with a #pragma omp followed by parameters, ending in a newline. The pragma usually applies only into the statement immediately following it, except for the barrier and flush commands, which do not have associated statements.

The parallel construct The parallel construct starts a parallel block. It creates a team of N threads (where N is determined at runtime, usually from the number of CPU cores, but may be affected by a few things), all of which execute the next statement (or the next block, if the statement is a {…} -enclosure). After the statement, the threads join back into one.

#pragma omp parallel { // Code inside this region runs in parallel. printf("Hello!\n"); }

This code creates a team of threads, and each thread executes the same code. It prints the text "Hello!" followed by a newline, as many times as there are threads in the team created. For a dual-core system, it will output the text twice. (Note: It may also output something like "HeHlellolo", depending on system, because the printing happens in parallel.) At the }, the threads are joined back into one, as if in non-threaded program. Internally, GCC implements this by creating a magic function and moving the associated code into that function, so that all the variables declared within that block become local variables of that function (and thus, locals to each thread).

ICC, on the other hand, uses a mechanism resembling fork(), and does not create a magic function. Both implementations are, of course, valid, and semantically identical. Variables shared from the context are handled transparently, sometimes by passing a reference and sometimes by using register variables which are flushed at the end of the parallel block (or whenever a flush is executed). --quote end---

http://gregslabaugh.net/publications/OpenMP_SPM.pdf Multicore Image Processing with OpenMP Greg Slabaugh, Richard Boyes, Xiaoyun Yang

https://nccastaff.bournemouth.ac.uk/jmacey/OpenMP/ -quote- OpenMP by Rob Bateman Introduction

OpenMP is an open standard that lets you easily make use of multi-threaded processors. It's currently supported by the following compilers: Visual C++, gcc (though not the Win32 version that comes with cygwin), XCode, and the Intel compiler; and It's supported on the following platforms: Win32, Linux, MacOS, XBox360*, and PS3*.

* Not amazingly well on those platforms --quote end--

I used bcast2000 example , namely bcast/overlayframe.C

and those CFLAGS:

CFLAGS = -O3 -fpermissive -fomit-frame-pointer -march=pentium3 -ffast-math -mfpmath=both -fopenmp -I/usr/local/include + enabled linking with libgomp (gcc 5.5.0) by adding -lgomp to bcast-2000c/bcast/Makefile

it makes code slower, so far :}

but it eats all processors :} unlike original code -- Cin mailing list [email protected] https://lists.cinelerra-gg.org/mailman/listinfo/cin

Phyllis Smith

12 Mar 12 Mar

3:18 p.m.

Andrew: I tried it with 'brightness' plugin in Broadcast2000 (yes, "my" toy editor)

...

but for now it seems to make things only slower? See attach.

So surprised that Broadcast2000 still compiles with your work to get it to do so. GG has never seen it.

...

Also, back to Cin topic, may be on 'big' cpu-count machines automatic local renderfarm setup will provide some better experience out-of-the-box?

Making renderfarm setup automatic is a good idea but too hard. A tutorial

on this would be nice to have though as opposed to the voluminous scattered information in the manual. Using the renderfarm even on a computer with 12 cpus is quite helpful for rendering. We will at least at a Tip of the Day for it.

Andrew Randrianasulu

2:47 a.m.

New subject: Re2: OpenMP

В сообщении от Wednesday 11 March 2020 19:18:00 Good Guy написал(а):

...

Hi,

Boy, this is interesting. It does illustrate you can simplify a complex pipeline of computation, but it is sort of wishy washy on how it sets up the thread pools and controls the client thread resources.

begin blathering... A long time ago, there was a design called HEP (heterogeneous element processor, back in the Cray days) but it did not catch on well. It was similar to this kind of pipeline simplification in that it did not require a great deal of programming to setup. It employed an extra bit (the sting bit) on every reg/mem cell that indicated that the result was stored (ready). The CPUs all were not bound strictly to a user context, and the instruction fetch would block and switch to a ready context if the input operands had not yet been stung (stored with a pipeline result). This meant that the computation would inherently create a serial pipeline sqrt(f(x)**2 + f(y)**2) cpu1 f(x) stings cpu3, then cpu3 runs **2 and stings cpu5(lt) then cpu5 runs +, and then sqrt cpu2 f(y) stings cpu4, then cpu4 runs **2 and stings cpu5(rt) The programmer normally did not see any of this functional parallelism in the code design, and the instruction fetch and context binding was per functional unit, so that float/int/simd etc were all part of a pool of computational resources that could decode and execute any next operation with all of the input regs/memory sting bits set. Hell, there were event context register sets with lambda parameter registers so that tail recursions could be unwound into iterations by the hardware.

In reality, much of this is good in the classroom, but not as easy to use as all of that in real life. The stung memory idea could get hung up with A waiting on B, and B waiting on A by some unrealized recursive dependency, and much of it was way past what could be implemented by hardware design, time, and money... since it did eventually have to make it into the market and do something useful. end bathering...

As for the OpenMP pragmas. This may be a great way to speed up a bunch of striping done in many of the standard transformations. To that extent, much of this code has already been painstakingly threaded using pthreads. This is mostly done using LoadBalance, a class in cinelerra. It is not nearly as easy as the OpenMP design, but does some of the code parallel operation that OpenMP seems to offer. I have found that there frequently is a trade with the overhead of setting up a thread to do a function, and the function itself. It does not make sense (usually) to create a thread for a trivial work load. The thread management is not free, and the needed locks and atomic ops do apply a cost in the code. Threading works well when the loop is much bigger than the cpu count, but not as well with lots of cpus. That is sort of a problem, especially with debugging. On my devel machine there are 128 cpus. This means that when you ask gdb: "inf thr", you get hundreds of results, and only one or two are interesting. A lot of effort has to be added to the lock trace just to get a handle on what is actually happening.

Yes, common wisdom seems to disable openmp (via CFLAGS, as it was enabled) for debugging ... -- Pro Tip : Debugging It's a nice idea to actually disable openMP in debug builds, since you'll find stepping over the for loop iterations can be a bit strange. -- src: https://nccastaff.bournemouth.ac.uk/jmacey/OpenMP/

...

I am concerned that by using a more opaque thread system, that simple things like tracing a thread, determining a thread owner, lock holder, or thread client set may be much more difficult. It is already quite difficult to just address a threaded program under the debugger.

For computations near the outer edge of the code graph, this could be great. The simple setup and high degree of parallelism is exactly what you need for the math to code abstraction. It might be a good idea to make a sort of "test" plugin, and see if it is worth it. IF the results are good, and not difficult to achieve, then there may be a case to "backport" some of the threaded code just to see how it goes.

I think I have very slight increase in processing speed with this diff (2.8 fps -> 3.3 fps for 1920x1080 mjpeg, 30 fps video, for 4-core AMD computer set to 1.4Ghz for full 'performance' setting (up to 3.9Ghz turbo) speed is like 7-8 fps for same video, with fade set to any value) attached diff and screenshot at full speed. ------ diff --git a/plugins/brightness/brightness.C b/plugins/brightness/brightness.C index fd75cb7..c3f4ad1 100644 --- a/plugins/brightness/brightness.C +++ b/plugins/brightness/brightness.C @@ -44,15 +44,20 @@ int BrightnessMain::process_realtime(long size, VFrame **input_ptr, VFrame **out //printf("BrightnessMain::process_realtime %f %f\n", brightness, contrast); + for(i = 0; i < size; i++) { input_rows = ((VPixel**)input_ptr[i]->get_rows()); output_rows = ((VPixel**)output_ptr[i]->get_rows()); + if(brightness != 0) { // Use int since brightness is also subtractive. int offset = (int)(brightness / 100 * VMAX); +#pragma omp parallel for private(i,j,k,r,g,b) \ +schedule(static) num_threads(4) collapse(2) + for(j = 0; j < project_frame_h; j++) { for(k = 0; k < project_frame_w; k++) @@ -83,7 +88,8 @@ int BrightnessMain::process_realtime(long size, VFrame **input_ptr, VFrame **out // Floating point math. float scalar = contrast_f; float offset = VMAX / 2 - (VMAX * scalar) / 2; - +#pragma omp parallel for private (j,k,r,g,b) \ +num_threads(4) collapse(2) for(j = 0; j < project_frame_h; j++) { for(k = 0; k < project_frame_w; k++) @@ -107,7 +113,8 @@ int BrightnessMain::process_realtime(long size, VFrame **input_ptr, VFrame **out int r, g, b; int scalar = (int)(contrast_f * 0x100); int offset = VMAX * 0x100 / 2 - (VMAX * scalar) / 2; - +#pragma omp parallel for private(j,k,r,g,b) \ +num_threads(4) collapse(2) for(j = 0; j < project_frame_h; j++) { for(k = 0; k < project_frame_w; k++) @@ -135,6 +142,9 @@ int BrightnessMain::process_realtime(long size, VFrame **input_ptr, VFrame **out // Data never processed so copy if necessary if(!buffers_identical(0)) { +//#pragma omp parallel for private(j,k) \ +//schedule(dynamic) collapse(2) + for(j = 0; j < project_frame_h; j++) { for(k = 0; k < project_frame_w; k++) --- patch relative to broadcast2000 sources...

...

gg

On Tue, Mar 10, 2020 at 6:29 PM Andrew Randrianasulu < [email protected]> wrote:

...
Hi, all!

Currently I'm experimenting with OpenMP

https://bisqwit.iki.fi/story/howto/openmp/

--quote--- Support in different compilers GCC (GNU Compiler Collection) supports OpenMP 4.5 since version 6.1, OpenMP 4.0 since version 4.9, OpenMP 3.1 since version 4.7, OpenMP 3.0 since version 4.4, and OpenMP 2.5 since version 4.2. Add the commandline option -fopenmp to enable it. OpenMP offloading is supported for Intel MIC targets only (Intel Xeon Phi KNL + emulation) since version 5.1, and to NVidia (NVPTX) targets since version 7 or so.

[...]

The syntax All OpenMP constructs in C and C++ are indicated with a #pragma omp followed by parameters, ending in a newline. The pragma usually applies only into the statement immediately following it, except for the barrier and flush commands, which do not have associated statements.

The parallel construct The parallel construct starts a parallel block. It creates a team of N threads (where N is determined at runtime, usually from the number of CPU cores, but may be affected by a few things), all of which execute the next statement (or the next block, if the statement is a {…} -enclosure). After the statement, the threads join back into one.

#pragma omp parallel { // Code inside this region runs in parallel. printf("Hello!\n"); }

This code creates a team of threads, and each thread executes the same code. It prints the text "Hello!" followed by a newline, as many times as there are threads in the team created. For a dual-core system, it will output the text twice. (Note: It may also output something like "HeHlellolo", depending on system, because the printing happens in parallel.) At the }, the threads are joined back into one, as if in non-threaded program. Internally, GCC implements this by creating a magic function and moving the associated code into that function, so that all the variables declared within that block become local variables of that function (and thus, locals to each thread).

ICC, on the other hand, uses a mechanism resembling fork(), and does not create a magic function. Both implementations are, of course, valid, and semantically identical. Variables shared from the context are handled transparently, sometimes by passing a reference and sometimes by using register variables which are flushed at the end of the parallel block (or whenever a flush is executed). --quote end---

http://gregslabaugh.net/publications/OpenMP_SPM.pdf Multicore Image Processing with OpenMP Greg Slabaugh, Richard Boyes, Xiaoyun Yang

https://nccastaff.bournemouth.ac.uk/jmacey/OpenMP/ -quote- OpenMP by Rob Bateman Introduction

OpenMP is an open standard that lets you easily make use of multi-threaded processors. It's currently supported by the following compilers: Visual C++, gcc (though not the Win32 version that comes with cygwin), XCode, and the Intel compiler; and It's supported on the following platforms: Win32, Linux, MacOS, XBox360*, and PS3*.

* Not amazingly well on those platforms --quote end--

I used bcast2000 example , namely bcast/overlayframe.C

and those CFLAGS:

CFLAGS = -O3 -fpermissive -fomit-frame-pointer -march=pentium3 -ffast-math -mfpmath=both -fopenmp -I/usr/local/include + enabled linking with libgomp (gcc 5.5.0) by adding -lgomp to bcast-2000c/bcast/Makefile

it makes code slower, so far :}

but it eats all processors :} unlike original code -- Cin mailing list [email protected] https://lists.cinelerra-gg.org/mailman/listinfo/cin

2117

Age (days ago)

2118

Last active (days ago)

List overview

Download

4 comments

3 participants

participants (3)

Andrew Randrianasulu
Good Guy
Phyllis Smith

OpenMP

Andrew Randrianasulu

Good Guy

Andrew Randrianasulu

Phyllis Smith

Andrew Randrianasulu

tags

participants (3)