Hi,
Boy, this is interesting. It does illustrate you can simplify a complex pipeline of computation, but
it is sort of wishy washy on how it sets up the thread pools and controls the client thread resources.
begin blathering...
A long time ago, there was a design called HEP (heterogeneous element processor, back in the Cray days)
but it did not catch on well. It was similar to this kind of pipeline simplification in that it did not require a
great deal of programming to setup. It employed an extra bit (the sting bit) on every reg/mem cell that
indicated that the result was stored (ready). The CPUs all were not bound strictly to a user context, and the
instruction fetch would block and switch to a ready context if the input operands had not yet been stung
(stored with a pipeline result). This meant that the computation would inherently create a serial pipeline
sqrt(f(x)**2 + f(y)**2)
cpu1 f(x) stings cpu3, then cpu3 runs **2 and stings cpu5(lt) then cpu5 runs +, and then sqrt
cpu2 f(y) stings cpu4, then cpu4 runs **2 and stings cpu5(rt)
The programmer normally did not see any of this functional parallelism in the code design, and
the instruction fetch and context binding was per functional unit, so that float/int/simd etc were all
part of a pool of computational resources that could decode and execute any next operation with
all of the input regs/memory sting bits set. Hell, there were event context register sets with lambda
parameter registers so that tail recursions could be unwound into iterations by the hardware.
In reality, much of this is good in the classroom, but not as easy to use as all of that in real life.
The stung memory idea could get hung up with A waiting on B, and B waiting on A by some unrealized
recursive dependency, and much of it was way past what could be implemented by hardware design,
time, and money... since it did eventually have to make it into the market and do something useful.
end bathering...
As for the OpenMP pragmas. This may be a great way to speed up a bunch of striping done in
many of the standard transformations. To that extent, much of this code has already been painstakingly
threaded using pthreads. This is mostly done using LoadBalance, a class in cinelerra. It is not nearly
as easy as the OpenMP design, but does some of the code parallel operation that OpenMP seems to
offer. I have found that there frequently is a trade with the overhead of setting up a thread to do a
function, and the function itself. It does not make sense (usually) to create a thread for a trivial work load.
The thread management is not free, and the needed locks and atomic ops do apply a cost in the code.
Threading works well when the loop is much bigger than the cpu count, but not as well with lots of cpus.
That is sort of a problem, especially with debugging. On my devel machine there are 128 cpus. This
means that when you ask gdb: "inf thr", you get hundreds of results, and only one or two are interesting.
A lot of effort has to be added to the lock trace just to get a handle on what is actually happening.
I am concerned that by using a more opaque thread system, that simple things like tracing a thread,
determining a thread owner, lock holder, or thread client set may be much more difficult. It is already
quite difficult to just address a threaded program under the debugger.
For computations near the outer edge of the code graph, this could be great. The simple setup and
high degree of parallelism is exactly what you need for the math to code abstraction. It might be a good
idea to make a sort of "test" plugin, and see if it is worth it. IF the results are good, and not difficult to
achieve, then there may be a case to "backport" some of the threaded code just to see how it goes.
gg