Discussion:
[Cython] OpenMP thread private variable not recognized (bug report + discussion)
Leon Bottou
2014-08-11 01:20:16 UTC
Permalink
The attached cython program uses an extension class to represent a unit of
work. The with parallel block temporarily gets the gil to allocate the object,
then release the gil and performs the task with a for i in prange(...)
statement. My expectation was to have w recognized as a thread private
variable.

with nogil, parallel():
with gil:
w = Worker(n) # should be thread private
with nogil:
for i in prange(0,m):
r += w.run() # should be reduction
w = None # is this needed?


Cythonize (0.20.2) works without error but produces an incorrect C file.

hello.c: In function ‘__pyx_pf_5hello_run’:
hello.c:2193:42: error: expected identifier before ‘)’ token
hello.c: At top level:

The erroneous line is:

#pragma omp parallel private() reduction(+:__pyx_v_r) private(__pyx_t_5,
__pyx_t_4, __pyx_t_3) firstprivate(__pyx_t_1, __pyx_t_2)
private(__pyx_filename, __pyx_lineno, __pyx_clineno)
shared(__pyx_parallel_why, __pyx_parallel_exc_type, __pyx_parallel_exc_value,
__pyx_parallel_exc_tb)

where you can see that the first private() clause has no argument. The
variable __pyx_v_w is not declared as private either as I would expect.

I believe that the problem comes from line 7720 in Cython/Compiler/Node.py

if self.privates:
privates = [e.cname for e in self.privates
if not e.type.is_pyobject]
code.put('private(%s)' % ', '.join(privates))

And I further believe that the clause "if not e.type.is_pyobject" has been
added because nothing would decrements the reference count of the thread
private worker object when leaving the parallel block.

My quick fix would be to remove this clause and make sure that my program
contains the line "w = None" before leaving the thread. But I realize that
this is not sufficient for you.

Note that the temporary python objects generated by the call to the Worker
construction are correctly recognized as thread private and their reference
count is correctly decremented when they are no longer needed. The problem
here is the clash between the python scoping rules and the semantics of thread
private variables. This is one of these cases where I would have liked to be
able to write

with nogil, parallel():
with gil:
cdef w = Worker(n) # block-scoped cdef
with nogil:
for i in prange(0,m):
r += w.run()

with an understanding that the scope of the cdef variable is limited to the
block where the cdef appears. But when you try this, cython tells you that
cdefs are not legal there.
Sturla Molden
2014-08-12 14:26:31 UTC
Permalink
Cython does not do an error here:

- i is recognized as private
- r is recognized as reduction
- w is (correctly) recognized as shared

If you need thread local storage, use threading.local()

I agree that scoped cdefs would be an advantage.

Personally I prefer to avoid OpenMP and just use Python threads and an
internal function (closure) or an internal class. If you start to use
OpenMP, Apple's libdispatch ("GCD"), Intel TBB, or Intel clikplus, you will
soon discover that they are all variations over the same theme: a thread
pool and a closure. Whether you call it a parallel block in OpenMP or an
anonymous block in GCD, it is fundamentally a closure. That's all there is.
You can easily do this with Python threads: Python, unlike C, supports
closures or internal classes directly in the language, and does not need
special extensions like C. Python threads and OpenMP threads will scale
equally well (they are all native OS threads, scheduled in the same way),
and there will be no scoping problems. The sooner you discover you do not
need Cython's prange, the less pain it will cause.

Sturla
Brett Calcott
2014-08-12 15:10:43 UTC
Permalink
Post by Sturla Molden
The sooner you discover you do not
need Cython's prange, the less pain it will cause.
For someone who has bumbled around trying to use prange & openmp on the mac
(but successfully used python threading), this sounds great. Is there an
example of this somewhere that you can point us to?

Thanks
Brett
Sturla Molden
2014-08-12 15:20:44 UTC
Permalink
For someone who has bumbled around trying to use prange & openmp on the mac
(but successfully used python threading), this sounds great. Is there an
example of this somewhere that you can point us to?
No, but I could make one :)

ipython notebook?


Sturla
Brett Calcott
2014-08-12 15:31:34 UTC
Permalink
That would be awesome (I love ipython notebook).
Post by Brett Calcott
For someone who has bumbled around trying to use prange & openmp on
the mac
(but successfully used python threading), this sounds great. Is there an
example of this somewhere that you can point us to?
No, but I could make one :)
ipython notebook?
Sturla
_______________________________________________
cython-devel mailing list
https://mail.python.org/mailman/listinfo/cython-devel
Leon Bottou
2014-08-12 16:06:23 UTC
Permalink
Cython does not do an error here:[...
- i is recognized as private
- r is recognized as reduction
- w is (correctly) recognized as shared
Not according to the documentation.
http://docs.cython.org/src/userguide/parallelism.html documentation for
cython.parallel.parallel says "A contained prange will be a worksharing loop
that is not parallel, so any variable assigned to in the parallel section is
also private to the prange. Variables that are private in the parallel block
are unavailable after the parallel block.". Variable w is such a variable.

Furthermore, if cython is correct, why does GCC report an error on the
cython generated C code?

My point here is that there is a bug because (a) cython does not behave as
documented, and (b) it generates invalid C code despite not reporting an
error.
Personally I prefer to avoid OpenMP and just use Python threads and an
internal function (closure) or an internal class. If you start to use
OpenMP,
Apple's libdispatch ("GCD"), Intel TBB, or Intel clikplus, you will soon
discover
that they are all variations over the same theme: a thread pool and a
closure.

I am making heavy uses of OpenBlas which also uses OpenMP.
Using the same queue manager prevents lots of CPU provisioning problem.
Using multiple queue managers in the same code does not work as well because
they are not aware of what the other one is doing.


- L.
Sturla Molden
2014-08-12 16:42:51 UTC
Permalink
Post by Leon Bottou
I am making heavy uses of OpenBlas which also uses OpenMP.
Using the same queue manager prevents lots of CPU provisioning problem.
Using multiple queue managers in the same code does not work as well because
they are not aware of what the other one is doing.
Normally OpenBLAS is built without OpenMP. Also, OpenMP is not fork safe
(cf. multiprocessing) but OpenBLAS' own threadpool is. So it is recommended
to build OpenBLAS without OpenMP dependency.

That is: If you build OpenBLAS with OpenMP, numpy.dot will hang if used
together with multiprocessing.


Sturla
Dave Hirschfeld
2014-08-13 06:53:01 UTC
Permalink
Post by Sturla Molden
Post by Leon Bottou
I am making heavy uses of OpenBlas which also uses OpenMP.
Using the same queue manager prevents lots of CPU provisioning problem.
Using multiple queue managers in the same code does not work as well because
they are not aware of what the other one is doing.
Normally OpenBLAS is built without OpenMP. Also, OpenMP is not fork safe
(cf. multiprocessing) but OpenBLAS' own threadpool is. So it is recommended
to build OpenBLAS without OpenMP dependency.
That is: If you build OpenBLAS with OpenMP, numpy.dot will hang if used
together with multiprocessing.
Sturla
Just wanting to clarify that it's only the GNU OpenMP implementation that
isn't fork-safe? AFAIK the intel OpenMP runtime is and will at some stage be
available in the master branch of clang.

-Dave
Sturla Molden
2014-08-13 08:17:35 UTC
Permalink
Post by Dave Hirschfeld
Just wanting to clarify that it's only the GNU OpenMP implementation that
isn't fork-safe? AFAIK the intel OpenMP runtime is and will at some stage be
available in the master branch of clang.
That is correct.
Leon Bottou
2014-08-13 18:21:02 UTC
Permalink
Post by Sturla Molden
Post by Leon Bottou
I am making heavy uses of OpenBlas which also uses OpenMP.
Using the same queue manager prevents lots of CPU provisioning problem.
Using multiple queue managers in the same code does not work as well
because they are not aware of what the other one is doing.
Normally OpenBLAS is built without OpenMP. Also, OpenMP is not fork safe
(cf. multiprocessing) but OpenBLAS' own threadpool is. So it is
recommended
Post by Sturla Molden
to build OpenBLAS without OpenMP dependency.
That is: If you build OpenBLAS with OpenMP, numpy.dot will hang if used
together with multiprocessing.
I am effectively using a version of openblas built with openmp because
Debian used to compile openblas this way. They seem to have reverted now.
Note than I cannot use python multiprocessing because my threads work on a
very large state vector. My current solution is to use python threading and
nogil cython compiled routines but this sometimes lead to weird effects
provisioning threads.

This is why I wanted to try the pure openmp solution and found the
aforementioned bug in cython.parallel.

Is there somebody actively trying to make cython.parallel work correctly?
- If yes, then my bug report should be of interest to this person.
- If no, then one should avoid (and possibly deprecate) cython.parallel and
find other ways to do things.

Thanks to the replies.

- L.

Loading...