mac,thread: fix pthread_barrier_wait serial thread#2003
mac,thread: fix pthread_barrier_wait serial thread#2003ofrobots wants to merge 1 commit intolibuv:v1.xfrom
Conversation
|
Ping @cjihrig @bnoordhuis @saghul Would you mind taking a peek at this PR? |
|
Hi @ofrobots!
Plausible. But from this I understand that you don't have proof, just a guess? Is the implementation wrong?
Let us say instead, it is not safe for any other thread to successfully destroy the barrier until all threads have exited it safely. The The corresponding libuv polyfill for As a result, my understanding is that:
(I might be wrong, just sharing my perspective on the code/spec). What should the implementation do?The return value |
Am I reasonably certain? Yes. The crashes that I observe match the thread races as I have described them. With my change, the crashes get fixed.
I can buy that argument. The more interesting question is whether
The only documentation for With that in mind, I stand by the fix I provided here, but I am open to alternative suggestions if you have them. |
Discussion
Maybe I misunderstood the behavior you are experiencing.
The libuv docs say here that "The API largely follows the pthreads API", which is why I was looking at the POSIX spec. Your patchI am concerned about your implementation in the case of a barrier initialized with a limit of 1. |
src/unix/thread.c
Outdated
|
|
||
| pthread_mutex_unlock(&barrier->mutex); | ||
| return PTHREAD_BARRIER_SERIAL_THREAD; | ||
| return 0; |
There was a problem hiding this comment.
In a 1-thread barrier, shouldn't this return PTHREAD_BARRIER_SERIAL_THREAD?
Can we put the do-while loop under an else so that all threads share the same exit path?
src/unix/thread.c
Outdated
|
|
||
| int pthread_barrier_wait(pthread_barrier_t* barrier) { | ||
| int rc; | ||
| char last; |
There was a problem hiding this comment.
Why did you use char instead of int?
src/unix/thread.c
Outdated
| if ((rc = pthread_cond_wait(&barrier->cond, &barrier->mutex)) != 0) | ||
| break; | ||
| } while (barrier->in != 0); | ||
| assert(rc == 0); |
There was a problem hiding this comment.
This should probably be just:
do {
rc = pthread_cond_wait(&barrier->cond, &barrier->mutex);
assert(rc == 0);
} while (barrier->in != 0);There was a problem hiding this comment.
Now that I think about it... everywhere else we call abort() when pthread functions fail when they logically can't. Would be good to do that here too.
There was a problem hiding this comment.
Done. There are a few more places where we should be aborting if a pthread function fails illogically. I didn't include those in this PR to keep the review delta small. I think they can be addressed in a follow-on.
src/unix/thread.c
Outdated
| /* Mark thread exit. If this the last thread to exit, return | ||
| PTHREAD_BARRIER_SERIAL_THREAD. */ | ||
| last = (--barrier->out == 0); | ||
| pthread_cond_signal(&barrier->cond); |
There was a problem hiding this comment.
Optimization: if this is the last thread, don't call pthread_cond_signal(); there isn't anyone to signal.
cf025f6 to
fb3cfab
Compare
I have added a commit with a test ( |
fb3cfab to
e34cd4a
Compare
bnoordhuis
left a comment
There was a problem hiding this comment.
Thanks, mostly LGTM. Is the third commit (the one that registers barrier_serial_thread_single) supposed to be squashed with the first one?
test/test-barrier.c
Outdated
| } | ||
|
|
||
| static void serial_worker(void* data) { | ||
| uv_barrier_t* barrier = (uv_barrier_t*) data; |
test/test-barrier.c
Outdated
| exited the barrier. If this value is returned too early and the barrier is | ||
| destroyed prematurely, then this test may see a crash. */ | ||
| TEST_IMPL(barrier_serial_thread) { | ||
| uv_thread_t threads[NUM_WORKERS]; |
There was a problem hiding this comment.
It'd be slightly more idiomatic to declare this as uv_thread_t threads[4]; and then use ARRAY_SIZE(threads) everywhere.
test/test-barrier.c
Outdated
|
|
||
| ASSERT(0 == uv_barrier_init(&barrier, NUM_WORKERS + 1)); | ||
|
|
||
| for (int i = 0; i < NUM_WORKERS; ++i) { |
There was a problem hiding this comment.
Can you declare int i at the top?
I have (at last!) identified the cause. Maybe this was obvious to everyone already, but just in case... I have been studying the source of the libuv polyfills, and I can't find a thread schedule that has But you're calling The In summary:
I don't like the prospect of memory leaks, so I support your patch. |
e34cd4a to
01e18bf
Compare
It is not clear (to me) which spec applies here anyway. I suggest we move this discussion to another issue as it is not necessary to resolve the bug that I am trying to fix. @bnoordhuis I've squashed the commit and address the rest of the comments. Thank you both for your review and insights! |
|
Should this be targeting v1.x? |
|
Agreed, @ofrobots can you rebase? |
01e18bf to
a989af1
Compare
|
@davisjam can we fork that discussion into a separate issue? I would this change to land. This fix is blocking a cascade of PRs in Node.js. |
|
What are the next steps here, is there a CI somewhere that can be run? |
bnoordhuis
left a comment
There was a problem hiding this comment.
is there a CI somewhere that can be run?
Indeed there is: https://ci.nodejs.org/view/libuv/job/libuv-test-commit/1047/
src/unix/thread.c
Outdated
|
|
||
| int pthread_barrier_wait(pthread_barrier_t* barrier) { | ||
| int rc; | ||
| int rc, last; |
There was a problem hiding this comment.
Minor style issue: one declaration per line.
The thread returning PTHREAD_BARRIER_SERIAL_THREAD should be the one last to exit the barrier, otherwise it is unsafe to destroy the barrier – other threads may have yet to exit.
a989af1 to
67130df
Compare
|
Windows test failure seems unrelated. AIX failure is relevant and needs to be investigated. I have opened nodejs/build#1514 to get login access to AIX to be able to investigate this. |
|
It seems that the AIX implementation of This means that that default example of if (uv_barrier_wait(&barrier) > 0)
uv_barrier_destroy(&barrier);I am forced to conclude that the uv_barrier_wait(&barrier);
if (0 == uv_barrier_destroy(&barrier)) {
// any additional cleanup, e.g. free of the barrier.
} |
|
I also noticed that we have AIX 6.1 in the CI. As per https://www-01.ibm.com/support/docview.wss?uid=isg3T1012517, AIX 6.1 may already be EOL since last year? It is possible that the behaviour is different/fixed on AIX 7. Looking at node supported platforms: https://github.com/nodejs/node/blob/master/BUILDING.md#supported-platforms-1, AIX 7 is the minimum supported. libuv claims to support AIX 6 though. |
It was pointed out that pthread_barrier_wait() behaves slightly different from other platforms. Switch to libuv's own thread barrier for uniformity of behavior. Perhaps we'll do that for more platforms in the future. Refs: libuv#2003 (comment)
|
Maybe we should use our own barrier implementation everywhere for the sake of uniform behavior, or at least switch to it on AIX. I've opened #2019 and included the changes from this PR. |
|
👍. Deferring to #2019. |
It was pointed out that pthread_barrier_wait() behaves slightly different from other platforms. Switch to libuv's own thread barrier for uniformity of behavior. Perhaps we'll do that for more platforms in the future. PR-URL: libuv#2019 Refs: libuv#2003 (comment) Reviewed-By: Santiago Gimeno <[email protected]>
Libuv's own thread barrier implementation signaled completion to the
first waiter that saw the threshold being reached, contrary to what
some native pthreads barrier implementations do, which is to signal
it to the _last_ waiter.
Libuv's behavior is not strictly non-conforming but it's inconvenient
because it means this snippet (that appears in the libuv documentation)
has a race condition in it:
if (uv_barrier_wait(&barrier) > 0)
uv_barrier_destroy(&barrier); // can still have waiters
This issue was discovered and fixed by Ali Ijaz Sheikh, a.k.a @ofrobots,
but some refactoring introduced conflicts in his pull request and I
didn't have the heart to ask him to redo it from scratch. :-)
PR-URL: libuv#2019
Refs: libuv#2003
Reviewed-By: Santiago Gimeno <[email protected]>
Hi Everyone 👋. This is my first PR here. While trying to fix nodejs/node#23065 using
uv_barriers, I noticed a thread race on my mac that, I believe, stems from a buggy polyfill ofpthread_barrier_waitin libuv.Here's the documentation on
uv_barrier_wait:The actual implementation returned
PTHREAD_BARRIER_SERIAL_THREAD(i.e. > 0) on the last thread to enter the wait. IMHO this value should be returned on the last thread to exit the wait. It is not safe for any other thread to destroy the barrier as there may be threads yet to exit the barrier.Checking with implementations elsewhere, refer to the the implementation of
absl::Barrier::Blockfrom the excellent abseil libraries, and they seem to agree with my assessment.I didn't know about
pthread_barrier_tuntil yesterday so it is possible I am figuring things wrong. PTAL.