The glibc s390 ABI break
Did you know...?The GNU C library (glibc) project has long lived up to a reputation for conservatism; glibc developers know that an ill-chosen change can create a great deal of pain downstream, so they proceed with caution. Even so, mistakes can happen. A recent slip-up involving the s390 architecture makes it clear how one of those mistakes can cascade into a significant mess that is hard to clean up afterward.LWN.net is a subscriber-supported publication; we rely on subscribers to keep the entire operation going. Please help out by buying a subscription and keeping LWN on the net.
The setjmp() and longjmp() functions have been part of the standard C library since something close to the beginning. They can be used to perform stack unwinding — a sort of "long return" from a function that skips over any number of intervening function calls. Both of these functions take an opaque jmp_buf data structure as an argument. The caller provides the buffer to setjmp(), which fills it with the information needed to make another return to the location of that call. A later call to longjmp() with that buffer will then cause setjmp() to appear to have returned a second time.
Back in April, developers from IBM committed a patch that changed the size of the jmp_buf structure on the s390 architecture; this change, which subsequently became part of the 2.19 release, was apparently needed to enable better hardware support for setjmp() and longjmp(). Since jmp_buf is a type that is visible to applications, this was a clear ABI change, with all of the possible problems that can go with it. For example, newer glibc releases expect the larger jmp_buf size, but they may be linked (at run time) against applications that have not been rebuilt and, thus, are still working with the older version of jmp_buf.
This possibility was taken into account, though. Symbol versioning was used to provide compatible versions of setjmp() and longjmp() for these older applications. So, in theory, things should Just Work without additional problems. This particular theory did not last long after its encounter with the real world, though.
The problem is that jmp_buf structures are often embedded into other structures, so a change in the size of that structure will change the containing structures too. To find victims, one need not even look outside of glibc; it turns out that glibc's POSIX threads (pthreads) implementation embeds a jmp_buf structure into its own __pthread_unwind_buf_t structure which, in turn, is visible to applications. So, as a result, a number of pthreads functions need to become versioned as well.
Versioning does not work, though, for problems that pop up outside of glibc. Consider, for example, the Perl interpreter, which embeds a jmp_buf in its main "this is a running Perl instance" structure. That has caused various Perl modules to fail (example) and can only really be fixed by rebuilding the entire Perl environment. The PNG image format library (libpng) also has an embedded jmp_buf — in a structure that is used by all PNG-using applications.
Debian's developers, who were trying to clean up this mess, considered
rebuilding all of Perl and then, perhaps, all (500 or so) packages
depending on the PNG library. But, by this point, it became clear that
the ripples from this change spread widely indeed and that playing
whack-a-mole may never get all of them fixed. So the Debian developers
have figured
that the course they may have to consider is to "do like Red Hat, ie
just rebuild everything and warn the users their system might break during
upgrade.
" Needless to say, this approach lacks appeal, especially
in the Debian world, where mass rebuilds are a rare event.
Even then, of course, there is the problem of end-user applications. Distributors cannot rebuild those; even worse, the user may not be able to either. So some things might just be broken.
One might be thinking that there is a mechanism in place for this kind of incompatible ABI change. Shared libraries have a shared-object name ("soname") built into them; applications linked against those libraries also contain that name. For glibc on your editor's system, for example, the soname is "libc.so.6". The runtime linker will not link an application against a shared object if the sonames do not match. In this way, the system can disallow running against a library that will not work. It also enables, in theory, the parallel installation of multiple versions of the library; older applications would continue to use the older library, while newly built binaries would use the current version.
So the glibc project could consider making a point release with a different soname (libc.so.6.1, say); distributors could then install the result alongside an older version of the library and, in theory, things should work. Except that glibc developer Carlos O'Donell tried it and concluded that:
The SO name bump in a mixed-ABI environment like debian results in two libc's being loaded and competing for effectively the same namespace of symbols with resolution (and therefore selection of the ABI) being determined by ELF interposition and scope rules. It's a nightmare. It's possible a worse solution than just telling everyone to rebuild and get on with their lives.
It also turns out to be painful to bootstrap a system with a new, ABI-incompatible version of the C library. So it seems that the soname change will not happen and that, on s390, a lot of rebuilding is going to have to go on. It will also become impossible to move affected applications between systems with pre- and post-change libraries. Not fun, but, as David Miller put it:
That leads to the obvious question: what can be done to avoid this kind of
problem in the future? Carlos plans to put
together a policy on how to manage ABI changes, with "don't break
ABI ever
" as
the first item. There has been talk of improving the testing tools in an
attempt to catch this kind of ABI break in the future.
In the end, though, nothing can replace a high level of care on the part of
the developers involved. Glibc developers have always shown that care,
which is why stories like this one are rare. In the aftermath of this
mistake, one can assume that they will be doubly careful in the future.
That, along with some testing support, should help to ensure that upcoming
glibc releases are free of this kind of issue.
Posted Jul 17, 2014 7:07 UTC (Thu)
by airlied (subscriber, #9104)
[Link] (2 responses)
though whether that would be because he'd catch it or just have never applied the patch, who knows!
Posted Jul 17, 2014 9:30 UTC (Thu)
by jhhaller (guest, #56103)
[Link]
Posted Jul 22, 2014 18:38 UTC (Tue)
by fw (subscriber, #26023)
[Link]
Posted Jul 17, 2014 10:48 UTC (Thu)
by danpb (subscriber, #4831)
[Link] (1 responses)
Seems like some kind of automated testing of the public ABI could have caught this problem. ie something that validates that the size of any & every public struct does not change. Of course changing the jmpbuf size was a deliberate decision, but the ripple effects it caused on other structs could have been identified sooner perhaps causing a rethink on the change to jmpbuf.
Posted Jul 17, 2014 14:28 UTC (Thu)
by jtaylor (subscriber, #91739)
[Link]
Posted Jul 17, 2014 15:07 UTC (Thu)
by mathstuf (subscriber, #69389)
[Link] (2 responses)
> …the Perl interpreter, which embeds a jmp_buf…
So it's translucent? This is not the definition of "opaque" structure I'm used to in C.
> ELF interposition and scope rules
Is the reason we don't support looking up symbols only in global and directly linked libraries due to performance and too much extra bookkeeping? I'd really like this to be possible as well:
— libA.so links libC.so
Since libB.so directly and explicitly links libC.so; why is it denied access to libC.so based on libA.so's transitive linking? If libC.so were opened directly with RTLD_LOCAL, I could see some logic behind it, but this makes much less sense to me and basically means when loading a plugin, I have to use RTLD_GLOBAL or risk this exact problem.
Posted Jul 17, 2014 17:44 UTC (Thu)
by RobSeace (subscriber, #4435)
[Link] (1 responses)
Yeah, jmp_buf is definitely not opaque... It's fully defined in <setjmp.h> (and some other files like <bits/setjmp.h> for the types of some of its members)... As you point out, if it were truly opaque, no one would be able to embed it anywhere, because they wouldn't have a full definition for it! They could basically only work with pointers to it... (I'm not sure if there are any true opaque structs in glibc... In theory, stdio FILE could probably be opaque, but in practice it's not... Maybe DIR is?)
I suppose it's "opaque" in a way, since the majority of it is just defined as a bunch of nondescript ints whose meaning is left as a complete mystery to the caller... So, one is obviously not meant to go poking in it...
Posted Jul 22, 2014 18:39 UTC (Tue)
by fw (subscriber, #26023)
[Link]
Posted Jul 17, 2014 18:05 UTC (Thu)
by Karellen (subscriber, #67644)
[Link] (1 responses)
https://wiki.debian.org/Multiarch
Posted Jul 18, 2014 1:18 UTC (Fri)
by mathstuf (subscriber, #69389)
[Link]
Posted Jul 31, 2014 0:32 UTC (Thu)
by vomlehn (guest, #45588)
[Link]
Posted Jul 31, 2014 18:34 UTC (Thu)
by sharkcz (guest, #52232)
[Link]
The glibc s390 ABI break
The glibc s390 ABI break
The glibc s390 ABI break
The glibc s390 ABI break
The glibc s390 ABI break
https://sourceware.org/glibc/wiki/Testing/ABI_checker#gli...
The glibc s390 ABI break
— libB.so links libC.so
— myapp does *not* link libC.so
— myapp: dlopen("libA.so", RTLD_LOCAL | RTLD_NOW); // opens libC.so implicitly
— myapp: dlopen("libB.so", RTLD_LOCAL | RTLD_NOW); // fails with missing symbols from libC.so
The glibc s390 ABI break
The glibc s390 ABI break
The glibc s390 ABI break
The glibc s390 ABI break
ABIs are *hard*
The glibc s390 ABI break