|
|
Subscribe / Log in / New account

The glibc s390 ABI break

Did you know...?

LWN.net is a subscriber-supported publication; we rely on subscribers to keep the entire operation going. Please help out by buying a subscription and keeping LWN on the net.

By Jonathan Corbet
July 16, 2014
The GNU C library (glibc) project has long lived up to a reputation for conservatism; glibc developers know that an ill-chosen change can create a great deal of pain downstream, so they proceed with caution. Even so, mistakes can happen. A recent slip-up involving the s390 architecture makes it clear how one of those mistakes can cascade into a significant mess that is hard to clean up afterward.

The setjmp() and longjmp() functions have been part of the standard C library since something close to the beginning. They can be used to perform stack unwinding — a sort of "long return" from a function that skips over any number of intervening function calls. Both of these functions take an opaque jmp_buf data structure as an argument. The caller provides the buffer to setjmp(), which fills it with the information needed to make another return to the location of that call. A later call to longjmp() with that buffer will then cause setjmp() to appear to have returned a second time.

Back in April, developers from IBM committed a patch that changed the size of the jmp_buf structure on the s390 architecture; this change, which subsequently became part of the 2.19 release, was apparently needed to enable better hardware support for setjmp() and longjmp(). Since jmp_buf is a type that is visible to applications, this was a clear ABI change, with all of the possible problems that can go with it. For example, newer glibc releases expect the larger jmp_buf size, but they may be linked (at run time) against applications that have not been rebuilt and, thus, are still working with the older version of jmp_buf.

This possibility was taken into account, though. Symbol versioning was used to provide compatible versions of setjmp() and longjmp() for these older applications. So, in theory, things should Just Work without additional problems. This particular theory did not last long after its encounter with the real world, though.

The problem is that jmp_buf structures are often embedded into other structures, so a change in the size of that structure will change the containing structures too. To find victims, one need not even look outside of glibc; it turns out that glibc's POSIX threads (pthreads) implementation embeds a jmp_buf structure into its own __pthread_unwind_buf_t structure which, in turn, is visible to applications. So, as a result, a number of pthreads functions need to become versioned as well.

Versioning does not work, though, for problems that pop up outside of glibc. Consider, for example, the Perl interpreter, which embeds a jmp_buf in its main "this is a running Perl instance" structure. That has caused various Perl modules to fail (example) and can only really be fixed by rebuilding the entire Perl environment. The PNG image format library (libpng) also has an embedded jmp_buf — in a structure that is used by all PNG-using applications.

Debian's developers, who were trying to clean up this mess, considered rebuilding all of Perl and then, perhaps, all (500 or so) packages depending on the PNG library. But, by this point, it became clear that the ripples from this change spread widely indeed and that playing whack-a-mole may never get all of them fixed. So the Debian developers have figured that the course they may have to consider is to "do like Red Hat, ie just rebuild everything and warn the users their system might break during upgrade." Needless to say, this approach lacks appeal, especially in the Debian world, where mass rebuilds are a rare event.

Even then, of course, there is the problem of end-user applications. Distributors cannot rebuild those; even worse, the user may not be able to either. So some things might just be broken.

One might be thinking that there is a mechanism in place for this kind of incompatible ABI change. Shared libraries have a shared-object name ("soname") built into them; applications linked against those libraries also contain that name. For glibc on your editor's system, for example, the soname is "libc.so.6". The runtime linker will not link an application against a shared object if the sonames do not match. In this way, the system can disallow running against a library that will not work. It also enables, in theory, the parallel installation of multiple versions of the library; older applications would continue to use the older library, while newly built binaries would use the current version.

So the glibc project could consider making a point release with a different soname (libc.so.6.1, say); distributors could then install the result alongside an older version of the library and, in theory, things should work. Except that glibc developer Carlos O'Donell tried it and concluded that:

It's unsupportable as a solution for glibc.

The SO name bump in a mixed-ABI environment like debian results in two libc's being loaded and competing for effectively the same namespace of symbols with resolution (and therefore selection of the ABI) being determined by ELF interposition and scope rules. It's a nightmare. It's possible a worse solution than just telling everyone to rebuild and get on with their lives.

It also turns out to be painful to bootstrap a system with a new, ABI-incompatible version of the C library. So it seems that the soname change will not happen and that, on s390, a lot of rebuilding is going to have to go on. It will also become impossible to move affected applications between systems with pre- and post-change libraries. Not fun, but, as David Miller put it:

Therefore, on the negative side, we might be stuck with this. But, on the positive side, we can refer to this incident next time a similar incident arises. We now know exactly what the ramifications are for not handling this properly.

That leads to the obvious question: what can be done to avoid this kind of problem in the future? Carlos plans to put together a policy on how to manage ABI changes, with "don't break ABI ever" as the first item. There has been talk of improving the testing tools in an attempt to catch this kind of ABI break in the future.

In the end, though, nothing can replace a high level of care on the part of the developers involved. Glibc developers have always shown that care, which is why stories like this one are rare. In the aftermath of this mistake, one can assume that they will be doubly careful in the future. That, along with some testing support, should help to ensure that upcoming glibc releases are free of this kind of issue.


to post comments

The glibc s390 ABI break

Posted Jul 17, 2014 7:07 UTC (Thu) by airlied (subscriber, #9104) [Link] (2 responses)

never would have happened on Uli's watch!

though whether that would be because he'd catch it or just have never applied the patch, who knows!

The glibc s390 ABI break

Posted Jul 17, 2014 9:30 UTC (Thu) by jhhaller (guest, #56103) [Link]

I know of one such patch he rejected. The semaphores in shared memory using sem_init are a different size for 32-bit binaries and 64-bit binaries, meaning that semaphore can't be shared by 32-bit and 64-bit binaries. A change was proposed to change the 32-bit version to be compatible with the 64-bit version, and it was rapidly shot down by Uli for breaking API compatibility.

The glibc s390 ABI break

Posted Jul 22, 2014 18:38 UTC (Tue) by fw (subscriber, #26023) [Link]

What about the "extern int errno" business? (Yes, I know, that's pretty lame, but it still hurt when you were affected by it.)

The glibc s390 ABI break

Posted Jul 17, 2014 10:48 UTC (Thu) by danpb (subscriber, #4831) [Link] (1 responses)

> it turns out that glibc's POSIX threads (pthreads) implementation embeds a jmp_buf structure into its own __pthread_unwind_buf_t structure which, in turn, is visible to applications

Seems like some kind of automated testing of the public ABI could have caught this problem. ie something that validates that the size of any & every public struct does not change. Of course changing the jmpbuf size was a deliberate decision, but the ripple effects it caused on other structs could have been identified sooner perhaps causing a rethink on the change to jmpbuf.

The glibc s390 ABI break

Posted Jul 17, 2014 14:28 UTC (Thu) by jtaylor (subscriber, #91739) [Link]

This is done, but only for x86. This ABI break affects only S390.
https://sourceware.org/glibc/wiki/Testing/ABI_checker#gli...

The glibc s390 ABI break

Posted Jul 17, 2014 15:07 UTC (Thu) by mathstuf (subscriber, #69389) [Link] (2 responses)

> Both of these functions take an opaque jmp_buf data structure as an argument.

> …the Perl interpreter, which embeds a jmp_buf…

So it's translucent? This is not the definition of "opaque" structure I'm used to in C.

> ELF interposition and scope rules

Is the reason we don't support looking up symbols only in global and directly linked libraries due to performance and too much extra bookkeeping? I'd really like this to be possible as well:

— libA.so links libC.so
— libB.so links libC.so
— myapp does *not* link libC.so
— myapp: dlopen("libA.so", RTLD_LOCAL | RTLD_NOW); // opens libC.so implicitly
— myapp: dlopen("libB.so", RTLD_LOCAL | RTLD_NOW); // fails with missing symbols from libC.so

Since libB.so directly and explicitly links libC.so; why is it denied access to libC.so based on libA.so's transitive linking? If libC.so were opened directly with RTLD_LOCAL, I could see some logic behind it, but this makes much less sense to me and basically means when loading a plugin, I have to use RTLD_GLOBAL or risk this exact problem.

The glibc s390 ABI break

Posted Jul 17, 2014 17:44 UTC (Thu) by RobSeace (subscriber, #4435) [Link] (1 responses)

> So it's translucent? This is not the definition of "opaque" structure I'm used to in C.

Yeah, jmp_buf is definitely not opaque... It's fully defined in <setjmp.h> (and some other files like <bits/setjmp.h> for the types of some of its members)... As you point out, if it were truly opaque, no one would be able to embed it anywhere, because they wouldn't have a full definition for it! They could basically only work with pointers to it... (I'm not sure if there are any true opaque structs in glibc... In theory, stdio FILE could probably be opaque, but in practice it's not... Maybe DIR is?)

I suppose it's "opaque" in a way, since the majority of it is just defined as a bunch of nondescript ints whose meaning is left as a complete mystery to the caller... So, one is obviously not meant to go poking in it...

The glibc s390 ABI break

Posted Jul 22, 2014 18:39 UTC (Tue) by fw (subscriber, #26023) [Link]

DIR is opaque. Historically, DIR * was sometimes implemented as an integer file descriptor cast to a pointer, which is why readdir used a static, global buffer and was not thread-safe.

The glibc s390 ABI break

Posted Jul 17, 2014 18:05 UTC (Thu) by Karellen (subscriber, #67644) [Link] (1 responses)

I'm wondering if Debian could solve this better with Multiarch, to create two entirely distinct "architectures" for the same hardware, rather than attemping a libc soname bump within the current s390 arch.

https://wiki.debian.org/Multiarch

The glibc s390 ABI break

Posted Jul 18, 2014 1:18 UTC (Fri) by mathstuf (subscriber, #69389) [Link]

Well, any new builds will be the "new" architecture, so it isn't like the "old" architecture has any kind of future. Are there really enough idle hands on the s390 porters team for Debian that this is viable anyways?

ABIs are *hard*

Posted Jul 31, 2014 0:32 UTC (Thu) by vomlehn (guest, #45588) [Link]

I spent a lot of time with the MIPS ABI Group and learned a lot about how hard it is to deal with ABIs. You really cannot change the size of anything without breaking compatibility. Ever. To avoid this, we scrutinized proposed data structures before adopting them to ensure that they would never need to grow. In one case, one vendor had a data structure several times larger than the size everyone else used. That became the size for everyone because we needed to support all the implementations. And then we lived with it. I still wonder why they really needed all that room.

The glibc s390 ABI break

Posted Jul 31, 2014 18:34 UTC (Thu) by sharkcz (guest, #52232) [Link]

Yes, it was a nice exercise :-) I've ended with ad-hoc rebuilding cca 85 Perl modules and 2 other libraries for Fedora, the rest was (and is being) processed during the continuous Rawhide build process.


Copyright © 2014, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds