Leading items

Welcome to the LWN.net Weekly Edition for February 1, 2018

This edition contains the following feature content, all drawn from the 2018 linux.conf.au:

Too many lords, not enough stewards: Daniel Vetter on the problems with the kernel's development process and why they may not be solved anytime soon.
Increasing open-source inclusivity with paper circuits: Andrew "bunnie" Huang with a scheme to bring more people into technology by making circuits and coding more accessible.
The effect of Meltdown and Spectre in our communities: a panel discussion on the aftermath from these vulnerabilities.
QUIC as a solution to protocol ossification: the QUIC protocol now carries 7% of the traffic on the Internet; Jana Iyengar explains how it works and how it came about.
Containers from user space: Jessie Frazelle's keynote on interesting things that can be done with the Linux container model.

This week's edition also includes these inner pages:

Brief items: Brief news items from throughout the community.
Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

Too many lords, not enough stewards

By Jake Edge
January 31, 2018

linux.conf.au

For anyone who has followed Daniel Vetter's talks over the last year or two, it is fairly clear that he is not happy with the kernel development process and the role played by kernel maintainers. In a strongly worded talk at linux.conf.au (LCA) 2018 in Sydney, he further explored the topic (that he also raised at LCA 2017) in a talk entitled "Burning down the castle". In his view, kernel development is broken and it is unlikely to improve anytime soon.

He started by noting that this talk would be a "rather more personal talk than others I give". It is his journey from first looking in on the kernel in high school to learn how operating systems work. The kernel developers were his heroes who created this awesome operating system by discussing things out in the open.

Eventually he started scratching his own itch in the graphics subsystem, which led to him getting hired to work on Linux graphics professionally on a small team. He got volunteered to be the kernel maintainer for that team, which grew from three to twenty people in a year or two. In that time he learned the tough lesson that "leading teams is leading people". But he has learned that the way kernel maintainers work is making developers unhappy, including him. The talk would be a look at how he learned just how broken things are.

What's broken

The first thing that generally comes up when discussing what's broken in the kernel community is the discussion culture, for example Linus Torvalds cursing at someone. That culture is enshrined in the kernel's Code of Conflict, which says that if you want to contribute to the Linux kernel, "you will get shredded for the greater good of the project", he said. There is another paragraph that says if you get really uncomfortable, you can report it to the Technical Advisory Board (TAB) "and they might not be able to do a whole lot about it", he said.

But if you talk to kernel developers at conferences, they will often say that the culture has gotten much better in the past few years. Generally what they mean, Vetter said, is the "rather violent language and discussion" in the kernel community has decreased or disappeared. But he thinks that is not the real problem, it is only one aspect of the problem.

That led to his first interlude, which was about a book that he read as part of his thinking about Linux kernel culture. It was Why Does He Do That?: Inside the Minds of Angry and Controlling Men by Lundy Bancroft. The interesting part of that book is the archetypes of abusers that Bancroft has extracted and the patterns of behavior that abusers engage in so that they can stay in power, he said. Vetter's main takeaway from the book is that abuse comes down to two elements.

First, the abuser must have power over the victim; in the kernel case, maintainers have a lot of power over their contributors, since their right to reject code is absolute. To get code into the upstream kernel, developers have to deal with the maintainer of the subsystem.

The other element is controlling behavior, which is not necessarily violence. Clearly, violence can be completely controlling behavior that puts the safety of the victim at risk, but there are counterexamples such as martial arts. Those are violent but, if done correctly, respectful. There is plenty of non-violent controlling behavior, though, including determining who a victim can talk to or go out with.

Vetter wanted to highlight the kinds of controlling behavior he has seen, which maintainers use to dictate what their contributors can and cannot do. That list starts with the assertion that "only technical topics are in-scope" for the kernel community. This is nonsense, he said; for developers it is true, but it may not be for maintainers.

For some maintainers, that means anger, screaming, and shouting, but others expect emotional support because they are overloaded. The emotional state of the developers is totally irrelevant. There is also a lot of micro-aggression, nagging, and bikeshedding from maintainers in an environment that is ostensibly technical-only. Maintainers can impose their emotional state on their contributors in the mailing list, but only the maintainers' emotions matter, not contributors'. This is classic controlling behavior, Vetter said.

Beyond that, discussions about governance and fixing these problems are off-topic as well. That makes it hard to even discuss the problems with others.

Leading teams of people is not a valued contribution within the kernel community, which creates a negative space for leadership, he said. Since maintainers are personally responsible for all tests failing, regressions in the subsystem, and so on, that makes things personal; it also turns maintainership into a high-stakes game. So maintainers self-censor and impose that on their sub-maintainers and contributors. Making it personal is, he thinks, a strong force that perpetuates the cycle of abusive and controlling behavior in the kernel community.

This leads to something of a personality cult in some subsystems. There are people who have been working in the subsystem for many years and are the keepers of much of the knowledge and history built up over those years. These people are "very hard to remove". Pretty much every subsystem has their "local toxic person" that cannot be removed because of their accumulated store of knowledge.

Maintainer power is not shared in most subsystems. The group maintainership model was pushed at Kernel Summits, but it has only been adopted in a few places: the x86 subsystem, ARM at the top level, and half of the graphics subsystem. In addition, maintainers are not documenting their implicit assumptions and rules. For example, even after writing, reviewing, and applying thousands of patches over the years, Vetter is still not sure how to format patches for other subsystems. Inevitably, there will be minor formatting or other issues that will arise when he submits patches elsewhere, but those rules are "outright not documented".

The tools and tests that maintainers use to check patches are not made available, at least easily, for contributors to use. That is slowly getting better, he said. But it would be easier for developers to find and fix problems before they get to the maintainer's tree if the checking and testing tools were accessible.

Most of the review that is done is between the maintainers and contributors. He crunched some numbers around a year ago and found that only 25% of review is done by peers, the rest is done by the maintainers. It becomes something of an exercise in conflict avoidance for the contributor, since the patch is on its way to acceptance. The contributor just needs to "go through the motions, respin the patch series ten times" to show that they are "sufficiently subordinate to the maintainer", he said, then it will be merged.

For some suggestions on how to do things better, he recommended two talks. One is "Life is better with Rust's community automation" [YouTube video], given by Emily Dunham at LCA 2016. The other is "Have it your way: maximizing drive-thru contributions" [YouTube video] by VM Brasseur. The latter targets one-off contributions, but by making the process easier for one-time contributors, it will also improve things for regular contributors. His takeaway is that submissions should be looked at by a bot that points out any stylistic issues, perhaps even provides a fixed-up version of the code, and points contributors to the appropriate part of the documentation. So these rules are not only documented but also easily referred to when problems arise.

Sinking feeling

The subsystem he was maintaining started growing wildly and he started to "get this sinking feeling that this is really not how you run a team". Around the same time, he was invited to his first Kernel Summit. He wanted to try to talk to some potential allies but, since graphics is kind of separate from the rest of the kernel, he did not know more than two or three of the hundred or so maintainers at the summit. He sought out some of those who had made public statements that made him think they might be up for "fighting the good fight" to change how kernel development is done.

But he found it hard to enlist aid and allies at the summit. Some of the seemingly "nice maintainers" turned out to be resistant to change. There is a risk in speaking out in the kernel community, he said. So it turns out that there are few maintainers willing to do so. When you chat with many of the people who have been around kernel development for a long time and challenge them about problematic people, toxic behavior, or "maybe we should do this better" the response is often "it is what it is and you just deal with it". In his view, this makes them complicit.

The people that you can talk to, Vetter said, are those who have left the community. That is great because you can compare notes. When you do so, the same names often come up as maintainers that are something of a disappointment for not doing more to help fix the problems. There are a lot of "loud quitters", but an even larger number of contributors have just dropped out silently. This provides a sizable group of people to talk to.

Similarly, in some ways, is the pool of burnt-out maintainers. They realize the problems, but don't want to leave the kernel entirely so they step down as maintainers. You can have "really good conversations" with them, he said, but since they have already burned out, they do not have the energy to help out fighting the next fight.

The participants on the dri-devel mailing list and the graphics subsystem in general have tried to do things a little differently to "make our little corner of the kernel more useful". The subsystem spearheaded the rewrite of the kernel documentation system and it is the only one that hands out commit rights. The latter makes graphics "more like a standard project".

In the last year, a code of conduct has been enacted and is being enforced, he said. The "surprising thing is that it seems to work". There has been a need to quiet a number of kernel developers—he thinks they may end up banning some. If you make it clear that you are serious, some unreasonable people become much more reasonable, Vetter said.

One group that would seem to have a strong incentive to fix the kernel process would be the "sandwiched maintainers". "They get the harassment from above and they get the unhappy contributors from below." He has talked with other subsystems that would like to see things be done differently, but it seems that once people start talking about fixing things it just eventually peters out and "nothing happens".

Before getting into the forces that tend to cement kernel development place, Vetter provided a few places to learn about maintainership. The Community Leadership Summit and the affiliated CLSx events, which are held at OSCON, LCA, and elsewhere, as well as the Maintainerati event, make for great places to talk with other open-source maintainers. The community tracks at many different conferences are also good. One author he wanted to highlight is John Kotter, whose books on change management have been helpful to Vetter's thinking on how to change an existing community with a lot of inertia.

Strong forces

There are some strong forces that uphold the status quo in the kernel, he said. One is the spectator sport nature of the kernel mailing list. When Torvalds blows up publicly, it is picked up by Reddit, Hacker News, and elsewhere. It is something of an "abusive performance art" that reinforces the personality cults and shows the power that maintainers wield. It also demonstrates the high-stakes nature of speaking out, since every time one of Torvalds's rants gets posted, all of the previous episodes from years ago are rehashed.

Some maintainers are so scared of the next blow up, and have the scars from the last one, that they are unwilling to change anything because that's the safest option. So if the subsystems starts talking about group maintainership or handing out commit rights, it eventually boils down to the maintainer saying "no". They are worried that if a regression happens or there is some other problem that occurs due to the change, they will be lambasted publicly (again). They are trying to handle their fear by keeping as much control as possible, which just perpetuates things, he said.

A lot of maintainers are employed because they are the bottleneck. Their managers should be forcing them to stop being the bottleneck by adopting commit rights and simply being the one who gets the blame when Torvalds freaks out, but that is not happening. The maintainer bottlenecks continue to have job security and the managers seem to be satisfied with that approach.

The Linux Foundation (LF) is part of the problem as well, he said. It was set up, partly, because some were concerned about the amount of power Torvalds was wielding, so they wanted a neutral foundation that would employ him. The steep hierarchy in the kernel-development world is an advantage for the foundation, because it can employ the top maintainers and thus provide exclusive access to them for its members and others. More recently, the LF has moved into the cloud world, so this is less of an issue, but the LF is still a factor in maintaining the hierarchy, he said.

As the secretary for the X.Org Foundation, he has heard some of the old stories where that foundation ran into difficulties along the way, so he sees conflicts of interest as inevitable. If the project governance and business sides are intertwined, things can get messy; after ten years, the community may have moved on, but the business is still stuck in the older thinking. So instead of best serving the community, the foundation's goal becomes one of keeping the people employed, even though there may not be a need for those jobs anymore.

In Vetter's opinion, there should be a strict separation between project governance and employing people. So the foundation would provide services and infrastructure for the project but not employ people to work on the project directly. Because, at some point, those people may not be doing work that is in the overall interest of the project and its community any longer. Similarly, he thinks that voting rights should be spread widely, so that when there are shifts in the community, the new people are not ignored because they have no voting power.

He wrapped up his talk by noting that the maintainers benefit from the status quo, so they are unlikely to try to change it. He would not suggest that contributors not work on the kernel, as it is a massive career boost, but did suggest they always have an exit plan. "If you stick out your head, and sooner or later you will stick out your head, it will get chopped at."

Steward, not lord

His talk title could be taken different ways; it could be seen as a reference to burn it all down and start over. But he sees things differently; it is a plea to the maintainers to please fix things before the revolution comes. That revolution would be a terrible thing for Linux. The current leadership refuses to see the problems, he said, and gets defensive when they get brought up.

So he suggested that maintainers share power, drop privileges, and don't make their reviewer powers special. They should also document everything they can and automate things, as well. It is important to "care about the people", because without people you don't have much of a project. He summed up his suggestions with "be a steward, not a lord".

He then answered a few audience questions. It is unclear to him how and why things have come to be this way. It may be that long ago, when Linux was started, we didn't know how to build a working community, as we do today. By the time people started speaking up, the power structure was too entrenched to be changed.

Vetter said he has a hard time recommending that people join or rejoin the kernel community. Outside of the graphics subsystem, the overall understanding of the problems is so lacking that you can't even start talking about solutions. One step might be to get people to a point where they agree that just apologizing for the current situation is not sufficient. Maybe there is another subsystem that could start to make positive changes; if that happens, it would make his talk a "resounding success", he said, but the time frame for that is probably something like five years.

There are probably elements of the talk that almost any participant or observer would quibble (at least) with, but it seems fairly likely that there would not be widespread agreement on which. The reaction to the talk has been laudatory in many quarters and he clearly voiced concerns that have largely stayed under wraps. In the talk, Vetter has identified some real problems in the kernel community, it remains to be seen where it goes from here.

A YouTube video of the talk is available.

[I would like to thank LWN's travel sponsor, the Linux Foundation, for travel assistance to Sydney for LCA.]

Comments (160 posted)

Increasing open-source inclusivity with paper circuits

By Jonathan Corbet
January 30, 2018

linux.conf.au

Open-source software has an inclusiveness problem that will take some innovative approaches to fix. But, Andrew "bunnie" Huang said in his fast-moving linux.conf.au 2018 talk, if we don't fix it we may find we have bigger problems in the near future. His approach to improving the situation is to make technology more accessible — by enabling people to create electronic circuits on paper and write code for them.

Huang started by asking why we should care about making technology more inclusive. Open-source software gets its power from inclusiveness; that power is so strong that we can (quoting Chris DiBona) claim that, without open-source software, the Internet as we know it would not exist. As an engineer, he thinks that's great, but he has a concern: that means that the user base now include politicians.

The problem with politicians is that they may look at the open nature of open-source software and conclude that random people, some of whom may be criminals, have access to the public infrastructure. That does indeed lead to the possibility of things going wrong. The leftpad() fiasco was one example that affected a large group of developers, but there are scarier examples, such as criminals who are typo-squatting in the NPM repository, trying to get malware into others' applications.

Huang also mentioned a time bomb he encountered in XScreenSaver, which started complaining that it was too old and needed to be upgraded. This was not a malicious act; the developer was just finding a way to say that he's tired of Debian not shipping updates to the code and using that code to try to force a change. In a Debian bug-tracker discussion, Jamie Zawinski (the author of XScreenSaver) described his interactions with users, who evidently tell him "but I don't know how to compile from source, herp derp I eat paste". To Huang, exchanges like this make it clear that the "everybody" that is empowered by open-source software is actually a small and elite group. Few people know how to compile a program from source.

What would happen, he asked, if some developer put in an obnoxious warning that popped up on every Android phone? It might be a well-intentioned act, but it would upset a lot of people and show how much power this small group has. That, in turn, can lead politicians to propose that, for example, programmers should be regulated and that a bonding requirement be put into place. One might say that such a thing would clearly never pass, but democracy is a tricky thing, and there are plenty of examples of things not going right. He cited the repeal of network neutrality in the US, the "Brexit" decision, and the recent US presidential election. One cannot, he warned, count on preposterous things not passing.

If one looks at voter demographics with regard to their understanding of technology, the picture is scary. Few people truly understand technology; the best of the bottom 50% in this regard can barely type a search query into a web browser. Far less than 50% of the electorate is qualified to vote on technical issues. So, he asked, do we really want to put issues like network neutrality, surveillance, DRM, right-to-repair, open-source software, or even emoji selection up for a vote? The outcome would be essentially random.

Open-source is politically powerless without the sort of inclusiveness that enables people to understand it, but nearly 100% of the population relies on it. We need the government to help to enforce our licenses, and we need society to support our work. If we believe that open-source software is important, he said, we need to empower more people to preserve and sustain it.

Efficient inclusiveness

That said, teaching people one-by-one to code is an inefficient process; the engineer in him wants to optimize the effort. That led him to look at the demographics of the technical community as it exists now, and he concluded that we're missing half of our society — there are few women in technology. If we could get them involved, we could double the size of our community in one stroke.

Currently, women earn 11% of the degrees in computer engineering, and just under 16% in computer science. Female representation in computer-science faculties is poor, worse than physics or math. It is, he said, an "exclusive boys club". One might ask if this is a cultural problem, and he thinks that it is; he has noted that there is a far better gender balance in China, for example. So he has concluded that there is no fundamental reason why we couldn't get more people involved.

Others have done the same. An effort at Carnegie-Mellon University [PDF] changed the approach in the computer-science curriculum and changed female participation from 10% to 40% in five years. At Harvey Mudd University, the female graduation rate was increased from 12% to 45% in five years. So it is possible to make things better.

The key insight Huang drew from these efforts is that computer-science programs assume that students know the material before they begin to study it. Medicine programs do not expect incoming students to have already done surgery, and law programs do not expect them to have participated in trials, but computer-science programs expect that students can write complex code in a C-like language when they walk in the door. Anybody who didn't decide to learn about computers at a young age is going to respond by concluding that this field is not for them.

The bias that keeps women from learning about technology starts at a young age. According to this study [PDF], over 60% of construction-kit gifts are given to boys (in this study, 9% went to girls, the rest were unknown). Universities like Carnegie-Mellon and Harvey Mudd have responded by creating a softer on-ramp that doesn't assume a lot of familiarity with the field and, as a result, they have succeeded in including a lot more people. Doing this isn't a mystery, we just need to make it easier to get started.

Changing culture with paper circuits

Huang's way to make it easier is to try to change the culture around electronics and, in particular, to do so by building circuits on paper rather than by using breadboards. Paper is the substrate, conductive tape is used for connections, and circuit elements are either taped on or soldered. The result is a lot closer to a real circuit board than any breadboard circuit; it demonstrates issues like wire crossings that don't arise with breadboards. That makes a paper circuit easier to move to a true circuit board once it has been finalized. Paper circuits are also flexible — literally. They can be incorporated into a pop-up book, for example.

There are other advantages to paper circuits. They are, for example, compatible with both surface-mount and through-hole components, while breadboards only handle the latter. Paper, he noted, natively supports comments — they can simply be written on the substrate. It's thin, flat, and light, and the substrate is free; breadboards have none of these attributes. On the other hand, with breadboards there is no soldering required and components can be reused, which is why breadboards tend to be favored in educational settings.

The soldering issue is one of the key barriers to mass adoption of paper circuits. In truth, surface-mount components can also be tricky; they tend to be tiny and can be hard to position correctly on a paper circuit. So Huang met up with Jie Qi, who has done a lot of work in this area, and developed the idea of creating stickers with components that could be easily applied to paper circuits. That, they thought, would make the whole idea much more approachable.

Huang and Qi ran a pitch on Crowd Supply to raise some money to develop this idea. They set their goal to an eminently achievable $1 just to see what they would get; when the time ran out, over $100,000 had been pledged. That put them into a position of running an "accidental startup" that they had to somehow make a proper company out of. One of the first things they decided is that they would not be providing "pinked-out" or watered-down technology. Instead, they would be providing rigorous technology that is accessible and fun. Design and technology would be seen as equals, each of which has value.

The people who initially funded this effort were 67% male. But the customers for the product on chibitronics.com were 74% female, and Amazon.com customers were close to that figure. The number of women has been growing over time; 78% of the customers are now women. Huang said that he would like to make things more balanced again and get more men, but he feels that it is a positive thing to have all of those women playing with the technology.

Love to Code

The next step is to move beyond circuits and add coding into the mix; that was the driving force behind the Love to Code project. This project, too, had some founding principles behind it. One was that engineering and design are equally important. We tend to measure achievement on a single axis, but that can exclude people who don't excel on that axis. By combining engineering and art, it's possible to create a two-dimensional space with a lot of ways for people to stand out.

Familiarity is another core part of Love to Code. Paper is a familiar material to most people. It is also expressive, allowing circuits to be made into all kinds of interesting works of art. But paper is also a technical material with many attractive characteristics.

Finally, simplicity was a key goal; there need to be few barriers to getting started. That led to some interesting challenges: how does one get compiled C code into a microcontroller, for example? By default, that is a difficult process requiring users to figure out details like what their serial port is named. It is frustrating and being able to do it doesn't make anybody better than anybody else. So they tried to find a universal interface that would be easy for everybody to use.

In that search, they concluded that ports on computers change frequently, but humans tend to stay the same over the years. In particular, they still have eyes and ears. As a result, there is an audio jack on almost every device — unless you get a modern phone that tries to lock you into a DRM ecosystem and you have given up your sound port, he said. Sound is capable of carrying data, so Love to Code uses it to upload compiled code. This code is played out of the audio port and into the device. The result is one of the few microcontrollers that can be programmed with a mobile phone.

There were many other challenges to overcome. The component stickers went through several generations of design until the problems were worked out. The problem of connecting a microcontroller to a paper circuit was eventually solved by attaching it to a plastic clip. The result is a usable controller that is suitable for classrooms. The curriculum had to be developed from the beginning; it starts with fun characters and introduces coding concepts in an approachable fashion. It makes the point that code isn't so scary.

The book is, naturally, available for download [PDF] under a CC-BY-SA license. All of the code and hardware designs can be found on the Chibitronics GitHub page. Everybody is encouraged to have a look and make it better.

Details

Huang spent a few minutes talking about how the hardware works. Sound is transmitted using AFSK modulation; it employs two tones like an old modem. Those frequencies are relatively high but were chosen to be "MP3 survivable". The demodulator on the microcontroller is based on linmodem; when running it takes 65% of the available CPU time, not leaving much for anything else.

Picking the microcontroller took some work. It needed to be cheap, open-source friendly, have a digital signal processor, an analog-to-digital converter, and so on. After looking at a number of candidates, they chose the Kinetis MKL02Z32VFK4. It has a 48MHz Cortex M0+ processor ("haha it's not vulnerable to Meltdown"), and 32KB of flash memory. The processor costs less than $1 when purchased in volume and there is no non-disclosure agreement to sign for documentation. The winning feature, though, was a fast GPIO port; it is used for USB low-speed emulation, allowing the controller to function as a USB keyboard.

A normal audio jack is too big to fit onto the microcontroller, so another solution was required. In the end, they created a hybrid USB cable that carries audio over the 5th pin. There is no channel back to the host on this cable, so there is no way for the microcontroller to request retransmission of corrupted data. Instead, the host transmits the whole thing three times.

The microcontroller operating system is based on ChibiOS, which is a separate project not related to Chibitronics; "chibi" means "small and cute" in Japanese. Much of the needed support code has been added to ChibiOS to minimize the size of code downloads from the host. The whole system operates under the "patience of a child constraint", meaning that downloaded code must be small enough to arrive and run before the child gets bored. The smallest programs compress down to about 256 bytes of code.

Huang said that Love to Code is not just for kids; it's intended to be a complete "novice to startup" course. Thus, for example, threading is part of the curriculum. Threading, he noted, is easy (but concurrency is hard). The hardware was also chosen to be "China ready". Kits sold in the US often include components that are available there, but which cannot be had in China; that makes it hard to move a prototype circuit into production. Love to Code uses components available in China, avoiding the need to redesign a circuit around new parts.

The "experience layer" can be found at ltc.chibitronics.com. There are a lot of examples ready to go. A simplified form of C++ is used for the introductory material, but it can get "as hairy as you want it" later on. For the smallest kids who might have trouble finding curly braces on the keyboard, a partnership with Microsoft led to makecode.chibitronics.com, where programs can be created using a visual block language.

Conclusion

Huang concluded by reiterating that sharing source is just the first step toward inclusiveness. Saying "I pushed to Git" is inclusiveness in passive-aggressive form. Reaching out to the wider world is a much bigger project, but it's an important one to take on; open-source software was built on inclusiveness. It is not a "commit and forget" thing, it's about pulling, merging, forking, and accepting new ideas. One way or another, we have to empower more of society to understand and value open-source software, he said, or society may come up with its own value with results that we don't like.

The video of this talk is available.

[Your editor thanks the Linux Foundation and linux.conf.au for assisting with his travel to the event.]

Comments (22 posted)

The effect of Meltdown and Spectre in our communities

By Jake Edge
January 31, 2018

linux.conf.au

A late-breaking development in the computing world led to a somewhat hastily arranged panel discussion at this year's linux.conf.au in Sydney. The embargo for the Meltdown and Spectre vulnerabilities broke on January 4; three weeks later, Jonathan Corbet convened representatives from five separate parts of our community, from cloud to kernel to the BSDs and beyond. As Corbet noted in the opening, the panel itself was organized much like the response to the vulnerabilities themselves, which is why it didn't even make it onto the conference schedule until a few hours earlier.

Introductions

Corbet is, of course, the executive editor here at LWN and has been "writing a lot of wrong stuff on the internet" about the vulnerabilities. He came up with the idea of a panel to discuss how the episode had impacted the free-software community and how our community responded to it, rather than to recount the "gory technical details". The panelists were Jessie Frazelle, Benno Rice, Kees Cook, Katie McLaughlin, and Andrew "bunnie" Huang, who each introduced themselves in turn.

Frazelle works at Microsoft and is on the security team for Kubernetes; she said she "has done a few things with containers", which elicited some laughter. Rice is a FreeBSD developer and core team member who works for iXsystems, which develops FreeNAS. Cook works for Google on upstream kernel security; he found out about Meltdown and Spectre before the embargo date, so he has some knowledge and insight about that part of the puzzle. McLaughlin is at Divio, which is a Django cloud provider; her company did not find out about the vulnerabilities until the embargo broke, so she has been "having fun" since then. Huang makes and breaks hardware, he said, which may give him some perspective on "the inside guts".

Corbet started by asking each panelist to recount their experiences in responding to the vulnerabilities and to comment on how it impacted their particular niche. There was a lot of confusion, Frazelle said, which led to questions about "whether containers would save you from a hardware bug—the answer is 'no'". Fixing this process moving forward would obviously involve "not having an absolute shitshow of an embargo", she said.

Rice said that the BSD community's main gripe is that it didn't get notified of the problem until very late in the embargo period despite having relationships with many of the vendors involved. The BSD world found out roughly eleven days before the embargo broke (which would put it right at Christmas); he didn't think that was due to any "malice or nastiness on anyone's part". He hopes that this is something that only happens once in a lifetime, though that statement caused a chuckle around the room—and from Rice himself. It is, he hopes, a chance to learn from the mistakes made this time, so that it can be done better in the future.

Even within Google (which discovered the vulnerabilities), knowledge about Meltdown and Spectre was pretty well contained, Cook said. Once he knew, he had to be careful who he talked with about it; those that needed to know had to go through a "somewhat difficult" approval process before they could be informed. He tried to make sure that the kernel maintainers were among the informed group around the time of the Kernel Summit in late October, the x86 maintainers in particular; once that happened, the process of working around the problem in the kernel seemed to accelerate.

As a platform provider, figuring out the right response was tricky, McLaughlin said, but her company's clients have been understanding. Divio does not host "really human-life-critical systems", she said, but that does not apply to everyone out there who has to respond. She and her colleagues were blindsided by the disclosure and things are not "100% back to where they were" on their platform.

Huang said that he feels for the engineers who work on these chips, as there is an arms race in the industry to get more and more performance; the feature that was exploited in this case is one that provides a large performance gain. It will be interesting to see how things play out in the legal arena, he said; there are already class-action lawsuits against Intel for this and previous hardware bugs had rather large settlements ($400 million for the Pentium FDIV bug, for example). He is concerned that lawsuits will cause hardware makers to get even more paranoid about sharing chip information; they will worry that more information equates to worse legal outcomes (and settlements).

Embargoes

Corbet then turned to the embargo process itself, wondering if it made things worse in this case, as has been asserted in various places. It is clear that there are various have and have-nots, with the latter completely left out in the cold without any kind of response ready even if the embargo had gone its full length.

Rice believes that embargoes do serve a purpose, but that this situation led to a lot of panic so that some were not informed because the hardware vendors were worried about premature disclosure. There is "a lovely bit of irony" that people worked out what was going on based on the merging of performance-degrading patches to Linux without Linus Torvalds's usual steadfast blocking of those kinds of patches. Without some kind of body that coordinates these kinds of embargoes (and disclosures), though, it is difficult to have any kind of consistency in how they are handled.

The embargo has been described as a "complete disaster", Cook said, but the real problems were with the notification piece; the embargo itself was relatively successful. The embargo only broke six days early after six months of being held under wraps. Those parts that could be, were worked on in the open, though that did lead to people realizing that something was going on: "Wow, Linus is happy with this, something is terribly wrong". The fact that it broke early made for a scramble, but he wondered what would have happened if it broke in July—it would have been months before there were any fixes.

It is quite surprising that the embargo held for as long as it did without people at least starting rumors about the problems, Frazelle said. But it is clear that the small cloud providers were given no notice even though they purchase plenty of hardware from Intel. There is an "exclusive club" and the lines on who is in and out of the club are not at all clear. Corbet said he would hate to see a world where only the largest cloud providers get this kind of information, since it represents a competitive disadvantage for those who are not in the club. Rice noted that community projects are even in a worse position, since they don't directly have the vendor relationships that might allow them to get early notice of these hardware flaws.

An audience member directed a question at McLaughlin: how could the disclosure process have been handled better for smaller cloud providers like the one she works for? She said that her company is running on an operating system that still did not have all the fixes, but that once the embargo broke, the company was able to "flip some switches" to disable access to many things in its systems. The sites were still up, but in a read-only mode and "no one died". She is not sure that can really scale; other providers had large enough teams to build their own kernels, but many were left waiting for the upstream kernel (or their distribution's kernel) to incorporate fixes. It is a difficult problem, she said, and she doesn't really have a good answer on how to fix it.

Open hardware

Another audience question on alternatives to Intel got redirected into a question on open hardware. Corbet asked Huang if open hardware was "a path to salvation", but Huang was not optimistic on that front. "I have an opinion on that", he said, to laughter. All of the elements that make up Meltdown and Spectre (speculative execution and timing side channels) were already known, he said, and open hardware is not immune. He quoted Seymour Cray: "memory is like an orgasm, it's better when you don't have to fake it". Whenever we fake memory to get better performance, it opens up side channels. Open hardware has not changed the laws of physics, so it is just as vulnerable to these kinds of problems.

On the other hand, there is a class of hardware problems where being able to review the processor designs will help, he said. These problems have not been disclosed yet, but he seemed to indicate they will be coming to light. Being able to look at all of the bits of the registers and what they control, as well as various debug features and other things that are typically hidden inside the processor design will make those kinds of bugs easier to find.

Corbet asked about the possibility of reproducible builds in the open hardware world, but Huang said that is a difficult problem to solve too. There are doping attacks that can change a chip in ways that cannot be detected even with visual inspection. So even being able to show that a particular design is embodied in the silicon is no real defense. We are building chips with gates a few atoms wide at this point, which makes for an "extremely hard problem". In software, we have the advantage that we can build our tools from source code, but we cannot build chip fabrication facilities (fabs) from source; until we can build fabs from source, we are not going to have that same level of transparency, he said.

Speculative execution has been warned about as an area of concern for a long time, an audience member said, but those warnings were either not understood or were ignored. There are likely warnings about other things today that likewise going unheeded; how does the industry ensure that we don't ignore warnings of risky features going forward. Corbet noted that writing in C or taking input into web forms have been deemed risky—sometimes we have to do risky things to make any progress.

[PULL QUOTE: We probably should be paying more attention to the more paranoid among us, Cook said. END QUOTE]

We probably should be paying more attention to the more paranoid among us, Cook said. Timing side channels have been around for a long time, but there were no practical attacks, so they were often ignored. Waiting for a practical attack may mean that someone malicious comes up with one; instead we should look at ways to mitigate these "impractical" problems if there is a way to do so with an acceptable impact on the code. Huang said that as a hardware engineer, he finds it "insane that you guys think you can run secrets with non-secrets on the same piece of hardware"; that was met with a round of applause.

A question from Twitter was next: how should the recipients of early disclosure of these kinds of bugs be chosen? Corbet said that there is an existing process within the community for disclosure, even though it works reasonably well, it was not followed for unclear reasons this time. But Cook said that process is for software vulnerabilities, not for hardware bugs, which have not happened with same kind of frequency as their software counterparts. The row hammer flaw was the most recent hardware vulnerability of this sort and that was handled awkwardly as well. Improving the process for hardware flaws is something we need to do, he said.

But Rice thinks that the process for hardware bugs should be similar to what is done for software. After all, the problems need to be fixed in software, Corbet agreed. The vulnerabilities also span multiple communities and an embargo is meant to allow a subset of the world to prepare fixes before "the stuff hits the rotating thing", Rice said, which did not work in this case.

Are containers able to help here? Corbet pointed to one of the early responses from Alan Cox who suggested that those affected should execute the plan (that they "most certainly have") to move their services when their cloud provider fails. He asked if that had happened: were people able to do that and did it help them?

One of the main selling points of Kubernetes and containers is that they avoid vendor lock-in, Frazelle said. She did not know if anyone used that to move providers. But many did like that the process of upgrading their kernel was easier using containers and orchestration. Services could be moved, the kernel upgraded and booted, then the services could be moved back, resulting in zero downtime.

McLaughlin agreed with that. Her company was able to migrate some of its services to other Amazon Web Services (AWS) regions that had kernel versions with the fixes applied. That is region-dependent, however, so it wasn't a complete fix, she said. In a bit of comedic break, there was a brief exchange on hardware. McLaughlin: "containers are cool in some aspects but they need to run on hardware". Frazelle: "sadly". Corbet: "that hardware thing's a real pain". Huang: "sorry, guys". Audience: laughter.

The future

Next up was another audience question: "is Spectre and Meltdown just the first salvo of the next twenty years of our life being hell?" Cook was somewhat optimistic that the mitigation for Meltdown (kernel page-table isolation (KPTI) for Linux) would be a "gigantic hammer" that would fix a wide range of vulnerabilities in this area. With three variants (Meltdown plus two for Spectre), there is the expectation that more problems of this sort will be found as researchers focus on these areas. If KPTI doesn't mitigate some of the newer attacks, at least the kernel developers will have had some practice in working around these kinds of problems, he said.

Since timing side channels have been known for some time, the novelty with Meltdown and Spectre comes from combining that with speculative execution, Rice said. That means that people will be looking at all of these CPU performance optimizations closely to see what else can be shaken loose.

Corbet asked Huang if he thought that would lead hardware designers toward more defensive designs. The problem, of course, is that any Huang can imagine are going to have a large impact on performance. Those purchasing hardware will have to decide if they are willing to take a "two to ten to twenty percent hit" for something that has rock-solid security. That kind of performance degradation impacts lots of things, including power budgets, maintainability, and so on, he said.

The last audience question asked if there were lessons that this hardware bug could teach us so that we do better when the inevitable next software vulnerability turns up. Rice said that, from his perspective, the software process is working better than the hardware one. It is not perfect, by any means, but is working reasonably well. Lessons are being learned, however, and he has had good conversations with those who did not alert the BSD community last time, so he is hopeful that situation will not repeat when the (inevitable) next hardware bug comes along.

Huang asked who embargoes are meant to protect against; is it script kiddies or state-level actors? Cook said "yes", but Corbet argued that script kiddies are last century's problem. They have been superseded by "political, commercial, or military interests". He said there have been rumors that these vulnerabilities were available for sale before the disclosure as well.

[PULL QUOTE: Huang pointed out that state-level actors are likely to have learned about the vulnerabilities at the same time as those inside the embargo did. END QUOTE]

Huang pointed out that state-level actors are likely to have learned about the vulnerabilities at the same time as those inside the embargo did; those actors are monitoring various communication channels that were used to disclose the flaws. So opening the bugs up to the community to allow it to work together on fixes would have been a more powerful response, he said. That, too, was an applause-worthy line.

The fact that some of the fixes could not be worked on in the open (the Spectre fixes in particular) made things worse, Corbet said. Once the embargo broke, it was clear those patches were not in good shape and some did not even fix what they purported to. The problem was, as Cook pointed out, that once the words "speculative execution" were mentioned, the cat was pretty much fully out of the bag. While there was a plausible, if highly suspect, reason behind the KPTI patches (breaking kernel address-space layout randomization or KASLR), that was not possible for Spectre. Normally that kind of work all happens behind closed doors; Meltdown was the exception here.

In closing, several of the panelists shared their takeaways from this episode. McLaughlin reiterated the need to make sure that systems are up to date. Fixes are great, but they need to actually be rolled out in order to make a difference. Cook suggested that people should be running a recent kernel; it turns out that it is difficult to backport these fixes to, say, 2.6 kernels (or even 3.x kernels). He did also want to share his perspective that, contrary to the "OMG 2018 has started off as the worst year ever" view, 2017 was actually the year with all the badness. 2018 is the year where we are fixing all of the problems, which is a good thing.

Rice further beat the drum for learning from this event and making the next one work better for everyone. Similarly, Frazelle restated her thinking with a grin: "containers sadly won't save you, but they will ease your pain in upgrading". That was met with both panel and audience laughter, as might be guessed.

Corbet rounded things out with a plea to the industry for more information regarding the timeline and what was happening behind the scenes, which is normally released shortly after an embargo is lifted. In particular, he would like to understand what was happening for the roughly three months after the vulnerabilities were found but before groups like the Linux kernel developers started to get involved. There are people at the conference who have publicly stated that they are not allowed to even say the names of the vulnerabilities. That really needs to change so that we can all learn from this one and be better prepared for the next.

A YouTube video of the panel is available.

[I would like to thank LWN's travel sponsor, the Linux Foundation, for travel assistance to Sydney for LCA.]

Comments (27 posted)

QUIC as a solution to protocol ossification

By Jonathan Corbet
January 29, 2018

linux.conf.au

The TCP protocol has become so ubiquitous that, to many people, the terms "TCP/IP" and "networking" are nearly synonymous. The fact that introducing new protocols (or even modifying existing protocols) has become nearly impossible tends to reinforce that situation. That is not stopping people from trying, though. At linux.conf.au 2018, Jana Iyengar, a developer at Google, discussed the current state of the QUIC protocol which, he said, is now used for about 7% of the traffic on the Internet as a whole.

QUIC ("quick UDP Internet connection") is, for now, intended for situations where the HTTP transport protocol is used over TCP. It has been under development for several years (LWN first looked at it in 2013), and was first deployed at Google in 2014. The main use for QUIC now is to move data between Google services and either the Chrome browser or various mobile apps. Using QUIC causes a 15-18% drop in rebuffering in YouTube and a 3.6-8% drop in Google search latency, Iyengar said. Getting that kind of improvement out of applications that have already been aggressively optimized is "somewhat absurd".

Use of QUIC increased slowly during 2015 before suddenly dropping to zero in December. It seems that somebody found a bug that could result in some requests being transmitted unencrypted, so QUIC was shut down until the issue could be fixed. In August 2016, usage abruptly doubled when QUIC was enabled in the YouTube app on phones. If anybody ever doubted that mobile is the future of computing, he said, this should convince them otherwise. Summed up, 35% of Google's outbound traffic is carried over QUIC now.

The standard network stack, as used for the world-wide web, employs HTTP on top of the TLS cryptographic layer which, in turn, sits on top of TCP. QUIC replaces those components with a new protocol based on the UDP datagram protocol. From that base, QUIC builds a reliable connection-oriented protocol, complete with TCP-like congestion-control features. There is support for both encryption and HTTP within QUIC; it can combine the cryptographic and HTTP handshakes into a single packet.

Thus far, both development and deployment of QUIC have been done primarily by Google. An IETF working group was formed to standardize the protocol in 2016, though. Among other things, standardization will replace the current QUIC cryptographic layer with one based on TLS 1.3 which, Iyengar said, took a number of its ideas from the current QUIC implementation.

Accelerating HTTP

A typical web page has a long list of objects (HTML, CSS, images, etc.) that must be loaded from the server. The HTTP/1.x protocol only allows for a single object to be transmitted at a time; that can be a problem when a large object, which takes a long time to transmit, blocks the transmission of many other objects. This problem, referred to as "head-of-line blocking", increases the time it takes to present a usable web page to the reader. Implementations using HTTP/1.x tend to work around head-of-line blocking by establishing multiple connections in parallel, which has its own problems. Those connections are relatively expensive, compete with each other, and cannot be managed together by congestion-control algorithms and the like.

HTTP/2 was designed to address this problem using multiple "streams" built into a single connection. Multiple objects can be sent over a stream in parallel by multiplexing them into the connection. That helps, but it creates a new problem: the loss of a single packet will stall transmission of all of the streams at once, creating new latency issues. This variant on the head-of-line-blocking problem is built into TCP itself and cannot be fixed with more tweaks at the HTTP level.

TCP suffers other problems as well. Its connection setup latency, involving a three-way handshake, is relatively high. Latency is a critical part of a user's experience with a web service, and the setup latency in TCP can be a significant part of that latency. Middleboxes (routers between the endpoints of a connection) interfere with traffic and make it difficult to improve the protocol. They aren't supposed to be looking at TCP headers, but they do so anyway and make decisions based on what they see, often blocking traffic that looks in any way out of the norm. This "ossification" of the protocol makes it nearly impossible to make changes to TCP itself. For example, TCP fast open has been available in the Linux kernel (and others) for years, but still is not really deployed because middleboxes will not allow it.

QUIC tries to resolve a number of these issues. The first time two machines talk over QUIC, a single round-trip is enough to establish the connection. For subsequent connections, cached information can be used to reduce that number to zero; the connection packet can be followed immediately by the request itself. HTTP streams map directly onto streams implemented in QUIC; a packet loss in one stream will not impact the others. The end result is the elimination of many sources of latency in typical interactions over the net.

Requirements, metrics, and implementations

The QUIC developers set out to create a protocol that was both deployable and evolvable. That dictated the use of UDP, which is able to get through middleboxes with a minimum of interference. UDP also facilitates the creation of a user-space implementation, which was also desired. (Iyengar didn't say this, but one reason to want such an implementation is to get the protocol deployed and updated quickly; many systems out there rarely receive kernel updates.) Low-latency connection establishment was a requirement, as was stream multiplexing. Beyond that, there was a desire for more flexible congestion control. This sort of work can (and has been) done in the Linux kernel, but the bar for inclusion there is high. The QUIC developers wanted to be able to experiment with various algorithms and see how they worked.

One other important requirement was resilience to "NAT rebinding". Most connections onto the Internet go through a network-address translation (NAT) box that hides the original request and port information. For TCP connections, the NAT box can see the SYN and FIN packets and know when a particular binding can be taken down. UDP itself has no "connection" concept, so NAT boxes carrying UDP traffic cannot associate it with a connection created by a higher-level protocol like QUIC. They thus have no indication of when a connection is no longer in use and instead have to rely on timers to decide when to tear down a specific port binding. As a result, UDP port bindings can be taken down while the QUIC connection using them is still active. The next UDP packet associated with that connection will cause a new binding to be established; that will cause the traffic to suddenly appear to be coming from a different port. QUIC packets must thus include the information needed to detect and handle such rebindings.

A member of the audience asked why QUIC was implemented over UDP rather than directly on top of IP. Iyengar pointed to the SCTP protocol as an example of the problem with new IP-based protocols: it comes down to the middleboxes again. SCTP has been around for years, but middleboxes still do not recognize it and tend to block it. As a result, SCTP cannot be reliably used on the net. Actually deploying a new IP-based protocol, he said, is simply impossible on today's Internet. Additionally, working on top of UDP makes a user-space implementation easier.

As noted above, deployment of QUIC has led to significant improvements in performance for Google services. The significant drop in search latency is mainly a result of eliminating round trips during connection setup. As a result, it tends to show the biggest improvement for users on slow networks. A search done in South Korea is likely to show a 1.3% improvement in latency, but in India that improvement is over 13%. Iyengar said that people measurably spent more time watching more videos when they are doing so over QUIC; that was presented as a good thing.

One key feature of QUIC is that the transport headers — buried inside the UDP packets — are encrypted. Beyond the obvious privacy benefits, encryption prevents ossification of the protocol by middleboxes, which can't make routing decisions based on information they can't understand. A few things have to be transmitted in clear text, though; a connection ID is required, for example, to find the key needed to decrypt the rest. The first byte of the clear data was a flags field which, he said, was promptly ossified by a middlebox vendor, leading to packets being dropped when a new flag was set.

That was a classic example of why changing network protocols is hard and what needs to be done to improve the situation. Middleboxes are the control points for the Internet as we know it now. The only defense against ossification of network protocols by middleboxes, he said at the conclusion of the talk, is encryption.

There were some questions from the audience regarding implementations. Most of them are still a work in progress, he said. The quic-go implementation is coming along. There are implementations being done by Apple and Microsoft, and a certain amount of interoperability testing has been done on those. When asked about an open-source reference implementation in particular, Iyengar pointed to the chromium browser, which is open-source. Other implementations exist, but everybody is waiting for the IETF process to finish.

The video of this talk is available.

[Your editor would like to thank the Linux Foundation and linux.conf.au for assisting with his travel to this event.]

Comments (55 posted)

Containers from user space

By Jonathan Corbet
January 31, 2018

linux.conf.au

In a linux.conf.au 2018 keynote called "Containers from user space" — an explicit reference to the cult film "Plan 9 from Outer Space" — Jessie Frazelle took the audience on a fast-moving tour of the past, present, and possible future of container technology. Describing the container craze as "amazing", she covered topics like the definition of a container, security, runtimes, container concepts in programming languages, multi-tenancy, and more.

Frazelle started by noting that she has recently moved to Microsoft — "selling out has been amazing" — where she works on open-source software, containers, and Kubernetes. She works between layers of abstraction and likes it there. There is fun to be had in pulling features out of one layer and putting them into another; as an example, consider how kernel features were lifted out and put into containers and have since been lifted again into Kubernetes.

But, she asked, what is a container? There is some confusion around this question, she said, because containers are not a real thing. In an operating system like Solaris, containers (Zones) are a first-class concept, but Linux containers are not. That is a feature, not a bug.

Containers are built on namespaces and control groups. Namespaces control what a process can see, while control groups regulate what they can use. Then a number of other technologies are layered on top. A security module like AppArmor or SELinux provides mandatory access control. She worked for a while adding seccomp support to Docker, preventing access to around 150 system calls. Building the whitelist took a long time but, in the end, it proved possible to block functionalities like keyring management, clone(), and namespace management from containers. The no new privileges flag helps to prevent privilege escalation. A lot of work was also done to limit capabilities within containers; she doesn't like how rkt and systemd-nspawn allow CAP_SYS_ADMIN.

On top of all that, she said, is "a bunch of duct tape" to hold it all together.

A variant on the container idea is the Intel Clear Containers (now Kata Containers) concept, which uses virtualization technology to provide isolation. She doesn't know whether these are true containers or not; they seem more like a virtual machine to her.

Lego bricks

Container technology had been around for quite some time before Docker showed up and the resulting hype got people excited. OpenVZ started in 2005; she is not sure if it is still maintained, but people are still using it. LXC, which came in 2008, was her first encounter with containers and, she said, can be blamed for the "container" term — but we shouldn't tell them that she said that.

Docker appeared in 2013, initially using LXC as its backend. Its own native runtime showed up later in v0.7. Google's lmctfy was released at about the same time; that project eventually refocused on libcontainer and working with Docker. The appearance of rkt in 2014 was the start of the container wars, while runc was an attempt to end them. Runc is part of the Open Container Initiative; it is good to have some governance, she said.

Another way of looking at the situation is to say that "containers are the new JavaScript frameworks" — there are so many of them.

We can think of containers as being like Lego bricks. Virtual machines and Solaris Zones are like having all of the bricks already glued together. You don't have to do much work to use them, they are just there waiting for use, but you only get one shape and that's "not much fun". Containers, instead, come with all of the pieces; you can build the Death Star with them, but you don't have to.

There are a number of advantages to this approach. Specific namespaces can be turned on and off, for example. If low-latency networking is needed, the network namespace can be disabled for a container and it can just run in the host namespace instead — without having to turn off any other container-oriented restrictions. It is possible to share resources between containers; every Kubernetes pod shares process-ID namespaces internally so that signals can be sent between containers. Everything is a tunable knob in Linux containers, meaning that it is easy to shoot oneself in the foot. Part of the appeal of systems like Docker is that they have been set up with sane defaults.

The Lego-like aspect of containers can be used to improve security by sandboxing applications beyond the default. As an example, she put up a site at contained.af that presents a root shell. It is a challenge: anybody who can find the contents of a special file hosted on the system running the container can send it to her for a special prize. Nobody has managed it, even though many have tried.

Why is that? The container is running with an extra-strict seccomp profile that renders most attacks useless; one would have to do a return-oriented programming (ROP) attack to get anywhere. And "nobody will waste their time doing a ROP attack on a shitty web app", especially when the only prize is her praise. The application itself was bashed out in a single day and is probably vulnerable, but anybody who is able to exploit a vulnerability still has to get past seccomp to get anywhere interesting.

Desktop containers and programming languages

Containers tend to be used in server applications, but Frazelle is interested in using containers on the desktop as well. She started by using runc and rootless containers as the base desktop system. Getting there required converting dockerfiles over to runc spec files, so she wrote a tool called riddler to do that task. That got her sandboxed desktop applications, but she wanted something even more minimal.

The next step was to switch to CoreOS, a container-oriented distribution based on ChromeOS. It features a read-only /usr directory and forces the use of containers for almost everything. The integrity of the operating system can be verified at a number of levels; it has trusted platform module support and uses dm-verity, for example. She did have to add graphics drivers but, since the whole thing is based on Gentoo, she just had to "emerge the world" to get there.

What would happen, she asked, if we were to apply some of these principles to programming languages? It should be possible to take some Go code and create a set of seccomp filters from the code itself. The required lockdown could be worked out at build time; the Go language supports this kind of work nicely. Generating filters at run time has been tried and failed: unit tests may not cover all of the code, and something is always missed, breaking the program in real use. Then users turn the whole thing off.

By moving the filter generation to build time it should be possible to ensure that all of the relevant code is examined. Go doesn't use the C library for its system calls; instead, it generates its own assembly that calls into the kernel. That process can be hijacked and used to collect information on which system calls the program needs; generation of the filter should then be easy.

There are a few challenges to overcome, of course. If the program calls execve() to run a new program, the seccomp filter is unlikely to meet that new program's needs. A similar problem arises if the program calls dlopen() to load code dynamically. Finally, Go provides a syscall.RawSyscall() function that bypasses the normal assembly generation. Even so, there is space for a cool proof-of-concept, and there will be better ways in the future. Rome wasn't built in a day.

Another possibility would be to use Metaparticle, which is a standard library used by applications to containerize themselves. It, too, could parse system calls and use the information to create a perfect seccomp filter.

Other crazy things

Another interesting "crazy thing" is SCONE, which uses Intel's SGX technology to create secure containers. This could be useful, she said, for people who don't trust the cloud. At the moment, if somebody has superuser access on the host machine, it's "game over". With SGX secure enclaves, that is no longer the case — or that is the promise anyway. Of course, she said, it doesn't help against side-channel attacks.

One of the objectives of the SCONE project was to keep the code as small as possible. After all, if everything is in the sandbox, then there is no sandbox. Performance was also an issue. Any thread that needs to leave the secure enclave for any reason (to make a system call, for example) must copy its arguments first; this will always be slow. Any cache miss requires that memory be decrypted before being loaded into the cache line. As a result there are some tradeoffs in the SCONE design.

The first approach taken was similar to Microsoft's Haven system. The entire container from the C library on up was placed inside the enclave, with a shielding layer interfacing with the host operating system. But this puts everything inside the enclave, defeating the goal of keeping things small. The second approach was to declare the C library untrusted and put it outside the enclave. But that was a little too small; it becomes too easy to compromise the container at the C-library level.

In the end, the SCONE developers went back to something closer to the first idea but put more work into the shielding layer. It encrypts I/O coming into and out of the enclave, for example. The result was a huge amount of code in the shielding layer, leading to a question: should one trust the shielding layer, or just trust the cloud instead? The answer is clear to her but, as she noted, she works for a cloud provider so she would think that way.

There are a lot of other compromises needed in the SCONE approach. It doesn't support fork() or clone(), for example, which will make many real-world applications hard to write. System calls outside the enclave are expensive, as are page faults. It's worth considering that the rather slower movement of data into and out of SGX enclaves might make side-channel attacks easier. All told, it "seems challenging" and she's not convinced that it's worth it. If you have so many trust issues, she said, just don't use the cloud.

Besides, this approach won't protect against Meltdown or Spectre; "that's funny".

The final topic was multi-tenancy, which is not currently supported in Kubernetes. Frazelle described two levels of multi-tenancy that will need to be supported. "Soft multi-tenancy" is when there are multiple users within the same organization that trust each other. Protections added for soft multi-tenancy are more about preventing accidents than thwarting attacks. Soft multi-tenancy is not difficult to support.

Hard multi-tenancy, instead, is needed where there are multiple users and any of them could be malicious. For Kubernetes or any other orchestrator, hard multi-tenancy provides a lot to think about. The task starts with the host operating system; one of the container-oriented systems should be considered. It may be worthwhile to consider using Kata Containers for their better containment.

Networking needs to start with a "deny all" policy and add exceptions for any communications that are needed between containers or pods. Domain name service should be namespaced like everything else, or turned off entirely. Running kube-dns as a sidecar can help in this regard. Authentication and authorization need a lot of thought. Master and system nodes need to be isolated on different machines. Access to host resources must be restricted. The rest is just "making everything else dumb to its surroundings".

She concluded by reiterating that containers are not "things" as such. They should, perhaps, have been called "boxes" — she likes that term better, anyway. Naming things is hard, but at least the container developers can't be blamed for "serverless". In the end, containers with sane defaults are sandboxes. Securing containers by default is just one piece of a brighter future; applying the same principles to other parts of the system will give us a more secure user space overall.

The video of this talk is available.

[Your editor thanks the Linux Foundation and linux.conf.au for assisting with his travel to the event.]

Comments (32 posted)

Page editor: Jonathan Corbet
Next page: Brief items>>

Oct	NOV	Jan
	19
2021	2022	2024