clone3(), fchmodat4(), and fsinfo()
We're bad at marketingThe kernel development community continues to propose new system calls at a high rate. Three ideas that are currently in circulation on the mailing lists are clone3(), fchmodat4(), and fsinfo(). In some cases, developers are just trying to make more flag bits available, but there is also some significant new functionality being discussed.We can admit it, marketing is not our strong suit. Our strength is writing the kind of articles that developers, administrators, and free-software supporters depend on to know what is going on in the Linux world. Please subscribe today to help us keep doing that, and so we don’t have to get good at marketing.
clone3()
The clone() system call creates a new process or thread; it is the actual machinery behind fork(). Unlike fork(), clone() accepts a flags argument to modify how it operates. Over time, quite a few flags have been added; most of these control what resources and namespaces are to be shared with the new child process. In fact, so many flags have been added that, when CLONE_PIDFD was merged for 5.2, the last available flag bit was taken. That puts an end to the extensibility of clone().
The natural solution is to clone the clone() system call into a new one that would be able to accept more flags. Christian Brauner, perhaps feeling guilty for having snagged the last flag for CLONE_PIDFD, set out to do this work. His first attempt was called clone6() but, after some discussion, it was downgraded to clone3(). (For the curious, there is a clone2() that appears to only be of interest on the ia64 architecture). The prototype for this system call looks something like this:
struct clone_args { u64 flags; int *pidfd; int *child_tid; int *parent_tid; int exit_signal; unsigned long stack; unsigned long stack_size; unsigned long tls; }; int clone3(struct clone_args *args, size_t size);
The clone_args structure contains much of the information that was previously passed directly to clone() or crammed into the flags field. The new flags is wider (64 bits on all architectures) and regains some space due to the relocation of information like the exit signal number. That should provide enough flags to last, as they say, "for a while".
The size argument is the size of the clone_args structure itself. Should there ever be a need to expand that structure in the future, the kernel will be able to tell whether any given user-space caller is using the new or the old version of the structure by examining size and do the right thing either way. So, with luck, there should be no need to create a clone4() anytime soon.
This interface seems to be satisfactory to everybody involved, though Jann Horn did point out one significant problem: the seccomp mechanism is unable to examine system-call arguments that are passed in separate structures, so it will be unable to make decisions based on the flags given to clone3(). That, he said, means that code meant to be sandboxed with seccomp may not use clone3() at all. Kees Cook has suggested a new mechanism for fetching user-space data for system calls that could be used by seccomp, but nobody appears to be working on that idea currently.
Meanwhile, clone3() is in linux-next, and so can be expected to appear in 5.3.
fchmodat4()
A look at the man page for fchmodat() reveals the following prototype:
int fchmodat(int dirfd, const char *pathname, mode_t mode, int flags);
The flags argument is documented to have one possible value:
AT_SYMLINK_NOFOLLOW, which would cause fchmodat() to
operate directly on a symbolic link rather than its target. There's only
one little problem: fchmodat() as implemented in the kernel does
not actually accept a flags argument. That is why the man page
concludes with: "This flag is not currently implemented
".
Palmer Dabbelt was motivated to action by a seemingly unpleasant
experience: "I spent half of dinner last night being complained to by
one of our hardware engineers about Linux's lack of support for the flags
argument to fchmodat()
". The result was a patch
set implementing support for fchmodat4(), which has the same
prototype as fchmodat() but which actually implements the
flags argument.
This patch set seems uncontroversial, so there should be no real barrier to its merging, though it has not yet found its way into linux-next.
fsinfo()
The statfs() system call can be used to get certain types of information about a filesystem, including its format, block size, available free blocks, maximum file-name length, and so on. But it turns out that there is a lot more to know about a filesystem than that, and statfs() is unable to provide that information. It seems like a situation just begging for somebody to come along and implement statfs2(), but instead we get fsinfo() from David Howells.
The prototype for fsinfo() looks like this:
struct fsinfo_params { __u32 at_flags; __u32 request; __u32 Nth; __u32 Mth; __u64 __reserved[3]; }; int fsinfo(int dfd, const char *filename, const struct fsinfo_params *params, void *buffer, size_t buf_size);
The dfd and filename arguments identify the filesystem about which information is needed. params is an optional array describing the requested information, while buffer and buf_size define the output buffer.
If params is null, the returned information will be essentially the same as what statfs() would provide. But it is possible to get more, including limits on the filesystem's capabilities, timestamp resolution, mount-time parameters, remote server information, and more. Once this patch set is applied, fsinfo() will also be able to return information about the system's mount topology.
This system call is complex, to say the least; there is not space here to try to describe how it all works. Fortunately, there is some good documentation provided with it. This patch provides a fair amount of information about what fsinfo() can do, liberally intermixed with API information for filesystem developers. But see also this patch for information on how the mount-topology queries work, and this one for the somewhat baroque mechanism used to format parameter values passed back to user space.
While there is clear value in the creation of an interface for extracting
arbitrary filesystem-related information from the kernel, the complexity of
the fsinfo() patch set has proved daunting to reviewers, who have
asked for it to be broken up in the past. Filesystem developers have, in
recent years, become more insistent that new features come with additions
to the xfstests suite as well; those have not yet been provided in
this case. fsinfo() has been circulating for a while — Howells posted
a version nearly one year ago — but chances are good that it will need
to circulate for a bit longer still before it's ready for the mainline.
Index entries for this article | |
---|---|
Kernel | System calls |
Posted Jul 5, 2019 16:55 UTC (Fri)
by zblaxell (subscriber, #26385)
[Link] (5 responses)
He did? What did they want it for? Nothing in Linux looks at symlink
Posted Jul 5, 2019 18:32 UTC (Fri)
by droundy (subscriber, #4559)
[Link] (3 responses)
Posted Jul 5, 2019 19:06 UTC (Fri)
by zblaxell (subscriber, #26385)
[Link]
So "chmod a symlink" wouldn't be expected to work literally--the call
The original patch was talking about some FUSE use case, but didn't say
Posted Jul 8, 2019 15:27 UTC (Mon)
by cyphar (subscriber, #110703)
[Link] (1 responses)
[1]: https://github.com/openSUSE/libpathrs
Posted Jul 18, 2019 14:44 UTC (Thu)
by nix (subscriber, #2304)
[Link]
I don't see how this could be considered anything but a userspace regression.
Please make these flags do-nothings if the sysadmin requests it.
Posted Jul 17, 2019 0:35 UTC (Wed)
by palmer (subscriber, #84061)
[Link]
Posted Jul 6, 2019 21:17 UTC (Sat)
by pr1268 (subscriber, #24648)
[Link] (8 responses)
Am I the only one who finds the packing of struct clone_args unusual? For 64-bit architectures it's a single 32-bit field (exit_signal) packed between four and three 64-bit fields. When packing data structures, I was taught to put POD types in descending order of size. Or perhaps I was just obsessive-compulsive about this. Or both...
Posted Jul 6, 2019 22:15 UTC (Sat)
by garloff (subscriber, #319)
[Link]
Posted Jul 7, 2019 12:25 UTC (Sun)
by brauner (subscriber, #109349)
[Link] (6 responses)
struct clone_args {
The kernel internal struct uses kernel internal types and packing doesn't matter here:
struct kernel_clone_args {
Posted Jul 11, 2019 14:35 UTC (Thu)
by mirabilos (subscriber, #84359)
[Link] (3 responses)
The padding is just ridiculous. Reorder them like this…
struct clone_args {
… and you’re done.
Posted Jul 11, 2019 23:46 UTC (Thu)
by brauner (subscriber, #109349)
[Link] (2 responses)
Posted Jul 11, 2019 23:59 UTC (Thu)
by mirabilos (subscriber, #84359)
[Link] (1 responses)
The uapi structure takes 64 bytes on all architectures.
My better packing takes 36 bytes on ILP32 and 60 bytes on LP64 only.
Posted Jul 12, 2019 0:19 UTC (Fri)
by corbet (editor, #1)
[Link]
The structure in question is, of course, a short-lived thing that will not exist in vast quantities in the system, so squeezing every possible byte out of it doesn't seem all that important.
Posted Jul 12, 2019 0:01 UTC (Fri)
by mirabilos (subscriber, #84359)
[Link] (1 responses)
Posted Jul 12, 2019 12:30 UTC (Fri)
by brauner (subscriber, #109349)
[Link]
clone3(), fchmodat4(), and fsinfo()
permissions.
clone3(), fchmodat4(), and fsinfo()
clone3(), fchmodat4(), and fsinfo()
use cases use fchmod() on files that are already open (potentially with
O_NOFOLLOW), or set umask so the file doesn't have the wrong permissions
in the first place. Or I put on my grumpy sysadmin hat and hand the
entire problem to chmod(1) which does a symlink check with (manageable)
TOCTTOU problems. It's an API gap that is so old I can't see it's there
any more.
could be ignored or return an error instead. That seems sane.
anything further. It sounded like someone was planning to do something
(presumably evil) with lrwxr-xr-x symlinks, like maybe restrict readlink()
access.
clone3(), fchmodat4(), and fsinfo()
[2]: https://marc.info/?l=linux-api&m=156242513200869&w=2
clone3(), fchmodat4(), and fsinfo()
clone3(), fchmodat4(), and fsinfo()
struct clone_args odd packing?
struct clone_args odd packing?
Good catch!
struct clone_args odd packing?
__aligned_u64 flags;
__aligned_u64 pidfd;
__aligned_u64 child_tid;
__aligned_u64 parent_tid;
__aligned_u64 exit_signal;
__aligned_u64 stack;
__aligned_u64 stack_size;
__aligned_u64 tls;
};
u64 flags;
int __user *pidfd;
int __user *child_tid;
int __user *parent_tid;
int exit_signal;
unsigned long stack;
unsigned long stack_size;
unsigned long tls;
};
struct clone_args odd packing?
u64 flags;
int *pidfd;
int *child_tid;
int *parent_tid;
unsigned long stack;
unsigned long stack_size;
unsigned long tls;
int exit_signal;
};
struct clone_args odd packing?
struct clone_args odd packing?
One would think that we could have a technical discussion about structure layouts without resorting to preschool words like "stupidity". Please try to avoid doing that, OK?
Please
struct clone_args odd packing?
struct clone_args odd packing?
for all arches that require special handling for fork-like syscalls.