clone3(), fchmodat4(), and fsinfo()

We're bad at marketing
See More
We can admit it, marketing is not our strong suit. Our strength is writing the kind of articles that developers, administrators, and free-software supporters depend on to know what is going on in the Linux world. Please subscribe today to help us keep doing that, and so we don’t have to get good at marketing.

By Jonathan Corbet
July 5, 2019

The kernel development community continues to propose new system calls at a high rate. Three ideas that are currently in circulation on the mailing lists are clone3(), fchmodat4(), and fsinfo(). In some cases, developers are just trying to make more flag bits available, but there is also some significant new functionality being discussed.

`clone3()`

The clone() system call creates a new process or thread; it is the actual machinery behind fork(). Unlike fork(), clone() accepts a flags argument to modify how it operates. Over time, quite a few flags have been added; most of these control what resources and namespaces are to be shared with the new child process. In fact, so many flags have been added that, when CLONE_PIDFD was merged for 5.2, the last available flag bit was taken. That puts an end to the extensibility of clone().

The natural solution is to clone the clone() system call into a new one that would be able to accept more flags. Christian Brauner, perhaps feeling guilty for having snagged the last flag for CLONE_PIDFD, set out to do this work. His first attempt was called clone6() but, after some discussion, it was downgraded to clone3(). (For the curious, there is a clone2() that appears to only be of interest on the ia64 architecture). The prototype for this system call looks something like this:

    struct clone_args {
        u64 flags;
        int *pidfd;
        int *child_tid;
        int *parent_tid;
        int exit_signal;
        unsigned long stack;
        unsigned long stack_size;
        unsigned long tls;
    };

    int clone3(struct clone_args *args, size_t size);

The clone_args structure contains much of the information that was previously passed directly to clone() or crammed into the flags field. The new flags is wider (64 bits on all architectures) and regains some space due to the relocation of information like the exit signal number. That should provide enough flags to last, as they say, "for a while".

The size argument is the size of the clone_args structure itself. Should there ever be a need to expand that structure in the future, the kernel will be able to tell whether any given user-space caller is using the new or the old version of the structure by examining size and do the right thing either way. So, with luck, there should be no need to create a clone4() anytime soon.

This interface seems to be satisfactory to everybody involved, though Jann Horn did point out one significant problem: the seccomp mechanism is unable to examine system-call arguments that are passed in separate structures, so it will be unable to make decisions based on the flags given to clone3(). That, he said, means that code meant to be sandboxed with seccomp may not use clone3() at all. Kees Cook has suggested a new mechanism for fetching user-space data for system calls that could be used by seccomp, but nobody appears to be working on that idea currently.

Meanwhile, clone3() is in linux-next, and so can be expected to appear in 5.3.

`fchmodat4()`

A look at the man page for fchmodat() reveals the following prototype:

    int fchmodat(int dirfd, const char *pathname, mode_t mode, int flags);

The flags argument is documented to have one possible value: AT_SYMLINK_NOFOLLOW, which would cause fchmodat() to operate directly on a symbolic link rather than its target. There's only one little problem: fchmodat() as implemented in the kernel does not actually accept a flags argument. That is why the man page concludes with: "This flag is not currently implemented".

Palmer Dabbelt was motivated to action by a seemingly unpleasant experience: "I spent half of dinner last night being complained to by one of our hardware engineers about Linux's lack of support for the flags argument to fchmodat()". The result was a patch set implementing support for fchmodat4(), which has the same prototype as fchmodat() but which actually implements the flags argument.

This patch set seems uncontroversial, so there should be no real barrier to its merging, though it has not yet found its way into linux-next.

`fsinfo()`

The statfs() system call can be used to get certain types of information about a filesystem, including its format, block size, available free blocks, maximum file-name length, and so on. But it turns out that there is a lot more to know about a filesystem than that, and statfs() is unable to provide that information. It seems like a situation just begging for somebody to come along and implement statfs2(), but instead we get fsinfo() from David Howells.

The prototype for fsinfo() looks like this:

    struct fsinfo_params {
	__u32	at_flags;
	__u32	request;
	__u32	Nth;
	__u32	Mth;
	__u64	__reserved[3];
    };

    int fsinfo(int dfd, const char *filename,
    	       const struct fsinfo_params *params, void *buffer,
	       size_t buf_size);

The dfd and filename arguments identify the filesystem about which information is needed. params is an optional array describing the requested information, while buffer and buf_size define the output buffer.

If params is null, the returned information will be essentially the same as what statfs() would provide. But it is possible to get more, including limits on the filesystem's capabilities, timestamp resolution, mount-time parameters, remote server information, and more. Once this patch set is applied, fsinfo() will also be able to return information about the system's mount topology.

This system call is complex, to say the least; there is not space here to try to describe how it all works. Fortunately, there is some good documentation provided with it. This patch provides a fair amount of information about what fsinfo() can do, liberally intermixed with API information for filesystem developers. But see also this patch for information on how the mount-topology queries work, and this one for the somewhat baroque mechanism used to format parameter values passed back to user space.

While there is clear value in the creation of an interface for extracting arbitrary filesystem-related information from the kernel, the complexity of the fsinfo() patch set has proved daunting to reviewers, who have asked for it to be broken up in the past. Filesystem developers have, in recent years, become more insistent that new features come with additions to the xfstests suite as well; those have not yet been provided in this case. fsinfo() has been circulating for a while — Howells posted a version nearly one year ago — but chances are good that it will need to circulate for a bit longer still before it's ready for the mainline.

Index entries for this article
Kernel	System calls

clone3(), fchmodat4(), and fsinfo()

Posted Jul 5, 2019 16:55 UTC (Fri) by zblaxell (subscriber, #26385) [Link] (5 responses)

> "I spent half of dinner last night being complained to by one of our hardware engineers about Linux's lack of support for the flags argument to fchmodat()"

He did? What did they want it for? Nothing in Linux looks at symlink
permissions.

clone3(), fchmodat4(), and fsinfo()

Posted Jul 5, 2019 18:32 UTC (Fri) by droundy (subscriber, #4559) [Link] (3 responses)

Perhaps to avoid being tricked into changing the permissions of some other file? e.g. I'd wish for this if I were implementing chmod -R.

clone3(), fchmodat4(), and fsinfo()

Posted Jul 5, 2019 19:06 UTC (Fri) by zblaxell (subscriber, #26385) [Link]

Oh, right...I forgot Linux doesn't have a lchmod() either. All my own
use cases use fchmod() on files that are already open (potentially with
O_NOFOLLOW), or set umask so the file doesn't have the wrong permissions
in the first place. Or I put on my grumpy sysadmin hat and hand the
entire problem to chmod(1) which does a symlink check with (manageable)
TOCTTOU problems. It's an API gap that is so old I can't see it's there
any more.

So "chmod a symlink" wouldn't be expected to work literally--the call
could be ignored or return an error instead. That seems sane.

The original patch was talking about some FUSE use case, but didn't say
anything further. It sounded like someone was planning to do something
(presumably evil) with lrwxr-xr-x symlinks, like maybe restrict readlink()
access.

clone3(), fchmodat4(), and fsinfo()

Posted Jul 8, 2019 15:27 UTC (Mon) by cyphar (subscriber, #110703) [Link] (1 responses)

Coincidentally, I'm actually working on a userspace library that helps avoid TOCTOUs like these (as well as many others)[1]. I sent a patchset for openat2 over the weekend[2] -- but I will probably have to resend it next week so folks actually see it.

[1]: https://github.com/openSUSE/libpathrs
[2]: https://marc.info/?l=linux-api&m=156242513200869&w=2

clone3(), fchmodat4(), and fsinfo()

Posted Jul 18, 2019 14:44 UTC (Thu) by nix (subscriber, #2304) [Link]

My problem with libraries like this is that if people start making extensive use of them, they break my use cases. I make extensive use of bind-mounts, sometimes within source trees, sometimes under my own $HOME, often within larger groups of files that other packages may see as a conceptual whole, as part of what amounts to a hierarchical storage system (shifting some things to uncached storage, faster storage, etc, as needed). This has just worked for decades -- but if programs start insisting that some things not be mount points, this will suddenly break.

I don't see how this could be considered anything but a userspace regression.

Please make these flags do-nothings if the sysadmin requests it.

clone3(), fchmodat4(), and fsinfo()

Posted Jul 17, 2019 0:35 UTC (Wed) by palmer (subscriber, #84061) [Link]

https://github.com/sifive/wake/blob/7729a9266d4cfba93414d...

struct clone_args odd packing?

Posted Jul 6, 2019 21:17 UTC (Sat) by pr1268 (subscriber, #24648) [Link] (8 responses)

Am I the only one who finds the packing of struct clone_args unusual? For 64-bit architectures it's a single 32-bit field (exit_signal) packed between four and three 64-bit fields.

When packing data structures, I was taught to put POD types in descending order of size. Or perhaps I was just obsessive-compulsive about this. Or both...

struct clone_args odd packing?

Posted Jul 6, 2019 22:15 UTC (Sat) by garloff (subscriber, #319) [Link]

There are 32bits unused after the exit_status field on LP64 architectures, correct, due to the C alignment rules.
Good catch!

struct clone_args odd packing?

Posted Jul 7, 2019 12:25 UTC (Sun) by brauner (subscriber, #109349) [Link] (6 responses)

The uapi structure is correctly packed to allow 32 bit and 64 bit handling to be identical:

struct clone_args {
__aligned_u64 flags;
__aligned_u64 pidfd;
__aligned_u64 child_tid;
__aligned_u64 parent_tid;
__aligned_u64 exit_signal;
__aligned_u64 stack;
__aligned_u64 stack_size;
__aligned_u64 tls;
};

The kernel internal struct uses kernel internal types and packing doesn't matter here:

struct kernel_clone_args {
u64 flags;
int __user *pidfd;
int __user *child_tid;
int __user *parent_tid;
int exit_signal;
unsigned long stack;
unsigned long stack_size;
unsigned long tls;
};

struct clone_args odd packing?

Posted Jul 11, 2019 14:35 UTC (Thu) by mirabilos (subscriber, #84359) [Link] (3 responses)

No, you’re not the only one, I also wanted to add a comment about this.

The padding is just ridiculous. Reorder them like this…

struct clone_args {
u64 flags;
int *pidfd;
int *child_tid;
int *parent_tid;
unsigned long stack;
unsigned long stack_size;
unsigned long tls;
int exit_signal;
};

… and you’re done.

struct clone_args odd packing?

Posted Jul 11, 2019 23:46 UTC (Thu) by brauner (subscriber, #109349) [Link] (2 responses)

See https://lwn.net/Articles/792955/ .

struct clone_args odd packing?

Posted Jul 11, 2019 23:59 UTC (Thu) by mirabilos (subscriber, #84359) [Link] (1 responses)

That’s the comment I replied to, calling it out for stupidity.

The uapi structure takes 64 bytes on all architectures.

My better packing takes 36 bytes on ILP32 and 60 bytes on LP64 only.

Please

Posted Jul 12, 2019 0:19 UTC (Fri) by corbet (editor, #1) [Link]

One would think that we could have a technical discussion about structure layouts without resorting to preschool words like "stupidity". Please try to avoid doing that, OK?

The structure in question is, of course, a short-lived thing that will not exist in vast quantities in the system, so squeezing every possible byte out of it doesn't seem all that important.

struct clone_args odd packing?

Posted Jul 12, 2019 0:01 UTC (Fri) by mirabilos (subscriber, #84359) [Link] (1 responses)

The in-kernel one has implicit padding, which is a concern on at least m68k (with its 16-bit instead of natural alignment).

struct clone_args odd packing?

Posted Jul 12, 2019 12:30 UTC (Fri) by brauner (subscriber, #109349) [Link]

m68k doesn't have this syscall enabled. This is left to the individual maintainers
for all arches that require special handling for fork-like syscalls.