Commit Graph

325 Commits

Author SHA1 Message Date
bors
30e49a9ead Auto merge of #75272 - the8472:spec-copy, r=KodrAus
specialize io::copy to use copy_file_range, splice or sendfile

Fixes #74426.
Also covers #60689 but only as an optimization instead of an official API.

The specialization only covers std-owned structs so it should avoid the problems with #71091

Currently linux-only but it should be generalizable to other unix systems that have sendfile/sosplice and similar.

There is a bit of optimization potential around the syscall count. Right now it may end up doing more syscalls than the naive copy loop when doing short (<8KiB) copies between file descriptors.

The test case executes the following:

```
[pid 103776] statx(3, "", AT_STATX_SYNC_AS_STAT|AT_EMPTY_PATH, STATX_ALL, {stx_mask=STATX_ALL|STATX_MNT_ID, stx_attributes=0, stx_mode=S_IFREG|0644, stx_size=17, ...}) = 0
[pid 103776] write(4, "wxyz", 4)        = 4
[pid 103776] write(4, "iklmn", 5)       = 5
[pid 103776] copy_file_range(3, NULL, 4, NULL, 5, 0) = 5

```

0-1 `stat` calls to identify the source file type. 0 if the type can be inferred from the struct from which the FD was extracted
𝖬 `write` to drain the `BufReader`/`BufWriter` wrappers. only happen when buffers are present. 𝖬 ≾ number of wrappers present. If there is a write buffer it may absorb the read buffer contents first so only result in a single write. Vectored writes would also be an option but that would require more invasive changes to `BufWriter`.
𝖭 `copy_file_range`/`splice`/`sendfile` until file size, EOF or the byte limit from `Take` is reached. This should generally be *much* more efficient than the read-write loop and also have other benefits such as DMA offload or extent sharing.

## Benchmarks

```

OLD

test io::tests::bench_file_to_file_copy         ... bench:      21,002 ns/iter (+/- 750) = 6240 MB/s    [ext4]
test io::tests::bench_file_to_file_copy         ... bench:      35,704 ns/iter (+/- 1,108) = 3671 MB/s  [btrfs]
test io::tests::bench_file_to_socket_copy       ... bench:      57,002 ns/iter (+/- 4,205) = 2299 MB/s
test io::tests::bench_socket_pipe_socket_copy   ... bench:     142,640 ns/iter (+/- 77,851) = 918 MB/s

NEW

test io::tests::bench_file_to_file_copy         ... bench:      14,745 ns/iter (+/- 519) = 8889 MB/s    [ext4]
test io::tests::bench_file_to_file_copy         ... bench:       6,128 ns/iter (+/- 227) = 21389 MB/s   [btrfs]
test io::tests::bench_file_to_socket_copy       ... bench:      13,767 ns/iter (+/- 3,767) = 9520 MB/s
test io::tests::bench_socket_pipe_socket_copy   ... bench:      26,471 ns/iter (+/- 6,412) = 4951 MB/s
```
2020-11-14 12:01:55 +00:00
Thom Chiovoloni
55d7f736d8 Tighten the bounds on atomic Ordering in std::sys::unix::weak 2020-11-13 19:15:51 -08:00
The8472
bbfa92c82d Always handle EOVERFLOW by falling back to the generic copy loop
Previously EOVERFLOW handling was only applied for io::copy specialization
but not for fs::copy sharing the same code.

Additionally we lower the chunk size to 1GB since we have a user report
that older kernels may return EINVAL when passing 0x8000_0000
but smaller values succeed.
2020-11-13 22:38:27 +01:00
The8472
4854d418a5 do direct splice syscall and probe availability to get android builds to work
Android builds use feature level 14, the libc wrapper for splice is gated
on feature level 21+ so we have to invoke the syscall directly.
Additionally the emulator doesn't seem to support it so we also have to
add ENOSYS checks.
2020-11-13 22:38:27 +01:00
The8472
3dfc377aa1 move sendfile/splice/copy_file_range into kernel_copy module 2020-11-13 22:38:27 +01:00
The8472
888b1031bc limit visibility of copy offload helpers to sys::unix module 2020-11-13 22:38:27 +01:00
The8472
7f5d2722af move copy specialization into sys::unix module 2020-11-13 22:38:23 +01:00
The8472
46e7fbe60b reduce syscalls by inferring FD types based on source struct instead of calling stat()
also adds handling for edge-cases involving large sparse files where sendfile could fail with EOVERFLOW
2020-11-13 19:46:35 +01:00
The8472
5eb88fa5c7 hide unused exports on other platforms 2020-11-13 19:45:38 +01:00
The8472
16236470c1 specialize io::copy to use copy_file_range, splice or sendfile
Currently it only applies to linux systems. It can be extended to make use
of similar syscalls on other unix systems.
2020-11-13 19:45:27 +01:00
bors
55794e4396 Auto merge of #78965 - jryans:emscripten-threads-libc, r=kennytm
Update thread and futex APIs to work with Emscripten

This updates the thread and futex APIs in `std` to match the APIs exposed by
Emscripten. This allows threads to run on `wasm32-unknown-emscripten` and the
thread parker to compile without errors related to the missing `futex` module.

To make use of this, Rust code must be compiled with `-C target-feature=atomics`
and Emscripten must link with `-pthread`.

I have confirmed this works well locally when building multithreaded crates.
Attempting to enable `std` thread tests currently fails for seemingly obscure
reasons and Emscripten is currently disabled in CI, so further work is needed to
have proper test coverage here.
2020-11-12 05:52:17 +00:00
J. Ryan Stinnett
bf3be09ee8 Fix timeout conversion 2020-11-12 03:40:15 +00:00
J. Ryan Stinnett
951576051b Update thread and futex APIs to work with Emscripten
This updates the thread and futex APIs in `std` to match the APIs exposed by
Emscripten. This allows threads to run on `wasm32-unknown-emscripten` and the
thread parker to compile without errors related to the missing `futex` module.

To make use of this, Rust code must be compiled with `-C target-feature=atomics`
and Emscripten must link with `-pthread`.

I have confirmed this works well locally when building multithreaded crates.
Attempting to enable `std` thread tests currently fails for seemingly obscure
reasons and Emscripten is currently disabled in CI, so further work is needed to
have proper test coverage here.
2020-11-12 01:41:49 +00:00
Dylan DPC
a8beaa3b3c Rollup merge of #78878 - shepmaster:intersecting-ignores, r=Mark-Simulacrum
Avoid overlapping cfg attributes when both macOS and aarch64

r? ``@Mark-Simulacrum``
2020-11-09 01:13:48 +01:00
Dylan DPC
41134be153 Rollup merge of #78026 - sunfishcode:symlink-hard-link, r=dtolnay
Define `fs::hard_link` to not follow symlinks.

POSIX leaves it [implementation-defined] whether `link` follows symlinks.
In practice, for example, on Linux it does not and on FreeBSD it does.
So, switch to `linkat`, so that we can pick a behavior rather than
depending on OS defaults.

Pick the option to not follow symlinks. This is somewhat arbitrary, but
seems the less surprising choice because hard linking is a very
low-level feature which requires the source and destination to be on
the same mounted filesystem, and following a symbolic link could end
up in a different mounted filesystem.

[implementation-defined]: https://pubs.opengroup.org/onlinepubs/9699919799/functions/link.html
2020-11-09 01:13:28 +01:00
Jake Goulding
b13817a795 Avoid overlapping cfg attributes when both macOS and aarch64 2020-11-08 09:43:51 -05:00
Mara Bos
eef9951e44 Rollup merge of #78572 - de-vri-es:bsd-cloexec, r=m-ou-se
Use SOCK_CLOEXEC and accept4() on more platforms.

This PR enables the use of `SOCK_CLOEXEC` and `accept4` on more platforms.

-----

Android uses the linux kernel, so it should also support it.

DragonflyBSD introduced them in 4.4 (December 2015):
https://www.dragonflybsd.org/release44/

FreeBSD introduced them in 10.0 (January 2014):
https://wiki.freebsd.org/AtomicCloseOnExec

Illumos introduced them in a commit in April 2013, not sure when it was released. It is quite possible that is has always been in Illumos:
5dbfd19ad5
https://illumos.org/man/3socket/socket
https://illumos.org/man/3socket/accept4

NetBSD introduced them in 6.0 (Oktober 2012) and 8.0 (July 2018):
https://man.netbsd.org/NetBSD-6.0/socket.2
https://man.netbsd.org/NetBSD-8.0/accept.2

OpenBSD introduced them in 5.7 (May 2015):
https://man.openbsd.org/socket https://man.openbsd.org/accept
2020-11-08 13:36:07 +01:00
Maarten de Vries
3bee37c290 Disable accept4 on Android. 2020-11-06 14:17:48 +01:00
LinkTed
ead7185db6 Fix docs for MacOs (again) 2020-11-04 19:45:48 +01:00
LinkTed
c779405686 Fix docs for MacOs (correction) 2020-11-03 18:28:04 +01:00
Ralf Jung
9f630af930 fix aliasing issue in unix sleep function 2020-10-31 16:26:06 +01:00
Maarten de Vries
59c6ae615e Use SOCK_CLOEXEC and accept4() on more platforms. 2020-10-30 14:20:10 +01:00
LinkTed
ea5e012ba7 Fix test cases for MacOs 2020-10-28 18:22:16 +01:00
Dan Gohman
6249cda78f Disable use of linkat on Android as well.
According to [the bionic status page], `linkat` has only been available
since API level 21. Since Android is based on Linux and Linux's `link`
doesn't follow symlinks, just use `link` on Android.

[the bionic status page]: https://android.googlesource.com/platform/bionic/+/master/docs/status.md
2020-10-24 09:43:31 -07:00
Tomasz Miąsko
21c29b1e95 Check that pthread mutex initialization succeeded
If pthread mutex initialization fails, the failure will go unnoticed unless
debug assertions are enabled. Any subsequent use of mutex will also silently
fail, since return values from lock & unlock operations are similarly checked
only through debug assertions.

In some implementations the mutex initialization requires a memory
allocation and so it does fail in practice.

Check that initialization succeeds to ensure that mutex guarantees
mutual exclusion.
2020-10-20 00:00:00 +00:00
Dan Gohman
ce00b3e2e0 Use link on platforms which lack linkat. 2020-10-18 07:47:32 -07:00
Dan Gohman
23a5c21415 Fix a typo in a comment. 2020-10-18 07:21:41 -07:00
LinkTed
79273fa30c Fix cannot find type ucred for MacOs by using fake definitions 2020-10-17 19:36:11 +02:00
bors
4cb7ef0f94 Auto merge of #77455 - asm89:faster-spawn, r=kennytm
Use posix_spawn() on unix if program is a path

Previously `Command::spawn` would fall back to the non-posix_spawn based
implementation if the `PATH` environment variable was possibly changed.
On systems with a modern (g)libc `posix_spawn()` can be significantly
faster. If program is a path itself the `PATH` environment variable is
not used for the lookup and it should be safe to use the
`posix_spawnp()` method. [1]

We found this, because we have a cli application that effectively runs a
lot of subprocesses. It would sometimes noticeably hang while printing
output. Profiling showed that the process was spending the majority of
time in the kernel's `copy_page_range` function while spawning
subprocesses. During this time the process is completely blocked from
running, explaining why users were reporting the cli app hanging.

Through this we discovered that `std::process::Command` has a fast and
slow path for process execution. The fast path is backed by
`posix_spawnp()` and the slow path by fork/exec syscalls being called
explicitly. Using fork for process creation is supposed to be fast, but
it slows down as your process uses more memory.  It's not because the
kernel copies the actual memory from the parent, but it does need to
copy the references to it (see `copy_page_range` above!).  We ended up
using the slow path, because the command spawn implementation in falls
back to the slow path if it suspects the PATH environment variable was
changed.

Here is a smallish program demonstrating the slowdown before this code
change:

```
use std::process::Command;
use std::time::Instant;

fn main() {
    let mut args = std::env::args().skip(1);
    if let Some(size) = args.next() {
        // Allocate some memory
        let _xs: Vec<_> = std::iter::repeat(0)
            .take(size.parse().expect("valid number"))
            .collect();

        let mut command = Command::new("/bin/sh");
        command
            .arg("-c")
            .arg("echo hello");

        if args.next().is_some() {
            println!("Overriding PATH");
            command.env("PATH", std::env::var("PATH").expect("PATH env var"));
        }

        let now = Instant::now();
        let child = command
            .spawn()
            .expect("failed to execute process");

        println!("Spawn took: {:?}", now.elapsed());

        let output = child.wait_with_output().expect("failed to wait on process");
        println!("Output: {:?}", output);
    } else {
        eprintln!("Usage: prog [size]");
        std::process::exit(1);
    }
    ()
}
```

Running it and passing different amounts of elements to use to allocate
memory shows that the time taken for `spawn()` can differ quite
significantly. In latter case the `posix_spawnp()` implementation is 30x
faster:

```
$ cargo run --release 10000000
...
Spawn took: 324.275µs
hello
$ cargo run --release 10000000 changepath
...
Overriding PATH
Spawn took: 2.346809ms
hello
$ cargo run --release 100000000
...
Spawn took: 387.842µs
hello
$ cargo run --release 100000000 changepath
...
Overriding PATH
Spawn took: 13.434677ms
hello
```

[1]: 5f72f9800b/posix/execvpe.c (L81)
2020-10-17 06:16:00 +00:00
Yuki Okushi
9abf81afa8 Rollup merge of #77900 - Thomasdezeeuw:fdatasync, r=dtolnay
Use fdatasync for File::sync_data on more OSes

Add support for the following OSes:
 * Android
 * FreeBSD: https://www.freebsd.org/cgi/man.cgi?query=fdatasync&sektion=2
 * OpenBSD: https://man.openbsd.org/OpenBSD-5.8/fsync.2
 * NetBSD: https://man.netbsd.org/fdatasync.2
 * illumos: https://illumos.org/man/3c/fdatasync
2020-10-17 05:36:45 +09:00
Dan Gohman
91a9f83dd1 Define fs::hard_link to not follow symlinks.
POSIX leaves it implementation-defined whether `link` follows symlinks.
In practice, for example, on Linux it does not and on FreeBSD it does.
So, switch to `linkat`, so that we can pick a behavior rather than
depending on OS defaults.

Pick the option to not follow symlinks. This is somewhat arbitrary, but
seems the less surprising choice because hard linking is a very
low-level feature which requires the source and destination to be on
the same mounted filesystem, and following a symbolic link could end
up in a different mounted filesystem.
2020-10-16 12:05:49 -07:00
Mara Bos
0f0257be10 Take some of sys/vxworks/process/* from sys/unix instead. 2020-10-16 06:22:05 +02:00
Mara Bos
408db0da85 Take sys/vxworks/{os,path,pipe} from sys/unix instead. 2020-10-16 06:22:00 +02:00
Mara Bos
71bb1dc2a0 Take sys/vxworks/{fd,fs,io} from sys/unix instead. 2020-10-16 06:19:00 +02:00
Mara Bos
ba483c51df Take sys/vxworks/args from sys/unix instead. 2020-10-16 06:19:00 +02:00
Mara Bos
dce405ae3d Take sys/vxworks/net from sys/unix instead. 2020-10-16 06:19:00 +02:00
Mara Bos
5d526f6eee Take sys/vxworks/thread from sys/unix instead. 2020-10-16 06:19:00 +02:00
Mara Bos
c8628f43bf Take sys/vxworks/stack_overflow from sys/unix instead. 2020-10-16 06:18:59 +02:00
Mara Bos
44a2af32cc Remove lifetime from StaticMutex and assume 'static.
StaticMutex is only ever used with as a static (as the name already
suggests). So it doesn't have to be generic over a lifetime, but can
simply assume 'static.

This 'static lifetime guarantees the object is never moved, so this is
no longer a manually checked requirement for unsafe calls to lock().
2020-10-14 09:52:03 +02:00
Thomas de Zeeuw
8c0c7ec4ec Use fdatasync for File::sync_data on more OSes
Add support for the following OSes:
 * Android
 * FreeBSD: https://www.freebsd.org/cgi/man.cgi?query=fdatasync&sektion=2
 * OpenBSD: https://man.openbsd.org/OpenBSD-5.8/fsync.2
 * NetBSD: https://man.netbsd.org/fdatasync.2
 * illumos: https://illumos.org/man/3c/fdatasync
2020-10-13 15:57:31 +02:00
LinkTed
d8c75d9f91 Fix unresolved imports for recv_vectored_with_ancillary_from, send_vectored_with_ancillary_to and SocketAncillary 2020-10-11 19:23:41 +02:00
bors
bc74dd711f Auto merge of #77727 - thomcc:mach-info-order, r=Amanieu
Avoid SeqCst or static mut in mach_timebase_info and QueryPerformanceFrequency caches

This patch went through a couple iterations but the end result is replacing a pattern where an `AtomicUsize` (updated with many SeqCst ops) guards a `static mut` with a single `AtomicU64` that is known to use 0 as a value indicating that it is not initialized.

The code in both places exists to cache values used in the conversion of Instants to Durations on macOS, iOS, and Windows.

I have no numbers to prove that this improves performance (It seems a little futile to benchmark something like this), but it's much simpler, safer, and in practice we'd expect it to be faster everywhere where Relaxed operations on AtomicU64 are cheaper than SeqCst operations on AtomicUsize, which is a lot of places.

Anyway, it also removes a bunch of unsafe code and greatly simplifies the logic, so IMO that alone would be worth it unless it was a regression.

If you want to take a look at the assembly output though, see https://godbolt.org/z/rbr6vn for x86_64, https://godbolt.org/z/cqcbqv for aarch64 (Note that this just the output of the mac side, but i'd expect the windows part to be the same and don't feel like doing another godbolt for it). There are several versions of this function in the godbolt:

- `info_new`: version in the current patch
- `info_less_new`: version in initial PR
- `info_original`: version currently in the tree
- `info_orig_but_better_orderings`: a version that just tries to change the original code's orderings from SeqCst to the (probably) minimal orderings required for soundness/correctness.

The biggest concern I have here is if we can use AtomicU64, or if there are targets that dont have it that this code supports. AFAICT: no. (If that changes in the future, it's easy enough to do something different for them)

r? `@Amanieu` because he caught a couple issues last time I tried to do a patch reducing orderings 😅

---

<details>
<summary>I rewrote this whole message so the original is inside here</summary>

I happened to notice the code we use for caching the result of mach_timebase_info uses SeqCst exclusively.

However, thinking a little more, it's actually pretty easy to avoid the static mut by packing the timebase info into an AtomicU64.

This entirely avoids needing to do the compare_exchange. The AtomicU64 can be read/written using Relaxed ops, which on current macos/ios platforms (x86_64/aarch64) have no overhead compared to direct loads/stores. This simplifies the code and makes it a lot safer too.

I have no numbers to prove that this improves performance (It seems a little futile to benchmark something like this), although it should do that on both targets it applies to.

That said, it also removes a bunch of unsafe code and simplifies the logic (arguably at least — there are only two states now, initialized or not), so I think it's a net win even without concrete numbers.

If you want to take a look at the assembly output though, see below. It has the new version, the original, and a version of the original with lower Orderings (which is still worse than the version in this PR)

- godbolt.org/z/obfqf9 x86_64-apple-darwin

- godbolt.org/z/Wz5cWc aarch64-unknown-linux-gnu (godbolt can't do aarch64-apple-ios but that doesn't matter here)

A different (and more efficient) option than this would be to just use the AtomicU64 and use the knowledge that after initialization the denominator should be nonzero... That felt like it's relying on too many things I'm not confident in, so I didn't want to do that.
</details>
2020-10-11 14:06:04 +00:00
LinkTed
64facfef51 Fix unresolved link to SocketAncillary 2020-10-10 15:19:13 +02:00
LinkTed
7b596f2e13 Fix libc is ambiguous for Windows 2020-10-10 15:19:13 +02:00
LinkTed
fc65f6a0ce Fix import errors for #[cfg(doc)] target 2020-10-10 15:19:13 +02:00
LinkTed
a81764731c Add fake definitions for Windows 2020-10-10 15:19:13 +02:00
LinkTed
d0069a0cc5 Fix imports for MacOs 2020-10-10 15:19:13 +02:00
LinkTed
1ae54e560a Change imports for cfg(doc) 2020-10-10 15:19:13 +02:00
LinkTed
e9bf69954c Remove passcred for emscripten 2020-10-10 15:19:13 +02:00
LinkTed
6b0c3dfe00 Remove unnecessary trailing semicolon 2020-10-10 15:19:13 +02:00