Commit Graph

22 Commits

Author SHA1 Message Date
bors
8e54a21139 Auto merge of #81238 - RalfJung:copy-intrinsics, r=m-ou-se
directly expose copy and copy_nonoverlapping intrinsics

This effectively un-does https://github.com/rust-lang/rust/pull/57997. That should help with `ptr::read` codegen in debug builds (and any other of these low-level functions that bottoms out at `copy`/`copy_nonoverlapping`), where the wrapper function will not get inlined. See the discussion in https://github.com/rust-lang/rust/pull/80290 and https://github.com/rust-lang/rust/issues/81163.

Cc `@bjorn3` `@therealprof`
2021-02-13 20:30:07 +00:00
Ralf Jung
13ffa43bbb rename raw_const/mut -> const/mut_addr_of, and stabilize them 2021-01-29 15:18:45 +01:00
Ralf Jung
18d12ad171 directly expose copy and copy_nonoverlapping intrinsics 2021-01-21 10:41:22 +01:00
Ralf Jung
712d065061 remove some outdated comments regarding debug assertions 2021-01-18 13:06:01 +01:00
Ralf Jung
a5b89a00cb extend comment
Co-authored-by: lcnr <bastian_kauschke@hotmail.de>
2021-01-02 16:58:38 +01:00
Ralf Jung
1862135351 implement ptr::write without dedicated intrinsic 2020-12-30 18:39:05 +01:00
Albin Hedman
0cea1c9206 Added reference to tracking issue #80377 2020-12-26 14:03:28 +01:00
Albin Hedman
7594d2a084 Constify ptr::read and ptr::read_unaligned 2020-12-26 02:25:08 +01:00
Dylan DPC
6cd02a85f1 Rollup merge of #77844 - RalfJung:zst-box, r=nikomatsakis
clarify rules for ZST Boxes

LLVM's rules around `getelementptr inbounds` with offset 0 are a bit annoying, and as a consequence we have no choice but say that a `Box<()>` pointing to previously allocated memory that has since been freed is UB. Clarify the docs to reflect this.

This is based on conversations on the LLVM mailing list.
* Here's my initial mail: https://lists.llvm.org/pipermail/llvm-dev/2019-February/130452.html
* The first email of the March part of that thread: https://lists.llvm.org/pipermail/llvm-dev/2019-March/130831.html
* First email of the April part: https://lists.llvm.org/pipermail/llvm-dev/2019-April/131693.html

The conclusion for me at least was that `getelementptr inbounds` with offset 0 is *not* the identity function, but can sometimes return `poison` even when the input is a regular pointer -- specifically, it returns `poison` when this pointer points into something that LLVM "knows has been deallocated", i.e., a former LLVM-managed allocation. It is however the identity function on pointers obtained by casting integers.

Note that there [are formal proposals](https://people.mpi-sws.org/~jung/twinsem/twinsem.pdf) for LLVM semantics where `getelementptr inbounds` with offset 0 isn't quite the identity function but never returns `poison` (it affects the provenance of the pointer but in a way that doesn't matter if this pointer is never used for memory accesses), and indeed this is likely necessary to consistently describe LLVM semantics. But with the informal LLVM LangRef that we have right now, and with LLVM devs insisting otherwise, it seems unwise to rely on this.
2020-11-21 19:44:07 +01:00
Ralf Jung
a7677f7714 reference NonNull::dangling 2020-11-20 11:09:49 +01:00
Camelid
fee4f8feb0 Improve wording of core::ptr::drop_in_place docs
And two small intra-doc link conversions in `std::{f32, f64}`.
2020-10-29 20:09:29 -07:00
bors
69e68cf550 Auto merge of #75728 - nagisa:improve_align_offset_2, r=Mark-Simulacrum
Optimise align_offset for stride=1 further

`stride == 1` case can be computed more efficiently through `-p (mod
a)`. That, then translates to a nice and short sequence of LLVM
instructions:

    %address = ptrtoint i8* %p to i64
    %negptr = sub i64 0, %address
    %offset = and i64 %negptr, %a_minus_one

And produces pretty much ideal code-gen when this function is used in
isolation.

Typical use of this function will, however, involve use of
the result to offset a pointer, i.e.

    %aligned = getelementptr inbounds i8, i8* %p, i64 %offset

This still looks very good, but LLVM does not really translate that to
what would be considered ideal machine code (on any target). For example
that's the codegen we obtain for an unknown alignment:

    ; x86_64
    dec     rsi
    mov     rax, rdi
    neg     rax
    and     rax, rsi
    add     rax, rdi

In particular negating a pointer is not something that’s going to be
optimised for in the design of CISC architectures like x86_64. They
are much better at offsetting pointers. And so we’d love to utilize this
ability and produce code that's more like this:

    ; x86_64
    lea     rax, [rsi + rdi - 1]
    neg     rsi
    and     rax, rsi

To achieve this we need to give LLVM an opportunity to apply its
various peep-hole optimisations that it does during DAG selection. In
particular, the `and` instruction appears to be a major inhibitor here.
We cannot, sadly, get rid of this load-bearing operation, but we can
reorder operations such that LLVM has more to work with around this
instruction.

One such ordering is proposed in #75579 and results in LLVM IR that
looks broadly like this:

    ; using add enables `lea` and similar CISCisms
    %offset_ptr = add i64 %address, %a_minus_one
    %mask = sub i64 0, %a
    %masked = and i64 %offset_ptr, %mask
    ; can be folded with `gepi` that may follow
    %offset = sub i64 %masked, %address

…and generates the intended x86_64 machine code.
One might also wonder how the increased amount of code would impact a
RISC target. Turns out not much:

    ; aarch64 previous                 ; aarch64 new
    sub     x8, x1, #1                 add     x8, x1, x0
    neg     x9, x0                     sub     x8, x8, #1
    and     x8, x9, x8                 neg     x9, x1
    add     x0, x0, x8                 and     x0, x8, x9

    (and similarly for ppc, sparc, mips, riscv, etc)

The only target that seems to do worse is… wasm32.

Onto actual measurements – the best way to evaluate snipets like these
is to use llvm-mca. Much like Aarch64 assembly would allow to suspect,
there isn’t any performance difference to be found. Both snippets
execute in same number of cycles for the CPUs I tried. On x86_64,
we get throughput improvement of >50%!

Fixes #75579
2020-10-26 06:49:34 +00:00
Ralf Jung
defcd7ff47 stop relying on feature(untagged_unions) in stdlib 2020-10-16 11:33:35 +02:00
Ralf Jung
0f572a9810 explicitly talk about integer literals 2020-10-13 09:30:09 +02:00
Ralf Jung
c555aabc5b clarify rules for ZST Boxes 2020-10-12 10:32:11 +02:00
Camelid
884a1b4b9b Fix anchor links
#safety -> self#safety
2020-09-09 13:42:57 -07:00
Camelid
d24026bb6d Fix broken link
`write` is ambiguous because there's also a macro called `write`.

Also removed unnecessary and potentially confusing link to a function in
its own docs.
2020-09-08 19:24:57 -07:00
Camelid
325acefee4 Use intra-doc links in core::ptr
The only link that I did not change is a link to a function on the
`pointer` primitive because intra-doc links for the `pointer` primitive
don't work yet (see #63351).
2020-09-08 14:36:36 -07:00
Simonas Kazlauskas
4bfacffb90 Optimise align_offset for stride=1 further
`stride == 1` case can be computed more efficiently through `-p (mod
a)`. That, then translates to a nice and short sequence of LLVM
instructions:

    %address = ptrtoint i8* %p to i64
    %negptr = sub i64 0, %address
    %offset = and i64 %negptr, %a_minus_one

And produces pretty much ideal code-gen when this function is used in
isolation.

Typical use of this function will, however, involve use of
the result to offset a pointer, i.e.

    %aligned = getelementptr inbounds i8, i8* %p, i64 %offset

This still looks very good, but LLVM does not really translate that to
what would be considered ideal machine code (on any target). For example
that's the codegen we obtain for an unknown alignment:

    ; x86_64
    dec     rsi
    mov     rax, rdi
    neg     rax
    and     rax, rsi
    add     rax, rdi

In particular negating a pointer is not something that’s going to be
optimised for in the design of CISC architectures like x86_64. They
are much better at offsetting pointers. And so we’d love to utilize this
ability and produce code that's more like this:

    ; x86_64
    lea     rax, [rsi + rdi - 1]
    neg     rsi
    and     rax, rsi

To achieve this we need to give LLVM an opportunity to apply its
various peep-hole optimisations that it does during DAG selection. In
particular, the `and` instruction appears to be a major inhibitor here.
We cannot, sadly, get rid of this load-bearing operation, but we can
reorder operations such that LLVM has more to work with around this
instruction.

One such ordering is proposed in #75579 and results in LLVM IR that
looks broadly like this:

    ; using add enables `lea` and similar CISCisms
    %offset_ptr = add i64 %address, %a_minus_one
    %mask = sub i64 0, %a
    %masked = and i64 %offset_ptr, %mask
    ; can be folded with `gepi` that may follow
    %offset = sub i64 %masked, %address

…and generates the intended x86_64 machine code. One might also wonder
how the increased amount of code would impact a RISC target. Turns out
not much:

    ; aarch64 previous                 ; aarch64 new
    sub     x8, x1, #1                 add     x8, x1, x0
    neg     x9, x0                     sub     x8, x8, #1
    and     x8, x9, x8                 neg     x9, x1
    add     x0, x0, x8                 and     x0, x8, x9

    (and similarly for ppc, sparc, mips, riscv, etc)

The only target that seems to do worse is… wasm32.

Onto actual measurements – the best way to evaluate snippets like these
is to use llvm-mca. Much like Aarch64 assembly would allow to suspect,
there isn’t any performance difference to be found. Both snippets
execute in same number of cycles for the CPUs I tried. On x86_64,
we get throughput improvement of >50%, however!
2020-08-20 05:06:00 +03:00
Simonas Kazlauskas
5d22b18bf2 Improve codegen of align_offset when stride == 1
Previously checking for `pmoda == 0` would get LLVM to generate branchy
code, when, for `stride = 1` the offset can be computed without such a
branch by doing effectively a `-p % a`.

For well-known (constant) alignments, with the new ordering of these
conditionals, we end up generating 2 to 3 cheap instructions on x86_64:

    movq    %rdi, %rax
    negl    %eax
    andl    $7, %eax

instead of 5+ as previously.

For unknown alignments the new code also generates just 3 instructions:

    negq    %rdi
    leaq    -1(%rsi), %rax
    andq    %rdi, %rax
2020-08-16 21:31:48 +03:00
Simonas Kazlauskas
e7271da69a Improve align_offset at opt-level <= 1
At opt-level <= 1, the methods such as `wrapping_mul` are not being
inlined, causing significant bloating and slowdowns of the
implementation at these optimisation levels.

With use of these intrinsics, the codegen of this function at
-Copt_level=1 is the same as it is at -Copt_level=3.
2020-08-16 21:31:48 +03:00
mark
2c31b45ae8 mv std libs to library/ 2020-07-27 19:51:13 -05:00