initial implementation of the darwin_objc unstable feature
Tracking issue: https://github.com/rust-lang/rust/issues/145496
This feature makes it possible to reference Objective-C classes and selectors using the same ABI used by native Objective-C on Apple/Darwin platforms. Without it, Rust code interacting with Objective-C must resort to loading classes and selectors using costly string-based lookups at runtime. With it, these references can be loaded efficiently at dynamic load time.
r? ```@tmandry```
try-job: `*apple*`
try-job: `x86_64-gnu-nopt`
Add `-Zindirect-branch-cs-prefix`
Cc: ``@azhogin`` ``@Darksonn``
This goes on top of https://github.com/rust-lang/rust/pull/135927, i.e. please skip the first commit here. Please feel free to inherit it there.
In fact, I am not sure if there is any use case for the flag without `-Zretpoline*`. GCC and Clang allow it, though.
There is a `FIXME` for two `ignore`s in the test that I took from another test I did in the past -- they may be needed or not here since I didn't run the full CI. Either way, it is not critical.
Tracking issue: https://github.com/rust-lang/rust/issues/116852.
MCP: https://github.com/rust-lang/compiler-team/issues/868.
Fix `-Zregparm` for LLVM builtins
This fixes the issue where `-Zregparm=N` was not working correctly when calling LLVM intrinsics
By default on `x86-32`, arguments are passed on the stack. The `-Zregparm=N` flag allows the first `N` arguments to be passed in registers instead.
When calling intrinsics like `memset`, LLVM still passes parameters on the stack, which prevents optimizations like tail calls.
As proposed by ````@tgross35,```` I fixed this by setting the `NumRegisterParameters` LLVM module flag to `N` when the `-Zregparm=N` is set.
```rust
// compiler/rust_codegen_llvm/src/context.rs#375-382
if let Some(regparm_count) = sess.opts.unstable_opts.regparm {
llvm::add_module_flag_u32(
llmod,
llvm::ModuleFlagMergeBehavior::Error,
"NumRegisterParameters",
regparm_count,
);
}
```
[Here](https://rust.godbolt.org/z/YMezreo48) is a before/after compiler explorer.
Here is the final result for the code snippet in the original issue:
```asm
entrypoint:
push esi
mov esi, eax
mov eax, ecx
mov ecx, esi
pop esi
jmp memset ; Tail call parameters in registers
```
Fixes: https://github.com/rust-lang/rust/issues/145271
set
* Enforce the `-Zregparm=N` flag by setting the NumRegisterParameters
LLVM module flag * Add assembly tests verifying that the parameters are
passed in registers for reparm values 1, 2, and 3, for both LLVM
intrinsics and non-builtin functions * Add c_void type to minicore
gpu offload host code generation
r? ghost
This will generate most of the host side code to use llvm's offload feature.
The first PR will only handle automatic mem-transfers to and from the device.
So if a user calls a kernel, we will copy inputs back and forth, but we won't do the actual kernel launch.
Before merging, we will use LLVM's Info infrastructure to verify that the memcopies match what openmp offloa generates in C++. `LIBOMPTARGET_INFO=-1 ./my_rust_binary` should print that a memcpy to and later from the device is happening.
A follow-up PR will generate the actual device-side kernel which will then do computations on the GPU.
A third PR will implement manual host2device and device2host functionality, but the goal is to minimize cases where a user has to overwrite our default handling due to performance issues.
I'm trying to get a full MVP out first, so this just recognizes GPU functions based on magic names. The final frontend will obviously move this over to use proper macros, like I'm already doing it for the autodiff work.
This work will also be compatible with std::autodiff, so one can differentiate GPU kernels.
Tracking:
- https://github.com/rust-lang/rust/issues/131513
Autodiff batching
Enzyme supports batching, which is especially known from the ML side when training neural networks.
There we would normally have a training loop, where in each iteration we would pass in some data (e.g. an image), and a target vector. Based on how close we are with our prediction we compute our loss, and then use backpropagation to compute the gradients and update our weights.
That's quite inefficient, so what you normally do is passing in a batch of 8/16/.. images and targets, and compute the gradients for those all at once, allowing better optimizations.
Enzyme supports batching in two ways, the first one (which I implemented here) just accepts a Batch size,
and then each Dual/Duplicated argument has not one, but N shadow arguments. So instead of
```rs
for i in 0..100 {
df(x[i], y[i], 1234);
}
```
You can now do
```rs
for i in 0..100.step_by(4) {
df(x[i+0],x[i+1],x[i+2],x[i+3], y[i+0], y[i+1], y[i+2], y[i+3], 1234);
}
```
which will give the same results, but allows better compiler optimizations. See the testcase for details.
There is a second variant, where we can mark certain arguments and instead of having to pass in N shadow arguments, Enzyme assumes that the argument is N times longer. I.e. instead of accepting 4 slices with 12 floats each, we would accept one slice with 48 floats. I'll implement this over the next days.
I will also add more tests for both modes.
For any one preferring some more interactive explanation, here's a video of Tim's llvm dev talk, where he presents his work. https://www.youtube.com/watch?v=edvaLAL5RqU
I'll also add some other docs to the dev guide and user docs in another PR.
r? ghost
Tracking:
- https://github.com/rust-lang/rust/issues/124509
- https://github.com/rust-lang/rust/issues/135283
Lower BinOp::Cmp to llvm.{s,u}cmp.* intrinsics
Lowers `mir::BinOp::Cmp` (`three_way_compare` intrinsic) to the corresponding LLVM `llvm.{s,u}cmp.i8.*` intrinsics.
These are the intrinsics mentioned in https://github.com/rust-lang/rust/pull/118310, which are now available in LLVM 19.
I couldn't find any follow-up PRs/discussions about this, please let me know if I missed something.
r? `@scottmcm`
Clean up various LLVM FFI things in codegen_llvm
cc ```@ZuseZ4``` I touched some autodiff parts
The major change of this PR is [bfd88ce](bfd88cead0) which makes `CodegenCx` generic just like `GenericBuilder`
The other commits mostly took advantage of the new feature of making extern functions safe, but also just used some wrappers that were already there and shrunk unsafe blocks.
best reviewed commit-by-commit
`rustc_codegen_llvm` relied on `Deref` impls where `Deref::Target` was
or contained an extern type - in my experimental implementation of
rust-lang/rfcs#3729, this isn't possible as the `Target` associated
type's `?Sized` bound cannot be relaxed backwards compatibly (unless we
come up with some way of doing this).
In later pull requests with the rust-lang/rfcs#3729 implementation,
breakage like this could only occur for nightly users relying on the
`extern_types` feature.
Upstreaming this to avoid needing to keep carrying this patch locally,
and I think it'll necessarily need to change eventually.
nvptx64: update default alignment to match LLVM 21
This changed in llvm/llvm-project@91cb8f5d32. The commit itself is mostly about some intrinsic instructions, but as an aside it also mentions something about addrspace for tensor memory, which I believe is what this string is telling us.
`@rustbot` label: +llvm-main