`Split*::as_str` refactor
I've made this patch almost a year ago, so the rename and the behavior change are in one commit, sorry 😅
This fixes#84974, as it's required to make other changes work.
This PR
- Renames `as_str` method of string `Split*` iterators to `remainder` (it seems like the `as_str` name was confusing to users)
- Makes `remainder` return `Option<&str>`, to distinguish between "the iterator is exhausted" and "the tail is empty", this was [required on the tracking issue](https://github.com/rust-lang/rust/issues/77998#issuecomment-832696619)
r? `@m-ou-se`
Previously, the str.lines() docstring stated that lines are split at line
endings, but not whether those were returned or not. This new version of the
docstring states this explicitly, avoiding the need of getting to doctests to
get an answer to this FAQ.
This commit
- Renames `Split*::{as_str -> remainder}` as it seems less confusing
- Makes `remainder` return Option<&str> to distinguish between
"iterator is exhausted" and "the tail is empty"
Add slice to the stack allocated string comment
Precise that the "stack allocated string" is not a string but a string slice.
``@rustbot`` label +A-docs
- bump simd compare to 32bytes
- import small slice compare code from memmem crate
- try a few different probe bytes to avoid degenerate cases
- but special-case 2-byte needles
Based on Wojciech Muła's "SIMD-friendly algorithms for substring searching"[0]
The two-way algorithm is Big-O efficient but it needs to preprocess the needle
to find a "criticla factorization" of it. This additional work is significant
for short needles. Additionally it mostly advances needle.len() bytes at a time.
The SIMD-based approach used here on the other hand can advance based on its
vector width, which can exceed the needle length. Except for pathological cases,
but due to being limited to small needles the worst case blowup is also small.
benchmarks taken on a Zen2:
```
16CGU, OLD:
test str::bench_contains_short_short ... bench: 27 ns/iter (+/- 1)
test str::bench_contains_short_long ... bench: 667 ns/iter (+/- 29)
test str::bench_contains_bad_naive ... bench: 131 ns/iter (+/- 2)
test str::bench_contains_bad_simd ... bench: 130 ns/iter (+/- 2)
test str::bench_contains_equal ... bench: 148 ns/iter (+/- 4)
16CGU, NEW:
test str::bench_contains_short_short ... bench: 8 ns/iter (+/- 0)
test str::bench_contains_short_long ... bench: 135 ns/iter (+/- 4)
test str::bench_contains_bad_naive ... bench: 130 ns/iter (+/- 2)
test str::bench_contains_bad_simd ... bench: 292 ns/iter (+/- 1)
test str::bench_contains_equal ... bench: 3 ns/iter (+/- 0)
1CGU, OLD:
test str::bench_contains_short_short ... bench: 30 ns/iter (+/- 0)
test str::bench_contains_short_long ... bench: 713 ns/iter (+/- 17)
test str::bench_contains_bad_naive ... bench: 131 ns/iter (+/- 3)
test str::bench_contains_bad_simd ... bench: 130 ns/iter (+/- 3)
test str::bench_contains_equal ... bench: 148 ns/iter (+/- 6)
1CGU, NEW:
test str::bench_contains_short_short ... bench: 10 ns/iter (+/- 0)
test str::bench_contains_short_long ... bench: 111 ns/iter (+/- 0)
test str::bench_contains_bad_naive ... bench: 135 ns/iter (+/- 3)
test str::bench_contains_bad_simd ... bench: 274 ns/iter (+/- 2)
test str::bench_contains_equal ... bench: 4 ns/iter (+/- 0)
```
[0] http://0x80.pl/articles/simd-strfind.html#sse-avx2
The `from_str` implementation from the example had an `unwrap` that would make it panic on invalid input strings. Instead of panicking, it nows returns an error to better reflect the intented behavior of the `FromStr` trait.
Replace most uses of `pointer::offset` with `add` and `sub`
As PR title says, it replaces `pointer::offset` in compiler and standard library with `pointer::add` and `pointer::sub`. This generally makes code cleaner, easier to grasp and removes (or, well, hides) integer casts.
This is generally trivially correct, `.offset(-constant)` is just `.sub(constant)`, `.offset(usized as isize)` is just `.add(usized)`, etc. However in some cases we need to be careful with signs of things.
r? ````@scottmcm````
_split off from #100746_
Expose `Utf8Lossy` as `Utf8Chunks`
This PR changes the feature for `Utf8Lossy` from `str_internals` to `utf8_lossy` and improves the API. This is done to eventually expose the API as stable.
Proposal: rust-lang/libs-team#54
Tracking Issue: #99543
Stabilize checked slice->str conversion functions
This PR stabilizes the following APIs as `const` functions in Rust 1.63:
```rust
// core::str
pub const fn from_utf8(v: &[u8]) -> Result<&str, Utf8Error>;
impl Utf8Error {
pub const fn valid_up_to(&self) -> usize;
pub const fn error_len(&self) -> Option<usize>;
}
```
Note that the `from_utf8_mut` function is not stabilized as unique references (`&mut _`) are [unstable in const context].
FCP: https://github.com/rust-lang/rust/issues/91006#issuecomment-1134593095
[unstable in const context]: https://github.com/rust-lang/rust/issues/57349
Add implicit call to from_str via parse in documentation
The documentation mentions "FromStr’s from_str method is often used implicitly,
through str’s parse method. See parse’s documentation for examples.".
It may be nicer to show that in the code example as well.
In the documentation comment for `std::str::rfind`, say "last" instead
of "rightmost" to describe the match that `rfind` finds. This follows the
spirit of #30459, for which `trim_left` and `trim_right` were replaced by
`trim_start` and `trim_end` to be more clear about how they work on
text which is displayed right-to-left.
The documentation mentions "FromStr’s from_str method is often used implicitly,
through str’s parse method. See parse’s documentation for examples.".
It may be nicer to show that in the code example as well.
Warn on unused `#[doc(hidden)]` attributes on trait impl items
[Zulip conversation](https://rust-lang.zulipchat.com/#narrow/stream/266220-rustdoc/topic/.E2.9C.94.20Validy.20checks.20for.20.60.23.5Bdoc.28hidden.29.5D.60).
Whether an associated item in a trait impl is shown or hidden in the documentation entirely depends on the corresponding item in the trait declaration. Rustdoc completely ignores `#[doc(hidden)]` attributes on impl items. No error or warning is emitted:
```rust
pub trait Tr { fn f(); }
pub struct Ty;
impl Tr for Ty { #[doc(hidden)] fn f() {} }
// ^^^^^^^^^^^^^^ ignored by rustdoc and currently
// no error or warning issued
```
This may lead users to the wrong belief that the attribute has an effect. In fact, several such cases are found in the standard library (I've removed all of them in this PR).
There does not seem to exist any incentive to allow this in the future either: Impl'ing a trait for a type means the type *fully* conforms to its API. Users can add `#[doc(hidden)]` to the whole impl if they want to hide the implementation or add the attribute to the corresponding associated item in the trait declaration to hide the specific item. Hiding an implementation of an associated item does not make much sense: The associated item can still be found on the trait page.
This PR emits the warn-by-default lint `unused_attribute` for this case with a future-incompat warning.
`@rustbot` label T-compiler T-rustdoc A-lint
Remove `#[rustc_deprecated]`
This removes `#[rustc_deprecated]` and introduces diagnostics to help users to the right direction (that being `#[deprecated]`). All uses of `#[rustc_deprecated]` have been converted. CI is expected to fail initially; this requires #95958, which includes converting `stdarch`.
I plan on following up in a short while (maybe a bootstrap cycle?) removing the diagnostics, as they're only intended to be short-term.
Some masks where defined as
```rust
const NONASCII_MASK: usize = 0x80808080_80808080u64 as usize;
```
where it was assumed that `usize` is never wider than 64, which is currently true.
To make those constants valid in a hypothetical 128-bit target, these constants have been redefined in an `usize`-width-agnostic way
```rust
const NONASCII_MASK: usize = usize::from_ne_bytes([0x80; size_of::<usize>()]);
```
There are already some cases where Rust anticipates the possibility of supporting 128-bit targets, such as not implementing `From<usize>` for `u64`.
Specifically, make it clear that it is immediately UB to pass ill-formed UTF-8 into the function. The previous wording left space to interpret that the UB only occurred when calling another function, which "assumes that `&str`s are valid UTF-8."
This does not change whether str being UTF-8 is a safety or a validity invariant. (As per previous discussion, it is a safety invariant, not a validity invariant.) It just makes it clear that valid UTF-8 is a precondition of str::from_utf8_unchecked, and that emitting an Abstract Machine fault (e.g. UB or a sanitizer error) on invalid UTF-8 is a valid thing to do.
If user code wants to create an unsafe `&str` pointing to ill-formed UTF-8, it must be done via transmutes. Also, just, don't.
Correct safety reasoning in `str::make_ascii_{lower,upper}case()`
I don't understand why the previous comment was used (it was inserted in #66564), but it doesn't explain why these functions are safe, only why `str::as_bytes{_mut}()` are safe.
If someone thinks they make perfect sense, I'm fine with closing this PR.
Bump bootstrap compiler to 1.61.0 beta
This PR bumps the bootstrap compiler to the 1.61.0 beta. The first commit changes the stage0 compiler, the second commit applies the "mechanical" changes and the third and fourth commits apply changes explained in the relevant comments.
r? `@Mark-Simulacrum`
This updates the standard library's documentation to use the new syntax. The
documentation is worthwhile to update as it should be more idiomatic
(particularly for features like this, which are nice for users to get acquainted
with). The general codebase is likely more hassle than benefit to update: it'll
hurt git blame, and generally updates can be done by folks updating the code if
(and when) that makes things more readable with the new format.
A few places in the compiler and library code are updated (mostly just due to
already having been done when this commit was first authored).