Because of an ambiguity with const closures, the parser needs to ensure
that for a const item, the `const` keyword isn't followed by a `move` or
`static` keyword, as that would indicate a const closure:
```rust
fn main() {
const move // ...
}
```
This check did not take raw identifiers into account, therefore being
unable to distinguish between `const move` and `const r#move`. The
latter is obviously not a const closure, so it should be allowed as a
const item.
This fixes the check in the parser to only treat `const ...` as a const
closure if it's followed by the *proper keyword*, and not a raw
identifier.
Additionally, this adds a large test that tests for all raw identifiers in
all kinds of positions, including `const`, to prevent issues like this
one from occurring again.
The size of this struct depends on the alignment of `u128`, for example
powerpc64le and s390x have align-8 and end up with only 280 bytes. Our
64-bit tier-1 arches are the same though, so let's just assert on those.
The parser pushes a `TokenType` to `Parser::expected_token_types` on
every call to the various `check`/`eat` methods, and clears it on every
call to `bump`. Some of those `TokenType` values are full tokens that
require cloning and dropping. This is a *lot* of work for something
that is only used in error messages and it accounts for a significant
fraction of parsing execution time.
This commit overhauls `TokenType` so that `Parser::expected_token_types`
can be implemented as a bitset. This requires changing `TokenType` to a
C-style parameterless enum, and adding `TokenTypeSet` which uses a
`u128` for the bits. (The new `TokenType` has 105 variants.)
The new types `ExpTokenPair` and `ExpKeywordPair` are now arguments to
the `check`/`eat` methods. This is for maximum speed. The elements in
the pairs are always statically known; e.g. a
`token::BinOp(token::Star)` is always paired with a `TokenType::Star`.
So we now compute `TokenType`s in advance and pass them in to
`check`/`eat` rather than the current approach of constructing them on
insertion into `expected_token_types`.
Values of these pair types can be produced by the new `exp!` macro,
which is used at every `check`/`eat` call site. The macro is for
convenience, allowing any pair to be generated from a single identifier.
The ident/keyword filtering in `expected_one_of_not_found` is no longer
necessary. It was there to account for some sloppiness in
`TokenKind`/`TokenType` comparisons.
The existing `TokenType` is moved to a new file `token_type.rs`, and all
its new infrastructure is added to that file. There is more boilerplate
code than I would like, but I can't see how to make it shorter.
This is a naming convention used in a handful of spots in the parser for
delimiters. It confused me when I first saw it a long time ago, and I've
never liked it. A web search says "Bra-ket notation" exists in linear
algebra but the terminology has zero prior use in a programming context,
as far as I can tell.
This commit changes it to `open`/`close`, which is consistent with the
rest of the compiler.
The most significant is `check_keyword`: it now only pushes to
`expected_token_types` if the keyword check fails, which matches how all
the other `check` methods work.
The remainder are just tweaks to make these methods more consistent with
each other.
`rustc_span::symbol` defines some things that are re-exported from
`rustc_span`, such as `Symbol` and `sym`. But it doesn't re-export some
closely related things such as `Ident` and `kw`. So you can do `use
rustc_span::{Symbol, sym}` but you have to do `use
rustc_span::symbol::{Ident, kw}`, which is inconsistent for no good
reason.
This commit re-exports `Ident`, `kw`, and `MacroRulesNormalizedIdent`,
and changes many `rustc_span::symbol::` qualifiers in `compiler/` to
`rustc_span::`. This is a 200+ net line of code reduction, mostly
because many files with two `use rustc_span` items can be reduced to
one.
- Move it to `rustc_parse`, which is the only crate that uses it. This
lets us remove all the `pub` markers from it.
- Change `next_ref` and `look_ahead` to `get` and `bump`, which work
better for the `rustc_parse` uses.
- This requires adding a `TokenStream::get` method, which is simple.
- In `TokenCursor`, we currently duplicate the
`DelimSpan`/`DelimSpacing`/`Delimiter` from the surrounding
`TokenTree::Delimited` in the stack. This isn't necessary so long as
we don't prematurely move past the `Delimited`, and is a small perf
win on a very hot code path.
- In `parse_token_tree`, we clone the relevant `TokenTree::Delimited`
instead of constructing an identical one from pieces.
Pasted metavariables are wrapped in invisible delimiters, which
pretty-print as empty strings, and changing that can break some proc
macros. But error messages saying "expected identifer, found ``" are
bad. So this commit adds support for metavariables in `TokenDescription`
so they print as "metavariable" in error messages, instead of "``".
It's not used meaningfully yet, but will be needed to get rid of
interpolated tokens.
`to_lowercase` allocates, but `eq_ignore_ascii_case` doesn't. This path
is hot enough that this makes a small but noticeable difference in
benchmarking.
By using `token_descr`, as is done for many other errors, we can get
slightly better descriptions in error messages, e.g.
"macro expansion ignores token `let` and any following" becomes
"macro expansion ignores keyword `let` and any tokens following".
This will be more important once invisible delimiters start being
mentioned in error messages -- without this commit, that leads to error
messages such as "error at ``" because invisible delimiters are
pretty printed as an empty string.
Fix `break_last_token`.
It currently doesn't handle the three-char tokens `>>=` and `<<=` correctly. These can be broken twice, resulting in three individual tokens. This is a latent bug that currently doesn't cause any problems, but does cause problems for #124141, because that PR increases the usage of lazy token streams.
r? `@petrochenkov`
It currently doesn't handle the three-char tokens `>>=` and `<<=`
correctly. These can be broken twice, resulting in three individual
tokens. This is a latent bug that currently doesn't cause any problems,
but does cause problems for #124141, because that PR increases the usage
of lazy token streams.
Fix double handling in `collect_tokens`
Double handling of AST nodes can occur in `collect_tokens`. This is when an inner call to `collect_tokens` produces an AST node, and then an outer call to `collect_tokens` produces the same AST node. This can happen in a few places, e.g. expression statements where the statement delegates `HasTokens` and `HasAttrs` to the expression. It will also happen more after #124141.
This PR fixes some double handling cases that cause problems, including #129166.
r? `@petrochenkov`
By keeping track of attributes that have been previously processed.
This fixes the `macro-rules-derive-cfg.stdout` test, and is necessary
for #124141 which removes nonterminals.
Also shrink the `SmallVec` inline size used in `IntervalSet`. 2 gives
slightly better perf than 4 now that there's an `IntervalSet` in
`Parser`, which is cloned reasonably often.
This commit does the following.
- Renames `collect_tokens_trailing_token` as `collect_tokens`, because
(a) it's annoying long, and (b) the `_trailing_token` bit is less
accurate now that its types have changed.
- In `collect_tokens`, adds a `Option<CollectPos>` argument and a
`UsePreAttrPos` in the return type of `f`. These are used in
`parse_expr_force_collect` (for vanilla expressions) and in
`parse_stmt_without_recovery` (for two different cases of expression
statements). Together these ensure are enough to fix all the problems
with token collection and assoc expressions. The changes to the
`stringify.rs` test demonstrate some of these.
- Adds a new test. The code in this test was causing an assertion
failure prior to this commit, due to an invalid `NodeRange`.
The extra complexity is annoying, but necessary to fix the existing
problems.
This pre-existing type is suitable for use with the return value of the
`f` parameter in `collect_tokens_trailing_token`. The more descriptive
name will be useful because the next commit will add another boolean
value to the return value of `f`.
Fix bug in `Parser::look_ahead`.
The special case was failing to handle invisible delimiters on one path.
Fixes (but doesn't close until beta backported) #128895.
r? `@davidtwco`
When collecting tokens there are two kinds of range:
- a range relative to the parser's full token stream (which we get when
we are parsing);
- a range relative to a single AST node's token stream (which we use
within `LazyAttrTokenStreamImpl` when replacing tokens).
These are currently both represented with `Range<u32>` and it's easy to
mix them up -- until now I hadn't properly understood the difference.
This commit introduces `ParserRange` and `NodeRange` to distinguish
them. This also requires splitting `ReplaceRange` in two, giving the new
types `ParserReplacement` and `NodeReplacement`. (These latter two names
reduce the overloading of the word "range".)
The commit also rewrites some comments to be clearer.
The end result is a little more verbose, but much clearer.
Adding details, clarifying lots of little things, etc. In particular,
the commit adds details of an example. I find this very helpful, because
it's taken me a long time to understand how this code works.