Commit Graph

36 Commits

Author SHA1 Message Date
Sage Mitchell
2b328ea5ee Address feedback from PR #101401 2022-09-04 08:07:53 -07:00
Sage Mitchell
4a3e169da7 Make char::is_lowercase and char::is_uppercase const
Implements #101400.
2022-09-04 08:07:53 -07:00
bors
ce36e88256 Auto merge of #100497 - kadiwa4:remove_clone_into_iter, r=cjgillot
Avoid cloning a collection only to iterate over it

`@rustbot` label: +C-cleanup
2022-08-28 18:31:08 +00:00
Yuki Okushi
e31bedc9cf Rollup merge of #100924 - est31:closure_to_fn_ptr, r=Mark-Simulacrum
Smaller improvements of tidy and the unicode generator
2022-08-27 13:14:19 +09:00
est31
754b3e7567 Change hint to correct path 2022-08-23 19:06:27 +02:00
est31
0a6af989f6 Simplify unicode_downloads.rs
Reduce duplication by moving fetching logic into a dedicated function.
2022-08-23 19:04:07 +02:00
KaDiWa
4eebcb9910 avoid cloning and then iterating 2022-08-13 16:16:52 +02:00
Bruce A. MacNaughton
5d048eb69d add #inline 2022-07-20 16:13:54 -07:00
Bruce A. MacNaughton
89ace470dc formatted 2022-07-19 18:03:18 -07:00
Bruce A. MacNaughton
d4819632e2 working updates 2022-07-19 17:35:19 -07:00
T-O-R-U-S
72a25d05bf Use implicit capture syntax in format_args
This updates the standard library's documentation to use the new syntax. The
documentation is worthwhile to update as it should be more idiomatic
(particularly for features like this, which are nice for users to get acquainted
with). The general codebase is likely more hassle than benefit to update: it'll
hurt git blame, and generally updates can be done by folks updating the code if
(and when) that makes things more readable with the new format.

A few places in the compiler and library code are updated (mostly just due to
already having been done when this commit was first authored).
2022-03-10 10:23:40 -05:00
Josh Stone
6b0b417299 Let unicode-table-generator fail gracefully for bitsets
The "Alphabetic" property in Unicode 14 grew too big for the bitset
representation, panicking "cannot pack 264 into 8 bits". However, we
were already choosing the skiplist for that anyway, so this doesn't need
to be a hard failure. That panic is now a returned `Err`, and then in
`emit_codepoints` we automatically defer to skiplist.
2021-10-06 17:35:49 -07:00
Josh Stone
e159d42a9a Redo #81358 in unicode-table-generator 2021-10-06 15:45:17 -07:00
Mark Rousskov
c746be2219 Migrate to 2021 2021-09-20 22:21:42 -04:00
Jade
3cf820e17d rfc3052: Remove authors field from Cargo manifests
Since RFC 3052 soft deprecated the authors field anyway, hiding it from
crates.io, docs.rs, and making Cargo not add it by default, and it is
not generally up to date/useful information, we should remove it from
crates in this repo.
2021-07-29 14:56:05 -07:00
Matthias Krüger
ba6b4274b5 unicode_table_generator: fix clippy::writeln_empty_string, clippy::useless_format, clippy:::for_kv_map 2020-08-24 00:43:50 +02:00
Izzy Swart
b809f453ca Fix typo "biset" -> "bitset" 2020-08-06 16:13:29 -07:00
mark
2c31b45ae8 mv std libs to library/ 2020-07-27 19:51:13 -05:00
Lzu Tao
fff822fead Migrate to numeric associated consts 2020-06-10 01:35:47 +00:00
Pyfisch
7f4048c710 Store UNICODE_VERSION as a tuple
Remove the UnicodeVersion struct containing
major, minor and update fields and replace it with
a 3-tuple containing the version number.
As the value of each field is limited to 255
use u8 to store them.
2020-04-11 12:56:25 +02:00
Mark Rousskov
ad679a7f43 Update the documentation comment 2020-03-27 19:02:23 -04:00
Mark Rousskov
b6bc906004 Remove separate encoding for a single nonzero-mapping byte
In practice, for the two data sets that still use the bitset encoding (uppercase
and lowercase) this is not a significant win, so just drop it entirely. It costs
us about 5 bytes, and the complexity is nontrivial.
2020-03-27 19:02:23 -04:00
Mark Rousskov
9c1ceece20 Add skip list based implementation for smaller encoding
This arranges for the sparser sets (everything except lower and uppercase) to be
encoded in a significantly smaller context. However, it is also a performance
trade-off (roughly 3x slower than the bitset encoding). The 40% size reduction
is deemed to be sufficiently important to merit this performance loss,
particularly as it is unlikely that this code is hot anywhere (and if it is,
paying the memory cost for a bitset that directly represents the data seems
worthwhile).

Alphabetic     : 1599 bytes     (- 937 bytes)
Case_Ignorable : 949 bytes      (- 822 bytes)
Cased          : 359 bytes      (- 429 bytes)
Cc             : 9 bytes        (-  15 bytes)
Grapheme_Extend: 813 bytes      (- 675 bytes)
Lowercase      : 863 bytes
N              : 419 bytes      (- 619 bytes)
Uppercase      : 776 bytes
White_Space    : 37 bytes       (-  46 bytes)
Total table sizes: 5824 bytes   (-3543 bytes)
2020-03-27 19:02:23 -04:00
Mark Rousskov
33b9e6f5cf Add richer printing 2020-03-24 16:24:47 -04:00
Mark Rousskov
af243d4d91 Avoid relying on const parameters to function
LLVM seems to at least sometimes optimize better when the length comes directly
from the `len()` of the array vs. an equivalent integer.

Also, this allows easier copy/pasting of the function into compiler explorer for
experimentation.
2020-03-21 18:01:50 -04:00
Mark Rousskov
a7ec6f8fe0 Arrange for zero to be canonical
We find that it is common for large ranges of chars to be false -- and that
means that it is plausibly common for us to ask about a word that is entirely
empty. Therefore, we should make sure that we do not need to rotate bits or
otherwise perform some operation to map to the zero word; canonicalize it first
if possible.
2020-03-21 17:53:18 -04:00
Mark Rousskov
233ab2f168 Push the byte of LAST_CHUNK_MAP into the array
This optimizes slightly better.

Alphabetic     : 2536 bytes
Case_Ignorable : 1771 bytes
Cased          : 788 bytes
Cc             : 24 bytes
Grapheme_Extend: 1488 bytes
Lowercase      : 863 bytes
N              : 1038 bytes
Uppercase      : 776 bytes
White_Space    : 83 bytes
Total table sizes: 9367 bytes  (-18 bytes; 2 bytes per set)
2020-03-21 17:51:40 -04:00
Mark Rousskov
5f71d98f90 Deduplicate test and primary range_search definitions
This ensures that what we test is what we get for final results as well.
2020-03-21 15:21:31 -04:00
Mark Rousskov
7b29b70d6e Add a right shift mapping
This saves less bytes - by far - and is likely not the best operator to choose.
But for now, it works -- a better choice may arise later.

Alphabetic     : 2538 bytes   (- 84 bytes)
Case_Ignorable : 1773 bytes   (- 30 bytes)
Cased          : 790 bytes    (- 18 bytes)
Cc             : 26 bytes     (-  6 bytes)
Grapheme_Extend: 1490 bytes   (- 18 bytes)
Lowercase      : 865 bytes    (- 36 bytes)
N              : 1040 bytes   (- 24 bytes)
Uppercase      : 778 bytes    (- 60 bytes)
White_Space    : 85 bytes     (-  6 bytes)
Total table sizes: 9385 bytes (-282 bytes)
2020-03-21 12:14:26 -04:00
Mark Rousskov
b0e121d9d5 Shrink bitset words through functional mapping
Previously, all words in the (deduplicated) bitset would be stored raw -- a full
64 bits (8 bytes). Now, those words that are equivalent to others through a
specific mapping are stored separately and "mapped" to the original when
loading; this shrinks the table sizes significantly, as each mapped word is
stored in 2 bytes (a 4x decrease from the previous).

The new encoding is also potentially non-optimal: the "mapped" byte is
frequently repeated, as in practice many mapped words use the same base word.

Currently we only support two forms of mapping: rotation and inversion. Note
that these are both guaranteed to map transitively if at all, and supporting
mappings for which this is not true may require a more interesting algorithm for
choosing the optimal pairing.

Updated sizes:

Alphabetic     : 2622 bytes     (-  414 bytes)
Case_Ignorable : 1803 bytes     (-  330 bytes)
Cased          : 808 bytes      (-  126 bytes)
Cc             : 32 bytes
Grapheme_Extend: 1508 bytes     (-  252 bytes)
Lowercase      : 901 bytes      (-   84 bytes)
N              : 1064 bytes     (-  156 bytes)
Uppercase      : 838 bytes      (-   96 bytes)
White_Space    : 91 bytes       (-    6 bytes)
Total table sizes: 9667 bytes   (-1,464 bytes)
2020-03-21 11:22:00 -04:00
Mark Rousskov
6c7691a37b Pre-pop zero chunks before mapping LAST_CHUNK_MAP
This avoids wasting a small amount of space for some of the data sets.

The chunk resizing is caused by but not directly related to changes in this
commit.

Alphabetic     : 3036 bytes
Case_Ignorable : 2133 bytes    (- 3 bytes)
Cased          : 934 bytes
Cc             : 32 bytes
Grapheme_Extend: 1760 bytes    (-14 bytes)
Lowercase      : 985 bytes
N              : 1220 bytes    (- 5 bytes)
Uppercase      : 934 bytes
White_Space    : 97 bytes
Total table sizes: 11131 bytes (-22 bytes)
2020-03-20 18:38:08 -04:00
Mark Rousskov
580a6342ef Generate tests for Unicode property data
Currently the test file takes a while to compile -- 30 seconds or so -- but
since it's not going to be committed, and is just for local testing, that seems
fine.
2020-03-20 12:11:13 -04:00
Mark Rousskov
7c4baedb3a Dynamically choose best chunk size
Try chunk sizes between 1 and 64, selecting the one which minimizes the number
of bytes used. 16, the previous constant, turned out to be a rather good choice,
with 5/9 of the datasets still using it.

Alphabetic     : 3036 bytes    (- 19 bytes)
Case_Ignorable : 2136 bytes
Cased          : 934 bytes
Cc             : 32 bytes      (- 11 bytes)
Grapheme_Extend: 1774 bytes
Lowercase      : 985 bytes
N              : 1225 bytes    (- 41 bytes)
Uppercase      : 934 bytes
White_Space    : 97 bytes      (- 43 bytes)
Total table sizes: 11153 bytes (-114 bytes)
2020-03-20 12:11:13 -04:00
Mark Rousskov
903f67d599 Avoid re-fetching Unicode data
If the unicode-downloads folder already exists, we likely just fetched the data,
so don't make any further network requests. Unicode versions are released rarely
enough that this doesn't matter much in practice.
2020-03-20 12:11:13 -04:00
Matthias Krüger
d3e5177f81 Use .next() instead of .nth(0) on iterators. 2020-03-03 03:15:03 +01:00
Mark Rousskov
064f8885d5 Add unicode table generator 2020-01-14 19:11:15 -05:00