Arrange for zero to be canonical

We find that it is common for large ranges of chars to be false -- and that
means that it is plausibly common for us to ask about a word that is entirely
empty. Therefore, we should make sure that we do not need to rotate bits or
otherwise perform some operation to map to the zero word; canonicalize it first
if possible.
This commit is contained in:
Mark Rousskov
2020-03-21 17:20:57 -04:00
parent 233ab2f168
commit a7ec6f8fe0
2 changed files with 243 additions and 254 deletions

View File

@@ -301,7 +301,21 @@ impl Canonicalized {
Canonicalized(usize),
}
while let Some((&to, _)) = mappings.iter().max_by_key(|m| m.1.len()) {
// Map 0 first, so that it is the first canonical word.
// This is realistically not inefficient because 0 is not mapped to by
// anything else (a shift pattern could do it, but would be wasteful).
//
// However, 0s are quite common in the overall dataset, and it is quite
// wasteful to have to go through a mapping function to determine that
// we have a zero.
//
// FIXME: Experiment with choosing most common words in overall data set
// for canonical when possible.
while let Some((&to, _)) = mappings
.iter()
.find(|(&to, _)| to == 0)
.or_else(|| mappings.iter().max_by_key(|m| m.1.len()))
{
// Get the mapping with the most entries. Currently, no mapping can
// only exist transitively (i.e., there is no A, B, C such that A
// does not map to C and but A maps to B maps to C), so this is