Add a simple markdown parser for formatting rustc --explain

Currently, the output of `rustc --explain foo` displays the raw markdown in a
pager. This is acceptable, but using actual formatting makes it easier to
understand.

This patch consists of three major components:

1.  A markdown parser. This is an extremely simple non-backtracking recursive
    implementation that requires normalization of the final token stream
2.  A utility to write the token stream to an output buffer
3.  Configuration within rustc_driver_impl to invoke this combination for
    `--explain`. Like the current implementation, it first attempts to print to
    a pager with a fallback colorized terminal, and standard print as a last
    resort.

    If color is disabled, or if the output does not support it, or if printing
    with color fails, it will write the raw markdown (which matches current
    behavior).

    Pagers known to support color are: `less` (with `-r`), `bat` (aka `catbat`),
    and `delta`.

The markdown parser does not support the entire markdown specification, but
should support the following with reasonable accuracy:

-   Headings, including formatting
-   Comments
-   Code, inline and fenced block (no indented block)
-   Strong, emphasis, and strikethrough formatted text
-   Links, anchor, inline, and reference-style
-   Horizontal rules
-   Unordered and ordered list items, including formatting

This parser and writer should be reusable by other systems if ever needed.
This commit is contained in:
Trevor Gross
2022-12-19 12:09:40 -06:00
parent 8aed93d912
commit 6a1c10bd85
15 changed files with 1408 additions and 19 deletions

View File

@@ -0,0 +1,50 @@
# H1 Heading [with a link][remote-link]
H1 content: **some words in bold** and `so does inline code`
## H2 Heading
H2 content: _some words in italic_
### H3 Heading
H3 content: ~~strikethrough~~ text
#### H4 Heading
H4 content: A [simple link](https://docs.rs) and a [remote-link].
---
A section break was above. We can also do paragraph breaks:
(new paragraph) and unordered lists:
- Item 1 in `code`
- Item 2 in _italics_
Or ordered:
1. Item 1 in **bold**
2. Item 2 with some long lines that should wrap: Lorem ipsum dolor sit amet,
consectetur adipiscing elit. Aenean ac mattis nunc. Phasellus elit quam,
pulvinar ac risus in, dictum vehicula turpis. Vestibulum neque est, accumsan
in cursus sit amet, dictum a nunc. Suspendisse aliquet, lorem eu eleifend
accumsan, magna neque sodales nisi, a aliquet lectus leo eu sem.
---
## Code
Both `inline code` and code blocks are supported:
```rust
/// A rust enum
#[derive(Debug, PartialEq, Clone)]
enum Foo {
/// Start of line
Bar
}
```
[remote-link]: http://docs.rs

View File

@@ -0,0 +1,35 @@
H1 Heading ]8;;http://docs.rs\with a link]8;;\
H1 content: some words in bold and so does inline code
H2 Heading
H2 content: some words in italic
H3 Heading
H3 content: strikethrough text
H4 Heading
H4 content: A ]8;;https://docs.rs\simple link]8;;\ and a ]8;;http://docs.rs\remote-link]8;;\.
--------------------------------------------------------------------------------------------------------------------------------------------
A section break was above. We can also do paragraph breaks:
(new paragraph) and unordered lists:
* Item 1 in code
* Item 2 in italics
Or ordered:
1. Item 1 in bold
2. Item 2 with some long lines that should wrap: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aenean ac mattis nunc. Phasellus
elit quam, pulvinar ac risus in, dictum vehicula turpis. Vestibulum neque est, accumsan in cursus sit amet, dictum a nunc. Suspendisse
aliquet, lorem eu eleifend accumsan, magna neque sodales nisi, a aliquet lectus leo eu sem.
--------------------------------------------------------------------------------------------------------------------------------------------
Code
Both inline code and code blocks are supported:
/// A rust enum
#[derive(Debug, PartialEq, Clone)]
enum Foo {
/// Start of line
Bar
}

View File

@@ -0,0 +1,312 @@
use super::*;
use ParseOpt as PO;
#[test]
fn test_parse_simple() {
let buf = "**abcd** rest";
let (t, r) = parse_simple_pat(buf.as_bytes(), STG, STG, PO::None, MdTree::Strong).unwrap();
assert_eq!(t, MdTree::Strong("abcd"));
assert_eq!(r, b" rest");
// Escaping should fail
let buf = r"**abcd\** rest";
let res = parse_simple_pat(buf.as_bytes(), STG, STG, PO::None, MdTree::Strong);
assert!(res.is_none());
}
#[test]
fn test_parse_comment() {
let opt = PO::TrimNoEsc;
let buf = "<!-- foobar! -->rest";
let (t, r) = parse_simple_pat(buf.as_bytes(), CMT_S, CMT_E, opt, MdTree::Comment).unwrap();
assert_eq!(t, MdTree::Comment("foobar!"));
assert_eq!(r, b"rest");
let buf = r"<!-- foobar! \-->rest";
let (t, r) = parse_simple_pat(buf.as_bytes(), CMT_S, CMT_E, opt, MdTree::Comment).unwrap();
assert_eq!(t, MdTree::Comment(r"foobar! \"));
assert_eq!(r, b"rest");
}
#[test]
fn test_parse_heading() {
let buf1 = "# Top level\nrest";
let (t, r) = parse_heading(buf1.as_bytes()).unwrap();
assert_eq!(t, MdTree::Heading(1, vec![MdTree::PlainText("Top level")].into()));
assert_eq!(r, b"\nrest");
let buf1 = "# Empty";
let (t, r) = parse_heading(buf1.as_bytes()).unwrap();
assert_eq!(t, MdTree::Heading(1, vec![MdTree::PlainText("Empty")].into()));
assert_eq!(r, b"");
// Combo
let buf2 = "### Top `level` _woo_\nrest";
let (t, r) = parse_heading(buf2.as_bytes()).unwrap();
assert_eq!(
t,
MdTree::Heading(
3,
vec![
MdTree::PlainText("Top "),
MdTree::CodeInline("level"),
MdTree::PlainText(" "),
MdTree::Emphasis("woo"),
]
.into()
)
);
assert_eq!(r, b"\nrest");
}
#[test]
fn test_parse_code_inline() {
let buf1 = "`abcd` rest";
let (t, r) = parse_codeinline(buf1.as_bytes()).unwrap();
assert_eq!(t, MdTree::CodeInline("abcd"));
assert_eq!(r, b" rest");
// extra backticks, newline
let buf2 = "```ab\ncd``` rest";
let (t, r) = parse_codeinline(buf2.as_bytes()).unwrap();
assert_eq!(t, MdTree::CodeInline("ab\ncd"));
assert_eq!(r, b" rest");
// test no escaping
let buf3 = r"`abcd\` rest";
let (t, r) = parse_codeinline(buf3.as_bytes()).unwrap();
assert_eq!(t, MdTree::CodeInline(r"abcd\"));
assert_eq!(r, b" rest");
}
#[test]
fn test_parse_code_block() {
let buf1 = "```rust\ncode\ncode\n```\nleftovers";
let (t, r) = parse_codeblock(buf1.as_bytes());
assert_eq!(t, MdTree::CodeBlock { txt: "code\ncode", lang: Some("rust") });
assert_eq!(r, b"\nleftovers");
let buf2 = "`````\ncode\ncode````\n`````\nleftovers";
let (t, r) = parse_codeblock(buf2.as_bytes());
assert_eq!(t, MdTree::CodeBlock { txt: "code\ncode````", lang: None });
assert_eq!(r, b"\nleftovers");
}
#[test]
fn test_parse_link() {
let simple = "[see here](docs.rs) other";
let (t, r) = parse_any_link(simple.as_bytes(), false).unwrap();
assert_eq!(t, MdTree::Link { disp: "see here", link: "docs.rs" });
assert_eq!(r, b" other");
let simple_toplevel = "[see here](docs.rs) other";
let (t, r) = parse_any_link(simple_toplevel.as_bytes(), true).unwrap();
assert_eq!(t, MdTree::Link { disp: "see here", link: "docs.rs" });
assert_eq!(r, b" other");
let reference = "[see here] other";
let (t, r) = parse_any_link(reference.as_bytes(), true).unwrap();
assert_eq!(t, MdTree::RefLink { disp: "see here", id: None });
assert_eq!(r, b" other");
let reference_full = "[see here][docs-rs] other";
let (t, r) = parse_any_link(reference_full.as_bytes(), false).unwrap();
assert_eq!(t, MdTree::RefLink { disp: "see here", id: Some("docs-rs") });
assert_eq!(r, b" other");
let reference_def = "[see here]: docs.rs\nother";
let (t, r) = parse_any_link(reference_def.as_bytes(), true).unwrap();
assert_eq!(t, MdTree::LinkDef { id: "see here", link: "docs.rs" });
assert_eq!(r, b"\nother");
}
const IND1: &str = r"test standard
ind
ind2
not ind";
const IND2: &str = r"test end of stream
1
2
";
const IND3: &str = r"test empty lines
1
2
not ind";
#[test]
fn test_indented_section() {
let (t, r) = get_indented_section(IND1.as_bytes());
assert_eq!(str::from_utf8(t).unwrap(), "test standard\n ind\n ind2");
assert_eq!(str::from_utf8(r).unwrap(), "\nnot ind");
let (txt, rest) = get_indented_section(IND2.as_bytes());
assert_eq!(str::from_utf8(txt).unwrap(), "test end of stream\n 1\n 2");
assert_eq!(str::from_utf8(rest).unwrap(), "\n");
let (txt, rest) = get_indented_section(IND3.as_bytes());
assert_eq!(str::from_utf8(txt).unwrap(), "test empty lines\n 1\n 2");
assert_eq!(str::from_utf8(rest).unwrap(), "\n\nnot ind");
}
const HBT: &str = r"# Heading
content";
#[test]
fn test_heading_breaks() {
let expected = vec![
MdTree::Heading(1, vec![MdTree::PlainText("Heading")].into()),
MdTree::PlainText("content"),
]
.into();
let res = entrypoint(HBT);
assert_eq!(res, expected);
}
const NL1: &str = r"start
end";
const NL2: &str = r"start
end";
const NL3: &str = r"start
end";
#[test]
fn test_newline_breaks() {
let expected =
vec![MdTree::PlainText("start"), MdTree::ParagraphBreak, MdTree::PlainText("end")].into();
for (idx, check) in [NL1, NL2, NL3].iter().enumerate() {
let res = entrypoint(check);
assert_eq!(res, expected, "failed {idx}");
}
}
const WRAP: &str = "plain _italics
italics_";
#[test]
fn test_wrap_pattern() {
let expected = vec![
MdTree::PlainText("plain "),
MdTree::Emphasis("italics"),
MdTree::Emphasis(" "),
MdTree::Emphasis("italics"),
]
.into();
let res = entrypoint(WRAP);
assert_eq!(res, expected);
}
const WRAP_NOTXT: &str = r"_italics_
**bold**";
#[test]
fn test_wrap_notxt() {
let expected =
vec![MdTree::Emphasis("italics"), MdTree::PlainText(" "), MdTree::Strong("bold")].into();
let res = entrypoint(WRAP_NOTXT);
assert_eq!(res, expected);
}
const MIXED_LIST: &str = r"start
- _italics item_
<!-- comment -->
- **bold item**
second line [link1](foobar1)
third line [link2][link-foo]
- :crab:
extra indent
end
[link-foo]: foobar2
";
#[test]
fn test_list() {
let expected = vec![
MdTree::PlainText("start"),
MdTree::ParagraphBreak,
MdTree::UnorderedListItem(vec![MdTree::Emphasis("italics item")].into()),
MdTree::LineBreak,
MdTree::UnorderedListItem(
vec![
MdTree::Strong("bold item"),
MdTree::PlainText(" second line "),
MdTree::Link { disp: "link1", link: "foobar1" },
MdTree::PlainText(" third line "),
MdTree::Link { disp: "link2", link: "foobar2" },
]
.into(),
),
MdTree::LineBreak,
MdTree::UnorderedListItem(
vec![MdTree::PlainText("🦀"), MdTree::PlainText(" extra indent")].into(),
),
MdTree::ParagraphBreak,
MdTree::PlainText("end"),
]
.into();
let res = entrypoint(MIXED_LIST);
assert_eq!(res, expected);
}
const SMOOSHED: &str = r#"
start
### heading
1. ordered item
```rust
println!("Hello, world!");
```
`inline`
``end``
"#;
#[test]
fn test_without_breaks() {
let expected = vec![
MdTree::PlainText("start"),
MdTree::ParagraphBreak,
MdTree::Heading(3, vec![MdTree::PlainText("heading")].into()),
MdTree::OrderedListItem(1, vec![MdTree::PlainText("ordered item")].into()),
MdTree::ParagraphBreak,
MdTree::CodeBlock { txt: r#"println!("Hello, world!");"#, lang: Some("rust") },
MdTree::ParagraphBreak,
MdTree::CodeInline("inline"),
MdTree::PlainText(" "),
MdTree::CodeInline("end"),
]
.into();
let res = entrypoint(SMOOSHED);
assert_eq!(res, expected);
}
const CODE_STARTLINE: &str = r#"
start
`code`
middle
`more code`
end
"#;
#[test]
fn test_code_at_start() {
let expected = vec![
MdTree::PlainText("start"),
MdTree::PlainText(" "),
MdTree::CodeInline("code"),
MdTree::PlainText(" "),
MdTree::PlainText("middle"),
MdTree::PlainText(" "),
MdTree::CodeInline("more code"),
MdTree::PlainText(" "),
MdTree::PlainText("end"),
]
.into();
let res = entrypoint(CODE_STARTLINE);
assert_eq!(res, expected);
}

View File

@@ -0,0 +1,90 @@
use std::io::BufWriter;
use std::path::PathBuf;
use termcolor::{BufferWriter, ColorChoice};
use super::*;
use crate::markdown::MdStream;
const INPUT: &str = include_str!("input.md");
const OUTPUT_PATH: &[&str] = &[env!("CARGO_MANIFEST_DIR"), "src","markdown","tests","output.stdout"];
const TEST_WIDTH: usize = 80;
// We try to make some words long to create corner cases
const TXT: &str = r"Lorem ipsum dolor sit amet, consecteturadipiscingelit.
Fusce-id-urna-sollicitudin, pharetra nisl nec, lobortis tellus. In at
metus hendrerit, tincidunteratvel, ultrices turpis. Curabitur_risus_sapien,
porta-sed-nunc-sed, ultricesposuerelacus. Sed porttitor quis
dolor non venenatis. Aliquam ut. ";
const WRAPPED: &str = r"Lorem ipsum dolor sit amet, consecteturadipiscingelit. Fusce-id-urna-
sollicitudin, pharetra nisl nec, lobortis tellus. In at metus hendrerit,
tincidunteratvel, ultrices turpis. Curabitur_risus_sapien, porta-sed-nunc-sed,
ultricesposuerelacus. Sed porttitor quis dolor non venenatis. Aliquam ut. Lorem
ipsum dolor sit amet, consecteturadipiscingelit. Fusce-id-urna-
sollicitudin, pharetra nisl nec, lobortis tellus. In at metus hendrerit,
tincidunteratvel, ultrices turpis. Curabitur_risus_sapien, porta-sed-nunc-
sed, ultricesposuerelacus. Sed porttitor quis dolor non venenatis. Aliquam
ut. Sample link lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet,
consecteturadipiscingelit. Fusce-id-urna-sollicitudin, pharetra nisl nec,
lobortis tellus. In at metus hendrerit, tincidunteratvel, ultrices turpis.
Curabitur_risus_sapien, porta-sed-nunc-sed, ultricesposuerelacus. Sed porttitor
quis dolor non venenatis. Aliquam ut. ";
#[test]
fn test_wrapping_write() {
WIDTH.with(|w| w.set(TEST_WIDTH));
let mut buf = BufWriter::new(Vec::new());
let txt = TXT.replace("-\n","-").replace("_\n","_").replace('\n', " ").replace(" ", "");
write_wrapping(&mut buf, &txt, 0, None).unwrap();
write_wrapping(&mut buf, &txt, 4, None).unwrap();
write_wrapping(
&mut buf,
"Sample link lorem ipsum dolor sit amet. ",
4,
Some("link-address-placeholder"),
)
.unwrap();
write_wrapping(&mut buf, &txt, 0, None).unwrap();
let out = String::from_utf8(buf.into_inner().unwrap()).unwrap();
let out = out
.replace("\x1b\\", "")
.replace('\x1b', "")
.replace("]8;;", "")
.replace("link-address-placeholder", "");
for line in out.lines() {
assert!(line.len() <= TEST_WIDTH, "line length\n'{line}'")
}
assert_eq!(out, WRAPPED);
}
#[test]
fn test_output() {
// Capture `--bless` when run via ./x
let bless = std::env::var("RUSTC_BLESS").unwrap_or_default() == "1";
let ast = MdStream::parse_str(INPUT);
let bufwtr = BufferWriter::stderr(ColorChoice::Always);
let mut buffer = bufwtr.buffer();
ast.write_termcolor_buf(&mut buffer).unwrap();
let mut blessed = PathBuf::new();
blessed.extend(OUTPUT_PATH);
if bless {
std::fs::write(&blessed, buffer.into_inner()).unwrap();
eprintln!("blessed output at {}", blessed.display());
} else {
let output = buffer.into_inner();
if std::fs::read(blessed).unwrap() != output {
// hack: I don't know any way to write bytes to the captured stdout
// that cargo test uses
let mut out = std::io::stdout();
out.write_all(b"\n\nMarkdown output did not match. Expected:\n").unwrap();
out.write_all(&output).unwrap();
out.write_all(b"\n\n").unwrap();
panic!("markdown output mismatch");
}
}
}