from dataclasses import dataclass
@dataclass
class Example:
str
name: str
mmark: str pandoc:
Convert Hugo mmark LaTeX into Pandoc
Convert Hugo mmark LaTeX into Pandoc
I’ve recently migrated from Hugo to Quarto and one of the hardest steps was converting the equations in Hugo’s legacy mmark
format to Quarto. This notebook shows how I converted the equations without changing equations inside code blocks (see fix_tex.py
in my hugo2quarto repository for an executable version of this).
The problem
The (deprecated) version of mmark in Hugo uses an unusual syntax for TeX. It’s not documented (except in the code, e.g. inline math), but some empirical rules for mmark are: - $$...$$
inside a paragraph starts inline math (even with whitespace surrounding …) - $$...$$
after a paragraph starts a math block (even with whitespace surrounding …) - A $
sign not followed by another $
sign is just a normal $
sign (A \$
should also be a $
mode) - Math isn’t rendered in inline code/code blocks
In Pandoc it’s documented
Anything between two
$
characters will be treated as TeX math. The opening$
must have a non-space character immediately to its right, while the closing$
must have a non-space character immediately to its left, and must not be followed immediately by a digit. Thus, \$20,000 and \$30,000 won’t parse as math. If for some reason you need to enclose text in literal $ characters, backslash-escape them and they won’t be treated as math delimiters. For display math, use$$
delimiters. (In this case, the delimiters may be separated from the formula by whitespace. However, there can be no blank lines between the opening and closing$$
delimiters.)
In summary:
$...$
starts an inline TeX (and space isn’t allowed between them)$$...$$
starts a math block- A
\$
sign is rendered as a normal\$
sign - Math isn’t rendered in inline code/code blocks
The final script implementing this is in my hugo2quarto repository as fix_tex.py
; the rest of this notebook
Tests
The result should be a function that takes mmark code and returns pandoc code.
Since there are a set of rules the best way to check the implementation is with some examples. Each Example
will have a descriptive name, the mmark
input and the expected pandoc
output.
We’ll generate a bunch of examples that satisfy the above rules.
Sometimes there are multiple possibilities, like with $20,000 to $30,000
but we will just pick a simple rule to transform them (escaping every $
).
There’s a bunch of other cases we won’t check (like indented code blocks and HTML BLocks) because they don’t occur in the Skeptric code.
= [
examples "Inline",
Example("And $$x=2$$",
"And $x=2$"),
"Inline Space",
Example("And $$ x = 2 $$",
"And $x = 2$"),
"Block",
Example("And\n\n$$x=2$$\n",
"And\n\n$$x=2$$\n"),
"Block space",
Example("And\n\n$$ x = 2 $$\n",
"And\n\n$$x = 2$$\n"),
"Block multiline",
Example("""
$$\begin{align}
& \text{maximize} && \mathbf{c}^\mathrm{T} \mathbf{x}\\
& \text{subject to} && A \mathbf{x} \le \mathbf{b}, \\
& && \mathbf{x} \ge \mathbf{0}, \\
\end{align}
$$
""",
"""
$$\begin{align}
& \text{maximize} && \mathbf{c}^\mathrm{T} \mathbf{x}\\
& \text{subject to} && A \mathbf{x} \le \mathbf{b}, \\
& && \mathbf{x} \ge \mathbf{0}, \\
\end{align}
$$
"""),
"Literal $", "It costs $20,000", r"It costs \$20,000"),
Example(
"Two Literal $", "$20,000 to $30,000", r"\$20,000 to \$30,000"),
Example(
"Inline code", "And `$x+=1`", "And `$x+=1`"),
Example(
"Inline code double $", "As TeX `$$x=2$$`", "As TeX `$$x=2$$`"),
Example(
"Inline code with escape", "And `\$x=2`", "And `\$x=2`"),
Example(
"Fenced code",
Example("""\n```\n$x+=1\n```\n""",
"""\n```\n$x+=1\n```\n"""),
"Fenced code double $",
Example("""\n```latex\n$$x==2$$\n```\n""",
"""\n```latex\n$$x==2$$\n```\n"""),
"Indented code blocks",
Example("\n" + r" %>% mutate_if(is.character, function(x) gsub('\\$', '\\\\$', x))",
"\n" + r" %>% mutate_if(is.character, function(x) gsub('\\$', '\\\\$', x))"),
"After intended code blocks",
Example("Like so\n $x = 2\nfor $30",
"Like so\n $x = 2\nfor \$30"),
]
Check the names are unique
assert len(set([e.name for e in examples])) == len(examples)
Now we can test our examples by checking our transformation function and returning the failures.
def test(f, examples=examples):
for example in examples:
= example.mmark
data = f(data)
result = example.pandoc
expected if result != expected:
yield({'name': example.name, 'data': data, 'result': result, 'expected': expected})
If we return the empty string all tests should fail
assert len(list(test(lambda x: ''))) == len(examples)
A lot of the time the input is unchanged; the identity function will only have a few failures
list(test(lambda x: x))
[{'name': 'Inline',
'data': 'And $$x=2$$',
'result': 'And $$x=2$$',
'expected': 'And $x=2$'},
{'name': 'Inline Space',
'data': 'And $$ x = 2 $$',
'result': 'And $$ x = 2 $$',
'expected': 'And $x = 2$'},
{'name': 'Block space',
'data': 'And\n\n$$ x = 2 $$\n',
'result': 'And\n\n$$ x = 2 $$\n',
'expected': 'And\n\n$$x = 2$$\n'},
{'name': 'Literal $',
'data': 'It costs $20,000',
'result': 'It costs $20,000',
'expected': 'It costs \\$20,000'},
{'name': 'Two Literal $',
'data': '$20,000 to $30,000',
'result': '$20,000 to $30,000',
'expected': '\\$20,000 to \\$30,000'},
{'name': 'After intended code blocks',
'data': 'Like so\n $x = 2\nfor $30',
'result': 'Like so\n $x = 2\nfor $30',
'expected': 'Like so\n $x = 2\nfor \\$30'}]
Strategy
We will use a simple Discrete Finite Automonon (DFA) to handle the transitions between the different states:
- In
default
state just yield characters, and look for transitions to other states - In
inline_code
orblock_code
just yield characters until the end of the code - In
inline_math
orblock_math
transform the delimiters and strip surrounding whitespace, leaving the input unchanged
Why not a parser?
A good solution would be to use one of the many Markdown parsers like Marko, or Mistletoe or even Pandoc itself. These all can produce Markdown and are able to be extended which would allow us to parse mmark maths.
The problem is they are all destructive parsers, they don’t preserve things like whitespace and even an identity parse changes the markdown significantly. This makes the git diffs much bigger and it’s harder to check the results (and I caught a lot of bugs checking the git diffs).
So we’re forced to write our own.
Implementation
States
We will create a Mode for each state
from enum import Enum, auto
class Mode(Enum):
= auto() # Default (paragraph mode)
DEFAULT = auto() # Inside an inline code
INLINE_CODE = auto() # Inside a code block
BLOCK_CODE = auto() # Inside inline math
INLINE_MATH = auto() # Inside block math
BLOCK_MATH = auto() # Inside an indented code block INDENTED_CODE
Transitions
We transition between the states when we hit certain sequences of tokens.
The below diagram shows the main transitions.
We will capture the transitions in an Action
object which has:
- an
input_mode
where it applies - a
match_re
, a regular expression on which to trigger the action - a
output_mode
to transition to on match - an
output
string to emit on a match, by default the matched string itself
There is also an implicit default action that consumes the next token, and outputs the current mode and that consumed token.
import re
from typing import Optional
@dataclass
class Action:
input_mode: Modestr
match_re:
output_mode: Modestr] = None
output: Optional[
def __post_init__(self):
self.pattern = re.compile(self.match_re)
def match(self, s: str, idx: int = 0) -> Optional[str]:
= self.pattern.match(s, idx)
match if match:
= match.group(0)
match_str = len(match_str)
len_match_str assert len_match_str > 0
return {'output': self.output or match_str, 'size': len_match_str}
Now the transitions can be defined as a list of Actions
= [
actions "\n```", Mode.BLOCK_CODE),
Action(Mode.DEFAULT, "`", Mode.INLINE_CODE),
Action(Mode.DEFAULT, "\n ", Mode.INDENTED_CODE),
Action(Mode.DEFAULT, "\n\$\$ *", Mode.BLOCK_MATH, "\n$$"),
Action(Mode.DEFAULT, "\$\$ *", Mode.INLINE_MATH, "$"),
Action(Mode.DEFAULT, "\$", Mode.DEFAULT, "\$"),
Action(Mode.DEFAULT,
"```", Mode.DEFAULT),
Action(Mode.BLOCK_CODE,
"`", Mode.DEFAULT),
Action(Mode.INLINE_CODE,
" *\$\$", Mode.DEFAULT, "$"),
Action(Mode.INLINE_MATH, " *\$\$", Mode.DEFAULT, "$$"),
Action(Mode.BLOCK_MATH,
"\n {,3}\S", Mode.DEFAULT),
Action(Mode.INDENTED_CODE, ]
Parsing
Now we need to find the matching action and pattern and update the mode and output.
If there is no matching pattern in this mode then we just consume one token and continue.
import logging
def parse(s):
= Mode.DEFAULT
mode = 0
idx = []
output
while idx < len(s):
'Mode: %s, Last output: %s, Next chars: %s' % (mode, output[-1:], s[idx:idx+5].replace('\n', '\\n')))
logging.debug(= idx
last_idx for action in actions:
if action.input_mode != mode:
continue
= action.match(s, idx)
match if match:
'Match: %s' % action)
logging.debug(= action.output_mode
mode += match['size']
idx += match['output']
output break
else:
+= s[idx]
output += 1
idx
assert idx > last_idx, "Infinite loop"
return ''.join(output)
Example
Let’s run through an example with logging on to see how it works
'DEBUG') logging.getLogger().setLevel(
= examples[1].mmark
mmark mmark
'And $$ x = 2 $$'
parse(mmark)
DEBUG:root:Mode: Mode.DEFAULT, Last output: [], Next chars: And $
DEBUG:root:Mode: Mode.DEFAULT, Last output: ['A'], Next chars: nd $$
DEBUG:root:Mode: Mode.DEFAULT, Last output: ['n'], Next chars: d $$
DEBUG:root:Mode: Mode.DEFAULT, Last output: ['d'], Next chars: $$ x
DEBUG:root:Mode: Mode.DEFAULT, Last output: [' '], Next chars: $$ x
DEBUG:root:Match: Action(input_mode=<Mode.DEFAULT: 1>, match_re='\\$\\$ *', output_mode=<Mode.INLINE_MATH: 4>, output='$')
DEBUG:root:Mode: Mode.INLINE_MATH, Last output: ['$'], Next chars: x = 2
DEBUG:root:Mode: Mode.INLINE_MATH, Last output: ['x'], Next chars: = 2
DEBUG:root:Mode: Mode.INLINE_MATH, Last output: [' '], Next chars: = 2 $
DEBUG:root:Mode: Mode.INLINE_MATH, Last output: ['='], Next chars: 2 $$
DEBUG:root:Mode: Mode.INLINE_MATH, Last output: [' '], Next chars: 2 $$
DEBUG:root:Mode: Mode.INLINE_MATH, Last output: ['2'], Next chars: $$
DEBUG:root:Match: Action(input_mode=<Mode.INLINE_MATH: 4>, match_re=' *\\$\\$', output_mode=<Mode.DEFAULT: 1>, output='$')
'And $x = 2$'
'INFO') logging.getLogger().setLevel(
Run tests
All the tests pass
list(test(parse))
[]
assert not list(test(parse))