Roman Numerals with Python and Regular Expressions

python

Published

January 24, 2024

We’re going convert integers to roman numerals in standard form and back again in Python, as well as detect them with regular expressions.

Roman Numerals use symbols for different values of thousands, hundreds, tens and units. For each of these they have a special symbol for 5 and use a subtractive notation for 4 and 9:

	Thousands	Hundreds	Tens	Units
1	M	C	X	I
2	MM	CC	XX	II
3	MMM	CCC	XXX	III
4		CD	XL	IV
5		D	L	V
6		DC	LX	VI
7		DCC	LXX	VII
8		DCCC	LXXX	VIII
9		CM	XC	IX

They are always written from largest to smallest for example:

MCMXLVIII = M + CM + XL + V + III = 1000 + 900 + 40 + 5 + 3 = 1948
MMMCMXCIX = MMM + CM + XC + IX = 3000 + 900 + 90 + 9 = 3999

We can easily convert a number to Roman Numerals by breaking it into each of these pieces from largest to smallest, being careful to include the subtractive values:

ROMAN_NUMERALS = dict(
    M=1000,
    CM=900,
    D=500,
    CD=400,
    C=100,
    XC=90,
    L=50,
    XL=40,
    X=10,
    IX=9,
    V=5,
    IV=4,
    I=1,
)

def int_to_roman_numeral(n: int) -> str:
    if not (0 <= n < 4000 and int(n) == n):
        raise ValueError("Expected an integer between 0 and 3999")
    ans = []
    for numeral, base in ROMAN_NUMERALS.items():
        count, n = divmod(n, base)
        ans += count * numeral
    assert n == 0
    return ''.join(ans)

', '.join([int_to_roman_numeral(i) for i in range(1, 25)])

'I, II, III, IV, V, VI, VII, VIII, IX, X, XI, XII, XIII, XIV, XV, XVI, XVII, XVIII, XIX, XX, XXI, XXII, XXIII, XXIV'

It agrees with our calculations before, and we have assigned 0 to an empty string.

assert int_to_roman_numeral(1948) == 'MCMXLVIII'
assert int_to_roman_numeral(3999) == 'MMMCMXCIX'
assert int_to_roman_numeral(0) == ''

We can convert a Roman Numeral back into an integer by converting the subtractive values into an appropriate number of units and then add the values together:

from collections import Counter

def roman_numeral_to_int(numeral: str) -> int:
    """Expand roman numerals"""
    numeral_expanded = (
        numeral
        .replace('CM', 'C' * 9)
        .replace('CD', 'C' * 4)
        .replace('XC', 'X' * 9)
        .replace('XL', 'X' * 4)
        .replace('IX', 'I' * 9)
        .replace('IV', 'I' * 4)
    )
    return sum([count * ROMAN_NUMERALS[n] for 
                n, count in Counter(numeral_expanded).items()])

assert roman_numeral_to_int("") == 0
assert roman_numeral_to_int("MCMXLVIII") == 1948
assert roman_numeral_to_int("MMMCMXCIX") == 3999

We can check that for all valid integers if we convert to a Roman Numeral and back again we get back our input:

for i in range(4000):
    assert roman_numeral_to_int(int_to_roman_numeral(i)) == i

However our roman_numeral_to_int will work on invalid roman numerals:

roman_numeral_to_int("IM")

We can write a simple regular expression to check a Roman Numeral

import re

roman_numeral_re = re.compile("""
(M{0,3})         # Thousands
(CM|CD|D?C{0,3}) # Hundreds
(XC|XL|L?X{0,3}) # Tens
(IX|IV|V?I{0,3}) # Units
$""", re.VERBOSE
)

def is_roman_numeral(x):
    return roman_numeral_re.match(x) is not None

assert is_roman_numeral("")
assert is_roman_numeral("I")
assert is_roman_numeral("MCMXLVIII")
assert is_roman_numeral("MMMCMXCIX")
assert not is_roman_numeral("IM")

And we can check all of our generated Roman Numerals are valid:

for i in range(4000):
    assert is_roman_numeral(int_to_roman_numeral(i))

But how can we be sure our regular expression doesn’t match anything else?

Regular expression generation

We gen generate all the possible things that match this finite regular expression by following all transitions in the corresponding automata. While we could write our own recursive descent parser we’ll use Python’s private regular expression parser in Python 3.11 (the API may break across versions).

import re._constants as sre
import re._parser as sre_parse

Let’s start with a simpler example:

sre_parse.parse('AB')

[(LITERAL, 65), (LITERAL, 66)]

The parse returns a list of items which are to be concatenated together. In this case each of the literals has a character code.

In general each subexpression could return multiple results, so we return a list of all things that match. For a literal expression this is a list containing the single character:

def generate_re_literal(arg) -> list[str]:
    return [chr(arg)]

We’ll have a handler for each kind of expression that was parsed, starting with literal

from typing import Any, Callable

GENERATE_RE_HANDLERS: dict[int, Callable[Any, list[str]]] = {
    sre.LITERAL: generate_re_literal
}

Then to generat all regular expressions we generate all possible combinations of the subitems (with itertools.product), and concatenate them together with (''.join):

import itertools

def generate_all_re(s, flags=None) -> list[str]:
    if isinstance(s, re._parser.SubPattern):
        parse = s
    elif isinstance(s, re.Pattern):
        if flags is None:
            flags = s.flags
        parse = sre_parse.parse(s.pattern, flags)
    elif isinstance(s, str):
        if flags is None:
            flags = re.NOFLAG
        parse = sre_parse.parse(s, flags)
    else:
        raise ValueError("Unknown type %s" % type(s))
    
    generations = [GENERATE_RE_HANDLERS[node](args) for node, args in parse]
    return [''.join(items) for items in itertools.product(*generations)]

Then we can check our simple example:

assert generate_all_re('AB') == ['AB']

Branches

A branch will generate either the pattern on the left or the pattern on the right:

sre_parse.parse('A|BC')

[(BRANCH, (None, [[(LITERAL, 65)], [(LITERAL, 66), (LITERAL, 67)]]))]

To handle this we need to get the generations from each branch and flatten them into a single list:

def generate_branch(args) -> str:
    _, branches = args
    return [generation for branch in branches for generation in generate_all_re(branch)]

GENERATE_RE_HANDLERS[sre.BRANCH] = generate_branch

assert generate_all_re('A|BC') == ['A', 'BC']

Repeats

A finite repeat allows the same subexpression a specified number of times:

sre_parse.parse('M{0,3}')

[(MAX_REPEAT, (0, 3, [(LITERAL, 77)]))]

To handle this we generate the subexpression and then return each repetition of the generations

def generate_re_max_repeat(args):
    min_repeat, max_repeat, pattern = args
    generated = generate_all_re(pattern)
    return [g * n_repeat for g in generated for n_repeat in range(min_repeat, max_repeat + 1)]

GENERATE_RE_HANDLERS[sre.MAX_REPEAT] = generate_re_max_repeat

assert generate_all_re('M{0,3}') == ['', 'M', 'MM', 'MMM']

Groups

For our purpose we can simply ignore groups; they’re just used for parsing

sre_parse.parse('(M)')

[(SUBPATTERN, (1, 0, 0, [(LITERAL, 77)]))]

def generate_subpattern(args):
    _, _, _, pattern = args
    return generate_all_re(pattern)

GENERATE_RE_HANDLERS[sre.SUBPATTERN] = generate_subpattern

assert generate_all_re('(M)') == ['M']

End token

For generating the end token we will just generate the empty string:

sre_parse.parse('A$')

[(LITERAL, 65), (AT, AT_END)]

def generate_at(args):
    return ['']

GENERATE_RE_HANDLERS[sre.AT] = generate_at

assert generate_all_re('A$') == ['A']

Generating the Roman Numerals

We now have everything we need to generate all the Roman Numerals from the regular expression. We can check there are 4000 all distinct:

roman_numerals = generate_all_re(roman_numeral_re)
assert len(roman_numerals) == 4000
assert len(set(roman_numerals)) == 4000

roman_numerals[:5]

['CMXCIX', 'CMXCIV', 'CMXC', 'CMXCI', 'CMXCII']

They should all match the regular expression:

for numeral in roman_numerals:
    assert is_roman_numeral(numeral)

And if we convert them to integers and back again we get the same result:

for numeral in roman_numerals:
    assert int_to_roman_numeral(roman_numeral_to_int(numeral)) == numeral

Non-empty Roman Numerals

What if we just want to find the non-empty Roman Numerals? We can modify our regular expression to exclude the case where all the parts are empty:

nonempty_roman_numeral_re = re.compile("""
(?:
(?:M{0,3}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|I?V|V?I{1,3}))    #  Has a unit
| (?:M{0,3}(CM|CD|D?C{0,3})(XC|X?L|L?X{1,3}))                  # OR Has a ten
| (?:M{0,3}(CM|C?D|D?C{1,3}))                                  # OR Has a hundred
| M{1,3}                                                       # OR Has a thousand
)$""", flags=re.VERBOSE)

assert not nonempty_roman_numeral_re.match('')
assert not nonempty_roman_numeral_re.match('VX')
assert nonempty_roman_numeral_re.match('X')
assert nonempty_roman_numeral_re.match('MCMXLVIII')
assert nonempty_roman_numeral_re.match('MMMCMXCIX')

This matches precisely the set of Roman Numerals from 1 to 3999:

nonempty_roman_numerals = generate_all_re(nonempty_roman_numeral_re)

assert len(nonempty_roman_numerals) == 3999
assert len(set(nonempty_roman_numerals)) == 3999
assert set(nonempty_roman_numerals).symmetric_difference(roman_numerals) == {''}

Composing Regular Expressions

The regular expression above is starting to get a bit gnarly. We can make it easier to write using Python classes and magic methods. To start with let’s create a function that wraps a regular expression in a non-capturing group:

def group_re(re: str) -> str:
    return "(?:" + re + ")"

assert group_re("a") == "(?:a)"

Then we can create a class R (for Regex) where we can:

Concatenate by multiplying objects (A * B)
Branch by OR-ing objects (A | B)
Having 0 or 1 matches with maybe: A.maybe()
Having a specified number of matches with repeat: A.repeat(0,3)

class R:
    def __init__(self, pattern: str):
        self.pattern = pattern

    def __mul__(self, other):
        if isinstance(other, str): other = R(other)
        return R(group_re(self.pattern) + group_re(other.pattern))

    def __rmul__(self, other):
        if isinstance(other, str): other = R(other)
        return other * self

    def __or__(self, other):
        if isinstance(other, str): other = R(other)
        return R(group_re(self.pattern) + "|" + group_re(other.pattern))

    def __ror__(self, other):
        if isinstance(other, str): other = R(other)
        return other | self

    def __eq__(self, other):
        return self.pattern == other.pattern

    def __repr__(self):
        return f'R({self.pattern})'

    def maybe(self):
        return R(group_re(self.pattern) + "?")

    def repeat(self, min, max):
        return R(group_re(self.pattern) + "{%s,%s}" % (min, max))

    def match(self, s, flags=re.NOFLAG):
        return re.match(group_re(self.pattern) + "$", s, flags)

    def finditer(self, s, flags=re.NOFLAG):
        return re.finditer(self.pattern, s, flags)

Using this notation we can then make regular expressions for a roman numeral between 0 and 9

unit = "IX" | (R("I").maybe() * "V") | (R("V").maybe() * R("I").repeat(1,3))
unit

R((?:(?:IX)|(?:(?:(?:I)?)(?:V)))|(?:(?:(?:V)?)(?:(?:I){1,3})))

Similarly we can create expressions for tens, hundreds and thousands, then combine them to get the non-empty roman numerals. The resulting expression is an eye-sore because we’re doing a lot of unnecesary grouping:

ten  = "XC" | (R("X").maybe() * "L") | (R("L").maybe() * R("X").repeat(1,3))
hund = "CM" | (R("C").maybe() * "D") | (R("D").maybe() * R("C").repeat(1,3))
thou = R("M").repeat(1,3)

nonempty_roman_numerals_re_2 = unit |  (ten * unit.maybe()) | (hund * ten.maybe() * unit.maybe()) | (thou * hund.maybe() * ten.maybe() * unit.maybe())
nonempty_roman_numerals_re_2

R((?:(?:(?:(?:(?:IX)|(?:(?:(?:I)?)(?:V)))|(?:(?:(?:V)?)(?:(?:I){1,3})))|(?:(?:(?:(?:XC)|(?:(?:(?:X)?)(?:L)))|(?:(?:(?:L)?)(?:(?:X){1,3})))(?:(?:(?:(?:IX)|(?:(?:(?:I)?)(?:V)))|(?:(?:(?:V)?)(?:(?:I){1,3})))?)))|(?:(?:(?:(?:(?:CM)|(?:(?:(?:C)?)(?:D)))|(?:(?:(?:D)?)(?:(?:C){1,3})))(?:(?:(?:(?:XC)|(?:(?:(?:X)?)(?:L)))|(?:(?:(?:L)?)(?:(?:X){1,3})))?))(?:(?:(?:(?:IX)|(?:(?:(?:I)?)(?:V)))|(?:(?:(?:V)?)(?:(?:I){1,3})))?)))|(?:(?:(?:(?:(?:M){1,3})(?:(?:(?:(?:CM)|(?:(?:(?:C)?)(?:D)))|(?:(?:(?:D)?)(?:(?:C){1,3})))?))(?:(?:(?:(?:XC)|(?:(?:(?:X)?)(?:L)))|(?:(?:(?:L)?)(?:(?:X){1,3})))?))(?:(?:(?:(?:IX)|(?:(?:(?:I)?)(?:V)))|(?:(?:(?:V)?)(?:(?:I){1,3})))?)))

This generates precisely the non-empty roman numerals:

nonempty_roman_numerals_2 = generate_all_re(nonempty_roman_numerals_re_2.pattern)

assert set(nonempty_roman_numerals_2) == set(nonempty_roman_numerals)

Although the way it is formed it can match some expressions in multiple ways:

len(nonempty_roman_numerals_2), len(set(nonempty_roman_numerals_2))

(20583, 3999)

Conclusion

Roman Numerals are a little funny but not terribly difficult to parse in Python. The approach here was inspired by property based testing especially of regular expressions to check that int_to_roman_numeral and roman_numeral_to_int are inverses, and the roman_numeral_re covers the range of int_to_roman_numeral.