Roman Numerals with Python and Regular Expressions
python
Published
January 24, 2024
We’re going convert integers to roman numerals in standard form and back again in Python, as well as detect them with regular expressions.
Roman Numerals use symbols for different values of thousands, hundreds, tens and units. For each of these they have a special symbol for 5 and use a subtractive notation for 4 and 9:
Thousands
Hundreds
Tens
Units
1
M
C
X
I
2
MM
CC
XX
II
3
MMM
CCC
XXX
III
4
CD
XL
IV
5
D
L
V
6
DC
LX
VI
7
DCC
LXX
VII
8
DCCC
LXXX
VIII
9
CM
XC
IX
They are always written from largest to smallest for example:
MCMXLVIII = M + CM + XL + V + III = 1000 + 900 + 40 + 5 + 3 = 1948
MMMCMXCIX = MMM + CM + XC + IX = 3000 + 900 + 90 + 9 = 3999
We can easily convert a number to Roman Numerals by breaking it into each of these pieces from largest to smallest, being careful to include the subtractive values:
ROMAN_NUMERALS =dict( M=1000, CM=900, D=500, CD=400, C=100, XC=90, L=50, XL=40, X=10, IX=9, V=5, IV=4, I=1,)def int_to_roman_numeral(n: int) ->str:ifnot (0<= n <4000andint(n) == n):raiseValueError("Expected an integer between 0 and 3999") ans = []for numeral, base in ROMAN_NUMERALS.items(): count, n =divmod(n, base) ans += count * numeralassert n ==0return''.join(ans)', '.join([int_to_roman_numeral(i) for i inrange(1, 25)])
'I, II, III, IV, V, VI, VII, VIII, IX, X, XI, XII, XIII, XIV, XV, XVI, XVII, XVIII, XIX, XX, XXI, XXII, XXIII, XXIV'
It agrees with our calculations before, and we have assigned 0 to an empty string.
We can convert a Roman Numeral back into an integer by converting the subtractive values into an appropriate number of units and then add the values together:
from collections import Counterdef roman_numeral_to_int(numeral: str) ->int:"""Expand roman numerals""" numeral_expanded = ( numeral .replace('CM', 'C'*9) .replace('CD', 'C'*4) .replace('XC', 'X'*9) .replace('XL', 'X'*4) .replace('IX', 'I'*9) .replace('IV', 'I'*4) )returnsum([count * ROMAN_NUMERALS[n] for n, count in Counter(numeral_expanded).items()])assert roman_numeral_to_int("") ==0assert roman_numeral_to_int("MCMXLVIII") ==1948assert roman_numeral_to_int("MMMCMXCIX") ==3999
We can check that for all valid integers if we convert to a Roman Numeral and back again we get back our input:
for i inrange(4000):assert roman_numeral_to_int(int_to_roman_numeral(i)) == i
However our roman_numeral_to_int will work on invalid roman numerals:
roman_numeral_to_int("IM")
1001
We can write a simple regular expression to check a Roman Numeral
And we can check all of our generated Roman Numerals are valid:
for i inrange(4000):assert is_roman_numeral(int_to_roman_numeral(i))
But how can we be sure our regular expression doesn’t match anything else?
Regular expression generation
We gen generate all the possible things that match this finite regular expression by following all transitions in the corresponding automata. While we could write our own recursive descent parser we’ll use Python’s private regular expression parser in Python 3.11 (the API may break across versions).
import re._constants as sreimport re._parser as sre_parse
Let’s start with a simpler example:
sre_parse.parse('AB')
[(LITERAL, 65), (LITERAL, 66)]
The parse returns a list of items which are to be concatenated together. In this case each of the literals has a character code.
In general each subexpression could return multiple results, so we return a list of all things that match. For a literal expression this is a list containing the single character:
Then to generat all regular expressions we generate all possible combinations of the subitems (with itertools.product), and concatenate them together with (''.join):
import itertoolsdef generate_all_re(s, flags=None) ->list[str]:ifisinstance(s, re._parser.SubPattern): parse = selifisinstance(s, re.Pattern):if flags isNone: flags = s.flags parse = sre_parse.parse(s.pattern, flags)elifisinstance(s, str):if flags isNone: flags = re.NOFLAG parse = sre_parse.parse(s, flags)else:raiseValueError("Unknown type %s"%type(s)) generations = [GENERATE_RE_HANDLERS[node](args) for node, args in parse]return [''.join(items) for items in itertools.product(*generations)]
Then we can check our simple example:
assert generate_all_re('AB') == ['AB']
Branches
A branch will generate either the pattern on the left or the pattern on the right:
for numeral in roman_numerals:assert is_roman_numeral(numeral)
And if we convert them to integers and back again we get the same result:
for numeral in roman_numerals:assert int_to_roman_numeral(roman_numeral_to_int(numeral)) == numeral
Non-empty Roman Numerals
What if we just want to find the non-empty Roman Numerals? We can modify our regular expression to exclude the case where all the parts are empty:
nonempty_roman_numeral_re = re.compile("""(?:(?:M{0,3}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|I?V|V?I{1,3})) # Has a unit| (?:M{0,3}(CM|CD|D?C{0,3})(XC|X?L|L?X{1,3})) # OR Has a ten| (?:M{0,3}(CM|C?D|D?C{1,3})) # OR Has a hundred| M{1,3} # OR Has a thousand)$""", flags=re.VERBOSE)assertnot nonempty_roman_numeral_re.match('')assertnot nonempty_roman_numeral_re.match('VX')assert nonempty_roman_numeral_re.match('X')assert nonempty_roman_numeral_re.match('MCMXLVIII')assert nonempty_roman_numeral_re.match('MMMCMXCIX')
This matches precisely the set of Roman Numerals from 1 to 3999:
The regular expression above is starting to get a bit gnarly. We can make it easier to write using Python classes and magic methods. To start with let’s create a function that wraps a regular expression in a non-capturing group:
def group_re(re: str) ->str:return"(?:"+ re +")"assert group_re("a") =="(?:a)"
Then we can create a class R (for Regex) where we can:
Concatenate by multiplying objects (A * B)
Branch by OR-ing objects (A | B)
Having 0 or 1 matches with maybe: A.maybe()
Having a specified number of matches with repeat: A.repeat(0,3)
class R:def__init__(self, pattern: str):self.pattern = patterndef__mul__(self, other):ifisinstance(other, str): other = R(other)return R(group_re(self.pattern) + group_re(other.pattern))def__rmul__(self, other):ifisinstance(other, str): other = R(other)return other *selfdef__or__(self, other):ifisinstance(other, str): other = R(other)return R(group_re(self.pattern) +"|"+ group_re(other.pattern))def__ror__(self, other):ifisinstance(other, str): other = R(other)return other |selfdef__eq__(self, other):returnself.pattern == other.patterndef__repr__(self):returnf'R({self.pattern})'def maybe(self):return R(group_re(self.pattern) +"?")def repeat(self, min, max):return R(group_re(self.pattern) +"{%s,%s}"% (min, max))def match(self, s, flags=re.NOFLAG):return re.match(group_re(self.pattern) +"$", s, flags)def finditer(self, s, flags=re.NOFLAG):return re.finditer(self.pattern, s, flags)
Using this notation we can then make regular expressions for a roman numeral between 0 and 9
unit ="IX"| (R("I").maybe() *"V") | (R("V").maybe() * R("I").repeat(1,3))unit
Similarly we can create expressions for tens, hundreds and thousands, then combine them to get the non-empty roman numerals. The resulting expression is an eye-sore because we’re doing a lot of unnecesary grouping:
Roman Numerals are a little funny but not terribly difficult to parse in Python. The approach here was inspired by property based testing especially of regular expressions to check that int_to_roman_numeral and roman_numeral_to_int are inverses, and the roman_numeral_re covers the range of int_to_roman_numeral.