SpaCy and Prodigy are handy tools for natural language processing in Python, but are a pain to install in a reproducible way, say with a Makefile. Downloading a SpaCy model with spacy download -m will always re-download the model, which can be very time and bandwidth consuming for large models. Prodigy is a paid product and can't be installed from PyPI.

My solution for this is to generate a file containing the links to all the SpaCy models.

curl \
-H "Accept: application/vnd.github+json" \
https://api.github.com/repos/explosion/spacy-models/releases | \
awk 'BEGIN {print "<!DOCTYPE html><html><body>"}

# Reproducible builds with pip-tools

One problem with model pipelines is a change in a dependency can suddenly break or change the behaviour. When you're trying to run the same code in different environments this can be really hard to debug. As an example I've have Pandas break handling of nullable fields between releases. A good way to handle this is to pin the versions of all dependencies (that is use one specific version), and have a separate process to upgrade the dependencies periodically (and check the tests pass). The pip-tools library provides a good way to do it.

You specify your dependencies in a requirements.in file, run pip-compile to generate a requirements.txt file (optionally with --upgrade to upgrade dependencies), and pip-sync to install them into your environment. To make things even more secure you can pass --generate-hashes which will check the hashes of all the files to check nothing has been changed. It also works with pyproject.toml and other ways of specifying inputs, and it works well. There are other solutions to this like poetry but pip-tools is less opinionated and simpler for a standalone project that's not being published to PyPI.

## Keeping credentials secret with pip-tools

To make things reproducible pip-tools compile will output all the find-links in the requirements.txt file, which should be committed. This is a problem because we have our license key secret in our links, but it can be turned off with --no-emit-find-links.

## Stop trying to reinstall things already installed

For URL or file dependencies, pip-sync will uninstall and reinstall them each time it's run. This makes it a bit slower to use each time.

A workaround for this is to use --find-links with the build directory itself, and pass the package names into pip-compile.

Here's an example workflow, you could have a requirements-model.in.raw file that looks like this:

https://github.com/explosion/spacy-models/releases/download/en_core_web_trf-3.4.0/en_core_web_trf-3.4.0-py3-none-any.whl

Then in the Makefile you could download all the requirements into build and produce a requirements-model.in file with the package names:

requirements-model.in: requirements-model.in.url
mkdir -p build/


Then when we want to install the dependencies we can pip-compile the files and run pip-sync:

requirements.txt: requirements.in requirements-model.in
python -m piptools compile --no-emit-find-links --generate-hashes \
--output-file requirements.txt \
requirements.in requirements-model.in

install: requirements.txt
python -m piptools sync --find-links ./build/ requirements.txt

# Is there an easier way?

All these workarounds are because there's no way for pip to align the package versions to a URL. If we had each SpaCy model's name with a link to its package in a HTML page (like Prodigy does) then pip could resolve the links.

We can easily build this because all the models are on the spacy-models release page. Using the Github releases API we can do this with a bit of shell scripting.

curl \
-H "Accept: application/vnd.github+json" \
https://api.github.com/repos/explosion/spacy-models/releases | \
awk 'BEGIN {print "<!DOCTYPE html><html><body>"}
{print "<a href=" \$0 ">" gensub("\"|.*/", "", "g") "</a><br/>"}
END {print "</body></html>"}' \
> spacy_model_links.html

Or equivalently in Python we can get all the URLs using the API.

import json
import urllib.request
from typing import List

url = "https://api.github.com/repos/explosion/spacy-models/releases"
with urllib.request.urlopen(req) as response:
for release in data
for asset in release["assets"]
]
return links

Then we can combine the links into an HTML document that pip can understand:

def spacy_model_links_to_html(links: List[str]) -> str:
html = ["<!DOCTYPE html><html><body>"]
html.append("</body></html>")
return "\n".join(html)

## Simplified process

Now that we have the links we can treat a SpaCy model like any other dependency. We just have a single requirements.in file and can specify a version such as en_core_web_trf>=3.4.0,<3.5.0. It even checks and resolves the model's dependencies correctly.

Our Makefile is now relatively simple:

requirements.txt: requirements.in
python -m piptools compile -q --no-emit-find-links \

My question is why SpaCy doesn't provide something like this. If this HTML links page was republished somewhere with every release, then I could just point to it with find-links (assuming there's good reason they aren't in PyPI). But this simple solution of generating a file works until they provide a better way.