This is the essence of the refactoring process: small changes and testing after each change. If I try to do too much, making a mistake will force me into a tricky debugging episode that can take a long time. Small changes, enabling a tight feedback loop, are the key to avoiding that mess. Martin Fowler, Refactoring, Improving the Design of Existing Code You've got a Python analytics process and have to make a change to how it works.
Many analytics codebases consist of a pipeline of steps, doing things like getting data, extracting features, training models and evaluating results and diagnostics. The best way to structure the code isn't obvious and if you're having trouble importing files, getting module not found errors or are tinkering with PYTHONPATH it's likely you haven't got it right. A way I've seen many data analytics processing pipelines structured is with a series of numbered steps:
I used to think the whole point of software verifications like types and tests was to ensure a piece of software worked as specified. Consequently if a piece of software already worked there wasn't much point in adding automated tests; sure we might find a few edge cases that didn't work, but we already would have had the ones that impacted end users in bug reports. I now think the primary benefit of verifications is about making software easier to change without losing quality by introducing regressions.
I am a very recent convert on automatic refactoring tools. I thought it was something for languages like Java that have a lot of boilerplate, and overkill for something like Python. I still liked the concept of refactoring, but I just moved the code around with Vim keymotions or sed. But then I came up against a giant Data Science codebase that was a wall of instructions like this: import pandas as pd import datetime df = pd.
One way of refactoring legacy code is to use diff tests; checking what changes when you change the code. While it can be easy to diff files, it's a little less obvious how to do this with SQL pipelines. Fortunately there are a few different techniques to do this. For exact matching you can use union all to find the number of rows that don't occur in both datasets. For approximate matching you can use a join to check whether the differences are within some bounds.
When making changes to code tests are a great way to make sure you haven't inadvertently introduced regressions. This means that you can make changes much faster with more confidence, knowing that your tests will catch many careless mistakes. But what do you do when you're working with a legacy codebase that doesn't have any tests? One method is creating diff tests; testing how your changes impact the output. For batch model training or ETL pipeline there's typically a natural way to do this.
When making changes to a new model training pipeline I find it really useful to understand the dataflow. Analytics workflows are done as a series of transformations, taking some inputs and producing some outputs (or in the case of mutation; an input is also an output). Seeing this dataflow helps give a big picture overview of what is happening and makes it easier to understand the impact of changes. Generally you can view the process as a directed and (hopefully) acyclic graph.
A lot of analytics code I've read is a very long procedural chain. These can be hard to follow because the only way to really know what's going on in any point is to insert a probe to inspect the inputs and outputs at that stage. Breaking these into functions is a really useful way of making the code easier to understand, change and find bugs in. In Martin Fowler's Refactoring he mentions that whenever there's a block of code that has (or requires) a comment to describe what it does, that's a good opportunity to package that code into a function.