Hurdles in Contributing to Open Source
programming

Hurdles in Contributing to Open Source

Often in programming it's not the code itself that is hard, it's all the environment and systems around it. I found that today when trying to contribute to an open source repository. Today I was working on some code and using the excellent data-science-types to type check some Pandas code with mypy. But for some reason I was getting a weird error when reading with read_feather some data I just wrote with to_feather, and so I switched my to_feather to be to_pickle which doesn't do as much conversion.

  • Edward Ross
Open Source Licenses for Data Processing Code
programming

Open Source Licenses for Data Processing Code

When a program primarily sources and transforms data then copyleft licenses add very little protection over other open source licenses. Because of this I've licensed my open data processing code as MIT because more complex licenses would prevent other people from using it, without adding much sharing. There are three main license types that are used in Open Source; MIT, Apache and GPL (with BSD family somewhere between MIT and Apache).

  • Edward Ross
What Is a Better Programming Approach?
programming

What Is a Better Programming Approach?

When you solve a problem in code you will use some programming approach, and the approach you choose can make a big impact on your efficiency. I talk about approach rather than language because it's more than just the language. A project will typically only use a subset of the language (especially for massive languages like C++), some set of libraries, and develop patterns in the lanugage for working with those libraries.

  • Edward Ross
Managing Python Versions with asdf
programming

Managing Python Versions with asdf

I was recently trying to run a pipenv script, but it gave an error that it required Python 3.7 which wasn't installed. Unfortunately I was on Ubuntu 20.04 which has Python 3.8 as default, and no access to earlier versions in the repositories. However pipenv gave a useful hint; pyenv and asdf not found. The asdf tool allows you to configure multiple versions of applications in common interactive shells (Bash, Zsh, and Fish).

  • Edward Ross
Git: One VCS to Rule Them All
programming

Git: One VCS to Rule Them All

When I started as a professional developer there were a number of competing version control systems. However Git seems to have almost entirely won this battle. One of the most popular centralised version control systems is Subversion (SVN), which was largely an improvement of Concurrent Versioning System (CVS). But Distributed Version Control Systems, starting with Git became really popular. With a centralised system you have to lock files on the central server when editing and unlock them when you're finished, to make sure no one else interferes with your work.

  • Edward Ross
Using find and xargs
programming

Using find and xargs

Sometimes you want to feed a bunch of files to a program, and this is often easily done with find and xargs. Suppose you have an executable doit that you want to execute on all Python files in src/; you can do this directly with find: find src/ -name '*.py' -exec doit {} \; You can use xargs for this as well; but if there's a chance that a path could contain a space somewhere it's best to use -print0 with find and -0 with xargs to separate all arguments with nulls (rather than spaces):

  • Edward Ross
Type Checking Beautiful Soup
python

Type Checking Beautiful Soup

Static type checking in Python can quickly verify whether your code is open to certain bugs. But it only works if it knows the types of external libraries. I've already introduced how to add type stubs for libraries without type annotations. But what if we have a complex library like BeautifulSoup that uses a lot of recursion, magic methods and operated on unknown data? With some small changes to your code you can make it typecheck with BeautifulSoup.

  • Edward Ross
Structuring a Project Like a Kaggle Competition
data

Structuring a Project Like a Kaggle Competition

Analytics projects are messy. It's rarely clear at the start how to frame the business problem, whether a given approach will actually work, and if you can get it adopted by your partners. However once you have a framing the modelling part can be iterated on quickly by structuring the project like a Kaggle Competition. The modelling part of analytics projects will go smoothly only if you have clear evaluation criteria.

  • Edward Ross
Code Structure Reflecting Function
programming

Code Structure Reflecting Function

I've been trying to extract job ads from Common Crawl. However I've been stuck on how to structure the code. Thinking through the relationships really helped me do this. The architecture of the pipeline is a set of methods that fetch source data, extract the structured data and normalise it into a common form to be combined. I previously had these methods all written in one large file, adding each extractor to a dictionary, which was a headache to look at.

  • Edward Ross
Which /bin/sh
programming

Which /bin/sh

I tried to run a shell script and got this error: set: Illegal option -o pipefail I had a quick look and the first line was #!/bin/sh, the -o pipefail isn't valid across POSIX shells so I would expect that to fail. More specifically on modern Ubuntu /bin/sh is dash which doesn't support these bash like constructions. But /bin/sh is very different on different systems; on some it is bash, on others it's ash (from which dash is derived), and on others it's ksh or something else.

  • Edward Ross
Operating a Tower of Hacks
programming

Operating a Tower of Hacks

Remember after you run the update process to run the fix script on the production database. But run it twice because it only fixes some of the rows the first time. Oh, and don't use the old importer tool in the import directory, use the one in the scripts directory now. You already used the old one? It's ok, just manually alter the production database with this gnarly query. Ah right, I see the filler table it uses is corrupted, let's just copy it from a backup.

  • Edward Ross
Run Webserver Without Root
programming

Run Webserver Without Root

You've written your web application or API and you now want to deploy it to a server. You don't want to run it as root, because if someone finds a vulnerability in the server then it will be trivial for them to take over the system. However only root has permission to run applications on ports 80 and 443. There are a few ways to do this, but only a couple that make sense for an interpreted language (like Python, as opposed to a compiled binary).

  • Edward Ross
Unhappy Path Programming
programming

Unhappy Path Programming

When programming it's easy to think about the happy path. The path along which you get well-formed valid data, all your requests return successfully and everything works on your target platform. When you're in this mindset it's easy to just check it works in one case and assume everything is alright. But the majority of real work in programming is the unhappy paths. While you always need to be thinking about how things could go wrong, it's much more important in web programming.

  • Edward Ross
Updating a Python Project: Whatcar
whatcar

Updating a Python Project: Whatcar

The hardest part of programming isn't learning the language itself, it's getting familiar with the gotchas of the ecosystem. I recently updated my whatcar car classifier in Python after leaving it for a year and hit a few roadblocks along the way. Because I'm familiar with Python I knew enough heuristics to work through them quickly, but it takes experience with running into problems to get there. I thought I had done a good job of making it reproducible by creating a Dockerfile for it.

  • Edward Ross
Choosing a Static Site Generator
blog

Choosing a Static Site Generator

Static website generators fill a useful niche between handcoding all your HTML and running a server. However there's a plethora of site generators and it's hard to choose between them. However I've got a simple recommendation: if you're writing a blog use Jekyll (if you don't want to use something like Wordpress). Static website generators compile input assets into a set of static HTML, CSS and Javascript files that can be deployed almost anywhere.

  • Edward Ross
Learning Hugo by Editing Themes
programming

Learning Hugo by Editing Themes

One of the hardest parts of learning something new is motivation. This is why one of the best ways to learn programming is editing code; it's goal driven so motivation is built in. I've successfully used this to start learning how to write Hugo themes. Now that I've got a reasonable collection of posts, over 250, I would like to understand what content people are actually accessing on this website to get an idea of what would be useful.

  • Edward Ross
Manually Triggering Github Actions
programming

Manually Triggering Github Actions

I have been publishing this webiste using Github Actions with Hugo on push and on a daily schedule. I recently received an error notification via email from Github, and wanted to check whether it was an intermittent error. Unfortunately I couldn't find anyway to rerun it manually; I would have to push again or wait. Fortunately there's a way to enable manual reruns with workflow_dispatch. There's a Github blog post on enabling manual triggers with workflow_dispatch.

  • Edward Ross
Finding Files Installed in Ubuntu and Debian
programming

Finding Files Installed in Ubuntu and Debian

My bashrc file sources the git prompt helper to show the branch I'm on in the prompt. Unfortunately it's quite old and was pointing to the wrong file, how do I find where it is? dpkg -L git | grep prompt Debian and its derivatives such as Ubuntu you can use apt to manage packages (e.g. apt upgrade, apt install). However apt is just a thin layer over dpkg that does useful things like resolving dependencies and downloading files.

  • Edward Ross
Programming Languages to Learn in 2020
programming

Programming Languages to Learn in 2020

A language that doesn't affect the way you think about programming, is not worth knowing. Alan Perlis I spend a lot of time programming in Python and SQL, some time in Bash and R (or at least tidyverse), and a little in Java and Javascript/HTML/CSS. This set of tools is actually pretty versatile about getting things done, but is fairly narrow from a programming concept perspective. Once in a while I think it's useful to broaden the programming frame to understand different ways of doing things; even if you still stick to the same few languages.

  • Edward Ross
Comment to Function
programming

Comment to Function

A lot of analytics code I've read is a very long procedural chain. These can be hard to follow because the only way to really know what's going on in any point is to insert a probe to inspect the inputs and outputs at that stage. Breaking these into functions is a really useful way of making the code easier to understand, change and find bugs in. In Martin Fowler's Refactoring he mentions that whenever there's a block of code that has (or requires) a comment to describe what it does, that's a good opportunity to package that code into a function.

  • Edward Ross
Teaching Programming by Editing Code
programming

Teaching Programming by Editing Code

I've had a few discussions with people, especially analysts, about how to learn programming. Generally I encourage them to find a project they want to accomplish and try to learn programming on the way. However I really struggle to find resources to recommend because they tend to spend a lot of time teaching programming concepts from stratch. I wonder if a better way to teach these things would be to start with code that's close to what they want to accomplish, and get them to edit it.

  • Edward Ross
Maybe Monad in Python
python

Maybe Monad in Python

A monad in languages like Haskell is used as a particular way to raise the domain of a function beyond where it was domain. You can think of them as a generalised form of function composition; they are a way of taking one type of function and getting another function. A very useful case is the maybe monad used for dealing with missing data. Suppose you've got some useful function that parses a date: parse_date('2020-08-22') == datetime(2020,8,22).

  • Edward Ross
Scheduling Github Actions
programming

Scheduling Github Actions

I use Github actions to publish daily articles via Hugo. I had set it up to publish on push, but sometimes I future date articles to have a backlog. This means that they won't be published until my next commit or manual publish action. To fix this I've set up a scheduled action to run just after 8am in UTC+10 (close to my timezone in Melbourne, Australia) every day. By default Hugo will not publish articles with a future date, so it's easy to keep a backlog by setting the date in front matter to a future date.

  • Edward Ross
Using Local Github Actions
programming

Using Local Github Actions

I've been using Github Actions to publish this website for almost a month. The experience has been great; whenever I push a commit it gets consistently published without me thinking about it within minutes. However I have one concern; I'm passing my rsync credentials into an external action. I've specified a tag in my yaml uses: wei/rclone@v1, but it would be easy for the author to move this tag to another commit that sends my private credentials to their personal server.

  • Edward Ross
Heuristics for Active Open Source Project
programming

Heuristics for Active Open Source Project

When evaluating whether to use an open source project I generally want to know how active the project is. A project doesn't need to be active to be useable; mature and stable projects don't need to change much to be reliable. But if a project has problems or missing essential features, or is in an evolving ecosystem (like any web project or kernel drivers), it's important to know how fast it changes.

  • Edward Ross
Using Github Actions with Hugo
programming

Using Github Actions with Hugo

I really like the idea of having a process triggered automatically when I push code. Github actions gives a way to do this with Github repositories, and this article was first published with a Github action. While convenient for simple things Github actions seem hard to customise, heavyweight to configure and give me security concerns. My workflow for publishing this website used to be commit and push the changes and run a deploy script.

  • Edward Ross
Powershell Debugging with Write-Warning
programming

Powershell Debugging with Write-Warning

I had to debug some Powershell, without knowing anything about it. I found Write-Warning was the right tool for printline debugging. This was enough to resolve my issue. I first tried Write-Output but apparently it doesn't work inside a function which I found misleading for a while (at first I thought that it wasn't getting to the function). Write-Warning worked straight away and I could see in bright yellow what was going on.

  • Edward Ross
Using emacs dumb-jump with evil
emacs

Using emacs dumb-jump with evil

Dumb-jump is a fantastic emacs package for code navigation. It jumps to the definition of a function/class/variable by searching for regular expressions that look like a definition using ag, ripgrep or git-grep/grep. Because it is so simple it works in over 40 languages (including oddities like SQL, LaTeX and Bash) and is easy to extend. While it is slower and less accurate than ctags, for medium sized projects it's fast enough and requiring no setup makes it much more useful in practice.

  • Edward Ross