Offline SQL Formatting with sqlformat
sql

Offline SQL Formatting with sqlformat

It's polite to format your SQL before you share it around. You want to be able to do it in context, and not upload your private SQL to some random website. The sqlformat command of the Python package sqlparse is a great tool for the job. You can install sqlformat in Debian derivatives such as Ubuntu with sudo apt install sqlformat. Alternatively with any system with Python you can install it via pip install sqlparse, just make sure you have the binary in your path (e.

Finding Hugo Blogs with BigQuery
sql

Finding Hugo Blogs with BigQuery

I want to find examples of other Hugo blogs, but they're not really easy to search for. Unless someone put "Hugo" in the descrption (which is actually common) there's no real defining files. However there's a set of files that are in a lot of Hugo Blogs and we can search them in Github with the GHArcive BigQuery Export. The strategy is that most Hugo blogs will contain a /themes folder, a /content folder a /static folder and a config.

Diffing in SQL
sql

Diffing in SQL

One way of refactoring legacy code is to use diff tests; checking what changes when you change the code. While it can be easy to diff files, it's a little less obvious how to do this with SQL pipelines. Fortunately there are a few different techniques to do this. For exact matching you can use union all to find the number of rows that don't occur in both datasets. For approximate matching you can use a join to check whether the differences are within some bounds.

Filling Gaps in SQL
sql

Filling Gaps in SQL

It's common for there to be gaps or missing values in an SQL table. For example you may have daily traffic by source, but on some low volume days around Christmas there are no values in the low traffic sources. Missing values can really complicate some calculations like moving averages, and some times you need a way of filling them in. This is straightforward with a cross join. You need all the possible variables you're filling in, and the value to fill.

Binning data in SQL
sql

Binning data in SQL

Generally when combining datasets you want to join them on some key. But sometimes you really want a range lookup like Excel's VLOOKUP. A common example is binning values; you want to group values into custom ranges. While you could do this with a giant CASE statement, it's much more flexible to specify in a separate table (for regular intervals you can do it with some integer division gymnastics). It is possible to implement VLOOKUP in SQL by using window functions to select the right rows.

SQL Views for hiding business logic
sql

SQL Views for hiding business logic

The longer I work with a database the more I learn the dark corners of the dataset. Make sure you exclude the rows created by the test accounts listed in another table. Don't use the create_date field, use the real_create_date_v2 instead, unless it's not there, then just use create_date. Make sure you only get data from the latest snapshot for the key. Very quickly I end up with complex spaghetti SQL, which either contains monstrous subqueries or a chain of CREATE TEMPORARY TABLE.

Calculating percentages in Presto
SQL

Calculating percentages in Presto

One trick I use all the time is calculating percentages in SQL by dividing with the count. Percentages quickly tell me how much coverage I've got when looking at the top few rows. However Presto uses integer division so doing the naive thing will always give you 0 or 1. There's a simple trick to work around this: replace count(*) with sum(1e0). Suppose for example you want to calculate the percentage of a column that is not null; you might try something like

Moving Averages in SQL
SQL

Moving Averages in SQL

Moving averages can help smooth out the noise to reveal the underlying signal in a dataset. As they lag behind the actual signal they tradeoff timeliness for increased precision in the underlying signal. You could use them for reporting metrics or for alerting in cases where it's more important to be sure there is a change than it is to catch any change early. It's typically better to have a 7 day moving average than weekly reporting for important metrics because you'll see changes earlier.

Getting most recent value in Presto with max_by
Presto

Getting most recent value in Presto with max_by

Presto and the AWS managed alternative Amazon Athena have some powerful aggregation functions that can make writing SQL much easier. A common problem is getting the most recent status of a transaction log. The max_by function (and its partner min_by) makes this a breeze. Suppose you have a table tracking user login activity over time like this: country user_id time status AU 1 2020-01-01 08:00 logged-in CN 2 2020-01-01 09:00 logged-in AU 1 2020-01-01 12:00 logged-out AU 1 2020-01-01 13:00 logged-in CN 2 2020-01-01 14:00 logged-out You need to find out which users are currently logged in and out, which requires you to find their most recent status.