It's polite to format your SQL before you share it around. You want to be able to do it in context, and not upload your private SQL to some random website. The sqlformat command of the Python package sqlparse is a great tool for the job. You can install sqlformat in Debian derivatives such as Ubuntu with sudo apt install sqlformat. Alternatively with any system with Python you can install it via pip install sqlparse, just make sure you have the binary in your path (e.
I want to find examples of other Hugo blogs, but they're not really easy to search for. Unless someone put "Hugo" in the descrption (which is actually common) there's no real defining files. However there's a set of files that are in a lot of Hugo Blogs and we can search them in Github with the GHArcive BigQuery Export. The strategy is that most Hugo blogs will contain a /themes folder, a /content folder a /static folder and a config.
One way of refactoring legacy code is to use diff tests; checking what changes when you change the code. While it can be easy to diff files, it's a little less obvious how to do this with SQL pipelines. Fortunately there are a few different techniques to do this. For exact matching you can use union all to find the number of rows that don't occur in both datasets. For approximate matching you can use a join to check whether the differences are within some bounds.
When doing a left join in SQL any filtering of the table after the join will turn it into an inner join. However there are some easy ways to do the filtering first. Suppose you've got some tables related to a website. The pages table describes the different pages on the site. page_id page_name 1 home 2 checkout 3 terms Another pageviews describes the daily activity:
Sometimes you may want to experiment with sessions and need to hand-roll your own in SQL. There's a good mode blog on how to do this. If you're using Postgres or Greenplum you may be able to use Apache Madlib's Sessionize for the basic case. This blog post will give a very brief summary of how to do this with some examples in Presto/Athena. The idea of a session is to capture a continuous unit of user activity.
A challenge of data analytics is that the data can change as well as the code. The systems producing and collecting data are often changed and can lead to missing or corrupt data. These can easily corrupt reports and machine learning systems. Worst of all the data may be lost permanently. So if you're going to use some data it's important to check the data regularly to catch the worst kind of mistakes as early as possible.
It's common for there to be gaps or missing values in an SQL table. For example you may have daily traffic by source, but on some low volume days around Christmas there are no values in the low traffic sources. Missing values can really complicate some calculations like moving averages, and some times you need a way of filling them in. This is straightforward with a cross join. You need all the possible variables you're filling in, and the value to fill.
When checking my work in SQL one of the first things I do is confirm a column I expect to be unique is. Many tables have a unique key at the level they are at; for session level data it's a session id, for user level data it's a user_id or for daily data it's a date. It's generally a useful thing to check because all it takes is one bad join to end up with a bunch of duplicate (or dropped) rows.
Generally when combining datasets you want to join them on some key. But sometimes you really want a range lookup like Excel's VLOOKUP. A common example is binning values; you want to group values into custom ranges. While you could do this with a giant CASE statement, it's much more flexible to specify in a separate table (for regular intervals you can do it with some integer division gymnastics). It is possible to implement VLOOKUP in SQL by using window functions to select the right rows.
The longer I work with a database the more I learn the dark corners of the dataset. Make sure you exclude the rows created by the test accounts listed in another table. Don't use the create_date field, use the real_create_date_v2 instead, unless it's not there, then just use create_date. Make sure you only get data from the latest snapshot for the key. Very quickly I end up with complex spaghetti SQL, which either contains monstrous subqueries or a chain of CREATE TEMPORARY TABLE.
One trick I use all the time is calculating percentages in SQL by dividing with the count. Percentages quickly tell me how much coverage I've got when looking at the top few rows. However Presto uses integer division so doing the naive thing will always give you 0 or 1. There's a simple trick to work around this: replace count(*) with sum(1e0). Suppose for example you want to calculate the percentage of a column that is not null; you might try something like
Moving averages can help smooth out the noise to reveal the underlying signal in a dataset. As they lag behind the actual signal they tradeoff timeliness for increased precision in the underlying signal. You could use them for reporting metrics or for alerting in cases where it's more important to be sure there is a change than it is to catch any change early. It's typically better to have a 7 day moving average than weekly reporting for important metrics because you'll see changes earlier.
Presto and the AWS managed alternative Amazon Athena have some powerful aggregation functions that can make writing SQL much easier. A common problem is getting the most recent status of a transaction log. The max_by function (and its partner min_by) makes this a breeze. Suppose you have a table tracking user login activity over time like this: country user_id time status AU 1 2020-01-01 08:00 logged-in CN 2 2020-01-01 09:00 logged-in AU 1 2020-01-01 12:00 logged-out AU 1 2020-01-01 13:00 logged-in CN 2 2020-01-01 14:00 logged-out You need to find out which users are currently logged in and out, which requires you to find their most recent status.