Filling Gaps in SQL

sql
Published

July 14, 2020

It’s common for there to be gaps or missing values in an SQL table. For example you may have daily traffic by source, but on some low volume days around Christmas there are no values in the low traffic sources. Missing values can really complicate some calculations like moving averages, and some times you need a way of filling them in. This is straightforward with a cross join.

You need all the possible variables you’re filling in, and the value to fill. For example you might have the daily table containing each device, traffic source and date the number of visitors.

device source date visitors
desktop direct 2019-12-23 100
desktop paid 2019-12-23 3
mobile any 2019-12-23 32
desktop direct 2019-12-24 40
mobile any 2019-12-24 18
desktop direct 2019-12-26 80
desktop paid 2019-12-26 2
mobile any 2019-12-26 23

There are some missing combinations of valid (device, source) and date with visitors, and we want to fill them with 0s. We could get all the possible combinations of device and source with a simple query:

select device, source
from daily
group by 1, 2

Similarly we could generate all possible dates; if some may be missing for every device and source it’s best to use something like generate series:

SELECT date
FROM generate_series(
  (select min(date) from daily),
  (select max(date) from daily),
  '1 day') as dates(date)

Then we can generate all possible combinations with a cross join. The exact syntax may vary by database; I’ll assume here that separating with commas does the trick (at least Presto and Postgres).

select device, source, date
from (
  select device, source
  from daily
  group by 1, 2
), 
 generate_series(
  (select min(date) from daily),
  (select max(date) from daily),
  '1 day') as dates(date)

Finally we can fill in the data from the underlying table with a coalesce, using the default value of 0 for missing entries:

select device, source, date, 
       coalesce(daily.users, 0) as users
from (
  select device, source, date
  from (
    select device, source
    from daily
    group by 1, 2
  ), 
   generate_series(
    (select min(date) from daily),
    (select max(date) from daily),
    '1 day') as dates(date)
) xjoin
left join daily
  on xjoin.device = daily.device 
  and xjoin.source = daily.source
  and xjoin.date = daily.date

Finally it’s always good to verify the table is actually as expected. Suppose we materialised the above query into daily_full:

select device, source,
       min(date) as first_date, max(date) as last_date, 
       count(distinct date) as dates,
       count(*) as rows,
       count(users) as user_rows

We would expect the last 3 columns to be the same and equal to the number of days between first date and last date (which you could add with the database specific function to calculate the difference between dates).

This is a useful pattern for filling in missing values in tables, and can be used in more complex scenarios where the filled value depends on the parameters.