SQL: Aggregation by Date
In previous post about group by top 5, we’re trying to find the top 5 buyers on a daily basis and we’re using the date bucket technique to aggregate data for each day, like this:
WITH buckets AS (SELECT '2023-01-01T00:00:00Z'::TIMESTAMP as date
UNION
SELECT '2023-01-02T00:00:00Z'::TIMESTAMP
UNION
SELECT '2023-01-03T00:00:00Z'::TIMESTAMP
UNION
SELECT '2023-01-04T00:00:00Z'::TIMESTAMP
UNION
SELECT '2023-01-05T00:00:00Z'::TIMESTAMP
UNION
SELECT '2023-01-06T00:00:00Z'::TIMESTAMP
UNION
SELECT '2023-01-07T00:00:00Z'::TIMESTAMP
UNION
SELECT '2023-01-08T00:00:00Z'::TIMESTAMP
UNION
SELECT '2023-01-09T00:00:00Z'::TIMESTAMP
UNION
SELECT '2023-01-10T00:00:00Z'::TIMESTAMP),
[...]
FROM buckets b
LEFT JOIN groups g ON b.date >= g.date AND b.date + INTERVAL '-1 day' < g.date
GROUP BY b.date, g.email
ORDER BY date;
This query works. But the problem is we have to do a JOIN between buckets
and groups
. If groups
is a very big table then the query will be too slow
to run. For big data in general, not moving data around is the key to boost
the query performance.
Let’s make the above query better by not JOINing the big table with others (hence not moving it). Instead try to aggregate the data in-place.
WITH dates AS (SELECT id,
DATE_TRUNC('day', date) as date,
email,
amount
FROM orders),
ranks AS (SELECT email AS email,
ROW_NUMBER() OVER (ORDER BY COUNT(id) DESC) AS email_rank
FROM dates
GROUP BY email),
groups AS (SELECT id,
CASE WHEN r.email_rank <= 5 THEN o.email ELSE 'Other' END AS email,
date
FROM dates o
LEFT JOIN ranks r ON o.email = r.email)
SELECT date,
email,
COUNT(id) AS count
FROM groups
GROUP BY date, email
ORDER BY date;
The result should be the same, but without any JOINs, hence it runs way faster.