Big Data Aggregation Strategy
We collect data in raw form and then display them in aggregations. That’s the most basic form of data analytics. Writing data aggregation queries should be a trivial task. The catch is when data got big, we can’t do real-time aggregations on raw data anymore. And the solution is to do pre-aggregation beforehand.
This post sketches out a simple plan that may suit many use cases that are in need of pre-aggregations.
1. Examination worker
This worker examines the raw data for objects that need to run the
aggregations on. This should be a cron job that triggers after a fixed
time range, based on your requirements for aggregation freshness. The
examination worker will create a queued
task in the task queue table
for the aggregation worker to work on.
A not-so-minor issue here is raw data may arrive after or before the examining time. So once in a while, we need to trigger a back-fill job that will do the data missing hunt.
2. Aggregation worker
This worker pulls the tasks from the task queue table, changes the
status from queued
to started
, and then works on the real aggregation
job. After done with it, change the started
to completed
. The status
change is a kind of optimistic lock mechanism that we will ensure by
setting the where condition when updating the value. Something like this:
UPDATE task_queue
SET status = 'completed'
WHERE status = 'started'
For aggregation data, new aggregated rows will replace the olds.
We handle the switch by using 2 extra columns: valid
and aggregation_time
.
The invalidation will look like this:
UPDATE agg_table
SET valid = CASE WHEN aggregation_time < $1 THEN FALSE ELSE TRUE END
3. Clean-up worker
This worker will work like a vacuum, cleaning out completed tasks and invalidated aggregation rows.