10
3.3 Partitioned Tables
Concepts
Partitioning physically splits a very large table ("fact table") into smaller pieces, based on user-
specified criteria.
Why use partitioning?
Partitioning enables rolling window operations (maintenance). For example, if a table is
partitioned by a timestamp column – let's say one partition per day – then you can implement
DELETE by dropping a daily partition.
Since JethroData format is append-only, dropping a partition is the only way to delete data.
In many databases, partitioning is considered a critical performance tool as it allows them to scan
less data when they full scan a table. However, in JethroData,, all queries use the indexes to read
only the relevant rows, so a query will access the same number of rows regardless of if it is
partitioned or not. So, for JethroData partitioning is not a major performance feature.
Partitioning also enables better scalability. If a query needs to access a specific subset of the table
based on the partitioning key (for example, a single day), partitioning helps to isolate it from the
actual size of the table – the query will consider a similarly sized subset of the table regardless of
the table's retention (one month / year / decade of data).
Partitioning Types – JethroData only supports range partitioning. Also, it only support a single
partitioning key. Additional options are not needed as partitioning is used mostly for rolling window
operations, not as a tool to minimize I/O (we use the indexes for that!)
How to choose a partition key? The partitioning key should be the main timestamp column that is
used both for maintenance (keeping data for n days) and in queries. That column data type is typically
TIMESTAMP, but occasionally it is a generated key (usually INT) from a date dimension.
In most cases, the large tables hold events (calls, messages, page views, clicks etc) and the likely
partitioning key is the event timestamp – when did the event happen.
How big should each partition be? Generally, you should align the partition range to the retention
policy. For example:
If you plan to keep data for 12 months, purging once a month, start with monthly partitions.
If you plan to keep data for 60 days, purging once a day, start with daily partitions
However, it is also recommended to aim for a typical partition size of a few billion rows. Many small
partitions are inefficient and may overwhelm the HDFS NameNode with too many small files. A few
extra-large partitions of many billions of rows each are harder to maintain – for example, harder to
correct the data after loading one bad file. So,
If you have a few billion rows per month, partition by month.
If you have a few billion rows per day, partition by day.
If you have a few billion rows per hour, partition by hour.
Partition key and NULL. If the partition key has NULL values, JethroData will create a dedicated
partition for those rows. This is part of the normal processing.