Andrey

Posted on May 12 • Originally published at sydelov.com

Time Series Systems: Architecture, Storage Models, and Engineering Principles

#influxdb #prometheus #questdb #postgres

Time as a First-Class Dimension

Time-series systems treat time as a primary design axis. Ingestion, storage layout, indexing, compression, and query execution are all built around the assumption that data arrives as a stream of observations ordered in time.

A time-series record carries the state of a process at a specific instant: CPU load on a node, vibration amplitude on a turbine, voltage on a feeder line, latency of a REST endpoint. New observations arrive as append-only writes, often at fixed sampling intervals or triggered by events. Historical data remains largely immutable, while queries scan contiguous time ranges: “last 5 minutes”, “previous 30 days”, “compare this week with last week”. The resulting workload differs sharply from OLTP (random point reads and writes) and from classic OLAP (batch analytics over slowly changing fact tables).

Several properties follow from this workload:

Write-heavy, read-many — continuous ingestion of measurements at high rate, with dashboards, alerts, and analytical jobs reading from the accumulated history.
Strong temporal locality — recent data receives the majority of queries, which encourages multi-tier storage (hot, warm, cold) and downsampling of older segments.
Time-bounded queries — virtually every query carries an explicit time predicate, so indexes and storage blocks are organized around time windows rather than arbitrary key ranges.

Operational behavior of real systems adds additional constraints. Data arrives out of order because of network delays, clock drift, batching on devices, or intermittent connectivity. Retries generate duplicates. Retention policies remove old segments continuously. High-cardinality labels (tenant identifiers, instance identifiers, request paths) expand the logical number of series even when the raw write rate remains moderate. Building a robust time-series platform therefore means controlling cardinality, lifecycle, and ingest pressure, not only maximizing throughput.

Several families of time-oriented data rely on these ideas:

Metrics — regularly sampled or event-driven numeric measurements with a compact value vector (for example cpu_usage{host, region} or temperature{device_id}).
Events and logs — discrete occurrences with richer payloads, anchored in time but optimized for text search and filtering.
Traces — correlated spans describing end-to-end requests, with their own indexing and aggregation patterns.

Metrics-style and sensor-style time series involve large volumes of numeric samples indexed by time and a structured set of labels. The same architectural principles, however, underpin observability stacks, industrial IoT platforms, and financial tick stores once time becomes a first-class axis in system design.

Data Model and Semantics

A time-series system stores observations of a process as it evolves in time. The minimal record shape:

t — timestamp. Wall-clock time or monotonic time from the device, often with sub-second precision.
Labels (dimensions, tags). Key–value pairs that identify the source and context: device_id, region, service, sensor_type. Together they define a logical series.
Values. One or more numeric measurements: temperature, voltage, latency, counters, gauges.

Two schema flavors dominate:

Narrow schema. One metric per row: time, labels…, value.
Wide schema. Several metrics in the same row: time, labels…, temperature, pressure, humidity.

Narrow layouts scale well with sparse data and high cardinality; wide layouts simplify joins between related metrics and reduce per-row overhead. The choice affects compression, indexing, and the kinds of queries that remain cheap.

Semantics matter as much as schema. Counters, gauges, histograms, and derived rates behave differently under downsampling, interpolation, and aggregation. A meaningful storage model preserves those semantics instead of treating everything as a generic float.

Query Workloads and Algorithms

Time-series workloads concentrate around a small set of query shapes:

Point-in-time lookups. “Value of metric X at time t” or “last known value before t”.
Range aggregates. Sums, averages, percentiles, minima/maxima over [t_start, t_end].
Downsampled timelines. Resampled series with aggregation over fixed windows, for example 1-minute buckets.
Group-by over labels. Aggregates per service, region, customer, etc.
Top-N queries. “Highest error_rate per service in the last 5 minutes.”

Efficient evaluation relies on two ideas:

Windowed aggregation. Precomputing or incrementally maintaining aggregates over fixed windows reduces repeated scans of raw samples.
Label filtering. Indexes over labels (inverted indexes, symbol tables) narrow the set of series before scanning time ranges.

Uneven sampling, gaps, and late data require explicit rules. Interpolation strategy, handling of missing buckets, and conflict resolution for out-of-order samples all shape query correctness and performance.

Core Data Structures

At the storage layer, time organizes both layout and indexing.

Time-partitioned chunks. Data is grouped into blocks for specific intervals (for example, two hours or one day). Each chunk holds many series, with local indexes and compression metadata.
Series index. A mapping from label sets to internal series identifiers. Implementation often combines a symbol table for label values with a hash or tree over label combinations.
Label inverted index. For each label key and value, a list of matching series IDs. Range and regex filters translate to operations on these lists.

Many engines use a log-structured or LSM-style design. New samples land in memory or in small on-disk segments; background compaction merges segments, sorts by time, and rewrites compressed chunks. This pattern aligns well with append-heavy writes and long scans over sorted time ranges.

Compression and Encoding

Time-series data compresses extremely well when encoded with awareness of time and measurement properties.

Key techniques:

Delta and delta-of-delta for timestamps. Adjacent timestamps differ by a small increment, so storing differences often collapses to a few bits.
XOR or delta encoding for values. Consecutive floating-point samples change smoothly; encoding the XOR against the previous value or the difference in value yields long runs of small integers.
Run-length and bit-packing. Boolean or low-cardinality fields compress as runs or densely packed bit fields.
Dictionary encoding for labels. Repeated label values (hostnames, regions) map to short integer IDs.

Compression interacts with query patterns:

Heavy compression shrinks storage and I/O but increases CPU cost during decoding.
Chunk size influences cache behavior and scan efficiency: large chunks favor sequential analytics, smaller chunks favor highly selective queries on recent data.
Encoding format constrains future features (for example, adding histograms or high-precision types).

A coherent design starts from these foundations: data model, query shapes, core structures, and compression strategy. Subsequent architectural choices in ingestion, clustering, and database selection inherit constraints from this layer.

System Architecture of Time-Series Platforms

A time-series platform is a pipeline: data enters through ingestion endpoints, flows into mutable in-memory and on-disk structures, is rewritten into long-lived compressed segments, and finally surfaces through query engines and APIs. The functional blocks stay roughly the same across databases; the interesting part is how they interact under load and failure.

Ingestion Path
Time-series data enters the system through a narrow set of entry points:

edge agents or SDKs pushing measurements,
scrapers pulling metrics endpoints,
message buses streaming events into dedicated writers.

Behind these interfaces sits an ingest subsystem that converts unstructured traffic into ordered, batched writes. The path usually includes an in-memory buffer, a write-ahead log, and a queue of pending chunk updates. The primary goals: accept data at wire speed, preserve ordering within a series as far as clocks allow, and tolerate retries without corrupting state.

At the edge, the code is typically implemented as a compact, specialized writer whose demands intensify as the ingestion rate increases:

//Golang
type Sample struct {
    Time   time.Time
    Labels map[string]string
    Value  float64
}

func ingestLoop(input <-chan Sample, batchSize int, flushInterval time.Duration) {
    var batch []Sample
    ticker := time.NewTicker(flushInterval)
    defer ticker.Stop()

    flush := func() {
        if len(batch) == 0 {
            return
        }
        // single call into TSDB write API
        writeBatchToTSDB(batch)
        batch = batch[:0]
    }

    for {
        select {
        case s, ok := <-input:
            if !ok {
                flush()
                return
            }
            batch = append(batch, s)
            if len(batch) >= batchSize {
                flush()
            }
        case <-ticker.C:
            flush()
        }
    }
}

This kind of loop hides several architectural decisions:

how large a batch can grow before latency becomes unacceptable;
how the writer reacts when writeBatchToTSDB slows down or returns errors;
where back-pressure is applied (dropping samples, buffering in memory, or pushing responsibility back to upstream systems).

On the database side, ingestion eventually lands in a write-ahead log and a set of in-memory or append-only on-disk structures. Out-of-order samples force the engine to decide whether to rewrite small regions of a chunk, maintain a side buffer for late data, or reject writes beyond a time horizon. Duplicate detection depends on the internal series identifier and timestamp equality rules. These details determine how the platform behaves under flaky network links, intermittent power on devices, and replay scenarios after outages.

Storage Layout and Lifecycle

In time-series databases, data follows a structured lifecycle that optimizes storage and access based on its age and usage patterns.

After ingestion, samples transition through progressively more efficient storage tiers, starting from mutable, high-performance buffers for recent data and evolving into compressed, cost-effective archives for historical information. This progression ensures fast writes and reads for active data while maintaining scalability, predictability in costs, and compliance with retention policies through mechanisms like partitioning, compaction, and downsampling.

For example, in Prometheus, ingested samples start in an in-memory head block backed by a write-ahead log (WAL) for durability. Every two hours, data flushes to immutable disk blocks, which are then compacted into larger ranges (up to 31 days) to reduce file count and enhance query efficiency. Retention policies delete blocks older than a configurable period, such as 15 days, balancing storage costs.

Similarly, InfluxDB uses a WAL for initial writes, an in-memory cache for immediate queries, and time-structured merge (TSM) files for persistent, compressed storage. Compaction merges these into optimized files, while series keys enable logical partitioning for high-cardinality scalability.

Partitioning by time and, optionally, by tenant or label group controls both scalability and retention. A partition key might include day or hour plus a logical shard derived from labels. Retention policies then operate at the partition level: delete, move to cheaper storage, or rewrite into downsampled form. Coordination between the lifecycle engine and the ingest path ensures that data does not disappear while still being referenced by queries or caches.

This tiered approach not only minimizes operational overhead but also supports real-world applications like monitoring large-scale infrastructure, where recent metrics require sub-second access while historical trends prioritize economical archiving.

Query Path

Queries enter from dashboards, alerting rules, ad-hoc analytical tools, or custom applications. Despite variation in query languages (SQL, PromQL-style selectors, proprietary DSLs), the internal path follows a similar sequence: parse → plan → select series → scan and aggregate chunks.

A simple range query against a metrics store illustrates the mapping:

SELECT
    time_bucket('1 minute', ts) AS bucket,
    region,
    avg(cpu_usage)              AS cpu_avg
FROM cpu_metrics
WHERE
    ts BETWEEN NOW() - INTERVAL '1 hour' AND NOW()
    AND service = 'api-gateway'
GROUP BY bucket, region
ORDER BY bucket, region;

Execution proceeds roughly as follows:

Translate label filters (service = ‘api-gateway’) and dimensions (region) into a set of internal series identifiers using the label index.
Intersect that set with partitions covering the requested time range.
For each relevant chunk, read compressed blocks, decode only columns needed for the query (ts, region, cpu_usage), and push partial aggregates as close to the scan as possible.
Merge partials across chunks and partitions, then sort or re-order results for the client.

The same pattern appears in a PromQL-like expression such as:

avg_over_time(cpu_usage_seconds_total{service="api-gateway"}[5m])

Here the engine must choose how to slide the 5-minute window across chunks, how much data to keep in memory for overlapping windows, and how to reuse intermediate results for multiple concurrent queries hitting the same time ranges.

Latency and throughput depend on several engineering decisions along this path: granularity of partitions, size of compressed blocks, layout of label indexes, and placement of caches for metadata, index pages, and recent chunks. The query engine encodes assumptions about windowed operations, label cardinality, and the ratio of hot to cold queries, and those assumptions feed back into storage layout and lifecycle policies.

Storage Models in Practice: Comparing Major Time-Series Databases

Time-series databases share a uniform conceptual record format, yet their storage models differ markedly based on integration with underlying engines, file layouts, and query strategies. Some adopt custom log-structured columnar formats with time-partitioned chunks, while others extend relational engines using time-aware partitioning, custom indexes, and compressed columnar segments in standard tablespaces.

TimescaleDB: Time Series on Relational Foundations

TimescaleDB runs as a PostgreSQL extension that injects time-series semantics into the engine through catalog metadata, planner hooks, and background workers. Dedicated catalog tables describe hypertables, chunks, compression settings, and policies. Planner and executor hooks use this metadata to rewrite plans for inserts and selects, and background workers execute compression and retention policies over time.

Extension Catalog and Hypertable Metadata

A hypertable begins as a regular PostgreSQL table. Additional metadata in TimescaleDB’s catalogs marks it as time-partitioned and, optionally, space-partitioned. Creation proceeds in two steps:

CREATE TABLE sensor_readings (
    ts          TIMESTAMPTZ       NOT NULL,
    device_id   TEXT              NOT NULL,
    region      TEXT              NOT NULL,
    temperature DOUBLE PRECISION  NOT NULL,
    pressure    DOUBLE PRECISION  NOT NULL
);

SELECT create_hypertable(
    relation            => 'sensor_readings',
    time_column_name    => 'ts',
    partitioning_column => 'device_id',
    number_partitions   => 4
);

The CREATE TABLE statement uses pure PostgreSQL syntax; nothing in the DDL specifies a special engine. The create_hypertable function inserts rows into TimescaleDB’s internal catalogs, records that sensor_readings is partitioned by ts and device_id, and creates initial chunk tables. From PostgreSQL’s point of view, these chunks are ordinary tables; the extension layer maintains the mapping between the logical hypertable and its physical children.

A conceptual diagram of the relationship:

+-------------------------------+
|        TimescaleDB Catalog    |
|-------------------------------|
| hypertable(id, relid, ...)    |
| chunk(id, hypertable_id, ...) |
| chunk_constraint(...)         |
| compression_settings(...)     |
| policies(...)                 |
+---------------+---------------+
                |
                | maps hypertable -> chunks
                v
        +------------------------+
        |  pg_class / pg_tables  |
        |------------------------|
        | sensor_readings        |  (base relid, marked as hypertable in catalog)
        | sensor_readings_1_1    |  (chunk: time range A, hash partition 1)
        | sensor_readings_1_2    |  (chunk: time range A, hash partition 2)
        | sensor_readings_2_1    |  (chunk: time range B, hash partition 1)
        | ...                    |
        +------------------------+

PostgreSQL system catalogs (pg_class, pg_attribute, etc.) hold the physical tables, while TimescaleDB’s catalogs hold the logical model and partitioning rules.

Chunk Routing and Execution Path

When an INSERT targets sensor_readings, the normal PostgreSQL parser and planner build an insert plan against the base relation. TimescaleDB’s planner hook then identifies the relation as a hypertable by consulting its catalogs and replaces the target with a chunk dispatch node.

At execution time, this node performs the routing logic:

Reads the partitioning configuration for sensor_readings from the catalogs.
Extracts values of the time column (ts) and space partitioning column (device_id) from each row.
Computes the target time partition and hash bucket.
Locates or creates the corresponding chunk table for that combination.
Forwards the row into that chunk’s insert executor state.

This logic is implemented in the executor, not as SQL triggers, so bulk inserts and COPY use the same mechanism without per-row trigger overhead. Chunks remain regular PostgreSQL tables; they appear in pg_class and can be queried explicitly, but typical workloads reference only the hypertable and rely on the dispatcher and planner to navigate chunks.

On disk, each chunk has its own heap and btree indexes. Chunk boundaries (for example, one-day time ranges combined with four hash partitions by device_id) determine:

which chunk tables a query touches for a given time interval,
the number of files involved in maintenance operations (vacuum, reindex, analyze),
the granularity of retention operations (dropping complete chunks).

Consider a range query that selects a subset of columns:

SELECT
    ts,
    device_id,
    region,
    temperature,
    pressure
FROM sensor_readings
WHERE
    ts BETWEEN now() - interval '7 days' AND now();

The planner expands this into a union over only those chunk tables whose time ranges intersect the 7-day window. Within each chunk, standard PostgreSQL indexes on ts, device_id, or region apply, and the executor merges results back into a single stream.

Compression Pipeline and Columnar Layout

TimescaleDB can convert older row-oriented chunks into a compressed, columnar representation while keeping the SQL interface unchanged. Configuration typically enables compression and a policy:

SELECT add_compression_policy(
    hypertable     => 'sensor_readings',
    compress_after => INTERVAL '7 days'
);

ALTER TABLE sensor_readings
    SET (timescaledb.compress,
         timescaledb.compress_segmentby = 'device_id',
         timescaledb.compress_orderby   = 'ts');

The ALTER TABLE statement stores compression settings in the extension catalogs. The add_compression_policy call creates a policy entry. A background worker periodically scans the hypertable’s chunks, finds those older than compress_after, and rewrites each selected chunk into a compressed form:

rows are grouped by device_id (segment-by key) inside the compressed chunk;
within each group, rows are ordered by ts;
each column is stored as a compressed array or block, using encodings such as delta coding for timestamps and dictionary or delta/delta-of-delta encodings for labels and values.

After compression, metadata flags the original row-oriented chunk as compressed, and the new compressed representation becomes the authoritative storage for that time range. Queries still reference sensor_readings; the planner injects specialized scan nodes for compressed chunks, which decode only the required columns and often perform aggregation directly on decoded vectors. Point lookups over deep history incur decompression overhead but remain valid SQL operations.

Trade-Offs of TimescaleDB in Real Usage

Strengths

Single relational engine for operational data and time series. Hypertables can live next to transactional tables, dimension tables, and reference data in the same PostgreSQL cluster and participate in the same SQL queries.
Mature ecosystem and tooling. Backups, monitoring, client libraries, migrations, connection pooling, and operational playbooks follow PostgreSQL practices instead of a new technology stack.
Expressive analytics over time-series data. Complex joins, window functions, CTEs, and advanced SQL constructs apply directly to hypertables, which benefits use cases where time series are tightly coupled with business entities.
Declarative lifecycle management. Retention, compression, and data tiering are configured via policies and executed by background workers, without custom cron jobs or external compaction services.

Constraints

Shared resource budget with the PostgreSQL core. High ingest rates, heavy analytical queries, autovacuum, index maintenance, and compression jobs compete for the same CPU, memory, and I/O, so capacity planning and tuning become part of day-to-day operations.
Scaling pattern aligned with PostgreSQL rather than purpose-built TSDBs. Vertical scaling and careful partitioning carry a large share of the load; horizontal scaling options exist but follow PostgreSQL clustering patterns and operational complexity, not a lightweight single-binary TSDB model.
Sensitivity to schema and chunk design. Poor choices for time partition size, space partitioning keys, and index layout can degrade both ingest and query performance, and recovery from an unsuitable design often requires data migration.
Historical access profile. Compressed chunks are optimized for scans and aggregates across longer ranges; deployments that rely on frequent random point lookups over deep history must accept additional latency or introduce auxiliary stores for that access pattern.

InfluxDB IoT: Measurement/Tag Model and TSM Engine

InfluxDB represents time series through a line protocol built around three layers: measurement, tags, and fields. A single sample can be written as:

temperature,device_id=sensor-17,region=eu-west value=21.7 1733306400000000000

Logical structure:

Measurement — temperature. Groups series that share a semantic domain and a default retention policy.
Tags — device_id, region. Stored as indexed strings, form the series identity and filter dimensions.
Fields — value. Stored as typed, non-indexed values (float, integer, boolean, string).
Timestamp — nanosecond precision epoch in the write.

Internally, a series key is derived from (measurement, ordered-tag-set). For IoT-style workloads this might look like:

measurement = "temperature"
tags        = {device_id="sensor-17", region="eu-west"}
series_key  = "temperature,device_id=sensor-17,region=eu-west"

Every distinct series key becomes an entry in the series index and maps to a compact series identifier used by the storage engine.

Shards, TSM Files, and WAL

On disk, InfluxDB groups data by shard groups defined by retention policy. For example, a retention policy with 7-day shard duration creates a new shard every 7 days. Each shard maps to a directory containing:

a write-ahead log (WAL) for durable recent writes;
several TSM (Time-Structured Merge) files that hold immutable, compressed blocks.

Conceptual layout:

/data/influxdb/
  ├── meta/              (cluster and retention metadata)
  └── data/
      └── iot_db/
          └── rp_7d/
              ├── shard_100/       (2025-11-01 .. 2025-11-07)
              │    ├── 000001.tsm
              │    ├── 000002.tsm
              │    ├── wal/
              │    │    ├── 000001.wal
              │    │    └── ...
              │    └── index/
              └── shard_101/       (2025-11-08 .. 2025-11-14)
                   └── ...

New writes enter the WAL as append-only records and an in-memory index. Periodically, the engine flushes accumulated points from memory into TSM files:

data is sorted by (series_key, timestamp);
values and timestamps are packed into compressed blocks;
a per-file index maps series keys and time ranges to block offsets.

This pattern resembles an LSM tree optimized specifically for time-keyed data: immutable on-disk files, periodic compaction, and append-focused WAL.

TSM Block Layout and Compression

The TSM (Time-Structured Merge) engine stores each series as a set of time-ordered blocks. For a given (measurement, tags, field) combination, InfluxDB maintains:

a sequence of compressed timestamp blocks;
a parallel sequence of compressed value blocks;
index entries describing [min_time, max_time] and file offsets.

A simplified view for one series:

series: temperature,device_id=sensor-17,region=eu-west,field=value

TSM file:
  [block 1]  timestamps: t0..t999   (delta / delta-of-delta encoding)
  [block 1]  values:     v0..v999   (XOR / RLE / type-specific encoding)
  [block 2]  timestamps: t1000..t1999
  [block 2]  values:     v1000..v1999
  ...
index:
  key="temperature,device_id=sensor-17,region=eu-west field=value"
  ├─ entry: [t0,    t999]   -> offset A
  ├─ entry: [t1000, t1999]  -> offset B
  └─ ...

Encoding strategies exploit the structure of time and measurements:

timestamps are strictly increasing per series; storing deltas and second-order deltas yields small integers;
neighbouring numeric values tend to vary smoothly; XOR-encoding or delta-encoding compresses them into short bit sequences;
boolean or low-cardinality values use run-length encoding or bit-packing.

The engine reads blocks directly into memory and decodes them in vector form, which matches range scans and aggregate queries over contiguous intervals.

Series Index and Cardinality

Efficient tag filtering depends on a series index that tracks all known series keys and provides mapping to internal IDs. Conceptually:

series_id( "temperature,device_id=sensor-17,region=eu-west" ) = 42
series_id( "temperature,device_id=sensor-18,region=eu-west" ) = 43
...

On top of this mapping, an inverted index lets the engine answer tag-based predicates such as region=’eu-west’ AND device_id=’sensor-17' without scanning every series. A simplified structure:

tag_index("region")["eu-west"]      -> {42, 43, 44, ...}
tag_index("device_id")["sensor-17"] -> {42}

The intersection {42} yields the series ID for that filter. Query planning becomes a two-step process:

Use tag indexes to resolve a set of matching series IDs.
Use shard and TSM indexes to locate blocks for each series within the requested time range.

The same structure exposes the cost of cardinality. Each distinct combination of measurement and tags consumes memory and index space. IoT deployments with unbounded tag values (for example, embedding request IDs or random UUIDs into tags) inflate the series index and degrade performance even when the raw write rate is moderate.

Query Path: From Line Protocol to Aggregates

At the API level, IoT-style ingestion often sends line protocol through HTTP or UDP. A minimal example:

curl -i -XPOST 'http://localhost:8086/write?db=iot_db&rp=rp_7d' \
  --data-binary '
temperature,device_id=sensor-17,region=eu-west value=21.7 1733306400000000000
temperature,device_id=sensor-18,region=eu-west value=22.1 1733306400000000000
'

Internally, the write path:

Parses each line into (measurement, tags, fields, timestamp).
Looks up or creates the series ID using the series index.
Appends a WAL record for durability.
Adds the point to in-memory buffers associated with the target shard and series.

Later, queries operate on the logical measurement and tags. For example, a 5-minute average per region:

SELECT
    time_bucket AS window_start,
    region,
    mean(value) AS temperature_avg
FROM temperature
WHERE
    time >= now() - 30m
GROUP BY region, time_bucket(5m)
ORDER BY window_start, region;

Execution performs:

tag-index lookup for relevant regions;
shard selection for the 30-minute window;
TSM index lookup for matching series and blocks;
block reads and decompression;
aggregation over decoded vectors.

Although the query surface depends on the InfluxDB version (InfluxQL or Flux), the storage engine remains centered on TSM blocks keyed by series ID and time.

IoT-Oriented Characteristics

In an IoT context, the TSM/measurement–tag–field model leads to specific behaviour:

ingestion can sustain high write rates as long as series cardinality remains controlled and shard durations fit available I/O;
retention policies map directly to shard group lifecycles, allowing coarse-grained deletion of expired telemetry;
compression ratios improve with regular sampling intervals and stable tag sets, reducing storage for dense sensor deployments;
read performance aligns with time-range queries over tags, which matches dashboards, alerting rules, and aggregate analytics on sensor fleets

These properties make the engine a good fit for device telemetry and infrastructure metrics where data arrives as continuous streams of simple numeric measurements tagged by a modest, well-curated set of dimensions.

Trade-Offs of InfluxDB and the TSM Engine in Real Usage

Strengths

Line protocol and TSM layout match append-heavy telemetry streams: high ingest throughput as long as write paths remain sequential and shard layout fits disk bandwidth.
Measurement/tag/field model encodes query dimensions directly, so tag filters translate cleanly into series and shard selection without an additional schema layer.
WAL + immutable TSM files and compaction form a predictable operational model: clear separation between recent mutable data and long-lived compressed blocks.
Compression tuned for monotonic timestamps and slowly varying numeric values yields high density for regular IoT workloads and infrastructure metrics.
Retention and shard duration settings express lifecycle policies directly in storage layout; dropping expired telemetry becomes a bounded, shard-level operation instead of large DELETE statements.

Constraints

Series cardinality becomes a hard resource limit: unbounded tag combinations (for example, embedding unique IDs in tags) inflate the series index and erode both ingest and query performance.
Query flexibility is shaped by the measurement/tag/field model; complex relational joins and deeply nested analytics require external processing or integration with a relational engine.
WAL, shard compaction, and TSM file rewrites consume I/O in the background; sizing and scheduling of compaction jobs affect write latency and read tail latency during load spikes.
Workloads dominated by wide scans across many measurements with irregular tagging schemes can degrade TSM locality and reduce compression efficiency, increasing both storage footprint and scan cost.
Operational scaling favors careful cardinality control, shard sizing, and node layout; horizontal expansion exists but does not remove the need for explicit design around series index size, shard distribution, and compaction capacity.

Prometheus and VictoriaMetrics: Metrics at Large Scale

Prometheus and VictoriaMetrics form a common pairing: Prometheus handles scraping, alerting, and short-term storage; VictoriaMetrics absorbs long-term history and large-cardinality workloads. Both rely on a label-based metrics model, but their storage engines and scaling strategies differ.

Metrics Data Model and Label Semantics

Prometheus represents every time series as a metric name and a set of labels. A single sample has the logical shape

(name,labels, t, v)

where

name encodes the metric family, for example http_requests_total or node_cpu_seconds_total;
labels is a set of key–value pairs such as {method=”GET”, code=”200", instance=”api-1", job=”frontend”};
t is a timestamp;
v is a floating-point value.

A concrete example in exposition form:

http_requests_total{
    job="frontend",
    instance="10.0.0.17:8080",
    method="GET",
    code="200"
} = 41729 @ 1733306400.000

The pair (name, sorted_labels) defines a logical series. Internally, Prometheus maps each series identity to a numeric series ID and maintains label indexes that allow efficient selection of series by label matchers such as job=”frontend”, code=~”5..”.

VictoriaMetrics adopts the same logical model on the wire: it accepts Prometheus remote write, understands metric names and labels, and builds its own internal series IDs and label indexes.

Scrape-Based Ingestion and Remote Write

Prometheus ingestion revolves around scraping HTTP endpoints that export current metric values in a text format. A scrape target exposes lines like:

# HELP temperature_celsius Current temperature in Celsius
# TYPE temperature_celsius gauge
temperature_celsius{sensor="A",region="eu-west"} 21.7
temperature_celsius{sensor="B",region="eu-west"} 22.1

The scrape loop runs per target on a fixed interval:

Fetch /metrics over HTTP.
Parse each line into (name, labels, value, scrape_time).
Attach job and instance labels from scrape configuration.
Append samples to the local TSDB.

A minimal configuration for scraping and remote write:

scrape_configs:
  - job_name: "iot-gateway"
    scrape_interval: 15s
    static_configs:
      - targets:
          - "iot-gateway-1:9100"
          - "iot-gateway-2:9100"

remote_write:
  - url: "http://victoriametrics:8428/api/v1/write"
    queue_config:
      capacity: 50000
      max_shards: 16
      max_samples_per_send: 10000

The local TSDB persists samples in a write-ahead log and in-memory structures; the remote_write subsystem batches the same samples and streams them to VictoriaMetrics. Back-pressure and queue settings in remote_write control how Prometheus behaves when the remote endpoint slows down or becomes unavailable.

Prometheus Local TSDB: Head Block, On-Disk Blocks, WAL

Prometheus ships with an embedded TSDB tuned for scrape-based metrics and moderate retention. The storage layout revolves around:

a head block in memory for recent samples;
a write-ahead log (WAL) for durability;
a directory of 2-hour blocks for older data.

Conceptual directory structure:

/data/prometheus/
  ├── wal/
  │   ├── 00000001
  │   ├── 00000002
  │   └── ...
  └── chunks_head/          (in-memory head block representation)
  └── 01HF6Z2Q6G3T7N3J9F0Z/
      ├── chunks            (TSDB chunks)
      ├── index             (label and series index)
      └── meta.json         (block metadata: time range, stats, version)
  └── 01HF6Z97JH2TQ8C9XKB5/
      └── ...

The ingestion path:

New samples enter the head block, which stores recent series state and chunks in memory and on disk.
Each batch of writes appends records to the WAL segments.
When the head block reaches a size or time threshold, the engine cuts a new block:

series data and chunks for that time window move from head to an immutable block directory;
an index is built mapping label combinations to series IDs and time ranges;
WAL segments that cover this range can be deleted after a safe margin.

Inside a block, each series holds a sequence of chunks. For each chunk:

timestamps are stored as deltas with respect to the first timestamp;
values are encoded with the Gorilla XOR scheme or similar float-optimized compression;
an index maps (name, labels) to a postings list of series IDs and to chunk references.

A simplified view of the index relationships:

label_index:
  job="frontend"    -> postings list { series_id 1, 2, 5, ... }
  region="eu-west"  -> { 1, 3, 4, ... }
  code="500"        -> { 2, 7, 9, ... }

series_index:
  series_id 1 -> metric "http_requests_total",
                  labels {job="frontend", region="eu-west", code="200", ...}

chunk_index:
  series_id 1 -> [chunk_ref_1, chunk_ref_2, ...]

Query execution resolves label matchers into postings lists, intersects them, and then walks the chunk references for the resulting series IDs within the relevant time range.

The head block behaves similarly but keeps its data structures in memory with WAL-backed durability. Short-range queries near “now” largely hit the head block and decode chunks that are still mutable.

Query Execution in Prometheus

PromQL expressions describe aggregations and transformations over ranges of samples. For example, a success rate per service over the last 5 minutes:

sum by (service) (
  rate(http_requests_total{job="frontend", code=~"2.."}[5m])
)
/
sum by (service) (
  rate(http_requests_total{job="frontend"}[5m])
)

Evaluation proceeds in several stages:

Label matchers (for job and code) resolve to sets of series IDs using the label index across the head block and relevant on-disk blocks.
For each matched series, the engine fetches chunks intersecting the [now-5m, now] window, decodes timestamps and values into vectors, and computes per-series rates.
Grouping (sum by (service)) aggregates those per-series rates across label combinations that share service.
The division of two grouped expressions yields the success rate.

The TSDB is optimized for this pattern: label-based series selection, short-range windowing, vectorized chunk decoding, and aggregations over a small set of time windows. Long-range ad-hoc analytics across months of data can run on Prometheus, but cost and cardinality constraints push those workloads into a downstream store such as VictoriaMetrics.

VictoriaMetrics as Long-Term Store and Aggregation Layer

VictoriaMetrics accepts remote write from Prometheus and uses its own LSM-like storage engine to scale to higher cardinality and longer retention. The external view remains compatible with Prometheus metrics and PromQL; internally, the engine organizes data into partitions, index trees, and compressed value blocks.

A minimal remote write setup from Prometheus was shown earlier. On the VictoriaMetrics side, a single-node deployment runs as one binary with a data directory layout roughly of the form:

/data/victoria-metrics/
  ├── data/
  │   ├── 2025_11/      (time partition)
  │   │   ├── index.bin
  │   │   ├── data-part-0.bin
  │   │   ├── data-part-1.bin
  │   │   └── ...
  │   └── 2025_12/
  └── metadata/

Key ideas:

Incoming remote write batches carry (name, labels, samples) in Prometheus wire format.
VictoriaMetrics assigns an internal series ID, updates an inverted index that maps label key–value pairs to series IDs, and appends compressed samples to data parts.
Data parts are immutable once written. Background merges coalesce smaller parts into fewer, larger ones, reducing index fan-out and improving scan locality.

The inverted label index resembles the logical structure from Prometheus but is built for large scale and heavy remote write ingestion:

label_index("job")["frontend"]     -> postings list of series IDs
label_index("region")["eu-west"]   -> postings list of series IDs
label_index("service")["billing"]  -> postings list of series IDs

A PromQL query coming into VictoriaMetrics follows a path similar to Prometheus:

Parse the expression and extract label matchers and time windows.
Resolve matchers via the inverted index to series IDs.
Locate data parts and offset ranges for those series and time intervals.
Read and decode compressed value streams, apply aggregations and functions, and return the result.

Because VictoriaMetrics is dedicated to metrics storage and query, the engine concentrates on efficient postings-list handling, large series-index footprints, and compaction across time partitions. It can serve as:

the long-term store behind one or more Prometheus instances;
a central query endpoint for cluster-wide metrics;
a single-binary TSDB for high-cardinality metrics feeds.

Trade-Offs of Prometheus + VictoriaMetrics in Real Usage

Strengths

Scrape-based ingestion removes the need for client libraries to push metrics; instrumented services expose /metrics, and Prometheus controls collection timing and labeling.
Local Prometheus TSDB is tightly integrated with alerting rules and service discovery, which simplifies short-term monitoring and incident response.
VictoriaMetrics absorbs long retention and high cardinality without overloading individual Prometheus instances, and can aggregate metrics from many clusters or tenants.
Both components speak Prometheus metrics and PromQL, so dashboards, alerting, and tooling can point at either local Prometheus or VictoriaMetrics with consistent query semantics.
Remote write provides a clear separation of concerns: Prometheus focuses on scraping, rule evaluation, and near-real-time queries; VictoriaMetrics focuses on storage density and long-range analytics.

Constraints

Reliability of remote write becomes part of the data path. Network issues or back-pressure on VictoriaMetrics can cause queue growth in Prometheus and, in extreme cases, dropped samples if buffers fill.
The label-centric model pushes careful label design into the critical path: uncontrolled label cardinality (for example, encoding high-cardinality user IDs in labels) degrades both series indexes and query latency.
Prometheus local TSDB is tuned for moderate retention; pushing long-term storage or multi-month analytics into Prometheus alone increases disk footprint and compaction pressure.
Running VictoriaMetrics at scale introduces its own operational envelope: capacity planning, compaction tuning, and index sizing must reflect ingestion rates and cardinality, especially in multi-tenant setups.
Distributed deployments with several Prometheus instances and a shared VictoriaMetrics cluster require a clear strategy for deduplication, replica labels, and query routing if the same metrics arrive from multiple scrapers.

QuestDB: Columnar, Partitioned, and SQL-First

QuestDB runs as a dedicated time-series engine with a relational surface. Tables look like ordinary SQL tables; underneath, data lives in partitioned columnar files with a write path and query layer tuned for time-ordered appends and vectorized scans.

Data Model and Time/Partition Layout

QuestDB stores time series as tables with a designated timestamp column and optional partitioning by a time granularity. A typical schema for sensor data:

CREATE TABLE sensor_readings (
    ts          TIMESTAMP,          -- designated timestamp
    device_id   SYMBOL,             -- dictionary-encoded string
    region      SYMBOL,
    temperature DOUBLE,
    pressure    DOUBLE
) TIMESTAMP(ts)
  PARTITION BY DAY;

Key elements in this definition:

TIMESTAMP(ts) tells QuestDB which column defines time ordering within each table.
PARTITION BY DAY splits the table into physical partitions per day (hour/month/year are also valid choices).
SYMBOL types act as dictionary-encoded strings: values are stored as integer codes with a separate symbol table, which reduces storage for tags like device_id and region.

Logically, the table behaves like a regular SQL table. Physically, QuestDB maps it to a directory structure:

/root/db/sensor_readings/
  ├── 2025-11-01/        (partition for this day)
  │    ├── ts.d          (column file for ts)
  │    ├── device_id.d   (column file for device_id codes)
  │    ├── region.d
  │    ├── temperature.d
  │    ├── pressure.d
  │    └── _meta         (partition metadata)
  ├── 2025-11-02/
  │    └── ...
  └── _meta              (table metadata)

Each partition directory contains one file per column. Columns append new values at the tail; partitions are created and sealed according to the PARTITION BY policy. This layout aligns retention (dropping entire partitions) and scan patterns (time-range queries visit only a subset of directories).

Columnar Storage and Append Path

Within a partition, columns are stored in fixed-size frames, so appends write sequential blocks to a small number of files. For the sensor_readings example, an incoming batch of rows might be laid out as:

ts.d:          [t0][t1][t2]...[tn]
device_id.d:   [id0][id1][id2]...[idn]
region.d:      [r0][r1][r2]...[rn]
temperature.d: [v0][v1][v2]...[vn]
pressure.d:    [p0][p1][p2]...[pn]

Dictionary-encoded SYMBOL columns maintain a separate symbol table:

device_id.sym:
  0 -> "sensor-17"
  1 -> "sensor-18"
  2 -> "sensor-19"
  ...

region.sym:
  0 -> "eu-west"
  1 -> "eu-central"
  ...

The write path appends integer codes (for symbols) and binary representations (for timestamps and numeric values). Data stays ordered by ts within each partition, which simplifies time-range filtering and improves cache behavior during scans.

QuestDB keeps recent partitions memory-mapped and uses OS page cache aggressively. Appending rows translates to writing at the end of column files and occasionally extending symbol tables. There is no WAL in the classic sense; durability comes from writing directly to the column files, with configuration knobs around commitLag and o3 (out-of-order) handling.

Ingestion Interfaces: SQL, Line Protocol, and PG Wire

QuestDB exposes several ingestion paths that feed the same columnar storage:

SQL inserts

INSERT INTO sensor_readings (
    ts, device_id, region, temperature, pressure
) VALUES (
    now(), 'sensor-17', 'eu-west', 21.7, 1013.2
);

This form is convenient for ad-hoc tests and small integrations, but at scale, batch-oriented and streaming interfaces dominate.

InfluxDB line protocol (ILP)

QuestDB implements a TCP and UDP receiver that accepts Influx line protocol. A client can stream sensor data like:

temperature,device_id=sensor-17,region=eu-west value=21.7 1733306400000000000
temperature,device_id=sensor-18,region=eu-west value=22.1 1733306400000000000

Mapped against the earlier schema:

measurement → table name (temperature mapped to sensor_readings or a dedicated table, depending on configuration),
tags (device_id, region) → SYMBOL columns,
field (value) → DOUBLE column (temperature),
timestamp → ts.

The ILP receiver parses lines, converts tags to symbol IDs, assigns each row to the correct partition by time, and appends values to the corresponding column files.

PostgreSQL wire protocol

QuestDB also speaks the PostgreSQL wire protocol. Applications can connect using PostgreSQL drivers and issue INSERT and SELECT statements without custom client libraries. From the client perspective, QuestDB looks like a PostgreSQL-like server; under the hood, it runs its own engine.

Each of these ingestion paths converges in the same mechanism: mapping incoming rows to (partition, row index) and appending to column files, while maintaining symbol tables and metadata.

Query Execution and Vectorized Scans

Queries run through a SQL planner that understands the partitioned, columnar nature of the data. A typical analytics query:

SELECT
    ts,
    device_id,
    region,
    temperature
FROM sensor_readings
WHERE
    ts BETWEEN '2025-11-01T00:00:00Z' AND '2025-11-01T01:00:00Z'
    AND region = 'eu-west'
ORDER BY ts;

Execution steps:

Partition pruning. The WHERE ts BETWEEN … predicate restricts the scan to partitions whose time ranges intersect the requested interval. With PARTITION BY DAY, only the 2025–11–01 partition is touched.
Symbol filter resolution. The filter region = ‘eu-west’ resolves once in the symbol table to a small set of integer codes (for example, region_id = 0).
Columnar scan. The engine walks the ts.d and region.d column files in that partition, selects rows where region_id matches the filter and ts falls within the requested window, and pulls the corresponding values from device_id.d and temperature.d.
Vectorized processing. Operations such as comparisons, aggregations, and functions run over column vectors instead of row-by-row loops, exploiting CPU cache and SIMD where possible.

For grouped aggregates, a query might look like:

SELECT
    to_timestamp(floor(ts / 60000000000L) * 60000000000L) AS minute_start,
    region,
    avg(temperature) AS temperature_avg
FROM sensor_readings
WHERE
    ts >= now() - INTERVAL '1 hour'
GROUP BY minute_start, region
ORDER BY minute_start, region;

Here the engine:

restricts to recent partitions;
computes minute buckets from ts on the fly;
groups by (bucket, region_id) in memory;
scans only ts, region, and temperature columns, avoiding irrelevant data.

The combination of time-based partitioning, symbol-based dictionaries, and columnar layout keeps the physical work bounded to the necessary partitions and columns.

Trade-Offs of QuestDB in Real Usage

Strengths

Columnar, time-partitioned layout aligns with analytical queries over recent and historical time ranges and supports high scan throughput with relatively simple hardware.
SYMBOL types and dictionary encoding compress label-like dimensions effectively and accelerate equality filters, which matches metrics and IoT-style tagging patterns.
SQL-first surface and PostgreSQL wire protocol simplify integration with existing tools and drivers, while ILP support enables direct ingestion from ecosystems that already emit Influx line protocol.
Partition-level retention maps cleanly to storage: dropping old partitions removes both data and index structures without large transactional deletes.

Constraints

Design favors append-heavy, time-ordered writes; highly out-of-order ingestion or frequent updates to existing rows complicate partition management and can degrade performance.
Query engine focuses on time-series analytics; workloads dominated by complex transactional patterns, heavy multi-row updates, or strong relational constraints demand a separate OLTP database.
Performance and storage efficiency depend on careful choices for partitioning granularity, symbol usage, and schema stability; frequent schema changes or poorly chosen partition sizes increase file fragmentation and operational overhead.

- Ecosystem and operational model differ from mainstream PostgreSQL or Prometheus-based stacks; teams often introduce QuestDB as a specialized component rather than a universal replacement for existing databases.

Time-series systems all solve the same problem: storing and querying data where time is the main axis. Each engine takes samples with a timestamp, labels, and values, then groups them into time-bounded chunks or partitions, indexes by labels, and compresses ordered data. Names and internals differ — hypertables, shard groups with TSM files, block stores with label postings, time-partitioned columnar tables — but the core ideas stay the same: organize by time ranges, keep label cardinality under control, compress dense intervals, and make recent data cheap to write and fast to read.

For engineering work, the starting point is not the database brand but the workload shape. Ingestion rate and bursts, label design, how much relational context is needed, retention for raw and rolled-up data, and the operational effort the team can support — together these define what kind of engine fits and how to configure it. With a clear view of internal models and storage layouts, choosing between a relational extension, a metrics-first stack, an IoT-focused engine, or a columnar time-series database becomes a concrete architectural decision instead of a catalog of product names.

DEV Community

Time Series Systems: Architecture, Storage Models, and Engineering Principles

Time as a First-Class Dimension

Data Model and Semantics

Query Workloads and Algorithms

Core Data Structures

Compression and Encoding

System Architecture of Time-Series Platforms

Storage Layout and Lifecycle

Query Path

Storage Models in Practice: Comparing Major Time-Series Databases

TimescaleDB: Time Series on Relational Foundations

Extension Catalog and Hypertable Metadata

Chunk Routing and Execution Path

Compression Pipeline and Columnar Layout

Trade-Offs of TimescaleDB in Real Usage

InfluxDB IoT: Measurement/Tag Model and TSM Engine

Shards, TSM Files, and WAL

TSM Block Layout and Compression

Series Index and Cardinality

IoT-Oriented Characteristics

Trade-Offs of InfluxDB and the TSM Engine in Real Usage

Prometheus and VictoriaMetrics: Metrics at Large Scale

Metrics Data Model and Label Semantics

Scrape-Based Ingestion and Remote Write

Prometheus Local TSDB: Head Block, On-Disk Blocks, WAL

Query Execution in Prometheus

VictoriaMetrics as Long-Term Store and Aggregation Layer

Trade-Offs of Prometheus + VictoriaMetrics in Real Usage

QuestDB: Columnar, Partitioned, and SQL-First

Data Model and Time/Partition Layout

Columnar Storage and Append Path

Ingestion Interfaces: SQL, Line Protocol, and PG Wire

InfluxDB line protocol (ILP)

Query Execution and Vectorized Scans

Trade-Offs of QuestDB in Real Usage

- Ecosystem and operational model differ from mainstream PostgreSQL or Prometheus-based stacks; teams often introduce QuestDB as a specialized component rather than a universal replacement for existing databases.

Top comments (0)