DEV Community

Cover image for Data Warehouse Design for Data Engineering Interviews: A Beginner's Guide to Fact Tables, Star Schemas, and Grain
Gowtham Potureddi
Gowtham Potureddi

Posted on

Data Warehouse Design for Data Engineering Interviews: A Beginner's Guide to Fact Tables, Star Schemas, and Grain

Data warehouse design is the discipline of laying out tables so analytical questions are fast, correct, and easy to ask. A well-designed enterprise data warehouse turns "what was revenue by region last quarter?" into a sub-second query; a badly-designed one turns the same question into a 30-minute, three-join, three-cell-disagrees-with-finance pile. For data-engineering interviews, the same three or four concepts — fact tables, dimension tables, grain, star schema, SCD — show up in every loop and every system-design round.

This guide is a beginner-friendly walk through data warehouse design from first principles. We start with OLTP vs OLAP and why the two need fundamentally different schemas, then build out the Kimball data warehouse mental model — fact tables, dimensions, the star schema vs snowflake schema trade-off, grain, surrogate keys, slowly changing dimensions, partitioning, and the six-step design process — with worked examples and an interview-style problem in each section. We also place the warehouse next to its neighbours — data warehouse vs data lake, data warehouse vs data mart, data lakehouse vs data warehouse — so you can defend the design choice in a round, not just memorise the diagram.

If you want hands-on reps after you read, explore practice →, drill SQL problems →, browse ETL practice →, or open ETL System Design for Data Engineering Interviews → for a structured path.

PipeCode blog header for a data warehouse design beginner's guide — bold title 'Data Warehouse Design' with subtitle 'Facts, dimensions, star schema, grain' and a stylized star-schema diagram with a central fact table and four orbiting dimensions in purple, green, and orange on a dark gradient background.


On this page


1. Why data warehouse design matters

OLTP vs OLAP, and why the warehouse needs its own shape

The single most important sentence in data warehouse design: the OLTP database that runs your application is shaped wrong for analytics. Operational databases (PostgreSQL, MySQL) are normalised, row-stored, and tuned for single-row writes; warehouses (Snowflake, Amazon Redshift, Google BigQuery) are denormalised, columnar, and tuned for full-table scans. A data engineer's first job is recognising which shape a workload needs and building the data warehouse architecture accordingly.

Pro tip: In a system-design round, your first sentence about any analytical request is "this is an OLAP workload, so I'd model it as a fact table at this grain with these dimensions, and run it on a columnar warehouse like Snowflake or BigQuery." That sentence packs grain, schema choice, and warehouse selection into one beat — interviewers love it.

OLTP design — normalised, transactional, single-row optimised

The OLTP invariant: operational databases are heavily normalised (3NF) to prevent update anomalies; rows are stored together so single-row reads and writes are fast; the workload is many small transactions per second. PostgreSQL and MySQL are the canonical examples. They are the right tool for the write side of the world — the user clicking "Buy" — and the wrong tool for the analytical question that follows.

  • Normalised — each fact lives in exactly one place; customers, orders, addresses are separate tables.
  • Row-stored — fetching one row of 30 columns is one disk seek.
  • High write throughput — millisecond INSERT / UPDATE / DELETE.
  • Indexes for point lookups — find customer 42 in O(log N).
  • ACID transactions — money cannot disappear between debit and credit.

Worked example. An OLTP order schema:

table rows per record typical operation
customers 1 per customer UPDATE … SET address = …
orders 1 per order INSERT … VALUES (…)
order_items 1 per order line INSERT … VALUES (…)
payments 1 per payment UPDATE … SET status = 'paid'

Step-by-step.

  1. A user clicks "Place order"; the app opens a transaction.
  2. INSERT INTO orders writes the order header; INSERT INTO order_items writes the line items.
  3. UPDATE inventory SET qty = qty - 1 decrements stock.
  4. INSERT INTO payments records the charge attempt.
  5. COMMIT makes everything visible atomically; the whole transaction takes ~10–30 ms.

Worked-example solution. OLTP table for orders (Postgres):

CREATE TABLE orders (
    order_id    BIGSERIAL    PRIMARY KEY,
    customer_id BIGINT       NOT NULL REFERENCES customers(customer_id),
    placed_at   TIMESTAMPTZ  NOT NULL DEFAULT NOW(),
    total       NUMERIC(14,2) NOT NULL
);
CREATE INDEX ON orders (customer_id);
CREATE INDEX ON orders (placed_at);
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: if the workload is "millisecond writes for a live application," it is OLTP — normalise it, index it, and stop. Analytics goes somewhere else.

OLAP design — denormalised, columnar, scan-optimised

The OLAP invariant: analytical workloads scan many rows and few columns; the right shape is columnar storage with denormalised fact tables and pre-joined dimensions, so a single SELECT can answer a business question without locking the OLTP database. Snowflake, BigQuery, and Redshift store each column as its own compressed file — a 100 M-row aggregation reads ~5% of the bytes that an OLTP row scan would read.

  • Denormalised — fact tables carry foreign keys to dimensions; dimensions carry pre-joined descriptive context.
  • Columnar storage — each column is its own file; analytical scans skip irrelevant columns.
  • Few transactions — batch ELT loads commit thousands of rows at once.
  • No row-level locks — long-running analytical queries don't block writers.
  • Aggregation-friendlyGROUP BY over millions of rows runs in seconds.

Worked example. An OLAP fact + dimension schema:

table grain typical query
fact_orders one row per order line SUM(revenue) GROUP BY month
dim_customer one row per customer (history) join for city, segment
dim_product one row per product join for category, brand
dim_date one row per calendar day join for month, quarter

Step-by-step.

  1. The analytical question is "revenue by category by month for the last quarter."
  2. The query selects category (from dim_product), month (from dim_date), and SUM(revenue) (from fact_orders).
  3. The warehouse reads only the three columns it needs; everything else is skipped.
  4. Partition pruning on date_id skips ~95% of fact rows.
  5. The full aggregation returns in 2–5 seconds over a 100 M-row fact.

Worked-example solution. OLAP star-shaped query:

SELECT p.category,
       d.month,
       SUM(f.revenue) AS revenue
FROM fact_orders f
JOIN dim_product p ON p.product_id = f.product_id
JOIN dim_date    d ON d.date_id    = f.date_id
WHERE d.year = 2026
GROUP BY p.category, d.month
ORDER BY d.month, p.category;
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: if the workload is "scan many rows, return an aggregate, run on a schedule for humans to read," it is OLAP — denormalise it, build it as a fact + dim schema, and put it in a columnar warehouse.

Where the warehouse fits — vs database, data lake, data mart, lakehouse

The placement invariant: a database holds the live application state (OLTP); a data warehouse holds modelled analytical history (OLAP, star schemas); a data lake holds raw files (sometimes pre-warehouse); a data mart is a subject-area subset of a warehouse; a data lakehouse merges lake storage with warehouse-style ACID tables on top. Picking the right placement is half the design.

  • Database (OLTP) — Postgres / MySQL; live application.
  • Data warehouse (OLAP) — Snowflake / Redshift / BigQuery; star/snowflake schemas for analytics.
  • Data lake — S3 / GCS / ADLS holding raw Parquet / JSON / CSV; cheaper but unstructured.
  • Data mart — subject-area subset (e.g., mart_marketing); business-team-owned.
  • Data lakehouse — Iceberg / Delta / Hudi on top of object storage; ACID + warehouse semantics on lake files.

Worked example. A modern company's three-tier stack:

tier system purpose
OLTP Postgres live orders, users, payments
Lake (raw) S3 + Parquet event firehose, schema-flexible
Warehouse (modelled) Snowflake star schemas for finance/BI
Mart MART_FINANCE schema in Snowflake finance-team-only view

Step-by-step.

  1. The app writes to Postgres; transactional reads stay there.
  2. A CDC pipeline streams Postgres changes into the S3 data lake as raw Parquet.
  3. Daily ELT (dbt or Spark) models the raw lake data into star-shaped fact/dim tables in Snowflake.
  4. Finance reads from MART_FINANCE (a curated subset); marketing reads from MART_MARKETING.
  5. The warehouse is the modelled truth; the lake is the raw archive; the mart is the consumer-facing slice.

Worked-example solution. A subject-area data mart on top of a warehouse:

CREATE SCHEMA mart_finance;

CREATE TABLE mart_finance.daily_revenue AS
SELECT d.date,
       p.category,
       SUM(f.revenue) AS revenue
FROM fact_orders f
JOIN dim_product p ON p.product_id = f.product_id
JOIN dim_date    d ON d.date_id    = f.date_id
GROUP BY d.date, p.category;
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: when asked "data warehouse vs data lake vs data mart" in an interview, sketch the three boxes in a line — lake (raw) → warehouse (modelled) → mart (consumer slice) — and name a tool for each.

Common beginner mistakes

  • Running analytical queries against the OLTP database — slows the live application and gives stale, locked-row answers.
  • Treating the data lake as a warehouse — raw files can be queried but have no grain, schema, or referential integrity until you model them.
  • Skipping the dimensional model — putting everything in one wide table (OBT) works until two analysts disagree on customer_segment because it was hard-coded twice.
  • Building a single "warehouse" without subject-area marts — every team has to learn every table.
  • Conflating Kimball (bottom-up, star-schema marts) and Inmon (top-down, normalised EDW) — both work; pick one and be consistent.

Data Warehouse Interview Question on When to Build a Warehouse vs Query Postgres

A growing startup has 50 M orders in Postgres. The CFO wants a monthly revenue report joining orders, customers, products, and regions. The current report runs on Postgres and takes 4 hours. Decide whether to (a) optimise Postgres, (b) build a data warehouse, or (c) build a data lake, and defend the choice.

Solution Using a Kimball Star Schema on a Cloud Warehouse with Daily ELT

Code solution.

-- Snowflake (or Redshift / BigQuery) — modelled warehouse
CREATE TABLE fact_orders (
    order_id    NUMBER(38,0),
    customer_id NUMBER(38,0) NOT NULL,
    product_id  NUMBER(38,0) NOT NULL,
    date_id     NUMBER(38,0) NOT NULL,
    region_id   NUMBER(38,0) NOT NULL,
    revenue     NUMBER(14,2) NOT NULL,
    quantity    NUMBER       NOT NULL
)
CLUSTER BY (date_id);

CREATE TABLE dim_customer (customer_id NUMBER(38,0) PRIMARY KEY, name TEXT, segment TEXT);
CREATE TABLE dim_product  (product_id  NUMBER(38,0) PRIMARY KEY, name TEXT, category TEXT, brand TEXT);
CREATE TABLE dim_date     (date_id     NUMBER(38,0) PRIMARY KEY, date DATE, month INT, year INT);
CREATE TABLE dim_region   (region_id   NUMBER(38,0) PRIMARY KEY, region TEXT, country TEXT);

-- daily ELT runs in the warehouse
INSERT INTO fact_orders
SELECT order_id, customer_id, product_id,
       TO_NUMBER(TO_CHAR(placed_at,'YYYYMMDD')),
       region_id, total, qty
FROM stage.orders WHERE load_date = CURRENT_DATE;
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

step choice result
1 option (a) optimise Postgres indexes help but the workload conflicts with the live app
2 option (c) data lake only raw files; no grain; analysts re-implement joins every report
3 option (b) build a warehouse + star schema one modelled source of truth; sub-second BI
4 daily ELT lands new orders freshness = T-1 day, which is fine for monthly CFO report
5 the 4-hour report becomes a 3-second BI query finance happy; OLTP unaffected

Output: the monthly report drops from 4 hours to 3 seconds; the OLTP Postgres is no longer fighting the analyst; the warehouse becomes the source of truth for every downstream BI / ML / finance use case.

Why this works — concept by concept:

  • Separation of OLTP and OLAP — live app stays fast; analytics moves to a columnar engine.
  • Star schema — fact_orders at the centre, dim_customer / dim_product / dim_date / dim_region around it; queries are simple joins.
  • Daily ELT — extract from Postgres, load to warehouse, transform with SQL inside the warehouse.
  • CLUSTER BY (date_id) — co-locates partitions by date so monthly filters prune ~95% of the fact.
  • Surrogate keys (customer_id numeric) — stable identifiers that survive business-key changes.
  • CostO(rows in last month) on a clustered scan; an OLTP scan would be O(rows in fact_orders) with row-level locks.

Inline CTA: drill the ETL practice page and the SQL aggregation topic for grain-correct rollups.

ETL
Topic — ETL
ETL practice problems

Practice →

SQL
Topic — aggregations
SQL aggregation problems

Practice →

COURSE
Course — ETL System Design
ETL System Design for DE Interviews

View course →


2. Fact tables — measurable business events

The numeric heart of the warehouse — what happened, when, and how much

A fact table stores measurable business events. Every row is an event — an order placed, a click recorded, a payment processed — and every column is either a measure (numeric quantity: revenue, units, duration) or a foreign key to a dimension that gives the event business context (which customer, which product, which day). Fact tables are usually the largest tables in a warehouse — millions to billions of rows — and they are the focus of every analytical query.

OLTP vs OLAP comparison diagram showing a normalised transactional database on the left with row-stored tables and short single-row queries, versus an OLAP star-schema warehouse on the right with a central fact table connected to four denormalised dimensions and a large SUM-by-GROUP-BY query — connected by an ELT arrow labelled 'load + model' on a light PipeCode-branded card.

Pro tip: When you walk through a fact-table design in an interview, say the grain in the first sentence and name the measures and foreign keys in the next two. "One row per order line. Measures: revenue, quantity, discount. FKs: customer, product, date, region." That structure signals you actually know what you're doing.

Transaction fact tables — one row per business event

The transaction-fact invariant: a transaction fact table stores one row per atomic business event at its natural grain; the row records the measures of that event and foreign keys to every dimension that gave it context; this is the most common and most interview-asked fact type. Order lines, payments, clicks, ad impressions — all transaction facts.

  • One row per event — never aggregate; the warehouse can always roll up later, never roll down.
  • Numeric measuresrevenue, quantity, discount, tax.
  • Foreign keyscustomer_id, product_id, date_id, region_id.
  • Degenerate dimensions — operational IDs (order_number, transaction_id) stored on the fact row.
  • Append-mostly — new events arrive; old events rarely change.

Worked example. A sales transaction fact with 5 sample rows:

sale_id customer_id product_id date_id revenue quantity
1 1001 50 20260510 200.00 2
2 1002 51 20260510 100.00 1
3 1001 52 20260510 350.00 1
4 1003 50 20260511 100.00 1
5 1002 51 20260511 100.00 1

Step-by-step.

  1. Each row is one order line; grain is "one row per (order, product line)."
  2. Measures revenue and quantity are numeric, additive, and aggregate cleanly with SUM.
  3. FKs customer_id, product_id, date_id link to dimensions that describe who, what, when.
  4. A GROUP BY date_id, customer_id rolls up to per-day-per-customer revenue.
  5. The same fact answers "revenue by customer," "revenue by product," "revenue by day" — different GROUP BY clauses.

Worked-example solution. A transaction fact DDL:

CREATE TABLE fact_sales (
    sale_id     NUMBER(38,0) PRIMARY KEY,
    customer_id NUMBER(38,0) NOT NULL,
    product_id  NUMBER(38,0) NOT NULL,
    date_id     NUMBER(38,0) NOT NULL,
    revenue     NUMBER(14,2) NOT NULL,
    quantity    NUMBER       NOT NULL
)
CLUSTER BY (date_id);
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: if you are tempted to write the fact at a coarser grain than the event, ask first — almost always the right grain is the finest one and any rollup is a SQL query.

Periodic snapshot fact tables — state at fixed intervals

The snapshot-fact invariant: a periodic snapshot fact stores the state of a process at fixed time intervals (end of day, end of month); each row records the level of measures (inventory on hand, account balance) at that snapshot moment; useful when the process is continuous and you want a series of point-in-time photos. Inventory levels, account balances, headcount.

  • One row per (snapshot date, entity) — e.g., one row per (day, product) for inventory.
  • Semi-additive measures — balances don't add across time (you can't sum yesterday's + today's inventory to get a meaningful number), but they aggregate across other dimensions.
  • Fixed cadence — daily, weekly, monthly snapshot.
  • History as time series — easy to query "balance over time."

Worked example. Daily inventory snapshot:

date_id product_id on_hand_units
20260510 50 120
20260510 51 85
20260511 50 118
20260511 51 80
20260512 50 115

Step-by-step.

  1. Every night at midnight, an ETL job snapshots the current inventory for every product.
  2. Each row is one (date, product) combination with the on-hand count at snapshot time.
  3. SUM across products is meaningful ("total units across catalogue today"); SUM across days is not (yesterday's units + today's units is meaningless).
  4. The fact answers "inventory trend for product 50 over time" via a single-column scan.
  5. Snapshot growth is bounded — one row per (day, product) — so 5 years × 10 k products = 18 M rows, manageable.

Worked-example solution. Inventory snapshot DDL:

CREATE TABLE fact_inventory_snapshot (
    date_id        NUMBER(38,0) NOT NULL,
    product_id     NUMBER(38,0) NOT NULL,
    on_hand_units  NUMBER       NOT NULL,
    PRIMARY KEY (date_id, product_id)
)
CLUSTER BY (date_id);
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: snapshot facts answer "what was the state on day X"; transaction facts answer "what happened on day X." Pick the right shape for the question.

Accumulating snapshot fact tables — process lifecycle in one row

The accumulating-snapshot invariant: an accumulating snapshot fact stores one row per process instance (one order, one application, one shipment) and updates that row as the process moves through its lifecycle; ideal when the process has a finite, well-defined sequence of milestones. Order fulfilment (ordered → packed → shipped → delivered), loan application (submitted → reviewed → approved → funded).

  • One row per process instance — one order's entire lifecycle in a single row.
  • Multiple date columnsordered_date_id, packed_date_id, shipped_date_id, delivered_date_id.
  • Multiple status columns — boolean flags for each milestone.
  • Row updates over time — same row, different fields filled in as the process advances.
  • Trend analysis on durationsdelivered_date - ordered_date = fulfilment lead time.

Worked example. An order-lifecycle fact:

order_id ordered_date packed_date shipped_date delivered_date
1001 2026-05-10 2026-05-10 2026-05-11 2026-05-13
1002 2026-05-10 2026-05-11 NULL NULL
1003 2026-05-11 NULL NULL NULL

Step-by-step.

  1. When an order is placed, a new fact row is inserted with ordered_date set and the rest NULL.
  2. When the warehouse packs the order, the same row is updated with packed_date.
  3. When the courier picks it up, shipped_date is filled.
  4. When the customer signs for delivery, delivered_date is filled.
  5. Analysts can now ask "average days from order to delivery" with one simple subtraction.

Worked-example solution. Accumulating snapshot DDL:

CREATE TABLE fact_order_fulfilment (
    order_id        NUMBER(38,0) PRIMARY KEY,
    customer_id     NUMBER(38,0) NOT NULL,
    product_id      NUMBER(38,0) NOT NULL,
    ordered_date    DATE         NOT NULL,
    packed_date     DATE,
    shipped_date    DATE,
    delivered_date  DATE
);
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: accumulating snapshots fit a finite, well-known lifecycle; for open-ended workflows (support tickets, leads), prefer transaction facts at the state-change grain.

Common beginner mistakes

  • Mixing grains in one fact table — a row that's sometimes per-order, sometimes per-line, sometimes per-day silently breaks every aggregate.
  • Storing aggregated measures and re-aggregating ("sum of average") — answers diverge from the row-level truth.
  • Adding customer_name as a fact column — that belongs in the dimension; if it changes, every fact row drifts.
  • Forgetting date_id — the most-asked filter in every analytical query.
  • Treating snapshot facts as additive — summing balances across time is almost always wrong.

Data Warehouse Interview Question on Picking the Right Fact-Table Shape

A team is building a warehouse for an online learning platform. They need to answer (a) "how many lessons were completed per day per course?" (b) "what is the current number of active subscribers per course?" and (c) "what is the average days-to-completion per learner per course?" Propose three fact tables — one per question — and pick the right type for each.

Solution Using Transaction + Periodic Snapshot + Accumulating Snapshot

Code solution.

-- (a) transaction fact — one row per lesson completion event
CREATE TABLE fact_lesson_completion (
    completion_id NUMBER(38,0) PRIMARY KEY,
    learner_id    NUMBER(38,0) NOT NULL,
    course_id     NUMBER(38,0) NOT NULL,
    lesson_id     NUMBER(38,0) NOT NULL,
    date_id       NUMBER(38,0) NOT NULL,
    duration_sec  NUMBER       NOT NULL
);

-- (b) periodic snapshot — one row per (day, course) with active subscribers
CREATE TABLE fact_course_subscribers (
    date_id      NUMBER(38,0) NOT NULL,
    course_id    NUMBER(38,0) NOT NULL,
    active_subs  NUMBER       NOT NULL,
    PRIMARY KEY (date_id, course_id)
);

-- (c) accumulating snapshot — one row per (learner, course) lifecycle
CREATE TABLE fact_course_completion (
    learner_id     NUMBER(38,0) NOT NULL,
    course_id      NUMBER(38,0) NOT NULL,
    started_date   DATE         NOT NULL,
    midway_date    DATE,
    finished_date  DATE,
    PRIMARY KEY (learner_id, course_id)
);
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

question fact type why
(a) lessons completed per day per course transaction one row per event; GROUP BY date, course rolls up
(b) active subscribers per course right now periodic snapshot one row per (day, course); semi-additive count
(c) average days-to-completion accumulating snapshot one row per learner-course lifecycle
each fact at its natural grain rollups are SQL, never re-modelling
dimensions shared dim_learner, dim_course, dim_date conformed across all three facts

Output: the three analytical questions become three small SQL queries against three correctly-shaped facts, each with its own grain. The conformed dimensions mean a join from any fact to any dimension produces the same answer about "what is Course 42?".

Why this works — concept by concept:

  • One fact type per business question — picking the wrong shape costs you a re-model; picking the right one costs nothing.
  • Transaction fact at the event grain — never aggregate at write time; rollups are SQL.
  • Periodic snapshot for state — balance / count / level metrics need a fixed-cadence row.
  • Accumulating snapshot for finite lifecycles — durations and milestone counts in one row.
  • Conformed dimensions — same dim_learner joins to all three facts.
  • CostO(events) for the transaction fact, O(days × courses) for the snapshot, O(learner-course pairs) for the accumulating; all bounded and queryable.

Inline CTA: sharpen fact-shape choice on the aggregation practice topic and the ETL topic.

SQL
Topic — aggregations
SQL aggregation problems

Practice →

ETL
Topic — ETL
ETL practice problems

Practice →

COURSE
Course — ETL System Design
ETL System Design for DE Interviews

View course →


3. Dimension tables — descriptive context

The "who, what, where, when" that gives facts business meaning

A dimension table stores the descriptive attributes that put facts into business context. If fact_sales says "100 units of product 50 sold on 2026-05-10," the dimension tables tell you that product 50 is a "Wireless Mouse" in category "Accessories", that the sale was on a Monday in May, and that the customer is a "Premium" segment in "Bangalore". Dimensions are smaller than facts but heavily joined — every analytical query touches one or more.

Pro tip: Every dimension answers a "by" question — revenue by category, clicks by region, sign-ups by referral source. When you sketch a star schema, label each dimension with the "by" it enables. That single habit catches missing dimensions before you write a line of SQL.

Conformed dimensions — same definition shared across facts

The conformed-dimension invariant: a conformed dimension is one dimension table joined to multiple fact tables with identical column definitions; "Customer 42" means the same thing whether queried from fact_orders or fact_support_tickets. Conformed dimensions are what turn a collection of subject-area marts into an enterprise data warehouse.

  • One dim_customer — same customer_id and same attributes across the warehouse.
  • One dim_date — every fact joins to it; one source of truth for "month," "quarter," "fiscal year."
  • Cross-mart consistency — finance and marketing see the same customer name.
  • No re-modelling per mart — analysts never re-derive "what is Customer 42?".
  • Cross-fact analytics — same customer's orders and tickets can be joined safely.

Worked example. A conformed dim_customer shared by three facts:

fact join key what the dim adds
fact_orders customer_id name, segment, city
fact_support_tickets customer_id same name, segment, city
fact_app_sessions customer_id same name, segment, city

Step-by-step.

  1. Marketing wants "revenue by city" and joins fact_orders to dim_customer.
  2. Support wants "ticket count by segment" and joins fact_support_tickets to dim_customer.
  3. Product wants "active sessions by city" and joins fact_app_sessions to dim_customer.
  4. All three teams use the same dimension; the answers about "Customer 42 lives in Bangalore" are identical.
  5. If Customer 42 moves to Hyderabad, one SCD2 update in dim_customer keeps all three facts honest.

Worked-example solution. Conformed dimension DDL:

CREATE TABLE dim_customer (
    customer_id   NUMBER(38,0) PRIMARY KEY,
    customer_name TEXT         NOT NULL,
    segment       TEXT,
    city          TEXT,
    country       TEXT,
    sign_up_date  DATE
);
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: if two analysts give different answers for the same "customer," check that they're joining the same dimension. Conformed dimensions are how you stop that argument.

Slowly Changing Dimensions (preview) — handling attribute change

The SCD preview invariant: dimension attributes change over time (a customer's city, a product's category); SCD types are the canonical patterns for handling that change; SCD2 is the interview favourite. Full treatment is in Section 6 — for now, know that dimensions are not purely static.

  • SCD Type 1 — overwrite; lose history.
  • SCD Type 2 — add new row with valid_from / valid_to; keep history.
  • SCD Type 3 — add a previous_city column; keep one prior value.
  • Most common in production — Type 2 for important attributes, Type 1 for unimportant ones.
  • Surrogate key — required for SCD2 since the business key isn't unique anymore.

Worked example. Customer 42 moves cities:

customer_sk customer_id city valid_from valid_to is_current
1 42 Hyderabad 2025-01-01 2026-03-14 FALSE
2 42 Bangalore 2026-03-15 NULL TRUE

Step-by-step.

  1. Customer 42 originally lives in Hyderabad; one row with is_current = TRUE.
  2. On 2026-03-15, the customer moves; the old row is closed (valid_to set, is_current = FALSE).
  3. A new row is inserted for Bangalore with valid_from = 2026-03-15 and is_current = TRUE.
  4. Historical fact joins use WHERE sale_date BETWEEN valid_from AND COALESCE(valid_to, '9999-12-31').
  5. Current-state queries use WHERE is_current = TRUE.

Worked-example solution. SCD2 dimension with surrogate key:

CREATE TABLE dim_customer (
    customer_sk   NUMBER(38,0) PRIMARY KEY,
    customer_id   NUMBER(38,0) NOT NULL,
    customer_name TEXT,
    city          TEXT,
    valid_from    DATE,
    valid_to      DATE,
    is_current    BOOLEAN
);
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: if an attribute is queried historically, SCD2 it; if it's only ever shown as "current," SCD1 is fine.

Date dimensions — the most-joined dim in every warehouse

The date-dim invariant: dim_date has one row per calendar date with pre-computed columns for day, week, month, quarter, year, fiscal year, is_weekend, is_holiday; every fact has a date_id FK; analysts never compute date math at query time. It is the single most reused dimension in the warehouse.

  • One row per calendar day — 5 years × 365 = 1,825 rows; trivially small.
  • Pre-computed columnsday_of_week, week_of_year, month_name, quarter, fiscal_year, is_weekend, is_business_day, is_holiday.
  • date_id as integer YYYYMMDD — sortable, partition-friendly, indexable.
  • Reusable across every fact — orders, clicks, payments, sessions all join here.
  • Always populate the full range upfront — no gaps in the calendar.

Worked example. A small slice of dim_date:

date_id date day_name month quarter year is_weekend
20260510 2026-05-10 Sunday 5 2 2026 TRUE
20260511 2026-05-11 Monday 5 2 2026 FALSE
20260512 2026-05-12 Tuesday 5 2 2026 FALSE

Step-by-step.

  1. A monthly revenue report joins fact_orders to dim_date on date_id.
  2. GROUP BY dim_date.month, dim_date.year returns one row per (year, month).
  3. A "weekend-only" filter is WHERE dim_date.is_weekend = TRUE — no EXTRACT(DOW …) needed.
  4. A fiscal-year report uses GROUP BY dim_date.fiscal_year — analysts never have to remember fiscal-month logic.
  5. The whole dim is small enough to broadcast — every join is essentially free.

Worked-example solution. Date-dimension generation:

-- generate 5 years of dates (Snowflake / BigQuery / Postgres variants exist)
INSERT INTO dim_date (date_id, date, day_name, month, quarter, year, is_weekend)
SELECT
    TO_NUMBER(TO_CHAR(d, 'YYYYMMDD'))         AS date_id,
    d                                         AS date,
    TO_CHAR(d, 'Day')                         AS day_name,
    EXTRACT(MONTH FROM d)                     AS month,
    CEIL(EXTRACT(MONTH FROM d) / 3.0)         AS quarter,
    EXTRACT(YEAR FROM d)                      AS year,
    EXTRACT(DOW FROM d) IN (0, 6)             AS is_weekend
FROM (SELECT generate_series('2024-01-01'::date, '2030-12-31'::date, INTERVAL '1 day')::date AS d) g;
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: every warehouse you build should have a dim_date on day one — even before the first fact table. Generating it later is busywork.

Common beginner mistakes

  • Storing descriptive columns directly on the fact table — fact_orders.customer_name works until the name changes and yesterday's revenue drifts.
  • Skipping conformed dimensions — every team builds their own customer table; analyst answers diverge.
  • Building one giant "junk" dimension — combining unrelated flags into one row instead of two clear dimensions.
  • Forgetting dim_date — analysts write EXTRACT(MONTH FROM date_col) everywhere; partition pruning suffers.
  • Treating dimensions as immutable — they change; pick an SCD type before the first row lands.

Data Warehouse Interview Question on Conformed Dimensions Across Two Marts

The marketing mart and the finance mart each have their own customer table. Marketing's customer.segment says "Premium" for customer 42; finance's says "Tier 1". The CEO asks "how many premium customers paid in April?" and gets two different answers. Propose a fix.

Solution Using a Single Conformed dim_customer with Both Attributes

Code solution.

-- One enterprise-wide dim, joined by both marts
CREATE TABLE dim_customer (
    customer_id    NUMBER(38,0) PRIMARY KEY,
    customer_name  TEXT,
    marketing_seg  TEXT,                  -- "Premium" / "Standard"
    finance_tier   TEXT,                  -- "Tier 1" / "Tier 2"
    city           TEXT,
    sign_up_date   DATE
);

-- Marketing mart joins for segment
SELECT marketing_seg, COUNT(DISTINCT f.customer_id)
FROM fact_orders f
JOIN dim_customer c ON c.customer_id = f.customer_id
GROUP BY 1;

-- Finance mart joins for tier on the same dim
SELECT finance_tier, SUM(f.revenue)
FROM fact_payments f
JOIN dim_customer c ON c.customer_id = f.customer_id
GROUP BY 1;
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

step observation
1 mkt and fin each have their own customer table
2 mkt.customer.segment ≠ fin.customer.tier
3 CEO asks one question
4 conform to one dim_customer with both columns
5 both marts join the same dim; labels match across the board

Output: the CEO's question returns one answer regardless of which mart the analyst queries. Future cross-mart questions ("are our Tier-1 finance customers also Premium marketing?") become a single SQL join.

Why this works — concept by concept:

  • One conformed dim — every team joins the same dim_customer; no parallel truths.
  • Both attributes side-by-side — marketing keeps its segment, finance keeps its tier, both visible on the same row.
  • Cross-mart analytics — "Tier 1 + Premium" customers are now one WHERE clause away.
  • Single update path — when customer 42's segment changes, you update one place.
  • Faster reviews — the CEO never sees diverging numbers for the "same" filter.
  • Cost — one dim, one join per query; the duplicated table cost disappears.

Inline CTA: drill cross-table modelling on the SQL practice page and the aggregation topic.

SQL
Topic — joins
SQL join problems

Practice →

SQL
Topic — aggregations
SQL aggregation problems

Practice →

COURSE
Course — ETL System Design
ETL System Design for DE Interviews

View course →


4. Star schema vs snowflake schema

The canonical model choice — flat dimensions or normalised hierarchy

The star schema vs snowflake schema decision is the single most-tested data-modelling question in interviews. A star schema keeps every dimension flat — one table per business entity, with all hierarchical attributes denormalised onto the row. A snowflake schema (the modelling pattern, not the cloud warehouse) normalises dimensions into sub-dimensions, saving space at the cost of more joins. Most modern warehouses prefer star — the query simplicity and performance almost always outweigh the storage savings.

Star-schema diagram showing a central rounded fact table 'fact_sales' with measures revenue and quantity, connected by purple lines to four dimension tables 'dim_customer', 'dim_product', 'dim_date', and 'dim_store' arranged like points of a star, on a light PipeCode-branded card with bold navy labels and green / orange accent dots.

Pro tip: When asked "star schema vs snowflake schema," answer in one sentence: "Star for query speed and simplicity, snowflake for storage savings on huge dimensions — and 90% of the time, star wins." Then offer a one-clause justification per side and stop talking.

Star schema — flat dimensions, simple joins, fast queries

The star invariant: a star schema has one fact table at the centre joined to N denormalised dimension tables; each dimension carries every attribute it needs as a column on a single row; queries are one-hop joins from fact to dim; the shape looks like a star with the fact at the centre and dimensions as the points. It is the default Kimball recommendation and the default modern-warehouse shape.

  • One fact, N dimensions — typical warehouse has 1 fact and 4–10 dimensions.
  • Flat dimensionsdim_product carries category, subcategory, brand, supplier all on one row.
  • One-hop joins — fact → dim, never dim → sub-dim.
  • Query simplicity — joins are obvious; analysts write SQL without help.
  • Performance — columnar warehouses optimise star joins natively.

Worked example. A retail star schema:

              dim_customer
                    |
dim_product — fact_sales — dim_date
                    |
                dim_store
Enter fullscreen mode Exit fullscreen mode
table columns
fact_sales sale_id, customer_id, product_id, date_id, store_id, revenue, quantity
dim_customer customer_id, name, city, segment, country
dim_product product_id, name, category, subcategory, brand
dim_date date_id, date, month, quarter, year
dim_store store_id, name, region, format

Step-by-step.

  1. The central fact_sales carries the four FKs and two measures.
  2. Each dimension is flatdim_product has category and brand directly on the row, not in a separate dim_category table.
  3. "Revenue by category by year" is one SELECT with two joins.
  4. The shape is symmetric — every dimension is reachable in one join from the fact.
  5. Columnar engines see one fact + N dim joins and execute them in parallel.

Worked-example solution. A canonical star-schema query:

SELECT p.category,
       d.year,
       SUM(f.revenue) AS revenue
FROM fact_sales f
JOIN dim_product p ON p.product_id = f.product_id
JOIN dim_date    d ON d.date_id    = f.date_id
GROUP BY p.category, d.year
ORDER BY d.year, p.category;
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: if you can't justify why a particular dimension must be normalised, leave it flat. Star is the default for a reason.

Snowflake schema — normalised dimensions, more joins, more storage discipline

The snowflake invariant: a snowflake schema (modelling pattern) normalises dimensions into sub-dimensions; dim_product.category_id references dim_category; queries need one more join per normalised level; useful when a hierarchical attribute has very high cardinality and changes independently. Reserve it for the rare cases when storage or update frequency genuinely matters.

  • Normalised dimensionsdim_product references dim_category which references dim_department.
  • More joinsfact_salesdim_productdim_categorydim_department.
  • Less redundancy — a category change updates one row in dim_category, not every row in dim_product.
  • More complex SQL — analysts have to remember the join path.
  • Slower queries — extra joins compound at scale.

Worked example. Same retail, snowflaked dimensions:

                  dim_customer
                       |
dim_brand → dim_product — fact_sales — dim_date
                       |                    |
                  dim_category         dim_quarter → dim_year
                       |
                dim_department
Enter fullscreen mode Exit fullscreen mode
query star joins snowflake joins
revenue by category by year 2 4
revenue by department by quarter 2 5
top brands by city 2 4

Step-by-step.

  1. The same fact_sales is now wrapped by normalised dimensions.
  2. dim_product has category_id, not category — to get the category name you join dim_category.
  3. "Revenue by category by year" becomes a four-table join instead of three.
  4. The schema saves space — there are only N distinct categories instead of M product rows × the category name.
  5. For most warehouses the storage savings are negligible and the join cost is real.

Worked-example solution. Snowflaked dim DDL:

CREATE TABLE dim_department (department_id NUMBER PRIMARY KEY, name TEXT);
CREATE TABLE dim_category   (category_id   NUMBER PRIMARY KEY, name TEXT, department_id NUMBER REFERENCES dim_department);
CREATE TABLE dim_product    (product_id    NUMBER PRIMARY KEY, name TEXT, category_id   NUMBER REFERENCES dim_category);
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: normalise a dimension only when the hierarchical attribute (a) is gigantic, (b) changes independently of the parent, or (c) is shared across multiple dimensions.

When to pick which — a one-line decision per dimension

The decision invariant: for each dimension, ask "does this attribute change independently and at significant volume?" — if yes, snowflake it; if no, star it. Most attributes fail that test; most dimensions stay flat.

  • Star, default — flat, denormalised, fast queries.
  • Snowflake, exception — only when storage or independent update wins.
  • Mixed (galaxy) schemas — multiple facts sharing conformed dimensions.
  • One-big-table (OBT) — extreme denormalisation, one row per event with every attribute inline; used by some Looker / Power BI shops.
  • Hybrid — star for most dimensions, snowflake one or two large hierarchical ones.

Worked example. Per-dimension choice for a retail warehouse:

dimension choice reason
dim_customer star (flat) denormalised attributes change together
dim_product star brand / category small, change with product
dim_date star static, small, joined heavily
dim_geography snowflake city → state → country shared, very large, infrequent change
dim_employee star hierarchy small, joined infrequently

Step-by-step.

  1. Walk each dimension and ask the question.
  2. For most retail dimensions, the answer is "keep it flat."
  3. dim_geography is the exception — country/state hierarchies repeat across millions of customer / store rows; normalising saves real space.
  4. Pick consistently and document the choice.
  5. The resulting schema is mostly star with one normalised dimension — a hybrid that maximises performance with controlled redundancy.

Worked-example solution. Hybrid schema:

-- star dimensions (flat)
CREATE TABLE dim_customer (customer_id NUMBER PRIMARY KEY, name TEXT, segment TEXT, geography_id NUMBER);
CREATE TABLE dim_product  (product_id  NUMBER PRIMARY KEY, name TEXT, category TEXT, brand TEXT);

-- snowflaked geography (the one exception)
CREATE TABLE dim_country   (country_id NUMBER PRIMARY KEY, name TEXT);
CREATE TABLE dim_state     (state_id   NUMBER PRIMARY KEY, name TEXT, country_id NUMBER REFERENCES dim_country);
CREATE TABLE dim_geography (geography_id NUMBER PRIMARY KEY, city TEXT, state_id NUMBER REFERENCES dim_state);
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: if you can't articulate the win from normalising, the default is star.

Common beginner mistakes

  • Defaulting to snowflake "for normalisation" — modern warehouses don't reward it.
  • Normalising dim_date — one of the cheapest, smallest, most-joined dimensions; flat is always right.
  • Mixing schema styles within one warehouse without documentation — analysts lose track of the join path.
  • Treating snowflake schema (the model) and Snowflake (the cloud warehouse) as the same thing — they are unrelated; the schema pre-dates the company by 30 years.
  • Picking OBT (one-big-table) for a warehouse with many subject areas — works for narrow dashboards, kills cross-team analytics.

Data Warehouse Interview Question on Star vs Snowflake for a Retail Warehouse

A retailer has 50 million fact_sales rows, 10 dimensions ranging from dim_customer (5 M rows, mostly flat) to dim_geography (50 k rows, country/state/city hierarchy shared across customers and stores). Pick the schema shape per dimension and defend the overall choice.

Solution Using a Hybrid — Star for Most, Snowflake for Geography Only

Code solution.

-- 9 flat star dimensions + 1 snowflaked dim_geography (city → state → country)

-- star, flat
CREATE TABLE dim_customer (
    customer_id NUMBER PRIMARY KEY,
    name TEXT, segment TEXT, sign_up_date DATE,
    geography_id NUMBER NOT NULL
);
CREATE TABLE dim_product (
    product_id NUMBER PRIMARY KEY, name TEXT,
    category TEXT, subcategory TEXT, brand TEXT
);
CREATE TABLE dim_date (date_id NUMBER PRIMARY KEY, date DATE, month INT, year INT);

-- the one snowflaked dim — saves space because the hierarchy is shared
CREATE TABLE dim_country   (country_id NUMBER PRIMARY KEY, name TEXT);
CREATE TABLE dim_state     (state_id NUMBER PRIMARY KEY, name TEXT, country_id NUMBER);
CREATE TABLE dim_geography (geography_id NUMBER PRIMARY KEY, city TEXT, state_id NUMBER);

-- fact joins normally
CREATE TABLE fact_sales (
    sale_id NUMBER PRIMARY KEY,
    customer_id NUMBER, product_id NUMBER, date_id NUMBER,
    revenue NUMBER(14,2), quantity NUMBER
);
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

step dimension choice reason
1 dim_customer star flat; attributes change together
2 dim_product star flat; category cheap to denormalise
3 dim_date star static, tiny, joined everywhere
4 dim_geography snowflake hierarchy shared, large, independent change
5 dim_store, dim_promo, dim_payment star flat, small
6 overall shape hybrid (mostly star + one snowflake) balances perf and storage

Output: the warehouse runs star-schema-fast for 95% of queries; the one snowflaked dimension saves disk on city/state/country redundancy without hurting most lookups; the schema documentation reads "star except for dim_geography."

Why this works — concept by concept:

  • Star for most dimensions — query simplicity and parallel join performance win.
  • Snowflake dim_geography only — hierarchical, shared, large; normalisation pays off here.
  • Conformed dimensions across the warehousedim_customer joins to every fact identically.
  • fact_sales clustered by date_id — every monthly / quarterly query prunes hard.
  • Surrogate keys on every dim — stable identifiers; SCD2-friendly going forward.
  • CostO(N log N) for the central fact scan; an extra O(K) join hop only for geography queries.

Inline CTA: drill star-schema joins on the SQL practice page and the aggregation topic.

SQL
Topic — joins
SQL join problems

Practice →

ETL
Topic — ETL
ETL practice problems

Practice →

COURSE
Course — ETL System Design
ETL System Design for DE Interviews

View course →


5. Grain, keys, and surrogate keys

The three foundations every fact and dimension stands on

Grain is "what does one row mean?", keys are how rows are uniquely identified, and surrogate keys are stable, system-generated identifiers that survive business-key changes. These three concepts are the foundation of every well-designed warehouse — and the three most-asked interview questions in data-engineering loops. Get them right and every downstream choice falls out; get them wrong and the schema is unfixable.

Pro tip: In every system-design round, the first sentence of your fact-table answer is "the grain is one row per X." The second sentence names the FK columns. The third names the measures. If you can't say grain in one phrase, the design isn't ready.

Grain — what one row represents

The grain invariant: the grain of a fact table is the answer to "what is the meaning of one row?" — it must be stated explicitly, in one phrase, before any column is chosen; mixing grains in one table is the most common modelling mistake and the source of every double-counting bug. Pick the finest grain that the source data supports — rollups are SQL, but you can never re-derive detail from a summary.

  • State it in one phrase — "one row per (order, product line)."
  • Pick the finest grain available — coarser views are aggregates; coarser data is irreversible.
  • Document the grain inline — table comment, dbt YAML, or schema notebook.
  • Never mix grains — a table with sometimes-order, sometimes-line rows is broken.
  • Grain drives partition key — usually the date column at the row's natural grain.

Worked example. Three grain choices for a sales fact:

grain rows what each row means rollups possible
one row per item sold 50 M / month finest; one product unit per row per order, per day, per category
one row per order line 10 M / month aggregated to (order, product) per order, per day, per category
one row per order 2 M / month aggregated by order per day, per customer; not by product line

Step-by-step.

  1. The source data has 50 M individual item-sale events per month.
  2. Option 1 (one row per item) preserves every detail; analysts can roll up however they want.
  3. Option 2 (one row per order line) groups items by (order, product) — slightly smaller, but you lose per-unit detail.
  4. Option 3 (one row per order) is too coarse — you cannot reconstruct "revenue by product" from it.
  5. Pick the finest grain (option 1 or 2) and write rollups as SQL.

Worked-example solution. Stating grain explicitly:

-- grain: one row per order line (one product per row, multiple rows per order)
CREATE TABLE fact_sales (
    sale_id      NUMBER PRIMARY KEY,
    order_id     NUMBER NOT NULL,           -- degenerate dimension
    product_id   NUMBER NOT NULL,
    customer_id  NUMBER NOT NULL,
    date_id      NUMBER NOT NULL,
    quantity     NUMBER NOT NULL,
    unit_price   NUMBER(14,2) NOT NULL,
    revenue      NUMBER(14,2) NOT NULL      -- quantity * unit_price
);
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: if two analysts disagree on a number, check that they're aggregating to the same grain. Half the time the bug is exactly that.

Primary, foreign, and natural keys — the basics

The key-basics invariant: a primary key uniquely identifies a row, a foreign key links to a primary key in another table, a natural key is the business identifier (customer_email, order_number), and a surrogate key is a system-generated stable identifier. Warehouses use surrogate keys for stability; OLTP systems often use natural keys directly.

  • Primary key (PK) — one row, one identifier; uniqueness enforced.
  • Foreign key (FK) — references another table's PK; integrity check.
  • Natural key (NK) — business identifier (customer_email); can change.
  • Composite key — PK of multiple columns (e.g., (date_id, store_id) for daily-store snapshot).
  • Degenerate dimension — operational ID stored on the fact (order_number); no dim table needed.

Worked example. A retail warehouse's key structure:

table PK FK to natural key
dim_customer customer_id (surrogate) customer_email
dim_product product_id (surrogate) sku
fact_sales sale_id customer_id, product_id, date_id order_number (degenerate)

Step-by-step.

  1. dim_customer has a surrogate customer_id as PK and a natural customer_email.
  2. The customer's email might change ("alice@old.com" → "alice@new.com"); the surrogate ID doesn't.
  3. fact_sales joins to dim_customer on the surrogate, so historical sales remain attached to the same person.
  4. dim_product.sku is the natural key; product_id is the surrogate; same logic.
  5. fact_sales.order_number is a degenerate dimension — preserved on the fact for traceability but with no dim table because there are no useful attributes about an order beyond its line items.

Worked-example solution. Key declarations:

CREATE TABLE dim_customer (
    customer_id  NUMBER PRIMARY KEY,           -- surrogate
    email        TEXT UNIQUE,                   -- natural key
    name         TEXT
);

CREATE TABLE fact_sales (
    sale_id      NUMBER PRIMARY KEY,
    order_number TEXT NOT NULL,                 -- degenerate dimension
    customer_id  NUMBER NOT NULL REFERENCES dim_customer,
    product_id   NUMBER NOT NULL REFERENCES dim_product,
    revenue      NUMBER(14,2) NOT NULL
);
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: if a column is used to join and changes over time, you want a surrogate key. If it changes only theoretically, the natural key may be fine.

Surrogate keys — stable, system-generated, SCD-ready

The surrogate-key invariant: a surrogate key is a system-generated, stable identifier (typically a BIGINT sequence) attached to every dimension row; it is what fact tables join to; it survives business-key changes and is the only practical way to implement SCD2 without breaking referential integrity. Surrogate key in SQL is one of the most reliably-asked data-warehouse interview questions.

  • System-generatedGENERATED ALWAYS AS IDENTITY or BIGSERIAL.
  • Stable — never changes for the life of the row.
  • Fact join targetfact.customer_id references dim_customer.customer_id (the surrogate).
  • SCD2 enabler — multiple rows for the same person, each with a different surrogate.
  • Performance — small fixed-width integer; B-tree-friendly joins.

Worked example. SCD2 dimension with surrogate keys:

customer_sk customer_id (natural) city valid_from valid_to is_current
1 42 Hyderabad 2025-01-01 2026-03-14 FALSE
2 42 Bangalore 2026-03-15 NULL TRUE

Step-by-step.

  1. Customer 42 (natural key) has two surrogate keys: 1 for the Hyderabad period, 2 for the Bangalore period.
  2. Historical sales reference customer_sk = 1; new sales reference customer_sk = 2.
  3. "Revenue by city last quarter" joins on customer_sk and naturally splits the customer's revenue between the two cities by date.
  4. The natural key customer_id = 42 is preserved on the dim row for traceability.
  5. Without the surrogate, you'd be stuck either overwriting history (Type 1) or breaking the FK.

Worked-example solution. Surrogate-key SCD2 dim:

CREATE TABLE dim_customer (
    customer_sk  NUMBER GENERATED ALWAYS AS IDENTITY PRIMARY KEY,  -- surrogate
    customer_id  NUMBER NOT NULL,                                  -- natural / business key
    name         TEXT,
    city         TEXT,
    valid_from   DATE,
    valid_to     DATE,
    is_current   BOOLEAN
);

CREATE TABLE fact_sales (
    sale_id      NUMBER PRIMARY KEY,
    customer_sk  NUMBER NOT NULL REFERENCES dim_customer,          -- joins to surrogate
    product_sk   NUMBER NOT NULL,
    date_id      NUMBER NOT NULL,
    revenue      NUMBER(14,2)
);
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: every dimension gets a surrogate. The business may give you a natural key; the warehouse always generates its own.

Common beginner mistakes

  • Stating grain after picking columns — the grain drives the columns, not vice versa.
  • Using a natural key (email, SKU) as a join key in a fact — when the natural key changes, the fact silently drifts.
  • Treating customer_id and customer_sk as the same thing — they are not; one is business-stable, the other is warehouse-stable.
  • Forgetting the degenerate dimension on the fact — operational IDs (order_number) get lost without it.
  • Building a composite key where a surrogate would do — joins get harder, indexes get bigger.

Data Warehouse Interview Question on Grain and Keys for an E-Commerce Order Fact

The team is modelling an e-commerce orders fact. Source data has 200 orders/day, average 3 items per order, average price changes daily, and customer addresses change occasionally. Pick the grain, name the keys (PK, FKs, natural, surrogate, degenerate), and defend each choice.

Solution Using One Row per Order Line + Surrogate Keys + a Degenerate order_number

Code solution.

-- Grain: one row per (order, product line)
CREATE TABLE fact_order_lines (
    line_sk      NUMBER GENERATED ALWAYS AS IDENTITY PRIMARY KEY,  -- surrogate PK
    order_number TEXT   NOT NULL,                                   -- degenerate dim
    customer_sk  NUMBER NOT NULL REFERENCES dim_customer,           -- SCD2-aware FK
    product_sk   NUMBER NOT NULL REFERENCES dim_product,
    date_id      NUMBER NOT NULL REFERENCES dim_date,
    quantity     NUMBER NOT NULL,
    unit_price   NUMBER(14,4) NOT NULL,
    revenue      NUMBER(14,4) GENERATED ALWAYS AS (quantity * unit_price) STORED
);
CREATE INDEX ON fact_order_lines (date_id);
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

design decision choice reason
grain one row per (order, product line) finest available; rollups are SQL
PK line_sk (surrogate) stable, integer, indexable
customer FK customer_sk (surrogate to SCD2 dim) customer city changes; surrogate captures history
product FK product_sk (surrogate) price changes; surrogate keeps history
date FK date_id conformed across every fact
degenerate order_number preserves operational ID without a dim
measure revenue generated from quantity * unit_price one source of truth

Output: the fact answers "revenue by product by day," "revenue by customer city by month" (using SCD2), and "average order size" — all from one well-shaped table. Historical accuracy is preserved because customer and product attributes are SCD2-tracked via the surrogate dimensions.

Why this works — concept by concept:

  • Grain stated explicitly — "one row per order line"; never violated.
  • Surrogate PK line_sk — small integer, stable across every join.
  • SCD2-aware FKs — historical city / price are attached to the correct dimension row.
  • Degenerate order_number — operational lookups still work without a dim_order.
  • Generated revenue — eliminates the "ETL computed qty * price but application computed something different" class of bugs.
  • CostO(rows) for the central fact; surrogate joins are O(log N) per dim with B-tree indexes.

Inline CTA: for end-to-end fact-and-dim design, see ETL System Design for Data Engineering Interviews.

SQL
Topic — joins
SQL join problems

Practice →

ETL
Topic — ETL
ETL practice problems

Practice →

COURSE
Course — ETL System Design
ETL System Design for DE Interviews

View course →


6. Slowly Changing Dimensions (SCD)

Types 1, 2, and 3 — how dimensions handle attribute change

Dimensions change. A customer moves cities, a product gets re-categorised, an employee changes departments. Slowly Changing Dimensions (SCD) are the canonical patterns for handling that change in a warehouse — Type 1 (overwrite, lose history), Type 2 (new row, keep full history), Type 3 (extra column, keep one prior value). Type 2 is the most-asked in interviews because it preserves historical accuracy at the cost of more rows and a surrogate key.

SCD comparison diagram showing three side-by-side cards labeled 'SCD Type 1', 'SCD Type 2', and 'SCD Type 3', each with a small before/after table showing how a customer's city change from Hyderabad to Bangalore is handled — Type 1 overwrites, Type 2 inserts a new row with valid_from / valid_to / is_current columns, Type 3 adds previous_city / current_city columns — on a light PipeCode-branded card with purple headers and green / orange accent dots.

Pro tip: When asked "which SCD type do I use?", say: "Type 1 for attributes I never want to look at historically, Type 2 for anything that affects a report, Type 3 for the rare 'just show me the previous value' case." That answer covers 99% of real-world choices.

SCD Type 1 — overwrite in place, lose history

The Type-1 invariant: SCD Type 1 simply overwrites the dimension row when an attribute changes; the old value is lost; no history; cheapest and simplest to implement; the right choice for attributes you never query historically (typos, formatting normalisation). Use it sparingly and explicitly — every Type 1 attribute is a piece of history you're choosing to discard.

  • One row per business keycustomer_id = 42 is exactly one row.
  • Overwrite on change — old value replaced; no audit trail in the dim.
  • Simplest ETLUPDATE … SET … and you're done.
  • Right for — corrections, name-formatting fixes, low-value attributes.
  • Wrong for — anything that affects historical reports.

Worked example. Customer 42's name corrected from "Alce" to "Alice":

before after
customer_id=42, name="Alce" customer_id=42, name="Alice"

Step-by-step.

  1. The CSV import accidentally created customer_id=42, name="Alce".
  2. The data team notices the typo and runs an UPDATE.
  3. The dim row is overwritten; future queries see Alice.
  4. Historical sales joined to this customer now show Alice too — which is what we want for a typo fix.
  5. No new row; no history kept; no surrogate key needed.

Worked-example solution. Type 1 update:

UPDATE dim_customer
SET name = 'Alice'
WHERE customer_id = 42;
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: Type 1 is correct when historical reports should retroactively reflect the corrected value. Otherwise it's wrong.

SCD Type 2 — new row, keep full history

The Type-2 invariant: SCD Type 2 inserts a new dimension row when an attribute changes, closes the old row with valid_to and is_current = FALSE, and points future facts at the new row's surrogate key; full history is preserved. This is the most common SCD type in production and the most-asked in interviews.

  • Multiple rows per business key — each row covers one period.
  • valid_from / valid_to columns — date range during which the row was current.
  • is_current BOOLEAN — shortcut for "give me the current row."
  • New surrogate key per change — facts joined by surrogate stay attached to the correct period.
  • Historical accuracy — last year's revenue still rolls up to last year's city.

Worked example. Customer 42 moves from Hyderabad to Bangalore on 2026-03-15:

customer_sk customer_id city valid_from valid_to is_current
1 42 Hyderabad 2025-01-01 2026-03-14 FALSE
2 42 Bangalore 2026-03-15 NULL TRUE

Step-by-step.

  1. Customer 42 originally has one row: customer_sk=1, city=Hyderabad, is_current=TRUE.
  2. On 2026-03-15 the customer moves; the ETL detects the change.
  3. The old row is closed: valid_to = 2026-03-14, is_current = FALSE.
  4. A new row is inserted: customer_sk=2, city=Bangalore, valid_from = 2026-03-15, valid_to = NULL, is_current = TRUE.
  5. Future fact rows reference customer_sk = 2; historical facts reference customer_sk = 1 — each fact gets the right city for its time.

Worked-example solution. SCD2 update pattern:

-- close the old row
UPDATE dim_customer
SET valid_to   = DATE '2026-03-14',
    is_current = FALSE
WHERE customer_id = 42 AND is_current = TRUE;

-- insert the new current row
INSERT INTO dim_customer (customer_id, name, city, valid_from, valid_to, is_current)
VALUES (42, 'Alice', 'Bangalore', DATE '2026-03-15', NULL, TRUE);
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: if the attribute affects a historical report, it must be Type 2. The classic test: "would last year's revenue be wrong if I overwrote this?"

SCD Type 3 — extra column, one prior value

The Type-3 invariant: SCD Type 3 adds a previous_* column alongside the current_* column on the same row; one prior value is kept, no more; cheaper than Type 2 but loses everything beyond the most recent change. Used in special cases — e.g., territory reassignments where you want "current and last quarter's region" available without a join.

  • One row per business key — no row growth.
  • Both columns on the rowcurrent_city + previous_city.
  • Loses older history — third change overwrites the previous.
  • Right for — "current vs immediately prior" comparison patterns.
  • Wrong for — anything that needs more than one period of history.

Worked example. Customer 42 moves once:

customer_id name current_city previous_city
42 Alice Bangalore Hyderabad

Step-by-step.

  1. The dim originally has current_city = Hyderabad and previous_city = NULL.
  2. The customer moves; ETL detects.
  3. Single UPDATE: previous_city = current_city, current_city = "Bangalore".
  4. If the customer moves again to Chennai, previous_city becomes "Bangalore" — Hyderabad is lost forever.
  5. Reports can answer "compared to where they used to live" but not "where they lived three moves ago."

Worked-example solution. Type 3 update:

UPDATE dim_customer
SET previous_city = current_city,
    current_city  = 'Bangalore'
WHERE customer_id = 42;
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: Type 3 is rare. Use it only when the business explicitly says "I want current vs previous side-by-side" and never asks for deeper history.

Common beginner mistakes

  • Defaulting to Type 1 because "it's simple" — overwriting historically-meaningful attributes silently rewrites past reports.
  • Implementing Type 2 without a surrogate key — joins break the moment the natural key has multiple rows.
  • Forgetting to close the old row in Type 2 — both rows look "current"; queries return duplicates.
  • Mixing SCD types within one dimension without documentation — analysts cannot predict whether history is preserved.
  • Using Type 3 for an attribute that changes many times — you keep "current and one prior," lose the rest, and miss the original analysis intent.

Data Warehouse Interview Question on Handling an Address Change Correctly

A dim_customer dimension has customer_id, name, city, email. Customers move cities occasionally; the marketing team wants quarterly revenue reports that attribute each sale to the city where the customer lived at the time of the sale. Pick the SCD type, write the update logic, and explain how the fact-side join works.

Solution Using SCD Type 2 + Surrogate Key + a Date-Range Join

Code solution.

CREATE TABLE dim_customer (
    customer_sk  NUMBER GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
    customer_id  NUMBER NOT NULL,
    name         TEXT,
    city         TEXT,
    email        TEXT,
    valid_from   DATE NOT NULL,
    valid_to     DATE,
    is_current   BOOLEAN NOT NULL
);

-- detect change, close old, insert new
UPDATE dim_customer
SET valid_to   = DATE '2026-03-14', is_current = FALSE
WHERE customer_id = 42 AND is_current = TRUE;

INSERT INTO dim_customer (customer_id, name, city, email, valid_from, valid_to, is_current)
VALUES (42, 'Alice', 'Bangalore', 'alice@x.com', DATE '2026-03-15', NULL, TRUE);

-- quarterly revenue by city — uses date-range join
SELECT c.city, SUM(f.revenue) AS revenue
FROM fact_sales f
JOIN dim_customer c
  ON c.customer_id = f.customer_id
 AND f.sale_date BETWEEN c.valid_from AND COALESCE(c.valid_to, DATE '9999-12-31')
WHERE f.sale_date >= DATE '2026-01-01' AND f.sale_date < DATE '2026-04-01'
GROUP BY c.city;
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

step action result
1 detect city change on 2026-03-15 row count for customer 42 changes from 1 to 2 in dim
2 close old Hyderabad row valid_to = 2026-03-14, is_current = FALSE
3 insert new Bangalore row valid_from = 2026-03-15, valid_to = NULL, is_current = TRUE
4 run quarterly report each sale joins to the dim row that was current on its sale date
5 revenue split correctly between Hyderabad and Bangalore historical accuracy preserved

Output: Q1 2026 revenue is split correctly — sales before March 15 attribute to Hyderabad, sales on or after attribute to Bangalore. The CEO's "revenue by city" report stays accurate even as customers move.

Why this works — concept by concept:

  • SCD Type 2 — full history; old rows live alongside new rows.
  • Surrogate key customer_sk — uniquely identifies each (customer, period); facts join to the right surrogate.
  • valid_from / valid_to date range — defines which dim row was current at any sale date.
  • COALESCE(valid_to, '9999-12-31') — handles the open-ended current row.
  • is_current = TRUE for "current state" queries — shortcut for dashboards that always want the latest.
  • Cost — modest dim growth (one extra row per change); fact-side join cost identical.

Inline CTA: drill SCD2 patterns and dim modelling on the ETL practice page.

ETL
Topic — ETL
ETL practice problems

Practice →

SQL
Topic — joins
SQL join problems

Practice →

COURSE
Course — ETL System Design
ETL System Design for DE Interviews

View course →


7. Partitioning, ETL/ELT, and the design process

How the warehouse actually gets built — and how data flows in

A warehouse is more than a schema — it is partitioned tables, ETL/ELT pipelines, and a repeatable design process. Partitioning (usually by date) is what turns multi-billion-row facts from "slow" into "sub-second." ETL/ELT is how source data gets into the schema you designed. And the design process — the Kimball six-step method — is how you make the schema choices in the first place. This section closes the loop from "I have a great schema in my head" to "the warehouse is in production."

Pro tip: When someone asks "design a warehouse for X," walk through the six Kimball steps in order — business process, grain, dimensions, facts, schema, optimisation. That ordering catches missing grain or missing dimensions before you write a line of DDL.

Partitioning — split big facts by date for prune-friendly queries

The partitioning invariant: partitioning splits a large fact table into smaller chunks (usually one per day or month) so that a query with a date predicate reads only the relevant partitions; this is how 5 B-row facts return in seconds. Every cloud warehouse (Snowflake, BigQuery, Redshift) supports partitioning natively.

  • Partition key — almost always the date column at the fact's natural grain.
  • Daily or monthly — daily for high-volume facts, monthly for low.
  • Partition pruning — the planner skips partitions whose stats prove they cannot match.
  • Loadable partition-by-partition — daily ETL can INSERT / MERGE only today's partition.
  • Partition-friendly predicatesWHERE date_col = '2026-05-10' prunes; WHERE DATE(ts) = '2026-05-10' may not.

Worked example. A 5 B-row fact_sales partitioned by date_id:

query partitions scanned latency
WHERE date_id = 20260510 1 of 1,825 ~200 ms
WHERE date_id BETWEEN 20260501 AND 20260531 31 of 1,825 ~1 s
WHERE date_id >= 20260101 130 of 1,825 ~4 s
no date predicate (full scan) 1,825 of 1,825 ~60 s

Step-by-step.

  1. The fact is partitioned daily by date_id; one micro-partition (or table partition) per day.
  2. A query with WHERE date_id = X scans exactly one partition — ~0.05% of the fact.
  3. A monthly query scans 30 partitions — ~1.6% of the fact.
  4. Without a date predicate, the warehouse must scan everything; that's almost always the wrong query.
  5. Partition pruning is automatic but requires that the predicate sits on the raw partition column, not wrapped in a function.

Worked-example solution. Partitioning a fact (Snowflake CLUSTER BY / BigQuery PARTITION BY):

-- Snowflake
CREATE TABLE fact_sales (
    sale_id NUMBER PRIMARY KEY,
    date_id NUMBER NOT NULL,
    customer_sk NUMBER, product_sk NUMBER,
    revenue NUMBER(14,2)
)
CLUSTER BY (date_id);

-- BigQuery
CREATE TABLE fact_sales
PARTITION BY DATE(sale_date)
CLUSTER BY customer_sk
AS SELECT * FROM staging_sales;
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: every fact with more than ~100 M rows must be partitioned. Skip it and every analytical query degrades.

ETL vs ELT — transform outside or inside the warehouse

The ETL/ELT invariant: ETL transforms data before loading (older pattern, used Spark / Python externally); ELT loads raw data and transforms with SQL inside the warehouse (modern pattern, used dbt / SQL); modern columnar warehouses make ELT the better default in most cases. Both fit dimensional modelling — they differ only in where the transform happens.

  • ETL — Extract, Transform, Load; transform pre-warehouse.
  • ELT — Extract, Load, Transform; transform in-warehouse with SQL.
  • dbt — the de-facto SQL transformation framework for ELT.
  • Modern cloud warehouses — fast enough that ELT outperforms ETL for most workloads.
  • ETL tools — Informatica, Talend, Spark; legacy stronghold for highly-custom transforms.

Worked example. Same daily orders load, ETL vs ELT:

step ETL flavour ELT flavour
1 extract pull Postgres rows into Spark dump Postgres rows to S3
2 transform Spark dedup, type-cast, enrich (later)
3 load write transformed rows to warehouse COPY INTO raw rows to staging
4 transform (done) dbt SQL builds star schema from staging
5 publish warehouse star schema ready warehouse star schema ready

Step-by-step.

  1. ETL: heavy work happens in Spark or Python before warehouse touches the data.
  2. ELT: raw rows land in the warehouse first; SQL transforms produce the model.
  3. ELT keeps the raw layer addressable — you can always re-derive the model.
  4. ELT uses the warehouse's compute (and bills you for it) instead of an external cluster.
  5. For most teams the simplification — "everything is SQL in one place" — outweighs the compute cost.

Worked-example solution. dbt-style ELT model (SQL only):

-- models/fact_orders.sql
WITH raw AS (
    SELECT * FROM staging.orders_raw WHERE load_date = CURRENT_DATE
),
deduped AS (
    SELECT *, ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY source_ts DESC) AS rn
    FROM raw
)
SELECT
    order_id,
    customer_sk,
    product_sk,
    TO_NUMBER(TO_CHAR(placed_at, 'YYYYMMDD')) AS date_id,
    revenue
FROM deduped WHERE rn = 1;
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: default to ELT unless you have a specific reason (massive transform, regulatory pre-processing, latency-sensitive streaming) to do ETL.

The Kimball six-step design process

The design-process invariant: the canonical Kimball method walks any new subject area through six numbered steps — business process → grain → dimensions → facts → schema → optimisation — in that order; doing them out of order produces broken designs. Memorise the order; it works for every analytical domain.

  • Step 1 — Business process — name the operational activity ("sales", "support tickets").
  • Step 2 — Grain — say "one row per X" in one phrase.
  • Step 3 — Dimensions — list the "by" axes: customer, product, date, region.
  • Step 4 — Facts — list the numeric measures: revenue, quantity, duration.
  • Step 5 — Schema — draw the star (or hybrid); name the conformed dims.
  • Step 6 — Optimisation — partition, cluster, index, materialise.

Worked example. Designing an e-commerce orders warehouse:

step output
1 business process "online order placement and fulfilment"
2 grain "one row per order line"
3 dimensions dim_customer, dim_product, dim_date, dim_region, dim_payment
4 facts revenue, quantity, discount, tax
5 schema star with 5 dims, 1 fact, surrogate keys
6 optimisation partition by date_id, cluster by customer_sk, SCD2 on customer + product

Step-by-step.

  1. The business process is "order placement and fulfilment"; that frames every choice that follows.
  2. Grain: one row per (order, product line) is the finest the source supports.
  3. Dimensions: who (customer), what (product), when (date), where (region), how (payment).
  4. Facts: revenue, quantity, discount, tax — additive numeric measures.
  5. Schema: star with surrogate keys; one conformed dim_customer shared with other facts.
  6. Optimisation: partition on date_id; cluster on customer_sk for customer-by-customer rollups.

Worked-example solution. End-to-end design output:

-- minimal six-step output
CREATE TABLE fact_order_lines (
    line_sk      NUMBER GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
    order_number TEXT NOT NULL,
    customer_sk  NUMBER NOT NULL,
    product_sk   NUMBER NOT NULL,
    date_id      NUMBER NOT NULL,
    region_sk    NUMBER NOT NULL,
    payment_sk   NUMBER NOT NULL,
    revenue      NUMBER(14,2),
    quantity     NUMBER,
    discount     NUMBER(14,2),
    tax          NUMBER(14,2)
)
CLUSTER BY (date_id, customer_sk);
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: every design conversation starts with step 1 and walks forward. If someone hands you DDL without a grain statement, your first question is "what does one row mean?"

Common beginner mistakes

  • Designing the schema before stating the grain — every column choice becomes a guess.
  • Building ETL when ELT would work — extra cluster, extra tool, extra ops cost.
  • Skipping partitioning on big facts — every query slows linearly with row count.
  • Picking partition keys that don't match the most common predicate — pruning never engages.
  • Treating the design as one-shot — every warehouse evolves; document the choices so the next iteration is informed.

Data Warehouse Interview Question on Designing an Online-Shopping Warehouse from Scratch

You are asked to design a warehouse for an online shopping app. The business wants daily revenue dashboards, monthly customer-segment reports, and real-time top-N best-selling products. Walk through the six-step Kimball process and produce the resulting schema.

Solution Using the Six-Step Kimball Process with a Star Schema

Code solution.

-- Step 1 (business process): online order placement
-- Step 2 (grain):           one row per order line
-- Step 3 (dimensions):      customer, product, date, region, payment
-- Step 4 (facts):           revenue, quantity, discount, tax
-- Step 5 (schema):          star with 5 dims + 1 fact
-- Step 6 (optimisation):    partition by date_id, cluster by customer_sk

CREATE TABLE dim_customer (customer_sk NUMBER PRIMARY KEY, customer_id NUMBER, name TEXT, segment TEXT, city TEXT, valid_from DATE, valid_to DATE, is_current BOOLEAN);
CREATE TABLE dim_product  (product_sk  NUMBER PRIMARY KEY, product_id  NUMBER, name TEXT, category TEXT, brand TEXT);
CREATE TABLE dim_date     (date_id NUMBER PRIMARY KEY, date DATE, month INT, quarter INT, year INT);
CREATE TABLE dim_region   (region_sk NUMBER PRIMARY KEY, region TEXT, country TEXT);
CREATE TABLE dim_payment  (payment_sk NUMBER PRIMARY KEY, method TEXT);

CREATE TABLE fact_order_lines (
    line_sk      NUMBER GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
    order_number TEXT NOT NULL,
    customer_sk  NUMBER NOT NULL REFERENCES dim_customer,
    product_sk   NUMBER NOT NULL REFERENCES dim_product,
    date_id      NUMBER NOT NULL REFERENCES dim_date,
    region_sk    NUMBER NOT NULL REFERENCES dim_region,
    payment_sk   NUMBER NOT NULL REFERENCES dim_payment,
    revenue      NUMBER(14,2),
    quantity     NUMBER,
    discount     NUMBER(14,2),
    tax          NUMBER(14,2)
)
CLUSTER BY (date_id, customer_sk);
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

step choice
1 process online order placement & fulfilment
2 grain one row per order line
3 dimensions customer (SCD2), product, date, region, payment
4 facts revenue, quantity, discount, tax
5 schema star with surrogate keys on every dim
6 optimisation partition by date_id, cluster by customer_sk

Output: the resulting schema answers all three business questions — daily revenue (GROUP BY date_id), monthly customer-segment (GROUP BY month, segment joining dim_customer), and top-N best-sellers (ORDER BY SUM(revenue) DESC LIMIT N joining dim_product). Each query is a simple star-shaped join with date-aware partition pruning.

Why this works — concept by concept:

  • Step 1: business process — frames every choice; "order placement" not "orders table."
  • Step 2: explicit grain — "one row per order line" prevents double-counting bugs.
  • Step 3: conformed dimensions — same dim_customer reused by future facts.
  • Step 4: additive measuresrevenue, quantity, discount, tax all SUM-able.
  • Step 5: star schema — simple, fast, columnar-friendly.
  • Step 6: partition + cluster — daily reports prune by date; customer rollups prune by customer.
  • Cost — each business question runs in seconds because the schema and partitioning anticipate the question pattern.

Inline CTA: the full Kimball-to-warehouse design syllabus is in ETL System Design for Data Engineering Interviews.

ETL
Topic — ETL
ETL practice problems

Practice →

SQL
Language — SQL
All SQL practice problems

Practice →

COURSE
Course — ETL System Design
ETL System Design for DE Interviews

View course →


Choosing a schema (checklist)

If you are designing… Pick… Watch out for…
A new analytical subject area Kimball star schema Skipping the grain statement
A fact table with finite lifecycle (order, application) Accumulating snapshot Open-ended workflows that never "complete"
A balance/level metric over time Periodic snapshot Summing balances across days
A dimension whose attributes change SCD Type 2 + surrogate key Forgetting to close the old row
A correction or typo fix SCD Type 1 Overwriting historically-important attributes
A very large hierarchical dimension Snowflake (only this one) Snowflaking every dimension
A 1 B+ row fact Partition by date, cluster by access pattern Predicates that wrap the partition column

Pro tip: Reach for Kimball data warehouse principles by default. Inmon's normalised EDW pattern works for some enterprise contexts, but most modern teams ship faster with subject-area marts joined by conformed dimensions.


Frequently asked questions

What is a fact table?

A fact table stores measurable business events — orders, clicks, payments — with one row per event and numeric measures plus foreign keys to dimensions. Fact tables are usually the largest tables in a warehouse and the focus of every analytical query.

What is a dimension table?

A dimension table stores descriptive attributes that put facts into business context — customer name and city, product category, calendar date. Dimensions answer the "by" questions ("revenue by category") and are joined to facts by foreign keys.

What is a star schema?

A star schema has one fact table at the centre joined to N denormalised dimension tables; the shape looks like a star. It is the default analytical schema because joins are simple and columnar warehouses optimise it natively. The star schema vs snowflake schema trade-off favours star in nearly every modern warehouse.

What is grain in data warehouse design?

Grain is the meaning of one row in a fact table — "one row per order line," "one row per (day, product)," "one row per session." It must be stated explicitly before columns are chosen, and mixing grains in a single fact is the most common modelling bug.

What is a surrogate key?

A surrogate key is a system-generated stable identifier (typically a BIGINT sequence) attached to every dimension row. Facts join on the surrogate; the natural business key (customer_email) lives on the dim for traceability. Surrogate keys are required for SCD Type 2 because the natural key isn't unique anymore.

What is SCD Type 2?

SCD Type 2 inserts a new dimension row whenever an attribute changes — the old row is closed with valid_to and is_current = FALSE; the new row gets a fresh surrogate key. Historical accuracy is preserved: last year's revenue rolls up to last year's city, not today's.

What's the difference between a data warehouse, a data lake, a data mart, and a data lakehouse?

A data warehouse holds modelled analytical data (star schemas, conformed dimensions). A data lake holds raw files (Parquet / JSON / CSV) on object storage without modelled schemas. A data mart is a subject-area subset of a warehouse (e.g., mart_finance). A data lakehouse layers ACID table formats (Iceberg, Delta) on top of lake storage to give warehouse-style semantics on raw files. Pick by the workload and the team's needs.


Practice on PipeCode

PipeCode ships 450+ data engineering practice problems — SQL uses the PostgreSQL dialect, with editorials and topics aligned to the same patterns warehouse interviewers ask. Start from Explore practice →, open SQL practice →, filter by ETL → or aggregations →, and see plans → when you want the full library.

Top comments (0)