Hyper data engineering interview questions are SQL end-to-end with a sharp retail-analytics edge: six SQL primitives (aggregation with ORDER BY ... LIMIT 1 for the most-expensive product, DATE_TRUNC('month', sale_date) with SUM for monthly revenue rollups, multi-table JOIN plus GROUP BY for state-level sales volume, ROW_NUMBER() OVER (PARTITION BY state ORDER BY units_sold DESC) for the best-selling product per state, HAVING COUNT(DISTINCT state) = total_states for nationwide-popularity completeness checks, and LAG(units) OVER (PARTITION BY category ORDER BY week) for week-over-week category growth tracking). The framings are retail data—SKU pricing, monthly revenue, state-level rankings, top-sellers per state, nationwide-availability checks, and weekly category growth deltas.
This guide walks through the six topic clusters Hyper actually tests, each with a detailed topic explanation, per-sub-topic explanation with a worked example and its solution, and an interview-style problem with a full solution that explains why it works. The mix matches the curated 6-problem Hyper set (3 easy, 2 medium, 1 hard)—a SQL-only hub that rewards aggregation discipline, partitioned ROW_NUMBER, and LAG-driven weekly-growth fluency. There is no Python in this loop; every minute of prep should go to SQL.
Top Hyper data engineering interview topics
From the Hyper data engineering practice set, the six numbered sections below follow this topic map (one row per H2):
| # | Topic (sections 1–6) | Why it shows up at Hyper |
|---|---|---|
| 1 | SQL aggregation and ORDER BY for the most expensive product | Most Expensive Product Sold—MAX(price) versus ORDER BY price DESC LIMIT 1 and the row-vs-value distinction. |
| 2 | SQL date truncation and SUM for monthly revenue rollups | Total Revenue by Month Report—DATE_TRUNC('month', sale_date) plus SUM(amount) plus ORDER BY month. |
| 3 | SQL multi-table JOINs for state-level sales volume | State with Highest Sales Volume—JOIN customers to bring state into a sales-keyed query, then GROUP BY state + ORDER BY total DESC LIMIT 1. |
| 4 | SQL window functions for best-selling product per state | Best-Selling Product by State—ROW_NUMBER() OVER (PARTITION BY state ORDER BY units_sold DESC) then WHERE rn = 1. |
| 5 | SQL JOINs with HAVING for nationwide-popularity products | Products with Nationwide Popularity—HAVING COUNT(DISTINCT state) = (SELECT COUNT(DISTINCT state) FROM customers) completeness check. |
| 6 | SQL window functions for weekly category growth tracking | Weekly Category Growth Tracker (HARD)—LAG(units) OVER (PARTITION BY category ORDER BY week) plus (curr − prev) * 100.0 / NULLIF(prev, 0). |
Retail-analytics framing rule: Hyper's prompts span pure retail SQL—product pricing, monthly revenue, state-level rankings, top-seller-per-state, nationwide-popularity completeness, weekly category growth. The interviewer is grading whether you map each business framing to the right SQL primitive: most-expensive →
ORDER BY DESC LIMIT 1; monthly revenue →DATE_TRUNC('month', ...) + SUM; per-state totals →JOIN + GROUP BY + ORDER BY DESC LIMIT 1; per-state top-N → partitionedROW_NUMBER; nationwide completeness →HAVING COUNT(DISTINCT state) = total; WoW growth →LAGover week-truncated partitions. State the mapping out loud.
1. SQL Aggregation and ORDER BY for the Most Expensive Product
Aggregation and ordered-row pickers in SQL for data engineering
"Find the most expensive product sold" is the canonical aggregation interview prompt for retail. The mental model: MAX(price) returns one number; ORDER BY price DESC LIMIT 1 returns one full row. Pick the form that matches what you actually need to emit. The MAX form is one aggregate; the ORDER BY form is one sort plus a LIMIT. Both are O(N) in row count but they pay different costs for different downstream needs.
Pro tip: Interviewers love asking "what if two products tie for the highest price?" The
MAX(price)form silently collapses ties to one value;ORDER BY price DESC LIMIT 1arbitrarily picks one row. State the tie-break explicitly:ORDER BY price DESC, name ASC LIMIT 1or move toDENSE_RANK() OVER (ORDER BY price DESC) = 1if the prompt wants all top-priced products.
MAX(price) returns the value, not the row
MAX(price) is a scalar aggregate—one column, one row, one value. The invariant: MAX answers "what is the highest price?" but cannot tell you which product had it without an additional self-join or subquery filter. If the question is "the highest price," MAX is correct in one line. If the question is "the product with the highest price," MAX alone is wrong.
-
SELECT MAX(price) FROM products— one row, columnmax. -
SELECT * FROM products WHERE price = (SELECT MAX(price) FROM products)— full-row recovery via scalar subquery. -
Tied tops — the subquery form returns all tied rows; the
MAXform silently collapses to one number. -
Empty table —
MAXreturnsNULL; the subquery form returns zero rows.
Worked example. Five products. Use MAX to get the price; use the subquery form to get the row.
| product_id | name | price |
|---|---|---|
| 1 | Notebook | 14.50 |
| 2 | Laptop | 1299.00 |
| 3 | Mouse | 25.00 |
| 4 | Monitor | 459.00 |
| 5 | Keyboard | 89.00 |
Worked-example solution.
-- Just the value
SELECT MAX(price) AS max_price FROM products;
-- max_price = 1299.00
-- Full row, by scalar subquery
SELECT product_id, name, price
FROM products
WHERE price = (SELECT MAX(price) FROM products);
-- (2, Laptop, 1299.00)
Rule of thumb: if you only need a number, use MAX. If you need attributes alongside it (name, id, category), reach for the ORDER BY ... LIMIT or scalar-subquery form.
ORDER BY price DESC LIMIT 1 returns the full row
The ORDER BY ... LIMIT invariant: sort the table, then keep the top row. Most engines don't materialize the full sort—they use a top-N index pushdown when there's an index on the sort column, or a streaming top-K sort otherwise. Either way, the result shape is "one row with every selected column."
-
SELECT * FROM products ORDER BY price DESC LIMIT 1— full row of the most expensive product. -
SELECT name, price FROM products ORDER BY price DESC LIMIT 1— pick specific columns. -
OFFSET 1— second-most expensive; generalizes to N-th. -
Index helps —
CREATE INDEX ON products(price DESC)makes the lookup near-O(log N).
Worked example. Same products table. Pull the top row, then the second-highest using OFFSET.
| query | result |
|---|---|
ORDER BY price DESC LIMIT 1 |
(2, Laptop, 1299.00) |
ORDER BY price DESC LIMIT 1 OFFSET 1 |
(4, Monitor, 459.00) |
ORDER BY price DESC LIMIT 1 OFFSET 2 |
(5, Keyboard, 89.00) |
Worked-example solution.
SELECT product_id, name, price
FROM products
ORDER BY price DESC
LIMIT 1;
-- (2, Laptop, 1299.00)
SELECT product_id, name, price
FROM products
ORDER BY price DESC
LIMIT 1 OFFSET 1;
-- (4, Monitor, 459.00)
Rule of thumb: ORDER BY ... LIMIT is the right primitive when the answer is "the row with the X-th biggest value of this column."
Tie-breaker: ORDER BY price DESC, name ASC LIMIT 1
The tie-break invariant: when two rows tie on the sort key, the result is non-deterministic unless you add a secondary key. Adding , name ASC (or any other unique-or-near-unique column) makes the result reproducible. Without a tie-break, your CI tests will flake when input rows ship in different orders.
-
ORDER BY price DESC, name ASC LIMIT 1— deterministic top-priced product. -
DENSE_RANK() OVER (ORDER BY price DESC) = 1— return all tied top-priced products. -
ROW_NUMBER() OVER (ORDER BY price DESC) = 1— return one tied product (engine picks). -
Always add a unique tie-break — even a synthetic
idcolumn works.
Worked example. Two products tie at 459.00.
| product_id | name | price |
|---|---|---|
| 4 | Monitor | 459.00 |
| 6 | Tablet | 459.00 |
| query | result |
|---|---|
ORDER BY price DESC LIMIT 1 |
non-deterministic (4 or 6) |
ORDER BY price DESC, name ASC LIMIT 1 |
(4, Monitor, 459.00) |
DENSE_RANK() OVER (ORDER BY price DESC) = 1 |
both 4 and 6 |
Worked-example solution.
-- Deterministic single row
SELECT product_id, name, price
FROM products
ORDER BY price DESC, name ASC
LIMIT 1;
-- All tied top-priced rows
SELECT product_id, name, price
FROM (
SELECT product_id, name, price,
DENSE_RANK() OVER (ORDER BY price DESC) AS dr
FROM products
) t
WHERE dr = 1;
Rule of thumb: every LIMIT 1 after ORDER BY should have a tie-break column listed second, or you'll regret it on every retest.
Common beginner mistakes
- Returning
MAX(price)and the row'snamein the same query withoutGROUP BY—a SQL error in strict mode. - Trying
WHERE price = MAX(price)—aggregates aren't allowed inWHERE; use a scalar subquery orHAVING. - Forgetting
LIMIT 1—returns the entire sorted table. - Sorting by a different column than the prompt asked—
ORDER BY pricedefaults to ascending; the most expensive needsDESC. - Skipping the tie-break and surprising the test harness with non-deterministic output.
SQL Interview Question on the Most Expensive Product
Table products(product_id INT, name TEXT, price NUMERIC). Return the single row for the most expensive product, with columns product_id, name, price. Break ties by alphabetical name ascending.
Solution Using ORDER BY price DESC, name ASC LIMIT 1
SELECT product_id, name, price
FROM products
ORDER BY price DESC, name ASC
LIMIT 1;
Step-by-step trace (input: 5 rows below):
| product_id | name | price |
|---|---|---|
| 1 | Notebook | 14.50 |
| 2 | Laptop | 1299.00 |
| 3 | Mouse | 25.00 |
| 4 | Monitor | 459.00 |
| 5 | Keyboard | 89.00 |
-
Scan
products— read all five rows into the sort operator. -
Sort by
price DESC— order: Laptop 1299, Monitor 459, Keyboard 89, Mouse 25, Notebook 14.50. -
Apply secondary
name ASC— no ties exist, so secondary order is a no-op here. (If two rows tied at 459, alphabetic ascending would deterministically pick the alphabetically-earlier name.) -
LIMIT 1— keep only the first row from the sorted output. -
Project columns — emit
product_id = 2,name = 'Laptop',price = 1299.00.
Output:
| product_id | name | price |
|---|---|---|
| 2 | Laptop | 1299.00 |
Why this works — concept by concept:
-
ORDER BY price DESC— sorts by the target column descending; the highest price ends up first. -
Secondary
name ASCtie-break — guarantees deterministic output when two products share the top price; without it, the result depends on storage order and can flake across runs. -
LIMIT 1— keeps exactly one row, which is precisely what the prompt requires. -
Top-N pushdown — most engines don't materialize the full sort; they stream the top-K rows, so this is closer to
O(N)thanO(N log N)in practice. -
Whole-row recovery — unlike
MAX(price), this form lets us project the product's name and id without a self-join or scalar subquery. -
Cost — one scan of
products, partial top-K sort →O(N)time,O(1)extra memory beyond the streaming top-K buffer.
SQL
Topic — aggregation
Aggregation problems
SQL
Topic — sorting
Sorting problems
2. SQL Date Truncation and SUM for Monthly Revenue Rollups
Date truncation and aggregation for time-series rollups in SQL for data engineering
"Build a total-revenue-by-month report" is the canonical time-series rollup interview prompt. The mental model: DATE_TRUNC('month', sale_date) collapses every day in a month to the same YYYY-MM-01 value, then SUM(amount) GROUP BY <truncated date> rolls revenue up to a single row per month. The same shape powers daily / weekly / quarterly rollups—swap the truncation argument and the grouping key follows.
Pro tip: Always cast or pin the truncated value to
DATE(notTIMESTAMP) before grouping; mixing time zones across aTIMESTAMP WITH TIME ZONEcolumn and aDATE_TRUNCcall will silently bucket rows wrong at month boundaries.DATE_TRUNC('month', sale_date)::dateis the safe form on PostgreSQL.
DATE_TRUNC('month', sale_date) collapses days to month-start
DATE_TRUNC rounds a date or timestamp down to the nearest unit. The invariant: for any sale_date in March 2026, DATE_TRUNC('month', sale_date) returns 2026-03-01. Every March row maps to the same bucket key.
-
DATE_TRUNC('month', d)— first day of the month containingd. -
DATE_TRUNC('week', d)— first day of ISO week (Monday on PostgreSQL; engine-specific on others). -
DATE_TRUNC('day', d)— strip time-of-day; useful whensale_dateis aTIMESTAMP. -
DATE_TRUNC('quarter', d)— first day of the quarter; pairs cleanly with quarterly revenue reports.
Worked example. Three sales rows in March + two in April. After DATE_TRUNC('month', sale_date):
| sale_id | sale_date | month_bucket |
|---|---|---|
| 1 | 2026-03-04 | 2026-03-01 |
| 2 | 2026-03-19 | 2026-03-01 |
| 3 | 2026-03-31 | 2026-03-01 |
| 4 | 2026-04-02 | 2026-04-01 |
| 5 | 2026-04-25 | 2026-04-01 |
Worked-example solution.
SELECT
sale_id,
sale_date,
DATE_TRUNC('month', sale_date)::date AS month_bucket
FROM sales
ORDER BY sale_date;
Rule of thumb: whenever the prompt says "by month / by week / by quarter," reach for DATE_TRUNC before you reach for EXTRACT(MONTH FROM ...)—DATE_TRUNC keeps the year, so March 2025 and March 2026 stay distinct.
SUM(amount) over the month bucket
The aggregate invariant: SUM(amount) GROUP BY DATE_TRUNC('month', sale_date) produces one row per month with the total revenue. Every monthly aggregate report follows this shape; only the truncation unit and aggregate function change.
-
SUM(amount)— totals the column. -
COUNT(*)— row count instead of total. -
AVG(amount)— average instead of total. -
COUNT(DISTINCT customer_id)— unique customers instead of revenue.
Worked example. Monthly revenue from the 5 rows above with amount values 50, 30, 20, 40, 60.
| month_bucket | rows | SUM(amount) |
|---|---|---|
| 2026-03-01 | 3 | 100.00 |
| 2026-04-01 | 2 | 100.00 |
Worked-example solution.
SELECT
DATE_TRUNC('month', sale_date)::date AS month,
SUM(amount) AS total_revenue
FROM sales
GROUP BY DATE_TRUNC('month', sale_date)
ORDER BY month;
Rule of thumb: the GROUP BY expression must match the SELECT expression exactly (or use the column position alias) — engines disagree on letting you GROUP BY 1 for safety.
ORDER BY month for time-sorted output
The output invariant: GROUP BY does not guarantee output order. For a time-series report, append ORDER BY month to surface chronological output. Without it, downstream charting tools see arbitrary order.
-
ORDER BY month— ascending chronological. -
ORDER BY month DESC— most recent first; common for ops dashboards. -
ORDER BY month+ densification — for "include zero-revenue months," left-join against a generated calendar table. -
No
ORDER BYafterGROUP BY— flaky output across engines and partition orderings.
Worked example. Add a January 2026 row with revenue 75. Output should now span Jan, Mar, Apr—but Feb is correctly absent (no sales).
| month | total_revenue |
|---|---|
| 2026-01-01 | 75.00 |
| 2026-03-01 | 100.00 |
| 2026-04-01 | 100.00 |
Worked-example solution.
SELECT
DATE_TRUNC('month', sale_date)::date AS month,
SUM(amount) AS total_revenue
FROM sales
GROUP BY DATE_TRUNC('month', sale_date)
ORDER BY month ASC;
Rule of thumb: a time-series report without ORDER BY is a bug waiting for a code review; always sort the output explicitly.
Common beginner mistakes
- Using
EXTRACT(MONTH FROM sale_date)—collapses every March across years into one bucket; lose the year. - Forgetting
::datecast—onTIMESTAMP WITH TIME ZONEcolumns, the result drifts at month boundaries by your session's offset. - Aliasing the truncated column as
monthand thenGROUP BY month(engine-dependent—Postgres allows it, others don't); safer to repeat the expression. - Skipping
ORDER BY—non-deterministic chronological output. - Densifying with
LEFT JOIN generate_series(...) ON DATE_TRUNC = ...—correct if the prompt asks for zero-revenue months, but check first; the canonical answer is sparse.
SQL Interview Question on Total Revenue by Month
Table sales(sale_id INT, sale_date DATE, amount NUMERIC). Return one row per month with the total revenue. Columns: month (first day of the month), total_revenue. Sort ascending by month.
Solution Using DATE_TRUNC and SUM
SELECT
DATE_TRUNC('month', sale_date)::date AS month,
SUM(amount) AS total_revenue
FROM sales
GROUP BY DATE_TRUNC('month', sale_date)
ORDER BY month ASC;
Step-by-step trace (input: 7 sales rows across 3 months):
| sale_id | sale_date | amount |
|---|---|---|
| 1 | 2026-01-15 | 75.00 |
| 2 | 2026-03-04 | 50.00 |
| 3 | 2026-03-19 | 30.00 |
| 4 | 2026-03-31 | 20.00 |
| 5 | 2026-04-02 | 40.00 |
| 6 | 2026-04-25 | 60.00 |
| 7 | 2026-04-30 | 25.00 |
-
Scan
sales— read all 7 rows. -
Compute
DATE_TRUNC('month', sale_date)— January rows →2026-01-01; March rows →2026-03-01; April rows →2026-04-01. -
GROUP BY DATE_TRUNC('month', sale_date)— three groups: Jan (1 row), Mar (3 rows), Apr (3 rows). -
SUM(amount)per group — Jan = 75; Mar = 50 + 30 + 20 = 100; Apr = 40 + 60 + 25 = 125. -
Project —
month=2026-01-01 / 2026-03-01 / 2026-04-01;total_revenue=75 / 100 / 125. -
ORDER BY month ASC— sort the three rows ascending; Feb is correctly absent.
Output:
| month | total_revenue |
|---|---|
| 2026-01-01 | 75.00 |
| 2026-03-01 | 100.00 |
| 2026-04-01 | 125.00 |
Why this works — concept by concept:
-
DATE_TRUNC('month', sale_date)— collapses every day in a calendar month to the same first-of-month key, giving a stable group key for the aggregate. -
::datecast — strips any time / timezone component so the bucket key is a cleanDATEvalue, avoiding boundary drift. -
SUM(amount)— sums revenue within each month bucket; the canonical aggregate for monetary rollups. -
GROUP BYmatches the projected expression — safest across engines; some acceptGROUP BY month(the alias) but not all. - Sparse-by-default — months with no sales rows are absent from the output; correct unless the prompt asks for zero-revenue rows via a calendar join.
-
ORDER BY month ASC— gives the consumer a chronological time-series; no client-side sort needed. -
Cost — one scan of
sales, one hash aggregate →O(N)time,O(M)space whereMis the number of distinct months.
SQL
Topic — date functions
Date-functions problems
SQL
Topic — date arithmetic
Date-arithmetic problems
3. SQL Multi-Table JOINs for State-Level Sales Volume
Multi-table JOINs and grouped aggregates in SQL for data engineering
"Find the state with the highest sales volume" is the canonical multi-table JOIN interview prompt. The mental model: the state column lives on customers; the amount column lives on sales; one is keyed by customer_id. You JOIN to bring state into a sales-keyed query, then GROUP BY state, sum amounts, and pick the top row. This is the bread-and-butter pattern for any cross-table rollup—orders by region, traffic by referrer, revenue by plan tier.
Pro tip: Always state the JOIN type out loud—
INNER JOINfor "sales whose customer exists" (the canonical answer),LEFT JOINif you're worried about orphanedcustomer_ids insales. Hyper's prompts assume referential integrity, soINNER JOINis the expected answer.
Inner JOIN: bring state into a sales-keyed query
INNER JOIN is the structural primitive: for every sales row, find the matching customers row by id and emit a combined row with both sides' columns. The invariant: the join key must be on the matching column from each table (here, sales.customer_id = customers.id).
-
INNER JOIN customers ON sales.customer_id = customers.id— combines rows that have a match on both sides. -
LEFT JOIN— keepssalesrows that have no matching customer (rare, but useful for audit). -
USING (customer_id)— terser when both columns share the same name. -
Filter columns from both tables — you can
SELECT customers.state, sales.amountafter the join.
Worked example. Three sales joined to two customers.
| sales: sale_id | customer_id | amount |
|---|---|---|
| 1 | 100 | 50 |
| 2 | 100 | 30 |
| 3 | 200 | 80 |
| customers: id | name | state |
|---|---|---|
| 100 | Alice | CA |
| 200 | Bob | TX |
After INNER JOIN:
| sale_id | customer_id | amount | name | state |
|---|---|---|---|---|
| 1 | 100 | 50 | Alice | CA |
| 2 | 100 | 30 | Alice | CA |
| 3 | 200 | 80 | Bob | TX |
Worked-example solution.
SELECT s.sale_id, s.amount, c.state
FROM sales s
INNER JOIN customers c ON c.id = s.customer_id;
Rule of thumb: always alias your tables (s, c) — verbose sales.customer_id = customers.id gets unreadable in 4-table joins.
GROUP BY state + SUM(amount) for per-state totals
The aggregation invariant: after the JOIN, GROUP BY customers.state collapses every joined row into one row per state with SUM(amount). The state column drives the bucketing; amount feeds the aggregate.
-
GROUP BY c.state— one row per state. -
SUM(s.amount)— total revenue per state. -
COUNT(*)— number of sales per state (different metric). -
COUNT(DISTINCT s.customer_id)— distinct customers per state.
Worked example. Continuing the previous join, group by state.
| state | SUM(amount) |
|---|---|
| CA | 80 |
| TX | 80 |
Worked-example solution.
SELECT c.state, SUM(s.amount) AS total_revenue
FROM sales s
INNER JOIN customers c ON c.id = s.customer_id
GROUP BY c.state
ORDER BY total_revenue DESC;
Rule of thumb: the GROUP BY key must be every column in the SELECT that is not behind an aggregate; otherwise strict-mode SQL will reject the query.
Top-state pick: ORDER BY total DESC LIMIT 1
The pick invariant: after GROUP BY, the row count is one per state; ORDER BY total DESC LIMIT 1 returns the top-revenue state. This is the same ORDER BY ... LIMIT 1 pattern from §1, just applied to an aggregated result instead of a raw table.
-
ORDER BY total_revenue DESC— descending revenue. -
LIMIT 1— top state only. -
Tie-break —
ORDER BY total_revenue DESC, c.state ASC LIMIT 1if two states tie. -
All-ties variant —
WITH t AS (...) SELECT * FROM t WHERE total_revenue = (SELECT MAX(total_revenue) FROM t).
Worked example. Add a third customer in CA with amount = 25.
| state | SUM(amount) | rank |
|---|---|---|
| CA | 105 | 1 |
| TX | 80 | 2 |
After LIMIT 1: just (CA, 105).
Worked-example solution.
SELECT c.state, SUM(s.amount) AS total_revenue
FROM sales s
INNER JOIN customers c ON c.id = s.customer_id
GROUP BY c.state
ORDER BY total_revenue DESC, c.state ASC
LIMIT 1;
Rule of thumb: if the prompt says "the state with…" (singular), LIMIT 1 is the contract; if it says "the top three states," use LIMIT 3 and consider switching to RANK().
Common beginner mistakes
- Forgetting the JOIN entirely and trying to
GROUP BY statefromsales—the state column doesn't exist onsales. - Joining
sales.customer_id = customers.customer_idwhen the customer table's PK isid—silent zero-row result. -
GROUP BYoncustomer_idinstead ofstate—per-customer totals, not per-state. - Skipping
ORDER BYbeforeLIMIT—non-deterministic top row. -
LEFT JOINwhen the prompt assumes integrity—introducesNULLstate rows that the aggregate then groups under one phantom bucket.
SQL Interview Question on State with Highest Sales Volume
Tables sales(sale_id INT, customer_id INT, amount NUMERIC) and customers(id INT, name TEXT, state TEXT). Return the single state with the highest total sales revenue and the revenue itself. Columns: state, total_revenue. Break ties alphabetically.
Solution Using INNER JOIN, GROUP BY, and LIMIT 1
SELECT
c.state,
SUM(s.amount) AS total_revenue
FROM sales s
INNER JOIN customers c ON c.id = s.customer_id
GROUP BY c.state
ORDER BY total_revenue DESC, c.state ASC
LIMIT 1;
Step-by-step trace (input: 4 sales × 3 customers):
| sale_id | customer_id | amount |
|---|---|---|
| 1 | 100 | 50 |
| 2 | 100 | 30 |
| 3 | 200 | 80 |
| 4 | 300 | 25 |
| id | name | state |
|---|---|---|
| 100 | Alice | CA |
| 200 | Bob | TX |
| 300 | Carol | CA |
-
Scan + JOIN — every
salesrow is matched to itscustomersrow viacustomer_id = id. Result: 4 joined rows (3 in CA, 1 in TX after we group). -
Project relevant columns —
c.state,s.amount. The 4 joined rows: (CA, 50), (CA, 30), (TX, 80), (CA, 25). -
GROUP BY c.state— collapse to 2 groups:CA(3 rows),TX(1 row). -
SUM(s.amount)per group — CA = 50 + 30 + 25 = 105; TX = 80. -
ORDER BY total_revenue DESC, c.state ASC— primary descending revenue puts CA (105) before TX (80); secondary tie-break is unused here but would protect against future ties. -
LIMIT 1— keep CA. -
Project — emit
(CA, 105.00).
Output:
| state | total_revenue |
|---|---|
| CA | 105.00 |
Why this works — concept by concept:
-
INNER JOIN— pulls thestatecolumn out ofcustomersand onto every matchingsalesrow; the join is the only way to express "sales rolled up by customer attribute" in one query. -
ON c.id = s.customer_id— the join key uses each table's actual primary / foreign key columns; mismatching them (e.g.s.id) silently returns wrong or zero rows. -
GROUP BY c.state— bucket every joined row by the state column;SUM(s.amount)then totals revenue within each bucket. -
SUM(s.amount)— the canonical aggregate for revenue rollups; counting rows (COUNT(*)) would answer a different question. -
ORDER BY total_revenue DESC— sort the aggregate result descending so the top-revenue state ends up first. -
Secondary
c.state ASCtie-break — guarantees deterministic output if two states tie at the top. -
LIMIT 1— caps the result to one state, matching the prompt's singular contract. -
Cost — one scan of
sales, one indexed lookup per row intocustomers(or one hash join), one hash aggregate →O(N + M)time whereN= sales rows andM= customers rows.
SQL
Topic — joins
JOIN problems
SQL
Hyper — joins
Hyper JOIN problems
4. SQL Window Functions for Best-Selling Product Per State
Partitioned ROW_NUMBER for top-N-per-group queries in SQL for data engineering
"Find the best-selling product in each state" is the canonical top-N-per-group window-function prompt. The mental model: partition the rows by state, rank within each partition by units_sold descending, then keep rows where the rank is 1. This is the structural pattern for any "top-K per group" question—best-seller per region, highest-paid employee per department, longest trip per driver. Replace LIMIT 1-after-ORDER BY with ROW_NUMBER() once you need the top per every group, not just one global top.
Pro tip:
ROW_NUMBER()andRANK()andDENSE_RANK()differ on ties.ROW_NUMBERarbitrarily breaks them (engine-dependent).RANKskips numbers on ties (1, 1, 3).DENSE_RANKdoesn't skip (1, 1, 2). For "the best-seller per state," prompt usually expects one row per state—ROW_NUMBERwith a deterministicORDER BYtiebreaker is the safe default.
PARTITION BY state for per-state windows
The window-function invariant: PARTITION BY slices the row set into independent groups; the window function then runs separately within each slice. For best-seller-per-state, the partition key is state; rankings within California don't bleed into Texas.
-
OVER (PARTITION BY state ...)— declare the partition key. -
No
PARTITION BY— the window spans the whole table (one global ranking). -
Multiple partition keys —
PARTITION BY region, statefor hierarchical groups. -
PARTITION BYis independent ofGROUP BY— windows don't collapse rows; they annotate them.
Worked example. Six sales-by-state-by-product rows, partitioned by state.
| state | product | units_sold |
|---|---|---|
| CA | P1 | 120 |
| CA | P2 | 95 |
| CA | P3 | 80 |
| TX | P2 | 200 |
| TX | P4 | 150 |
| TX | P5 | 110 |
The partition splits this into two independent groups: {CA: P1, P2, P3} and {TX: P2, P4, P5}.
Worked-example solution.
-- Just attach the partition key and inspect with COUNT
SELECT
state, product, units_sold,
COUNT(*) OVER (PARTITION BY state) AS state_rowcount
FROM sales_by_state;
-- CA rows show state_rowcount = 3; TX rows show state_rowcount = 3
Rule of thumb: whenever the question says "per X" (per state, per category, per user), reach for PARTITION BY X.
ROW_NUMBER() OVER (... ORDER BY units DESC) ranks within each state
The ranking invariant: ROW_NUMBER() OVER (PARTITION BY state ORDER BY units_sold DESC) assigns 1 to the highest-selling product in each state, 2 to the second-highest, etc.. The ORDER BY inside the window is what defines "best."
-
ROW_NUMBER()— one row gets each rank; ties broken arbitrarily (or by your secondaryORDER BYkey). -
RANK()— ties share a rank; gaps follow. -
DENSE_RANK()— ties share a rank; no gap. -
Always add a tie-break inside
ORDER BY—ORDER BY units_sold DESC, product ASCfor determinism.
Worked example. Apply ROW_NUMBER() to the table above.
| state | product | units_sold | rn |
|---|---|---|---|
| CA | P1 | 120 | 1 |
| CA | P2 | 95 | 2 |
| CA | P3 | 80 | 3 |
| TX | P2 | 200 | 1 |
| TX | P4 | 150 | 2 |
| TX | P5 | 110 | 3 |
Worked-example solution.
SELECT
state, product, units_sold,
ROW_NUMBER() OVER (
PARTITION BY state
ORDER BY units_sold DESC, product ASC
) AS rn
FROM sales_by_state;
Rule of thumb: the function is ROW_NUMBER() (no argument); the magic happens entirely inside the OVER (...) clause.
Filter WHERE rn = 1 after the window
The filter invariant: window functions cannot be used in WHERE directly; wrap the windowed query in a CTE or subquery, then filter on the alias. This is a very common interview gotcha.
-
CTE form —
WITH t AS (SELECT ..., ROW_NUMBER() OVER (...) AS rn FROM ...) SELECT * FROM t WHERE rn = 1. -
Subquery form —
SELECT * FROM (SELECT ..., ROW_NUMBER() OVER (...) AS rn FROM ...) sub WHERE rn = 1. -
QUALIFY— supported on Snowflake / BigQuery / DuckDB; lets you skip the wrapper. Postgres does not supportQUALIFY. -
WHERE ROW_NUMBER() ... = 1— illegal; engine error.
Worked example. Filter the table above to keep only rn = 1.
| state | product | units_sold |
|---|---|---|
| CA | P1 | 120 |
| TX | P2 | 200 |
Worked-example solution.
WITH ranked AS (
SELECT
state, product, units_sold,
ROW_NUMBER() OVER (
PARTITION BY state
ORDER BY units_sold DESC, product ASC
) AS rn
FROM sales_by_state
)
SELECT state, product, units_sold
FROM ranked
WHERE rn = 1
ORDER BY state;
Rule of thumb: "wrap in CTE, filter on the alias" is the correct ritual for every ROW_NUMBER()-based top-N-per-group query.
Common beginner mistakes
- Trying
WHERE ROW_NUMBER() OVER (...) = 1— illegal; the engine errors. - Forgetting the
ORDER BYinside the window —ROW_NUMBERthen assigns ranks in undefined order. - Using
RANK()and being confused when ties produce tworn = 1rows for one state (correct! that's whatRANKdoes — butROW_NUMBERis what most prompts want). -
PARTITION BY state, product— collapses each (state, product) pair into a one-row partition, everyrn = 1, no useful ranking. - Skipping the tie-break inside the window's
ORDER BY— non-deterministic best-seller output when units tie.
SQL Interview Question on the Best-Selling Product by State
Table sales_by_state(state TEXT, product TEXT, units_sold INT) with one row per (state, product) combination. Return the best-selling product in each state: the product with the highest units_sold per state. Break ties alphabetically by product. Sort output ascending by state.
Solution Using ROW_NUMBER() partitioned by state
WITH ranked AS (
SELECT
state,
product,
units_sold,
ROW_NUMBER() OVER (
PARTITION BY state
ORDER BY units_sold DESC, product ASC
) AS rn
FROM sales_by_state
)
SELECT state, product, units_sold
FROM ranked
WHERE rn = 1
ORDER BY state ASC;
Step-by-step trace (input: 7 rows, 3 states):
| state | product | units_sold |
|---|---|---|
| CA | P1 | 120 |
| CA | P2 | 95 |
| CA | P3 | 80 |
| NY | P1 | 60 |
| NY | P6 | 60 |
| TX | P2 | 200 |
| TX | P4 | 150 |
-
Scan
sales_by_state— read all 7 rows. - Partition by state — three partitions: CA (3 rows), NY (2 rows), TX (2 rows).
-
Apply
ROW_NUMBERper partition — order byunits_sold DESC, product ASC:- CA → P1 (120, rn=1), P2 (95, rn=2), P3 (80, rn=3).
- NY → tie at 60 between P1 and P6; secondary
product ASCgives P1 (rn=1), P6 (rn=2). - TX → P2 (200, rn=1), P4 (150, rn=2).
-
Filter
WHERE rn = 1— keep one row per state: (CA, P1, 120), (NY, P1, 60), (TX, P2, 200). -
ORDER BY state ASC— alphabetic state order.
Output:
| state | product | units_sold |
|---|---|---|
| CA | P1 | 120 |
| NY | P1 | 60 |
| TX | P2 | 200 |
Why this works — concept by concept:
-
PARTITION BY state— slices the row set into independent per-state windows so each state gets its own ranking; rankings never cross state boundaries. -
ORDER BY units_sold DESC, product ASC— primary descending units puts the bestseller at rank 1 within each partition; secondary ascending product is the tie-break (used by NY in the trace above). -
ROW_NUMBER()— assigns a unique 1, 2, 3, … to every row in the partition; combined with our tie-break it's deterministic. -
CTE wrapper — required because window functions cannot appear in
WHERE; the CTE materializes the rank as a regular column we can then filter on. -
WHERE rn = 1— keeps only the top-ranked row per partition; this is the moment "rank within a group" turns into "the group's best." -
ORDER BY state ASC— final sort over the three result rows; presents the answer in stable alphabetical order. -
Cost — one scan of
sales_by_state, one window sort partitioned by state →O(N log N)worst-case (typicallyO(N)with a hash-partition operator),O(N)space for the temporary annotation column.
SQL
Hyper — window functions
Hyper window-function problems
SQL
Topic — ranking
Ranking problems
5. SQL JOINs with HAVING for Nationwide-Popularity Products
Completeness checks via HAVING and DISTINCT counts in SQL for data engineering
"Find products sold in every state" is the canonical completeness-check interview prompt. The mental model: a product is nationwide-popular iff it has been sold in **every state**. The shape: count distinct states where each product appears, then keep only products whose distinct-state count equals the total number of states. The structural primitive is HAVING COUNT(DISTINCT state) = (SELECT COUNT(DISTINCT state) FROM customers).
Pro tip: Do not hard-code the total-states number (e.g.
= 50). Always pull it from a subquery so the query stays correct if the data set has fewer or more states (territories, partial coverage, future expansion).
Total-state subquery: SELECT COUNT(DISTINCT state) FROM customers
The subquery invariant: the total-states value must be computed from the same source the main query uses for state membership. If customers defines the set of states, the subquery must count distinct states from customers—not from a hard-coded list.
-
SELECT COUNT(DISTINCT state) FROM customers— scalar subquery returning one number. -
COUNT(DISTINCT state)— counts unique non-NULL state values. -
Inline as a scalar — usable in
HAVING,WHERE,SELECTprojection. - Materialize as a CTE — cleaner if reused multiple times in the same query.
Worked example. A customers table spanning 3 states.
| id | state |
|---|---|
| 100 | CA |
| 200 | TX |
| 300 | NY |
| 400 | CA |
COUNT(DISTINCT state) returns 3.
Worked-example solution.
SELECT COUNT(DISTINCT state) AS total_states FROM customers;
-- 3
Rule of thumb: always pull "total of X" values from a subquery in the same statement; never hard-code totals that depend on data.
HAVING COUNT(DISTINCT state) = total_states
The completeness invariant: after GROUP BY product, the per-product COUNT(DISTINCT state) is the number of distinct states where that product appears; equality with the total tells us the product appears in every state.
-
HAVINGfilters groups, not rows;WHEREfilters rows pre-aggregate. -
COUNT(DISTINCT state)is the per-product distinct-state count. -
=total_states — strict equality; products missing one state are filtered out. -
>=total_states — defensive; same answer in practice but documents the intent.
Worked example. Three products with sales in different state subsets.
| product | states sold in | distinct count | total | nationwide? |
|---|---|---|---|---|
| P1 | CA, TX, NY | 3 | 3 | ✅ |
| P2 | CA, TX | 2 | 3 | ❌ |
| P3 | CA | 1 | 3 | ❌ |
Worked-example solution.
WITH total_states AS (
SELECT COUNT(DISTINCT state) AS n FROM customers
)
SELECT s.product
FROM sales s
JOIN customers c ON c.id = s.customer_id
GROUP BY s.product
HAVING COUNT(DISTINCT c.state) = (SELECT n FROM total_states);
Rule of thumb: HAVING is for predicates over aggregates; WHERE is for predicates over rows. The completeness check needs HAVING because it compares two aggregates.
JOIN sales → customers → group by product, filter
The composition invariant: JOIN to attach state to every sale, group by product, count distinct states per product, filter by completeness. Three steps in order; reordering breaks correctness.
-
JOIN customers ON ...— bringstateonto each sale. -
GROUP BY s.product— bucket per product. -
COUNT(DISTINCT c.state)— distinct-states-per-product. -
HAVING ... = total_states— keep only products in every state.
Worked example. Six sales rows, three products, three states.
| sale_id | customer_id | product | (joined state) |
|---|---|---|---|
| 1 | 100 | P1 | CA |
| 2 | 200 | P1 | TX |
| 3 | 300 | P1 | NY |
| 4 | 100 | P2 | CA |
| 5 | 200 | P2 | TX |
| 6 | 100 | P3 | CA |
After grouping by product: P1 sees {CA, TX, NY} → 3; P2 sees {CA, TX} → 2; P3 sees {CA} → 1. With total_states = 3, only P1 passes.
Worked-example solution.
SELECT s.product
FROM sales s
JOIN customers c ON c.id = s.customer_id
GROUP BY s.product
HAVING COUNT(DISTINCT c.state) = (
SELECT COUNT(DISTINCT state) FROM customers
)
ORDER BY s.product;
Rule of thumb: the JOIN brings the dimension onto the fact; the GROUP BY collapses the fact; the HAVING filters the groups. Always in that order.
Common beginner mistakes
- Hard-coding
HAVING COUNT(DISTINCT state) = 50—silently wrong if the data has 49 or 51 states. - Using
WHERE COUNT(DISTINCT state) = ...—aggregates aren't allowed inWHERE; engine error. - Counting
COUNT(state)instead ofCOUNT(DISTINCT state)—double-counts states with multiple sales. - Forgetting the JOIN—
statedoesn't exist onsales. - Grouping by
customer_idinstead ofproduct—per-customer state count, off-prompt.
SQL Interview Question on Products with Nationwide Popularity
Tables sales(sale_id INT, customer_id INT, product TEXT) and customers(id INT, state TEXT). Return the products that have been sold in every state present in customers. Sort ascending by product.
Solution Using JOIN, GROUP BY, and HAVING with a scalar subquery
SELECT s.product
FROM sales s
JOIN customers c ON c.id = s.customer_id
GROUP BY s.product
HAVING COUNT(DISTINCT c.state) = (
SELECT COUNT(DISTINCT state) FROM customers
)
ORDER BY s.product ASC;
Step-by-step trace (input: 8 sales × 4 customers across 3 states):
| sale_id | customer_id | product |
|---|---|---|
| 1 | 100 | P1 |
| 2 | 200 | P1 |
| 3 | 300 | P1 |
| 4 | 100 | P2 |
| 5 | 200 | P2 |
| 6 | 100 | P3 |
| 7 | 300 | P3 |
| 8 | 400 | P3 |
| id | state |
|---|---|
| 100 | CA |
| 200 | TX |
| 300 | NY |
| 400 | CA |
-
Scalar subquery —
(SELECT COUNT(DISTINCT state) FROM customers)evaluates to 3. - JOIN — every sale gets its customer's state. P1 sees CA, TX, NY; P2 sees CA, TX; P3 sees CA, NY, CA.
-
GROUP BY s.product— three groups (P1, P2, P3). -
Per-group
COUNT(DISTINCT c.state)— P1 → 3, P2 → 2, P3 → 2 (CA, NY—the second CA is a duplicate). -
HAVING ... = 3— keep only P1. -
ORDER BY s.product ASC— single row, alphabetic order is trivially satisfied.
Output:
| product |
|---|
| P1 |
Why this works — concept by concept:
-
JOIN to attach
state—statelives oncustomers; the join brings it onto every sale soGROUP BY productcan see which states a product reached. -
COUNT(DISTINCT c.state)— counts unique state values per product;DISTINCTis the active ingredient (without it, multiple sales per state inflate the count). -
Scalar subquery for
total_states— pulls the canonical state count fromcustomersso the query auto-adapts as the data grows or shrinks. -
HAVINGover aggregates —HAVINGis the correct clause for filtering on aggregates;WHEREwould be a syntax error. -
Equality check — strict
=ensures we only keep products with full state coverage;>=would behave the same but is less idiomatic. -
ORDER BY product ASC— deterministic output even when multiple products qualify. -
Cost — one scan of
sales, one scan ofcustomers, one hash join, one hash aggregate, one scalar-subquery evaluation →O(N + M)time.
SQL
Topic — having clause
HAVING-clause problems
SQL
Topic — grouping
Grouping problems
6. SQL Window Functions for Weekly Category Growth Tracking
LAG-driven window functions for week-over-week growth in SQL for data engineering
"Track week-over-week growth per category" is the Hard-tier interview prompt for retail analytics. The mental model: DATE_TRUNC('week', sale_date) collapses days to ISO-week starts; LAG(units) OVER (PARTITION BY category ORDER BY week) exposes last week's units alongside this week's; (curr − prev) * 100.0 / NULLIF(prev, 0) is the percentage formula. The same pattern handles month-over-month, day-over-day, hour-over-hour. Just swap the truncation unit.
Pro tip: Always wrap the divisor in
NULLIF(prev, 0). If a category has zero units in the prior week (or no prior week at all), the unguarded formula either divides by zero (engine error) or produces a misleading 100% growth.NULLIF(prev, 0)returnsNULL, the formula short-circuits toNULL, and downstream consumers can treat it as "no comparable baseline."
DATE_TRUNC('week', sale_date) for weekly bucketing
The week-bucket invariant: DATE_TRUNC('week', sale_date) returns the Monday of the ISO week containing sale_date (Postgres default). Same idea as DATE_TRUNC('month', ...) from §2; only the unit changes.
-
DATE_TRUNC('week', d)— first day of ISO week (engine-specific: Postgres = Monday; Snowflake = Sunday by default; check your engine). -
Cast to
DATE—DATE_TRUNC('week', sale_date)::date. - Group key — bucket all daily sales into the week's first day.
- Year wraparound — December 30 of one year often shares the same ISO-week with January 5 of the next; that's correct ISO behavior, not a bug.
Worked example. Five sale days collapsed to two ISO weeks.
| sale_date | week_bucket |
|---|---|
| 2026-04-06 (Mon) | 2026-04-06 |
| 2026-04-09 (Thu) | 2026-04-06 |
| 2026-04-12 (Sun) | 2026-04-06 |
| 2026-04-13 (Mon) | 2026-04-13 |
| 2026-04-16 (Thu) | 2026-04-13 |
Worked-example solution.
SELECT
sale_date,
DATE_TRUNC('week', sale_date)::date AS week_bucket
FROM sales
ORDER BY sale_date;
Rule of thumb: if the prompt says "week," confirm the engine's week-start convention before writing code. Postgres, Snowflake, and BigQuery disagree.
LAG(units) OVER (PARTITION BY category ORDER BY week)
The lag invariant: LAG(col) returns the value of col from the previous row within the partition, ordered by the same key. Partition by category so previous-week lookups never cross categories; order by week so "previous" means "previous week chronologically."
-
PARTITION BY category— independent windows per category. -
ORDER BY week— defines what "previous" means. -
LAG(units)— returns NULL for the first row in each partition (no prior row). -
LAG(col, n, default)— n-row lag with optional default value when no prior row exists.
Worked example. Two categories × three weeks each.
| category | week | units | lag(units) |
|---|---|---|---|
| electronics | 2026-04-06 | 100 | NULL |
| electronics | 2026-04-13 | 120 | 100 |
| electronics | 2026-04-20 | 150 | 120 |
| apparel | 2026-04-06 | 80 | NULL |
| apparel | 2026-04-13 | 85 | 80 |
| apparel | 2026-04-20 | 70 | 85 |
Worked-example solution.
WITH weekly AS (
SELECT
category,
DATE_TRUNC('week', sale_date)::date AS week,
SUM(units) AS units
FROM sales
GROUP BY category, DATE_TRUNC('week', sale_date)
)
SELECT
category, week, units,
LAG(units) OVER (PARTITION BY category ORDER BY week) AS prev_units
FROM weekly
ORDER BY category, week;
Rule of thumb: always pre-aggregate to the bucket grain (one row per category-week) before applying LAG; otherwise LAG jumps between intra-week rows of the same category.
WoW formula: (curr - prev) * 100.0 / NULLIF(prev, 0)
The growth invariant: (curr - prev) * 100.0 / prev is the percentage change from the prior period to the current period. Multiply by 100.0 (not 100) to force float division; wrap the divisor in NULLIF(prev, 0) to avoid divide-by-zero on the first week per category.
-
(curr - prev) * 100.0— float multiplier; makes the result a proper percentage. -
NULLIF(prev, 0)— returns NULL whenprev = 0; the whole expression then returns NULL (no error). -
ROUND(..., 2)— clamp to 2 decimals for readable output. -
First-week NULL — the first week per category has no
prev, soLAGreturns NULL and the formula propagates NULL; correct, not a bug.
Worked example. Apply the formula to the table above.
| category | week | units | prev_units | growth_pct |
|---|---|---|---|---|
| electronics | 2026-04-06 | 100 | NULL | NULL |
| electronics | 2026-04-13 | 120 | 100 | 20.00 |
| electronics | 2026-04-20 | 150 | 120 | 25.00 |
| apparel | 2026-04-06 | 80 | NULL | NULL |
| apparel | 2026-04-13 | 85 | 80 | 6.25 |
| apparel | 2026-04-20 | 70 | 85 | -17.65 |
Worked-example solution.
WITH weekly AS (
SELECT
category,
DATE_TRUNC('week', sale_date)::date AS week,
SUM(units) AS units
FROM sales
GROUP BY category, DATE_TRUNC('week', sale_date)
),
with_prev AS (
SELECT
category, week, units,
LAG(units) OVER (PARTITION BY category ORDER BY week) AS prev_units
FROM weekly
)
SELECT
category, week, units,
ROUND(
(units - prev_units) * 100.0 / NULLIF(prev_units, 0),
2
) AS growth_pct
FROM with_prev
ORDER BY category, week;
Rule of thumb: growth percentages always need 100.0 (not 100), NULLIF(prev, 0), and ROUND(..., 2)—skipping any of the three breaks correctness or readability.
Common beginner mistakes
- Forgetting
NULLIF(prev, 0)— divide-by-zero on first-week-per-category or zero-baseline weeks. - Using
100instead of100.0— integer division silently rounds to zero on small differences. - Skipping the pre-aggregation CTE —
LAGjumps over intra-week rows and the comparison becomes meaningless. -
LAGwithoutPARTITION BY category— comparisons cross category boundaries. - Returning the absolute delta instead of the percentage — common reading mistake; the prompt usually says "growth %," not "delta."
SQL Interview Question on Weekly Category Growth Tracker
Table sales(sale_id INT, sale_date DATE, category TEXT, units INT). For each (category, week) pair, return the units sold that week and the week-over-week growth percentage versus the prior week (NULL if no prior week exists for that category). Columns: category, week, units, growth_pct. Sort ascending by category, then by week. Round growth to 2 decimals.
Solution Using DATE_TRUNC, LAG, and the (curr − prev) * 100.0 / NULLIF(prev, 0) formula
WITH weekly AS (
SELECT
category,
DATE_TRUNC('week', sale_date)::date AS week,
SUM(units) AS units
FROM sales
GROUP BY category, DATE_TRUNC('week', sale_date)
),
with_prev AS (
SELECT
category,
week,
units,
LAG(units) OVER (PARTITION BY category ORDER BY week) AS prev_units
FROM weekly
)
SELECT
category,
week,
units,
ROUND(
(units - prev_units) * 100.0 / NULLIF(prev_units, 0),
2
) AS growth_pct
FROM with_prev
ORDER BY category ASC, week ASC;
Step-by-step trace (input: 8 sales rows across 2 categories × 3 weeks):
| sale_id | sale_date | category | units |
|---|---|---|---|
| 1 | 2026-04-06 | electronics | 100 |
| 2 | 2026-04-13 | electronics | 120 |
| 3 | 2026-04-20 | electronics | 80 |
| 4 | 2026-04-20 | electronics | 70 |
| 5 | 2026-04-06 | apparel | 80 |
| 6 | 2026-04-13 | apparel | 85 |
| 7 | 2026-04-20 | apparel | 40 |
| 8 | 2026-04-20 | apparel | 30 |
-
Build
weekly— group by(category, DATE_TRUNC('week', sale_date))and sumunits. Result:
| category | week | units |
|----------|------|------:|
| electronics | 2026-04-06 | 100 |
| electronics | 2026-04-13 | 120 |
| electronics | 2026-04-20 | 150 |
| apparel | 2026-04-06 | 80 |
| apparel | 2026-04-13 | 85 |
| apparel | 2026-04-20 | 70 |
-
Build
with_prev— applyLAG(units) OVER (PARTITION BY category ORDER BY week):
| category | week | units | prev_units |
|----------|------|------:|-----------:|
| electronics | 2026-04-06 | 100 | NULL |
| electronics | 2026-04-13 | 120 | 100 |
| electronics | 2026-04-20 | 150 | 120 |
| apparel | 2026-04-06 | 80 | NULL |
| apparel | 2026-04-13 | 85 | 80 |
| apparel | 2026-04-20 | 70 | 85 |
-
Apply the growth formula —
(units - prev_units) * 100.0 / NULLIF(prev_units, 0)rounded to 2 decimals:- electronics 04-06:
(100 - NULL) ...→ NULL. - electronics 04-13:
(120 - 100) * 100.0 / 100 = 20.00. - electronics 04-20:
(150 - 120) * 100.0 / 120 = 25.00. - apparel 04-06: NULL.
- apparel 04-13:
(85 - 80) * 100.0 / 80 = 6.25. - apparel 04-20:
(70 - 85) * 100.0 / 85 = -17.65.
- electronics 04-06:
-
ORDER BY category ASC, week ASC— alphabetic categories then chronological weeks.
Output:
| category | week | units | growth_pct |
|---|---|---|---|
| apparel | 2026-04-06 | 80 | NULL |
| apparel | 2026-04-13 | 85 | 6.25 |
| apparel | 2026-04-20 | 70 | -17.65 |
| electronics | 2026-04-06 | 100 | NULL |
| electronics | 2026-04-13 | 120 | 20.00 |
| electronics | 2026-04-20 | 150 | 25.00 |
Why this works — concept by concept:
-
DATE_TRUNC('week', sale_date)— collapses every day in an ISO week to that week's first day, giving a stable group key for weekly aggregation. -
Pre-aggregation CTE
weekly— produces one row per(category, week)pair soLAGoperates at the right grain (week, not day). -
LAG(units) OVER (PARTITION BY category ORDER BY week)— exposes the prior-week units for each category-week pair without crossing categories. -
PARTITION BY category— keeps each category's growth series isolated; growth never leaks between categories. -
ORDER BY weekinside the window — defines "previous" as "previous week chronologically," not insertion order. -
NULLIF(prev_units, 0)— guards divide-by-zero on first-week rows (whereLAGreturns NULL) and zero-baseline weeks; the entire growth expression then propagates NULL safely. -
* 100.0float multiplier — forces float division so the percentage is a real decimal, not integer-truncated. -
ROUND(..., 2)— clamps the result to two decimal places for readable output. -
Cost — one scan of
sales, one hash aggregate for the CTE, one window pass with a partition sort →O(N log N)worst-case; typicalO(N)with hash-partitioning.
SQL
Topic — date arithmetic
Date-arithmetic problems
Tips to Crack Hyper Data Engineering Interviews
SQL is the entire loop—skip Python prep
Every problem in the curated Hyper SQL practice set is SQL. There is no Python in this loop. Spending interview-prep time on Python algorithms is a misallocation; spend it on DATE_TRUNC, LAG, partitioned ROW_NUMBER, and HAVING COUNT(DISTINCT) = patterns. Expect 4–6 SQL coding problems across the loop, all framed as retail analytics.
Drill the three window-function patterns Hyper actually asks
Two of the six Hyper problems are window functions: ROW_NUMBER() OVER (PARTITION BY ... ORDER BY ...) for top-N-per-group (#523), and LAG(...) over a partition for week-over-week growth (#525—the Hard-tier problem). The third pattern Hyper might combine into a follow-up is RANK() or DENSE_RANK() for "all top-tied rows." Drill the window-functions practice page until the CTE-wrapper-then-filter ritual is automatic.
Retail-analytics framing on every prompt
Frame your verbal walkthrough in retail vocabulary: SKUs, products, categories, states, monthly revenue, weekly growth, cohorts, sell-through. Interviewers test whether you can translate the abstract SQL primitive to the business question—matching the interviewer's mental model of what their team builds. The Hyper medium practice set leans on retail framings; mirror that vocabulary in your answer.
Completeness checks via HAVING are a Hyper signature
The "products sold in every state" pattern (#524) is unusual; most candidates haven't seen HAVING COUNT(DISTINCT state) = (SELECT COUNT(DISTINCT state) FROM customers) written down. Memorize it. Generalize: any "X covers all Ys" prompt becomes HAVING COUNT(DISTINCT y) = (SELECT COUNT(DISTINCT y) FROM y_table). Drill more variants on the HAVING-clause practice page.
Easy-tier discipline still matters
Three of the six Hyper problems are EASY-tier (#520, #521, #522). Easy at Hyper means: the interviewer expects clean, deterministic SQL with no off-by-ones, every ORDER BY followed by a tie-break key, every aggregate paired with the right GROUP BY. Stuttering through an EASY-tier problem signals worse than failing a HARD-tier one. The Hyper easy practice set is the right warmup.
Where to practice on PipeCode
Start with the Hyper practice page for the curated 6-problem set. After that, drill the matching topic pages: aggregation, joins, window-functions, HAVING-clause, date-functions. The interview courses page bundles the SQL course if you want a structured curriculum before company drills. For broad coverage, browse by topic.
Communication and approach under time pressure
For every prompt, narrate the structure before writing code: "I'll JOIN sales to customers to bring in state, group by product, count distinct states per product, then HAVING against the total state count." Interviewers grade process as much as the final answer. Leave 5 minutes at the end of each problem for an edge-case sweep: empty input, ties, NULL columns, divide-by-zero in growth formulas, missing prior-period rows in LAG. The most common "almost passed" failure mode is a correct happy-path query that crashes on the empty input—a 30-second sweep prevents it.
Frequently Asked Questions
What is the Hyper data engineering interview process like?
Hyper's data engineering interview opens with a recruiter screen, then a technical phone screen with one SQL coding problem, then an onsite (or virtual onsite) of four to five rounds: two to three SQL coding rounds (aggregation, joins, window functions, HAVING completeness checks), one data-modeling discussion (star schemas, retail dimensional models), and one to two behavioral rounds. Hyper does not test Python in the data engineering loop; every coding round is SQL on retail-analytics framings.
What SQL topics does Hyper test for data engineers?
Hyper SQL interviews concentrate on six primitives that correspond directly to the curated 6-problem set: aggregation with ORDER BY ... LIMIT 1 for top-N rows, DATE_TRUNC('month' / 'week', ...) for date-bucketed rollups, multi-table INNER JOIN plus GROUP BY for cross-table aggregates, ROW_NUMBER() OVER (PARTITION BY ...) for top-N-per-group, HAVING COUNT(DISTINCT) = total for completeness checks, and LAG(...) over partitioned ordered windows for period-over-period growth. Memorize these six patterns and you will recognize most Hyper prompts within the first 60 seconds.
Does Hyper test Python in data engineering interviews?
The curated Hyper practice set is 6 SQL, 0 Python—Hyper's data engineering interview is SQL-only. If a Python question appears, treat it as a different role family (data scientist, analytics engineer with Python responsibilities) rather than the canonical data-engineering loop covered in this guide. Spending prep time on Python algorithms for a Hyper data-engineering interview is a misallocation; SQL fluency is the entire signal.
How hard are Hyper data engineering interview questions?
The Hyper set is 3 easy + 2 medium + 1 hard. The Hard-tier problem (#525 Weekly Category Growth Tracker) is LAG-driven WoW growth over a partitioned, week-truncated window with NULLIF divide-by-zero handling—make-or-break for senior roles. The two medium problems test partitioned ROW_NUMBER (#523) and HAVING COUNT(DISTINCT) = completeness checks (#524); both are pattern-recognition once you have seen them. The three EASY problems are aggregation classics where interviewers expect zero hesitation and clean tie-break handling.
Are window functions and weekly growth common in Hyper interviews?
Yes—two of six problems are window-function-driven, and the Hard-tier problem is explicitly week-over-week growth with LAG. If you cannot fluently write LAG(units) OVER (PARTITION BY category ORDER BY week) and the (curr - prev) * 100.0 / NULLIF(prev, 0) growth formula, you cannot pass the Hyper loop. Drill the window-functions practice page and the Hyper window-functions topic page until the patterns are muscle memory.
How many Hyper practice problems should I solve before the interview?
Solve all 6 problems on the Hyper practice page end-to-end—untimed first, then timed at 25 minutes per problem. After that, broaden to 30 to 40 additional SQL problems spread across the matching topic pages: aggregation, joins, window-functions, HAVING-clause, date-functions, ranking, date-arithmetic. The Hyper SQL practice page and the Hyper medium practice page are the right surfaces for the final week of prep.



Top comments (0)