Hassann

Posted on May 12 • Originally published at apidog.com

How to Track OpenAI API spend per feature: a cost-attribution playbook

#api #monitoring #openai #tutorial

Your OpenAI invoice says you spent $4,237 last month. It does not tell you that $3,100 came from one runaway summarization endpoint, $700 came from a customer paying $50/month, and $437 came from a feature nobody uses. If you want pricing, capacity, or roadmap decisions to be grounded in data, you need request-level cost attribution.

Try Apidog today

This guide shows how to implement OpenAI API cost attribution in production: tag every request, log token usage and computed cost, aggregate spend by feature/route/customer, set budget caps, and test the wrapper before shipping.

💡 Apidog gives you the request-level visibility and scenario testing you need to verify your cost-tracking wrapper works before it ships to production. Use Apidog to replay tagged requests, assert log shape, and validate that every call carries the metadata your warehouse expects.

TL;DR

Implement this pipeline:

Wrap every OpenAI API call.
Require metadata: feature, route, customer_id, and environment.
Capture response.usage.
Compute cost_usd at write time.
Emit one structured log event per request.
Aggregate by tag in your warehouse.
Set OpenAI project/key budget caps.
Alert on hourly spend anomalies.
Validate the wrapper with Apidog scenario tests.

Introduction

You ship a new AI feature on Tuesday. By Friday, your CFO asks why the OpenAI line item jumped 40%. The OpenAI dashboard shows total spend and model usage, but not which feature, customer, or endpoint caused the spike.

That is the core problem: OpenAI billing is useful for invoices, not engineering attribution.

The fix is straightforward:

Add metadata at the call site.
Log every request as structured data.
Compute cost from token usage.
Store the event in your warehouse.
Build dashboards and alerts from that table.

By the end of this guide, you will have:

A cost-attribution event schema
Python wrapper code
SQL aggregation queries
A verification workflow with Apidog
A build-vs-buy tooling comparison

For pricing context, see the GPT-5.5 pricing breakdown. For a related billing-attribution problem, see GitHub Copilot usage billing for API teams. For API basics, see the official OpenAI API reference.

Why OpenAI’s billing dashboard is not enough

The OpenAI billing dashboard typically gives you:

Daily spend
Model breakdown
Usage limits

That works for a simple setup. It breaks down when you have:

Multiple AI features
Multiple customers
Multiple environments
Multiple developers
Background jobs
Internal tools

What is missing

Total spend without context

The dashboard can tell you that you spent $312 yesterday. It cannot tell you whether that came from a customer hammering your support-chat endpoint or from a background job reprocessing your knowledge base.

No per-feature breakdown

OpenAI usage is grouped around account/project/model dimensions. It does not know your product concepts: feature, route, customer_id, or environment.

Reporting lag

Usage data may lag by tens of minutes or hours. That is too slow for runaway loops or hourly burn alerts.

No feature-level alerts

There is no native primitive for: “Page me if /api/v1/chat/answer exceeds $50/hour.”

No customer attribution

If you run B2B SaaS, you need to know which customer generated which spend. Without that, you cannot compute gross margin per customer.

Project keys help, but only partially

OpenAI project keys can separate workloads at a coarse level. They do not give you per-feature, per-route, or per-customer attribution. The OpenAI usage API returns aggregated data, not request-level product metadata.

The pattern is common enough that the Dev.to thread “OpenAI Tells You What You Spent. Not Where. So I Built a Dashboard” resonated with developers: you cannot manage what you cannot measure.

The cost-attribution data model

Treat every OpenAI request as a cost event. That event is the unit you query, alert on, and reconcile.

Use a schema like this:

Column	Type	Example	Why it matters
`request_id`	uuid	`7a91...`	Idempotency, deduplication, retries
`timestamp`	timestamptz	`2026-05-06T14:23:01Z`	Time-series queries and anomaly detection
`feature`	text	`support-chat`	Product surface that triggered the call
`route`	text	`/api/v1/chat/answer`	HTTP route or background job ID
`customer_id`	text	`cust_4291`	Per-customer spend and gross margin
`environment`	text	`prod`, `staging`, `dev`	Separate production from internal usage
`model`	text	`gpt-5.5`, `gpt-5.4-mini`	Pricing differs per model
`prompt_tokens`	int	`15234`	Input token count
`completion_tokens`	int	`812`	Output token count
`reasoning_tokens`	int	`4500`	Reasoning tokens billed as output
`cached_tokens`	int	`12000`	Cached input tokens
`latency_ms`	int	`2341`	Cost/performance correlation
`cost_usd`	numeric(10,6)	`0.045672`	Cost computed at write time
`prompt_cache_key`	text	`system-v3`	Cache hit tracking
`error_code`	text	`null`, `429`	Retry and failure analysis

Compute cost when you write the event, not later in a dashboard query. Pricing changes over time, so historical events should preserve the rate used at the time.

Example pricing function:

PRICING = {  # USD per 1M tokens, as of May 2026
    "gpt-5.5":      {"input": 5.00,  "cached": 2.50,  "output": 30.00},
    "gpt-5.5-pro":  {"input": 30.00, "cached": 15.00, "output": 180.00},
    "gpt-5.4":      {"input": 2.50,  "cached": 1.25, "output": 15.00},
    "gpt-5.4-mini": {"input": 0.25,  "cached": 0.125, "output": 2.00},
}

def compute_cost_usd(model, prompt_tokens, cached_tokens, completion_tokens, reasoning_tokens):
    rates = PRICING[model]

    uncached = max(0, prompt_tokens - cached_tokens)

    input_cost = (uncached * rates["input"]) / 1_000_000
    cache_cost = (cached_tokens * rates["cached"]) / 1_000_000
    output_cost = ((completion_tokens + reasoning_tokens) * rates["output"]) / 1_000_000

    return round(input_cost + cache_cost + output_cost, 6)

Reasoning tokens are returned under:

usage.completion_tokens_details.reasoning_tokens

They are billed at the output rate. If you omit them, you undercount cost for reasoning-heavy calls.

For more pricing details, see the GPT-5.5 pricing breakdown.

Wrap the OpenAI client

Every OpenAI call should go through one wrapper. The wrapper should:

Require product metadata.
Generate or receive a request_id.
Call OpenAI.
Capture token usage.
Compute cost.
Emit a structured event.

import time
import uuid
import json
import logging
from openai import OpenAI

client = OpenAI()
logger = logging.getLogger("llm.cost")

def call_with_attribution(
    *,
    feature,
    route,
    customer_id,
    environment,
    model,
    messages,
    request_id=None,
    **openai_kwargs
):
    if not feature or not route or not customer_id or not environment:
        raise ValueError("feature, route, customer_id, and environment are required")

    request_id = request_id or str(uuid.uuid4())
    started = time.time()
    error_code = None
    response = None

    try:
        response = client.chat.completions.create(
            model=model,
            messages=messages,
            **openai_kwargs
        )
        return response

    except Exception as e:
        error_code = getattr(e, "code", "unknown_error")
        raise

    finally:
        latency_ms = int((time.time() - started) * 1000)

        u = response.usage if response else None

        prompt_tokens = getattr(u, "prompt_tokens", 0) if u else 0
        completion_tokens = getattr(u, "completion_tokens", 0) if u else 0

        cached_tokens = (
            getattr(getattr(u, "prompt_tokens_details", None), "cached_tokens", 0)
            if u else 0
        ) or 0

        reasoning_tokens = (
            getattr(getattr(u, "completion_tokens_details", None), "reasoning_tokens", 0)
            if u else 0
        ) or 0

        cost_usd = compute_cost_usd(
            model,
            prompt_tokens,
            cached_tokens,
            completion_tokens,
            reasoning_tokens
        )

        logger.info(json.dumps({
            "event": "openai.request",
            "request_id": request_id,
            "feature": feature,
            "route": route,
            "customer_id": customer_id,
            "environment": environment,
            "model": model,
            "prompt_tokens": prompt_tokens,
            "completion_tokens": completion_tokens,
            "reasoning_tokens": reasoning_tokens,
            "cached_tokens": cached_tokens,
            "latency_ms": latency_ms,
            "cost_usd": cost_usd,
            "error_code": error_code,
        }))

Usage example:

response = call_with_attribution(
    feature="support-chat",
    route="/api/v1/chat/answer",
    customer_id="cust_4291",
    environment="prod",
    model="gpt-5.5",
    messages=[
        {"role": "system", "content": "You are a support assistant."},
        {"role": "user", "content": "How do I reset my password?"}
    ],
)

Ship these logs to your existing pipeline:

Vector
Fluent Bit
Logstash
OTLP collector
Kafka
Pub/Sub
NATS

Then write them into your warehouse:

BigQuery
ClickHouse
Snowflake
Postgres

For Node.js, use the same shape: a wrapper function around the OpenAI SDK that accepts metadata, captures response.usage, computes cost, and writes a JSON event.

Wire up cost tracking and test it with Apidog

1. Replace direct OpenAI calls

Search your codebase for direct SDK calls:

grep -R "client.chat.completions.create" .
grep -R "OpenAI(" .

Replace every direct call with your attribution wrapper.

Do not default missing metadata to "unknown". Fail fast:

if not feature:
    raise ValueError("feature is required")

Bad tags create silent attribution errors.

2. Emit structured logs

Log one JSON event per request:

{
  "event": "openai.request",
  "request_id": "7a91...",
  "feature": "support-chat",
  "route": "/api/v1/chat/answer",
  "customer_id": "cust_4291",
  "environment": "prod",
  "model": "gpt-5.5",
  "prompt_tokens": 15234,
  "completion_tokens": 812,
  "reasoning_tokens": 4500,
  "cached_tokens": 12000,
  "latency_ms": 2341,
  "cost_usd": 0.045672,
  "error_code": null
}

Keep these events clean. Do not mix them with debug logs.

3. Aggregate spend in SQL

Once events are in your warehouse, start with feature-level spend:

SELECT
  feature,
  DATE_TRUNC(timestamp, DAY) AS day,
  COUNT(*) AS requests,
  SUM(cost_usd) AS spend_usd,
  SUM(prompt_tokens + completion_tokens + reasoning_tokens) AS tokens,
  AVG(latency_ms) AS avg_latency_ms,
  SUM(cached_tokens) / NULLIF(SUM(prompt_tokens), 0) AS cache_hit_rate
FROM openai_events
WHERE environment = 'prod'
  AND timestamp >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
GROUP BY feature, day
ORDER BY day DESC, spend_usd DESC;

Then add customer-level spend:

SELECT
  customer_id,
  DATE_TRUNC(timestamp, MONTH) AS month,
  COUNT(*) AS requests,
  SUM(cost_usd) AS spend_usd
FROM openai_events
WHERE environment = 'prod'
GROUP BY customer_id, month
ORDER BY spend_usd DESC;

And route-level spend:

SELECT
  route,
  COUNT(*) AS requests,
  SUM(cost_usd) AS spend_usd,
  AVG(cost_usd) AS avg_cost_per_request
FROM openai_events
WHERE environment = 'prod'
  AND timestamp >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 DAY)
GROUP BY route
ORDER BY spend_usd DESC
LIMIT 20;

4. Build the dashboard

Create three operational views:

Spend per feature over time
Spend per customer over time
Top routes by daily spend

Use whatever BI layer you already have:

Grafana
Metabase
Looker
Superset
Mode

5. Test the wrapper with Apidog

Before shipping, verify that the wrapper logs the metadata you expect.

Use Apidog to create an end-to-end scenario:

Send a request to your AI endpoint with a known customer_id.
Verify the API response succeeds.
Capture the side-channel log event through your logging endpoint, stdout collector, or OTLP/log pipeline.
Assert the event contains:
- feature
- route
- customer_id
- environment
- model
- prompt_tokens > 0
- cost_usd > 0
Run the same scenario against staging and production using Apidog environments.
Replay the request and verify retries do not double-count cost.

For broader testing workflows, see API testing tools for QA engineers. For contract-first coverage, see contract-first API development.

6. Set budget caps and alerts

Use OpenAI project keys to isolate risk:

prod-support-chat
prod-summarization
staging-all
dev-all

Set hard caps in the OpenAI dashboard so one runaway workload cannot drain the whole organization budget.

Then add warehouse-driven alerts. Example: page if any feature exceeds 3x its seven-day average hourly spend.

WITH hourly AS (
  SELECT
    feature,
    TIMESTAMP_TRUNC(timestamp, HOUR) AS hour,
    SUM(cost_usd) AS spend_usd
  FROM openai_events
  WHERE environment = 'prod'
    AND timestamp >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 8 DAY)
  GROUP BY feature, hour
),
baseline AS (
  SELECT
    feature,
    AVG(spend_usd) AS avg_hourly_spend
  FROM hourly
  WHERE hour < TIMESTAMP_SUB(TIMESTAMP_TRUNC(CURRENT_TIMESTAMP(), HOUR), INTERVAL 1 HOUR)
  GROUP BY feature
),
current_hour AS (
  SELECT
    feature,
    spend_usd
  FROM hourly
  WHERE hour = TIMESTAMP_TRUNC(CURRENT_TIMESTAMP(), HOUR)
)
SELECT
  c.feature,
  c.spend_usd,
  b.avg_hourly_spend
FROM current_hour c
JOIN baseline b USING (feature)
WHERE c.spend_usd > b.avg_hourly_spend * 3;

Send the result to:

PagerDuty
Opsgenie
Slack
Email
Incident.io

Native caps protect you from catastrophic burn. Warehouse alerts catch slow drift earlier.

Advanced techniques

Prompt caching

GPT-5.5 charges less for cached input tokens. Structure prompts so stable content appears first:

[Stable system prompt]
[Stable policy/instructions]
[Stable examples]
[Per-request user data]

Track this per feature:

SELECT
  feature,
  SUM(cached_tokens) / NULLIF(SUM(prompt_tokens), 0) AS cache_hit_rate
FROM openai_events
WHERE environment = 'prod'
GROUP BY feature
ORDER BY cache_hit_rate ASC;

If a prompt change drops cache hit rate, your input cost can rise silently.

See the official OpenAI prompt caching docs for eligibility rules.

Batch API for offline workloads

Use the Batch API for workloads that do not need synchronous responses:

Nightly summarization
Evaluation runs
Embedding backfills
Document re-processing

Tag these events with a batch_job_id so you can attribute cost back to the source workload.

Reasoning effort tuning

Reasoning-heavy calls can multiply output tokens. Audit features that use higher reasoning effort:

Can medium become low?
Does quality remain acceptable?
What is the cost delta?

Track cost and quality side by side before changing production defaults.

For more details, see how to use the GPT-5.5 API.

Context-window discipline

Long prompts are expensive. Prefer tight retrieval over stuffing large context windows.

Track prompt size by feature:

SELECT
  feature,
  AVG(prompt_tokens) AS avg_prompt_tokens,
  APPROX_QUANTILES(prompt_tokens, 100)[OFFSET(95)] AS p95_prompt_tokens
FROM openai_events
WHERE environment = 'prod'
GROUP BY feature
ORDER BY p95_prompt_tokens DESC;

If prompt size grows without a product reason, investigate.

Watch the 272K-token cliff

OpenAI applies higher pricing on GPT-5.5 requests above 272K tokens. Add a guardrail:

if prompt_tokens > 250_000:
    logger.warning(json.dumps({
        "event": "openai.prompt_size_warning",
        "request_id": request_id,
        "feature": feature,
        "route": route,
        "customer_id": customer_id,
        "prompt_tokens": prompt_tokens,
    }))

For pricing details, see the GPT-5.5 pricing post.

Per-customer spend caps

For B2B SaaS, enforce spend limits before making the OpenAI call.

Example flow:

Query current monthly spend for customer_id.
Compare it to the customer’s quota.
If under quota, call OpenAI.
If over quota, return 429.

Example response:

{
  "error": "monthly_ai_quota_exceeded",
  "message": "Your monthly AI quota has been exceeded. Upgrade your plan or contact billing."
}

This turns AI from a margin risk into a controllable product cost.

Common mistakes

Avoid these:

Counting reasoning tokens as input. They are output.
Trusting the OpenAI dashboard for real-time alerts.
Adding tags globally instead of at the call site.
Forgetting background jobs and queue workers.
Sampling logs. Log every request.
Allowing customer_id to be null.
Computing historical cost with today’s pricing.
Retrying successful requests with a new request_id.

For background jobs, use synthetic routes:

cron:nightly-summarize
queue:image-caption
webhook:crm-sync

For unknown internal usage, use explicit values:

customer_id = "internal"
customer_id = "system"

Never use null as an attribution bucket.

Alternatives and tooling

You do not have to build all of this yourself.

Approach	What it does well	What it costs	When to use
OpenAI usage API	Native, no setup, accurate to the cent	Free	One project, one feature, no per-customer attribution
Helicone	Drop-in proxy, dashboards, caching, per-user costs	Free tier; paid from $20/mo	You want a hosted dashboard quickly and accept a proxy
Langfuse	Open source, self-host or cloud, traces plus cost	Free self-hosted; cloud from $29/mo	You want traces and cost in one tool
LangSmith	LangChain integration, evals, cost tracking	Paid from $39/user/mo	You already use LangChain heavily
Custom warehouse	Full control, no proxy, custom dimensions	Engineering time	Large workloads, strict residency, custom attribution

Tradeoffs:

A proxy adds another hop in the critical path.
A self-hosted observability stack gives control but adds ops work.
A custom warehouse integrates well with your data stack but requires you to own queries and alerts.
The native usage API is useful for reconciliation, not product-level attribution.

For more on hosted LLM cost monitoring, see Helicone’s guide on tracking LLM costs. For open-source cost tracking, see the Langfuse cost tracking docs.

If you operate at platform scale, these patterns also fit service-mesh and platform-engineering workflows. See API platforms for microservices architecture.

Real-world use cases

B2B SaaS with per-customer LLM spend

A sales-intelligence product spends $80,000/month on OpenAI. After adding per-customer attribution, the team learns that 12% of customers drive 71% of AI spend.

The company can then:

Add tiered pricing
Apply soft quotas to lower tiers
Charge overages
Improve gross margin per account

Internal developer tooling

An engineering org gives developers access to an internal GPT-5.5 assistant. By tagging requests with developer identity, platform engineering sees that three developers account for 50% of internal spend.

Two are running abandoned agent loops. Turning them off saves $1,800/month. The third is doing legitimate high-value work, so the team increases their quota.

AI feature forecasting

A product team wants to ship summarization. Historical events give them:

Average input tokens per call
Average output tokens per call
Calls per active user
Active user forecast

They estimate cost at $0.04 per active user per day, or about $1.20/month. Pricing can then set a $5/month feature price with visible unit economics.

Conclusion

OpenAI’s billing dashboard answers an accounting question. Request-level attribution answers the engineering and product question: where is the money going?

Implementation checklist:

Tag every request with feature, route, customer_id, and environment.
Compute cost at write time.
Log every request as structured data.
Store events in your warehouse.
Build feature, route, and customer dashboards.
Set OpenAI project/key caps.
Add warehouse-driven anomaly alerts.
Test the wrapper with Apidog.
Audit reasoning effort, prompt size, and cache hit rate regularly.

Download Apidog and use it to verify your cost-attribution wrapper end to end. Drive AI endpoints with tagged requests, assert the log payload shape, and replay scenarios across environments before your warehouse depends on the data.

For related cost-management reading, see the GPT-5.5 pricing breakdown and GitHub Copilot usage billing for API teams.

FAQ

Do reasoning tokens count as input or output for billing?

Reasoning tokens are billed at the output rate. The OpenAI API returns them under:

usage.completion_tokens_details.reasoning_tokens

Add them to completion_tokens when computing cost. For per-effort pricing details, see the GPT-5.5 pricing breakdown.

How accurate is `response.usage` compared to the OpenAI dashboard?

Token counts in response.usage should match dashboard usage. Cost drift usually comes from stale pricing tables. Pin your rate table per model and update it when OpenAI changes pricing.

Can I do attribution with OpenAI project keys alone?

Only partially. Project keys give you one dimension of attribution. They do not give you per-feature, per-customer, or per-route visibility. Use project keys for isolation and budget caps; use application metadata for product attribution.

What about retries and rate-limit errors?

If a request fails before the model runs, there is no usage object and no cost to log.

If a request succeeds and your app retries it, you can double-count unless you reuse the same request_id and dedupe on write.

How fast does the OpenAI usage API return data?

The usage API can lag by tens of minutes. Use it for reconciliation. Use your own event stream and warehouse for alerts and kill switches.

Should I sample requests?

No. One JSON line per request is small, and sampling breaks customer and route attribution. Log every request.

Can this work for other LLM providers?

Yes. Add a provider column:

openai
anthropic
google
deepseek

Then maintain provider-specific pricing logic. The warehouse schema and dashboards can stay mostly the same.

For a comparison point, see DeepSeek V4 API pricing.

Does this work for embeddings and image generation?

Yes, but the cost math changes.

Add an endpoint column:

chat
embeddings
image

Then branch cost computation by endpoint. Embeddings are usually billed per input token. Images are usually billed per image or resolution.

DEV Community

How to Track OpenAI API spend per feature: a cost-attribution playbook

TL;DR

Introduction

Why OpenAI’s billing dashboard is not enough

What is missing

The cost-attribution data model

Wrap the OpenAI client

Wire up cost tracking and test it with Apidog

1. Replace direct OpenAI calls

2. Emit structured logs

3. Aggregate spend in SQL

4. Build the dashboard

5. Test the wrapper with Apidog

6. Set budget caps and alerts

Advanced techniques

Prompt caching

Batch API for offline workloads

Reasoning effort tuning

Context-window discipline

Watch the 272K-token cliff

Per-customer spend caps

Common mistakes

Alternatives and tooling

Real-world use cases

B2B SaaS with per-customer LLM spend

Internal developer tooling

AI feature forecasting

Conclusion

FAQ

Do reasoning tokens count as input or output for billing?

How accurate is `response.usage` compared to the OpenAI dashboard?

Can I do attribution with OpenAI project keys alone?

What about retries and rate-limit errors?

How fast does the OpenAI usage API return data?

Should I sample requests?

Can this work for other LLM providers?

Does this work for embeddings and image generation?

Top comments (0)

TL;DR

Introduction

Why OpenAI’s billing dashboard is not enough

What is missing

The cost-attribution data model

Wrap the OpenAI client

Wire up cost tracking and test it with Apidog

1. Replace direct OpenAI calls

2. Emit structured logs

3. Aggregate spend in SQL

4. Build the dashboard

5. Test the wrapper with Apidog

6. Set budget caps and alerts

Advanced techniques

Prompt caching

Batch API for offline workloads

Reasoning effort tuning

Context-window discipline

Watch the 272K-token cliff

Per-customer spend caps

Common mistakes

Alternatives and tooling

Real-world use cases

B2B SaaS with per-customer LLM spend

Internal developer tooling

AI feature forecasting

Conclusion

FAQ

Do reasoning tokens count as input or output for billing?

How accurate is response.usage compared to the OpenAI dashboard?

Can I do attribution with OpenAI project keys alone?

What about retries and rate-limit errors?

How fast does the OpenAI usage API return data?

Should I sample requests?

Can this work for other LLM providers?

Does this work for embeddings and image generation?

How accurate is `response.usage` compared to the OpenAI dashboard?