Your OpenAI invoice says you spent $4,237 last month. It does not tell you that $3,100 came from one runaway summarization endpoint, $700 came from a customer paying $50/month, and $437 came from a feature nobody uses. If you want pricing, capacity, or roadmap decisions to be grounded in data, you need request-level cost attribution.
This guide shows how to implement OpenAI API cost attribution in production: tag every request, log token usage and computed cost, aggregate spend by feature/route/customer, set budget caps, and test the wrapper before shipping.
💡 Apidog gives you the request-level visibility and scenario testing you need to verify your cost-tracking wrapper works before it ships to production. Use Apidog to replay tagged requests, assert log shape, and validate that every call carries the metadata your warehouse expects.
TL;DR
Implement this pipeline:
- Wrap every OpenAI API call.
- Require metadata:
feature,route,customer_id, andenvironment. - Capture
response.usage. - Compute
cost_usdat write time. - Emit one structured log event per request.
- Aggregate by tag in your warehouse.
- Set OpenAI project/key budget caps.
- Alert on hourly spend anomalies.
- Validate the wrapper with Apidog scenario tests.
Introduction
You ship a new AI feature on Tuesday. By Friday, your CFO asks why the OpenAI line item jumped 40%. The OpenAI dashboard shows total spend and model usage, but not which feature, customer, or endpoint caused the spike.
That is the core problem: OpenAI billing is useful for invoices, not engineering attribution.
The fix is straightforward:
- Add metadata at the call site.
- Log every request as structured data.
- Compute cost from token usage.
- Store the event in your warehouse.
- Build dashboards and alerts from that table.
By the end of this guide, you will have:
- A cost-attribution event schema
- Python wrapper code
- SQL aggregation queries
- A verification workflow with Apidog
- A build-vs-buy tooling comparison
For pricing context, see the GPT-5.5 pricing breakdown. For a related billing-attribution problem, see GitHub Copilot usage billing for API teams. For API basics, see the official OpenAI API reference.
Why OpenAI’s billing dashboard is not enough
The OpenAI billing dashboard typically gives you:
- Daily spend
- Model breakdown
- Usage limits
That works for a simple setup. It breaks down when you have:
- Multiple AI features
- Multiple customers
- Multiple environments
- Multiple developers
- Background jobs
- Internal tools
What is missing
Total spend without context
The dashboard can tell you that you spent $312 yesterday. It cannot tell you whether that came from a customer hammering your support-chat endpoint or from a background job reprocessing your knowledge base.
No per-feature breakdown
OpenAI usage is grouped around account/project/model dimensions. It does not know your product concepts: feature, route, customer_id, or environment.
Reporting lag
Usage data may lag by tens of minutes or hours. That is too slow for runaway loops or hourly burn alerts.
No feature-level alerts
There is no native primitive for: “Page me if /api/v1/chat/answer exceeds $50/hour.”
No customer attribution
If you run B2B SaaS, you need to know which customer generated which spend. Without that, you cannot compute gross margin per customer.
Project keys help, but only partially
OpenAI project keys can separate workloads at a coarse level. They do not give you per-feature, per-route, or per-customer attribution. The OpenAI usage API returns aggregated data, not request-level product metadata.
The pattern is common enough that the Dev.to thread “OpenAI Tells You What You Spent. Not Where. So I Built a Dashboard” resonated with developers: you cannot manage what you cannot measure.
The cost-attribution data model
Treat every OpenAI request as a cost event. That event is the unit you query, alert on, and reconcile.
Use a schema like this:
| Column | Type | Example | Why it matters |
|---|---|---|---|
request_id |
uuid | 7a91... |
Idempotency, deduplication, retries |
timestamp |
timestamptz | 2026-05-06T14:23:01Z |
Time-series queries and anomaly detection |
feature |
text | support-chat |
Product surface that triggered the call |
route |
text | /api/v1/chat/answer |
HTTP route or background job ID |
customer_id |
text | cust_4291 |
Per-customer spend and gross margin |
environment |
text |
prod, staging, dev
|
Separate production from internal usage |
model |
text |
gpt-5.5, gpt-5.4-mini
|
Pricing differs per model |
prompt_tokens |
int | 15234 |
Input token count |
completion_tokens |
int | 812 |
Output token count |
reasoning_tokens |
int | 4500 |
Reasoning tokens billed as output |
cached_tokens |
int | 12000 |
Cached input tokens |
latency_ms |
int | 2341 |
Cost/performance correlation |
cost_usd |
numeric(10,6) | 0.045672 |
Cost computed at write time |
prompt_cache_key |
text | system-v3 |
Cache hit tracking |
error_code |
text |
null, 429
|
Retry and failure analysis |
Compute cost when you write the event, not later in a dashboard query. Pricing changes over time, so historical events should preserve the rate used at the time.
Example pricing function:
PRICING = { # USD per 1M tokens, as of May 2026
"gpt-5.5": {"input": 5.00, "cached": 2.50, "output": 30.00},
"gpt-5.5-pro": {"input": 30.00, "cached": 15.00, "output": 180.00},
"gpt-5.4": {"input": 2.50, "cached": 1.25, "output": 15.00},
"gpt-5.4-mini": {"input": 0.25, "cached": 0.125, "output": 2.00},
}
def compute_cost_usd(model, prompt_tokens, cached_tokens, completion_tokens, reasoning_tokens):
rates = PRICING[model]
uncached = max(0, prompt_tokens - cached_tokens)
input_cost = (uncached * rates["input"]) / 1_000_000
cache_cost = (cached_tokens * rates["cached"]) / 1_000_000
output_cost = ((completion_tokens + reasoning_tokens) * rates["output"]) / 1_000_000
return round(input_cost + cache_cost + output_cost, 6)
Reasoning tokens are returned under:
usage.completion_tokens_details.reasoning_tokens
They are billed at the output rate. If you omit them, you undercount cost for reasoning-heavy calls.
For more pricing details, see the GPT-5.5 pricing breakdown.
Wrap the OpenAI client
Every OpenAI call should go through one wrapper. The wrapper should:
- Require product metadata.
- Generate or receive a
request_id. - Call OpenAI.
- Capture token usage.
- Compute cost.
- Emit a structured event.
import time
import uuid
import json
import logging
from openai import OpenAI
client = OpenAI()
logger = logging.getLogger("llm.cost")
def call_with_attribution(
*,
feature,
route,
customer_id,
environment,
model,
messages,
request_id=None,
**openai_kwargs
):
if not feature or not route or not customer_id or not environment:
raise ValueError("feature, route, customer_id, and environment are required")
request_id = request_id or str(uuid.uuid4())
started = time.time()
error_code = None
response = None
try:
response = client.chat.completions.create(
model=model,
messages=messages,
**openai_kwargs
)
return response
except Exception as e:
error_code = getattr(e, "code", "unknown_error")
raise
finally:
latency_ms = int((time.time() - started) * 1000)
u = response.usage if response else None
prompt_tokens = getattr(u, "prompt_tokens", 0) if u else 0
completion_tokens = getattr(u, "completion_tokens", 0) if u else 0
cached_tokens = (
getattr(getattr(u, "prompt_tokens_details", None), "cached_tokens", 0)
if u else 0
) or 0
reasoning_tokens = (
getattr(getattr(u, "completion_tokens_details", None), "reasoning_tokens", 0)
if u else 0
) or 0
cost_usd = compute_cost_usd(
model,
prompt_tokens,
cached_tokens,
completion_tokens,
reasoning_tokens
)
logger.info(json.dumps({
"event": "openai.request",
"request_id": request_id,
"feature": feature,
"route": route,
"customer_id": customer_id,
"environment": environment,
"model": model,
"prompt_tokens": prompt_tokens,
"completion_tokens": completion_tokens,
"reasoning_tokens": reasoning_tokens,
"cached_tokens": cached_tokens,
"latency_ms": latency_ms,
"cost_usd": cost_usd,
"error_code": error_code,
}))
Usage example:
response = call_with_attribution(
feature="support-chat",
route="/api/v1/chat/answer",
customer_id="cust_4291",
environment="prod",
model="gpt-5.5",
messages=[
{"role": "system", "content": "You are a support assistant."},
{"role": "user", "content": "How do I reset my password?"}
],
)
Ship these logs to your existing pipeline:
- Vector
- Fluent Bit
- Logstash
- OTLP collector
- Kafka
- Pub/Sub
- NATS
Then write them into your warehouse:
- BigQuery
- ClickHouse
- Snowflake
- Postgres
For Node.js, use the same shape: a wrapper function around the OpenAI SDK that accepts metadata, captures response.usage, computes cost, and writes a JSON event.
Wire up cost tracking and test it with Apidog
1. Replace direct OpenAI calls
Search your codebase for direct SDK calls:
grep -R "client.chat.completions.create" .
grep -R "OpenAI(" .
Replace every direct call with your attribution wrapper.
Do not default missing metadata to "unknown". Fail fast:
if not feature:
raise ValueError("feature is required")
Bad tags create silent attribution errors.
2. Emit structured logs
Log one JSON event per request:
{
"event": "openai.request",
"request_id": "7a91...",
"feature": "support-chat",
"route": "/api/v1/chat/answer",
"customer_id": "cust_4291",
"environment": "prod",
"model": "gpt-5.5",
"prompt_tokens": 15234,
"completion_tokens": 812,
"reasoning_tokens": 4500,
"cached_tokens": 12000,
"latency_ms": 2341,
"cost_usd": 0.045672,
"error_code": null
}
Keep these events clean. Do not mix them with debug logs.
3. Aggregate spend in SQL
Once events are in your warehouse, start with feature-level spend:
SELECT
feature,
DATE_TRUNC(timestamp, DAY) AS day,
COUNT(*) AS requests,
SUM(cost_usd) AS spend_usd,
SUM(prompt_tokens + completion_tokens + reasoning_tokens) AS tokens,
AVG(latency_ms) AS avg_latency_ms,
SUM(cached_tokens) / NULLIF(SUM(prompt_tokens), 0) AS cache_hit_rate
FROM openai_events
WHERE environment = 'prod'
AND timestamp >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
GROUP BY feature, day
ORDER BY day DESC, spend_usd DESC;
Then add customer-level spend:
SELECT
customer_id,
DATE_TRUNC(timestamp, MONTH) AS month,
COUNT(*) AS requests,
SUM(cost_usd) AS spend_usd
FROM openai_events
WHERE environment = 'prod'
GROUP BY customer_id, month
ORDER BY spend_usd DESC;
And route-level spend:
SELECT
route,
COUNT(*) AS requests,
SUM(cost_usd) AS spend_usd,
AVG(cost_usd) AS avg_cost_per_request
FROM openai_events
WHERE environment = 'prod'
AND timestamp >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 DAY)
GROUP BY route
ORDER BY spend_usd DESC
LIMIT 20;
4. Build the dashboard
Create three operational views:
- Spend per feature over time
- Spend per customer over time
- Top routes by daily spend
Use whatever BI layer you already have:
- Grafana
- Metabase
- Looker
- Superset
- Mode
5. Test the wrapper with Apidog
Before shipping, verify that the wrapper logs the metadata you expect.
Use Apidog to create an end-to-end scenario:
- Send a request to your AI endpoint with a known
customer_id. - Verify the API response succeeds.
- Capture the side-channel log event through your logging endpoint, stdout collector, or OTLP/log pipeline.
- Assert the event contains:
featureroutecustomer_idenvironmentmodelprompt_tokens > 0cost_usd > 0
- Run the same scenario against staging and production using Apidog environments.
- Replay the request and verify retries do not double-count cost.
For broader testing workflows, see API testing tools for QA engineers. For contract-first coverage, see contract-first API development.
6. Set budget caps and alerts
Use OpenAI project keys to isolate risk:
prod-support-chatprod-summarizationstaging-alldev-all
Set hard caps in the OpenAI dashboard so one runaway workload cannot drain the whole organization budget.
Then add warehouse-driven alerts. Example: page if any feature exceeds 3x its seven-day average hourly spend.
WITH hourly AS (
SELECT
feature,
TIMESTAMP_TRUNC(timestamp, HOUR) AS hour,
SUM(cost_usd) AS spend_usd
FROM openai_events
WHERE environment = 'prod'
AND timestamp >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 8 DAY)
GROUP BY feature, hour
),
baseline AS (
SELECT
feature,
AVG(spend_usd) AS avg_hourly_spend
FROM hourly
WHERE hour < TIMESTAMP_SUB(TIMESTAMP_TRUNC(CURRENT_TIMESTAMP(), HOUR), INTERVAL 1 HOUR)
GROUP BY feature
),
current_hour AS (
SELECT
feature,
spend_usd
FROM hourly
WHERE hour = TIMESTAMP_TRUNC(CURRENT_TIMESTAMP(), HOUR)
)
SELECT
c.feature,
c.spend_usd,
b.avg_hourly_spend
FROM current_hour c
JOIN baseline b USING (feature)
WHERE c.spend_usd > b.avg_hourly_spend * 3;
Send the result to:
- PagerDuty
- Opsgenie
- Slack
- Incident.io
Native caps protect you from catastrophic burn. Warehouse alerts catch slow drift earlier.
Advanced techniques
Prompt caching
GPT-5.5 charges less for cached input tokens. Structure prompts so stable content appears first:
[Stable system prompt]
[Stable policy/instructions]
[Stable examples]
[Per-request user data]
Track this per feature:
SELECT
feature,
SUM(cached_tokens) / NULLIF(SUM(prompt_tokens), 0) AS cache_hit_rate
FROM openai_events
WHERE environment = 'prod'
GROUP BY feature
ORDER BY cache_hit_rate ASC;
If a prompt change drops cache hit rate, your input cost can rise silently.
See the official OpenAI prompt caching docs for eligibility rules.
Batch API for offline workloads
Use the Batch API for workloads that do not need synchronous responses:
- Nightly summarization
- Evaluation runs
- Embedding backfills
- Document re-processing
Tag these events with a batch_job_id so you can attribute cost back to the source workload.
Reasoning effort tuning
Reasoning-heavy calls can multiply output tokens. Audit features that use higher reasoning effort:
- Can
mediumbecomelow? - Does quality remain acceptable?
- What is the cost delta?
Track cost and quality side by side before changing production defaults.
For more details, see how to use the GPT-5.5 API.
Context-window discipline
Long prompts are expensive. Prefer tight retrieval over stuffing large context windows.
Track prompt size by feature:
SELECT
feature,
AVG(prompt_tokens) AS avg_prompt_tokens,
APPROX_QUANTILES(prompt_tokens, 100)[OFFSET(95)] AS p95_prompt_tokens
FROM openai_events
WHERE environment = 'prod'
GROUP BY feature
ORDER BY p95_prompt_tokens DESC;
If prompt size grows without a product reason, investigate.
Watch the 272K-token cliff
OpenAI applies higher pricing on GPT-5.5 requests above 272K tokens. Add a guardrail:
if prompt_tokens > 250_000:
logger.warning(json.dumps({
"event": "openai.prompt_size_warning",
"request_id": request_id,
"feature": feature,
"route": route,
"customer_id": customer_id,
"prompt_tokens": prompt_tokens,
}))
For pricing details, see the GPT-5.5 pricing post.
Per-customer spend caps
For B2B SaaS, enforce spend limits before making the OpenAI call.
Example flow:
- Query current monthly spend for
customer_id. - Compare it to the customer’s quota.
- If under quota, call OpenAI.
- If over quota, return
429.
Example response:
{
"error": "monthly_ai_quota_exceeded",
"message": "Your monthly AI quota has been exceeded. Upgrade your plan or contact billing."
}
This turns AI from a margin risk into a controllable product cost.
Common mistakes
Avoid these:
- Counting reasoning tokens as input. They are output.
- Trusting the OpenAI dashboard for real-time alerts.
- Adding tags globally instead of at the call site.
- Forgetting background jobs and queue workers.
- Sampling logs. Log every request.
- Allowing
customer_idto be null. - Computing historical cost with today’s pricing.
- Retrying successful requests with a new
request_id.
For background jobs, use synthetic routes:
cron:nightly-summarize
queue:image-caption
webhook:crm-sync
For unknown internal usage, use explicit values:
customer_id = "internal"
customer_id = "system"
Never use null as an attribution bucket.
Alternatives and tooling
You do not have to build all of this yourself.
| Approach | What it does well | What it costs | When to use |
|---|---|---|---|
| OpenAI usage API | Native, no setup, accurate to the cent | Free | One project, one feature, no per-customer attribution |
| Helicone | Drop-in proxy, dashboards, caching, per-user costs | Free tier; paid from $20/mo | You want a hosted dashboard quickly and accept a proxy |
| Langfuse | Open source, self-host or cloud, traces plus cost | Free self-hosted; cloud from $29/mo | You want traces and cost in one tool |
| LangSmith | LangChain integration, evals, cost tracking | Paid from $39/user/mo | You already use LangChain heavily |
| Custom warehouse | Full control, no proxy, custom dimensions | Engineering time | Large workloads, strict residency, custom attribution |
Tradeoffs:
- A proxy adds another hop in the critical path.
- A self-hosted observability stack gives control but adds ops work.
- A custom warehouse integrates well with your data stack but requires you to own queries and alerts.
- The native usage API is useful for reconciliation, not product-level attribution.
For more on hosted LLM cost monitoring, see Helicone’s guide on tracking LLM costs. For open-source cost tracking, see the Langfuse cost tracking docs.
If you operate at platform scale, these patterns also fit service-mesh and platform-engineering workflows. See API platforms for microservices architecture.
Real-world use cases
B2B SaaS with per-customer LLM spend
A sales-intelligence product spends $80,000/month on OpenAI. After adding per-customer attribution, the team learns that 12% of customers drive 71% of AI spend.
The company can then:
- Add tiered pricing
- Apply soft quotas to lower tiers
- Charge overages
- Improve gross margin per account
Internal developer tooling
An engineering org gives developers access to an internal GPT-5.5 assistant. By tagging requests with developer identity, platform engineering sees that three developers account for 50% of internal spend.
Two are running abandoned agent loops. Turning them off saves $1,800/month. The third is doing legitimate high-value work, so the team increases their quota.
AI feature forecasting
A product team wants to ship summarization. Historical events give them:
- Average input tokens per call
- Average output tokens per call
- Calls per active user
- Active user forecast
They estimate cost at $0.04 per active user per day, or about $1.20/month. Pricing can then set a $5/month feature price with visible unit economics.
Conclusion
OpenAI’s billing dashboard answers an accounting question. Request-level attribution answers the engineering and product question: where is the money going?
Implementation checklist:
- Tag every request with
feature,route,customer_id, andenvironment. - Compute cost at write time.
- Log every request as structured data.
- Store events in your warehouse.
- Build feature, route, and customer dashboards.
- Set OpenAI project/key caps.
- Add warehouse-driven anomaly alerts.
- Test the wrapper with Apidog.
- Audit reasoning effort, prompt size, and cache hit rate regularly.
Download Apidog and use it to verify your cost-attribution wrapper end to end. Drive AI endpoints with tagged requests, assert the log payload shape, and replay scenarios across environments before your warehouse depends on the data.
For related cost-management reading, see the GPT-5.5 pricing breakdown and GitHub Copilot usage billing for API teams.
FAQ
Do reasoning tokens count as input or output for billing?
Reasoning tokens are billed at the output rate. The OpenAI API returns them under:
usage.completion_tokens_details.reasoning_tokens
Add them to completion_tokens when computing cost. For per-effort pricing details, see the GPT-5.5 pricing breakdown.
How accurate is response.usage compared to the OpenAI dashboard?
Token counts in response.usage should match dashboard usage. Cost drift usually comes from stale pricing tables. Pin your rate table per model and update it when OpenAI changes pricing.
Can I do attribution with OpenAI project keys alone?
Only partially. Project keys give you one dimension of attribution. They do not give you per-feature, per-customer, or per-route visibility. Use project keys for isolation and budget caps; use application metadata for product attribution.
What about retries and rate-limit errors?
If a request fails before the model runs, there is no usage object and no cost to log.
If a request succeeds and your app retries it, you can double-count unless you reuse the same request_id and dedupe on write.
How fast does the OpenAI usage API return data?
The usage API can lag by tens of minutes. Use it for reconciliation. Use your own event stream and warehouse for alerts and kill switches.
Should I sample requests?
No. One JSON line per request is small, and sampling breaks customer and route attribution. Log every request.
Can this work for other LLM providers?
Yes. Add a provider column:
openai
anthropic
google
deepseek
Then maintain provider-specific pricing logic. The warehouse schema and dashboards can stay mostly the same.
For a comparison point, see DeepSeek V4 API pricing.
Does this work for embeddings and image generation?
Yes, but the cost math changes.
Add an endpoint column:
chat
embeddings
image
Then branch cost computation by endpoint. Embeddings are usually billed per input token. Images are usually billed per image or resolution.
Top comments (0)