Budgeting Guide
How to plan spend, set caps, and forecast runway for token-metered inference.
The three numbers that matter
- Tokens per request =
prompt_tokens+completion_tokens(returned in every response) - Requests per day = typical + peak call volume
- Daily cap = hard stop tokens/day per customer ID (HTTP
429when exceeded)
Quick cost math
At $0.75 per 1,000 tokens:
| Tokens | Cost |
|---|---|
| 1,000 | $0.75 |
| 10,000 | $7.50 |
| 100,000 | $75.00 |
| 1,000,000 | $750.00 |
Finance framing: daily caps turn AI spend into a known maximum dollars/day.
Set a cap from your budget
Pick your max dollars/day, then convert to tokens/day:
tokens_per_day_cap = (dollars_per_day / 0.75) * 1000
Example: $10/day → ~13,333 tokens/day.
Plans, runway, and what “cap” protects
Bundles prepay your token usage. Caps are enforced at request time and reset daily (UTC).
When you hit the cap, additional requests return HTTP 429 until the next reset.
| Solo | Team | Scale | |
|---|---|---|---|
| Bundle | $50 | $150 | $300 |
| Tokens included | ~66,667 | ~200,000 | ~400,000 |
| Typical default cap | 2,000 / day | 7,000 / day | 15,000 / day |
| ~Requests/day example Assumes ~800 tokens/request |
~2–3 | ~8–10 | ~18–20 |
| Max spend/day at cap $0.75 / 1k tokens |
$1.50/day | $5.25/day | $11.25/day |
Runway estimate (days) ≈ tokens_in_bundle / daily_cap. Example: 200,000 / 7,000 ≈ 28.6 days at cap.
Latency note (Standard vs Premium)
Larger models are typically higher-latency and may consume more tokens per useful answer.
As a rule of thumb, qwen25-32b-awq runs slower than qwen25-14b-awq.
Actual latency depends on prompt length, max_tokens, concurrent load, and warm vs cold starts.
Cap adjustments
Default caps start conservative to prevent accidental spend. Caps can be raised or lowered on request. Typical increases happen after a short clean-usage period and may depend on model tier (14B-only vs mixed).
To request an adjustment: desired cap, model(s), expected tokens/request, and expected requests/day.
Contact: [email protected]