There's no subscription required to use Synthetic. Instead, we charge based on usage: if you don't use the product, you don't get charged.
We keep popular open-source models always-on: there's no boot time, they're just ready to go. Always-on models are charged per-token.
Just like we read word-by-word, LLMs break down text into tokens, which can be words or part of a word. On average, two words are worth three tokens.
Always-on models are very affordable, usually only costing fractions of a cent per conversation.
Not sure how many tokens your prompt takes? Try our interactive token calculator →
Here's the list of our always-on models:
Model | Context length | Input price (per million tokens) | Output price (per million tokens) |
---|---|---|---|
deepseek-ai/DeepSeek-R1 | 128k tokens | $0.55/mtok | $2.19/mtok |
deepseek-ai/DeepSeek-R1-0528 | 128k tokens | $3.00/mtok | $8.00/mtok |
deepseek-ai/DeepSeek-R1-Distill-Llama-70B | 128k tokens | $0.90/mtok | $0.90/mtok |
deepseek-ai/DeepSeek-V3 | 128k tokens | $1.25/mtok | $1.25/mtok |
deepseek-ai/DeepSeek-V3-0324 | 128k tokens | $1.20/mtok | $1.20/mtok |
google/gemma-2-27b-it | 8k tokens | $0.80/mtok | $0.80/mtok |
meta-llama/Llama-3.1-405B-Instruct | 128k tokens | $3.00/mtok | $3.00/mtok |
meta-llama/Llama-3.1-70B-Instruct | 128k tokens | $0.90/mtok | $0.90/mtok |
meta-llama/Llama-3.1-8B-Instruct | 128k tokens | $0.20/mtok | $0.20/mtok |
meta-llama/Llama-3.2-3B-Instruct | 128k tokens | $0.06/mtok | $0.06/mtok |
meta-llama/Llama-3.3-70B-Instruct | 128k tokens | $0.90/mtok | $0.90/mtok |
meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 | 524k tokens | $0.22/mtok | $0.88/mtok |
meta-llama/Llama-4-Scout-17B-16E-Instruct | 328k tokens | $0.15/mtok | $0.60/mtok |
mistralai/Mistral-7B-Instruct-v0.3 | 32k tokens | $0.20/mtok | $0.20/mtok |
mistralai/Mixtral-8x22B-Instruct-v0.1 | 64k tokens | $1.20/mtok | $1.20/mtok |
mistralai/Mixtral-8x7B-Instruct-v0.1 | 32k tokens | $0.60/mtok | $0.60/mtok |
moonshotai/Kimi-K2-Instruct | 128k tokens | $0.60/mtok | $2.50/mtok |
nvidia/Llama-3.1-Nemotron-70B-Instruct-HF | 32k tokens | $0.90/mtok | $0.90/mtok |
Qwen/Qwen2.5-72B-Instruct | 128k tokens | $0.90/mtok | $0.90/mtok |
Qwen/Qwen2.5-7B-Instruct | 32k tokens | $0.18/mtok | $0.18/mtok |
Qwen/Qwen2.5-Coder-32B-Instruct | 32k tokens | $0.80/mtok | $0.80/mtok |
Qwen/Qwen3-235B-A22B | 128k tokens | $0.20/mtok | $0.60/mtok |
Low-rank adapters — called "LoRAs" — are small, efficient fine-tunes that run on top of existing models. They can modify a model to be much more effective at specific tasks.
LoRAs of the following models are always-on, and are charged per-token.
Base model | Input price (per million tokens) | Output price (per million tokens) |
---|---|---|
meta-llama/Llama-3.2-1B-Instruct | $0.06/mtok | $0.06/mtok |
meta-llama/Llama-3.2-3B-Instruct | $0.06/mtok | $0.06/mtok |
meta-llama/Meta-Llama-3.1-8B-Instruct | $0.20/mtok | $0.20/mtok |
meta-llama/Meta-Llama-3.1-70B-Instruct | $0.90/mtok | $0.90/mtok |
LoRA sizes are measured in "ranks," starting at rank-8; we support up to rank-64 LoRAs kept always-on, and we run them in FP8 precision. The rank is set during the finetuning process: if you create your own LoRA, you'll be able to set exactly what rank you want using standard configuration for your training framework.
For LoRAs of base models not listed in the table above, we support running them on-demand as long as vLLM does; however, since the base models aren't kept always-on, you'll be charged our standard on-demand pricing for the base model (and no additional charge for the LoRA).
We support launching all other LLMs on-demand on cloud GPUs. There's no configuration necessary: just enter the Hugging Face link for any model, and we'll automatically run it for you in our friendly chat UI or API.
You're only charged once the model is ready, and we'll keep the model running for ten minutes after your last message, and you can stop it at any time.
We'll automatically detect the number of GPUs you need to run the model. Here's our current GPU pricing:
GPU Type | Price |
---|---|
80GB | 3. cents/min, per GPU |
48GB | 1.5 cents/min, per GPU |
24GB | 1.2 cents/min, per GPU |
Our on-demand GPU rates are very competitive: for example, an 80GB GPU is ~2x cheaper on Synthetic than on competing services like Replicate or Modal Labs.
We automatically calculate the type and number of GPUs required for a model repository for you. We don't quantize on-demand models: they're launched in whatever precision the underlying repo uses; typically, BF16, with the exception of Jamba-based models which are launched in FP8. Quantizing past FP8 can significantly harm model performance.
On-demand model context length is capped to a maximum of 32k tokens.