What are the best AI inference providers in 2026?

There is no single best one: the 34 providers tracked here split into five jobs. First-party labs (OpenAI, Anthropic, Google, xAI) win on frontier quality. Neutral platforms (DeepInfra, Together, Fireworks) host open-weight models cheapest. Custom-silicon clouds (Groq, Cerebras, SambaNova) win on speed. Hyperscalers (Azure, Bedrock, Snowflake, Databricks) win on enterprise governance. GPU clouds (GMI, CoreWeave) and compression specialists (CompactifAI) serve self-hosting and edge.

What is the cheapest way to run AI models?

For frontier-class quality, the cheapest per-token APIs are the China-built open-weight models: DeepSeek V4 Flash at roughly $0.14/$0.28 per million tokens, with MiniMax, Kimi and StepFun close behind. For open-weight hosting at scale, DeepInfra, Together and Fireworks converge near $0.05 to $1 per million tokens. Above roughly 10 to 50 million tokens per day of steady load per model, renting GPUs (GMI or Hyperbolic at about $2 to $3.20 per H100-hour) usually beats per-token pricing.

Which AI inference providers are the fastest?

The custom-silicon specialists. Cerebras runs open models at 1,800 to 3,000+ tokens per second on its wafer-scale chips, Groq at 500 to 1,000+ on its LPUs, and SambaNova is benchmarked among the fastest for the largest open models. The trade-off is that all three serve open models only: no proprietary frontier models and limited or no custom-model uploads.

What is an OpenAI-compatible API and why does it matter?

It means the provider exposes the same request and response format as OpenAI's API, so switching providers is usually just a base-URL and API-key change rather than a rewrite. 26 of the 34 providers here are OpenAI-compatible, which is why you can benchmark several on the same code and route to whichever wins on price, speed or quality.

How current are these prices?

They are representative snapshots as of June 28, 2026, each linked to the provider's official page. Per-token rates for open-weight hosts change often, sometimes weekly, and open-model hosts may serve at different precision (FP8 vs FP4), which changes both price and quality. Treat every figure as a starting point and confirm the live rate at the linked source before budgeting.

Directory· Updated June 28, 2026

AI inference providers

Where to actually run a model, mapped. 34 providers across five categories, from the frontier labs to the open-weight hosts, the speed specialists, the enterprise clouds and the GPU rental market, each with representative pricing and a link to its own page. For which model to run, see the model release tracker; for monthly plan prices, the coding plan comparison; and for free options, thefree AI models list.

What are the best AI inference providers?

The best AI inference provider depends on what you optimize for. For frontier quality, use OpenAI, Anthropic or Google. For cheap open-weight models, use DeepInfra, Together or Fireworks. For raw speed, use Groq, Cerebras or SambaNova. For enterprise governance, use Azure, Bedrock, Snowflake or Databricks; for GPU rental, GMI or CoreWeave.

Providers mapped

Across five categories, from frontier labs to GPU clouds

$0.6-$30

Flagship output, per Mtok

From Upstage (Solar Pro 3) to OpenAI (GPT-5.5)

26 of 34

OpenAI-compatible

Switching is usually a base-URL change, not a rewrite

Offer a free tier

Free credits or a standing zero-cost allowance to start

The flagship-tier price spread

Among the first-party labs alone, the flagship output rate spans more than 50x, fromUpstage (Solar Pro 3) to OpenAI (GPT-5.5). Across the whole market, counting cheap open-weight hosting and compressed models, per-token output runs from about $0.14 to $180 per million tokens: four orders of magnitude. The lesson is that the sticker rate is a starting point, not the bill.

Flagship model output price per million tokens, by lab
Item	Value
Upstage (Solar Pro 3)	$0.6
Inception (Mercury 2)	$0.75
MiniMax M2.7	$1.2
StepFun (Step3)	$1.42
xAI (Grok 4.3)	$2.5
DeepSeek V4 Pro	$3.48
Kimi K2.6	$4
Mistral Large	$6
Cohere (Command A)	$10
Google (Gemini 3.1 Pro)	$12
Anthropic (Claude Opus 4.8)	$25
OpenAI (GPT-5.5)	$30

Flagship model output price per million tokens, by first-party lab, cheapest first. Output tokens are where the cost concentrates: across this line, output runs 2 to 5x the input rate.Source: Provider official pricing pages, representative as of June 2026

The five kinds of inference provider

Every provider sits in one of five buckets, and the bucket usually decides the choice before the individual provider does. Each table below is grounded to the providers' own pages and dated June 28, 2026. To search and filter all 34 at once, use thedirectory tool below.

First-party model labs

The companies that train the models and serve them through their own API. Choose these for frontier quality and day-one access.

Provider	What it offers	Representative pricing	Known for
AnthropicUS	Frontier proprietary API: Claude Opus 4.8, Sonnet 5, Haiku 4.5, with Fable 5 and Mythos 5 above Opus. All at 1M-token context.	Opus 4.8 $5/$25, Sonnet 5 $3/$15 (intro $2/$10 through Aug 31 2026), Haiku 4.5 $1/$5 per Mtok. Batch 50% off; prompt caching up to 90% off.	Output costs 5x input across the line; pricing has held steady across generations. Run-rate revenue passed $30B in early 2026.
CohereUS	Command (generation: Command A/R+/R/R7B), Embed (vectors), Rerank (neural reranking). Strong on data sovereignty and on-prem.	Command R+ / Command A $2.50/$10, Command R $0.15/$0.60, Command R7B $0.0375/$0.15 per Mtok. Embed v3 $0.10/M.Free: Free trial key (1,000 calls/month, not for production).	Best-in-class Embed plus Rerank stack for retrieval; strong on data sovereignty, VPC and on-prem deployment.
DeepSeekChina · OpenAI-compatible	V4 Flash (cheapest frontier-class API) and V4 Pro. Both 1M ctx, 384K max output. Open weights with 10M+ downloads.	V4 Flash $0.14/$0.28, V4 Pro $1.74/$3.48 per Mtok. Cache hits at 1/10 standard input.	Roughly 90-95% cheaper than comparable Western models, with open weights. Owned by hedge fund High-Flyer.
GoogleUS · OpenAI-compatible	Gemini 3.x API (3.1 Pro, 3.5 Flash, 3.1 Flash-Lite). AI Studio is free for prototyping; Vertex AI adds enterprise SLAs and compliance.	Gemini 3.1 Pro $2/$12 (to 200K ctx), 3.5 Flash $1.50/$9, Flash-Lite $0.25/$1.50 per Mtok. 90% context-caching discount.Free: AI Studio free for prototyping; Flash retains a free tier.	Largest production context window (2M tokens) and the cheapest Tier-1 budget model (Flash-Lite). Pro models are paid-only as of April 1, 2026.
Inception (Inception Labs)US · OpenAI-compatible	Mercury 2 (reasoning dLLM, 128K ctx, >1,000 tok/s on Blackwell), Mercury Coder, Mercury Edit 2. Default in Continue and Zed.	Mercury 2 $0.25/$0.75 per Mtok (cached input $0.025/M).Free: 10M free tokens per new account.	First commercially available diffusion LLM, for ~5-10x faster, cheaper inference than one-token-at-a-time models.
Kimi (Moonshot AI)China · OpenAI-compatible	Kimi K2.6 (1T-param MoE/32B active, 256K ctx, multimodal), K2.5 (cheaper), K2.7-Code (coding). Agent Swarm up to 300 subagents.	K2.6 $0.95/$4.00, K2.5 $0.60/$3.00 per Mtok; cached input $0.10-0.16/M. Batch API 40% off.	Roughly 8-10x cheaper than Claude Opus at frontier-adjacent quality.
MiniMaxChina · OpenAI-compatible	M-series: M2.7, M3, plus legacy abab6.5 and MiniMax-01 (1M ctx). A faster highspeed variant at 2x.	M2.7 $0.30/$1.20 per Mtok (official), cache reads $0.06/M.	Frontier-class coding and agentic quality at ~5-10% of Claude Opus output pricing. M2.7 restricts commercial use (M2 was MIT).
MistralEU · OpenAI-compatible	Mistral Large 3, Medium 3.5, Small 3, Codestral (code), Ministral 3B/8B (edge), Pixtral (vision), OCR. Many under Apache 2.0.	Large 2 tier $2/$6, Small 3 $0.10/$0.30, Ministral 3B ~$0.04/$0.04 per Mtok.Free: Free experimentation tier via la Plateforme.	Among the cheapest flagship-tier output pricing, plus EU data residency and genuine open weights for self-hosting.
OpenAIUS · OpenAI-compatible	Frontier proprietary API (GPT-5.5, GPT-5.4/Mini/Nano, GPT-5.x Pro, GPT-5.3 Codex), plus Batch and Realtime APIs.	GPT-5.5 $5/$30, GPT-5.4 $2.50/$15, Nano $0.20/$1.25 per Mtok. Batch 50% off; cached input up to 90% off.Free: $5 in free credits for new accounts (expire in 3 months).	The default frontier benchmark. Note: winding down its fine-tuning platform (closed to new users as of May 2026).
Reka AIUS	Reka Core (top reasoning), Flash/Flash 3 (21B), Edge (7B vision-language), Spark. Deployable cloud, on-prem or on-device.	Edge $0.10/$0.10, Flash 3 $0.10/$0.20 per Mtok (via OpenRouter); Core is most expensive.	Flexible deployment down to edge and device; Reka Edge uses only 64 tokens per image tile for low-latency robotics and AR.
Sarvam (Sarvam AI)India · OpenAI-compatible	Sarvam-30B and Sarvam-105B (MoE, trained on Indian compute, 128K ctx, open-weight), Bulbul (TTS), Saaras (STT), translation (22 languages), OCR.	Chat completion ~Rs 4 input / Rs 16 output per Mtok (105B tier); STT Rs 45/hour. Free credits on signup.Free: Free credits on signup.	INR-denominated pricing avoids USD plus GST overhead; data hosted in India; best-in-class Indic-language and OCR performance.
StepFunChina · OpenAI-compatible	Step 3.7 Flash (196B/~11B active, 256K ctx, native image and video input), Step 3.5 Flash, Step3.	Step 3.7 Flash $0.20/$1.15, Step 3.5 Flash $0.09/$0.30, Step3 $0.57/$1.42 per Mtok.	Disproportionately strong agentic benchmark scores relative to its price tier.
UpstageKorea · OpenAI-compatible	Solar Pro 3 (102B total/12B active, 128K ctx, tuned for Korean/English/Japanese), Solar Pro 2, Document Parse/Extract.	Solar Pro 3 ~$0.15/$0.60 per Mtok (via OpenRouter); Document Parse ~$0.01/page. Prices exclude 10% VAT.	Best positioned for Korean-language and structured document and instruction-following tasks.
xAI (Grok)US · OpenAI-compatible	Frontier proprietary API: Grok 4.3 (flagship, ~1M ctx), Grok 4.20 (2M ctx long-context), Grok 4.1 Fast (cheap workhorse).	Grok 4.3 $1.25/$2.50, Grok 4.1 Fast $0.20/$0.50 per Mtok. Batch 50% off; cached input ~90% off.Free: Free developer credits via data-sharing program (~$150-175/mo reported).	Only frontier model with live grounding to X posts; aggressive pricing undercuts GPT-5.4. API is independent of X subscriptions.

Cloud hyperscaler marketplaces

Cloud platforms that resell many labs models behind enterprise governance, compliance and data residency. Expect a 10-40% premium over raw token cost.

Provider	What it offers	Representative pricing	Known for
Amazon BedrockUS	Foundation models from Anthropic, Meta, Mistral, Cohere, AI21, Amazon (Nova/Titan), Stability, and now OpenAI. AgentCore for agents.	Per-token, matching providers (Claude Sonnet 5 $3/$15, Nova Micro $0.035/$0.14 per Mtok). Batch 50% off; caching up to 90% off.	Deep AWS integration, FedRAMP and HIPAA compliance. Watch hidden costs (OpenSearch Serverless ~$345/mo for Knowledge Bases).
Databricks (Mosaic AI)US · OpenAI-compatible	Open foundation models (Llama, DBRX) plus external models (OpenAI, Anthropic, Cohere). Tight Unity Catalog governance.	Consumption via DBUs from ~$0.07/DBU; pay-per-token, provisioned throughput (Llama 3.3 70B from $6/hr per band), and batch.Free: 14-day free trial; Free Edition available.	Tight integration with Unity Catalog governance and data pipelines; OpenAI-compatible API.
Microsoft AzureUS · OpenAI-compatible	Hosts OpenAI (GPT-5 family, Sora, image), plus DeepSeek, Grok, Llama, Mistral, FLUX, and managed GPU compute.	Token pricing matches OpenAI direct (GPT-5 $1.25/$10 per Mtok). PTUs for sustained load from ~$2,448/mo.Free: $200 free credit for 30 days.	Enterprise governance, compliance and Azure integration. Real bills often run 15-40% above raw token cost (support, networking, search).
Snowflake (Cortex)US	Pre-integrated Arctic, Llama, Mistral, Reka, Google, plus OpenAI, Anthropic, DeepSeek. AISQL, Cortex Search, Analyst and Agents.	Consumption and credit-based, token-metered, roughly $0.12-5.10 per Mtok depending on model; warehouse compute billed separately.Free: No dedicated free tier (trial credits only).	AI runs where the data lives, no egress; Snowflake does not train on customer data.

Neutral open-weight inference platforms

Vendor-neutral clouds that host open-weight models per token, plus GPU rental and fine-tuning. The commodity layer for cheap open-model inference.

Provider	What it offers	Representative pricing	Known for
BasetenUS · OpenAI-compatible	Per-minute dedicated GPU billing (T4 ~$0.01/min up to B200 ~$0.166/min) with scale-to-zero, plus a per-token Model APIs catalog.	Model APIs median ~$0.60/$2.20 per Mtok; dedicated GPU T4 ~$0.01/min up to B200 ~$0.166/min.Free: New-account credits.	Multi-cloud routing across ~18-20 providers, 1B+ inference calls/day. Closed a $1.5B Series F at a $13B valuation (June 22, 2026).
DeepInfraUS · OpenAI-compatible	190+ models (Llama, Qwen, DeepSeek, GLM, Gemma, Mistral, Nemotron, Kimi) plus embeddings, TTS, image. Dedicated GPUs by the hour.	From ~$0.06/M for small models; DeepSeek V4 Flash $0.10/$0.20 per Mtok. ~5T tokens/week.Free: No standing free tier.	Among the cheapest serverless options, runs its own US data centers including Blackwell B200.
Fireworks AIUS · OpenAI-compatible	Serverless per-token (DeepSeek/Kimi/GLM/MiniMax catalog), on-demand GPUs, fine-tuning, reserved capacity. 30T+ tokens/day.	8B-class ~$0.20/M, 70B-class ~$0.90/M; H100/H200 $6/hr, B200 $9/hr. Batch 50% off. Often 20-40% below Together.Free: $1 free starter credit.	Day-zero model support and rapid growth: $315M annualized revenue as of February 2026; in talks at a $15B valuation in mid-2026.
FriendliAIUS · OpenAI-compatible	Serverless Endpoints (OpenAI-compatible), Dedicated Endpoints (per GPU-hour), Container (on-prem). DeepSeek, Qwen, Kimi, GLM, Llama, EXAONE.	Pay-per-token serverless plus per GPU-hour dedicated. Claims 50-90% cost savings vs vLLM.Free: $5 free credits.	Claims up to 3x faster than vLLM via custom kernels, speculative decoding, continuous batching. SOC 2 Type II plus HIPAA, 99.99% uptime SLA.
GMI (GMI Cloud)US · OpenAI-compatible	Bare-metal plus serverless plus dedicated clusters. Inference Engine (auto-scaling), Cluster Engine, and Model-as-a-Service.	On-demand GPU/hr: H100 from $2.00, H200 from $2.60, B200 from $4.00, GB200 from $8.00. Reserved cuts 30-50%.	Claims 40-70% savings vs hyperscalers; pure bare-metal rates with no forced CPU or networking upsell.
HyperbolicUS · OpenAI-compatible	25+ open models plus a GPU marketplace that aggregates idle GPUs. Pay by card or crypto. Serves Llama-3.1-405B-Base in BF16.	GPU/hr: RTX 4090 $0.50, A100 ~$1.60-1.80, H100 PCIe $3.00, H100 SXM $3.20. Llama 3.3 70B $0.40/M.Free: $1 promo credit (not for GPU rental).	Up to 75% savings vs hyperscalers by pooling idle GPUs; only platform serving Llama-3.1-405B-Base in BF16.
NebiusEU · OpenAI-compatible	Token Factory: 60+ open-source models in Fast/Base tiers, OpenAI-compatible. Also H100/H200/B200/B300 rental with per-second billing.	Cheapest ~$0.06-0.08/M input (Nemotron 3 Nano $0.08 blended) up to ~$1.93/M for DeepSeek V4 Pro. Reserve discounts up to 35%.	EU data residency, full-stack managed compute plus inference, and a ~$2B NVIDIA deal with early Rubin access.
ParasailUS · OpenAI-compatible	Serverless per-token (GLM-5.2, Kimi, DeepSeek V4, MiniMax, Qwen, gpt-oss, Llama), dedicated endpoints and batch. Deploy any HF model in ~5 lines.	~$0.09/M for MiMo-V2.5 up to ~$0.90/M for GLM-5.1. 500B tokens/day.	Lossless by default, no hidden quantization; pay-per-token with no GPU contracts. Founded by an ex-Groq exec; raised $32M Series A.
ReplicateUS	50,000+ community models plus ~100 curated Official Models (Claude, DeepSeek, FLUX, Veo, Kling). Deploy custom models via Cog.	Hardware-per-second (CPU $0.000025/s up to H100 ~$0.001525/s) and output-based (per token/image/video).	Largest open model catalog and easiest experimentation. Cold starts and unpredictable per-second billing are the main drawbacks.
SiliconFlowChina · OpenAI-compatible	200+ models across text/image/video/audio (DeepSeek, Qwen, GLM, Kimi, MiniMax, Step, FLUX). Backed by Alibaba Cloud.	Pay-as-you-go per-token (DeepSeek V4 Flash $0.14/$0.28 per Mtok); reserved GPU ~CNY 2.73/hr.Free: $1 credits on signup; some smaller models permanently free.	6M+ users, 100B+ daily tokens, fastest to handle DeepSeek traffic.
Together AIUS · OpenAI-compatible	Serverless per-token (DeepSeek, Llama, Qwen, Kimi, GLM, MiniMax, Mixtral), plus fine-tuning, dedicated endpoints and GPU clusters.	DeepSeek V3.1 $0.60/$1.70, GPT-OSS 20B $0.05/$0.20 per Mtok; H100 reserved ~$3.99/hr. $5 minimum credit.	Broad catalog, research-driven optimization (FlashAttention lineage), full-stack from serverless to clusters.

Custom-silicon speed specialists

Inference clouds built on bespoke chips (LPU, wafer-scale, RDU) that compete almost entirely on tokens per second. Open models only.

Provider	What it offers	Representative pricing	Known for
CerebrasUS · OpenAI-compatible	Llama, Qwen, DeepSeek distills, GPT-OSS. No proprietary frontier models and no custom-model uploads.	~$0.10-6 per Mtok depending on model ($0.35/M cheapest input). Pay-as-you-go and enterprise tiers.Free: 1M tokens/day, no credit card.	Fastest inference benchmarked by Artificial Analysis.
GroqUS · OpenAI-compatible	Runs Llama, Qwen, Kimi, GPT-OSS, DeepSeek distills and Whisper at 500-1,000+ tok/s. Catalog is open-source only.	Llama 3.1 8B $0.05/$0.08, Llama 3.3 70B $0.59/$0.79, Kimi K2 $1/$3 per Mtok. Batch and caching each cut 50%.Free: Free developer tier (no credit card).	Among the fastest inference available. NVIDIA agreed to pay ~$20B for a perpetual license to Groq's LPU patents (finalized December 24, 2025); GroqCloud continues operating.
SambaNovaUS · OpenAI-compatible	Fast serving of large open models (Llama, DeepSeek 671B, Qwen, MiniMax) with a three-tier memory architecture.	Pay-per-token; rates listed on the SambaCloud plans page.	Benchmarked by Artificial Analysis as among the fastest for large models (MiniMax M2.7 at 435 tok/s). Strong sovereign-AI and on-prem story.

GPU cloud and compression niche

Specialists at the edges: raw GPU rental at hyperscale, and model compression that shrinks open models for cheaper, faster, edge-ready inference.

Provider	What it offers	Representative pricing	Known for
CompactifAI (Multiverse Computing)Spain	Serves compressed Slim models via API and the AWS/Azure marketplaces. HyperNova 60B (from gpt-oss-120b), compressed Llama/DeepSeek/Mistral.	HyperNova 60B $0.04/$0.14 per Mtok; Llama 3.3 70B Slim ~$0.15/$0.31 per Mtok.	Claims compressed models beat their base models on speed and cost, runnable on edge devices down to Raspberry Pi. Raised a 189M euro Series B (June 12, 2025).
CoreWeaveUS	Rents NVIDIA A100, H100, H200, GB200/B200, GB300. Per-second billing, no egress fees; spot and reserved available.	8x H100 node ~$49.24/hr (~$6.16/GPU/hr); single GPUs from ~$1.19/hr (A100 PCIe) to $10.50/hr (B200 NVL).	Roughly 40-60% cheaper than hyperscalers for equivalent GPUs; customers include OpenAI, Mistral, Jane Street. Often 8-GPU minimums.

Search and filter all 34 providers

Filter by category, or search by name, what a provider offers, or what it is known for.

Search the directory

Filter all 34 providers by category, or search by name, what they offer, or what they are known for.

34 shown

Category Search

Provider	Category	What it offers	Representative pricing	Known for
Anthropic US	First-party model labs	Frontier proprietary API: Claude Opus 4.8, Sonnet 5, Haiku 4.5, with Fable 5 and Mythos 5 above Opus. All at 1M-token context.	Per token Opus 4.8 $5/$25, Sonnet 5 $3/$15 (intro $2/$10 through Aug 31 2026), Haiku 4.5 $1/$5 per Mtok. Batch 50% off; prompt caching up to 90% off.	Output costs 5x input across the line; pricing has held steady across generations. Run-rate revenue passed $30B in early 2026.
Cohere US	First-party model labs	Command (generation: Command A/R+/R/R7B), Embed (vectors), Rerank (neural reranking). Strong on data sovereignty and on-prem.	Per token Command R+ / Command A $2.50/$10, Command R $0.15/$0.60, Command R7B $0.0375/$0.15 per Mtok. Embed v3 $0.10/M.	Best-in-class Embed plus Rerank stack for retrieval; strong on data sovereignty, VPC and on-prem deployment.
DeepSeek China· OpenAI-compatible	First-party model labs	V4 Flash (cheapest frontier-class API) and V4 Pro. Both 1M ctx, 384K max output. Open weights with 10M+ downloads.	Per token V4 Flash $0.14/$0.28, V4 Pro $1.74/$3.48 per Mtok. Cache hits at 1/10 standard input.	Roughly 90-95% cheaper than comparable Western models, with open weights. Owned by hedge fund High-Flyer.
Google US· OpenAI-compatible	First-party model labs	Gemini 3.x API (3.1 Pro, 3.5 Flash, 3.1 Flash-Lite). AI Studio is free for prototyping; Vertex AI adds enterprise SLAs and compliance.	Per token Gemini 3.1 Pro $2/$12 (to 200K ctx), 3.5 Flash $1.50/$9, Flash-Lite $0.25/$1.50 per Mtok. 90% context-caching discount.	Largest production context window (2M tokens) and the cheapest Tier-1 budget model (Flash-Lite). Pro models are paid-only as of April 1, 2026.
Inception (Inception Labs) US· OpenAI-compatible	First-party model labs	Mercury 2 (reasoning dLLM, 128K ctx, >1,000 tok/s on Blackwell), Mercury Coder, Mercury Edit 2. Default in Continue and Zed.	Per token Mercury 2 $0.25/$0.75 per Mtok (cached input $0.025/M).	First commercially available diffusion LLM, for ~5-10x faster, cheaper inference than one-token-at-a-time models.
Kimi (Moonshot AI) China· OpenAI-compatible	First-party model labs	Kimi K2.6 (1T-param MoE/32B active, 256K ctx, multimodal), K2.5 (cheaper), K2.7-Code (coding). Agent Swarm up to 300 subagents.	Per token K2.6 $0.95/$4.00, K2.5 $0.60/$3.00 per Mtok; cached input $0.10-0.16/M. Batch API 40% off.	Roughly 8-10x cheaper than Claude Opus at frontier-adjacent quality.
MiniMax China· OpenAI-compatible	First-party model labs	M-series: M2.7, M3, plus legacy abab6.5 and MiniMax-01 (1M ctx). A faster highspeed variant at 2x.	Per token M2.7 $0.30/$1.20 per Mtok (official), cache reads $0.06/M.	Frontier-class coding and agentic quality at ~5-10% of Claude Opus output pricing. M2.7 restricts commercial use (M2 was MIT).
Mistral EU· OpenAI-compatible	First-party model labs	Mistral Large 3, Medium 3.5, Small 3, Codestral (code), Ministral 3B/8B (edge), Pixtral (vision), OCR. Many under Apache 2.0.	Per token Large 2 tier $2/$6, Small 3 $0.10/$0.30, Ministral 3B ~$0.04/$0.04 per Mtok.	Among the cheapest flagship-tier output pricing, plus EU data residency and genuine open weights for self-hosting.
OpenAI US· OpenAI-compatible	First-party model labs	Frontier proprietary API (GPT-5.5, GPT-5.4/Mini/Nano, GPT-5.x Pro, GPT-5.3 Codex), plus Batch and Realtime APIs.	Per token GPT-5.5 $5/$30, GPT-5.4 $2.50/$15, Nano $0.20/$1.25 per Mtok. Batch 50% off; cached input up to 90% off.	The default frontier benchmark. Note: winding down its fine-tuning platform (closed to new users as of May 2026).
Reka AI US	First-party model labs	Reka Core (top reasoning), Flash/Flash 3 (21B), Edge (7B vision-language), Spark. Deployable cloud, on-prem or on-device.	Per token Edge $0.10/$0.10, Flash 3 $0.10/$0.20 per Mtok (via OpenRouter); Core is most expensive.	Flexible deployment down to edge and device; Reka Edge uses only 64 tokens per image tile for low-latency robotics and AR.
Sarvam (Sarvam AI) India· OpenAI-compatible	First-party model labs	Sarvam-30B and Sarvam-105B (MoE, trained on Indian compute, 128K ctx, open-weight), Bulbul (TTS), Saaras (STT), translation (22 languages), OCR.	Per token Chat completion ~Rs 4 input / Rs 16 output per Mtok (105B tier); STT Rs 45/hour. Free credits on signup.	INR-denominated pricing avoids USD plus GST overhead; data hosted in India; best-in-class Indic-language and OCR performance.
StepFun China· OpenAI-compatible	First-party model labs	Step 3.7 Flash (196B/~11B active, 256K ctx, native image and video input), Step 3.5 Flash, Step3.	Per token Step 3.7 Flash $0.20/$1.15, Step 3.5 Flash $0.09/$0.30, Step3 $0.57/$1.42 per Mtok.	Disproportionately strong agentic benchmark scores relative to its price tier.
Upstage Korea· OpenAI-compatible	First-party model labs	Solar Pro 3 (102B total/12B active, 128K ctx, tuned for Korean/English/Japanese), Solar Pro 2, Document Parse/Extract.	Per token Solar Pro 3 ~$0.15/$0.60 per Mtok (via OpenRouter); Document Parse ~$0.01/page. Prices exclude 10% VAT.	Best positioned for Korean-language and structured document and instruction-following tasks.
xAI (Grok) US· OpenAI-compatible	First-party model labs	Frontier proprietary API: Grok 4.3 (flagship, ~1M ctx), Grok 4.20 (2M ctx long-context), Grok 4.1 Fast (cheap workhorse).	Per token Grok 4.3 $1.25/$2.50, Grok 4.1 Fast $0.20/$0.50 per Mtok. Batch 50% off; cached input ~90% off.	Only frontier model with live grounding to X posts; aggressive pricing undercuts GPT-5.4. API is independent of X subscriptions.
Amazon Bedrock US	Cloud hyperscaler marketplaces	Foundation models from Anthropic, Meta, Mistral, Cohere, AI21, Amazon (Nova/Titan), Stability, and now OpenAI. AgentCore for agents.	Mixed Per-token, matching providers (Claude Sonnet 5 $3/$15, Nova Micro $0.035/$0.14 per Mtok). Batch 50% off; caching up to 90% off.	Deep AWS integration, FedRAMP and HIPAA compliance. Watch hidden costs (OpenSearch Serverless ~$345/mo for Knowledge Bases).
Databricks (Mosaic AI) US· OpenAI-compatible	Cloud hyperscaler marketplaces	Open foundation models (Llama, DBRX) plus external models (OpenAI, Anthropic, Cohere). Tight Unity Catalog governance.	Consumption Consumption via DBUs from ~$0.07/DBU; pay-per-token, provisioned throughput (Llama 3.3 70B from $6/hr per band), and batch.	Tight integration with Unity Catalog governance and data pipelines; OpenAI-compatible API.
Microsoft Azure US· OpenAI-compatible	Cloud hyperscaler marketplaces	Hosts OpenAI (GPT-5 family, Sora, image), plus DeepSeek, Grok, Llama, Mistral, FLUX, and managed GPU compute.	Mixed Token pricing matches OpenAI direct (GPT-5 $1.25/$10 per Mtok). PTUs for sustained load from ~$2,448/mo.	Enterprise governance, compliance and Azure integration. Real bills often run 15-40% above raw token cost (support, networking, search).
Snowflake (Cortex) US	Cloud hyperscaler marketplaces	Pre-integrated Arctic, Llama, Mistral, Reka, Google, plus OpenAI, Anthropic, DeepSeek. AISQL, Cortex Search, Analyst and Agents.	Consumption Consumption and credit-based, token-metered, roughly $0.12-5.10 per Mtok depending on model; warehouse compute billed separately.	AI runs where the data lives, no egress; Snowflake does not train on customer data.
Baseten US· OpenAI-compatible	Neutral open-weight inference platforms	Per-minute dedicated GPU billing (T4 ~$0.01/min up to B200 ~$0.166/min) with scale-to-zero, plus a per-token Model APIs catalog.	Mixed Model APIs median ~$0.60/$2.20 per Mtok; dedicated GPU T4 ~$0.01/min up to B200 ~$0.166/min.	Multi-cloud routing across ~18-20 providers, 1B+ inference calls/day. Closed a $1.5B Series F at a $13B valuation (June 22, 2026).
DeepInfra US· OpenAI-compatible	Neutral open-weight inference platforms	190+ models (Llama, Qwen, DeepSeek, GLM, Gemma, Mistral, Nemotron, Kimi) plus embeddings, TTS, image. Dedicated GPUs by the hour.	Mixed From ~$0.06/M for small models; DeepSeek V4 Flash $0.10/$0.20 per Mtok. ~5T tokens/week.	Among the cheapest serverless options, runs its own US data centers including Blackwell B200.
Fireworks AI US· OpenAI-compatible	Neutral open-weight inference platforms	Serverless per-token (DeepSeek/Kimi/GLM/MiniMax catalog), on-demand GPUs, fine-tuning, reserved capacity. 30T+ tokens/day.	Mixed 8B-class ~$0.20/M, 70B-class ~$0.90/M; H100/H200 $6/hr, B200 $9/hr. Batch 50% off. Often 20-40% below Together.	Day-zero model support and rapid growth: $315M annualized revenue as of February 2026; in talks at a $15B valuation in mid-2026.
FriendliAI US· OpenAI-compatible	Neutral open-weight inference platforms	Serverless Endpoints (OpenAI-compatible), Dedicated Endpoints (per GPU-hour), Container (on-prem). DeepSeek, Qwen, Kimi, GLM, Llama, EXAONE.	Mixed Pay-per-token serverless plus per GPU-hour dedicated. Claims 50-90% cost savings vs vLLM.	Claims up to 3x faster than vLLM via custom kernels, speculative decoding, continuous batching. SOC 2 Type II plus HIPAA, 99.99% uptime SLA.
GMI (GMI Cloud) US· OpenAI-compatible	Neutral open-weight inference platforms	Bare-metal plus serverless plus dedicated clusters. Inference Engine (auto-scaling), Cluster Engine, and Model-as-a-Service.	Per GPU-hour On-demand GPU/hr: H100 from $2.00, H200 from $2.60, B200 from $4.00, GB200 from $8.00. Reserved cuts 30-50%.	Claims 40-70% savings vs hyperscalers; pure bare-metal rates with no forced CPU or networking upsell.
Hyperbolic US· OpenAI-compatible	Neutral open-weight inference platforms	25+ open models plus a GPU marketplace that aggregates idle GPUs. Pay by card or crypto. Serves Llama-3.1-405B-Base in BF16.	Mixed GPU/hr: RTX 4090 $0.50, A100 ~$1.60-1.80, H100 PCIe $3.00, H100 SXM $3.20. Llama 3.3 70B $0.40/M.	Up to 75% savings vs hyperscalers by pooling idle GPUs; only platform serving Llama-3.1-405B-Base in BF16.
Nebius EU· OpenAI-compatible	Neutral open-weight inference platforms	Token Factory: 60+ open-source models in Fast/Base tiers, OpenAI-compatible. Also H100/H200/B200/B300 rental with per-second billing.	Mixed Cheapest ~$0.06-0.08/M input (Nemotron 3 Nano $0.08 blended) up to ~$1.93/M for DeepSeek V4 Pro. Reserve discounts up to 35%.	EU data residency, full-stack managed compute plus inference, and a ~$2B NVIDIA deal with early Rubin access.
Parasail US· OpenAI-compatible	Neutral open-weight inference platforms	Serverless per-token (GLM-5.2, Kimi, DeepSeek V4, MiniMax, Qwen, gpt-oss, Llama), dedicated endpoints and batch. Deploy any HF model in ~5 lines.	Mixed ~$0.09/M for MiMo-V2.5 up to ~$0.90/M for GLM-5.1. 500B tokens/day.	Lossless by default, no hidden quantization; pay-per-token with no GPU contracts. Founded by an ex-Groq exec; raised $32M Series A.
Replicate US	Neutral open-weight inference platforms	50,000+ community models plus ~100 curated Official Models (Claude, DeepSeek, FLUX, Veo, Kling). Deploy custom models via Cog.	Mixed Hardware-per-second (CPU $0.000025/s up to H100 ~$0.001525/s) and output-based (per token/image/video).	Largest open model catalog and easiest experimentation. Cold starts and unpredictable per-second billing are the main drawbacks.
SiliconFlow China· OpenAI-compatible	Neutral open-weight inference platforms	200+ models across text/image/video/audio (DeepSeek, Qwen, GLM, Kimi, MiniMax, Step, FLUX). Backed by Alibaba Cloud.	Mixed Pay-as-you-go per-token (DeepSeek V4 Flash $0.14/$0.28 per Mtok); reserved GPU ~CNY 2.73/hr.	6M+ users, 100B+ daily tokens, fastest to handle DeepSeek traffic.
Together AI US· OpenAI-compatible	Neutral open-weight inference platforms	Serverless per-token (DeepSeek, Llama, Qwen, Kimi, GLM, MiniMax, Mixtral), plus fine-tuning, dedicated endpoints and GPU clusters.	Mixed DeepSeek V3.1 $0.60/$1.70, GPT-OSS 20B $0.05/$0.20 per Mtok; H100 reserved ~$3.99/hr. $5 minimum credit.	Broad catalog, research-driven optimization (FlashAttention lineage), full-stack from serverless to clusters.
Cerebras US· OpenAI-compatible	Custom-silicon speed specialists	Llama, Qwen, DeepSeek distills, GPT-OSS. No proprietary frontier models and no custom-model uploads.	Per token ~$0.10-6 per Mtok depending on model ($0.35/M cheapest input). Pay-as-you-go and enterprise tiers.	Fastest inference benchmarked by Artificial Analysis.
Groq US· OpenAI-compatible	Custom-silicon speed specialists	Runs Llama, Qwen, Kimi, GPT-OSS, DeepSeek distills and Whisper at 500-1,000+ tok/s. Catalog is open-source only.	Per token Llama 3.1 8B $0.05/$0.08, Llama 3.3 70B $0.59/$0.79, Kimi K2 $1/$3 per Mtok. Batch and caching each cut 50%.	Among the fastest inference available. NVIDIA agreed to pay ~$20B for a perpetual license to Groq's LPU patents (finalized December 24, 2025); GroqCloud continues operating.
SambaNova US· OpenAI-compatible	Custom-silicon speed specialists	Fast serving of large open models (Llama, DeepSeek 671B, Qwen, MiniMax) with a three-tier memory architecture.	Per token Pay-per-token; rates listed on the SambaCloud plans page.	Benchmarked by Artificial Analysis as among the fastest for large models (MiniMax M2.7 at 435 tok/s). Strong sovereign-AI and on-prem story.
CompactifAI (Multiverse Computing) Spain	GPU cloud and compression niche	Serves compressed Slim models via API and the AWS/Azure marketplaces. HyperNova 60B (from gpt-oss-120b), compressed Llama/DeepSeek/Mistral.	Per token HyperNova 60B $0.04/$0.14 per Mtok; Llama 3.3 70B Slim ~$0.15/$0.31 per Mtok.	Claims compressed models beat their base models on speed and cost, runnable on edge devices down to Raspberry Pi. Raised a 189M euro Series B (June 12, 2025).
CoreWeave US	GPU cloud and compression niche	Rents NVIDIA A100, H100, H200, GB200/B200, GB300. Per-second billing, no egress fees; spot and reserved available.	Per GPU-hour 8x H100 node ~$49.24/hr (~$6.16/GPU/hr); single GPUs from ~$1.19/hr (A100 PCIe) to $10.50/hr (B200 NVL).	Roughly 40-60% cheaper than hyperscalers for equivalent GPUs; customers include OpenAI, Mistral, Jane Street. Often 8-GPU minimums.

Prices are representative June 2026 snapshots, not live quotes: per-token and per-GPU-hour rates move often, and open-model hosts may serve at different precision (FP8 vs FP4), which affects both price and quality. Provider links open each company's own site; confirm the current rate at the source before you budget.

How to choose an inference provider

Pick on the one axis that dominates your bill or your risk, not on the headline rate alone.

Frontier quality: default to OpenAI, Anthropic or Google. If output-token volume dominates your spend, Google Gemini and xAI Grok are materially cheaper at comparable quality.
Cheap open-weight at scale: start with DeepInfra, Together or Fireworks, all OpenAI-compatible and in the $0.05 to $1 per million token range. Verify the quantization (FP8 vs FP4) before committing, because it affects quality.
Latency-critical apps (voice, agents, real-time): use Groq or Cerebras for open models, SambaNova for the largest. The trade-off is no proprietary frontier models.
Enterprise and regulated workloads: use Azure, Bedrock, Snowflake Cortex or Databricks for governance and data residency, and accept a 10 to 40% premium. For sovereign needs, consider Nebius (EU), Sarvam (India) or Upstage (Korea).
GPU rental and self-hosting: GMI and Hyperbolic have the cheapest transparent H100 rates (about $2 to $3.20 per hour); CoreWeave for hyperscale clusters; Baseten or Replicate for managed serving without owning infrastructure.
Specialized needs: Cohere for retrieval (RAG) pipelines; Inception or CompactifAI when raw speed and cost per token are paramount; DeepSeek, Kimi, MiniMax or StepFun for frontier-adjacent quality at a fraction of Western pricing.

The crossover from serverless per-token to renting dedicated GPUs typically lands around 10 to 50 million tokens per day of steady load per model. Below that, pay-per-token wins; above it, price out reserved GPUs. To turn any per-token rate into what a real task costs, use thecost-per-task calculator, or rank models by value on the value leaderboard.

Why this is a snapshot, not a live feed

Inference pricing is the most volatile layer of the AI stack. Open-model hosts re-price weekly, frontier labs cut rates without notice, and several providers serve the same model at different precision, so two identical headline prices do not guarantee identical output. This directory is therefore a representative map of the market as of June 28, 2026, built to show the shape of the choice and ground each provider to its own page, not to quote a rate you can hold anyone to. Always confirm at the source before you commit spend.

Frequently asked questions

What are the best AI inference providers in 2026?: There is no single best one: the 34 providers tracked here split into five jobs. First-party labs (OpenAI, Anthropic, Google, xAI) win on frontier quality. Neutral platforms (DeepInfra, Together, Fireworks) host open-weight models cheapest. Custom-silicon clouds (Groq, Cerebras, SambaNova) win on speed. Hyperscalers (Azure, Bedrock, Snowflake, Databricks) win on enterprise governance. GPU clouds (GMI, CoreWeave) and compression specialists (CompactifAI) serve self-hosting and edge.
What is the cheapest way to run AI models?: For frontier-class quality, the cheapest per-token APIs are the China-built open-weight models: DeepSeek V4 Flash at roughly $0.14/$0.28 per million tokens, with MiniMax, Kimi and StepFun close behind. For open-weight hosting at scale, DeepInfra, Together and Fireworks converge near $0.05 to $1 per million tokens. Above roughly 10 to 50 million tokens per day of steady load per model, renting GPUs (GMI or Hyperbolic at about $2 to $3.20 per H100-hour) usually beats per-token pricing.
Which AI inference providers are the fastest?: The custom-silicon specialists. Cerebras runs open models at 1,800 to 3,000+ tokens per second on its wafer-scale chips, Groq at 500 to 1,000+ on its LPUs, and SambaNova is benchmarked among the fastest for the largest open models. The trade-off is that all three serve open models only: no proprietary frontier models and limited or no custom-model uploads.
What is an OpenAI-compatible API and why does it matter?: It means the provider exposes the same request and response format as OpenAI's API, so switching providers is usually just a base-URL and API-key change rather than a rewrite. 26 of the 34 providers here are OpenAI-compatible, which is why you can benchmark several on the same code and route to whichever wins on price, speed or quality.
How current are these prices?: They are representative snapshots as of June 28, 2026, each linked to the provider's official page. Per-token rates for open-weight hosts change often, sometimes weekly, and open-model hosts may serve at different precision (FP8 vs FP4), which changes both price and quality. Treat every figure as a starting point and confirm the live rate at the linked source before budgeting.

Sources

Each provider profile is grounded to that company's own home, pricing or documentation page, verified June 28, 2026. Prices are representative snapshots from those pages and from independent benchmarking by Artificial Analysis where noted:

Anthropic (2026). Official site and pricing. claude.com/pricing
Cohere (2026). Official site and pricing. cohere.com/pricing
DeepSeek (2026). Official site and pricing. api-docs.deepseek.com/quick_start/pricing
Google (2026). Official site and pricing. ai.google.dev/gemini-api/docs/pricing
Inception (Inception Labs) (2026). Official site and pricing. inceptionlabs.ai/models
Kimi (Moonshot AI) (2026). Official site and pricing. platform.moonshot.ai/docs/pricing
MiniMax (2026). Official site and pricing. platform.minimax.io/docs/guides/pricing-paygo
Mistral (2026). Official site and pricing. mistral.ai/pricing
OpenAI (2026). Official site and pricing. developers.openai.com/api/docs/pricing
Reka AI (2026). Official site and pricing. docs.reka.ai/pricing
Sarvam (Sarvam AI) (2026). Official site and pricing. sarvam.ai/api-pricing
StepFun (2026). Official site and pricing. platform.stepfun.ai/docs/en/guides/pricing/details
Upstage (2026). Official site and pricing. upstage.ai/pricing/api
xAI (Grok) (2026). Official site and pricing. x.ai/api
Amazon Bedrock (2026). Official site and pricing. aws.amazon.com/bedrock/pricing
Databricks (Mosaic AI) (2026). Official site and pricing. databricks.com/product/pricing/foundation-model-serving
Microsoft Azure (2026). Official site and pricing. azure.microsoft.com/en-us/pricing/details/cognitive-services/openai-service/
Snowflake (Cortex) (2026). Official site and pricing. docs.snowflake.com/en/user-guide/snowflake-cortex
Baseten (2026). Official site and pricing. baseten.co/pricing
DeepInfra (2026). Official site and pricing. deepinfra.com/pricing
Fireworks AI (2026). Official site and pricing. fireworks.ai/pricing
FriendliAI (2026). Official site and pricing. friendli.ai/pricing
GMI (GMI Cloud) (2026). Official site and pricing. gmicloud.ai/en/pricing
Hyperbolic (2026). Official site and pricing. docs.hyperbolic.xyz/docs/hyperbolic-pricing
Nebius (2026). Official site and pricing. nebius.com/token-factory/prices
Parasail (2026). Official site and pricing. saas.parasail.io/pricing
Replicate (2026). Official site and pricing. replicate.com/pricing
SiliconFlow (2026). Official site and pricing. siliconflow.com/pricing
Together AI (2026). Official site and pricing. together.ai/pricing
Cerebras (2026). Official site and pricing. cerebras.ai/pricing
Groq (2026). Official site and pricing. groq.com/pricing
SambaNova (2026). Official site and pricing. cloud.sambanova.ai/plans
CompactifAI (Multiverse Computing) (2026). Official site and pricing. multiversecomputing.com/compactifai/api
CoreWeave (2026). Official site and pricing. coreweave.com/pricing

Machine-readable data: /ai-inference-providers.json. Funding figures (Fireworks, Baseten, Sarvam, Groq, CompactifAI) are from reporting by Bloomberg, Sacra and company announcements as cited in each profile.

← All tools & trackers