Skip to content
Capital & Compute
Directory· Updated June 28, 2026

AI inference providers

Where to actually run a model, mapped. 34 providers across five categories, from the frontier labs to the open-weight hosts, the speed specialists, the enterprise clouds and the GPU rental market, each with representative pricing and a link to its own page. For which model to run, see the model release tracker; for monthly plan prices, the coding plan comparison; and for free options, thefree AI models list.

What are the best AI inference providers?

The best AI inference provider depends on what you optimize for. For frontier quality, use OpenAI, Anthropic or Google. For cheap open-weight models, use DeepInfra, Together or Fireworks. For raw speed, use Groq, Cerebras or SambaNova. For enterprise governance, use Azure, Bedrock, Snowflake or Databricks; for GPU rental, GMI or CoreWeave.

34
Providers mapped
Across five categories, from frontier labs to GPU clouds
$0.6-$30
Flagship output, per Mtok
From Upstage (Solar Pro 3) to OpenAI (GPT-5.5)
26 of 34
OpenAI-compatible
Switching is usually a base-URL change, not a rewrite
18
Offer a free tier
Free credits or a standing zero-cost allowance to start

The flagship-tier price spread

Among the first-party labs alone, the flagship output rate spans more than 50x, fromUpstage (Solar Pro 3) to OpenAI (GPT-5.5). Across the whole market, counting cheap open-weight hosting and compressed models, per-token output runs from about $0.14 to $180 per million tokens: four orders of magnitude. The lesson is that the sticker rate is a starting point, not the bill.

Flagship model output price per million tokens, by labA horizontal bar chart of flagship output prices per million tokens. Upstage Solar Pro 3 is cheapest at $0.60, rising through Inception Mercury 2 at $0.75, MiniMax M2.7 at $1.20, StepFun Step3 at $1.42, xAI Grok 4.3 at $2.50, DeepSeek V4 Pro at $3.48, Kimi K2.6 at $4.00, Mistral Large at $6.00, Cohere Command A at $10, Google Gemini 3.1 Pro at $12, Anthropic Claude Opus 4.8 at $25, up to OpenAI GPT-5.5 at $30.$0$10$20$30Upstage (Solar Pro 3)$0.6Inception (Mercury 2)$0.75MiniMax M2.7$1.2StepFun (Step3)$1.42xAI (Grok 4.3)$2.5DeepSeek V4 Pro$3.48Kimi K2.6$4Mistral Large$6Cohere (Command A)$10Google (Gemini 3.1 Pro)$12Anthropic (Claude Opus 4.8)$25OpenAI (GPT-5.5)$30
Flagship model output price per million tokens, by lab
ItemValue
Upstage (Solar Pro 3)$0.6
Inception (Mercury 2)$0.75
MiniMax M2.7$1.2
StepFun (Step3)$1.42
xAI (Grok 4.3)$2.5
DeepSeek V4 Pro$3.48
Kimi K2.6$4
Mistral Large$6
Cohere (Command A)$10
Google (Gemini 3.1 Pro)$12
Anthropic (Claude Opus 4.8)$25
OpenAI (GPT-5.5)$30
Flagship model output price per million tokens, by first-party lab, cheapest first. Output tokens are where the cost concentrates: across this line, output runs 2 to 5x the input rate.Source: Provider official pricing pages, representative as of June 2026

The five kinds of inference provider

Every provider sits in one of five buckets, and the bucket usually decides the choice before the individual provider does. Each table below is grounded to the providers' own pages and dated June 28, 2026. To search and filter all 34 at once, use thedirectory tool below.

First-party model labs

The companies that train the models and serve them through their own API. Choose these for frontier quality and day-one access.

ProviderWhat it offersRepresentative pricingKnown for
AnthropicUSFrontier proprietary API: Claude Opus 4.8, Sonnet 5, Haiku 4.5, with Fable 5 and Mythos 5 above Opus. All at 1M-token context.Opus 4.8 $5/$25, Sonnet 5 $3/$15 (intro $2/$10 through Aug 31 2026), Haiku 4.5 $1/$5 per Mtok. Batch 50% off; prompt caching up to 90% off.Output costs 5x input across the line; pricing has held steady across generations. Run-rate revenue passed $30B in early 2026.
CohereUSCommand (generation: Command A/R+/R/R7B), Embed (vectors), Rerank (neural reranking). Strong on data sovereignty and on-prem.Command R+ / Command A $2.50/$10, Command R $0.15/$0.60, Command R7B $0.0375/$0.15 per Mtok. Embed v3 $0.10/M.Free: Free trial key (1,000 calls/month, not for production).Best-in-class Embed plus Rerank stack for retrieval; strong on data sovereignty, VPC and on-prem deployment.
DeepSeekChina · OpenAI-compatibleV4 Flash (cheapest frontier-class API) and V4 Pro. Both 1M ctx, 384K max output. Open weights with 10M+ downloads.V4 Flash $0.14/$0.28, V4 Pro $1.74/$3.48 per Mtok. Cache hits at 1/10 standard input.Roughly 90-95% cheaper than comparable Western models, with open weights. Owned by hedge fund High-Flyer.
GoogleUS · OpenAI-compatibleGemini 3.x API (3.1 Pro, 3.5 Flash, 3.1 Flash-Lite). AI Studio is free for prototyping; Vertex AI adds enterprise SLAs and compliance.Gemini 3.1 Pro $2/$12 (to 200K ctx), 3.5 Flash $1.50/$9, Flash-Lite $0.25/$1.50 per Mtok. 90% context-caching discount.Free: AI Studio free for prototyping; Flash retains a free tier.Largest production context window (2M tokens) and the cheapest Tier-1 budget model (Flash-Lite). Pro models are paid-only as of April 1, 2026.
Inception (Inception Labs)US · OpenAI-compatibleMercury 2 (reasoning dLLM, 128K ctx, >1,000 tok/s on Blackwell), Mercury Coder, Mercury Edit 2. Default in Continue and Zed.Mercury 2 $0.25/$0.75 per Mtok (cached input $0.025/M).Free: 10M free tokens per new account.First commercially available diffusion LLM, for ~5-10x faster, cheaper inference than one-token-at-a-time models.
Kimi (Moonshot AI)China · OpenAI-compatibleKimi K2.6 (1T-param MoE/32B active, 256K ctx, multimodal), K2.5 (cheaper), K2.7-Code (coding). Agent Swarm up to 300 subagents.K2.6 $0.95/$4.00, K2.5 $0.60/$3.00 per Mtok; cached input $0.10-0.16/M. Batch API 40% off.Roughly 8-10x cheaper than Claude Opus at frontier-adjacent quality.
MiniMaxChina · OpenAI-compatibleM-series: M2.7, M3, plus legacy abab6.5 and MiniMax-01 (1M ctx). A faster highspeed variant at 2x.M2.7 $0.30/$1.20 per Mtok (official), cache reads $0.06/M.Frontier-class coding and agentic quality at ~5-10% of Claude Opus output pricing. M2.7 restricts commercial use (M2 was MIT).
MistralEU · OpenAI-compatibleMistral Large 3, Medium 3.5, Small 3, Codestral (code), Ministral 3B/8B (edge), Pixtral (vision), OCR. Many under Apache 2.0.Large 2 tier $2/$6, Small 3 $0.10/$0.30, Ministral 3B ~$0.04/$0.04 per Mtok.Free: Free experimentation tier via la Plateforme.Among the cheapest flagship-tier output pricing, plus EU data residency and genuine open weights for self-hosting.
OpenAIUS · OpenAI-compatibleFrontier proprietary API (GPT-5.5, GPT-5.4/Mini/Nano, GPT-5.x Pro, GPT-5.3 Codex), plus Batch and Realtime APIs.GPT-5.5 $5/$30, GPT-5.4 $2.50/$15, Nano $0.20/$1.25 per Mtok. Batch 50% off; cached input up to 90% off.Free: $5 in free credits for new accounts (expire in 3 months).The default frontier benchmark. Note: winding down its fine-tuning platform (closed to new users as of May 2026).
Reka AIUSReka Core (top reasoning), Flash/Flash 3 (21B), Edge (7B vision-language), Spark. Deployable cloud, on-prem or on-device.Edge $0.10/$0.10, Flash 3 $0.10/$0.20 per Mtok (via OpenRouter); Core is most expensive.Flexible deployment down to edge and device; Reka Edge uses only 64 tokens per image tile for low-latency robotics and AR.
Sarvam (Sarvam AI)India · OpenAI-compatibleSarvam-30B and Sarvam-105B (MoE, trained on Indian compute, 128K ctx, open-weight), Bulbul (TTS), Saaras (STT), translation (22 languages), OCR.Chat completion ~Rs 4 input / Rs 16 output per Mtok (105B tier); STT Rs 45/hour. Free credits on signup.Free: Free credits on signup.INR-denominated pricing avoids USD plus GST overhead; data hosted in India; best-in-class Indic-language and OCR performance.
StepFunChina · OpenAI-compatibleStep 3.7 Flash (196B/~11B active, 256K ctx, native image and video input), Step 3.5 Flash, Step3.Step 3.7 Flash $0.20/$1.15, Step 3.5 Flash $0.09/$0.30, Step3 $0.57/$1.42 per Mtok.Disproportionately strong agentic benchmark scores relative to its price tier.
UpstageKorea · OpenAI-compatibleSolar Pro 3 (102B total/12B active, 128K ctx, tuned for Korean/English/Japanese), Solar Pro 2, Document Parse/Extract.Solar Pro 3 ~$0.15/$0.60 per Mtok (via OpenRouter); Document Parse ~$0.01/page. Prices exclude 10% VAT.Best positioned for Korean-language and structured document and instruction-following tasks.
xAI (Grok)US · OpenAI-compatibleFrontier proprietary API: Grok 4.3 (flagship, ~1M ctx), Grok 4.20 (2M ctx long-context), Grok 4.1 Fast (cheap workhorse).Grok 4.3 $1.25/$2.50, Grok 4.1 Fast $0.20/$0.50 per Mtok. Batch 50% off; cached input ~90% off.Free: Free developer credits via data-sharing program (~$150-175/mo reported).Only frontier model with live grounding to X posts; aggressive pricing undercuts GPT-5.4. API is independent of X subscriptions.

Cloud hyperscaler marketplaces

Cloud platforms that resell many labs models behind enterprise governance, compliance and data residency. Expect a 10-40% premium over raw token cost.

ProviderWhat it offersRepresentative pricingKnown for
Amazon BedrockUSFoundation models from Anthropic, Meta, Mistral, Cohere, AI21, Amazon (Nova/Titan), Stability, and now OpenAI. AgentCore for agents.Per-token, matching providers (Claude Sonnet 5 $3/$15, Nova Micro $0.035/$0.14 per Mtok). Batch 50% off; caching up to 90% off.Deep AWS integration, FedRAMP and HIPAA compliance. Watch hidden costs (OpenSearch Serverless ~$345/mo for Knowledge Bases).
Databricks (Mosaic AI)US · OpenAI-compatibleOpen foundation models (Llama, DBRX) plus external models (OpenAI, Anthropic, Cohere). Tight Unity Catalog governance.Consumption via DBUs from ~$0.07/DBU; pay-per-token, provisioned throughput (Llama 3.3 70B from $6/hr per band), and batch.Free: 14-day free trial; Free Edition available.Tight integration with Unity Catalog governance and data pipelines; OpenAI-compatible API.
Microsoft AzureUS · OpenAI-compatibleHosts OpenAI (GPT-5 family, Sora, image), plus DeepSeek, Grok, Llama, Mistral, FLUX, and managed GPU compute.Token pricing matches OpenAI direct (GPT-5 $1.25/$10 per Mtok). PTUs for sustained load from ~$2,448/mo.Free: $200 free credit for 30 days.Enterprise governance, compliance and Azure integration. Real bills often run 15-40% above raw token cost (support, networking, search).
Snowflake (Cortex)USPre-integrated Arctic, Llama, Mistral, Reka, Google, plus OpenAI, Anthropic, DeepSeek. AISQL, Cortex Search, Analyst and Agents.Consumption and credit-based, token-metered, roughly $0.12-5.10 per Mtok depending on model; warehouse compute billed separately.Free: No dedicated free tier (trial credits only).AI runs where the data lives, no egress; Snowflake does not train on customer data.

Neutral open-weight inference platforms

Vendor-neutral clouds that host open-weight models per token, plus GPU rental and fine-tuning. The commodity layer for cheap open-model inference.

ProviderWhat it offersRepresentative pricingKnown for
BasetenUS · OpenAI-compatiblePer-minute dedicated GPU billing (T4 ~$0.01/min up to B200 ~$0.166/min) with scale-to-zero, plus a per-token Model APIs catalog.Model APIs median ~$0.60/$2.20 per Mtok; dedicated GPU T4 ~$0.01/min up to B200 ~$0.166/min.Free: New-account credits.Multi-cloud routing across ~18-20 providers, 1B+ inference calls/day. Closed a $1.5B Series F at a $13B valuation (June 22, 2026).
DeepInfraUS · OpenAI-compatible190+ models (Llama, Qwen, DeepSeek, GLM, Gemma, Mistral, Nemotron, Kimi) plus embeddings, TTS, image. Dedicated GPUs by the hour.From ~$0.06/M for small models; DeepSeek V4 Flash $0.10/$0.20 per Mtok. ~5T tokens/week.Free: No standing free tier.Among the cheapest serverless options, runs its own US data centers including Blackwell B200.
Fireworks AIUS · OpenAI-compatibleServerless per-token (DeepSeek/Kimi/GLM/MiniMax catalog), on-demand GPUs, fine-tuning, reserved capacity. 30T+ tokens/day.8B-class ~$0.20/M, 70B-class ~$0.90/M; H100/H200 $6/hr, B200 $9/hr. Batch 50% off. Often 20-40% below Together.Free: $1 free starter credit.Day-zero model support and rapid growth: $315M annualized revenue as of February 2026; in talks at a $15B valuation in mid-2026.
FriendliAIUS · OpenAI-compatibleServerless Endpoints (OpenAI-compatible), Dedicated Endpoints (per GPU-hour), Container (on-prem). DeepSeek, Qwen, Kimi, GLM, Llama, EXAONE.Pay-per-token serverless plus per GPU-hour dedicated. Claims 50-90% cost savings vs vLLM.Free: $5 free credits.Claims up to 3x faster than vLLM via custom kernels, speculative decoding, continuous batching. SOC 2 Type II plus HIPAA, 99.99% uptime SLA.
GMI (GMI Cloud)US · OpenAI-compatibleBare-metal plus serverless plus dedicated clusters. Inference Engine (auto-scaling), Cluster Engine, and Model-as-a-Service.On-demand GPU/hr: H100 from $2.00, H200 from $2.60, B200 from $4.00, GB200 from $8.00. Reserved cuts 30-50%.Claims 40-70% savings vs hyperscalers; pure bare-metal rates with no forced CPU or networking upsell.
HyperbolicUS · OpenAI-compatible25+ open models plus a GPU marketplace that aggregates idle GPUs. Pay by card or crypto. Serves Llama-3.1-405B-Base in BF16.GPU/hr: RTX 4090 $0.50, A100 ~$1.60-1.80, H100 PCIe $3.00, H100 SXM $3.20. Llama 3.3 70B $0.40/M.Free: $1 promo credit (not for GPU rental).Up to 75% savings vs hyperscalers by pooling idle GPUs; only platform serving Llama-3.1-405B-Base in BF16.
NebiusEU · OpenAI-compatibleToken Factory: 60+ open-source models in Fast/Base tiers, OpenAI-compatible. Also H100/H200/B200/B300 rental with per-second billing.Cheapest ~$0.06-0.08/M input (Nemotron 3 Nano $0.08 blended) up to ~$1.93/M for DeepSeek V4 Pro. Reserve discounts up to 35%.EU data residency, full-stack managed compute plus inference, and a ~$2B NVIDIA deal with early Rubin access.
ParasailUS · OpenAI-compatibleServerless per-token (GLM-5.2, Kimi, DeepSeek V4, MiniMax, Qwen, gpt-oss, Llama), dedicated endpoints and batch. Deploy any HF model in ~5 lines.~$0.09/M for MiMo-V2.5 up to ~$0.90/M for GLM-5.1. 500B tokens/day.Lossless by default, no hidden quantization; pay-per-token with no GPU contracts. Founded by an ex-Groq exec; raised $32M Series A.
ReplicateUS50,000+ community models plus ~100 curated Official Models (Claude, DeepSeek, FLUX, Veo, Kling). Deploy custom models via Cog.Hardware-per-second (CPU $0.000025/s up to H100 ~$0.001525/s) and output-based (per token/image/video).Largest open model catalog and easiest experimentation. Cold starts and unpredictable per-second billing are the main drawbacks.
SiliconFlowChina · OpenAI-compatible200+ models across text/image/video/audio (DeepSeek, Qwen, GLM, Kimi, MiniMax, Step, FLUX). Backed by Alibaba Cloud.Pay-as-you-go per-token (DeepSeek V4 Flash $0.14/$0.28 per Mtok); reserved GPU ~CNY 2.73/hr.Free: $1 credits on signup; some smaller models permanently free.6M+ users, 100B+ daily tokens, fastest to handle DeepSeek traffic.
Together AIUS · OpenAI-compatibleServerless per-token (DeepSeek, Llama, Qwen, Kimi, GLM, MiniMax, Mixtral), plus fine-tuning, dedicated endpoints and GPU clusters.DeepSeek V3.1 $0.60/$1.70, GPT-OSS 20B $0.05/$0.20 per Mtok; H100 reserved ~$3.99/hr. $5 minimum credit.Broad catalog, research-driven optimization (FlashAttention lineage), full-stack from serverless to clusters.

Custom-silicon speed specialists

Inference clouds built on bespoke chips (LPU, wafer-scale, RDU) that compete almost entirely on tokens per second. Open models only.

ProviderWhat it offersRepresentative pricingKnown for
CerebrasUS · OpenAI-compatibleLlama, Qwen, DeepSeek distills, GPT-OSS. No proprietary frontier models and no custom-model uploads.~$0.10-6 per Mtok depending on model ($0.35/M cheapest input). Pay-as-you-go and enterprise tiers.Free: 1M tokens/day, no credit card.Fastest inference benchmarked by Artificial Analysis.
GroqUS · OpenAI-compatibleRuns Llama, Qwen, Kimi, GPT-OSS, DeepSeek distills and Whisper at 500-1,000+ tok/s. Catalog is open-source only.Llama 3.1 8B $0.05/$0.08, Llama 3.3 70B $0.59/$0.79, Kimi K2 $1/$3 per Mtok. Batch and caching each cut 50%.Free: Free developer tier (no credit card).Among the fastest inference available. NVIDIA agreed to pay ~$20B for a perpetual license to Groq's LPU patents (finalized December 24, 2025); GroqCloud continues operating.
SambaNovaUS · OpenAI-compatibleFast serving of large open models (Llama, DeepSeek 671B, Qwen, MiniMax) with a three-tier memory architecture.Pay-per-token; rates listed on the SambaCloud plans page.Benchmarked by Artificial Analysis as among the fastest for large models (MiniMax M2.7 at 435 tok/s). Strong sovereign-AI and on-prem story.

GPU cloud and compression niche

Specialists at the edges: raw GPU rental at hyperscale, and model compression that shrinks open models for cheaper, faster, edge-ready inference.

ProviderWhat it offersRepresentative pricingKnown for
CompactifAI (Multiverse Computing)SpainServes compressed Slim models via API and the AWS/Azure marketplaces. HyperNova 60B (from gpt-oss-120b), compressed Llama/DeepSeek/Mistral.HyperNova 60B $0.04/$0.14 per Mtok; Llama 3.3 70B Slim ~$0.15/$0.31 per Mtok.Claims compressed models beat their base models on speed and cost, runnable on edge devices down to Raspberry Pi. Raised a 189M euro Series B (June 12, 2025).
CoreWeaveUSRents NVIDIA A100, H100, H200, GB200/B200, GB300. Per-second billing, no egress fees; spot and reserved available.8x H100 node ~$49.24/hr (~$6.16/GPU/hr); single GPUs from ~$1.19/hr (A100 PCIe) to $10.50/hr (B200 NVL).Roughly 40-60% cheaper than hyperscalers for equivalent GPUs; customers include OpenAI, Mistral, Jane Street. Often 8-GPU minimums.

Search and filter all 34 providers

Filter by category, or search by name, what a provider offers, or what it is known for.

Search the directory

Filter all 34 providers by category, or search by name, what they offer, or what they are known for.

34 shown
ProviderCategoryWhat it offersRepresentative pricingKnown for
Anthropic USFirst-party model labsFrontier proprietary API: Claude Opus 4.8, Sonnet 5, Haiku 4.5, with Fable 5 and Mythos 5 above Opus. All at 1M-token context.Per token Opus 4.8 $5/$25, Sonnet 5 $3/$15 (intro $2/$10 through Aug 31 2026), Haiku 4.5 $1/$5 per Mtok. Batch 50% off; prompt caching up to 90% off.Output costs 5x input across the line; pricing has held steady across generations. Run-rate revenue passed $30B in early 2026.
Cohere USFirst-party model labsCommand (generation: Command A/R+/R/R7B), Embed (vectors), Rerank (neural reranking). Strong on data sovereignty and on-prem.Per token Command R+ / Command A $2.50/$10, Command R $0.15/$0.60, Command R7B $0.0375/$0.15 per Mtok. Embed v3 $0.10/M.Best-in-class Embed plus Rerank stack for retrieval; strong on data sovereignty, VPC and on-prem deployment.
DeepSeek China· OpenAI-compatibleFirst-party model labsV4 Flash (cheapest frontier-class API) and V4 Pro. Both 1M ctx, 384K max output. Open weights with 10M+ downloads.Per token V4 Flash $0.14/$0.28, V4 Pro $1.74/$3.48 per Mtok. Cache hits at 1/10 standard input.Roughly 90-95% cheaper than comparable Western models, with open weights. Owned by hedge fund High-Flyer.
Google US· OpenAI-compatibleFirst-party model labsGemini 3.x API (3.1 Pro, 3.5 Flash, 3.1 Flash-Lite). AI Studio is free for prototyping; Vertex AI adds enterprise SLAs and compliance.Per token Gemini 3.1 Pro $2/$12 (to 200K ctx), 3.5 Flash $1.50/$9, Flash-Lite $0.25/$1.50 per Mtok. 90% context-caching discount.Largest production context window (2M tokens) and the cheapest Tier-1 budget model (Flash-Lite). Pro models are paid-only as of April 1, 2026.
Inception (Inception Labs) US· OpenAI-compatibleFirst-party model labsMercury 2 (reasoning dLLM, 128K ctx, >1,000 tok/s on Blackwell), Mercury Coder, Mercury Edit 2. Default in Continue and Zed.Per token Mercury 2 $0.25/$0.75 per Mtok (cached input $0.025/M).First commercially available diffusion LLM, for ~5-10x faster, cheaper inference than one-token-at-a-time models.
Kimi (Moonshot AI) China· OpenAI-compatibleFirst-party model labsKimi K2.6 (1T-param MoE/32B active, 256K ctx, multimodal), K2.5 (cheaper), K2.7-Code (coding). Agent Swarm up to 300 subagents.Per token K2.6 $0.95/$4.00, K2.5 $0.60/$3.00 per Mtok; cached input $0.10-0.16/M. Batch API 40% off.Roughly 8-10x cheaper than Claude Opus at frontier-adjacent quality.
MiniMax China· OpenAI-compatibleFirst-party model labsM-series: M2.7, M3, plus legacy abab6.5 and MiniMax-01 (1M ctx). A faster highspeed variant at 2x.Per token M2.7 $0.30/$1.20 per Mtok (official), cache reads $0.06/M.Frontier-class coding and agentic quality at ~5-10% of Claude Opus output pricing. M2.7 restricts commercial use (M2 was MIT).
Mistral EU· OpenAI-compatibleFirst-party model labsMistral Large 3, Medium 3.5, Small 3, Codestral (code), Ministral 3B/8B (edge), Pixtral (vision), OCR. Many under Apache 2.0.Per token Large 2 tier $2/$6, Small 3 $0.10/$0.30, Ministral 3B ~$0.04/$0.04 per Mtok.Among the cheapest flagship-tier output pricing, plus EU data residency and genuine open weights for self-hosting.
OpenAI US· OpenAI-compatibleFirst-party model labsFrontier proprietary API (GPT-5.5, GPT-5.4/Mini/Nano, GPT-5.x Pro, GPT-5.3 Codex), plus Batch and Realtime APIs.Per token GPT-5.5 $5/$30, GPT-5.4 $2.50/$15, Nano $0.20/$1.25 per Mtok. Batch 50% off; cached input up to 90% off.The default frontier benchmark. Note: winding down its fine-tuning platform (closed to new users as of May 2026).
Reka AI USFirst-party model labsReka Core (top reasoning), Flash/Flash 3 (21B), Edge (7B vision-language), Spark. Deployable cloud, on-prem or on-device.Per token Edge $0.10/$0.10, Flash 3 $0.10/$0.20 per Mtok (via OpenRouter); Core is most expensive.Flexible deployment down to edge and device; Reka Edge uses only 64 tokens per image tile for low-latency robotics and AR.
Sarvam (Sarvam AI) India· OpenAI-compatibleFirst-party model labsSarvam-30B and Sarvam-105B (MoE, trained on Indian compute, 128K ctx, open-weight), Bulbul (TTS), Saaras (STT), translation (22 languages), OCR.Per token Chat completion ~Rs 4 input / Rs 16 output per Mtok (105B tier); STT Rs 45/hour. Free credits on signup.INR-denominated pricing avoids USD plus GST overhead; data hosted in India; best-in-class Indic-language and OCR performance.
StepFun China· OpenAI-compatibleFirst-party model labsStep 3.7 Flash (196B/~11B active, 256K ctx, native image and video input), Step 3.5 Flash, Step3.Per token Step 3.7 Flash $0.20/$1.15, Step 3.5 Flash $0.09/$0.30, Step3 $0.57/$1.42 per Mtok.Disproportionately strong agentic benchmark scores relative to its price tier.
Upstage Korea· OpenAI-compatibleFirst-party model labsSolar Pro 3 (102B total/12B active, 128K ctx, tuned for Korean/English/Japanese), Solar Pro 2, Document Parse/Extract.Per token Solar Pro 3 ~$0.15/$0.60 per Mtok (via OpenRouter); Document Parse ~$0.01/page. Prices exclude 10% VAT.Best positioned for Korean-language and structured document and instruction-following tasks.
xAI (Grok) US· OpenAI-compatibleFirst-party model labsFrontier proprietary API: Grok 4.3 (flagship, ~1M ctx), Grok 4.20 (2M ctx long-context), Grok 4.1 Fast (cheap workhorse).Per token Grok 4.3 $1.25/$2.50, Grok 4.1 Fast $0.20/$0.50 per Mtok. Batch 50% off; cached input ~90% off.Only frontier model with live grounding to X posts; aggressive pricing undercuts GPT-5.4. API is independent of X subscriptions.
Amazon Bedrock USCloud hyperscaler marketplacesFoundation models from Anthropic, Meta, Mistral, Cohere, AI21, Amazon (Nova/Titan), Stability, and now OpenAI. AgentCore for agents.Mixed Per-token, matching providers (Claude Sonnet 5 $3/$15, Nova Micro $0.035/$0.14 per Mtok). Batch 50% off; caching up to 90% off.Deep AWS integration, FedRAMP and HIPAA compliance. Watch hidden costs (OpenSearch Serverless ~$345/mo for Knowledge Bases).
Databricks (Mosaic AI) US· OpenAI-compatibleCloud hyperscaler marketplacesOpen foundation models (Llama, DBRX) plus external models (OpenAI, Anthropic, Cohere). Tight Unity Catalog governance.Consumption Consumption via DBUs from ~$0.07/DBU; pay-per-token, provisioned throughput (Llama 3.3 70B from $6/hr per band), and batch.Tight integration with Unity Catalog governance and data pipelines; OpenAI-compatible API.
Microsoft Azure US· OpenAI-compatibleCloud hyperscaler marketplacesHosts OpenAI (GPT-5 family, Sora, image), plus DeepSeek, Grok, Llama, Mistral, FLUX, and managed GPU compute.Mixed Token pricing matches OpenAI direct (GPT-5 $1.25/$10 per Mtok). PTUs for sustained load from ~$2,448/mo.Enterprise governance, compliance and Azure integration. Real bills often run 15-40% above raw token cost (support, networking, search).
Snowflake (Cortex) USCloud hyperscaler marketplacesPre-integrated Arctic, Llama, Mistral, Reka, Google, plus OpenAI, Anthropic, DeepSeek. AISQL, Cortex Search, Analyst and Agents.Consumption Consumption and credit-based, token-metered, roughly $0.12-5.10 per Mtok depending on model; warehouse compute billed separately.AI runs where the data lives, no egress; Snowflake does not train on customer data.
Baseten US· OpenAI-compatibleNeutral open-weight inference platformsPer-minute dedicated GPU billing (T4 ~$0.01/min up to B200 ~$0.166/min) with scale-to-zero, plus a per-token Model APIs catalog.Mixed Model APIs median ~$0.60/$2.20 per Mtok; dedicated GPU T4 ~$0.01/min up to B200 ~$0.166/min.Multi-cloud routing across ~18-20 providers, 1B+ inference calls/day. Closed a $1.5B Series F at a $13B valuation (June 22, 2026).
DeepInfra US· OpenAI-compatibleNeutral open-weight inference platforms190+ models (Llama, Qwen, DeepSeek, GLM, Gemma, Mistral, Nemotron, Kimi) plus embeddings, TTS, image. Dedicated GPUs by the hour.Mixed From ~$0.06/M for small models; DeepSeek V4 Flash $0.10/$0.20 per Mtok. ~5T tokens/week.Among the cheapest serverless options, runs its own US data centers including Blackwell B200.
Fireworks AI US· OpenAI-compatibleNeutral open-weight inference platformsServerless per-token (DeepSeek/Kimi/GLM/MiniMax catalog), on-demand GPUs, fine-tuning, reserved capacity. 30T+ tokens/day.Mixed 8B-class ~$0.20/M, 70B-class ~$0.90/M; H100/H200 $6/hr, B200 $9/hr. Batch 50% off. Often 20-40% below Together.Day-zero model support and rapid growth: $315M annualized revenue as of February 2026; in talks at a $15B valuation in mid-2026.
FriendliAI US· OpenAI-compatibleNeutral open-weight inference platformsServerless Endpoints (OpenAI-compatible), Dedicated Endpoints (per GPU-hour), Container (on-prem). DeepSeek, Qwen, Kimi, GLM, Llama, EXAONE.Mixed Pay-per-token serverless plus per GPU-hour dedicated. Claims 50-90% cost savings vs vLLM.Claims up to 3x faster than vLLM via custom kernels, speculative decoding, continuous batching. SOC 2 Type II plus HIPAA, 99.99% uptime SLA.
GMI (GMI Cloud) US· OpenAI-compatibleNeutral open-weight inference platformsBare-metal plus serverless plus dedicated clusters. Inference Engine (auto-scaling), Cluster Engine, and Model-as-a-Service.Per GPU-hour On-demand GPU/hr: H100 from $2.00, H200 from $2.60, B200 from $4.00, GB200 from $8.00. Reserved cuts 30-50%.Claims 40-70% savings vs hyperscalers; pure bare-metal rates with no forced CPU or networking upsell.
Hyperbolic US· OpenAI-compatibleNeutral open-weight inference platforms25+ open models plus a GPU marketplace that aggregates idle GPUs. Pay by card or crypto. Serves Llama-3.1-405B-Base in BF16.Mixed GPU/hr: RTX 4090 $0.50, A100 ~$1.60-1.80, H100 PCIe $3.00, H100 SXM $3.20. Llama 3.3 70B $0.40/M.Up to 75% savings vs hyperscalers by pooling idle GPUs; only platform serving Llama-3.1-405B-Base in BF16.
Nebius EU· OpenAI-compatibleNeutral open-weight inference platformsToken Factory: 60+ open-source models in Fast/Base tiers, OpenAI-compatible. Also H100/H200/B200/B300 rental with per-second billing.Mixed Cheapest ~$0.06-0.08/M input (Nemotron 3 Nano $0.08 blended) up to ~$1.93/M for DeepSeek V4 Pro. Reserve discounts up to 35%.EU data residency, full-stack managed compute plus inference, and a ~$2B NVIDIA deal with early Rubin access.
Parasail US· OpenAI-compatibleNeutral open-weight inference platformsServerless per-token (GLM-5.2, Kimi, DeepSeek V4, MiniMax, Qwen, gpt-oss, Llama), dedicated endpoints and batch. Deploy any HF model in ~5 lines.Mixed ~$0.09/M for MiMo-V2.5 up to ~$0.90/M for GLM-5.1. 500B tokens/day.Lossless by default, no hidden quantization; pay-per-token with no GPU contracts. Founded by an ex-Groq exec; raised $32M Series A.
Replicate USNeutral open-weight inference platforms50,000+ community models plus ~100 curated Official Models (Claude, DeepSeek, FLUX, Veo, Kling). Deploy custom models via Cog.Mixed Hardware-per-second (CPU $0.000025/s up to H100 ~$0.001525/s) and output-based (per token/image/video).Largest open model catalog and easiest experimentation. Cold starts and unpredictable per-second billing are the main drawbacks.
SiliconFlow China· OpenAI-compatibleNeutral open-weight inference platforms200+ models across text/image/video/audio (DeepSeek, Qwen, GLM, Kimi, MiniMax, Step, FLUX). Backed by Alibaba Cloud.Mixed Pay-as-you-go per-token (DeepSeek V4 Flash $0.14/$0.28 per Mtok); reserved GPU ~CNY 2.73/hr.6M+ users, 100B+ daily tokens, fastest to handle DeepSeek traffic.
Together AI US· OpenAI-compatibleNeutral open-weight inference platformsServerless per-token (DeepSeek, Llama, Qwen, Kimi, GLM, MiniMax, Mixtral), plus fine-tuning, dedicated endpoints and GPU clusters.Mixed DeepSeek V3.1 $0.60/$1.70, GPT-OSS 20B $0.05/$0.20 per Mtok; H100 reserved ~$3.99/hr. $5 minimum credit.Broad catalog, research-driven optimization (FlashAttention lineage), full-stack from serverless to clusters.
Cerebras US· OpenAI-compatibleCustom-silicon speed specialistsLlama, Qwen, DeepSeek distills, GPT-OSS. No proprietary frontier models and no custom-model uploads.Per token ~$0.10-6 per Mtok depending on model ($0.35/M cheapest input). Pay-as-you-go and enterprise tiers.Fastest inference benchmarked by Artificial Analysis.
Groq US· OpenAI-compatibleCustom-silicon speed specialistsRuns Llama, Qwen, Kimi, GPT-OSS, DeepSeek distills and Whisper at 500-1,000+ tok/s. Catalog is open-source only.Per token Llama 3.1 8B $0.05/$0.08, Llama 3.3 70B $0.59/$0.79, Kimi K2 $1/$3 per Mtok. Batch and caching each cut 50%.Among the fastest inference available. NVIDIA agreed to pay ~$20B for a perpetual license to Groq's LPU patents (finalized December 24, 2025); GroqCloud continues operating.
SambaNova US· OpenAI-compatibleCustom-silicon speed specialistsFast serving of large open models (Llama, DeepSeek 671B, Qwen, MiniMax) with a three-tier memory architecture.Per token Pay-per-token; rates listed on the SambaCloud plans page.Benchmarked by Artificial Analysis as among the fastest for large models (MiniMax M2.7 at 435 tok/s). Strong sovereign-AI and on-prem story.
CompactifAI (Multiverse Computing) SpainGPU cloud and compression nicheServes compressed Slim models via API and the AWS/Azure marketplaces. HyperNova 60B (from gpt-oss-120b), compressed Llama/DeepSeek/Mistral.Per token HyperNova 60B $0.04/$0.14 per Mtok; Llama 3.3 70B Slim ~$0.15/$0.31 per Mtok.Claims compressed models beat their base models on speed and cost, runnable on edge devices down to Raspberry Pi. Raised a 189M euro Series B (June 12, 2025).
CoreWeave USGPU cloud and compression nicheRents NVIDIA A100, H100, H200, GB200/B200, GB300. Per-second billing, no egress fees; spot and reserved available.Per GPU-hour 8x H100 node ~$49.24/hr (~$6.16/GPU/hr); single GPUs from ~$1.19/hr (A100 PCIe) to $10.50/hr (B200 NVL).Roughly 40-60% cheaper than hyperscalers for equivalent GPUs; customers include OpenAI, Mistral, Jane Street. Often 8-GPU minimums.

Prices are representative June 2026 snapshots, not live quotes: per-token and per-GPU-hour rates move often, and open-model hosts may serve at different precision (FP8 vs FP4), which affects both price and quality. Provider links open each company's own site; confirm the current rate at the source before you budget.

How to choose an inference provider

Pick on the one axis that dominates your bill or your risk, not on the headline rate alone.

  • Frontier quality: default to OpenAI, Anthropic or Google. If output-token volume dominates your spend, Google Gemini and xAI Grok are materially cheaper at comparable quality.
  • Cheap open-weight at scale: start with DeepInfra, Together or Fireworks, all OpenAI-compatible and in the $0.05 to $1 per million token range. Verify the quantization (FP8 vs FP4) before committing, because it affects quality.
  • Latency-critical apps (voice, agents, real-time): use Groq or Cerebras for open models, SambaNova for the largest. The trade-off is no proprietary frontier models.
  • Enterprise and regulated workloads: use Azure, Bedrock, Snowflake Cortex or Databricks for governance and data residency, and accept a 10 to 40% premium. For sovereign needs, consider Nebius (EU), Sarvam (India) or Upstage (Korea).
  • GPU rental and self-hosting: GMI and Hyperbolic have the cheapest transparent H100 rates (about $2 to $3.20 per hour); CoreWeave for hyperscale clusters; Baseten or Replicate for managed serving without owning infrastructure.
  • Specialized needs: Cohere for retrieval (RAG) pipelines; Inception or CompactifAI when raw speed and cost per token are paramount; DeepSeek, Kimi, MiniMax or StepFun for frontier-adjacent quality at a fraction of Western pricing.

The crossover from serverless per-token to renting dedicated GPUs typically lands around 10 to 50 million tokens per day of steady load per model. Below that, pay-per-token wins; above it, price out reserved GPUs. To turn any per-token rate into what a real task costs, use thecost-per-task calculator, or rank models by value on the value leaderboard.

Why this is a snapshot, not a live feed

Inference pricing is the most volatile layer of the AI stack. Open-model hosts re-price weekly, frontier labs cut rates without notice, and several providers serve the same model at different precision, so two identical headline prices do not guarantee identical output. This directory is therefore a representative map of the market as of June 28, 2026, built to show the shape of the choice and ground each provider to its own page, not to quote a rate you can hold anyone to. Always confirm at the source before you commit spend.

Frequently asked questions

What are the best AI inference providers in 2026?
There is no single best one: the 34 providers tracked here split into five jobs. First-party labs (OpenAI, Anthropic, Google, xAI) win on frontier quality. Neutral platforms (DeepInfra, Together, Fireworks) host open-weight models cheapest. Custom-silicon clouds (Groq, Cerebras, SambaNova) win on speed. Hyperscalers (Azure, Bedrock, Snowflake, Databricks) win on enterprise governance. GPU clouds (GMI, CoreWeave) and compression specialists (CompactifAI) serve self-hosting and edge.
What is the cheapest way to run AI models?
For frontier-class quality, the cheapest per-token APIs are the China-built open-weight models: DeepSeek V4 Flash at roughly $0.14/$0.28 per million tokens, with MiniMax, Kimi and StepFun close behind. For open-weight hosting at scale, DeepInfra, Together and Fireworks converge near $0.05 to $1 per million tokens. Above roughly 10 to 50 million tokens per day of steady load per model, renting GPUs (GMI or Hyperbolic at about $2 to $3.20 per H100-hour) usually beats per-token pricing.
Which AI inference providers are the fastest?
The custom-silicon specialists. Cerebras runs open models at 1,800 to 3,000+ tokens per second on its wafer-scale chips, Groq at 500 to 1,000+ on its LPUs, and SambaNova is benchmarked among the fastest for the largest open models. The trade-off is that all three serve open models only: no proprietary frontier models and limited or no custom-model uploads.
What is an OpenAI-compatible API and why does it matter?
It means the provider exposes the same request and response format as OpenAI's API, so switching providers is usually just a base-URL and API-key change rather than a rewrite. 26 of the 34 providers here are OpenAI-compatible, which is why you can benchmark several on the same code and route to whichever wins on price, speed or quality.
How current are these prices?
They are representative snapshots as of June 28, 2026, each linked to the provider's official page. Per-token rates for open-weight hosts change often, sometimes weekly, and open-model hosts may serve at different precision (FP8 vs FP4), which changes both price and quality. Treat every figure as a starting point and confirm the live rate at the linked source before budgeting.

Sources

Each provider profile is grounded to that company's own home, pricing or documentation page, verified June 28, 2026. Prices are representative snapshots from those pages and from independent benchmarking by Artificial Analysis where noted:

Machine-readable data: /ai-inference-providers.json. Funding figures (Fireworks, Baseten, Sarvam, Groq, CompactifAI) are from reporting by Bloomberg, Sacra and company announcements as cited in each profile.

← All tools & trackers