Skip to content
S

Step 3.7 Flash

efficiency

StepFunReleased on 2026-05-28

StepFun's latest high-efficiency multimodal Mixture-of-Experts model, released May 28, 2026. It pairs a 196B-parameter language backbone with a vision encoder for native image and video understanding, activating roughly 11B parameters per token. With a 256K context window and selectable reasoning levels, it targets coding, agentic workflows, structured outputs and long-context productivity at a fraction of frontier pricing.

79
Overall Score

Voice of the community

sample 6

Step 3.7 Flash pairs a 196B MoE language backbone with a vision encoder for native image and video understanding, activating ~11B parameters per token.

StepFun (launch)2026-05-28

Enterprise-ready multimodal model combining perception, search and multi-step reasoning, runnable on NVIDIA GPUs via TensorRT-LLM.

NVIDIA Technical Blog2026-05-28

Core Specs

256K
Context Window
66K
Max Output
ReasoningOpen Sourcetextimagevideo

Pros & Cons

Sentiment50% +50% ·0% −

Pros

  • +Very low pricing ($0.20/$1.15 per MTok) for a 196B multimodal MoE
  • +Native image and video input (256K context)
  • +Selectable reasoning levels (high/medium/low) for speed/cost control
  • +Open-weights, deployable via vLLM / SGLang / TensorRT-LLM
  • +Tuned for coding and agentic workflows

Cons

  • No independent benchmarks yet — capability claims unverified
  • Efficiency tier, not positioned to beat frontier flagships on hard reasoning
  • Limited English-language community track record at launch

Pricing

Input (per 1M tokens)$0.20
Output (per 1M tokens)$1.15
Updated on 2026-05-29

Get Started

1Visit the provider's website
2Create an account
3Start using the model

Benchmarks

noteNo independent third-party benchmarks published yet (released 2026-05-28). scenarioScores below are preliminary positioning estimates for an efficiency-tier multimodal model, pending community data.%

Reliability

Incidents (30d)0
View Status Page →