S
Step 3.7 Flash
efficiencyStepFun•Released on 2026-05-28
StepFun's latest high-efficiency multimodal Mixture-of-Experts model, released May 28, 2026. It pairs a 196B-parameter language backbone with a vision encoder for native image and video understanding, activating roughly 11B parameters per token. With a 256K context window and selectable reasoning levels, it targets coding, agentic workflows, structured outputs and long-context productivity at a fraction of frontier pricing.
79
Overall Score
Voice of the community
sample 6“Step 3.7 Flash pairs a 196B MoE language backbone with a vision encoder for native image and video understanding, activating ~11B parameters per token.”
“Enterprise-ready multimodal model combining perception, search and multi-step reasoning, runnable on NVIDIA GPUs via TensorRT-LLM.”
Core Specs
256K
Context Window
66K
Max Output
ReasoningOpen Sourcetextimagevideo
Scenario Scores
Pros & Cons
Sentiment50% +50% ·0% −
Pros
- +Very low pricing ($0.20/$1.15 per MTok) for a 196B multimodal MoE
- +Native image and video input (256K context)
- +Selectable reasoning levels (high/medium/low) for speed/cost control
- +Open-weights, deployable via vLLM / SGLang / TensorRT-LLM
- +Tuned for coding and agentic workflows
Cons
- −No independent benchmarks yet — capability claims unverified
- −Efficiency tier, not positioned to beat frontier flagships on hard reasoning
- −Limited English-language community track record at launch
Pricing
Input (per 1M tokens)$0.20
Output (per 1M tokens)$1.15
Updated on 2026-05-29
Get Started
1Visit the provider's website
2Create an account
3Start using the model
Benchmarks
noteNo independent third-party benchmarks published yet (released 2026-05-28). scenarioScores below are preliminary positioning estimates for an efficiency-tier multimodal model, pending community data.%
Reliability
Incidents (30d)0
View Status Page →