Published on

DeepSeek-V3 Blog 7: Faster Inference with Shared Experts — The Key to Efficient AI Generation

Authors
  • avatar
    Name
    AIFlection
    Twitter

DeepSeek-V3 Blog 7: Faster Inference with Shared Experts — The Key to Efficient AI Generation

DeepSeek-V3 Blog 7: Faster Inference with Shared Experts — The Key to Efficient AI Generation

Introduction: Making AI Models Faster Without Losing Accuracy

Training large language models (LLMs) is just one side of the equation. Deploying them for real-world use presents an entirely different challenge. While DeepSeek-V3’s 671 billion parameters offer unparalleled reasoning capabilities, the true engineering marvel lies in how efficiently it serves responses in real time.

One of the biggest bottlenecks in model inference is the cost of activating and computing across massive parameters. If every input token had to process all model parameters, response times would be unbearably slow. Instead, DeepSeek-V3 employs a faster inference strategy using a combination of shared and routed experts, ensuring that only a small subset of parameters is active per token.

This innovative architecture allows large-scale models to operate at a speed comparable to much smaller models, making real-time AI a practical reality. Let’s dive deep into the mechanisms that make this possible.

The Challenge: Massive Models vs. Real-Time Responses

Imagine a university with thousands of professors across different subjects. If a student needs help with a calculus question, they don’t need every professor to weigh in. Instead, they should consult a small group of the most relevant experts who can provide the best answers efficiently.

Traditional dense transformers activate all parameters for every token, similar to forcing every professor to contribute to answering a single question. This results in huge computational waste and makes inference extremely slow.

To address this, DeepSeek-V3 introduces shared experts, allowing some general-purpose computations to be executed once per batch instead of per token. This significantly reduces inference overhead while maintaining quality.

How Shared Experts Improve Efficiency

In DeepSeek-V3, each feed-forward network (FFN) layer consists of a Mixture-of-Experts (MoE) system, where multiple specialized subnetworks (experts) exist. However, instead of assigning every token to a completely unique set of experts, some experts are designated as shared experts — meaning all tokens have access to them.

Mathematical Breakdown

How Does This Help Inference?

The presence of shared experts brings two major advantages:

  1. Reduced Memory and Compute Overhead — Since shared experts are reused across many tokens, their activations can be cached and computed once per batch instead of per token.
  2. Better Generalization — These experts act as a knowledge base that helps maintain overall coherence across different queries, complementing the routed experts, which handle more specialized tasks.

In simple terms, shared experts allow some computations to be “reused,” rather than recomputed for every token, cutting down inference costs.

Balancing Specialization and Generalization

To see why this is powerful, imagine a medical diagnosis AI.

  • Routed Experts = Specialized doctors handling rare diseases.
  • Shared Experts = General practitioners who handle common symptoms and first-line diagnoses.

When a patient (token) enters, they always start with a general assessment (shared expert), then get referred to specialists (routed experts) based on symptoms.

DeepSeek-V3 applies the same principle. Generalized computations (like basic linguistic transformations, common phrases, and structural patterns) are handled by shared experts, while token-specific nuances (such as complex reasoning, topic-specific details) are handled by routed experts.

This hybrid approach ensures that:

  1. Common processing happens efficiently.
  2. Rare and difficult queries still get specialized attention.
  3. Inference speed remains optimal, avoiding unnecessary computation.

Reducing Inference Bottlenecks Through Caching

One of the most significant optimizations of using shared experts is the ability to cache their activations.

How Does Caching Work in Shared Experts?

Instead of computing an expert’s output every single time a token needs it, DeepSeek-V3:

  1. Computes shared expert outputs once per batch.
  2. Caches these results in memory.
  3. Uses the cached values whenever another token needs the same computation.

This means redundant calculations are eliminated, freeing up GPU resources for routed experts, which require per-token specialization.

Mathematical Representation

Challenges and Trade-offs in Shared Expert Design

While shared experts significantly improve efficiency, they also introduce a few challenges:

  1. Over-reliance on Shared Computation — If shared experts dominate too much, the model may lose specialization, leading to generic responses.
  2. Complexity in Routing — The model must balance when to use a shared expert versus when to route a token to a specialized expert.
  3. Cache Staleness — Since cached activations are reused, they must be refreshed periodically to prevent drift in outputs.

DeepSeek-V3 overcomes these by tuning the balance between shared and routed experts, ensuring that each token gets the best mix of general and specialized computation.

Conclusion: Shared Experts as a Game-Changer for Real-Time AI

DeepSeek-V3’s shared expert design is a breakthrough in efficient AI inference. Instead of activating the entire model per token, it ensures that only the necessary parts of the model run, dramatically improving response time, computational efficiency, and scalability.

By combining:

  • Shared experts for common computations
  • Routed experts for token-specific specialization
  • Caching strategies to eliminate redundant computation

DeepSeek-V3 achieves near-instantaneous AI responses while retaining high model accuracy.

This innovation is a stepping stone towards building AI models that scale effectively while remaining computationally feasible.

In the next blog, we will explore DeepSeek-V3’s memory-efficient attention mechanism (MLA), which further reduces inference costs without degrading model accuracy.

All Blog Links:
1. Blog-1: DeepSeek-V3 Blog 1: Redefining AI Efficiency — An Exclusive Series

2. Blog-2: DeepSeek-V3 Blog 2: The Smartest Use of Mixture of Experts (MoE) in AI Yet

3. Blog-3: DeepSeek-V3 Blog 3: Smarter Memory Optimization — The Key to Training a 671B Model on Just 2048 GPUs

4. Blog-4: DeepSeek-V3 Blog 4: Faster, Smarter, and More Efficient — How Multi-Token Prediction and Multi-Head Latent Attention Redefine LLM Inference

5. Blog-5: DeepSeek-V3 Blog 5: Low-Precision Training — The FP8 Revolution in Large-Scale AI

6. Blog-6: DeepSeek-V3 Blog 6: Hardware-Level Optimizations — Engineering AI for Peak Efficiency

7. Blog-7: DeepSeek-V3 Blog 7: Faster Inference with Shared Experts — The Key to Efficient AI Generation

8. Blog-8: DeepSeek-V3 Blog 8: The Final Piece — Bringing It All Together

View on GitHub