DeepSeek-V3, an ultra-large open source AI, outperforms Llama and Qwen at launch

DeepSeek-V3, an ultra-large open source AI, outperforms Llama and Qwen at launch


Subscribe to our daily and weekly newsletters to receive the latest updates and exclusive content on industry-leading AI reporting. Learn more


Chinese AI startup DeepSeek, known for challenging AI leaders with its innovative open source technologies, today released a new ultra-large model: DeepSeek-V3.

Available through Hugging Face under the company’s licensing agreement, the new model features 671B parameters but uses a mixed-expert architecture to enable only selected parameters to handle specific tasks precisely and efficiently. According to benchmarks shared by DeepSeek, the offering is already topping the charts, outperforming leading open source models including Meta’s Llama 3.1-405B, and nearly matching the performance of closed models from Anthropic and OpenAI.

The release represents another important development that closes the gap between closed and open source AI. Ultimately, DeepSeek, which began as an offshoot of Chinese quantitative hedge fund High-Flyer Capital Management, hopes that these developments will pave the way for artificial general intelligence (AGI), where models will be able to understand or learn any intellectual task, the a man can.

What does DeepSeek-V3 bring?

Just like its predecessor DeepSeek-V2, the new ultra-large model uses the same basic architecture revolving around Multi-Head Latent Attention (MLA) and DeepSeekMoE. This approach ensures efficient training and inference – with specialized and shared “experts” (individual, smaller neural networks within the larger model) activating 37B of 671B parameters for each token.

While the core architecture ensures robust performance for DeepSeek-V3, the company also introduced two innovations to further push the bar.

The first is an additional lossless load balancing strategy. This dynamically monitors and adjusts the load on the experts to utilize them in a balanced way without affecting the overall performance of the model. The second option is multi-token prediction (MTP), which allows the model to predict multiple future tokens at the same time. This innovation not only increases training efficiency, but also enables the model to perform three times faster and generate 60 tokens per second.

“During pre-training, we trained DeepSeek-V3 on high-quality and diverse 14.8T tokens… Next, we performed a two-stage context length expansion on DeepSeek-V3,” the company wrote in a technical document detailing the new model. “In the first stage, the maximum context length is expanded to 32 KB and in the second stage further to 128 KB. We then conducted post-training including Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) on the base model of DeepSeek-V3 to adapt it to human preferences and further exploit its potential. In the post-training phase, we distill the reasoning capabilities of the DeepSeekR1 model suite, paying careful attention to the balance between model accuracy and generation length.”

In particular, DeepSeek used several hardware and algorithm optimizations during the training phase, including the FP8 Mixed Precision Training Framework and the DualPipe pipeline parallelism algorithm, to reduce the cost of the process.

In total, the entire DeepSeek V3 training is claimed to have been completed in approximately 2,788,000 H800 GPU hours, or approximately $5.57 million, assuming a rental rate of $2 per GPU hour. This is much less than the hundreds of millions of dollars typically spent on pre-training large language models.

It is estimated that Llama-3.1 was trained with an investment of over $500 million.

Strongest open source model currently available

Despite the cost-effective training, DeepSeek-V3 has become the strongest open source model on the market.

The company ran several benchmarks to compare the AI’s performance and found that it convincingly outperforms leading open models, including Llama-3.1-405B and Qwen 2.5-72B. It even outperforms closed-source GPT-4o on most benchmarks, with the exception of English-focused SimpleQA and FRAMES – where the OpenAI model scores 38.2 and 80.5 (versus 24.9 and 73, respectively). ,3) was ahead.

In particular, DeepSeek-V3’s performance particularly stood out on the Chinese and math-focused benchmarks, performing better than all competitors. It earned a score of 90.2 on the Math 500 test, with Qwen’s score of 80 being the next best.

The only model that managed to challenge DeepSeek-V3 was Anthropic’s Claude 3.5 Sonnet, outperforming it with higher scores in MMLU-Pro, IF-Eval, GPQA-Diamond, SWE Verified and Aider-Edit.

The work shows that open source models approach closed source models and promise almost equivalent performance on various tasks. The development of such systems is extremely positive for the industry as it potentially eliminates the chance of a major AI player dominating the game. It also gives companies multiple options to choose and work with when orchestrating their stacks.

Currently, the code for DeepSeek-V3 is available via GitHub under an MIT license, while the model is provided under the company’s model license. Companies can also test the new model via DeepSeek Chat, a ChatGPT-like platform, and access the API for commercial use. DeepSeek provides the API Same price as DeepSeek-V2 until February 8th. After that, input tokens will be charged at $0.27 per million (tokens with a cache hit of $0.07 per million) and output tokens will be charged at $1.10 per million .

DeepSeek-V3, an ultra-large open source AI, outperforms Llama and Qwen at launch

Leave a Reply

Your email address will not be published. Required fields are marked *