Are you ready to bring more awareness to your brand? Consider becoming a sponsor for The AI Impact Tour. Learn more about the opportunities here.
Launched as part of an alpha test, the assistant taps 7B and 67B-parameter DeepSeek LLMs, trained on a dataset of 2 trillion tokens in English and Chinese. According to benchmarks, both these models deliver strong performance across a range of evaluations, including coding and mathematics, and match (sometimes even outperform) Meta’s famous Llama 2-70B.
The news marks the entry of another Chinese player into the AI race, following the recent releases from Qwen, 01.AI and Baidu. DeepSeek said it has open-sourced the models – both base and instruction-tuned versions – to foster further research within both academic and commercial communities.
The company, which was founded a few months ago to unravel the mystery of AGI with curiosity, also permits commercial usage under certain terms.
The AI Impact Tour
Connect with the enterprise AI community at VentureBeat’s AI Impact Tour coming to a city near you!
What do we know about DeepSeek Chat and LLMs?
DeepSeek Chat is accessible via a web interface (like ChatGPT), where users can sign in and interact with the model for a range of tasks. Only the 67B version is available through this interface.
According to the company, both of its models have been built using the same auto-regressive transformer decoder architecture as Llama, but their inference approach is different. The smaller model uses multi-head attention (MHA), running through an attention mechanism several times in parallel, while the bigger leverages grouped-query attention (GQA) to produce results.
“The 7B model’s training involved a batch size of 2304 and a learning rate of 4.2e-4 and the 67B model was trained with a batch size of 4608 and a learning rate of 3.2e-4. We employ a multi-step learning rate schedule in our training process. The learning rate begins with 2000 warmup steps, and then it is stepped to 31.6% of the maximum at 1.6 trillion tokens and 10% of the maximum at 1.8 trillion tokens,” it wrote on the models’ Github page.
When put to test, DeepSeek LLM 67B Base demonstrated superior general capabilities, outperforming Llama2 70B Base in areas such as reasoning, coding, math, and Chinese comprehension. In fact, the only benchmark where Llama did a little better was 5-shot trivia QA (79.5 vs 78.9).
The chat version of the model, fine-tuned on extra instruction data, also did exceptionally well on never-seen-before tests.
For instance, on HumanEval pass@1 for coding, it scored 73.78, while on GSM8K 0-shot for mathematics, it scored 84.1, sitting right behind GPT-4 and Anthropic’s Claude 2.
That said, despite the impressive performance seen in the benchmarks, it seems the DeepSeek model does suffer from some level of censorship. In a post on X, a user pointed out that the answers from the assistant were automatically redacted when the original question was about China. Instead, the model displayed a message saying the content was “withdrawn” for security reasons. It is not immediately clear if the base model also contains such filters.
LLMs of all sizes
The launch of DeepSeek LLMs marks another notable move from China in the AI space and expands the country’s offerings to cover all popular model sizes – serving a broad spectrum of end users.
More interestingly, some of these models’ performance was even better than their larger counterparts, including Yi 34B.
If a small model matches or outperforms a bigger one, like how Yi 34B took on Llama-2-70B and Falcon-180B, businesses can drive significant efficiencies. They can save compute resources while targeting downstream use cases with the same level of effectiveness.
Just a week ago, Microsoft also shared its work in the same area with the release of Orca 2 models that performed better than five to ten times bigger models, including Llama-2Chat-70B.
VentureBeat's mission is to be a digital town square for technical decision-makers to gain knowledge about transformative enterprise technology and transact. Discover our Briefings.