DeepSeek open source DeepEP – library for MoE training and Inference

533 points by helloericsf 5 days ago

mohsen1 4 days ago

> For extreme performance, we discover and use an out-of-doc PTX instruction: ld.global.nc.L1::no_allocate.L2::256B. This instruction will lead to an undefined behavior: accessing volatile GPU memory with non-coherent read-only PTX modifiers .nc. But the correctness is tested to be guaranteed with .L1::no_allocate on Hopper architectures, and performance will be much better.

k_sze 4 days ago

Practically speaking, is it possible for NVIDIA to "pull the rug" later, intentionally or otherwise, by subtly changing the behaviour of this out-of-doc instruction on new architectures?
- ammo1662 4 days ago
  
  They could. That's why there is a switch to disable it.
  > If you find kernels not working on some other platforms, you may add DISABLE_AGGRESSIVE_PTX_INSTRS=1 to setup.py and disable this, or file an issue.

pama 4 days ago

I feel like a kid in a candy shop. Some of these tricks would take way too long to reverse engineer correctly based on the papers. I hope that the releases this week start a renaissance of the use of MoE as baseline academic models.

antirez 4 days ago

From this point of view I don't understand what's happening between the actual SOTA models practice and the academic models. The former at this point are all MoEs, starting with GPT4. But then the open models, if not for DeepSeek V3 and Mixtral, are always dense models.
- woctordho 4 days ago
  
  MoEs require less computation and more memory, so they're harder to setup in small labs
- kristianp 4 days ago
  
  I assumed gpt 4o wasn't MOE, being a smaller version of gpt-4, but I've never heard either way.

ofou 4 days ago

You gotta love these guys, they're really pushing the open source frontier for all of us, thanks for sharing

grg0 4 days ago

Open AI™ (with a space)
- InkCanon 4 days ago
  
  There's hilariously nothing open about OpenAI, and that was the plan from the start. From the email by Ilya Sutsekver, OpenAI was always going to keep all it's research and code as proprietary information. Open supposedly meant the benefits would be shared. So they basically just became a SaaS with a free tier, like most of them. Musk was right when he called them out for fishing for money as if they were a non profit, but always had plans to become a company
  - danans 4 days ago
    
    > Musk was right when he called them out for fishing for money as if they were a non profit, but always had plans to become a company
    I believe that he was right, because he of all people should recognize when someone is working from his own playbook of lies and misrepresentation.
    Musk is pretty obviously upset because he got outfoxed and cut out of OpenAI, not because of some supposed ideal he holds about safe use of gen AI models.
- hackit2 4 days ago
  
  Kind of ironic that DeepSeek is more Open than ChatGPT
  - gostsamo 4 days ago
    
    They do it for their own reasons, but OpenAI are straight up liars and they are neither open nor give a fuck about humanity.
    
    WiSaGaN 4 days ago
    
    It would be hilarious if this scenario played out.
    OpenAI starts as a nonprofit, aiming to benefit all humanity. Eventually, they discover a path to AGI and engage in intense internal debates: Should they abandon their original mission and chase profit, knowing it could bring generational wealth? They ultimately decide, "To hell with humanity—let’s go for the money."
    As they pivot to prioritizing profit, DeepSeek emerges. Staying true to OpenAI’s original vision, DeepSeek open-sources everything, benefiting humanity and earning global admiration. Unintentionally, this move tanks OpenAI’s valuation. In the end, OpenAI fails to become the hero or secure the massive profits they chased. Instead, they leave behind a legacy rebranded as "ClosedAI"
    
    ghfhghg 4 days ago
    
    Admittedly I'm a sideline observer but it feels like the first half of your scenario is already happening (sans the agi).
    
    yieldcrv 4 days ago
    
    "I don't want to live in a world where someone else is making the world a better place better than we are"
    - Silicon Valley Season 2
    
    chefandy 4 days ago
    
    OpenAyyyyI swear babe I’m gonna open it up any day. Yeah for that grated good or whatever it is you keep yappin about.
    
    amelius 4 days ago
    
    Well, they do give us a great free tool to use, but that's where it ends and probably has some agenda behind it.
  - ur-whale 4 days ago
    
    > Kind of ironic that DeepSeek is more Open than ChatGPT
    Not ironic at all.
    You've simply be lied to by OpenAI.
    Nothing ironic about being naive.
  - azinman2 4 days ago
    
    Now. It’s amazing to me that everyone is like fuck OpenAI deepseek is the savior, when OpenAI’s papers and code jump started an AI revolution just a few years ago. Let’s wait the same number of years and see what deepseek does.
    
    gertop 4 days ago
    
    I thought the papers that jump started the revolution came from Google?
    
    larodi 4 days ago
    
    Indeed. And the papers were about doing better translation of char sequences, essentially the tech emerged as linguistics improvement for language. Then someone realised the parrot learns enough ZIP and JPEG alongside and can spit back hazy memories of it all.
    the one still super useful thing OpenAI ever released must’ve been Whisper. But they could’ve been much more open for sure.
    
    jeffreygoesto 4 days ago
    
    Hinton. And if you'd ask himself probably Schmidthuber.
- echelon 4 days ago
  
  I hope you're reading this Sam Altman:
  Make Open AI open.
  Or else you'll lose to the ecosystem.
  - ta988 4 days ago
    
    Too late, there is no more innovation from openai all the people that were the drivers left for Anthropic and the others. They had some of the biggest funding, had the advance... And yet they lost it.
  - ur-whale 4 days ago
    
    > I hope you're reading this Sam Altman
    I hope he's not.
    All he deserves at this point is to go down as hard as possible.
  - alpb 4 days ago
    
    That’s an impossible ask. Sam is the pinnacle of capitalist ruling class, he’s a pure businessman. He has no interest in giving anything for free unless there’s a business plan. He doesn’t care about humanity. He’ll pretend to change the world and tell you that they’re inventing AGI, Q*, strawberry or whatever they’re branding it, but the reality is he knows it’s all over and unless there’s a major breakthrough this company will be in major financial trouble. Sorry for the rant but he doesn’t deserve much respect for turning all this science to grift. He’s actually the person the old openai board warned everyone about.
    
    anticensor 4 days ago
    
    Their state-of-the-art speech to text model, Whisper, is available as open weights for free.
    
    echelon 4 days ago
    
    Strategically, they know that needs to run at the edge, and they want users to send them requests to their API without incurring latency or bad user experience.
    That is still a fair point, though, and it should be commended. And that hasn't been their only contribution, either.
    
    anticensor 4 days ago
    
    They could've made it a trusted-computing-only model distributed with a proprietary encryption, unlocked with an expensive licence key if they wanted.
  - sciencesama 4 days ago
    
    Sam is busy with his new kiddo
blackeyeblitzar 4 days ago

Not really open source. For a truly open source model, check out OLMo 2 from AI2:
https://allenai.org/blog/olmo2
They literally share everything you need to recreate their model, including the data itself. This is what they say on that link above:
> Because fully open science requires more than just open weights, we are excited to share a new round of OLMo updates–including weights, data, code, recipes, intermediate checkpoints, and instruction–tuned models—with the broader language modeling community!
steve76 4 days ago

[dead]

breadwinner 4 days ago

Zuckerberg should stop claiming Meta is open sourcing AI (they are even running TV ads) when they are only releasing the weights, and not the code. Only DeepSeek is real OSS AI.

lithiumii 4 days ago

Well technically even DeepSeek is not as OSS as OLMo or Open Euro, because they didn't open the data.
- echelon 4 days ago
  
  We're 2/3rds of the way there.
  We need:
  1. Open datasets for pretrains, including the tooling used to label and maintain
  2. Open model, training, and inference code. Ideally with the research paper that guides the understanding of the approach and results. (Typically we have the latter, but I've seen some cases where that's omitted.)
  3. Open pretrained foundation model weights, fine tunes, etc.
  Open AI = Data + Code + Paper + Weights
- buyucu 4 days ago
  
  Opening data is an invitation to lawsuits. That is why even the most die-hard open source enthusiasts are reluctant. It is also why people train a model and generate data with it, rather than sharing the original datasets.
  These datasets are huge, and it's practically impossible to make sure they are clean of illegal or embarrassing stuff.
  - johnla 4 days ago
    
    Sounds like a job for AI.
  - sdesol 4 days ago
    
    I understand the reasoning and I hope there is legislation in the future that basically goes "If you can't produce the data, you can't charge more than this for it". Basically, LLM producers will have to treat their product as a commodity product that can only be priced based on the compute resources plus some overhead.
- tway223 4 days ago
  
  For understandable reasons
- chvid 4 days ago
  
  It is pirated material / material that breaks various terms of service but as I understand it is the stuff you can see in Anna's Archive and a bunch of "artificial" training data from queries to OpenAI ChatGPT and other LLMs.
blackeyeblitzar 4 days ago

DeepSeek is definitely not real OSS. To be open source, you need to use a real open source license (like the ones OSI lists), and you need to share all pre and post training code, any code related to tuning, any evaluation code, everything related to safety/censorship/etc, and probably the full training data as well. Otherwise you can't reproduce their weights. Sharing weights is like sharing a compiled program.
As far as I know the only true open source model that is competitive is the OLMo 2 model from AI2:
https://allenai.org/blog/olmo2
They even released an app recently, which is also open source, that does on-device inference:
https://allenai.org/blog/olmoe-app
They also have this other model called Tülu 3, which outperforms DeepSeek V3:
https://allenai.org/blog/tulu-3-405B
- startupsfail 4 days ago
  
  Yes, releasing training source code code is like releasing the source code of a compiler used to compile and link the binary.
  Lets say you took GCC, modified its sources, compiled your code with it and released your binaries along with modified GCC source code. And you are claiming that your software is open source. Well, it wouldn’t be.
  Releasing training data is extremely hard, as licensing and redistribution rights for that data are difficult to tackle. And it is not clear, what exactly are the benefits in releasing it.
duchenne 4 days ago

Come on... Meta has been refining pytorch for more than a decade. It basically contains all that you need to train LLMs, including the latest technologies. What more do you need? The part of the code that is specific to Meta infrastructure?
prjkt 4 days ago

does pytorch count
- ein0p 4 days ago
  
  PyTorch had the "first thing that didn't suck" advantage and now it has a completely dominant marketshare that prevents better alternatives from emerging. Where it sucks (e.g. on macOS) there are popular alternatives. But it's hard to be enthusiastic about a DL framework in 2025 which does not have native high performance quantization support, for example. Or one where FSDP is crudely bolted onto the side. They say "usability above all else", but I consider such things to be major usability deficiencies, which need to be addressed. But because PyTorch does not have to fight for marketshare, it'll be years before we see anything usable there.
  - numba888 a day ago
    
    I wonder how Meta trains its models. On vanilla Pytorch or they actually have some closed tools and frameworks?
echelon 4 days ago

Open Weights = Binary Blob
It's a return to the FREEWARE / SHAREWARE model.
This is the language we need to use for "open" weights.
dr-detroit 4 days ago

[dead]

helloericsf 5 days ago

- Efficient and optimized all-to-all communication - Both intranode and internode support with NVLink and RDMA - High-throughput kernels for training and inference prefilling - Low-latency kernels for inference decoding - Native FP8 dispatch support - Flexible GPU resource control for computation-communication overlapping X: https://x.com/deepseek_ai/status/1894211757604049133

ur-whale 4 days ago

The incentive behind the work of DeepSeek might very well be wrong (something along the lines of a state-sponsored attempt at shrinking the US first mover advantage in AI to nil) but the net result for everyone on the planet is simply fantastic.

So even in the worst case (doing this for the wrong reasons): thank you DeepSeek, you are actually doing what OpenAI lied through their teeth to the whole world about doing for years.

You rock.

danans 4 days ago

> The incentive behind the work of DeepSeek might very well be wrong (something along the lines of a state-sponsored attempt at shrinking the US first mover advantage in AI to nil)
In the space of international relations, right and wrong don't apply nearly as much. Is open sourcing this any more "wrong" than the export ban on high end Nvidia GPUs?
The open sourcing by DeepSeek (presumably with CCP consent) just happens to be good for both the CCP and the broader open source AI community at the same time, but don't take it as some kind of principled stance by them.
Finding ways to take away other countries' competitive advantages is a major activity off all governments, large and small.
- jimmydoe 4 days ago
  
  It seems CCP is less hate worthy than they were two months ago. Comparing fake democracy with real authoritarian is kinda funny.

rvz 4 days ago

Round 2 of open source releases from an actual "Open AI™" company and licensed under MIT.

Once again, DeepSeek is more open than the $157B+ one that is claiming to be "Open".

Almost no-one is talking about Meta's Llama and everyone should expect them to release Llama 4 with reasoning.

The objective is to not be squeezed in the middle of the race to zero.

swyx 4 days ago

https://www.llama.com/events/llamacon/signup/

yieldcrv 4 days ago

so while the US is chasing GPU receipts in Singapore just to ensure DeepSeek was using H800s only, the rest of the world can run these optimizations on the full H100s?

while we also pretend that H100s were difficult to get or access because of the US sanctions and their hubris to believe their edicts blanket the globe?

am I understanding this correctly?

wbsun 2 days ago

This feels like the 80s/90s when people hacking assembly or finding undocumented instructions to squeeze CPU for performance. Until one day either the compiler will be highly optimized enough or the GPU will be so powerful that such tricks won’t make much difference anymore, like CPUs nowadays :D

deyiao 4 days ago

Is the PTX that everyone was looking forward to included this time?

find0x90 4 days ago

Yes, there's some in the csrc/kernels directory. Search for 'asm' to find uses of it.
swyx 4 days ago

> the PTX that everyone was looking forward to
explanation for the rest of us why this is so important?
- ta988 4 days ago
  
  Parallel Thread Execution. Think of them as opcodes for the Nvidia GPUs. They are a bit more complex that your traditional opcodes (the lowest level of abstraction accessible to users) in CPUs, as you can specify cache parameters, memory barriers etc.
  There are documented combinations of parameters for those instructions but if you fuzz (search new combinations in a random or organized way because you hope some will work the way you want) you can find new ones with unexpected effects or with advantages (in various ways like not polluting caches, speed...)
  Which is the case for example for ld.global.nc.L1::no_allocate.L2::256B that they use in deepseek that provides significant acceleration while beeing reliable (although not working on all architectures so they have ways to disable it)
  - rfoo 4 days ago
    
    Gonna check what SASS it get translated to and whether it makes any sense.
    I wonder if they had SASS assembler for Hopper (either by reverse engineering nvdisasm or by fuzzing instructions + nvdisasm + stare hard) and don't want to say it out :p
    
    saagarjha 4 days ago
    
    You'd be looking at ptxas here. FWIW, it looks like it generates LDG.E.NA.LTC256B.U8.CONSTANT on my machine.
  - saagarjha 4 days ago
    
    CPUs have instructions with similar semantics.
- find0x90 4 days ago
  
  Much of the hype around DeepSeek is due to their extraordinarily low training and inference costs. They achieved this by optimizing their training code, apparently using PTX in addition to CUDA. PTX is kind of an intermediate assembly language for NVIDIA GPUs and people are eager to see how it was used.

Bimos 4 days ago

The PTX instructions they talked about in the tech report should be pointing to the code here?

zardinality 4 days ago

"For extreme performance, we discover and use a behavior-out-of-doc PTX instruction: ld.global.nc.L1::no_allocate.L2::256B. This instruction will lead to an undefined behavior: accessing volatile GPU memory with non-coherent read-only PTX modifiers .nc. But the correctness is tested to be guaranteed with .L1::no_allocate on Hopper architectures, and performance will be much better. If you find kernels not working on some other platforms, you may add DISABLE_AGGRESSIVE_PTX_INSTRS=1 to setup.py and disable this, or file an issue."
- magicalhippo 4 days ago
  
  So non-coherent refers to bypassing cache coherency, ie don't care about what other units might have written to that address? And the L1/L2 modifiers are to avoid L1 thrashing, keeping the value in L2 only?
  Or did I get that wrong?
  - ta988 4 days ago
    
    My understanding of the L2 part is that it asks for a 256b pre-fetch (only available on some platforms it seems) but they use vectors of 4 32bits signed ints max so not sure why only the 256 would work or if the fact that it did fetch the next 128 helps.
  - saagarjha 4 days ago
    
    Yeah that's about right
helloericsf 4 days ago

this might help: https://x.com/main_horse/status/1894215779521794058/photo/1

kennyloginz 4 days ago

Spring showers bring may flowers!

ringer007 4 days ago

[dead]

ringer007 4 days ago

[dead]

deyiao 4 days ago

Now it includes the highly anticipated PTX! Of course, I don’t understand it, but I’ve already click the star and even the fork button, which basically means I’ve mastered it, right? I feel incredibly powerful right now...