htrp 3 hours ago

The lmsys had a package (flexgen) that did a lot of this similar work (swap GPU to ram to disk)

not sure if it's still being maintained

buyucu 4 hours ago

I applaud how hardcore this is. Swapping the model from disk and just keeping the KV cache on the CPU ram.

  • Oarch 3 hours ago

    Can someone ELI5 please?

    • buyucu 3 hours ago

      deepseek is huge with 671b parameters. they keep it in hard disk, and load it piece by piece to the ram. the innovation is that they kick out everything other than the kv cache from the ram.