It’s amazing to see others build on top of open-source projects. Forks like RamaLama are exactly what open source is all about. Developers with different design philosophies can still collaborate in the open for everyone’s benefit.
Some folks on the Ollama team have contributed directly to the OCI spec, so naturally we started with tools we know best. But we made a conscious decision to deviate because AI models are massive in size - on the order of gigabytes - and we needed performance optimizations that the existing approaches didn’t offer.
We have not forked llama.cpp, We are a project written in Go, so naturally we’ve made our own server side serving in server.go. Now, we are beginning to hit performance, reliability and model support problems. This is why we have begun transition to Ollama’s new engine that will utilize multiple engine designs. Ollama is now naturally responsible for the portability between different engines.
I did see the complaint about Ollama not using Jinja templates. Ollama is written in Go. I’m listening but it seems to me that it makes perfect sense to support Go templates.
We are only a couple of people, and building in the open. If this sounds like vendor lock-in, I'm not sure what vendor lock-in is?
Those rejected README changes only served to provide greater transparency to would-be users, and here we are a year and a half later with woefully inadequate movement on that front.
As an outsider (not an oss maintainer, but a contributor), the decline to merge imo was understandable - the maintainer had a strategy and it didn’t fit. They gave reasons why - really nicely - and even made a call to action for PRs placing architecture docs elsewhere. Your response tonally was disparaging, and the subsequent pile on was anti productive. All due respect to your experience as a maintainer; in that role, can you imagine seeing a contribution that you are not interested in, and declining/forgetting to or being too busy to engage, imagining that it might get dropped or made better while you are busy with your priorities?
Putting myself in your shoes, I can see why you might be annoying at being ignored. Suggests this change is really important to you, and so my question would be why didn’t you follow the maintainers advice and add architecture docs?
These comments seem reasonable to me. Could you clarify the Ollama maintainers' POV wrt. the recent discussion of Ollama Vulkan support at https://news.ycombinator.com/item?id=42886680 ? Many people seem to be upset that this PR seems to have gotten zero acknowledgment from the Ollama folks, even with so many users being quite interested in it for obvious reasons. (To be clear, I'm not sure that the PR is in a mergeable state as-is, so I would disagree with many of those comments. But this is just my personal POV - and with no statement on the matter from the Ollama maintainers, users will be confused.)
Since you are one of the maintainers of Ollama, maybe you can help me answer a related question. It is great that the software itself is open source, but hosting the models must cost a fortune. I know this is funded by VC money, yet nowhere on the Ollama website or repository there is any mention of this. Why is that?
There isn't an about section, a tiny snippet in a FAQ somewhere, nothing.
I don't get it. The 'Modelfile' files are used to save and restore chat history as well, set custom system prompts and lots of other stuff that would require custom coding with most other local AI frameworks. Llama.cpp certainly doesn't offer anything like that out of the box. Those sorts of complaints seem pointless to me.
While we're at it, is there already some kind of standardized local storage location/scheme for LLM models? If not, this project could potentially be a great place to set an example that others can follow, if they want. I've been playing with different runtimes (Ollama, vLLM) the last days, and I really would have appreciated better interoperability in terms of shared model storage, instead of everybody defaulting to downloading everything all over again.
The llama.cpp tools and examples download the models by default to a OS-specific cache folder [0]. We try to follow the HF standard (as discussed in the linked thread), though the layout of the llama.cpp cache is not the same atm. Not sure about the plans for RamaLama, but it might be something worth to consider.
I think it would be the most important thing to consider, because the biggest thing that predecessor to RamaLama provided was a way to download a model (and run it).
If there was a contract about how models were laid out on disk, then downloading, managing and tracking model weights could be handled by a different tool or subsystem.
In RamaLama an OCI container-like store is used (at least from the UX perspective it feels like that) for all models in RamaLama, it's protocol agnostic supports oci artefacts, huggingface, ollama, etc.
i just started to play with ollama and ramalama.. on linux. The models are quite some gigabytes.. not pretty to keep N copies..
ollama stores things under ~/.ollama/models/blobs/
named sha256-whatevershaisit
ramalama stores things under ~/.local/share/ramalama/repos/ollama/blobs/
named sha256:whatevershaisit
Note the ":" in ramalama names instead of the "-" .. that may not fly under windows.
if one crosslinks ramalama things over to ollama with that slight rename, ollama will remove them as they are not pulled via itself - no metadata on them.
i guess vllm etc everybody-else has yet-another schema and/or metadata.
btw Currently, arch-linux-wise, there is llm-manager (pointing to https://github.com/xyproto/llm-manager ), but it's made dependent on some of ollama packages, and can't be installed just by itself (without overforcing).
To make it AI really boring all those projects need to be more approachable to non-tech savvy people, e.g. some minimal GUI for searching, listing, deleting, installing ai models. I wish e.g. this or ollama could work more as invisible AI models dependency manager. Right now every app that want to have STT like whisper will bundle such model inside. User waste more memory storage and have to wait to download big models. We had similar problems with and static libraries and then moved to dynamic linking libraries.
I wish your app could add some model as dependency and on install would download only if such model is not avialable locally. Also would check if ollama is installed and only bootstrap if also doesn't exist on drive. Maybe with some nice interface for user to confirm download and nice onboarding.
One of my primary goals of RamaLama was to allow users to move AI Models into containers, so they can be stored in OCI Registries. I believe there is going to be a proliferation of "private" models, and eventually "private" RAG data. (Working heavily in RAG support in RamaLama now.
Once you have private models and RAG, I believe you will want to run these models and data on edge devices in in Kubernetes clusters. Getting the AI Models and data into OCI content. Would allow us to take advantage of content signing, trust, mirroring. And make running the AI in production easier.
Also allowing users to block access to outside "untrusted" AI Models stored in the internet. Allow companies to only use "trusted" AI.
Since Companies already have OCI registries, it makes sense to store your AI Models and content in the same location.
122 points 2 hours ago yet this is currently #38 and not on the front page.
Strange. At the same time I see numerous items that are on the front page posted 2 hours or older with fewer points.
I'm willing to take a reputation hit on this meta post. I wonder why this got demoted so quickly from front page despite people clearly voting on it. I wonder if it has anything to do with being backed by YC.
I sincerely hope it's just my miss understanding of hn algorithm though
Can confirm it doesn't. Many Ollama posts get pushed off the front page too despite having hundreds of points. Over time I understood. If they did this for YC companies, it would ruin the trust of HN, YC, and probably the most important to YC companies, the reputation of the startup itself.
I assume this is what happens when many HN users just flag every AI- and LLM-related post out of sheer frustration with the reality distortion field around this particular topic.
> Running in containers eliminates the need for users to configure the host system for AI.
When is that a problem?
Based on the linked issue in eigenvalue's comment[1], this seems like a very good thing. It sounds like ollama is up to no good and this is a good drop-in replacement. What is the deeper problem being solved here though, about configuring the host? I've not run into any such issue.
It seemed an awful lot like you were feigning confusion with a lack of empathy towards why someone would want to use a container to have repeatable environments.
What benefit does Ollama (or RamaLama) offer over just plain llama.cpp or llamafile? The only thing I understand is that there is automatic downloading of models behind the scenes, but a big reason for me to want to use local models at all is that I want to to know exactly what files I use and keep them sorted and backed up properly, so a tool automatically downloading models and dumping in some cache directory just sounds annoying.
IIRC it makes things a little easier, e.g. you don't need to specify a ClI flag to set how many layers to offload to GPU, and it provides an API that other programs on your system can use (e.g. openwebui).
It's been a while since I used llama.cpp directly, and I don't know whether I'm correct about its current scope.
RamaLama stands on the shoulders of giants by building upon llama.cpp (and other projects like minja, podman, vllm, etc.), we've been contributing back also Sergio Lopez, Michael Engel and I are contributing back to llama.cpp (just three examples of RamaLama people off the top of my head)
We write the higher level abstractions in python3 (with no dependancies on python libs outside of the standard library) because it's the heavy-lifting that needs to be done in C++. Python is a nice community friendly language also, many people know how to write it.
Does this provide a Ollama compatible API endpoint? I've got at least one other project running that only supports Ollama's API or OpenAI's hosted solution (ie. the API endpoint isn't configurable to use llama.cpp and friends)
I agree with what you wrote. The whole situation reminds me of the old "Standards" XKCD, to an extent. In the short term something like LiteLLM, which I just discovered doing more research on the whole topic, can at least hide some of the underlying complexity.
That being said, considering what you've done with Open Home and Home Assistant (which has run my home for years, thank you!), perhaps there is some hope of an open standard in the near future.
It sounds like this project isn't addressing the user convenience aspect of ollama, but rather the developer convenience.
Hopefully both will be easy for users to play around with, but RamaLama should be easier to get your PR merged as a developer and swap out different registries. Vendor lock-in is rarely a good thing in the world of open source.
The killer features of Ollama for me right now are the nice library of quantized models and the ability to automatically start and stop serving models in response to incoming requests and timeouts. The first send to be solved by reusing the Ollama models, but I can't see if the service is possible from my cursory look.
ramalama can just pull (almost) any arbitrary model off huggingface and run it ... you're not limited to just what ollama has repackaged into their non-standard format
Well, if you aren’t that great with Docker but you want to try out a variety of LLMs under Docker, how much would this help you? How much trouble is it to enable an LLM to reach outside of a container to make use of your GPU? How much does this tool help with that?
ramalama can just pull (almost) any arbitrary model off huggingface and run it ... you're not limited to just what ollama has repackaged into their non-standard format
This is the point of it:
https://github.com/ggerganov/llama.cpp/pull/11016#issuecomme...
I’m one of the maintainers of Ollama.
It’s amazing to see others build on top of open-source projects. Forks like RamaLama are exactly what open source is all about. Developers with different design philosophies can still collaborate in the open for everyone’s benefit.
Some folks on the Ollama team have contributed directly to the OCI spec, so naturally we started with tools we know best. But we made a conscious decision to deviate because AI models are massive in size - on the order of gigabytes - and we needed performance optimizations that the existing approaches didn’t offer.
We have not forked llama.cpp, We are a project written in Go, so naturally we’ve made our own server side serving in server.go. Now, we are beginning to hit performance, reliability and model support problems. This is why we have begun transition to Ollama’s new engine that will utilize multiple engine designs. Ollama is now naturally responsible for the portability between different engines.
I did see the complaint about Ollama not using Jinja templates. Ollama is written in Go. I’m listening but it seems to me that it makes perfect sense to support Go templates.
We are only a couple of people, and building in the open. If this sounds like vendor lock-in, I'm not sure what vendor lock-in is?
You can check the source code: https://github.com/ollama/ollama
These comments would carry more merit if they weren’t coming from the very person who closed this pull request: https://github.com/jmorganca/ollama/pull/395
Those rejected README changes only served to provide greater transparency to would-be users, and here we are a year and a half later with woefully inadequate movement on that front.
I am very glad folks are working on alternatives.
As an outsider (not an oss maintainer, but a contributor), the decline to merge imo was understandable - the maintainer had a strategy and it didn’t fit. They gave reasons why - really nicely - and even made a call to action for PRs placing architecture docs elsewhere. Your response tonally was disparaging, and the subsequent pile on was anti productive. All due respect to your experience as a maintainer; in that role, can you imagine seeing a contribution that you are not interested in, and declining/forgetting to or being too busy to engage, imagining that it might get dropped or made better while you are busy with your priorities?
Putting myself in your shoes, I can see why you might be annoying at being ignored. Suggests this change is really important to you, and so my question would be why didn’t you follow the maintainers advice and add architecture docs?
These comments seem reasonable to me. Could you clarify the Ollama maintainers' POV wrt. the recent discussion of Ollama Vulkan support at https://news.ycombinator.com/item?id=42886680 ? Many people seem to be upset that this PR seems to have gotten zero acknowledgment from the Ollama folks, even with so many users being quite interested in it for obvious reasons. (To be clear, I'm not sure that the PR is in a mergeable state as-is, so I would disagree with many of those comments. But this is just my personal POV - and with no statement on the matter from the Ollama maintainers, users will be confused.)
EDIT: I'm seeing a newly added comment in the Vulkan PR GitHub thread, at https://github.com/ollama/ollama/pull/5059#issuecomment-2628... . Quite overdue, but welcome nonetheless!
Since you are one of the maintainers of Ollama, maybe you can help me answer a related question. It is great that the software itself is open source, but hosting the models must cost a fortune. I know this is funded by VC money, yet nowhere on the Ollama website or repository there is any mention of this. Why is that?
There isn't an about section, a tiny snippet in a FAQ somewhere, nothing.
We partner with Cloudflare R2 to minimize the cost of hosting. Check out their pricing.
The website is so minimal right now because we have been focused on the GitHub repo.
I see! Now I understand why I need to create those useless `Modelfile` files...
I'm glad there is a more open source alternative to Ollama now.
I don't get it. The 'Modelfile' files are used to save and restore chat history as well, set custom system prompts and lots of other stuff that would require custom coding with most other local AI frameworks. Llama.cpp certainly doesn't offer anything like that out of the box. Those sorts of complaints seem pointless to me.
you tried a recent build of llama-server (from llama.cpp) ? the web interface does remember my chat, and obviously let me change all settings.
Interested to know why one is “more open source” than the other.
I wish this were on the readme. Or if it already is, I wish it were significantly higher up.
Thanks for this context, I will give RamaLlama a try!
This looks great!
While we're at it, is there already some kind of standardized local storage location/scheme for LLM models? If not, this project could potentially be a great place to set an example that others can follow, if they want. I've been playing with different runtimes (Ollama, vLLM) the last days, and I really would have appreciated better interoperability in terms of shared model storage, instead of everybody defaulting to downloading everything all over again.
The llama.cpp tools and examples download the models by default to a OS-specific cache folder [0]. We try to follow the HF standard (as discussed in the linked thread), though the layout of the llama.cpp cache is not the same atm. Not sure about the plans for RamaLama, but it might be something worth to consider.
[0] https://github.com/ggerganov/llama.cpp/issues/7252
I think it would be the most important thing to consider, because the biggest thing that predecessor to RamaLama provided was a way to download a model (and run it).
If there was a contract about how models were laid out on disk, then downloading, managing and tracking model weights could be handled by a different tool or subsystem.
In RamaLama an OCI container-like store is used (at least from the UX perspective it feels like that) for all models in RamaLama, it's protocol agnostic supports oci artefacts, huggingface, ollama, etc.
i just started to play with ollama and ramalama.. on linux. The models are quite some gigabytes.. not pretty to keep N copies..
ollama stores things under ~/.ollama/models/blobs/ named sha256-whatevershaisit
ramalama stores things under ~/.local/share/ramalama/repos/ollama/blobs/ named sha256:whatevershaisit
Note the ":" in ramalama names instead of the "-" .. that may not fly under windows.
if one crosslinks ramalama things over to ollama with that slight rename, ollama will remove them as they are not pulled via itself - no metadata on them.
i guess vllm etc everybody-else has yet-another schema and/or metadata.
btw Currently, arch-linux-wise, there is llm-manager (pointing to https://github.com/xyproto/llm-manager ), but it's made dependent on some of ollama packages, and can't be installed just by itself (without overforcing).
To make it AI really boring all those projects need to be more approachable to non-tech savvy people, e.g. some minimal GUI for searching, listing, deleting, installing ai models. I wish e.g. this or ollama could work more as invisible AI models dependency manager. Right now every app that want to have STT like whisper will bundle such model inside. User waste more memory storage and have to wait to download big models. We had similar problems with and static libraries and then moved to dynamic linking libraries.
I wish your app could add some model as dependency and on install would download only if such model is not avialable locally. Also would check if ollama is installed and only bootstrap if also doesn't exist on drive. Maybe with some nice interface for user to confirm download and nice onboarding.
One of my primary goals of RamaLama was to allow users to move AI Models into containers, so they can be stored in OCI Registries. I believe there is going to be a proliferation of "private" models, and eventually "private" RAG data. (Working heavily in RAG support in RamaLama now.
Once you have private models and RAG, I believe you will want to run these models and data on edge devices in in Kubernetes clusters. Getting the AI Models and data into OCI content. Would allow us to take advantage of content signing, trust, mirroring. And make running the AI in production easier.
Also allowing users to block access to outside "untrusted" AI Models stored in the internet. Allow companies to only use "trusted" AI.
Since Companies already have OCI registries, it makes sense to store your AI Models and content in the same location.
Bottom line we want to take advantage of the infrastructure created by Podman, Docker and Kubernetes.
122 points 2 hours ago yet this is currently #38 and not on the front page.
Strange. At the same time I see numerous items that are on the front page posted 2 hours or older with fewer points.
I'm willing to take a reputation hit on this meta post. I wonder why this got demoted so quickly from front page despite people clearly voting on it. I wonder if it has anything to do with being backed by YC.
I sincerely hope it's just my miss understanding of hn algorithm though
Can confirm it doesn't. Many Ollama posts get pushed off the front page too despite having hundreds of points. Over time I understood. If they did this for YC companies, it would ruin the trust of HN, YC, and probably the most important to YC companies, the reputation of the startup itself.
I assume this is what happens when many HN users just flag every AI- and LLM-related post out of sheer frustration with the reality distortion field around this particular topic.
> Running in containers eliminates the need for users to configure the host system for AI.
When is that a problem?
Based on the linked issue in eigenvalue's comment[1], this seems like a very good thing. It sounds like ollama is up to no good and this is a good drop-in replacement. What is the deeper problem being solved here though, about configuring the host? I've not run into any such issue.
1. https://news.ycombinator.com/item?id=42888129
So you have never hit the issue so no one else has?
... orrrrr I have never hit the issue, so that's why I'm asking.
Calm down. It's Friday, time to relax, my friend. ;)
It seemed an awful lot like you were feigning confusion with a lack of empathy towards why someone would want to use a container to have repeatable environments.
https://mannerofspeaking.org/2013/03/23/rhetorical-devices-a...
I know you are just asking questions.
What benefit does Ollama (or RamaLama) offer over just plain llama.cpp or llamafile? The only thing I understand is that there is automatic downloading of models behind the scenes, but a big reason for me to want to use local models at all is that I want to to know exactly what files I use and keep them sorted and backed up properly, so a tool automatically downloading models and dumping in some cache directory just sounds annoying.
IIRC it makes things a little easier, e.g. you don't need to specify a ClI flag to set how many layers to offload to GPU, and it provides an API that other programs on your system can use (e.g. openwebui).
It's been a while since I used llama.cpp directly, and I don't know whether I'm correct about its current scope.
RamaLama stands on the shoulders of giants by building upon llama.cpp (and other projects like minja, podman, vllm, etc.), we've been contributing back also Sergio Lopez, Michael Engel and I are contributing back to llama.cpp (just three examples of RamaLama people off the top of my head)
We write the higher level abstractions in python3 (with no dependancies on python libs outside of the standard library) because it's the heavy-lifting that needs to be done in C++. Python is a nice community friendly language also, many people know how to write it.
Does this provide a Ollama compatible API endpoint? I've got at least one other project running that only supports Ollama's API or OpenAI's hosted solution (ie. the API endpoint isn't configurable to use llama.cpp and friends)
We need to stop chasing compatible API endpoints and work towards an AI standard. I wrote about it here https://news.ycombinator.com/item?id=42887610
I agree with what you wrote. The whole situation reminds me of the old "Standards" XKCD, to an extent. In the short term something like LiteLLM, which I just discovered doing more research on the whole topic, can at least hide some of the underlying complexity.
That being said, considering what you've done with Open Home and Home Assistant (which has run my home for years, thank you!), perhaps there is some hope of an open standard in the near future.
Great, finally an alternative to ollama's convenience.
It sounds like this project isn't addressing the user convenience aspect of ollama, but rather the developer convenience.
Hopefully both will be easy for users to play around with, but RamaLama should be easier to get your PR merged as a developer and swap out different registries. Vendor lock-in is rarely a good thing in the world of open source.
That was kinda funny :)
So it's a replacement for Ollama?
The killer features of Ollama for me right now are the nice library of quantized models and the ability to automatically start and stop serving models in response to incoming requests and timeouts. The first send to be solved by reusing the Ollama models, but I can't see if the service is possible from my cursory look.
ramalama can just pull (almost) any arbitrary model off huggingface and run it ... you're not limited to just what ollama has repackaged into their non-standard format
Ollama has the ability to pull models off Hugging Face as well:
https://huggingface.co/docs/hub/en/ollama
I am doing a short talk on this tomorrow at FOSDEM:
https://fosdem.org/2025/schedule/event/fosdem-2025-4486-rama...
I’m using openwebui, can this replace ollama in my setup?
It seems that all instructions are based on Mac/Linux? Can someone confirm this works smoothly on Windows?
The windows solution is WSL2
Is this useful? Can someone help me see the value add here?
It was mentioned in the other thread on the front page [1]
[1] https://news.ycombinator.com/item?id=42886680
Well, if you aren’t that great with Docker but you want to try out a variety of LLMs under Docker, how much would this help you? How much trouble is it to enable an LLM to reach outside of a container to make use of your GPU? How much does this tool help with that?
I think the post arose from:
https://news.ycombinator.com/item?id=42886768
ramalama can just pull (almost) any arbitrary model off huggingface and run it ... you're not limited to just what ollama has repackaged into their non-standard format