For about half a year I stuck with using 7B models and got a strong 4 bit quantisation on them, because I had very bad experiences with an old qwen 0.5B model.

But recently I tried running a smaller model like llama3.2 3B with 8bit quant and qwen2.5-1.5B-coder on full 16bit floating point quants, and those performed super good aswell on my 6GB VRAM gpu (gtx1060).

So now I am wondering: Should I pull strong quants of big models, or low quants/raw 16bit fp versions of smaller models?

What are your experiences with strong quants? I saw a video by that technovangelist guy on youtube and he said that sometimes even 2bit quants can be perfectly fine.

  • Smorty [she/her]OP
    link
    fedilink
    English
    arrow-up
    2
    ·
    17 hours ago

    Ollama does indeed have the ability to share the memory between VRAM and RAM, but I always assumed it wouldn’t make sense, since it would massively slow down the generation.

    I think ollama already uses GGUF, since that is how you import the model from HF to ollama anyway, you gotta use the *.GGUF file.

    As someone who has experience with shader development in glsl, I know very well that communication between the GPU and CPU is super slow, and sending data from the GPU to the CPU is a pretty heavy task. So I just assumed it wouldn’t make any sense. I will try a full 7B model (fp16) model now using my 32GB of normal RAM to check out the speed. I’ll edit this comment once I’m done and share results