For about half a year I stuck with using 7B models and got a strong 4 bit quantisation on them, because I had very bad experiences with an old qwen 0.5B model.

But recently I tried running a smaller model like llama3.2 3B with 8bit quant and qwen2.5-1.5B-coder on full 16bit floating point quants, and those performed super good aswell on my 6GB VRAM gpu (gtx1060).

So now I am wondering: Should I pull strong quants of big models, or low quants/raw 16bit fp versions of smaller models?

What are your experiences with strong quants? I saw a video by that technovangelist guy on youtube and he said that sometimes even 2bit quants can be perfectly fine.

  • Smorty [she/her]OP
    link
    fedilink
    English
    arrow-up
    1
    ·
    16 hours ago

    Pulled an 7B Q4 model just now an woah, yeah, they really are a lot better. I guess the smaller models really are just for devices with less than 1 GB of RAM to spare… Like ma phone, which runs Llama3.2 3B just fine…