For about half a year I stuck with using 7B models and got a strong 4 bit quantisation on them, because I had very bad experiences with an old qwen 0.5B model.
But recently I tried running a smaller model like llama3.2 3B
with 8bit quant and qwen2.5-1.5B-coder
on full 16bit floating point quants, and those performed super good aswell on my 6GB VRAM gpu (gtx1060).
So now I am wondering: Should I pull strong quants of big models, or low quants/raw 16bit fp versions of smaller models?
What are your experiences with strong quants? I saw a video by that technovangelist guy on youtube and he said that sometimes even 2bit quants can be perfectly fine.
oooh a windows only feature, now I see why I haven’t heard of this yet. Well, too bad I guess. It’s time to switch to AMD for me anyway…
Oh, that part is. But the splitting tech is built into llama.cpp