For about half a year I stuck with using 7B models and got a strong 4 bit quantisation on them, because I had very bad experiences with an old qwen 0.5B model.
But recently I tried running a smaller model like llama3.2 3B
with 8bit quant and qwen2.5-1.5B-coder
on full 16bit floating point quants, and those performed super good aswell on my 6GB VRAM gpu (gtx1060).
So now I am wondering: Should I pull strong quants of big models, or low quants/raw 16bit fp versions of smaller models?
What are your experiences with strong quants? I saw a video by that technovangelist guy on youtube and he said that sometimes even 2bit quants can be perfectly fine.
Pulled an 7B Q4 model just now an woah, yeah, they really are a lot better. I guess the smaller models really are just for devices with less than 1 GB of RAM to spare… Like ma phone, which runs Llama3.2 3B just fine…