For about half a year I stuck with using 7B models and got a strong 4 bit quantisation on them, because I had very bad experiences with an old qwen 0.5B model.

But recently I tried running a smaller model like llama3.2 3B with 8bit quant and qwen2.5-1.5B-coder on full 16bit floating point quants, and those performed super good aswell on my 6GB VRAM gpu (gtx1060).

So now I am wondering: Should I pull strong quants of big models, or low quants/raw 16bit fp versions of smaller models?

What are your experiences with strong quants? I saw a video by that technovangelist guy on youtube and he said that sometimes even 2bit quants can be perfectly fine.

  • Smorty [she/her]OP
    link
    fedilink
    English
    arrow-up
    2
    ·
    18 hours ago

    Yeaaa those models are just too large for most people… You gotta have 56GB of VRAM to run an 8bit quant, which most people don’t have a quarter of.

    Also, what specifically do you mean by alignment? Are you talking about finetuning or instruction alignment?

      • Smorty [she/her]OP
        link
        fedilink
        English
        arrow-up
        2
        ·
        15 hours ago

        Another user @SGforce@lemmy.ca commented about there being a way to split it between GPU and CPU. Are you talking about this nvidia only and windows only thingy, which only works with the proprietary driver? If so, I’m really not gonna use that…

        Have you tried some of the abliterated models? They work really nicely even for the spiciest of topics. They literally can’t refuse your instruction, so they just go ahead and do what you want. But maybe even these models are too narrow for your specific application…