LLM ASICs on USB sticks?

makeasnek@lemmy.ml · 11 months ago

LLM ASICs on USB sticks?

kakes@sh.itjust.works · 11 months ago

Never really occurred to me before how huge a 10x savings would be in terms of parameters on consumer hardware.

Like, obviously 10x is a lot, but with the way things are going, it wouldn’t surprise me to see that kind of leap in the next year or two tbh.

Fisch@discuss.tchncs.de · 11 months ago

That would actually be insane. Right now, I still need my GPU and about 8-10 gigs of VRAM to run a 7B model tho, so idk how that’s supposed to work on a phone. Still, being able to run a model that’s as good as a 70B model but with the speed and memory usage of a 7B model would be huge.

Smorty [she/her] · edit-2 6 months ago

I’m even more excited for running 8B models at the speed of 1B! Laughably fast ok-quality generations in JSON format would be crazy useful.

Also yeah, that 7B on mobile was not the best example. Again, probably 1B to 3B is the sweetspot for mobile (I’m running Qwen2.5 0.5B on my phone and it works tel real for simple JSON)

EDIT: And imagine the context lengths we would be ablentonrun on our GPUs at home! What a time to be alive.

Fisch@discuss.tchncs.de · 6 months ago

Being able to run 7B quality models on your phone would be wild. It would also make it possible to run those models on my server (which is just a mini pc), so I could connect it to my Home Assistant voice assistant, which would be really cool.

Smorty [she/her] · 6 months ago

Something similar to this already kinda exists on HF with the 1.58 bit quantisation which seem to get very similar performance to the original Llama 3 8B model. That’s essentially a two bit quanitsation with reasonable performance!

Fisch@discuss.tchncs.de · 6 months ago

That’s really interesting, gonna try out how well it runs

Boomkop3@reddthat.com · 6 months ago

Slowly, is how

Chrobin@discuss.tchncs.de · 11 months ago

I have never worked on machine learning, what does the B stand for? Billion? Bytes?

Fisch@discuss.tchncs.de · 11 months ago

I think it’s how many billion parameters the model has

Chrobin@discuss.tchncs.de · 11 months ago

Thanks!