Lol. Lmao even. "DeepSeek R1 reproduced for $30: Berkeley researchers replicate DeepSeek R1 for $30—casting doubt on H100 claims and controversy"

Snot Flickerman · 4 months ago

Lol. Lmao even. "DeepSeek R1 reproduced for $30: Berkeley researchers replicate DeepSeek R1 for $30—casting doubt on H100 claims and controversy"

reallykindasorta@slrpnk.net · edit-2 4 months ago

Non-techie requesting a laymen explanation if anyone has time!

After reading a couple of”what makes nvidias h100 chips so special” articles I’m gathering that they were supposed to have a significant amount more computational capability than their competitors (which I’m taking to mean more computations per second). So the question with deepseek and similar is something like ‘how are they able to get the same results with less computations?’ and the answer is speculated to be more efficient code/instructions for the AI model so it can make the same conclusions with less computations overall, potentially reducing the need for special jacked up cpus to run it?

justOnePersistentKbinPlease@fedia.io · 4 months ago

From a technical POV, from having read into it a little:

Deepseek devs worked in a very low level language called Assembly. This language is unlike relatively newer languages like C in that it provides no guardrails at all and is basically CPU instructions in extreme shorthand. An “if” statement would be something like BEQ 1000, where it goes to a specific memory location(in this case address 1000 if two CPU registers are equal.)

The advantage of using it is that it is considerably faster than C. However, it also means that the code is mostly locked to that specific hardware. If you add more memory or change CPUs you have to refactor. This is one of the reasons the language was largely replaced with C and other languages.

Edit: to expound on this: “modern” languages are even slower, but more flexible in terms of hardware. This would be languages like Python, Java and C#

V0ldek@awful.systems · edit-2 4 months ago

This is a really weird comment. Assembly is not faster than C, that’s a nonsensical statement, C compiles down to assembly. LLVM’s optimizations will most likely outperform or directly match whatever hand-crafted assembly you write. Why would BEQ 1000 be “considerably faster” than if (x == y) goto L_1000;? This collapses even further if you consider any application larger than a few hundred lines of code, any sensible compiler is going to beat you on optimizations if you try to write hand-crafted assembly. Try loading up assembly code and manually performing intraprocedural optimizations, lol, there’s a reason every compiled language goes through an intermediate representation.

Saying that C# is slower than C is also nonsensical, especially now that C# has built-in PGO it’s very likely it could outperform an application written in C. C#'s JIT compiler is not somehow slower because it’s flexible in terms of hardware, if anything that’s what makes it fast. For example you can write a vectorized loop that will be JIT-compiled to the ideal fastest instruction set available on the CPU running the program, whereas in C or assembly you’d have to manually write a version for each. There’s no reason to think that manual implementation would be faster than what the JIT comes up with at runtime, though, especially with PGO.

It’s kinda like you’re saying that a V12 engine is faster than a Ferrari and that they are both faster than a spaceship because the spaceship doesn’t have wheels.

I know you’re trying to explain this to a non-technical person but what you said is so terribly misleading I cannot see educational value in it.

froztbyte@awful.systems · 4 months ago

and one doesn’t program GPUs with assembly (in the sense as it’s used with CPUs)

iltg@sh.itjust.works · 4 months ago

your statement is so extreme it gets nonsensical too.

compilers will usually produce higher optimized asm than writing it yourself, but there is room to improve usually. it’s not impossible that deepseek team obtained some performance gains hand-writing some hot sections directly in assembly. llvm must “play it safe” because doesn’t know your use case, you do and can avoid all safety checks (stack canaries, overflow checks) or cleanups (eg, make memory arenas rather than realloc). you can tell LLVM to not do those, but it may happen in the whole binary and not be desirable

claiming c# gets faster than C because of jit is just ridicolous: you need yo compile just in time! the runtime cost of jitting + the resulting code would be faster than something plainly compiled? even if c# could obtain same optimization levels (and it can’t: oop and .net runtime) you still pay the jit cost, which plainly compiled code doesn’t pay. also what are you on with PGO, as if this buzzword suddenly makes everything as fast as C?? the example they give is “devirtualization” of interfaces. seems like C just doesn’t have interfaces and can just do direct calls? how would optimizing up to C level make it faster than C?

you just come off as a bit entitled and captured in MS bullshit claims

bitofhope@awful.systems · 4 months ago

GPU programs (specifically CUDA, although other vendors’ stacks are similar) combine code for the host system in a conventional programming language (typically C++), and code for the GPU written in CUDA language. Even if the C++ code for the host system can be optimized with hand written assembly, it’s not going to lead to significant gains when the performance bottleneck is on the GPU side.

The CUDA compiler translates the high level CUDA code into something called PTX, machine code for a “virtual ISA” which is then translated by the GPU driver into native machine language for the proprietary instruction set of the GPU. This seems to be somewhat comparable to a compiler intermediate representation, such as LLVM. It’s plausible that hand written PTX assembly/IR language could have been used to optimize parts of the program, but that would be somewhat unusual.

For another layer or assembly/machine languages, technically they could have reverse engineered the actual native ISA of the GPU core and written machine code for it, bypassing the compiler in the driver. This is also quite unlikely as it would practically mean writing their own driver for latest-gen Nvidia cards that vastly outperforms the official one and that would be at least as big of a news story as Yet Another Slightly Better Chatbot.

While JIT and runtimes do have an overhead compared to direct native machine code, that overhead is relatively small, approximately constant, and easily amortized if the JIT is able to optimize a tight loop. For car analogy enjoyers, imagine a racecar that takes ten seconds to start moving from the starting line in exchange for completing a lap one second faster. If the race is more than ten laps long, the tradeoff is worth it, and even more so the longer the race. Ahead of time optimizations can do the same thing at the cost of portability, but unless you’re running Gentoo, most of the C programs on your computer are likely compiled for the lowest common denominator of x86/AMD64/ARMwhatever instruction sets your OS happens to support.

If the overhead of a JIT and runtime are significant in the overall performance of the program, it’s probably a small program to begin with. No shame to small programs, but unless you’re running it very frequently, it’s unlikely to matter if the execution takes five or fifty milliseconds.

froztbyte@awful.systems · 4 months ago

For another layer or assembly/machine languages, technically they could have reverse engineered the actual native ISA of the GPU core and written machine code for it, bypassing the compiler in the driver. This is also quite unlikely as it would practically mean writing their own driver for latest-gen Nvidia cards that vastly outperforms the official one

yeah, and it’d be a pretty fucking immense undertaking, as it’d be the driver and the application code and everything else (scheduling, etc etc). again, it’s not impossible, and there’s been significant headway across multiple parts of industry to make doing this kind of thing more achievable… but it’s also an extremely niche, extremely focused, hard-to-port thing, and I suspect that if they actually did do this it’d be something they’d be shouting about loudly in every possible PR outlet

a look at every other high-optimisation field, from the mechanical sympathy lot stemming from HFT etc all the way through to where that’s gotten to in modern usage of FPGAs in high-perf runtime envs also gives a good backgrounder in the kind of effort cost involved for this shit, and thus gives me some extra reasons to doubt claims kicking around (along with the fact that everyone seems to just be making shit up)

skillissuer@discuss.tchncs.de · 4 months ago

yeah, would you look at this https://www.tomshardware.com/tech-industry/artificial-intelligence/deepseek-might-not-be-as-disruptive-as-claimed-firm-reportedly-has-50-000-nvidia-gpus-and-spent-usd1-6-billion-on-buildouts

However, industry analyst firm SemiAnalysis reports that the company behind DeepSeek incurred $1.6 billion in hardware costs and has a fleet of 50,000 Nvidia Hopper GPUs

froztbyte@awful.systems · 4 months ago

yep, a completely normal amount of non-specialist hardware that basically everyone has in their back shed. you just don’t turn it on all the time because the neighbours keep complaining about the fan noise. practically anyone could do this!

froztbyte@awful.systems · 4 months ago

for the love of god read the sidebar

justOnePersistentKbinPlease@fedia.io · 4 months ago

I have have crafted assembly instructions and have made it faster than the same C code.

Particular to if statements, C will do things push and pull values from the stack which takes a small but occasionally noticeable amount of cycles.

khalid_salad@awful.systems · edit-2 4 months ago

python, what are you doing?"

idk, I’m written in C, it does things push and pull values from the stack, have you tried assembly, it’s faster

froztbyte@awful.systems · 4 months ago

til if I wanted the program to go faster I should’ve just been asking it to switch its runtime

khalid_salad@awful.systems · 4 months ago

if you use TeX as much as i do, you learn that “begging the program to behave differently” is pretty viable

self@awful.systems · 4 months ago

Particular to if statements, C will do things push and pull values from the stack which takes a small but occasionally noticeable amount of cycles.

holy fuck. llvm in shambles

bitofhope@awful.systems · edit-2 4 months ago

Meanwhile I’m reverse engineering some very much not performance sensitive video game binary patcher program some guy made a decade ago and Ghidra interprets a string splitting function as a no-op because MSVC decided calling conventions are a spook and made up a new one at link time. And it was right to do that.

EDIT: Also me looking for audio data from another old video game, patiently waiting for my program to take about half an hour on my laptop every time I run it. Then I remember to add --release to cargo run and while the compilation takes three seconds longer, the runtime shrinks to about ten seconds. I wonder if the above guy ever tried adding -O2 to his CFLAGS?

froztbyte@awful.systems · 4 months ago

for anyone reading this comment hoping for an actual eli5, the “technical POV” here is nonsense bullshit. you don’t program GPUs with assembly.

the rest of the comment is the poster filling in bad comparisons with worse details

Pup Biru@aussie.zone · 4 months ago

literally looks like LLM-generated generic slop: confidently incorrect without even a shred of thought

justOnePersistentKbinPlease@fedia.io · 4 months ago

For anyone reading this comment, that person doesnt know anything about assembly or C.

froztbyte@awful.systems · edit-2 4 months ago

yep, clueless. can’t tell a register apart from a soprano. and allocs? the memory’s right there in the machine, it has it already! why does it need an alloc!

fuckin’ dipshit

next time you want to do a stupid driveby, pick somewhere else

o7___o7@awful.systems · 4 months ago

Sufficiently advanced skiddies are indistinguishable from malloc

David Gerard@awful.systems · 4 months ago

this user is just too smart for the average awful systems poster to deal with, and has been sent on his way to a more intellectual lemmy

self@awful.systems · 4 months ago

you know I was having a slow day yesterday cause I only just caught on: you think we program GPUs in plain fucking C? absolute dipshit no notes

froztbyte@awful.systems · 4 months ago

the wildest bit is that one could literally just … go do the thing. like you could grab the sdk and run through the tutorial and actually have babby’s first gpu program in not too long at all[0], with all the lovely little bits of knowledge that entails

but nah, easier to just make some nonsense up out of thirdhand conversations misheard out of a gamer discord talking about a news post of a journalist misunderstanding a PR statement, and then confidently spout that synthesis

[0] - I’m eliding “make the cuda toolchain run” for argument of simplicity. could just rent a box that has it, for instance

fartsparkles@lemmy.world · 4 months ago

I’m sure that non techie person understood every word of this.

blakestacey@awful.systems · 4 months ago

And I’m sure that your snide remark will both tell them what to simplify and explain how to do so.

Enjoy your free trip to the egress.

msage@programming.dev · 4 months ago

Putting Python, the slowest popular language, alongside Java and C# really irks me bad.

The real benefit of R1 is Mixture of Experts - the model is separated into smaller sections, that are trained and used independently, meaning you don’t need the entire model to be active all the time, just parts of it.

Meaning it uses less resources during training and general usage. For example instead of 670 billion parameters all the time, it can use 30 billion for specific question, and you can get away with using 2% of the hardware used by competition.

UndercoverUlrikHD@programming.dev · 4 months ago

Putting Python, the slowest popular language, alongside Java and C# really irks me bad.

I wouldn’t call python the slowest language when the context is machine learning. It’s essentially C.

msage@programming.dev · 4 months ago

Python is still the slowest, it just utilizes libraries written in C for this specific math.

UndercoverUlrikHD@programming.dev · 4 months ago

And that maths happens to be 99% of the workload

justOnePersistentKbinPlease@fedia.io · 4 months ago

I used them as they are well known modern languages that the average person might have heard about.

mountainriver@awful.systems · 4 months ago

Good question!

The guesses and rumours that you have got as replies makes me lean towards “apparently no one knows”.

And because it’s slop machines (also referred to as “AI”, there is always a high probability of some sort of scam.

froztbyte@awful.systems · edit-2 4 months ago

pretty much my take as well. I haven’t seen any actual information from a primary source, just lots of hearsay and “what we think happened” analyst shit (e.g. that analyst group in the twitter screenshot has names but no citation/links)

and doubly yep on the “everyone could just be lying” bit

fallowseed@lemmy.world · edit-2 4 months ago

i read that that the chinese made alterations to the cards, as well-- they dismantled them to access the chips themselves and were able to do more precise micromanagement that cuda doesn’t support, for instance… basically they took the training wheels off and used a more fine-tuned and hands-on approach that gave them some serious advantages

froztbyte@awful.systems · 4 months ago

got a source for that?

fallowseed@lemmy.world · 4 months ago

just something i read, this isn’t the original source i read, but a quick search gave me: https://www.xatakaon.com/robotics-and-ai/the-secret-to-deepseeks-extreme-efficiency-is-out-it-bypasses-nvidias-cuda-standard

froztbyte@awful.systems · edit-2 4 months ago

okay so that post’s core supposition (“using ptx instead of cuda”) is just ~~fucking wrong~~ fucking weird and I’m not going to spend time on it, but it links to this tweet which has this:

DeepSeek customized parts of the GPU’s core computational units, called SMs (Streaming Multiprocessors), to suit their needs. Out of 132 SMs, they allocated 20 exclusively for server-to-server communication tasks instead of computational tasks

this still reads more like simply tuning allocation than outright scheduler and execution control (which your post alluded to)

[x] doubt

e: original wording because cuda still uses ptx anyway, whereas this post looks like it’s saying “they steered ptx directly”. at first I read the tweet more like “asm vs python” but it doesn’t appear to be what that part meant to convey. still doubting the core hypothesis tho

froztbyte@awful.systems · edit-2 4 months ago

sidebar: I definitely wouldn’t be surprised if it comes to this overall being a case of “a shop optimised by tuning, and then it suddenly turns out the entire industry has never tried to tune a thing ever”

because why try hard when the money taps are open and flowing free? velocity over everything! this is the bayfucker way.

skillissuer@discuss.tchncs.de · 4 months ago

ah yes the ultimate american NOBUS - we can throw money at the problem until it disappears

froztbyte@awful.systems · 4 months ago

it might disappear under the gigantic heap of money but gosh darn it we can KEEP HEAPING

froztbyte@awful.systems · 4 months ago

I do sorta get the idea that this is (one of the reasons) exactly why 'ole felon is trying to get his hand on all the funding faucets

fallowseed@lemmy.world · 4 months ago

well you’re always free to doubt and do your own research-- as i mentioned- it is something i read and between believing what the US tech bros are saying when all their money and hegemony is on the line vs what the chinese have given up for free-use, i am going to go out on a limb and trust the chinese. you’re free to make your own decisions in this regard and kudos for having your own mind.

froztbyte@awful.systems · 4 months ago

mine isn’t a “USA v China: Jelly Wrestling Deluxe” comment and you’re not really understanding the point

fallowseed@lemmy.world · 4 months ago

what is your point? i thought i was giving a “explain like i’m 5” answer to a guy asking for one… you came along asking me to show sources… now this?

froztbyte@awful.systems · 4 months ago

the point is that your eli5 is unfounded rumour hearsay bullshit (and thus it’s entirely pointless to spread it), then when giving you a relatively gentle indication of that you decided to cosplay an ostrich

pro-tip: if it ain’t something you actually understand something about, probably best to avoid uncritically amplifying shit about it

manicdave@feddit.uk · edit-2 4 months ago

The article sort of demonstrates it. Instead of needing inordinate amounts of data and memory to increase it’s chance of one-shotting the countdown game. It only needs to know enough to prove itself wrong and roll the dice again.