Researchers have found that large language models (LLMs) tend to parrot buggy code when tasked with completing flawed snippets.
That is to say, when shown a snippet of shoddy code and asked to fill in the blanks, AI models are just as likely to repeat the mistake as to fix it.
fancy autocomplete autocompletes whatever it given. tech bros: "surprised Pikachu*.
It’s that time again… for LLMentalist.
Seriously, it should be linked to every mention of LLM anywhere.
I guess that’s one advantage of stack overflow, sometimes you need a guy to tell you the entire basis of your question is dumb and wrong.
o7
Thank you for service, toxic ass Stack Overflow commenters who are often wrong themselves and are then corrected by other, more toxic commenters.
I don’t see why anyone would expect anything else out of a “what is the most likely way to continue this” algorithm.
This is what I was thinking, if you give the code to a person and ask them to finish it, they would do the same.
If you rather ask the LLM to give some insights about the code, it might tell you what’s wrong with it.
It doesn’t help that the AI also has no ability to go backwards or edit code, it can only append. The best it can do is write it all out again with changes made, but even then, the chance of it losing the plot while doing that is pretty high.
Yeah, that’s what the canvas feature is for with ChatGPT. And you guessed it, it’s behind a paywall. :)
To be fair, if you give me a shit code base and expect me to add features with no time to fix the existing ones, I will also just add more shit on the pile. Because obviously that’s how you want your codebase to look.
And if you do that without saying you want to refactor, I likely won’t stand up for you on the next round of layoffs. If I wanted to make the codebase worse, I’d use AI.
I’ve been in this scenario and I didn’t wait for layoffs. I left and applied my skills where shit code is not tolerated, and quality is rewarded.
But in this hypothetical, we got this shit code not by management encouraging the right behavior, and giving time to make it right. They’re going to keep the yes men and fire the “unproductive” ones (and I know fully, adding to the pile is not, in the long run, productive, but what does the management overseeing this mess think?)
Fair.
That said, we have a lot of awful code at my org, yet we also have time to fix it. Most of the crap came from the “move fast and break things” period, but now we have the room to push back a bit.
There’s obviously a balance, and as a lead, I’m looking for my devs to push back and make the case for why we need the extra time. If you convince me, I’ll back you up and push for it, and we’ll probably get the go-ahead. I’m not going to approve everything though because we can’t fix everything at once. But if you ignore the problems and trudge along anyway, I’ll be disappointed.
I also do that.
Can you select all crosswalks and press ok:
Let’s train an LLM exclusively on the Windows XP source code and contemporary Microsoft apps
Wasn’t XP more reliable than the average of Windows versions?
It had a lot of vulnerabilities iirc
Especially before SP2.
UGH, this triggered my PTSD
AutoComplete 2, Wasted Electricity Boogaloo!
If you ask the llm for code it will often give you a buggy code but if you run it get an error annd then tell the ai what error you had it will often fix the error so that is cool.
Wont always work though…
In my experience it will write three paragraphs about the mistake, what went wrong and how to fix it. Only to then output the exact same code, or very close to it, with the same bug. And once you get into that infinite loop, it’s basically impossible to get out of it.
And once you get into that infinite loop, it’s basically impossible to get out of it.
The easiest way I found to get out of that loop, is to get mad at the AI so it hangs up on you.
I was in that problem too but you can at least occationally tell it to “change the code further” and it can work.
Often YOU have to try and fix its code though its far from perfect…
I’ve only used an LLM (you can guess which one) once to write code. Mostly because I didn’t feel like writing down some numbers and making a little drawing for myself to solve the problem.
And because a friend insisted that it writes code just fine.
But it didn’t. It confidently didn’t. Instead, it made up something weird and kept telling me that it had now “fixed” the problem, when in reality it was trying random fixes that were related to the error message but had nothing to do with the actual core problem. It just guessed and prayed.
In the end, I solved the problem in 10 minutes with a small scribble and a pen. And most of the time was spend drawing small boxes, because my solution relied on a coordinate system, I needed to visualize.
And because a friend insisted that it writes code just fine.
It’s so weird, I feel like I’m being gaslit from all over the place. People talking about “vibe coding” to generate thousands of lines of code without ever having to actually read any of it and swearing it can work fine.
I’ve repeatedly given LLMs a shot and always the experience is very similar. If I don’t know how to do it, neither does it, but it will spit out code confidently, hallucinating function names or REST urls as needed to fit the narrative that would have been convenient. If I can’t spot the logic issue with some code that isn’t acting correct, it will also fail to generate useful text that would describe the problem.
If the query is within reach of copy/paste of the top stack overflow answer, then it can generate the code. The nature of LLM integration with IDEs makes the workflow easier to pull in than stack overflow answers, but you need to be vigilant as it’s impossible to tell a viable result from junk, as both are presented with equal confidence and certainty. It can also do a better job of spotting issues within things like key values that are strings with typo than traditional code analysis, and by extension errors in less structured languages like Javascript and Python (where ‘everything is a hash/dictionary’ design prevails).
So far I can’t say I’ve seen improvements, I see how it could be seen as valuable, but the resulting babysitting carries a cost that has been more annoying than the theoretical time saves. Maybe for more boilerplate tasks, but generally speaking those are highly wrapped by libraries already, and when I have to create significant volume of code, it’s because there’s no library and if there’s no library, it’s niche enough that the LLMs can’t generate either.
I think the most credible time save was a report of refreshing an old codebase that used a lot of deprecated function and changing most of the calls to the new method without explicit human intervention. Better than tools like ‘2to3’ for python, but still not magical either.
Paper + pencil are still a programmer’s best friend. YMMV when it comes to graphics tablets, but no fancy software doing anything but precisely emulating paper + pencil gives you that raw braindump interface, when you have to think about how to squeeze things into syntax you have a bottleneck that chokes everything.
What a waste of time. Both the article and the researchers.
Literally by the time their research was published, it was using irrelevant models, on top of the fact that, yeah, that’s how LLMs work. That would be obvious from 5m of using them.
AutoComplete 2, Wasted Electricity Boogaloo!