The research from Purdue University, first spotted by news outlet Futurism, was presented earlier this month at the Computer-Human Interaction Conference in Hawaii and looked at 517 programming questions on Stack Overflow that were then fed to ChatGPT.

“Our analysis shows that 52% of ChatGPT answers contain incorrect information and 77% are verbose,” the new study explained. “Nonetheless, our user study participants still preferred ChatGPT answers 35% of the time due to their comprehensiveness and well-articulated language style.”

Disturbingly, programmers in the study didn’t always catch the mistakes being produced by the AI chatbot.

“However, they also overlooked the misinformation in the ChatGPT answers 39% of the time,” according to the study. “This implies the need to counter misinformation in ChatGPT answers to programming questions and raise awareness of the risks associated with seemingly correct answers.”

  • locuester@lemmy.zip
    link
    fedilink
    English
    arrow-up
    8
    ·
    edit-2
    8 months ago

    If you’re skilled enough to know it’s wrong, then you should do it yourself, if you’re not, then you shouldn’t be using it.

    Oh I strongly disagree. I’ve been building software for 30 years. I use copilot in vscode and it writes so much of the tedious code and comments for me. Really saves me a lot of time, allowing me to spend more time on the complicated bits.

    • madsen@lemmy.world
      link
      fedilink
      English
      arrow-up
      7
      ·
      edit-2
      8 months ago

      I’m closing in on 30 years too, started just around '95, and I have yet to see an LLM spit out anything useful that I would actually feel comfortable committing to a project. Usually you end up having to spend as much time—if not more—double-checking and correcting the LLM’s output as you would writing the code yourself. (Full disclosure: I haven’t tried Copilot, so it’s possible that it’s different from Bard/Gemini, ChatGPT and what-have-you, but I’d be surprised if it was that different.)

      Here’s a good example of how an LLM doesn’t really understand code in context and thus finds a “bug” that’s literally mitigated in the line before the one where it spots the potential bug: https://daniel.haxx.se/blog/2024/01/02/the-i-in-llm-stands-for-intelligence/ (see “Exhibit B”, which links to: https://hackerone.com/reports/2298307, which is the actual HackerOne report).

      LLMs don’t understand code. It’s literally your “helpful”, non-programmer friend—on stereoids—cobbling together bits and pieces from searches on SO, Reddit, DevShed, etc. and hoping the answer will make you impressed with him. Reading the study from TFA (https://dl.acm.org/doi/pdf/10.1145/3613904.3642596, §§5.1-5.2 in particular) only cements this position further for me.

      And that’s not even touching upon the other issues (like copyright, licensing, etc.) with LLM-generated code that led to NetBSD simply forbidding it in their commit guidelines: https://mastodon.sdf.org/@netbsd/112446618914747900

      Edit: Spelling

      • locuester@lemmy.zip
        link
        fedilink
        English
        arrow-up
        4
        ·
        edit-2
        8 months ago

        I’m very familiar with what LLMs do.

        You’re misunderstanding what copilot does. It just completes a line or section of code. It doesn’t answer questions - it just continues a pattern. Sometimes quite intelligently.

        Shoot me a message on discord and I’ll do a screenshare for you. #locuester

        It has improved my quality and speed significantly. More so than any other feature since intellisense was introduced (which many back then also frowned upon).

        • madsen@lemmy.world
          link
          fedilink
          English
          arrow-up
          1
          ·
          8 months ago

          Fair enough, and thanks for the offer. I found a demo on YouTube. It does indeed look a lot more reasonable than having an LLM actually write the code.

          I’m one of the people that don’t use IntelliSense, so it’s probably not for me, but I can definitely see why people find that particular implementation useful. Thanks for catching and correcting my misunderstanding. :)