ChatGPT use declines as users complain about ‘dumber’ answers, and the reason might be AI’s biggest threat for the future::AI for the smart guy?

  • daisy lazarus@lemmy.world
    link
    fedilink
    English
    arrow-up
    63
    ·
    1 year ago

    Nonsense. Less people are using it because there are viable alternatives and the broader novelty has worn off.

    I use it every day in my job and the quality of answers only drops off when prompts are poorly crafted.

    By and large, the average user doesn’t understand the fundamentals of prompt engineering.

    The suggestion that “answers are increasingly dumber” is embarrassing.

    • Zeth0s@lemmy.world
      link
      fedilink
      English
      arrow-up
      59
      ·
      1 year ago

      Unfortunately I don’t agree with you. Different things have changed over time:

      • For chatgpt 3.5 they moved to a “lighter” and faster (distilled) version, gpt-3.5-turbo. Distillation came with a performance price, particularly on advanced and less common cases.
      • newer chatgpt-4 versions have likely been “lighten” for performance reasons
      • context has been halved for chatgpt-4 on webui, meaning that the model forget more easily and can use half information to create text
      • heavy control has been implemented on jailbreaking and hallucinations, that results in models less prone to follow complex instructions (limiting prompt engineering) and that prefer simplified answers than providing wrong ones (overall decreasing the chance of getting high quality answers).

      All these changes have made working with gpt less pleasant, and more difficult for very advanced and specialized case, particularly with gpt-4 which at the beginning was particularly good.

      • Gutless2615@ttrpg.network
        link
        fedilink
        English
        arrow-up
        3
        ·
        1 year ago

        None of these points are true though. Context has been extended in the webui, markedly. 3.5 turbo is only that, 3.5 but faster. Gpt-4 is a marked improvement on 3.5 and I definitely haven’t seen any conclusive evidence it’s been nerfed in my daily use. Prompts have and still need to be carefully crafted for best results, but the results have been steadily improving not degrading over time.

        • Zeth0s@lemmy.world
          link
          fedilink
          English
          arrow-up
          13
          ·
          1 year ago

          All of these points are true though. Chatgpt 4 max token is now half of from the webui compared to when gtp-4 was launched. It used to be >8k, it is now >4k. Max number of tokens for the api hasn’t changed for gpt-4, while it was greatly increased for chatgpt-3.5-turbo. The article is however talking about the service chatgpt, used via webui.

          ChatGPT-3.5-turbo are different models than those used in the past. You can literally read it in the https://platform.openai.com/docs/models/gpt-3-5

          Prompt engineering has been limited as demonstrated by the fact that most jailbreaking techniques don’t work anymore. The way to avoid jailbreaking is exactly to limit ability of users to instruct the model.

          • Gutless2615@ttrpg.network
            link
            fedilink
            English
            arrow-up
            1
            ·
            1 year ago

            Source on the halved token limit for gpt- 4 in the webui? Because that has not been my experience at all. There are now 16k and 32k models for 3.5-turbo, but there’s no evidence 3.5-turbo is nerfed at all from 3.5 and it absolutely out performs 3. Yes, you can see that they offer different snapshots of models, but that doesn’t indicate at all that there’s been a any reduction in their ability. “Breaking” jail breaking isn’t a bug, and it certainly hasn’t been demonstrated that the model is less capable.

            • Zeth0s@lemmy.world
              link
              fedilink
              English
              arrow-up
              2
              ·
              edit-2
              1 year ago

              Unless they reverted the chance recently (or using some regional A/B testing), you can test yourself the max number of tokens of gpt-4 from webui, that is now ~4k. It used to be ~ 8k.

              What you are talking about are the APIs, that are different, and are not discussed in the news. They are even different models, in the sense that depending on the size of the context you get different results because of the attention mechanism. Unfortunately there is no official benchmark from openai as a comparison between gpt-3.5-turbo models with different context size, but I would not trust them much anyway. They are very defensive on their data, and push out mainly marketing stuff. I would wait for a 3rd party to do the benchmark.

              “Breaking” jailbreaking is not a bug, but it limits the ability to instruct the model, i.e. prompt engineering, because it is literally meant to limit prompt engineering, it is the whole idea behind it

              Edit. Here a link of a guide where they have the ~4k limit as well for gpt-4 https://the-decoder.com/chatgpt-guide-prompt-strategies/

      • mikkL@lemmy.world
        link
        fedilink
        English
        arrow-up
        2
        ·
        1 year ago

        This was really enlightening. Do you have some articles that elaborate? ☺️

        • Zeth0s@lemmy.world
          link
          fedilink
          English
          arrow-up
          13
          ·
          edit-2
          1 year ago

          Regarding 3.5 turbo you can check the documentation, the old 3.5 models are defined as “legacy”. Regarding max number of tokens of gpt-4 you can try yourself. It used to be >8k, it is now >4k from webui.

          There is a talk from openai cio (if I recall correctly) where he describes that reinforcement learning from human feedback (rlhf) actually decreased performance of the models when it comes to programming. I cannot find it now, but it is around on YouTube.

          The additional safeguard against jailbreaking, it is what OpenAI has been focusing the past months with heavy use of rlhf. You can google official statements regarding “safety” of the model. I have a bunch of standard pre-prompt I have been using to initialize my chats since the beginning, and with time you could see how the model followed the instructions less strictly.

          Problem with openai is that they never released exact number of parameters they are using and detailed benchmarks. And benchmarks you find online refer to APIs that behave differently than the chat webui (for instance you have longer context, you set temperature and system prompt, they are probably even different models, who knows… All is closed)

          Measuring performances of llm is pretty tricky, minimal changes can have big effects (see https://huggingface.co/blog/evaluating-mmlu-leaderboard), and unfortunately I haven’t found good resources to properly track chatgpt performances (from web ui) over time, across iterations

    • YeastForTheYeastGod@sh.itjust.works
      link
      fedilink
      English
      arrow-up
      23
      ·
      1 year ago

      I was skeptical at first but I’ve seen enough evidence now. There are definitely times when it’s dumb as a brick, whether the filters just get in the way too much, or whether they’ve implemented other changes idk. I’d really love the unchained version.

      • Kelly@lemmy.world
        link
        fedilink
        English
        arrow-up
        1
        ·
        1 year ago

        dumb as a brick

        On 23rd of March 2023 I asked a family member to give me a prompt and they asked “what day is 19th of April?”.

        It answered “The 19th of April falls on a Tuesday.”, which was true last year but completely misleading if I thought we were taling about the coming month.

        Was it wrong or just unclear? Either way it wasn’t helpful.

    • ghostwolf@lemmy.fakeplastictrees.ee
      link
      fedilink
      English
      arrow-up
      3
      ·
      1 year ago

      I use it every day in my job and the quality of answers only drops off when prompts are poorly crafted.

      Same. It saves me a lot of time both at work and when I’m working on my personal projects. But you need to ask proper questions to get proper answers.