Reddit usernames like ‘SolidGoldMagikarp’ are somehow causing the chatbot to give bizarre responses.

    • icerunner@kbin.social
      link
      fedilink
      arrow-up
      6
      ·
      1 year ago

      I believe, if this sort of generative AI is going to be trustworthy in the future, we need some sort of external verification system so we can make our own trust judgements based on the data used to train the system. For example, if a system is trained including 4chan as a data source, I’m going to trust it less than if it wasn’t trained using that source.

      I don’t think big business yet realises how important the training data is but, as soon as they do, they will want the AI companies to provide guarantees about the sanity and appropriateness of the training data.

      • FaceDeer@kbin.social
        link
        fedilink
        arrow-up
        2
        ·
        1 year ago

        Whether 4chan is a good data source or not depends on what you intend to use the AI for. If you want to have it interact with users on a web forum or similar context then using 4chan data would likely be very useful indeed.

        Bear in mind that as long as it’s properly labelled then “bad” data is still useful as an example of bad data. A common example is with image AIs, where people can give negative prompts like “ugly” and “blurry” to tell the AI to make images that are not like that.

      • pgm_01@kbin.social
        link
        fedilink
        arrow-up
        2
        ·
        1 year ago

        That’s how human intelligence works. We assign a value to the source of the information. The fact that the AI’s seemed to be trained without that explains why they “lie” so much. They simply reconstruct patterns without giving any weight to specific patterns.

        For example, if you have the information “President Biden will launch a ground invasion of Russia.” If the New York Times, BBC, and CNN are all reporting it, we would give that information a higher likelihood of being true than if the information was found on random blogs. However, if the random blogs reporting the information belonged to reputable reporters or bloggers on military and international affairs, we would assign the information a higher value of being correct than if the information came from Bob’s Bigfoot and Alien sightings Index.

        Without the ability to check the level of accuracy of source data, all the generative AI could be corrupted. If you fed an art AI photos of the Statue of Liberty but kept telling it that it was the Eiffel Tower, when asked to draw the Eiffel Tower it would spit out the Statue of Liberty. Right now, without the ability to assess the accuracy of a response, any of the chat-based AI are garbage for most of the use-cases companies are deploying them in.

    • hiyaaaaa23@kbin.social
      link
      fedilink
      arrow-up
      2
      ·
      1 year ago

      No. There’s a computerfile video but iirc r/counting was accidentally left in the training data set for part of the training process