• 6 Posts
  • 95 Comments
Joined 2 年前
cake
Cake day: 2023年7月13日

help-circle

  • I was writing some math code, and not being an idiot I’m using an open source math library for doing something called “QR decomposition”, and its efficient, and it supports sparse matrices (matrices where many numbers are 0), etc.

    Just out of curiosity I checked where some idiot vibecoder would end up. AI simply plagiarizes from some shit sample snippets which exist purely to teach people what QR decomposition is. It’s actually unusable, due to being numerically unstable.

    Who in the fuck even needs this shit to be plagiarized, anyway?

    It can’t plagiarize a production quality implementation, because you can count those on the fingers of one hand, they’re complex as fuck and you can’t just blend a few together to try to pretend you didn’t plagiarize.

    The answer is, people who are peddling the AI. They are the ones who ordered plagiarism with extra plagiarism on top. These are not coding tools, these are demos to convince the investors to buy the actual product, which is company’s stock. There’s a little bit of tool functionality (you can ask them to refactor the code), but it’s just you misusing a demo to try to get some value out of it.

    And to that end, the demos take every opportunity to plagiarize something, and to talk about how the “AI” wrote the code from scratch based on its supposed understanding of fairly advanced math.

    And in coding, it is counter productive to plagiarize. Many of the open source libraries can be used in commercial projects. You get upstream fixes for free. You don’t end up with some bugs or worse yet security exploits that may have been fixed since the training cut-off date.

    No fucking one in the right mind would willingly want their product to contain copy pasted snippets from stale open source libraries, passed through some sort of variable-renaming copyright laundering machine.

    Except of course the business idiots who are in charge of software at major companies, who don’t understand software. Who just failed upwards.

    They look at plagiarized lines and count them as improved productivity.




  • If it was a basement dweller with a chatbot that could be mistaken for a criminal co-conspirator, he would’ve gotten arrested and his computer seized as evidence, and then it would be a crapshoot if he would even be able to convince a jury that it was an accident. Especially if he was getting paid for his chatbot. Now, I’m not saying that this is right, just stating how it is for normal human beings.

    It may not be explicitly illegal for a computer to do something, but you are liable for what your shit does. You can’t just make a robot lawnmower and run over a neighbor’s kid. If you are using random numbers to steer your lawnmower… yeah.

    But because it’s OpenAI with 300 billion dollar “valuation”, absolutely nothing can happen whatsoever.




  • I appreciate the sentiment but I also hate the whole “AI is a power loom for coding”.

    The power loom for coding is called “git clone”.

    What “AI” (LLM) tools provide is just English as a programming language with plagiarized sum total of all open source as the standard library. English is a shit programming language. LLMs are shit at compiling it. Open source is awesome. Plagiarized open source is “meh” - you can not apply upstream patches.




  • One thing that I couldn’t easily figure out is what is the constant factor. If the constant factor is significantly worse than for Strassen, then it would be much slower than Strassen except for very large matrices.

    Let’s say the constant factor is k.

    N should be large enough that N^((log(49)-log(48))/log(4)) > k where k is the constant factor. Let’s say the difference in exponents is x, then

    N^x > k

    log(N)*x > log(k)

    N > exp(log(k)/x)

    N > k^(1/x)

    So lets say x is 0.01487367169 , then we’re talking [constant factor]^67 for how big the matrix has to be?

    So, 2^67 sized matrix (2^134 entries in it) if Google’s is 2x greater constant than Strassen.

    That don’t even sound right, but I double checked, (k^67) ^ 0.01487367169 is approximately k.

    edit: I’m not sure what the cross over points would be if you use Google’s then Strassen’s then O( n^3 )

    Also, Strassen’s algorithm works on reals (and of course, on complex numbers), while the new “improvement” reduces by 1 the number of real multiplications required for a product of two 4x4 complex-valued matrices.



  • Its not about moats, it’s about open source community (whose code had been trained on) coming out with pitchforks. It has nothing to do with moats.

    You are way overselling coding agents.

    Re-creating some open source project with a similar function is literally the only way a coding agent can pretend to be a programmer.

    I tried latest models for code and they are in fact capable of shitting out a thousand lines of working code at a time, which obviously can only be obtained via plagiarism since they are also incapable of writing the most trivial code for a novel situation. And the neat thing about plagiarism is that once you start you can keep going since there’s more of compatible code where it came from.


  • Yeah I’m thinking this one may be special cased, perhaps they wrote a generator of river crossing puzzles with corresponding conversion to “is_valid_state” or some such. I should see if I can get it to write something really ridiculous into “is_valid_state”.

    Other thing is that in real life its like “I need to move 12 golf carts, one has low battery, I probably can’t tow more than 3 uphill, I can ask Bob to help but he will be grumpy…”, just a tremendous amount of information (most of it irrelevant) with tremendoustremendous possible moves (most of them possible to eliminate by actual thinking).



  • Pre-LLM, I had to sit through one or two annual videos to the sense of “dont cut and paste from open source, better yet don’t even look at GPLd code you arent working on” and had to do a click test with questions like “is it ok if you rename all the variables yes no”. Ohh and I had to run a scanning tool as part of the release process.

    I don’t think its the FSD they would worry about, but GPL especially v3. Nobody gives a shit if it steals some leetcode snippet, or cuts and pastes some calls to a stupid API.

    But if you have a “coding agent” just replicating GPL code wholesale, thousands and thousands of lines, it would be very obvious. And not all companies ship shitcode. Apple is a premium product and ages old patched CVEs from open source cropping up in there wouldn’t be exactly premium.





  • Other funny thing: it only became a fully automatic plagiarism machine when it claimed that it wrote the code (referring to itself by name which is a dead giveaway that the system prompt makes it do that).

    I wonder if code is where they will ultimately get nailed to the wall for willful copyright infringement. Code is too brittle for their standard approach, “we sort of blurred a lot of works together so its ours now, transformative use, fuck you, prove that you don’t just blur other people’s work together, huh?”.

    But also for a piece of code, you can very easily test if the code has the same “meaning” - you can implement a parser that converts code to an expression graph, and then compare that. Which makes it far easier to output code that is functionally identical to the code they are plagiarizing, but looks very different.

    But also I estimate approximately 0% probability that the assholes working on that wouldn’t have banter between themselves about copyright laundering.

    edit: Another thing is that since it can have no own conception of what “correct” behavior is for a piece of code being plagiarized, it would also plagiarize all the security exploits.

    This hasn’t been a big problem for the industry, because only short snippets were being cut and pasted (how to make some stupid API call, etc), but with generative AI whole implementations are going to get plagiarized wholesale.

    Unlike any other work, code comes with its own built in, essentially irremovable “watermark” in the form of security exploits. In several thousands lines of code, there would be enough “watermark” for identification.