See linked posting. I’ve commented there with a link to a CLI tool in Python that allows downloading of IA collections. I’ve submitted a patch to enable specifying start and end points so that it’s easier to resume downloading a huge collection, or to allow multiple people to split up the work.

https://archive.org/details/georgeblood

https://archive.org/details/78rpm_bowling_green

F*ck the RIAA and absurdly long copyright.


EDIT: There is more than one collection of 78s on IA, so I updated the title.


The issue with these collections are that they’re absolutely HUGE. And yes, IA offers torrents for them, but as a separate torrent for every. single. album. And the torrents have all data in them – FLAC, fixed-rate MP3, VBR MP3, PDF liner notes, etc. etc… there may be some extremely hardcore data-hoarders out there who want everything, but IMHO as these are scratchy old 78 records, FLAC is overkill to just save the audio in a listenable format. The George Blood collection, just the VBR MP3s, is looking to be about 6TB. With ALL data it might be over 40TB! I can’t afford that many hard drives :)


So, my approach at the moment is to save just the VBR MP3s (they seem to be done at up to 320kbps VBR) and the JPEG album cover. If I have a chance and any storage left afterwards, I can make a separate pass to get the album liner PDFs…


Tool used: https://github.com/jjjake/internetarchive


Patch to allow setting start and end item indices for downloads: https://github.com/jjjake/internetarchive/pull/605


Example usage to grab just the VBR MP3 and record label JPG for each (note the --start-idx and --end-idx arguments):

#ia download --start-idx=4001 --end-idx=8000 -a -i --format="VBR MP3" --format="JPEG" --search collection:georgeblood

I’m going to concentrate on the George Blood collection for now… I’m starting at item 1. It would be great if others started at index 50,000, 100,000, 150,000, … and others started at the end and worked backwards in similarly-sized chunks, so that it’s assured someone gets each of them.

  • blindsight@beehaw.org
    link
    fedilink
    arrow-up
    105
    ·
    1 year ago

    Copyright has completely jumped the shark. There’s absolutely no balance between the public benefit of the public domain.

    30 years ought to be enough time for anyone to extract any reasonable value from an IP. If you haven’t made your profit in 30 years, then let the public benefit from it.

    Or at least let preservationists (data hoarders, let’s be honest) keep our cultural history alive and accessible for future generations.

    • keeb420@kbin.social
      link
      fedilink
      arrow-up
      16
      ·
      1 year ago

      And maybe short of very limited production runs there haven’t been any 78s produced in a long time. All of that work should be in the public domain.

    • Grimpen@lemmy.ca
      link
      fedilink
      English
      arrow-up
      15
      ·
      1 year ago

      Or a renewal step. If it’s not worth renewing, let it into the public domain.

      This is why It’s A Wonderful Life became a Christmas classic. Because it was in the pubic domain, it was used as late night filler.

      The MPAA and RIAA miss the point. If It’s A Wonderful Life was still copyrighted, it wouldn’t have become a classic.

      It’s like the concept of Abandonware. If video games had a large copyright clearing house like the MPAA or RIAA, Abandonware wouldn’t work, but abandoned media will disappear. Heck, non-abandoned media also disappears because profits don’t reward preservation.

      • GnuLinuxDude@lemmy.ml
        link
        fedilink
        English
        arrow-up
        3
        ·
        1 year ago

        Ok but then how will my kids record company benefit into perpetuity?

        In all seriousness, I think copyright law is the best example of how captured our government is to large corporate interests.

  • rich@feddit.uk
    link
    fedilink
    English
    arrow-up
    64
    ·
    1 year ago

    You don’t need to censor yourself…

    Fuck the RIAA, bunch of absolute cunts.

    • Arghblarg@lemmy.caOP
      link
      fedilink
      English
      arrow-up
      31
      ·
      edit-2
      1 year ago

      Yeah, you’re right, Fuck em.

      FYI I’m currently on 4001-8000 of the ‘Great 78 Collection’. Looks like I’ll need about 6TB to get it all, yikes! (Just the VBR MP3 files, not the FLACs. Holy Hell.)

      collection:georgeblood

      https://archive.org/details/georgeblood

      If everyone would take blocks of it, say 4000 each, we can eventually create torrents for each one or something so it can all be reassembled if/when the IA has to take it down.

        • Arghblarg@lemmy.caOP
          link
          fedilink
          English
          arrow-up
          5
          ·
          1 year ago

          I wish the IA would offer a torrents of the overall collection but it’s over 400k separate torrents, one for each album. And they contain FLACs, fixed- and VBR MP3s, PDF jacket notes, JPGs … it’s just too much for one person (I am OK with buying an 8TB drive or two, but not a dozen!)

          I’m trying to at least grab the VBR MP3s (these are old scratchy records after all… I don’t know how much FLAC will really preserve). Maybe if I can get most of those, I’ll do a second pass and get the album cover JPGs, then liner PDFs… depending on if/how long the collection stays up.

        • Arghblarg@lemmy.caOP
          link
          fedilink
          English
          arrow-up
          4
          ·
          1 year ago

          Normally I would just fetch the torrent, yes, but this particular collection is huge – over 400k separate items (which on IA be their own torrents). Is there a way to get an aggregate, but filtered, torrent with just, say, the album jpg and VBR mp3 files for each? I don’t think I can afford the entire collection as each also has the FLACs.

      • 0x0F
        link
        fedilink
        English
        arrow-up
        1
        ·
        1 year ago

        which block are you on now?

        • Arghblarg@lemmy.caOP
          link
          fedilink
          English
          arrow-up
          3
          ·
          1 year ago

          around 5500… gonna take a while. My ISP says there’s no monthly cap but I wonder if I really should dl this much…

  • Haui@discuss.tchncs.de
    link
    fedilink
    English
    arrow-up
    39
    ·
    1 year ago

    Probably stating the obvious but „are in no threat of being deleted“ is an absolute joke.

    A company holding the IP can just make it unavailable tormorrow. A big chunk of us is here because reddit somehow is allowed to delete our posts because the law is idiotic. At least european people are allowed to get their data but the cooperative works of thousands of people is threatened due to those laws.

    The concept of IP needs to be reformed.

    • Arghblarg@lemmy.caOP
      link
      fedilink
      English
      arrow-up
      20
      ·
      1 year ago

      Yeah. And whenever anyone says “Oh the music companies would never let these old recordings die, it’s their bread and butter!” I give them this story.

      We cannot trust our cultural heritage to any one entity.

    • As concrete examples, try to get a copy of Disney’s 1946 movie, “Song of the South.” It’s been removed from circulation because of its whitewashed presentation of “happy slaves.” Similarly, 6 of Dr. Seuss’ books, including “And to Think That I Saw It on Mulberry Street” were withdrawn because of racial imagery (the mentioned book had a “Chinaman” drawn with a WWII stereotype style - rice hat, sloping eyes, buck teeth).

      There’s media you simply can’t get anymore.

  • maudefi@lemm.ee
    link
    fedilink
    English
    arrow-up
    28
    ·
    1 year ago

    Cool tool! Please consider leaving GitHub for any of the numerous FOSS options.

    • Arghblarg@lemmy.caOP
      link
      fedilink
      English
      arrow-up
      10
      ·
      1 year ago

      Oh, it’s not my project – I already have moved my own projects off there, yeah.

      • maudefi@lemm.ee
        link
        fedilink
        English
        arrow-up
        7
        ·
        1 year ago

        That’s awesome! Really encouraging seeing projects and devs migrate away from closed-source and proprietary systems and features. 💪

        • Arghblarg@lemmy.caOP
          link
          fedilink
          English
          arrow-up
          4
          ·
          edit-2
          1 year ago

          sourcehut, self-hosted Gogs or Forgejo are some good candidates. Gitea is popular, but there’s apparently been some drama about them going commercial without proper buy-in from their contributors. (The code lineage is AFAIK Gogs → Gitea → Forgejo).


          All the above solutions make it super-easy to mirror a github project as well, just in case it goes away :) Doing so has saved my arse more than a few times when github takes a repo down for stupid reasons.


          Mandatory plug for !selfhosted@lemmy.world :)

          Gitlab seems too heavyweight to me. I use Gogs myself on my home server. No code review tools via PR ala github/gitlab, but I don’t need those in my web frontend.

  • Cyb3rManiak@kbin.social
    link
    fedilink
    arrow-up
    13
    ·
    1 year ago

    Instructions unclear. Linked posting explains nothing. Will assume this is about 78 missing dragonballs and move on.

    Jokes aside, we must preserve the 78 collection. What if in the future an alien signal will reach earth and no one can understand it because all 78s are extinct? We don’t have Starfleet to go back in time and get a 78 from San Francisco in the past to save the future!

    • Arghblarg@lemmy.caOP
      link
      fedilink
      English
      arrow-up
      2
      ·
      1 year ago

      Aha! Well, coincidentally, a few weeks ago I just found out about another IA download tool for getting books that are hidden behind the borrow wall.

      DeGourou

      NOTE DeGourou is incompatible with the tool mentioned in my post here (Python library differences) so install it in a different account if you want to use both tools often. (Maybe someone more fluent in Python can find out why installing one breaks the other?)

      Now DeGourou seems to only download individual books. Would be great if it could be made to iterate over entire collections as well…