So it’s been a few days, where are we now?

I also thought given the technical inclination of a lot of our users that you all might be somewhat interested in the what, how and why of our decisions here, so I’ve included a bit of the more techy side of things in my update.

Bandwidth

So one of the big issues we had was the heavy bandwidth caused by a massive amount of downloaded content (not in terms of storage space, but multiple people downloading the same content).

In terms of bandwidth, we were seeing the top 10 single images resulting in around 600GB+ of downloads in a 24 hour period.

This has been resolved by setting up a frontline caching server at pictrs.blahaj.zone, which is sitting on a small, unlimited 400Mbps connection, running a tiny Caddy cache that is reverse proxying to the actual lemmy server and locally caching the images in a file store on its 10TB drive. The nginx in front of lemmy is 301 redirecting internet facing static image requests to the new caching server.

This one step alone is saving over $1,500/month.

Alternate hosting

The second step is to get away from RDS and our current fixed instance hosting to a stand-alone and self-healing infrastructure. This has been what I’ve been doing over the last few days, setting up the new servers and configuring the new cluster.

We could be doing this cheaper with a lower cost hosting provider and a less resiliant configuration, but I’m pretty risk averse and I’m comfortable that this will be a safe configuration.

I woudn’t normally recommend this setup to anyone hosting a small or single user instance, as it’s a bit overkill for us at this stage, but in this case, I have decided to spin up a full production grade kubernetes cluster with a stacked etcd inside a dedicated HA control plane.

We have rented two bigger dedicated servers (64GB, 8 CPU, 2TB RAID 1, 1 GBPS bandwidth) to run our 2 databases (main/standby), redis, etc on. Then a the control plane is running on 3 smaller instances (2GB, 2 CPU each).

All up this new infrastructure will cost around $9.20/day ($275/m).

Current infrastructure

The current AWS infrastructure is still running at full spec and (minus the excess bandwidth charges) is still costing around $50/day ($1500/m).

Migration

Apart from setting up kubernetes, nothing has been migrated yet. This will be next.

The first step will be to get the databases off the AWS infrastucture first, which will be the biggest bang for buck as the RDS is costing around $34/day ($1,000/m)

The second step will be the next biggest machine which is our Hajkey instance at Blåhaj zone, currently costing around $8/day ($240/m).

Then the pictrs installation, and lemmy itself.

And finally everything else will come off and we’ll shut the AWS account down.

  • lapis
    link
    fedilink
    English
    arrow-up
    34
    ·
    1 year ago

    Absolutely wild to me that moving off AWS + setting up the caching server will bring overall costs down by around a factor of ten. So glad y’all are capable of the advanced technical junk, and super thankful that you’re willing and able to host the various blahaj.zone instances!

  • moonsnotreal
    link
    fedilink
    English
    arrow-up
    30
    ·
    1 year ago

    It’s amazing how much setting up a caching server saves

  • Norah - She/They
    link
    fedilink
    English
    arrow-up
    29
    ·
    edit-2
    1 year ago

    Thank you for all your communication about how the server is being run. I always feel in good hands here on Blahaj :)

  • audiomodder
    link
    fedilink
    English
    arrow-up
    27
    ·
    1 year ago

    How can we donate to keep the infrastructure up and running?

  • ezri
    link
    fedilink
    English
    arrow-up
    26
    ·
    1 year ago

    Awesome! Keep up the good work

  • masukomi
    link
    fedilink
    English
    arrow-up
    20
    ·
    1 year ago

    a) holy 💩 i had no idea this was so expensive b) please include the ko-fi link for us to help support in future updates.

    (link found in other comments)

    • Kaity AOPMA
      link
      fedilink
      English
      arrow-up
      13
      ·
      1 year ago

      It’s not supposed to be, that’s the issue. :)

      • masukomi
        link
        fedilink
        English
        arrow-up
        7
        ·
        1 year ago

        well yeah, but even once the costs are reduced 10x or whatever there will still be costs and it’s still be good to support its continued existence.

  • iso
    link
    fedilink
    English
    arrow-up
    17
    ·
    1 year ago

    I’m glad you moved away from AWS, I wouldn’t even consider going for VM hosting and would’ve gone dedicated from the get go (or even self-hosting on a colo / using a good fiber connection at home, but I guess I live in a super privileged country when it comes to ISPs).

    Isn’t k8s a bit overkill tho? Front-loaded caching seems to make sense, but a single 10gbit dedi could probably resolve the issue easier and simpler, couldn’t it?

    • iso
      link
      fedilink
      English
      arrow-up
      13
      ·
      edit-2
      1 year ago

      Just to add some more background on this: I used to work tightly with the Network Team in the website team of the biggest contender in its market (can’t disclose which one without people figuring out the company since the market is a bit niche).

      We had 20’000 Users a day with a lot of images served.

      The whole infrastructure consisted of 2 Firewall servers and the main DB (pSQL) on 2 self-hosted servers (think colo, it was sitting in a very remote location with 2 big diesel generators that would’ve ran the whole datacenter for a week iirc), with 14 Hetzner backend mirrors who ran the whole PHP code, served images and the angular + some weird custom Javascript. Scaling was done by simply throwing more Hetzners at it.

      Given that Lemmy runs super performance efficient in comparison to 20 year deprecated PHP code that held together with ducttape, I feel like much less could make it work.

      • tallgirlvanessa
        link
        fedilink
        English
        arrow-up
        3
        ·
        1 year ago

        I’m out of my depth but based on the cost savings, seems like a good situation? Kubernetes does scare me though. On the other hand it might be sensible to do this kind of overcorrection just in case the traffic takes another big spike. On the other other hand what you’re describing seems pretty dang effective.

        • iso
          link
          fedilink
          English
          arrow-up
          6
          ·
          1 year ago

          yeah, you pretty much described the use case for k8s. It allows for rapid horizontal scaling, since you can easily throw another machine into the cluster if you need it. It mostly makes sense if you actually have multiple machines sitting idle to begin with, so this technology is mostly used in combination with managed quick rent servers (think AWS).

          Beyond that, k8s is kinda fancy for cluster management, but if you don’t have a cluster you kinda don’t need it to begin with. Using simple kernel VMs (think Proxmox) or just Docker works better there. You could still go for k8s since it’s pretty much docker with cluster functionalities, just in case you want to expand eventually (sidenote, docker allows for cluster functionalities too, but they put a price on it, while k8s is open source iirc).

          In that company I worked, k8s was considered but ultimately not implemented since it was considered a bit overkill. We already had everything set up with a bunch of bash scripts anyway, so it didn’t matter too greatly to begin with.

          • MsPenguinette
            link
            fedilink
            English
            arrow-up
            2
            ·
            1 year ago

            I think it’s smart to start with k8s. Better than having to switch over to it later. Since lemmy is growing and will continue to grow.

            Learning k8s is the more difficult part. If you know k8s well, it’s much easier to deploy than an ec2 deployment. Especially if you need an ASG and ELBs

  • katy ✨
    link
    fedilink
    English
    arrow-up
    15
    ·
    1 year ago

    I have no idea what any of this means but I’m glad you were able to figure out and make it cheaper (Hopefully it’s not the emojis causing it because I love my blobcat in a box emoji :))

  • Lanthanae
    link
    fedilink
    English
    arrow-up
    12
    ·
    1 year ago

    It’s cool to see behind the curtain on this stuff, thanks for the update!

  • LuckingFurker (Any/All)
    link
    fedilink
    English
    arrow-up
    10
    ·
    1 year ago

    I have no idea what a lot of this means but I’m glad to know that our admins are so cool and knowledgeable about it ❤️

  • Josie
    link
    fedilink
    English
    arrow-up
    6
    ·
    1 year ago

    appreciate the dedication and transparency! I’m a developer myself but im still learning the basics when it finds to clusters and scaling

  • NoStressyJessie
    link
    fedilink
    English
    arrow-up
    4
    ·
    edit-2
    1 year ago

    I tried to update my profile picture for the first time since the migration and now I don’t have a profile picture at all, anyone else noticed issues? Image upload attempted from the webui settings page at lemmy.blahaj.zone.

    Throws the error “{“data”:{“msg”:“Couln’t upload file, Couldn’t save file, No space left on device (os error 28)”,“files”:null},“state”:“success”}” in a toast notification on bottom left of page.

    • Hexlynn
      link
      fedilink
      English
      arrow-up
      2
      ·
      1 year ago

      Yeah same here, assuming its just a migration hiccup

        • NoStressyJessie
          link
          fedilink
          English
          arrow-up
          2
          ·
          edit-2
          1 year ago

          I’m trying, maybe I messed up when I converted the file, but it shows as a broken image, when I go to the web address where the image should be hosted it says

          {“msg”:“Error in MagickWand, ImproperImageHeader `/data/pict-rs/files/jhLII3k5jz.png' @ error/png.c/ReadPNGImage/4286”}

          Edit: Same kind of error for jpg

          {“msg”:“Error in MagickWand, InsufficientImageDataInFile `/data/pict-rs/files/gnUPYJkCuT.jpg’ @ error/jpeg.c/ReadJPEGImage_/1112”}

          the first image I exported from gimp, 2nd picture was converted online. Seems unlikely I botched 2 seperate conversion attempts using seperate utilities

  • bdonvr@thelemmy.club
    link
    fedilink
    arrow-up
    3
    ·
    1 year ago

    Once pict-rs updates to allow directly serving images from object storage- wouldn’t it be beneficial to migrate it to an object storage that allows unlimited egress like Cloudflare R2?

    • AdaMA
      link
      fedilink
      English
      arrow-up
      20
      ·
      1 year ago

      Cloudflare is a non starter

      • tallgirlvanessa
        link
        fedilink
        English
        arrow-up
        3
        ·
        1 year ago

        Can you say why? I might wanna move some of my stuff if they’re being shitty

        I mean, Cloudbleed sucked, and their constant refrain of “we’re not HOSTING bigoted websites, just caching all their stuff and handing it to whoever asks for it”, is that it?