In the last 24 hours you’ve likely noticed that we’ve had some performance issues on Blåhaj Lemmy.
The initial issue occurred as a result of our hosting provider having technical problems. We use Hetzner, who provides hosting for approximately a third of the fediverse, so there was wide spread chaos above and beyond us.
As of lemmy 19.x, messages queue rather than getting silently dropped when an instance is down, so once Hetzner resolved their issues, we had a large backlog of jobs to process. Whilst we were working through the queues, we were operational, but laggy, and our messages were an hour or more behind. These queues aren’t just posts and replies, but also include votes, so there can be a large volume of them, each one of which needs to be remotely verified with the sending instance as we process it, so geographical latency also plays a part.
As you can see from the graph, we are finally through the majority of the queues.
The exception is lemmy.world. Unfortunately, the lemmy platform processes incoming messages on a sequential basis (think of it as a sequential queue for each remote instance), which means Blahaj Lemmy can’t process a second lemmy.world message until we’ve finished processing the first message.
Due to the size of Lemmy.world they are sending us new queue items almost as fast as our instance can process them, so the queue is coming down, but slowly! In practical terms, this means that lemmy.world communities are going to be several hours behind for the next few days.
For those that are interested, there is a detailed technical breakdown of a similar problem currently being experienced by reddthat, that explores the impact of sequential processing and geographical latency.
As long as we get the ability to process incoming queues in parallel, rather than in sequence, then it will be fine. But until that happens, yeah, this will become more of a problem.