Yep that’s the new idea. The sad part is that with this method there’s no way to get historical data. Only new posts. So if a server goes down, gets DDOSd etc… I’ll lose posts forever.
Also building an ActivityPub implementation from scratch isn’t trivial either. So that’ll take some time.
I’ve got a few other ideas I’m playing with as well. Like just assuming that internal post IDs are all sequential and literally fetching them one by one. Or maybe some combination of both?
Instead of building a new ActivityPub implementation, you could just run a regular instance of Lemmy and pull data from its database directly? Or use its API for searches?
I was using it’s APIs. But new restrictions have effectively been put in place that prevent me from using them for what I need. Similar API calls were being made that were causing DDOS attacks on lemmy.world.
As for running a lemmy instance itself. That’s a thought but I need the data in a different format to do efficient searches. It’s a tricky problem.
Why not talk to the instance admins directly and ask for their database dumps (minus the user accounts table and DMs) so you can ingest it into your search index? You’re doing this for the benefit of the fediverse, right? I’m sure most instance admins would help you if you ask (and it’s easier on their servers too because your scraper aren’t bombarding their server), I know I would, tough my instance only have 2 users right now so it’ll probably useless to index. This should take care the historical data problem, and you can use activitypub for obtaining new data going forward without scraping the instances.
Yep that’s the new idea. The sad part is that with this method there’s no way to get historical data. Only new posts. So if a server goes down, gets DDOSd etc… I’ll lose posts forever.
Also building an ActivityPub implementation from scratch isn’t trivial either. So that’ll take some time.
I’ve got a few other ideas I’m playing with as well. Like just assuming that internal post IDs are all sequential and literally fetching them one by one. Or maybe some combination of both?
Instead of building a new ActivityPub implementation, you could just run a regular instance of Lemmy and pull data from its database directly? Or use its API for searches?
I was using it’s APIs. But new restrictions have effectively been put in place that prevent me from using them for what I need. Similar API calls were being made that were causing DDOS attacks on lemmy.world.
As for running a lemmy instance itself. That’s a thought but I need the data in a different format to do efficient searches. It’s a tricky problem.
Why not talk to the instance admins directly and ask for their database dumps (minus the user accounts table and DMs) so you can ingest it into your search index? You’re doing this for the benefit of the fediverse, right? I’m sure most instance admins would help you if you ask (and it’s easier on their servers too because your scraper aren’t bombarding their server), I know I would, tough my instance only have 2 users right now so it’ll probably useless to index. This should take care the historical data problem, and you can use activitypub for obtaining new data going forward without scraping the instances.