hi my dears, I have an issue at work where we have to work with millions (150 mln~) of product data points. We are using SQL server because it was inhouse available for development. however using various tables growing beyond 10 mln the server becomes quite slow and waiting/buffer time becomes >7000ms/sec. which is tearing our complete setup of various microservices who read, write and delete from the tables continuously down. All the stackoverflow answers lead to - its complex. read a 2000 page book.

the thing is. my queries are not that complex. they simply go through the whole table to identify any duplicates which are not further processed then, because the processing takes time (which we thought would be the bottleneck). but the time savings to not process duplicates seems now probably less than that it takes to compare batches with the SQL table. the other culprit is that our server runs on a HDD which is with 150mb read and write per second probably on its edge.

the question is. is there a wizard move to bypass any of my restriction or is a change in the setup and algorithm inevitable?

edit: I know that my questions seems broad. but as I am new to database architecture I welcome any input and discussion since the topic itself is a lifetime know-how by itself. thanks for every feedbach.

  • dandelion
    link
    fedilink
    arrow-up
    3
    ·
    3 days ago

    sounds like some changes would be a good idea 😅

      • dandelion
        link
        fedilink
        arrow-up
        1
        ·
        edit-2
        2 days ago

        sounds like lots of directions:

        • why are duplicates such a frequent problem, sounds like upstream solutions are needed there?
        • SSD would be faster read/write, yes (your data shouldn’t be on a single hard-drive, it should be regularly backed up at least - make the HDD the backup and copy the main database to SSD?); you might even consider a cloud service like AWS RDS
        • for some use-cases, a noSQL database can be faster for reading - but it’s contextual
        • PoisonedPrisonPanda@discuss.tchncs.deOP
          link
          fedilink
          arrow-up
          1
          ·
          21 hours ago

          yes. the problem is, we are fetching products from an API. and since we thought processing power will be a limiting factor, we thought that sorting out duplicates would reduce load.

          but since the different microservices which process the data are taking different times we are using the sql tables as a pool. this should help upcscaling by using multiple microservices.

          cloud services are yet not a solution as we are still in development.