Is it possible to "manage up" on customer expectations? Or am I doomed to unreasonable SLAs? (Database as a Service Company)

th3raid0r@programming.dev · edit-2 5 months ago

Is it possible to "manage up" on customer expectations? Or am I doomed to unreasonable SLAs? (Database as a Service Company)

RonSijm@programming.dev · 5 months ago

The amount of times I’ve been alerted in the middle of the night because CPU was running high for 5 minutes is too damn high.

I’d suggest to just set up automatons to fix those things automatically. Lets say 80% CPU for 5 minutes it too high. Ok, add an auto-scale rule at 65% CPU for 3 minutes to add an extra node to the cluster to load balance the CPU load

It’s like we’re trying to prevent outages by monitoring for potential issues rather than actually making our system more robust and automate-able.

Like it sounds like you’re saying the issues are caused by systems not being robust and lack of automation… If they’re this scared of outages and breaking SLA, they should work on having less outages, or having fall-backs when they occur.

But it could get pretty difficult to get management to do this kinda things from random suggestions from some SRE. I’d probably talk with the team-lead about this, and other people in your team, cause you’re probably not the only one with these issues. And then have a meeting with the entire dev/SRE team and management to point out it’s not sustainable the way it’s going, and with suggestions to improve it

th3raid0r@programming.dev · 5 months ago

I’d suggest to just set up automatons to fix those things automatically. Lets say 80% CPU for 5 minutes it too high. Ok, add an auto-scale rule at 65% CPU for 3 minutes to add an extra node to the cluster to load balance the CPU load

Sure, if it were a normal service and not a distributed database that requires days to scale. Days. It’s not, “add one node” and we’re good. It’s Add Node - Migrate Data - Add Node - Migrate Data… And in many cases, we have explicit instructions NOT to scale the customer because they won’t be able to afford the larger cluster.

Also, would you auto-scale for a 5 minute blip that goes away in that time and doesn’t consistently recur? I certainly wouldn’t. The customer might not be able to pay for the size we put them on.

Our customers can simultaneously demand that we respond to all alerts AND not to scale their cluster. Who’s fuckin’ idea this was, I’ve no clue.

Like it sounds like you’re saying the issues are caused by systems not being robust and lack of automation… If they’re this scared of outages and breaking SLA, they should work on having less outages, or having fall-backs when they occur.

No. That’s reading far more into my statement than I hoped. The reliability is indeed there - it’s VERY unlikely our managed database goes down to a technology issue in our control. If it does, it’s usually an operator error thing. However, if it were down to just operator error alerts and things actually impacting the end users, my job would be a dream!

Automation is somewhat there, but there’s a few stakeholders that insist on human validated steps. So, while I have an ansible playbook for most issues, operating that playbook takes hours.

But it could get pretty difficult to get management to do this kinda things from random suggestions from some SRE. I’d probably talk with the team-lead about this, and other people in your team, cause you’re probably not the only one with these issues. And then have a meeting with the entire dev/SRE team and management to point out it’s not sustainable the way it’s going, and with suggestions to improve it

Sure, if it were technical. But this is largely not a technical issue, as you had assumed. The issue is that there is someone, with power, who gets to say that we must follow unreasonable customer requests to the letter. Even if those requests run counter to our sustainability.