Page MenuHomePhabricator

Wrap up monitoring and alerting for new user onboarding
Closed, ResolvedPublic2 Estimated Story Points

Description

We're planning on adding at least 50 new users to Wikibase.cloud, possibly 67, with the ability to create a total of 6 Wiki's per user. We've worked towards implementing several dashboards with metrics to keep an eye on during the onboarding of these users.

To make sure we're ready, we want to park time to double check if the metrics/monitoring/alerts are complete. Some ideas:

  • Potentially add alerting on sql-logic-backup scratch disk space or bumping it up before the new users start signing up
  • We only have one replica for mariadb secondary, and are running out of space here, so this might need bumping up
  • Do we need metrics that give us a better idea of how the cache is used?
  • Review the incident list from the migration and see if this is all covered in monitoring or in a good place to keep an eye on

Link to existing dashboard: https://console.cloud.google.com/monitoring/dashboards/builder/fd1bf4b9-3b5b-4cd0-9529-3f1f75c3bdbd?project=wikibase-cloud&dashboardBuilderState=%257B%2522editModeEnabled%2522:false%257D&timeDomain=6h

AC:

  • Everyone has reviewed this ticket and added their thoughts
    • Deniz
    • Dat
    • Rosalie
    • Tobias
    • Tom
  • We can conclude if monitoring and alerting is sufficient for onboarding >50 new users
  • We can confidently send out all the invite codes at once

Event Timeline

Evelien_WMDE set the point value for this task to 2.

We only have one replica for mariadb secondary, and are running out of space here, so this might need bumping up

So to clarify this problem.

Our replica has been falling over from time to time, and this seems to be mostly caused by the load spikes caused by the logic backup being taken. It's not a problem with diskspace but seems to be mostly a memory problem. We can increase the replica count and keep taking the backup of the secondaries, but this will most likely still keep tipping these over from time to time. Since the backups are currently retried 4 times on failure this could potentially still lead to both replicas falling over as they would restart.

The second problem I see with still taking backups from the secondaries is that this load kind of makes it hard to follow recommendations how to tune mariadb since lots of them rely on leaving the thing running under normal usage for a while then seeing how the variables / status of mariadb looks. example for open_table_cache

I would propose instead of just simply increasing the replicaCount variable for the sql deployment we setup a dedicated dump/backup replica in a similar way as described here. This way we wouldn't have these load spikes that can tip them over (Out of Memory), we also would only have "normal application usage" for the replicas potentially making tuning them easier.

Potentially add alerting on sql-logic-backup scratch disk space or bumping it up before the new users start signing up

image.png (200×543 px, 12 KB)

This is what the current usage looks like. When that space runs out the backups will not longer keep being taken.

I used the UI to add this chart (and display as a ratio) for the sql-logic-backup cronjob on production in the Volume utilization dashboard

image.png (327×914 px, 19 KB)

Maybe this will be enough for us to keep monitoring how this increases without requiring alerting.

Do we need metrics that give us a better idea of how the cache is used?

Great question, would be nice to see cache hits / misses but I'm not sure this will be required to on-board more users now.

Potentially add alerting on sql-logic-backup scratch disk space or bumping it up before the new users start signing up

I don't think this is needed for the next ~70 users, but I think it's a great idea to get alerting on this. If we would hit the limit we would probably risk about 1-3 days without logical backups until someone notices and increases the needed disk (but there still would be the physical backups in place).

The secondary mariadb falling over because of the backups is a bit worrying, but (without having a deeper understanding of the issue at this moment) it looks like it recovers from it, so probably not a blocker for more users?

Added my tick. I would like to see us sort the OOM from running the backups problem (T316214).

Looked through the previous incidents and I'm happy that there doesn't feel like there is anything exceptional outstanding.

We're going to tackle the mariaDB problem in this print and after that I think we are good to go. Let's them all aboard!

Evelien_WMDE claimed this task.