Wrap up monitoring and alerting for new user onboarding
Closed, ResolvedPublic2 Estimated Story Points
Actions

Assigned To

Authored By

	Evelien_WMDE
	Aug 11 2022, 12:56 PM

Description

We're planning on adding at least 50 new users to Wikibase.cloud, possibly 67, with the ability to create a total of 6 Wiki's per user. We've worked towards implementing several dashboards with metrics to keep an eye on during the onboarding of these users.

To make sure we're ready, we want to park time to double check if the metrics/monitoring/alerts are complete. Some ideas:

Potentially add alerting on sql-logic-backup scratch disk space or bumping it up before the new users start signing up
We only have one replica for mariadb secondary, and are running out of space here, so this might need bumping up
Do we need metrics that give us a better idea of how the cache is used?
Review the incident list from the migration and see if this is all covered in monitoring or in a good place to keep an eye on

Link to existing dashboard: https://console.cloud.google.com/monitoring/dashboards/builder/fd1bf4b9-3b5b-4cd0-9529-3f1f75c3bdbd?project=wikibase-cloud&dashboardBuilderState=%257B%2522editModeEnabled%2522:false%257D&timeDomain=6h

AC:

Everyone has reviewed this ticket and added their thoughts
- Deniz
- Dat
- Rosalie
- Tobias
- Tom
We can conclude if monitoring and alerting is sufficient for onboarding >50 new users
We can confidently send out all the invite codes at once

Related Objects

Mentioned Here: T316214: Create a mariadb replica for backups only

Event Timeline

Evelien_WMDE created this task.Aug 11 2022, 12:56 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 11 2022, 12:56 PM

Addshore subscribed.Aug 12 2022, 2:50 PM

Evelien_WMDE updated the task description. (Show Details)Aug 15 2022, 11:38 AM

Evelien_WMDE set the point value for this task to 2.

Evelien_WMDE moved this task from Product prioritized backlog to Ready to Pick Up on the Wikibase Cloud board.

Deniz_WMDE updated the task description. (Show Details)Aug 15 2022, 11:56 AM

We only have one replica for mariadb secondary, and are running out of space here, so this might need bumping up

So to clarify this problem.

Our replica has been falling over from time to time, and this seems to be mostly caused by the load spikes caused by the logic backup being taken. It's not a problem with diskspace but seems to be mostly a memory problem. We can increase the replica count and keep taking the backup of the secondaries, but this will most likely still keep tipping these over from time to time. Since the backups are currently retried 4 times on failure this could potentially still lead to both replicas falling over as they would restart.

The second problem I see with still taking backups from the secondaries is that this load kind of makes it hard to follow recommendations how to tune mariadb since lots of them rely on leaving the thing running under normal usage for a while then seeing how the variables / status of mariadb looks. example for open_table_cache

I would propose instead of just simply increasing the replicaCount variable for the sql deployment we setup a dedicated dump/backup replica in a similar way as described here. This way we wouldn't have these load spikes that can tip them over (Out of Memory), we also would only have "normal application usage" for the replicas potentially making tuning them easier.

• toan updated the task description. (Show Details)Aug 17 2022, 8:28 AM

Potentially add alerting on sql-logic-backup scratch disk space or bumping it up before the new users start signing up

This is what the current usage looks like. When that space runs out the backups will not longer keep being taken.

I used the UI to add this chart (and display as a ratio) for the sql-logic-backup cronjob on production in the Volume utilization dashboard

Maybe this will be enough for us to keep monitoring how this increases without requiring alerting.

Do we need metrics that give us a better idea of how the cache is used?

Great question, would be nice to see cache hits / misses but I'm not sure this will be required to on-board more users now.

• toan updated the task description. (Show Details)Aug 17 2022, 8:58 AM

• toan moved this task from Ready to Pick Up to Wikibase.cloud (WB Cloud Sprint 3) on the Wikibase Cloud board.Aug 17 2022, 12:38 PM

• toan edited projects, added Wikibase Cloud (Wikibase.cloud (WB Cloud Sprint 3)); removed Wikibase Cloud.

Rosalie_WMDE claimed this task.Aug 18 2022, 9:41 AM

Rosalie_WMDE moved this task from Sprint Backlog to Doing on the Wikibase Cloud (Wikibase.cloud (WB Cloud Sprint 3)) board.

Rosalie_WMDE removed Rosalie_WMDE as the assignee of this task.Aug 19 2022, 8:42 AM

Rosalie_WMDE subscribed.

Potentially add alerting on sql-logic-backup scratch disk space or bumping it up before the new users start signing up

I don't think this is needed for the next ~70 users, but I think it's a great idea to get alerting on this. If we would hit the limit we would probably risk about 1-3 days without logical backups until someone notices and increases the needed disk (but there still would be the physical backups in place).

The secondary mariadb falling over because of the backups is a bit worrying, but (without having a deeper understanding of the issue at this moment) it looks like it recovers from it, so probably not a blocker for more users?

Deniz_WMDE updated the task description. (Show Details)Aug 19 2022, 11:47 AM

Rosalie_WMDE updated the task description. (Show Details)Aug 27 2022, 9:47 PM

Rosalie_WMDE updated the task description. (Show Details)

Added my tick. I would like to see us sort the OOM from running the backups problem (T316214).

Looked through the previous incidents and I'm happy that there doesn't feel like there is anything exceptional outstanding.

Tarrow updated the task description. (Show Details)Aug 30 2022, 11:45 AM

Tarrow moved this task from Wikibase.cloud (WB Cloud Sprint 3) to Wikibase.cloud (WB Cloud Sprint 4) on the Wikibase Cloud board.Aug 31 2022, 12:32 PM

Tarrow edited projects, added Wikibase Cloud (Wikibase.cloud (WB Cloud Sprint 4)); removed Wikibase Cloud (Wikibase.cloud (WB Cloud Sprint 3)).

Tarrow moved this task from Sprint Backlog to Doing on the Wikibase Cloud (Wikibase.cloud (WB Cloud Sprint 4)) board.Aug 31 2022, 12:36 PM

We're going to tackle the mariaDB problem in this print and after that I think we are good to go. Let's them all aboard!

dang updated the task description. (Show Details)Aug 31 2022, 1:29 PM

dang moved this task from Doing to In Review on the Wikibase Cloud (Wikibase.cloud (WB Cloud Sprint 4)) board.

Deniz_WMDE moved this task from In Review to Done on the Wikibase Cloud (Wikibase.cloud (WB Cloud Sprint 4)) board.Sep 1 2022, 8:35 AM

Evelien_WMDE closed this task as Resolved.Sep 14 2022, 12:11 PM

Evelien_WMDE claimed this task.

	F35464702: image.png
	Aug 17 2022, 8:56 AM

	F35464695: image.png
	Aug 17 2022, 8:56 AM

Wrap up monitoring and alerting for new user onboardingClosed, ResolvedPublic2 Estimated Story PointsActions

Description

Related Objects

Event Timeline

Wrap up monitoring and alerting for new user onboarding
Closed, ResolvedPublic2 Estimated Story Points
Actions