Page MenuHomePhabricator

Leftover schemas in shared DB for envs
Closed, ResolvedPublic3 Estimated Story Points

Description

Right now there's a significant discrepancy between the actual Patchdemo wiki envs running on the cluster and the corresponding schemas in the DB shared by Patchdemo envs:

$ kubectl -n cat-env get po | grep mediawiki | grep ^wiki | wc -l
163
$ kubectl -n cat-env exec -it envdb-mariadbop-0 -- bash -c $'mariadb -uroot -p$(echo $MARIADB_ROOT_PASSWORD) -NBe "select count(*) from information_schema.schemata where schema_name regexp \'^wiki[-_][^_]+(__main)?$\';"'
271

All 163 running envs are accounted for and their schemas exist in the shared DB. The 271−163=108 difference consists exclusively of deleted envs whose schemas were not deleted. A cursory check showed envs that are months old, so these have been probably accumulating for some time now.

Full list of (normalized) schemas can be seen here: P90359.

Note only Patchdemo envs use the shared DB at the moment.

Unless we want to commit to creating some proper distributed system with eventual consistency, then we can't guarantee that we will always be able to clean up an environment successfully (think for example of situations like the incidents back in February this year). Because of this, I propose that we create a periodic job in Catalyst that checks the data consistency in the shared DB and cleans up old schemas.

In the future we should also monitor the output from that cron job. If such leftover schemas keep happening and there's no system-wide incident to track it back to, we should try to figure out whether there's also a bug somewhere causing the issue.

Details

Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
jobs: Replay recent env deletionsrepos/test-platform/catalyst/catalyst-api!190jnucheT422938main
Customize query in GitLab

Related Objects

Event Timeline

jnuche moved this task from Backlog to Luka Ijo Pimeja Jan on the Catalyst board.
jnuche edited projects, added Catalyst (Luka Ijo Pimeja Jan); removed Catalyst.
jnuche moved this task from Backlog to In progress on the Catalyst (Luka Ijo Pimeja Jan) board.
jnuche set the point value for this task to 3.Apr 16 2026, 1:49 PM
jnuche closed this task as Resolved.EditedMon, Apr 27, 4:04 PM

I have deployed a new cron job that will replay the deletion of any env deleted recently ("recently" meaning 3 days, but it's configurable). This should ensure we eventually succeed in reaping the resources associated to an env if the original failure was transient (i.e. a problem with the infra somewhere)

After checking a number of the left-over schemas mentioned in P90359, I'm convinced that the majority, if not all, failed to be dropped because of changes to the function that provides the name for the deletion. When we added changes to that function, we forgot to remain backwards-compatible with the old name pattern. This means other old envs deleted in the future will suffer the same problem and leave their schemas behind. I will continue monitoring the shared DB to clean these up from time to time. In the future we should make sure changes to that function in the code are backwards compatible.

All of the schemas mentioned in P90359 have now been dropped