Page MenuHomePhabricator

Modify EL purging script to not use limit/offset
Closed, ResolvedPublic5 Estimated Story Points

Description

After the performance analysis in https://phabricator.wikimedia.org/T156933#3352689, it seems that the batch selection using uuids and limit/offset won't work well enough for the one-off historical purge of EL database, even if the script seems perfectly fine for day-to-day execution. This task is about finding a way to optimize the purging script, so that it does not use the uuid+limit/offset solution.

Event Timeline

Change 359442 had a related patch set uploaded (by Mforns; owner: Mforns):
[operations/puppet@production] [WIP] Modify EL purging script to not use limit/offset

https://gerrit.wikimedia.org/r/359442

Trial 1)
The initial idea was to quickly alter the _get_old_uuids method to get not only the list of uuids, but also the max timestamp of the events corresponding to the uuids, so that in the next call to _get_old_uuids we'd be able to use it (the timestamp) to select the proper slice without using offset. However, the SELECT by itself does not guarantee order in the results. So, the max timestamp idea would fail potentially, unless we used ORDER BY. But using ORDER BY is not possible with big tables like Edit (We can sort 1 day of data, but not 1 month or the whole history). So I abandoned this (see trial 4).

Trial 2)
Second idea was to use timestamps to split the data into batches, and have an adaptive batch_size. Start with 1 minute, and then grow until the volume of events was near the desired optimal batch_size. the problem with this approach is that a certain table in the db can have no events registered until a certain date, and then start having a big thoughput of events (the client started to send events at that date). If the script starts before that date, it can have a huge batch_size already when it reaches the first batch with events, and may try to proccess too many events in the same batch. So, also abandoned.

Trial 3)
As a variation to trial 2, the idea was to preselect the count of events that would be updated in the next batch. If the number is too big, change the batch size proportionally so that the batch is within expected (also if the number of events is too small, change the batch size). When the batch size has been proven of a good size, then proceed to the update. I think this would work and I wrote the full code: https://gerrit.wikimedia.org/r/#/c/359442/. Now, it is a big refactor of the sanitize method, and the tests are still missing. I didn't continue though, because of trial 4.

Trial 4)
It turns out you can use very fast ORDER BY on big tables if the column you're ordering on is indexed. So... Trial 1 can actually work and it will be a minor change. Will proceed with that in a new gerrit patch.

Change 359938 had a related patch set uploaded (by Mforns; owner: Mforns):
[operations/puppet@production] Change EL purging script to avoid limit/offset

https://gerrit.wikimedia.org/r/359938

Change 359938 abandoned by Mforns:
Change EL purging script to avoid limit/offset

Reason:
This is already being implemented in another patch: https://gerrit.wikimedia.org/r/#/c/356383

https://gerrit.wikimedia.org/r/359938

Change 359442 abandoned by Mforns:
[WIP] Modify EL purging script to not use limit/offset

Reason:
This is being implemented in another patch: https://gerrit.wikimedia.org/r/#/c/356383

https://gerrit.wikimedia.org/r/359442

mforns set the point value for this task to 5.