Page MenuHomePhabricator

Implement a continuous sanitization process
Closed, ResolvedPublic

Description

This script is designed to cleanup oddities that may appear in the index.
It has been run in the past but we should have a quick review/test before running it to prod.
We should also implement a continuous process so that pages are constantly checked.

Details

Related Gerrit Patches:
mediawiki/extensions/CirrusSearch : masterAdd a continuous sanitize process using the JobQueue
mediawiki/extensions/CirrusSearch : masterOptimize saneitize.php

Event Timeline

dcausse created this task.Jun 6 2016, 3:29 PM
Restricted Application added projects: Discovery, Discovery-Search. · View Herald TranscriptJun 6 2016, 3:29 PM
Restricted Application added subscribers: Zppix, Aklapper. · View Herald Transcript
Danny_B renamed this task from Review and eventually fix the saneitizer script to Review and eventually fix the sanitizer script.Jun 6 2016, 4:13 PM

Any known problems here? I took a look over the implementation and it looks pretty simple and straight forward. There is the downside that it looks like running against enwiki will take ~140-200 hours in a single process. Given the from/to id's though we could relatively easily run this in parallel to get the time down to a day or two.

One other thing we could do is farm out the actual checking to the job queue. We would have to use a system like forceSearchIndex does to push jobs into the queue, wait for it to drain below a threshold, then push more jobs. This would allow for much more parallelism than we can do on terbium directly.

EBernhardson added a subscriber: TJones.EditedJun 7 2016, 4:47 PM

One other idea that came up before (i think from justin?) was to have a process that runs regularly and semi-randomly chooses a selection of pages to check such that over the course of a week or two we know the entire index has been checked. Would have to poke @TJones about how to determine the right sizes such than with random page selection we can guarantee (not exactly, but with some level of confidence) the whole index is visited in a particular timeframe.

No I'm not aware of any known problems here, I created this task just to be sure that nothing changed in the update process that this script could be confused with.
I have not looked at how it works but if we have an option to run it just to delete zombie pages like this:

  • fetch all ids+namespace with a scroll from elastic
  • check db if the page exists
  • if it does not exist, send delete
  • if namespace mismatch (content vs general) send delete to content/general

I think that would cover the most problematic index issues we have today (duplcate pages)

TJones added a comment.Jun 7 2016, 5:59 PM

One other idea that came up before (i think from justin?) was to have a process that runs regularly and semi-randomly chooses a selection of pages to check such that over the course of a week or two we know the entire index has been checked.

For large values of n, 95% confidence works out to taking 3n samples. Why not methodically (though maybe in a random order) work through everything and do n tasks instead of 3n?

EBernhardson added a comment.EditedJun 7 2016, 5:59 PM

Random guess:

enwiki (likely the largest) has a max page id just under 51,000,000. If we were to queue jobs with batches of 1000 continuous id's, that means 51,000 possible buckets. Randomly selecting buckets is basically sampling with replacement.

When sampling with replacement, the probability of selecting a specific bucket is:

r: number of samples taken
n: total number of samples

p = 1 - (1 - 1/n)^r

Guessing a 95% chance of visiting all buckets is preferable. For enwiki with ~51M page id's if we take buckets of 1000 continuous id's, this works out to 152,781 random buckets selected. Using two weeks as our baseline that would be 454 buckets to visit per hour. Over 4 weeks this gives us a 99.75% chance of visiting all pages.

For reference on processing time, visiting 10k id's on terbium took 3m 20s, or 50 per second. 454k id's would then be ~ 150 minutes of processing time per hour (parallelized across the job runners).

An alternate solution would be to maintain a position (in redis?) per wiki and not do random selections. Then we would be visiting 152k id's per hour, which is only 50 minutes of processing time.

Will talk about this at our wednesday weekly meeting to nail down more concrete details.

No I'm not aware of any known problems here, I created this task just to be sure that nothing changed in the update process that this script could be confused with.
I have not looked at how it works but if we have an option to run it just to delete zombie pages like this:

  • fetch all ids+namespace with a scroll from elastic
  • check db if the page exists
  • if it does not exist, send delete
  • if namespace mismatch (content vs general) send delete to content/general

I think that would cover the most problematic index issues we have today (duplcate pages)

I ran a 10k page sample (50,700,000 - 50,710,000) and there were quite a few pages missing from the index as well.

Change 293503 merged by jenkins-bot:
Optimize saneitize.php

https://gerrit.wikimedia.org/r/293503

Change 295556 had a related patch set uploaded (by DCausse):
Add a continuous sanitize process using the JobQueue

https://gerrit.wikimedia.org/r/295556

dcausse renamed this task from Review and eventually fix the sanitizer script to Implement a continuous sanitization process.Jun 30 2016, 8:24 AM
dcausse updated the task description. (Show Details)

Change 295556 merged by jenkins-bot:
Add a continuous sanitize process using the JobQueue

https://gerrit.wikimedia.org/r/295556

debt closed this task as Resolved.Jul 21 2016, 6:23 PM