Page MenuHomePhabricator

Create maintenance script for cleaning up stale indexes
Closed, DuplicatePublic

Description

If reindex fails or is interrupted for some reason, there might be an index left which is not used but prevents further index updates, e.g.:

Looks like the index has more than one identifier. You should delete all
but the one of them currently active. Here is the list: frwiki_content_1494688722,frwiki_content_1495187567

We should make a maintenance script (or option in one of the existing scripts) which would detect which of these indexes is currently active and delete extra indexes. It is possible now to do it manually, but it is prone to error and deleting wrong one by mistake may cause search downtime.

Event Timeline

There is some documentation about how we've manually handled this on wikitech: https://wikitech.wikimedia.org/wiki/Search#Removing_Duplicate_Indices

It wouldn't hurt to have a more direct maintenance script for these things though.

debt triaged this task as Low priority.Jun 1 2017, 5:12 PM
debt moved this task from needs triage to search-icebox on the Discovery-Search board.
debt subscribed.

Looks like a maintenance script could fix this, maybe half a day to a day's worth of effort. It's not quite a slam-dunk easy task, but I've tagged this ticket with good first task and patch-welcome

There's some code in maintenance\updateSuggesterIndex.php (method checkAndDeleteBrokenIndices) that does something similar with some extra checks (check the stats api). I think this code can easily be adapted to run on main indices.
Worth noting that duplicate indices are almost always a consequence of a failed reindex.

Worth noting that duplicate indices are almost always a consequence of a failed reindex.

Yes, I also have encountered it when I start the reindex and then realize I am not using correct config or indexing wrong wiki or something like that. Stopping the wrong reindex does not delete the new index, and it's not easy to figure out its name if the output of the tool was not captured.

@Aklapper I talked with eugene and he does, I will co mentor with him

@Eugene233, @Zppix: It's unclear to me which codebase this is about, and the task description misses pointers / links to documentation or example implementations. Please also see the project description of good first task how to improve the task description. Thanks!

It's unclear to me which codebase this is about,

CirrusSearch extension, specifically maintenance scripts. I've added the tag.

the task description misses pointers / links to documentation or example implementations

Well, we don't exactly have any implementations for this. But if you look at includes/Maintenance/ConfigUtils.php, in pickIndexIdentifierFromOption, that's how it checks if there are more than one indexes. Now what needs to be done when this new script is run is to check which of them is linked to the main alias (i.e. frwiki_content would be an alias for frwiki_content_1494688722), list others, ask the user for confirmation and if the user confirms (or it is run with --yes option) delete the others.

The script should do it for all indexes belonging to the wiki - content, general, suggester & archive. Check also UpdateOneSearchIndexConfig for general code that handles the updates - and validateAlias function there.

Thanks Smalyshev!
@Eugene233, @Zppix: If you definitely do feel knowledgeable enough to mentor this task and any potential implementation questions here, feel free to improve the task description to make clear what exactly is expected from a contributor. Then we could offer this task in Google Code-in. But that is up to your judgement.