Maniphest T203622

Upgrade saneitizer to constantly re-index documents
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	EBernhardson
	Sep 5 2018, 11:56 PM

Description

When adding new fields to the search indices or making minor changes to how the content of a field is generated the only way currently to ensure that change has been applied to everything is to run a long (week+) maintenance script to rebuild the indices from the database. This is generally avoided, it's not been run on the largest wikis in years, but means we have documents that don't have all the fields they should.

Resolve this by constantly reindexing documents. This must provide a guarantee on the oldest possible last indexed date. It would be nice to have finer grained information about when documents were indexed and proportions after some deployment date, but only a guarantee on the oldest possible indexed document is required.

Details

	Subject	Repo	Branch	Lines +/-
	Update saneitizer to constantly re-index documents	mediawiki/extensions/CirrusSearch	master	+128 -26

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	Gehel	T147505 [tracking] CirrusSearch: what is updated during re-indexing
Declined	None	T200516 Re-index all pages on all wikis (insource and contentmodel don't play well together)
Resolved	EBernhardson	T203622 Upgrade saneitizer to constantly re-index documents

Event Timeline

EBernhardson triaged this task as Medium priority.Sep 5 2018, 11:56 PM

EBernhardson created this task.

Restricted Application edited projects, added Discovery-Search; removed Discovery-Search (Current work). · View Herald TranscriptSep 5 2018, 11:56 PM

EBernhardson moved this task from needs triage to Current work on the Discovery-Search board.Sep 5 2018, 11:56 PM

EBernhardson edited projects, added Discovery-Search (Current work); removed Discovery-Search.

EBernhardson moved this task from Incoming to not in use - please delete on the Discovery-Search (Current work) board.

Sample Indexed Document:

{
    field1: 10,
    version:3
    indexTimestamp:3
}

assuming handlers:

{
    field1: within 10
    version: versionDoc
    indexTimestamp: newType
}

Expected Result, where update means to merge all provided source fields into the most recently indexed document source:

id	update	action	reason
ex1	`{ field1: 10, version:4, indexTimestamp:4}`	update	version (revision id) changed
ex2	`{ field1: 30, version:4, indexTimestamp:4 }`	update	field1 and version changed
ex3	`{ field1: 30 indexTimestamp:4 }`	update	field1 changed
ex4	`{ field1: 12 indexTimestamp:4 }`	NOOP	field1 didn't change enough
ex5	`{ field1: 10 version: 3 indexTimestamp:3 }`	NOOP	document is exactly the same
ex6	`{ field1: 11 version: 3 indexTimestamp:4 }`	NOOP	only field1 changed, but not by enough. should not happen in theory.
ex7	`{ field1: 10 version: 3 indexTimestamp:4 }`	NOOP	Only indexTimestamp changed. See below

Everything from 1-3 seems pretty straight forward. 4-7 is where we have an open question. Start first with the goal: We want to add new fields, or make small changed to how a field is generated, and have a guarantee that after some time the new/updated value has been indexed for all documents.

The high level idea on the plan is:

Add a property to all documents containing the date the document was last indexed
As the saneitizer loops through all the documents on it's 14 day cycle check the last indexed date. If the document hasn't been indexed in the last N days issue a reindex job.

When the saneitizer tries to index the document we have two primary states to think about:

Something changed. Great! index the new version of the document.
Nothing changed. Probably the most common case. If we ship this to elasticsearch the last indexed timestamp will be different. We will not only be checking that documents have the latest values but also performing a, from the user perspective, no-op update to all these documents.

If all these updates matter depends on the volume. If we are talking about 10 docs/s it doesn't really matter. If we are talking 10k/s it might be a bit of a problem. A quick look at the last edit timestamp shows we have the following counts across all indexes for the last edited date.

index	last week	last 2 weeks	last 4 weeks	last 8 weeks	all docs
everything in codfw	5,995k	11,487k	20,066k	34,790k	332,019k

Data collected with:

curl search.svc.codfw.wmnet:9200/_search -d '{
    "size": 0,
    "query": {
        "bool": {
            "filter": [
                { "range": { "timestamp": { "gte": "now-7d" } } },
                { "type": { "value": "page" } }
            ]
        }
    }
}'

For the full cluster this is basically 6% of pages updated in last 28 days. Even pushing out to 8 weeks only gets us 10%. For purposes of estimation, it seems 94% of pages being updated through this process and 100% are about the same, so lets assume this process adds an additional 332M indexing operations per 28 days which, in the most common case (after the first loop to get everything up to date), will only update the last indexed timestamp. That works out to, if well distributed, 140 updates per second. Counting 3x for replicas, lets round that up to 500 updates per second.

Current production load is a bit harder to calculate as, iirc, elasticsearch's reporting indexing rate is all over the place. The dashboard shows ~1.6k/s typical, ~6k/s peak (popularity import), and >10k overall peak (daily comp suggest rebuild, not comparable to typical indexing operations). We can take a rough guess from the metrics from cirrus though, which reports a noop rate of at 60-75%. Using the 1.4-1.6 baseline (6k from popularity import likely has higher noop rate, but thats not yet imported into prometheus to check). This gives a "true" indexing rate of around 400 - 650 docs/s.

What I'm not sure about is if that indexing load matters. 30% feels minimal, and we actually have indexing throttled back with only 6 threads per server in the indexing thread pool. Maybe we don't worry about it and let the timestamp be updated even if it's the only thing that changed. In that case though we know the document can't be nooped, so maybe we simply disable the noop script for most updates, reserving it only to the job that counts incoming_links and pushes those updates into elastic?

Strike last paragraph, as I've updated numbers taking noop reporting into account. Updating all documents on an approximately 28 day cycle would increase our current rate of 400-650/s by 500/s, or roughly double the indexing rate. It's important to remember that no matter the solution we go with for the common case of no updates, for the case where it is correctly adding/modifying document fields this doubling of the update rate will apply. Our only lever there is really the time to update, pushing it out to 8 weeks would bring the increase down to 250/s which is something i'm comfortable shipping to the prod clusters.

Another option to ex7 is that we noop a document that would only update the last indexed date. Without updating the last indexed date every time the saneitizer passes by this document (every 14 days) if it was not edited in the last 28 days a new document will need to be generated and sent to elasticsearch for comparison. Following the previous estimates that works out to around 332M docs per 14 days, or 275 docs/s that need to be generated and sent to elasticsearch to make the noop decision. That should be relatively cheap though, as the parser output for old revisions should come from the on-disk parser caches and I'm pretty sure we push the load significantly more when using forceSearchIndex. The problem with this solution though is that roughly 95% of documents will be beyond the 28 day threshold, meaning a deploy with a new property will build all 330M documents in a single 14 day saneitizer cycle, giving a increase to indexing rate of almost 1000 operations per sec (vs baseline of 400-600).

Overall the numbers involved all seem significant no matter how we slice it. To summarize the two options:

1 - Update documents even if the only change is tlhe last indexed timestamp

Pros:

No change to noop plugin.
Relatively minor changes to CirrusSearch for basic implementation. New property gets no special handling, and saneitizer needs a little work to check it.

Cons:

Noop basically becomes pointless for regular document updates, relegating it to special cases like popularity_score/incoming_links updates and other index handling. Likely need to update CirrusSearch to stop asking for these to be nooped (or maybe we already do? not sure).
Unless specifically handled the first run through will give all documents a last indexed timestamp in a single 14 day cycle, and then future runs with saneitizer will maintain that clumping of updates regardless of if we set the max last indexed date at 28 days or 200 days.

2 - Don't update documents if the only change is the last indexed timestamp

Pros:

Reduced load when nothing is changing

Cons:

Requires updates to noop plugin
Unless specifically handled all updates will be applied in a single 14 day saneitizer cycle. This is basically the same problem as option 1, but forever instead of a special case to spread out the first run.

Suggestion?

I'm tempted to push N from 28 to 56 and go with the simpler option 1. The update rate at 28 days feels too high, but cutting it in half and forcing a 2 month wait seems a plausible middle ground?

@dcausse @Gehel thoughts?

EBernhardson renamed this task from Add handler to super detect noop to update a field only if other updates are applied to Upgrade saneitizer to constantly re-index documents.Sep 7 2018, 4:13 PM

EBernhardson updated the task description. (Show Details)

EBernhardson merged a task: T192616: Update saneitizer to reindex documents that havn't been indexed in N days.Sep 7 2018, 4:20 PM

EBernhardson added a subscriber: TJones.

After talking to @dcausse on irc:

3 - Option 1 but better

While saneitizer is looping through documents, when loopId % N == pageId % N queue a job to reindex the page. N then controls the number of saneitizer loops necessary to reindex everything. Saneitizer loops are at 2 weeks, so N = 4 would guarantee all documents have been reindexed in the last 14 * 4 days. Nothing needs to explicitly track when documents were last indexed, so we don't need the pointless updates options 1 and 2 were trying to deal with. We won't have direct insight into what documents are reindexed, although it could be determined from the current loopId if truly necessary.

Pros:

No new properties to store
Current page update jobs stay exactly the same
Saneitizer change is simple and straight forward

Cons:

No direct record of the last time a document was indexed

Change 458897 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[mediawiki/extensions/CirrusSearch@master] Update saneitizer to constantly re-index documents

https://gerrit.wikimedia.org/r/458897

gerritbot added a project: Patch-For-Review.Sep 7 2018, 8:51 PM

EBernhardson moved this task from not in use - please delete to Needs review on the Discovery-Search (Current work) board.Sep 7 2018, 9:00 PM

EBernhardson mentioned this in T200516: Re-index all pages on all wikis (insource and contentmodel don't play well together).Sep 7 2018, 9:03 PM

EBernhardson mentioned this in T155523: re-index multimedia files after deployment of ogg filetype detection updates.Sep 10 2018, 6:01 PM

EBernhardson mentioned this in T195071: Add chronological sorting by-page-creation-timestamp for search results.Sep 11 2018, 4:57 PM

EBernhardson mentioned this in T195192: Missing search results when using contentmodel filter.Sep 11 2018, 5:23 PM

EBernhardson mentioned this in T164288: CirrusSearch should be able to keep its index upto date in most cases.Sep 11 2018, 5:26 PM

Change 458897 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@master] Update saneitizer to constantly re-index documents

https://gerrit.wikimedia.org/r/458897

ReleaseTaggerBot added a project: MW-1.32-notes (WMF-deploy-2018-09-25 (1.32.0-wmf.23)).Sep 18 2018, 4:00 PM

EBernhardson moved this task from Needs review to Needs Reporting on the Discovery-Search (Current work) board.Sep 19 2018, 10:52 PM