Page MenuHomePhabricator

Upgrade saneitizer to constantly re-index documents
Closed, ResolvedPublic


When adding new fields to the search indices or making minor changes to how the content of a field is generated the only way currently to ensure that change has been applied to everything is to run a long (week+) maintenance script to rebuild the indices from the database. This is generally avoided, it's not been run on the largest wikis in years, but means we have documents that don't have all the fields they should.

Resolve this by constantly reindexing documents. This must provide a guarantee on the oldest possible last indexed date. It would be nice to have finer grained information about when documents were indexed and proportions after some deployment date, but only a guarantee on the oldest possible indexed document is required.

Event Timeline

EBernhardson created this task.

Sample Indexed Document:

    field1: 10,

assuming handlers:

    field1: within 10
    version: versionDoc
    indexTimestamp: newType

Expected Result, where update means to merge all provided source fields into the most recently indexed document source:

ex1{ field1: 10, version:4, indexTimestamp:4}updateversion (revision id) changed
ex2{ field1: 30, version:4, indexTimestamp:4 }updatefield1 and version changed
ex3{ field1: 30 indexTimestamp:4 }updatefield1 changed
ex4{ field1: 12 indexTimestamp:4 }NOOPfield1 didn't change enough
ex5{ field1: 10 version: 3 indexTimestamp:3 }NOOPdocument is exactly the same
ex6{ field1: 11 version: 3 indexTimestamp:4 }NOOPonly field1 changed, but not by enough. should not happen in theory.
ex7{ field1: 10 version: 3 indexTimestamp:4 }NOOPOnly indexTimestamp changed. See below

Everything from 1-3 seems pretty straight forward. 4-7 is where we have an open question. Start first with the goal: We want to add new fields, or make small changed to how a field is generated, and have a guarantee that after some time the new/updated value has been indexed for all documents.

The high level idea on the plan is:

  • Add a property to all documents containing the date the document was last indexed
  • As the saneitizer loops through all the documents on it's 14 day cycle check the last indexed date. If the document hasn't been indexed in the last N days issue a reindex job.

When the saneitizer tries to index the document we have two primary states to think about:

  • Something changed. Great! index the new version of the document.
  • Nothing changed. Probably the most common case. If we ship this to elasticsearch the last indexed timestamp will be different. We will not only be checking that documents have the latest values but also performing a, from the user perspective, no-op update to all these documents.

If all these updates matter depends on the volume. If we are talking about 10 docs/s it doesn't really matter. If we are talking 10k/s it might be a bit of a problem. A quick look at the last edit timestamp shows we have the following counts across all indexes for the last edited date.

indexlast weeklast 2 weekslast 4 weekslast 8 weeksall docs
everything in codfw5,995k11,487k20,066k34,790k332,019k

Data collected with:

curl search.svc.codfw.wmnet:9200/_search -d '{
    "size": 0,
    "query": {
        "bool": {
            "filter": [
                { "range": { "timestamp": { "gte": "now-7d" } } },
                { "type": { "value": "page" } }

For the full cluster this is basically 6% of pages updated in last 28 days. Even pushing out to 8 weeks only gets us 10%. For purposes of estimation, it seems 94% of pages being updated through this process and 100% are about the same, so lets assume this process adds an additional 332M indexing operations per 28 days which, in the most common case (after the first loop to get everything up to date), will only update the last indexed timestamp. That works out to, if well distributed, 140 updates per second. Counting 3x for replicas, lets round that up to 500 updates per second.

Current production load is a bit harder to calculate as, iirc, elasticsearch's reporting indexing rate is all over the place. The dashboard shows ~1.6k/s typical, ~6k/s peak (popularity import), and >10k overall peak (daily comp suggest rebuild, not comparable to typical indexing operations). We can take a rough guess from the metrics from cirrus though, which reports a noop rate of at 60-75%. Using the 1.4-1.6 baseline (6k from popularity import likely has higher noop rate, but thats not yet imported into prometheus to check). This gives a "true" indexing rate of around 400 - 650 docs/s.

What I'm not sure about is if that indexing load matters. 30% feels minimal, and we actually have indexing throttled back with only 6 threads per server in the indexing thread pool. Maybe we don't worry about it and let the timestamp be updated even if it's the only thing that changed. In that case though we know the document can't be nooped, so maybe we simply disable the noop script for most updates, reserving it only to the job that counts incoming_links and pushes those updates into elastic?

Strike last paragraph, as I've updated numbers taking noop reporting into account. Updating all documents on an approximately 28 day cycle would increase our current rate of 400-650/s by 500/s, or roughly double the indexing rate. It's important to remember that no matter the solution we go with for the common case of no updates, for the case where it is correctly adding/modifying document fields this doubling of the update rate will apply. Our only lever there is really the time to update, pushing it out to 8 weeks would bring the increase down to 250/s which is something i'm comfortable shipping to the prod clusters.

Another option to ex7 is that we noop a document that would only update the last indexed date. Without updating the last indexed date every time the saneitizer passes by this document (every 14 days) if it was not edited in the last 28 days a new document will need to be generated and sent to elasticsearch for comparison. Following the previous estimates that works out to around 332M docs per 14 days, or 275 docs/s that need to be generated and sent to elasticsearch to make the noop decision. That should be relatively cheap though, as the parser output for old revisions should come from the on-disk parser caches and I'm pretty sure we push the load significantly more when using forceSearchIndex. The problem with this solution though is that roughly 95% of documents will be beyond the 28 day threshold, meaning a deploy with a new property will build all 330M documents in a single 14 day saneitizer cycle, giving a increase to indexing rate of almost 1000 operations per sec (vs baseline of 400-600).

Overall the numbers involved all seem significant no matter how we slice it. To summarize the two options:

1 - Update documents even if the only change is tlhe last indexed timestamp


  • No change to noop plugin.
  • Relatively minor changes to CirrusSearch for basic implementation. New property gets no special handling, and saneitizer needs a little work to check it.


  • Noop basically becomes pointless for regular document updates, relegating it to special cases like popularity_score/incoming_links updates and other index handling. Likely need to update CirrusSearch to stop asking for these to be nooped (or maybe we already do? not sure).
  • Unless specifically handled the first run through will give all documents a last indexed timestamp in a single 14 day cycle, and then future runs with saneitizer will maintain that clumping of updates regardless of if we set the max last indexed date at 28 days or 200 days.

2 - Don't update documents if the only change is the last indexed timestamp


  • Reduced load when nothing is changing


  • Requires updates to noop plugin
  • Unless specifically handled all updates will be applied in a single 14 day saneitizer cycle. This is basically the same problem as option 1, but forever instead of a special case to spread out the first run.


I'm tempted to push N from 28 to 56 and go with the simpler option 1. The update rate at 28 days feels too high, but cutting it in half and forcing a 2 month wait seems a plausible middle ground?

@dcausse @Gehel thoughts?

EBernhardson renamed this task from Add handler to super detect noop to update a field only if other updates are applied to Upgrade saneitizer to constantly re-index documents.Sep 7 2018, 4:13 PM
EBernhardson updated the task description. (Show Details)
EBernhardson updated the task description. (Show Details)

After talking to @dcausse on irc:

3 - Option 1 but better

While saneitizer is looping through documents, when loopId % N == pageId % N queue a job to reindex the page. N then controls the number of saneitizer loops necessary to reindex everything. Saneitizer loops are at 2 weeks, so N = 4 would guarantee all documents have been reindexed in the last 14 * 4 days. Nothing needs to explicitly track when documents were last indexed, so we don't need the pointless updates options 1 and 2 were trying to deal with. We won't have direct insight into what documents are reindexed, although it could be determined from the current loopId if truly necessary.


  • No new properties to store
  • Current page update jobs stay exactly the same
  • Saneitizer change is simple and straight forward


  • No direct record of the last time a document was indexed

Change 458897 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[mediawiki/extensions/CirrusSearch@master] Update saneitizer to constantly re-index documents

Change 458897 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@master] Update saneitizer to constantly re-index documents