Analysis of the lucene-search-2 code indicates a possible explanation for reports of page updates occasionally being missed, causing the previous version of a page to persist in the index indefinitely.
The main loop of IncrementalUpdater fetches OAI records from MediaWiki, 50 pages at a time. It uses the "from" timestamp parameter to advance through the update list. After each batch of pages, it uses the date from the <responseDate> element as the next value to send to the "from" parameter. This has the following flaws:
- responseDate is the time at which the response is generated. If there is replication lag, the most recent timestamp available in the chosen slave might be some seconds in the past. Thus, a batch of events equivalent to the replication lag will be skipped.
- responseDate and the from parameter have a one-second resolution. The English Wikipedia sees about 5 edits per second at peak. So some events may appear in the database with the same timestamp, after that timestamp has been processed by IncrementalUpdater, because they were committed later in the same second.
- Using the revision timestamp instead of responseDate would be an improvement. However, rev_timestamp and up_timestamp are generated before the transaction is committed, and it is unknown how long it will take for the transaction to be completed, so the order of rev_timestamp or up_timestamp in the replication log will typically not be monotonic. Additionally, the approach would be highly sensitive to apache clock skew.
The obvious solution is to use the sequence number (resumptionToken) to advance through the update list, instead of the timestamp.