As a user, when there is a failed update, I want quick automatic retries instead of manual retries.
As a maintainer of the wdqs streaming updater I want requests to Special:EntityData receiving a 404 response to be retried so that there are fewer items to reconcile (T279541).
There is a race between the events flowing to kafka and mysql replication. This race might cause the events to be processed before the data they point to is available on the mysql replica being reached.
One simple approach to circumvent the problem would be to retry on 404. The retry could be guarded by a check on the difference between the processing time and the event time, if the difference is less than e.g. 10 seconds then a retry is performed.
Looking at the side output data of the streaming updater for the first seven day of april we see (range is the delta between the ingestion time vs the event time):
+-------+------+ |range |events| +-------+------+ |0: 0-1s|65 | |1: 1-3s|137 | |2: 3-5s|38 | |3: 5-7s|9 | +-------+------+
which translates to: over the 8 days of wikidata edits 249 events failed with a 404 but for which the data is actually available (most probably due to replication lag) and whose events were ingested between 0 and 7 seconds after their event time.
There are 141 events for which we received a 404 that is still a 404 now:
+--------+------+ |range |events| +--------+------+ |1: 1-3s |4 | |3: 5-7s |1 | |4: 7-10s|2 | |5: >10s |134 | +--------+-----+
So retrying 404 for events with an processing_time - event_time < 10 seconds seems the right threshold that will cause an extra latency only for a few hundreds of events per week.
AC:
- retry 404 until the event time is 10sec older than the processing time