Page MenuHomePhabricator

Revisions missing from mediawiki_revision_create
Closed, ResolvedPublic

Description

Investigating what data sources to use for T201656, I considered options for getting data more frequently than what wmf.mediawiki_history would enable. One idea was to use the MW database, another was to use event.mediawiki_revision_create. I wanted to check whether mediawiki_revision_create stores all revisions or not. Comparing the random month of October 2018 against the history table for English Wikipedia, I found 70,719 revisions were missing (or 1.5%). Investigating some of them in the MW database, I did not find a clear pattern, but on the other hand I have not been digging very deeply.

I'm unsure about what the expected reliability of an EventBus table should be, so I thought I'd open this task to discuss it.

Update

After improvements to envoy proxies, we found that we are missing 0.0014% of events. The remaining missing events don't seem possible to fix in the current architecture, we would need some other way to send them more reliably or reconcile them later.

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
fdans moved this task from Incoming to Data Quality on the Analytics board.

@JAllemandou, is that something you're working on? cc @Milimetric

mforns raised the priority of this task from High to Needs Triage.Apr 22 2019, 3:40 PM
mforns triaged this task as High priority.

I'm not working on this (yet?) - Seems event related.

@Ottomata / @Pchelolo: short description is event.mediawiki_revision_create is missing about 1.5% of the revisions that are in the revision table. My first thought was that maybe page restores create revision rows in the revision table without firing mediawiki_revision_create. Then I wondered if this should fire an event, so I figured I'd ping you anyway. Do you have plans for like revision_restored events [1]? Or any other thoughts on how to deal with this discrepancy? Also, it's of course possible but less likely that events are being dropped.

[1] right now there are no restore/delete events at the revision level: https://github.com/wikimedia/mediawiki-event-schemas/tree/master/jsonschema/mediawiki/revision

@nettrom_WMF
Hi, can you provide the query you used to get that data, so we can more easily determine where the problem is? Thanks!

mforns raised the priority of this task from High to Needs Triage.Nov 21 2019, 6:13 PM
mforns triaged this task as High priority.

Now is my time for this. Here is some data for simplewiki only in presto for July and August 2020:

  • July
select count(distinct rev_id) from event.mediawiki_revision_create where year = 2020 and ((month = 6 and day = 30) OR (month = 7) OR (month = 8 and day = 1)) and rev_timestamp like '2020-07%' and database = 'simplewiki';
 _col0 
-------
 38771 
(1 row)

select count(distinct rev_id) from wmf_raw.mediawiki_revision where snapshot = '2020-07' and wiki_db = 'simplewiki' and rev_timestamp like '202007%';
 _col0 
-------
 37491 
(1 row)

select count(distinct ar_rev_id) from wmf_raw.mediawiki_archive where snapshot = '2020-07' and wiki_db = 'simplewiki' and ar_timestamp like '202007%';
 _col0 
-------
  2030 
(1 row)

(37491 + 2030) - 38771 = 750 revisions missing for July 2020.

  • August
select count(distinct rev_id) from event.mediawiki_revision_create where year = 2020 and ((month = 7 and day = 31) OR (month = 8) OR (month = 9 and day = 1)) and rev_timestamp like '2020-08%' and database = 'simplewiki';
 _col0 
-------
 34087 
(1 row)

select count(distinct rev_id) from wmf_raw.mediawiki_revision where snapshot = '2020-08' and wiki_db = 'simplewiki' and rev_timestamp like '202008%';
 _col0 
-------
 32380 
(1 row)

select count(distinct ar_rev_id) from wmf_raw.mediawiki_archive where snapshot = '2020-08' and wiki_db = 'simplewiki' and ar_timestamp like '202008%';
 _col0 
-------
  2705 
(1 row)

(32380 + 2705) - 34087 = 998 revisions missing for August.

Things I have checks:

  • Similar ratio of revision made by anonymous users vs registered users between missing revisions and not missing ones. Doens't seem anonymous/registered related
  • Similar ratio of namespaces - doesn't seem namespace related
  • Missing revisions are usually happening by small bulks in both revision_id/time (100 rows, can show all if needed):
+---------------------+----------------+                                        
|hudi_event_timestamp |hudi_revision_id|
+---------------------+----------------+
|2020-07-01 03:47:00.0|7013511         |
|2020-07-01 03:47:00.0|7013512         |
|2020-07-01 14:43:02.0|7025295         |
|2020-07-01 17:24:12.0|7023563         |
|2020-07-01 18:04:52.0|7014298         |
|2020-07-01 18:04:52.0|7014299         |
|2020-07-01 18:04:52.0|7014300         |
|2020-07-01 18:04:52.0|7014301         |
|2020-07-01 18:16:19.0|7014313         |
|2020-07-01 18:16:20.0|7014314         |
|2020-07-01 18:16:59.0|7014316         |
|2020-07-01 18:16:59.0|7014317         |
|2020-07-01 18:17:00.0|7014318         |
|2020-07-01 18:17:00.0|7014319         |
|2020-07-01 18:17:20.0|7014322         |
|2020-07-01 18:17:20.0|7014323         |
|2020-07-01 18:17:21.0|7014324         |
|2020-07-01 18:17:21.0|7014325         |
|2020-07-01 18:17:41.0|7014326         |
|2020-07-01 18:17:41.0|7014327         |
|2020-07-01 18:18:20.0|7014328         |
|2020-07-01 18:18:21.0|7014329         |
|2020-07-01 18:21:05.0|7014340         |
|2020-07-01 18:21:05.0|7014341         |
|2020-07-01 20:22:37.0|7014496         |
|2020-07-01 20:22:38.0|7014497         |
|2020-07-01 21:07:12.0|7014578         |
|2020-07-01 21:07:13.0|7014579         |
|2020-07-01 21:07:13.0|7014580         |
|2020-07-01 21:07:13.0|7014581         |
|2020-07-01 22:00:41.0|7014667         |
|2020-07-02 12:10:11.0|7015244         |
|2020-07-02 12:10:11.0|7015245         |
|2020-07-02 13:20:18.0|7015316         |
|2020-07-02 15:33:01.0|7015455         |
|2020-07-02 15:33:02.0|7015457         |
|2020-07-02 15:33:32.0|7015459         |
|2020-07-02 17:59:26.0|7015621         |
|2020-07-02 23:18:03.0|7021670         |
|2020-07-03 00:25:20.0|7016025         |
|2020-07-03 01:22:05.0|7021668         |
|2020-07-03 04:49:09.0|7016147         |
|2020-07-03 04:49:09.0|7016148         |
|2020-07-03 13:45:46.0|7016538         |
|2020-07-03 13:48:02.0|7016554         |
|2020-07-03 13:48:02.0|7016555         |
|2020-07-03 17:00:10.0|7016867         |
|2020-07-03 17:00:15.0|7016869         |
|2020-07-03 17:00:15.0|7016871         |
|2020-07-03 17:00:20.0|7016874         |
|2020-07-03 17:00:21.0|7016876         |
|2020-07-03 17:00:53.0|7016878         |
|2020-07-03 17:00:54.0|7016880         |
|2020-07-03 17:09:54.0|7016901         |
|2020-07-03 17:10:07.0|7016904         |
|2020-07-03 17:10:16.0|7016906         |
|2020-07-03 17:11:49.0|7016910         |
|2020-07-03 17:25:45.0|7016931         |
|2020-07-03 17:28:34.0|7016940         |
|2020-07-03 17:29:10.0|7016941         |
|2020-07-03 17:29:32.0|7016942         |
|2020-07-03 17:29:32.0|7016943         |
|2020-07-03 17:33:34.0|7016955         |
|2020-07-03 18:00:36.0|7021672         |
|2020-07-03 20:31:46.0|7032969         |
|2020-07-03 20:49:38.0|7033012         |
|2020-07-03 21:11:12.0|7017212         |
|2020-07-03 23:00:55.0|7017325         |
|2020-07-04 01:53:41.0|7017466         |
|2020-07-04 01:53:41.0|7017467         |
|2020-07-04 13:06:25.0|7018133         |
|2020-07-04 13:06:25.0|7018134         |
|2020-07-04 17:00:27.0|7018257         |
|2020-07-04 17:00:28.0|7018258         |
|2020-07-04 18:44:52.0|7018334         |
|2020-07-04 18:44:52.0|7018335         |
|2020-07-04 19:07:34.0|7018359         |
|2020-07-04 19:07:34.0|7018360         |
|2020-07-04 22:23:46.0|7018853         |
|2020-07-04 22:23:46.0|7018854         |
|2020-07-04 22:52:17.0|7018953         |
|2020-07-04 22:52:17.0|7018954         |
|2020-07-04 23:09:07.0|7019128         |
|2020-07-04 23:09:07.0|7019129         |
|2020-07-05 02:45:22.0|7019780         |
|2020-07-05 02:45:22.0|7019781         |
|2020-07-05 02:45:34.0|7019782         |
|2020-07-05 02:45:34.0|7019783         |
|2020-07-05 03:42:44.0|7019821         |
|2020-07-05 03:42:44.0|7019822         |
|2020-07-05 03:58:08.0|7019830         |
|2020-07-05 03:58:08.0|7019831         |
|2020-07-05 06:03:42.0|7021674         |
|2020-07-05 07:03:56.0|7019956         |
|2020-07-05 07:23:00.0|7019972         |
|2020-07-05 16:43:50.0|7020410         |
|2020-07-05 16:44:08.0|7020411         |
|2020-07-05 18:44:35.0|7020539         |
|2020-07-05 18:44:35.0|7020540         |
|2020-07-05 18:45:36.0|7020542         |
+---------------------+----------------+

@JAllemandou I assume the last table is the missing revisions. Which wiki are these from?

@Jalemayehu I assume the last table is the missing revisions. Which wiki are these from?

I assume that was for me :) Indeed the table are the first 100 missing revisions ordered by rev_id for simplewiki in July 2020.

We should understand the reasons that lead to these events to be lost , 1% seems a lot of events to be missing. Pinging @WDoranWMF about this cause we could use some of @Pchelolo 's time next quarter (Q2) when troubleshooting issues such as these that fundamentally impare our ability to have events to be the primary source of data

note to self: re-evaluate after the 503s above got better

Interesting finding: number of missing revisions seems independent from the number of revisions sent over a period. For instance, the daily median number of missing revisions in December 2020 were, 13 for simplewiki (average = 26), 15 for wikidatawiki (average = 23). This is weird, as the flow of edits for those two projects are orders of magnitude different: ~1400 edits per day for simplewiki and ~540000 for wikidatawiki for December 2020.

Interesting. Is there any correlation as to the kind of revisions that are missing? Maybe there is some way that these revisions are created in MediaWiki that isn't firing the onPageContentSaveComplete hook.

Since we're close to deploying WDQS streaming updater that uses, among others, revision-create - are there any updates on this?

We might be able to mitigate this if we find the reason it is happening now, but I don't think we can rely on 100% revision-create consistency without T120242: Eventually Consistent MediaWiki State Change Events. I'd like to work on that Q1 next FY, and will try to put together a plan and proposal for that next quarter.

(tl;dr; no simple answers)

As this is becoming a pivotal issue I wanted to make sure to exhaust simple explanations. My query is at the bottom, here are my results for missing revisions:

In etwiki:

event_revmw_revarchivedin_timeframeduplication
NULL5826252falseNULLNULL
NULL5826251falseNULLNULL
NULL5826124falseNULLNULL
NULL5826250falseNULLNULL
NULL5826253falseNULLNULL
NULL5826125falseNULLNULL
NULL5826134falseNULLNULL

4 of these are minor edits, the other 3 are page-create revisions (rev_parent_id is 0).

In enwiki: found 2163, all missing from events, none the other way, all but 181 of them are minor edits or page-create revisions. I did a check to see how this compared to all the edits for this day:

  • 197844 total revisions
  • 19077 page-create revisions
  • 45917 minor edits

On one hand, minor edits and page creates definitely seem over-represented among the missing events. I checked ct_tag for any clues that the rest of the 181 missing events might have some tags in common, but only five had any tags at all and those didn't seem special. So on the other hand, these seem like normal revisions that we just missed somehow. And it's not like we're missing *all* page-create revisions or all minor edits, just some. This might help track the bug down, but doesn't give us any simple answers.

My query (on presto):

   with event_revisions as (

         select rev_id,
                sum(if(rev_timestamp like '2021-02-17%', 1, 0)) in_timeframe,
                count(*) duplication

           from event.mediawiki_revision_create
          where year = 2021 and month = 2 and day = 17
            and database = 'enwiki'
          group by rev_id

        ),
        mediawiki_revisions as (

        -- checked there are no duplicate revisions here or in archive
         select rev_id,
                false as archived
           from wmf_raw.mediawiki_revision
          where snapshot = '2021-02'
            and wiki_db = 'enwiki'
            and rev_timestamp like '20210217%'

          union all

         select ar_rev_id as rev_id,
                true as archived
           from wmf_raw.mediawiki_archive
          where snapshot = '2021-02'
            and wiki_db = 'enwiki'
            and ar_timestamp like '20210217%'
        )

 select e.rev_id event_rev,
        m.rev_id mw_rev,
        m.archived,
        e.in_timeframe,
        e.duplication

   from event_revisions e
            full outer join
        mediawiki_revisions m   on e.rev_id = m.rev_id

  where e.rev_id is null
     or m.rev_id is null
;

After going a little overboard still no easy answers. I did slice and dice the data based on the query that @Milimetric provided above.

Some observations:

  • wikidatawiki has the most revisions (4x enwiki), but very few missing events (0.003% vs the 1.3% on average for all others)
  • some wikis miss more events. e.g. ukwikisource misses 34% of ~60k revisions for Feb21
  • minor edits are more likely to have missing events: 2.3% vs 0.4%
  • revision with rev_parent_id==0 are more likely to be missing (3% vs 1%). Also, events are missing for revisions with (minor_revision==true and parent_id!=0) and (minor_revision==false and parent_id==0), not for other combinations - why this strange xor behavior?
  • archived revisions are more likely to be missing (5.9% vs 1.2%), but <2% of revisions are archived
  • there are spikes in missing events when looking at hourly timeseries (for Feb21), especially for the fields (minor edit, archived, no parent id) seemingly correlated with the occurence of missing events. @JAllemandou in a previous comment also observed bulks of missing revisions.
  • nothing stands out looking at missing events as a function of 'hour of day' / 'minute of hour'

Attached the notebook used as html, including graphs/tables.

Thanks for the background on where the special:tag info is stored @Milimetric on irc, though since it is in the refined revisions and not the raw revisions table used so far, pulling that in is a bit more work. That said, just looking at the actual revisions that are missing does seem to indicate that multiple specific scenarios don't result in kafka events being created.

See the links in the html attached.

rev_parent_is_0rev_minor_editlinks
TrueFalsemostly tagged with "Tag: New redirect"
FalseFalse"File:" most diffs involve pages that are files
FalseTruemostly edits "(No difference)"

I think I know what the issue is.

Actually inserting into the revision table in MW happens in RevisionStore::insertRevisionOn. However we emit revision-create event from PageSaveComplete hook, which is not called in some edge-cases:

  • Page moves
  • Page undeletes
  • FileImporter imports

We need to switch the hook used to RevisionRecordInserted - that one is called always on every insertion into revision table.

Slight issue here is that this hook is not really providing us with all the data we need to build the revision-create event. Lemme see. Specifically the problem is revert data.

Ok. Houston, we do have an implementation problem. I'm fairly certain sending the event from RevisionRecordInserted hook would mitigate the missing revisions problem. However, we don't have the EditResult object there, which is used for rev_is_revert as other revert-related data..

The problem is that we have 2 hooks:

  • RevisionRecordInserted - always called when a new revision is inserted into revision table. But doesn't have access to EditResult which is needed for rev_is_revert fields.
  • PageSaveComplete - only called when the revision is the result of an actual edit. Has EditResult object.

We can subscribe to both hooks, but that will mean we duplicate the events most of the time. So, we need some deduplication semantics there - if PageContentSaveCompelete hook is called, RevisionRecordInserted should be ignored. This means holding some state in hook handler. However, some hooks can be called from a DeferredUpdates so it's hard to figure out when we need. We can try to implement some deduplication of events on EventBus side, maybe by sending events from a MergeableUpdate (probably not very reliable), or holding some state in hook handler (quite awful).

Alternatively we can rearchitect the events a bit and create a separate revision-revert stream?

Would it be possible to get EditResult or whatever info is needed into RevisionRecordInserted we need? Is rev_is_revert the only thing we'd miss out on?

Would it be possible to get EditResult or whatever info is needed into RevisionRecordInserted we need?

Possible - yes. Desirable - hell no. That would mean dragging high-level EditResult class through multiple layers of MW internals just to pass it into the hook. Not good.

Is rev_is_revert the only thing we'd miss out on?

yup.

I'm not sure what to recommend here yet, I need to think more about it.

We just added rev_is_revert a few months ago. It really only indicated if the user reverted via the UI, so maybe it isn't worth holding on to? A revision-revert stream might be the way to go.

Change 679353 had a related patch set uploaded (by Clarakosi; author: Clarakosi):

[mediawiki/extensions/EventBus@master] Use RevisionRecordInserted hook for revision-create event

https://gerrit.wikimedia.org/r/679353

It's been a couple of weeks since I sent an email asking if anyone needed or used revert info in mediawiki.revision-create. No response, so I think we can move forward with this without blocking on T280538: Capture rev_is_revert event data in a stream different than mediawiki.revision-create.

@Protsack.stephan - do we use the reverts in revision-create?

@Ottomata sorry i didn't respond to your email, I actually had forwarded to the team excitedly but forgot to respond to you :/

Hey, sorry for late reply. We do use revision-create but don't use rev_is_revert field, so removing that field should not be a problem for us at all.

Change 679353 merged by jenkins-bot:

[mediawiki/extensions/EventBus@master] Use RevisionRecordInserted hook for revision-create event

https://gerrit.wikimedia.org/r/679353

https://gerrit.wikimedia.org/r/679353 has is out on all wikis in wmf.5!

After the next mediawiki history import we should do another analysis and see how it changed things.

Just a quick note to say that I ran the query for May 17th, and still found mismatches on both sides. I will find a way to do a better analysis that we can easily re-run every time we make improvements.

Change 724173 had a related patch set uploaded (by Ppchelko; author: Ppchelko):

[operations/deployment-charts@master] Eventgate TLS proxy: lower local idle_timeout

https://gerrit.wikimedia.org/r/724173

Change 724173 merged by jenkins-bot:

[operations/deployment-charts@master] Eventgate TLS proxy: lower local idle_timeout

https://gerrit.wikimedia.org/r/724173

So, after mangling keepAlive timeouts in envoy, we've had 0 503/504 in eventgate over that last day.

@Milimetric would it be possible to rerun the analysis? Or we should wait till October to get the new scooped data?

I did an analysis for both 2021-09-29 and 2021-09-30 on all projects (we don't have revision data on hadoop for October yet - only a few days to wait).
I found:

DateEvent revisionsmediawiki_history revisionsRevisions in mediawiki-history not in eventsRevisions in Events not Mediawiki-history
2021-09-29130916913081369956
2021-09-301272924127160641310

We are missing almost no revision - 0.0007% !!

I checked for various random days earlier in the year and the year before and the number of missing revisions per day was higher (some hundreds earlier this year, thousands in February last year).

This analysis also made me investigate the reason for the revisions being in events and not in mediawiki-history: it is due to pages being deleted between the data imports of the revision and archive tables. We first import archive, and then other tables, and finally revision - this makes enough time for big projects (a couple of hours) from some page to get deleted and the revisions not showing up not in the imported revision nor archive tables. I'm creating a task now to change the ordering of the imports: first import revisions, then archive and finally page, and devise a mechnism to triple check deletion/restore times when revisions are present in both tables.

I'm sorry this update is a bit late. I'm calling this task done, but I'll update the description with the following information:

  • After Petr's envoy updates, we looked at a month worth of revisions. We queried the analytics replicas to get all the revision ids from the revision and archive tables. There were 4759917 of them from Sep. 30 to Oct 27. Joining these to the mediawiki_revision_create events that we had in HDFS we found 67 missing events. That comes out to 99.9986% reliable, or 0.0014% missing events.
  • A spot check on some of these events shows that they're most likely missing because of the global kill timer. Because the edits were on huge templates that threw errors even just rendering the page, let alone old revisions of the page.
  • So the reason I'm moving this to done is because it seems to me we are as good as we can possibly be with the current setup. We'll never get 100% without being in the transaction itself and with mechanisms like the global kill timer.

My quick opinion here is that we can use EventBus -> EventGate for pretty much everything except for revisions, where I think we need some way to get 100% of the events or some mechanism to reconcile to 100%. This is all being discussed in other tasks so I won't repeat it here.