Page MenuHomePhabricator

Duplicated revision_create events
Open, MediumPublic

Description

While investigating for T258532 I found data duplication in July 2020 for the event.mediawiki_revision_create data. This duplication is probably due to the event-producer since the only changing fields in the various rows are meta.dt. meta.id, meta.request_id.

-- In presto

-- Checking meta.id has no duplicate
SELECT COUNT(1) as dup_meta_ids
FROM (
  SELECT COUNT(1) AS meta_id_count_gt_1
  FROM event.mediawiki_revision_create
  WHERE year = 2020 and month = 9 and day = 6
  GROUP BY meta.id
  HAVING COUNT(1) > 1
) subquery;
-- dup_meta_ids 
-- --------------
--            0 

-- Checking rev_id has duplicates
SELECT COUNT(1) as dup_rev_ids
FROM (
  SELECT COUNT(1) AS rev_id_count_gt_1
  FROM event.mediawiki_revision_create
  WHERE year = 2020 and month = 9 and day = 6
  GROUP BY database, rev_id
  HAVING COUNT(1) > 1
) subquery;
-- dup_rev_ids 
-- -------------
--         176 

-- Looking for wiki, rev_id to check
SELECT
  database,
  rev_id,
  COUNT(1) AS rev_id_count_gt_1
FROM event.mediawiki_revision_create
WHERE year = 2020 and month = 9 and day = 6
GROUP BY database, rev_id
HAVING COUNT(1) > 1
LIMIT 10;
--    database    |  rev_id   | rev_id_count_gt_1 
-- ---------------+-----------+-------------------
--  mediawikiwiki |   4086810 |                 2 
--  dawiki        |  10455662 |                 2 
--  enwiki        | 977066159 |                 2 
--  ruwikinews    |   8066753 |                 2 
--  commonswiki   | 449919046 |                 2 
--  plwiki        |  60787792 |                 2 
--  plwiki        |  60790202 |                 2 
--  plwiki        |  60787723 |                 2 
--  elwiki        |   8421375 |                 2 
--  enwiki        | 977066086 |                 2 


-- Take an exemple
SELECT meta
FROM event.mediawiki_revision_create
WHERE year = 2020 and month = 9 and day = 6
  AND database = 'enwiki' AND rev_id = 977066159;

--  {domain=en.wikipedia.org, dt=2020-09-06T18:38:44Z, id=47b330c8-2524-4601-ac16-97e359074527, request_id=7454961e-f74a-4928-b141-be6c7120feb8, schema_uri=null, topic=null, uri=https://en.wikipedia.org/wiki/Silvia
--  {domain=en.wikipedia.org, dt=2020-09-06T18:38:44Z, id=4e0f81bc-e025-4fee-bd68-0d18bc5bc90e, request_id=525e6f86-0401-4cdf-9044-a36c1a5dd372, schema_uri=null, topic=null, uri=https://en.wikipedia.org/wiki/Silvia

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 7 2020, 12:34 PM
  • i think* this duplication on events might mean that we are expiring cache more frequently than we should for those revisions, not sure but seems likely. This is an issue of data quality in the analytics end but it could be a more serious issue on the consumption side by the purged daemon. CC @Pchelolo and @WDoranWMF

More info on how frequently it happened:

 month | dup_rev_ids | not_dup_rev_ids 
-------+-------------+-----------------
     1 |           0 |        44391145 
     2 |           0 |        43851710 
     3 |           0 |        49377389 
     4 |           0 |        52554134 
     5 |           0 |        54332474 
     6 |          51 |        52631940 
     7 |      527416 |        46948609 
     8 |        4352 |        54912421 
     9 |        1294 |        19872714
Nuria added a comment.Sep 10 2020, 4:09 PM

so 0.006% of revision create events * might be* unnecessary purged

so 0.006% of revision create events * might be* unnecessary purged

1% in July - Can be interesting to understand what happened.

In this case it's pretty clear the duplicates are a bug in MW code since request_id/id/dt are different and those AFAIK are set in MW code - this needs fixing.

However, Kafka as we use it has 'at least once' semantics, so analytics systems should not depend on any event being globally unique, duplicates should be tolerated.

duplicates should be tolerated.

They are :) The event-refine process de-duplicates events by meta.id.

Nuria added a comment.Sep 10 2020, 4:19 PM

However, Kafka as we use it has 'at least once' semantics, so analytics systems should not depend on any event being globally unique, duplicates should be tolerated.

Right, in the analytics end duplicated events are an issue for data quality but it is understood that they can happen and our systems deal with them accordingly.

kafkacat -b localhost:9092 -p 0 -t 'codfw.mediawiki.revision-create' -o beginning | grep 977066159 | jq .

1{
2 "$schema": "/mediawiki/revision/create/1.1.0",
3 "meta": {
4 "uri": "https://en.wikipedia.org/wiki/Silvia_Lemus",
5 "request_id": "525e6f86-0401-4cdf-9044-a36c1a5dd372",
6 "id": "4e0f81bc-e025-4fee-bd68-0d18bc5bc90e",
7 "dt": "2020-09-06T18:38:44Z",
8 "domain": "en.wikipedia.org",
9 "stream": "mediawiki.revision-create"
10 },
11 "database": "enwiki",
12 "page_id": 23217054,
13 "page_title": "Silvia_Lemus",
14 "page_namespace": 0,
15 "rev_id": 977066159,
16 "rev_timestamp": "2020-09-06T18:38:44Z",
17 "rev_sha1": "3t9st7420dx28l0y1ydm3l9fg70bm1n",
18 "rev_minor_edit": false,
19 "rev_len": 4051,
20 "rev_content_model": "wikitext",
21 "rev_content_format": "text/x-wiki",
22 "performer": {
23 "user_text": "Boleyn",
24 "user_groups": [
25 "autoreviewer",
26 "extendedconfirmed",
27 "extendedmover",
28 "patroller",
29 "reviewer",
30 "*",
31 "user",
32 "autoconfirmed"
33 ],
34 "user_is_bot": false,
35 "user_id": 6127189,
36 "user_registration_dt": "2008-01-05T21:27:12Z",
37 "user_edit_count": 266709
38 },
39 "page_is_redirect": false,
40 "comment": "+stub tag",
41 "parsedcomment": "+stub tag",
42 "rev_parent_id": 975915783,
43 "rev_content_changed": true,
44 "rev_is_revert": false
45}
46{
47 "$schema": "/mediawiki/revision/create/1.1.0",
48 "meta": {
49 "uri": "https://en.wikipedia.org/wiki/Silvia_Lemus",
50 "request_id": "7454961e-f74a-4928-b141-be6c7120feb8",
51 "id": "47b330c8-2524-4601-ac16-97e359074527",
52 "dt": "2020-09-06T18:38:44Z",
53 "domain": "en.wikipedia.org",
54 "stream": "mediawiki.revision-create"
55 },
56 "database": "enwiki",
57 "page_id": 23217054,
58 "page_title": "Silvia_Lemus",
59 "page_namespace": 0,
60 "rev_id": 977066159,
61 "rev_timestamp": "2020-09-06T18:38:44Z",
62 "rev_sha1": "3t9st7420dx28l0y1ydm3l9fg70bm1n",
63 "rev_minor_edit": false,
64 "rev_len": 4051,
65 "rev_content_model": "wikitext",
66 "rev_content_format": "text/x-wiki",
67 "performer": {
68 "user_text": "Boleyn",
69 "user_groups": [
70 "autoreviewer",
71 "extendedconfirmed",
72 "extendedmover",
73 "patroller",
74 "reviewer",
75 "*",
76 "user",
77 "autoconfirmed"
78 ],
79 "user_is_bot": false,
80 "user_id": 6127189,
81 "user_registration_dt": "2008-01-05T21:27:12Z",
82 "user_edit_count": 266709
83 },
84 "page_is_redirect": false,
85 "comment": "+stub tag",
86 "parsedcomment": "+stub tag",
87 "rev_parent_id": 975915783,
88 "rev_content_changed": true,
89 "rev_is_revert": false
90}

Wow that looks like you found it @Pchelolo! I wish I cou;d find bugs as fast as you do :)

I found the example events. The reason why they are duplicated is still a mystery...

@Ottomata I've dug into this quite a lot walking up and up the call stack, and couldn't find where could this be duplicated in MW.

Honestly, the most reasonable explanation would be if the MW->eventgate HTTP request is being retried but both events make it into kafka, maybe it times out just before ACKs are received.

What is not fitting into this picture is different request_id. Service-template code in eventgate can generate x-request-id header, but eventgate shouldn't be using it if the meta.request_id field is set in the event. I've been eyeballing the code that sets it in eventgate-wikimedia a lot and I can't see how it could possibly be wrong.. Can you have a look at that code too and see if you have ideas on how it could be incorrect?

Is it possible that something is producing revision-create events other than EventBus? Not likely, right?

I just looked through EventBus code and I don't see anything particular about retries...is that happening via http->runMulti?

service-template code in eventgate can generate x-request-id header, but eventgate shouldn't be using it if the meta.request_id field is set in the event.

Just looked at MW WebRequest getRequestId. It will either return X-Request-Id, else, env var UNIQUE_ID, else wfRandomString( 24 ). So, if something called EventFactory createRevisionCreateEvent more than once, it is possible for meta.id and meta.request_id to be unique each time.

So hm, either the PageSaveComplete hook is being called more than once for the same revision, or the revision-create event is being produced twice without setting meta.request_id and meta.id, causing eventgate-wikimedia to set those as defaults.

@JAllemandou there are also events where meta.dt is slightly different too?

@Pchelolo is it possible that JobQueue is somehow responsible? Is some JobQueue retry somehow firing PageCompleteHook more than once; or somehow calling createRevisionCreateEvent more than once for the same revision?

There are events with different timetamps. In July 2020 (when the number of dups peaked) I found a revision with up to 16 events, and different timestamp (1h diff between every).
Recent data:

presto> SELECT
     ->   database,
     ->   rev_id,
     ->   array_distinct(array_agg(meta.dt)) as timestamps,
     ->   COUNT(1) AS rev_id_count_gt_1
     -> FROM event.mediawiki_revision_create
     -> WHERE year = 2020 and month = 9
     -> GROUP BY database, rev_id
     -> HAVING COUNT(1) > 1 AND cardinality(array_distinct(array_agg(meta.dt))) > 1
     -> LIMIT 10;
  database   |  rev_id   |                  timestamps                  | rev_id_count_gt_1 
-------------+-----------+----------------------------------------------+-------------------
 commonswiki | 451735365 | [2020-09-08T06:25:50Z, 2020-09-08T06:25:51Z] |                 2 
 enwiki      | 977973886 | [2020-09-12T03:08:38Z, 2020-09-12T03:08:39Z] |                 2 
 arwiki      |  50310422 | [2020-09-09T22:56:19Z, 2020-09-09T22:56:20Z] |                 2 
 azwiki      |   5441991 | [2020-09-11T06:45:47Z, 2020-09-11T06:45:41Z] |                 2 
 dewiki      | 203613472 | [2020-09-12T18:41:03Z, 2020-09-12T18:41:04Z] |                 2 
 enwiki      | 977958479 | [2020-09-12T01:02:55Z, 2020-09-12T01:02:53Z] |                 2 
 enwiki      | 977080082 | [2020-09-06T20:17:26Z, 2020-09-06T20:17:59Z] |                 2 
 commonswiki | 454236590 | [2020-09-10T16:45:15Z, 2020-09-10T16:45:16Z] |                 2 
 enwiki      | 978150065 | [2020-09-13T05:54:09Z, 2020-09-13T05:54:08Z] |                 2 
 enwiki      | 977839636 | [2020-09-11T08:32:07Z, 2020-09-11T08:32:06Z] |                 2 
(10 rows)

Ping @Ottomata and @Pchelolo - I found something even more problematic: the duplication can happen with a change of performer!

presto> SELECT
     ->   database,
     ->   rev_id,
     ->   array_distinct(array_agg(performer.user_text)) as users,
     ->   COUNT(1) AS rev_id_count_gt_1
     -> FROM event.mediawiki_revision_create
     -> WHERE year = 2020 and month = 9
     -> GROUP BY database, rev_id
     -> HAVING COUNT(1) > 1 AND cardinality(array_distinct(array_agg(performer.user_text))) > 1
     -> LIMIT 10;
 database |  rev_id   |             users             | rev_id_count_gt_1 
----------+-----------+-------------------------------+-------------------
 enwiki   | 977034865 | [104.235.36.59, 96.33.68.122] |                 2 
(1 row)

This occurred on 9249 revision-ids on July (over 527416 revision-id duplications), and only on anonymous users.

Milimetric added a subscriber: Milimetric.EditedSep 15 2020, 7:32 PM
databaserev_idusersrev_id_count_gt_1
enwiki977034865[104.235.36.59, 96.33.68.122]2

Looked into which one mediawiki went with:

select * from revision where rev_id = 977034865;

rev_idrev_pagerev_comment_idrev_actorrev_timestamprev_minor_editrev_deletedrev_lenrev_parent_id
9770348654084385010208288646202009061505280043774977034756

select * from actor where actor_id = 208288646;

actor_idactor_useractor_name
208288646NULL104.235.36.59

I geolocated the two IPs, and they're 1300+ km apart from each other. I was going to say maybe a user pre-saving is triggering the hook, but I understand that doesn't generate a rev_id.

eprodromou triaged this task as Medium priority.Sep 15 2020, 8:24 PM