Page MenuHomePhabricator

Eventbus revisions are duplicated in event.mediawiki_revision_tags_change
Closed, InvalidPublic

Description

The results of the following query seems to be duplicated:

select *
from event.mediawiki_revision_tags_change
where
    year =2019 and month = 3
    and array_contains(tags, "campaign-external-machine-translation")
order by rev_timestamp
limit 1000

The rev_id and rev_timestamp are the same, but prior_state and tags are some other fields are different between the duplicated records.

Tickets that are related to the generation of this tag: T209132 and T216123

Event Timeline

@Pchelolo we looked at this and the request id and id were different as well. Any idea where the duplication occurs? I'd be curious to help debug this if you're not sure right away where it is.

Milimetric triaged this task as Medium priority.Mar 18 2019, 3:41 PM
Milimetric moved this task from Incoming to Ops Week on the Analytics board.

Moving to radar and ping to @Pchelolo that tags changed events look to be sent twice.

@Milimetric The entry point for generating the tags change events is located at https://github.com/wikimedia/mediawiki-extensions-EventBus/blob/master/includes/EventBusHooks.php#L533 if you wanna debug it yourself

I will look into it soon.

I haven't looked at the data, but are meta.id and meta.dt different? If so, these are likely just different events. IIRC revision tags can change after the fact (right?). If so, each time they do would result in a different event (with the same rev id) but different tags.

I came across this issue when looking to use mediawiki_revision_tags_change to count tagged edits for the Growth Team and chatting with the Product Analytics team about it. Based on that discussion, and as @Ottomata mentions above, multiple rows for the same revision ID are expected because tags can change.

Digging into the last year's worth of data suggests that something isn't the way we'd expect it to be. Consistently from 2019-01 through 2020-01, I can find a few hundred thousand revisions per month (some millions in July–September) that have the same values for the revision ID and meta.dt columns. However, if we also include the meta.id column, none of these are duplicates, meaning that EventBus does regard these as separate events. I take this to mean that there are multiple parts of the MW code that call the entry point on a single edit, resulting in duplicate rows with the same timestamp, rather than combine all these changes into a single event. One of the examples I did find when looking at the Growth Team's data is that the second row added the "visualeditor" tag.

Below are a couple of the queries and results from running them. I use Python's format() call to insert the year and month. Let me know if I've missed something.

Query to get duplicates by rev ID and timestamp:

SELECT "{year}-{month}" AS log_month, SUM(IF(num_rows > 1, 1, 0)) AS num_duplicated_ts
FROM (
    SELECT rev_id, meta.dt, count(*) AS num_rows
    FROM event.mediawiki_revision_tags_change
    WHERE year = {year}
    AND month = {month}
    GROUP BY rev_id, meta.dt
) AS mrtc

Query to get duplicates by rev ID, timestamp, and event ID:

SELECT "{year}-{month}" AS log_month, SUM(IF(num_rows > 1, 1, 0)) AS num_duplicated_ids
FROM (
    SELECT rev_id, meta.dt, meta.id, count(*) AS num_rows
    FROM event.mediawiki_revision_tags_change
    WHERE year = {year}
    AND month = {month}
    GROUP BY rev_id, meta.dt, meta.id
) AS mrtc

Resulting data from 2019:

The two duplicate count columns corresponds to the two queries above.

log_monthnum_duplicated_tsnum_duplicated_ids
2019-1170,8780
2019-2281,5160
2019-3328,4110
2019-4322,3900
2019-5384,8190
2019-6339,0420
2019-73,383,8400
2019-85,168,2540
2019-92,420,0560
2019-10214,3580
2019-11232,8970
2019-122531370

Seems like any fix here requires digging into mediawiki code to see where those multiple events are being generated. CC-ing @WDoranWMF for prioritization on their end.

Hm yeah if the events have different meta.ids, that indicates that the MW hook that generates these events is being fired multiple times.

Hm yeah if the events have different meta.ids, that indicates that the MW hook that generates these events is being fired multiple times.

Hey are the meta.ids and meta.dts being filled in at send time by the MW hook or are they blank and filled in by EventBus/EventGate at receive/processing time?

EventBus is the MW extension that registers the hook handler to send the event to EventGate. meta.id and meta.dt are set by EventBus in MW.

I finally got to look into this. How and where are you running these queries? I would need some examples of events that are considered duplicates to figure out what's happening here.

Queries can be run in hadoop (ssh to 1107) , data is on event database. Remember to do >knit to run queries as kerberos authentication is now required.

Here's an example of a duplicated events pair:

1{
2 "$schema": "/mediawiki/revision/tags-change/1.0.0",
3 "meta": {
4 "uri": "https://www.wikidata.org/wiki/Q18109620",
5 "request_id": "XnAL2wpAIDYAAEnO9ssAAADR",
6 "id": "e9633c3c-bead-4e1a-809f-661bedeac632",
7 "dt": "2020-03-16T23:29:32Z",
8 "domain": "www.wikidata.org",
9 "stream": "mediawiki.revision-tags-change"
10 },
11 "database": "wikidatawiki",
12 "page_id": 19643820,
13 "page_title": "Q18109620",
14 "page_namespace": 0,
15 "rev_id": 1136790319,
16 "rev_timestamp": "2020-03-16T23:29:31Z",
17 "rev_sha1": "abdwq48kv6xtg07i9dtys4s6np5ylwk",
18 "rev_minor_edit": false,
19 "rev_len": 3064,
20 "rev_content_model": "wikibase-item",
21 "rev_content_format": "application/json",
22 "performer": {
23 "user_text": "世界首都环游",
24 "user_groups": [
25 "*",
26 "user",
27 "autoconfirmed"
28 ],
29 "user_is_bot": false,
30 "user_id": 1842367,
31 "user_registration_dt": "2015-04-04T07:57:39Z",
32 "user_edit_count": 9706
33 },
34 "page_is_redirect": false,
35 "comment": "/* wbsetdescription-add:1|zh */ 双边关系, #suggestededit-add 1.0",
36 "parsedcomment": "<span dir=\"auto\"><span class=\"autocomment\">wbsetdescription-add:1|zh: </span> 双边关系, #suggestededit-add 1.0</span>",
37 "rev_parent_id": 863742694,
38 "tags": [
39 "mobile edit",
40 "mobile app edit",
41 "android app edit"
42 ],
43 "prior_state": {
44 "tags": []
45 }
46}
47{
48 "$schema": "/mediawiki/revision/tags-change/1.0.0",
49 "meta": {
50 "uri": "https://www.wikidata.org/wiki/Q18109620",
51 "request_id": "XnAL2wpAIDYAAEnO9ssAAADR",
52 "id": "1d12fc30-fe11-4312-b8b3-f5f4c372e1b4",
53 "dt": "2020-03-16T23:29:32Z",
54 "domain": "www.wikidata.org",
55 "stream": "mediawiki.revision-tags-change"
56 },
57 "database": "wikidatawiki",
58 "page_id": 19643820,
59 "page_title": "Q18109620",
60 "page_namespace": 0,
61 "rev_id": 1136790319,
62 "rev_timestamp": "2020-03-16T23:29:31Z",
63 "rev_sha1": "abdwq48kv6xtg07i9dtys4s6np5ylwk",
64 "rev_minor_edit": false,
65 "rev_len": 3064,
66 "rev_content_model": "wikibase-item",
67 "rev_content_format": "application/json",
68 "performer": {
69 "user_text": "世界首都环游",
70 "user_groups": [
71 "*",
72 "user",
73 "autoconfirmed"
74 ],
75 "user_is_bot": false,
76 "user_id": 1842367,
77 "user_registration_dt": "2015-04-04T07:57:39Z",
78 "user_edit_count": 9706
79 },
80 "page_is_redirect": false,
81 "comment": "/* wbsetdescription-add:1|zh */ 双边关系, #suggestededit-add 1.0",
82 "parsedcomment": "<span dir=\"auto\"><span class=\"autocomment\">wbsetdescription-add:1|zh: </span> 双边关系, #suggestededit-add 1.0</span>",
83 "rev_parent_id": 863742694,
84 "tags": [
85 "android app edit",
86 "mobile app edit",
87 "mobile edit",
88 "apps-suggested-edits"
89 ],
90 "prior_state": {
91 "tags": [
92 "android app edit",
93 "mobile app edit",
94 "mobile edit"
95 ]
96 }
97}

As you can see, these events are not really duplicates - first, some android app edit tags are added, and then, almost instantly, apps-suggested-edits tag is added. The latter tag is added by the WikimediaEditorTasks extension via an edit hook. So, basically what is happening is that if there's 2 or more extensions adding tags to a certain edit, we emit 2 or more events about this edit, since the tags changes are not merged together between the extensions.

To further proof that this is what's happening, the following query

SELECT * FROM ( SELECT rev_id, meta.dt, count(*) AS num_rows FROM event.mediawiki_revision_tags_change WHERE year = 2020 AND month = 03 GROUP BY rev_id, meta.dt ) as rows sort by rows.num_rows desc limit 5;

Gives the result:

401981919	2020-03-07T01:18:18Z	2
401477581	2020-03-04T13:35:25Z	2
402529653	2020-03-08T14:32:52Z	2
403044128	2020-03-10T02:43:00Z	2
404756992	2020-03-17T10:16:28Z	2

Thus we at most have 2 tags-change events in the same revision/timestamp.

Trying to group it by tags as well:

SELECT * FROM ( SELECT rev_id, meta.dt, count(*) AS num_rows FROM event.mediawiki_revision_tags_change WHERE year = 2020 AND month = 03 GROUP BY rev_id, meta.dt, tags ) as rows sort by rows.num_rows desc limit 5;

Gives the result:

404532630	2020-03-16T08:13:12Z	2
405022725	2020-03-18T11:35:21Z	2
404532643	2020-03-16T08:13:17Z	2
404532677	2020-03-16T08:13:31Z	2
404509786	2020-03-16T05:04:56Z	2

Which means that true duplicates do exist. Poking deeper.

So, I have established that 'true' duplicates do exist in the previous comments.

Running:

SELECT SUM(rows.num_rows) FROM ( SELECT count(*) AS num_rows FROM event.mediawiki_revision_tags_change WHERE year = 2019 AND month = 12 GROUP BY rev_id, meta.dt, tags ) as rows WHERE rows.num_rows > 1;

We get 2900 - which is much lower and nicer then 253137 reported in a comment above.

Running

SELECT * FROM ( SELECT rev_id, tags, count(*) AS num_rows FROM event.mediawiki_revision_tags_change WHERE year = 2019 AND month = 12 GROUP BY rev_id, meta.dt, tags ) as rows WHERE rows.num_rows > 1;

shows that all the duplicates are for the 'fileimporter' tag.

FileImporter extension indeed adds the exact same tag to the revision twice, probably not on purpose. Since it's done twice within the same execution context, I guess sometimes we submit the queries twice due to db lag, and thus emit the tags-change event twice. This seems harmless to me, so I will close this ticket as invalid. Please reopen if disagree.