Page MenuHomePhabricator

interwiki imports and its effects on revision data
Closed, ResolvedPublic

Description

==Overview==

When an interwiki import happens, edit history is brought over from one wiki to another. By default, performers of the imported edits appear with an interwiki prefix, like meta>Example. This is documented at https://www.mediawiki.org/wiki/Help:Import.

(Fyi, we experience the same issue for other cross-operations. The one we found in T425443 relates to users changing user-groups from centralauth – on metawiki – impacting users on possibly all sister projects.)

For the instances of interwikis imports of revisions, the imported revisions show up with usernames presented in a couple of different ways, such as with

  • wiki prefix (e.g., en>USERTEXT , de>USERTEXT, meta>USERTEXT , strategywiki>USERTEXT)
  • “import” prefix (import>USERTEXT)
  • letter prefix (e.g., b>USERTEXT , w>USERTEXT )
  • miscellaneous prefix (e.g. regiowiki.at>USERTEXT , *>USERTEXT)

==Implications==

These imported revisions may be inflating anonymous revision counts, as well as revision counts that don't exclude anonymous revisions. Why? These imported revisions all show up in mediawiki_history as event_user_is_anonymous = TRUE

SELECT event_user_is_anonymous, COUNT(*)
FROM wmf.mediawiki_history
WHERE snapshot = '2026-04'
 AND event_user_text LIKE '%>%'
 AND event_entity = 'revision'
 AND event_type = 'create'
GROUP BY event_user_is_anonymous

The imported revisions share the same sha1:

SELECT wiki_db, page_title, page_id, event_timestamp, event_entity, event_type, event_user_text, event_user_is_anonymous, event_user_is_permanent, revision_text_sha1
FROM wmf.mediawiki_history
WHERE snapshot = '2026-04'
AND wiki_db IN ('dewiki', 'enwiki')
AND page_title = 'Battle_for_Dream_Island'
AND event_timestamp < '2026-03-10'
ORDER BY event_timestamp DESC , wiki_db
LIMIT 50

Now that temporary accounts have rolled out, temp account revisions that are imported are labeled as event_user_text_is_anonymous = TRUE and event_user_text_is_temporary = FALSE:

SELECT wiki_db, event_entity, event_type, event_user_is_permanent, event_user_is_anonymous , event_user_text
FROM wmf.mediawiki_history
WHERE snapshot = '2026-04'
 AND wiki_db = 'enwiki'
 AND event_user_text LIKE '%>~%'
 AND event_entity = 'revision'
 AND event_type = 'create'
LIMIT 50

==Impact==

If we look at overall impact, per the following query, dewiki seems to be affected the most, with more than 6 million rows in MWH having a revision done by a user with event_user_text LIKE '%>%’. That's spanning all years.

SELECT wiki_db, event_entity, event_type, COUNT(*) as count
FROM wmf.mediawiki_history
WHERE snapshot = '2026-04'
 AND event_user_text LIKE '%>%'
 AND event_entity = 'revision'
 AND event_type = 'create'
GROUP BY wiki_db, event_entity, event_type
ORDER BY count DESC

Here are the top five rows of that query’s output:

wiki_dbevent_entity event_type count
dewikirevisioncreate6031772
mlwikirevisioncreate698182
bhwikirevisioncreate341711
tewikirevisioncreate289112
newikirevisioncreate219307

If we look at the monthly impact, we can see e.g. that for January 2026, dewiki was the most impact, with 5,033 rows in MWH having a revision done by a user with event_user_text LIKE '%>%’. That’s spanning all years.

SELECT wiki_db, substr(event_timestamp,1,7) as month, event_entity, event_type, COUNT(*) as count
FROM wmf.mediawiki_history
WHERE snapshot = '2026-04'
 AND event_user_text LIKE '%>%'
 AND event_entity = 'revision'
 AND event_type = 'create'
 AND substr(event_timestamp,1,7) = '2026-01'
GROUP BY wiki_db, substr(event_timestamp,1,7), event_entity, event_type
ORDER BY count DESC

Here are the top five rows of that query’s output:

wiki_dbmonthevent_entity event_type count
dewiki2026-01revisioncreate5033
tcywiki2026-01revisioncreate240
tewiki2026-01revisioncreate102
siwiktionary2026-01revisioncreate69
mlwiki2026-01revisioncreate47

Impact note: It should be noted that many of these instances may occur on pages that were moved and the original page was deleted; pages where the original page (translated from) gets deleted; or translated pages (i.e. the new page) that subsequently get deleted. This will affect how these edits show up – or don’t show up – in various counts.

==Suggestions==
We might consider

  • Adding an event_user_is_cross_wiki field to make this instances explicit, as suggested in T425443
  • Whether or not we should exclude instances where event_user_is_cross_wiki = TRUE in downstream tables, including the wikistats tables and analytics tables that have edit counts (e.g. geoeditors monthly, edits hourly, etc.)

Event Timeline

Discussed this with @JAllemandou.

We agree that having a field event_user_is_cross_wiki makes sense. Since we are doing high priority mediawiki_history DDL changes as part of T425986, we may as well do this one.

The implementation looks quite simple: event_user_text LIKE '%>%' AND event_user_is_anonymous AND NOT event_user_is_temporary AS event_user_is_cross_wiki.

xcollazo changed the task status from Open to In Progress.May 12 2026, 2:10 PM
xcollazo claimed this task.

Change #1286383 had a related patch set uploaded (by Xcollazo; author: Xcollazo):

[analytics/refinery@master] Add event_user_is_cross_wiki to mediawiki_history DDL

https://gerrit.wikimedia.org/r/1286383

Change #1286385 had a related patch set uploaded (by Xcollazo; author: Xcollazo):

[analytics/refinery/source@master] Add event_user_is_cross_wiki to wmf.mediawiki_history

https://gerrit.wikimedia.org/r/1286385

Change #1286397 had a related patch set uploaded (by Xcollazo; author: Xcollazo):

[analytics/refinery/source@master] Refactor MediawikiEvent.fromRow to use named column access

https://gerrit.wikimedia.org/r/1286397

Change #1286397 merged by jenkins-bot:

[analytics/refinery/source@master] Refactor MediawikiEvent.fromRow to use named column access

https://gerrit.wikimedia.org/r/1286397

Change #1286385 merged by jenkins-bot:

[analytics/refinery/source@master] Add event_user_is_cross_wiki to wmf.mediawiki_history

https://gerrit.wikimedia.org/r/1286385

Change #1286383 merged by Xcollazo:

[analytics/refinery@master] Add event_user_is_cross_wiki to mediawiki_history DDL

https://gerrit.wikimedia.org/r/1286383

BTW, we are dealing with a similar issue in T426198: Event schemas - mediawiki user entity should be wiki aware.

For events, I will be adding a user.wiki_id field.

New field now available on snapshot='2026-04' of wmf.mediawiki_history:

`event_user_is_cross_wiki`                      boolean       COMMENT 'True if the event_user is an interwiki-imported editor (usertext contains ">", is anonymous, and not temporary). NULL for user/page events.',

@CMyrick-WMF please check it out when you have some time and let us know if it looks good to you.