Page MenuHomePhabricator

Identify imported revisions in mediawiki_history
Open, MediumPublic

Description

Revision importing is weird. It essentially allow users to insert arbitrary amounts of fake history at arbitrary points in the past. This is useful for building on work from other wikis, but when this fake history is analyzed, it can produce misleading or even ridiculous results (e.g. T123313).

There has never been an easy, systematic way to pick out imported revisions, so our analyses generally just ignore them and trust that the effect is acceptably small. This has worked reasonably well so far, but it would still be a great feature if mediawiki_history was able to take care of this for us and provide a revision_is_imported flag.

Unfortunately, this would take some significant inference since MediaWiki doesn't clearly identify imported revisions. We do have two sources of information:

  • The import log (e.g. Simple English Wikipedia's log) which records the date of import (log_timestamp), the destination page (log_namespace and log_title), and the number of revisions imported (in the log_comment).
  • Revision IDs: imported revisions will almost always have much larger IDs than normal revisions with the same timestamp. This happens because the original timestamp is imported along with the revision content, but a new revision ID is assigned on import to avoid clashes.

See also

Event Timeline

nshahquinn-wmf renamed this task from Identify imported revisions in `mediawiki_history` to Identify imported revisions in mediawiki_history.Apr 19 2019, 10:20 PM
fdans triaged this task as Medium priority.Apr 22 2019, 3:17 PM
fdans moved this task from Incoming to Data Quality on the Analytics board.

Super good idea and good presentation of the difficulty :)
Maybe one day ;)

Some information in that respect is provided as part of T221825 with the new field page_is_from_before_page_creation. But this is incomplete as it only accounts for pages imported before the page creation, not after.

I think we should do this. We can limit the pages we look at with the import log as Neil says, and then just mark all the revisions that have much larger revision ids than their parent (via rev_parent_id as revision_is_probably_imported

then just mark all the revisions that have much larger revision ids than their parent (via rev_parent_id as revision_is_probably_imported

I'm not sure this would work. Imported revisions will usually have much larger revision IDs than normal revisions on the wiki with a similar timestamp, but that wouldn't necessarily work at the article level. I think the usual case is a new page being created with a whole bunch of imported edits. In that case, the first revision should have a rev_parent_id of 0, and then the rest will have the previous revision in the chain as parent. Since all are imported, the page won't have any normal revision ID that makes the imported ones stand out. If a user comes and makes a normal edit to the imported page later, that should be a detectable situation, but I can imagine that frequently doesn't happen.