Revision importing is weird. It essentially allow users to insert arbitrary amounts of fake history at arbitrary points in the past. This is useful for building on work from other wikis, but when this fake history is analyzed, it can produce misleading or even ridiculous results (e.g. T123313).
There has never been an easy, systematic way to pick out imported revisions, so our analyses generally just ignore them and trust that the effect is acceptably small. This has worked reasonably well so far, but it would still be a great feature if mediawiki_history was able to take care of this for us and provide a revision_is_imported flag.
Unfortunately, this would take some significant inference since MediaWiki doesn't clearly identify imported revisions. We do have two sources of information:
- The import log (e.g. Simple English Wikipedia's log) which records the date of import (log_timestamp), the destination page (log_namespace and log_title), and the number of revisions imported (in the log_comment).
- Revision IDs: imported revisions will almost always have much larger IDs than normal revisions with the same timestamp. This happens because the original timestamp is imported along with the revision content, but a new revision ID is assigned on import to avoid clashes.