Page MenuHomePhabricator

WikiExporter::dumpPages() batching/paging logic is wrong in edge cases
Open, Needs TriagePublic

Description

WikiExporter::dumpPages() batching/paging logic looks questionable and probably wrong in certain edge cases. I think they are almost impossible to hit in practical usage, but because the questionable code was added to fix performance issues in T203424 and then T207628, I don't want to just go in and change it.

I noticed the problem when trying to migrate the code to use Database::buildComparison() (T321422).

This is the relevant code:

		$op = '>';
		if ( is_array( $this->history ) ) {
			# Time offset/limit for all pages/history...
			# Set time order
			if ( $this->history['dir'] == 'asc' ) {
				$opts['ORDER BY'] = 'rev_timestamp ASC';
			} else {
				$op = '<';
				$opts['ORDER BY'] = 'rev_timestamp DESC';
			}
			...
		} else ...


		$opts['LIMIT'] = self::BATCH_SIZE;

		...

		while ( !$done ) {
			// If necessary, impose the overall maximum and stop looping after this iteration.
			if ( !empty( $maxRowCount ) && $rowCount + self::BATCH_SIZE > $maxRowCount ) {
				$opts['LIMIT'] = $maxRowCount - $rowCount;
				$done = true;
			}

			$queryConds = $conds;
			$queryConds[] = 'rev_page>' . intval( $revPage ) . ' OR (rev_page=' .
				intval( $revPage ) . ' AND rev_id' . $op . intval( $revId ) . ')';

			# Do the query and process any results, remembering max ids for the next iteration.
			$result = $this->db->select(
				$tables,
				$fields,
				$queryConds,
				__METHOD__,
				$opts,
				$join
			);
			...
		}
	}

It looks like we're ordering the results by rev_timestamp within the query, and by rev_page between the queries. These orders are not the same, and so it looks like this may fail to export some revisions, and may export others more than once.

The reason this isn't a big problem is:

  • Almost nothing passes an array to $this->history. The only code path I could find is in Special:Export, in a "secret" mode that is not accessible from the user interface, but only used for trans-wiki imports. You can test it by sending some POST requests manually, e.g. from the browser developer tools: await $.post( '/wiki/Special:Export', { pages: 'Foo', dir: 'desc' } ).
  • BATCH_SIZE is 50000, so you'd have to export more than this many revisions using this fiddly method to run into the problem.

Event Timeline

In general we try to order by revid since that's known to be unique, and lets us do nice things like re-use old pge content dumps for creating the new ones, searching a previous dump output file that is also ordered by page id and within each page by rev id. So if we were going to make a change, I'd want to go that direction.

I'm inclined to agree that this is very much a corner case, but having consistency is probably worth it for future maintainers.

Previous discussion about the order we dump things: T29112