Page MenuHomePhabricator

Use RevisionStore::newRevisionFromBatch in WikiExporter
Open, Needs TriagePublic

Description

WikiExporter creates revision object in batches, so it's a perfect use-case for the new RevisionStore::newRevisionsFromBatch method.

Details

Related Gerrit Patches:

Event Timeline

Change 537570 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[mediawiki/core@master] Use RevisionStore::newRevisionsFromBatch in WikiExporter

https://gerrit.wikimedia.org/r/537570

@daniel/@Pchelolo was this task moved to the Done column in error? I see that it still has a patch set open.

@CCicalese_WMF yeah. Some unexpected complications have been found.

daniel added a comment.Oct 9 2019, 9:32 AM

@daniel/@Pchelolo was this task moved to the Done column in error? I see that it still has a patch set open.

oops, sorry about that.

I've done profiling for abstract dumps and have not yet been able to tease out the part of the code where more time is spent, after several hours of scrying xhprof results. Back at it again later today. The difference in times is exacerbated if I add --namespaces=0 to the command line args. I'm hoping that will give me a lead.

Just a short update, I have been doing profiling and logging on one of the currently idle snapshot hosts, and I think I have a lead.

I'm profiling abstracts, and I see both that many text blobs are looked up in the new code but not with the old code, and that most blobs are looked up twice or three times in the new code and not in the old code (where by 'looked up' I mean a call to WANObjectCache::getWithSetCallback and most of those calling WANObjectCache::fetchOrRegenerate in the new code as compared to the old code.

There are nearly twice as many calls to mysqli_result::fetch_object in the new code so a lot of these lookups must be failing to find the entry. I've got a pile of stack straces I'll be looking at to see where in the new code we are getting the unneeded blobs and/or requesting blobs multiple times.

Adding here for posterity that the command I run (after making live mods to xhprofile and/or wancache code, with and without the patch) is

/usr/bin/php7.2 /srv/mediawiki/multiversion/MWScript.php dumpBackup.php --wiki=simplewiki /srv/mediawiki/php-1.35.0-wmf.1 --plugin=AbstractFilter:/srv/mediawiki/php-1.35.0-wmf.1/extensions/ActiveAbstract/includes/AbstractFilter.php --current --report=1000 --output=gzip:/mnt/dumpsdata/temp/dumpsgen/simple-abstracts-old-profiling.gz --filter=namespace:NS_MAIN --filter=noredirect --filter=abstract  --start=1 --end=100000 --namespaces=0

and that with and without the patch/logging/profiling, the output is the same.

All right, one of them is clear(ish) to me: abstracts should only run on the main namespace (0), and we skip over anything not in namespace 0 both in the extension and in WikiExporter.php, via command line args that so specify.

In outputPageStreamBatch, The processing of revisions and filling in of content used to be done after the check

                foreach ( $results as $revRow ) {
	                $revRecord = $revisionRecords[$revRow->rev_id];

                        if ( $this->limitNamespaces &&
	                        !in_array( $revRow->page_namespace, $this->limitNamespaces ) ) {
                                $lastRow = $revRow;
		                continue;
                        }
...

Now it's been moved up into the retrieval of the rows themselves, I guess. So that's a problem and needs to be fixed before the patch can go. The other issue (most text blobs looked up 2 or 3 times) is still under investigation.

ArielGlenn added a comment.EditedOct 15 2019, 9:50 AM

I have found the reason for triple lookups of content; the third lookup, performed from within the Abstract Filter extension, is to determine whether or not the revision is a redirect or not. In the past, filtering from within the extension has not been a problem, because content is not loaded until after all filters are applied. Now that content is retrieved early, there's a substantial hit to performance. These pages with the selected revision as a redirect should not even have content loaded once, but the revisions should be discarded from the result set for processing, just like those belonging to the wrong namespaces.

We could consider having special logic for filtering revisions right in outputPageStreamBatch before content is loaded. I'm not sure how that would work with the new batch loading that is desired.

I have not yet looked at stubs dumps; this is only for abstracts.

Thank you for investigating, Ariel!

I'm profiling abstracts, and I see both that many text blobs are looked up in the new code but not with the old code, and that most blobs are looked up twice or three times in the new code and not in the old code (where by 'looked up' I mean a call to WANObjectCache::getWithSetCallback and most of those calling WANObjectCache::fetchOrRegenerate in the new code as compared to the old code.

Due to T235188: Preemptive refresh in getMultiWithSetCallback() and getMultiWithUnionSetCallback() pollutes cache, we caching has (temporarily) been removed from batch lookups for content objects (which didn't have caching before either). That should redically reduce the number of calls to WANObjectCache::getWithSetCallback and WANObjectCache::fetchOrRegenerate.

I'm not sure what it does to overall performance, and the caching is probably going to come back. Just letting you know that hebavior changed wrt caching, and may change again in the future. Relevant patch: https://gerrit.wikimedia.org/r/c/mediawiki/core/+/542328

I have found the reason for triple lookups of content; the third lookup, performed from within the Abstract Filter extension, is to determine whether or not the revision is a redirect or not.

So, this is a problem only when dumping abstract? Do regular dumps perform ok?

We could consider having special logic for filtering revisions right in outputPageStreamBatch before content is loaded. I'm not sure how that would work with the new batch loading that is desired.

I haven't looked at the code just now, but in general, calling code is free to make the query used to retrieve rows of the revision table as restrictive as it likes. The batch loading code triggered after the query against the revision table finished.

...

So, this is a problem only when dumping abstract? Do regular dumps perform ok?

Stubs are also slow but I'd like to have this fixed up and then do new profiling with the new code. Likewise I'd like to have caching re-enabled before running profiles, as not having it will surely skew the results. Note that the above test runs were done before the patch disabling caches was merged.
...

I haven't looked at the code just now, but in general, calling code is free to make the query used to retrieve rows of the revision table as restrictive as it likes. The batch loading code triggered after the query against the revision table finished.

That would be great! It means adding another argument to dumpBackup.php but that's the way it goes.