So, this is a problem only when dumping abstract? Do regular dumps perform ok?
I have found the reason for triple lookups of content; the third lookup, performed from within the Abstract Filter extension, is to determine whether or not the revision is a redirect or not. In the past, filtering from within the extension has not been a problem, because content is not loaded until after all filters are applied. Now that content is retrieved early, there's a substantial hit to performance. These pages with the selected revision as a redirect should not even have content loaded once, but the revisions should be discarded from the result set for processing, just like those belonging to the wrong namespaces.
Mon, Oct 14
All right, one of them is clear(ish) to me: abstracts should only run on the main namespace (0), and we skip over anything not in namespace 0 both in the extension and in WikiExporter.php, via command line args that so specify.
Adding here for posterity that the command I run (after making live mods to xhprofile and/or wancache code, with and without the patch) is
Just a short update, I have been doing profiling and logging on one of the currently idle snapshot hosts, and I think I have a lead.
Wed, Oct 9
I've done profiling for abstract dumps and have not yet been able to tease out the part of the code where more time is spent, after several hours of scrying xhprof results. Back at it again later today. The difference in times is exacerbated if I add --namespaces=0 to the command line args. I'm hoping that will give me a lead.
As an interested party, I'm curious about what's happening now as far as going ahead with the trial.
Thu, Oct 3
Still true for .wmf25.
I can start on this once the new dumpsdata host is racked and has a base install.
Wed, Oct 2
@Ebonetti90 I'd like to close this task as declined, meaning that we won't update the schema and you'll take steps on your end to adjust your script. OK by you?
Tue, Oct 1
Adding @Bstorm because the labstore servers are WMCS boxes.
I'd like to request that both eth interfaces be cabled, as I'd like to try to set up bonding for this host.
Fri, Sep 27
https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/539498/ was merged in response and kicked in about 10 minutes ago, with good results on the graph.
at around 6:50 UTC this morning we began seeing this:
Tue, Sep 24
Good morning Enrico!
The sql in maintenance/tables.sql for creating the pagelinks table is the following:
-- -- Track page-to-page hyperlinks within the wiki. -- CREATE TABLE /*_*/pagelinks ( -- Key to the page_id of the page containing the link. pl_from int unsigned NOT NULL default 0, -- Namespace for this page pl_from_namespace int NOT NULL default 0,
Sun, Sep 22
How big are these dumps for one set, and how many sets do we intend to keep? Adding @Bstorm since the host behind dumps.wikimedia.org is a WMCS server.
Sat, Sep 21
Everything back to how it was, 20th run going everywhere. Closing.
Status files copied over manually, cron job set to go off in about ten minutes. Once that's running I'll re-enable puppet there and be done with this task.
The multistream dumps are complete, which means the wikidata run is complete. I'll wait a little to see if the rsync picks up the status files; if not, I'll manually send them around to the labstore and other dumpsdat hosts.
Multistream dumps for wikidata are still running but we're closing in on the end.
I've disabled puppet on snapshot1006 and turned off the 20th dumps run for wikidata in the crontab which would start today, to be re-enabled once the current run completes.
Fri, Sep 20
All bz2, 7z page meta history files are done, and sha1/md5 sum files produced for them.
The pages logging job is complete, along with the recombine job.
The only jobs remaining are the multistream job and the multistream recombine.
I have marked all other jobs as complete in the dumpruninfo.txt file.
- pages-meta-history 56x - 5803x
- pages-meta-history 5803x - 60x
- 7z's/hashes for all currently completed pages-meta-history bz2, 7z files
Thu, Sep 19
Meh the above comment is getting too hard to read. Here's what's running this evening:
- bz2 pages-meta-history from 40x... to the end (it should be interrupted when it reaches the 50x files)
- bz2 pages-meta-history from 50x to 599x
- more part 27 7z files
- bz2 pages-meta-history for parts 22,23 - once this completes we can start part 27 50x - 600x
- bz2 pages-meta-history for part 27 39x - end (will be interrupted when it begins to duplicate completed output from other processes)
- bz2 pages-meta-history 27 60x - end
- 7z for parts 26, 27 (partial) - once this completes we can start 21 and rerun 26 to completion; after that we can generate md5/sha1 sums for bz2/7z files that don't have them.
This issue is indeed resolved. Closing.
Tue, Sep 17
I guess by the closure of the subtask that the server has arrived? What's the outlook for getting it racked?
Sep 12 2019
Pretty sure you don't need all that checklist. Can whoever does the grant clean up the description to just leave whatever's necessary? Thanks in advance!
Sep 11 2019
I see dumpTextPass running for one of the wikis so things are at last back on track. Closing this ticket.
Now that the above is deployed, I will watch for the dump scheduler to start succeeding at some of these jobs...
Sep 8 2019
The errors I see are related to a MediaWiki commit and not to this, but since we won't have verification that connecting works until that issue is resolved (see T232268) I'm leaving this open for now, but downgrading its priority.
Note that I'm on vacation so I might not be near a keyboard when the fix is pushed out for testing, please don't wait for me.
Sep 7 2019
i'll be looking into this a bit later today/tomorrow (on vacation!). In theory nothing needs to be done; the dump scripts all ask MediaWiki for the password. But I see errors so something changed.
Sep 2 2019
These wikis ran the Aug 20th dump run successfully with the new config, so closing.
This is merged and live on at least some wikis without incident, so closing.
Sep 1 2019
Aug 29 2019
These two pages are aliases for the same contributor, and the problematic revisions were added in 2016 on each page, so this is some sort of regression (php? babel? combo?)
Wikitext for Gangleri: https://meta.wikimedia.org/w/index.php?title=User:Gangleri&action=edit Just look at all those babel entries. Same for the other user: https://meta.wikimedia.org/w/index.php?title=User:%D7%91%D7%B2%D6%B7_%D7%9E%D7%99%D7%A8_%D7%91%D7%99%D7%A1%D7%98%D7%95_%D7%A9%D7%99%D7%99%D7%9F&action=edit
Aug 28 2019
Which xml dump file did you import, can you provide a link? And can you let us know the version of MediaWiki you have installed? Also, can you provide the full stack trace from the error output? Thank you.
When I look at that image it looks pretty empty, am I missing something?
Aug 27 2019
I think Reedy was away and didn't see my pings. Anyways, thanks for moving forward on this, and we'll see how it looks in a week!
It’s part of the serialization. Not sure why that would be a new issue, though – this seems like a fairly fundamental issue (tying the page ID to the page content even though it’s not stable across delete+restore). Is it possible that File:Bolsonaro_etc is just the first file with structured data that was deleted and then restored?
https://commons.wikimedia.org/wiki/Special:Log?type=&user=&page=File%3ABolsonaro_with_Israeli_PM_Benjamin_Netanyahu%2C_Tel_Aviv%2C_31_March_2019.jpg&wpdate=&tagfilter= It was deleted and restored on 02:45, 26 Αυγούστου 2019 so I guess something isn't handled quite right in MediaInfo entities for these cases.
Aug 23 2019
@elukey On our previous server we let people pull from us and it was very difficult to manage upgrades or any sort of maintenance. Somewhere there's a ticket with the awfulness.
https://github.com/apergos/misc-wmf-crap/tree/master/glyph-image-generator Starting to get clever about this: ability to generate 50k small images with metadata that can be extracted for using in depicts and/or caption statements.
Aug 22 2019
Aug 21 2019
I'm looking at deployment-db05 now, and there are 63332 rows in the revision table, with 53250 rows in the content table. I guess we need to double the number of revisions and then add the structured data for those entries. we can probably be clever about this via a script.
@Smalyshev Do you know how many entries have structured data on deployment-prep? Is that a useful testing ground right now or should we be populating the data over there first?
Aug 12 2019
This appears to be working as it should. Closing.
I'm not thinking about the amount of time it takes, but rather the load on the database servers. Reasonable sized batched queries will be better, as I've seen already with stub dumps and slot retrieval.
I think T222497 should be resolved before this goes live. I can test it in deployment-prep before then, but I don't want to do production tests until there is some sort of batching.
Aug 9 2019
Thanks a lot! I've updated the patch above to remove those entries. Now just waiting on the wb_terms migration to get further along.
The entry in the text row points to a non-existent blob in a cluster.
firstname.lastname@example.org(zhwiki)> select * from text where old_id = 9375723; +---------+---------------+-----------+------------------+-------------+----------+---------------+---------------+----------------+---------------------+-------------------+ | old_id | old_namespace | old_title | old_text | old_comment | old_user | old_user_text | old_timestamp | old_minor_edit | old_flags | inverse_timestamp | +---------+---------------+-----------+------------------+-------------+----------+---------------+---------------+----------------+---------------------+-------------------+ | 9375723 | 0 | | DB://cluster20/0 | | 0 | | | 0 | utf-8,gzip,external | | +---------+---------------+-----------+------------------+-------------+----------+---------------+---------------+----------------+---------------------+-------------------+ 1 row in set (0.00 sec)
The id of 0 after the cluster20 address is the issue, just like other entries on this ticket.
Aug 8 2019
The python scripts at the dump end are (mostly) protected against exceptions from MediaWiki generally and from this failure case in particular. Since we have problematic data in production I've re-opened the ticket so that the WikiBase issue can somehow be resolved.
Most of these were handled in https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/524666/ but not quite all.
I should clarify; I think the only thing that is needed is to set the content model column for those rows in the content table to 1 (or whichever model is 'wikitext'). The dewikiversity tickets are similar.
I expected that; this requires direct intervention at the db level. I was sort of hoping you were volunteering to do it :-D
Aug 7 2019
@MarcoAurelio Wonderful! It's not the pages though; in the page entry for each of the bad revisions, the content model is listed there as wikitext. It's only in the content table where the wrong content model (4) is shown.
The entries in the content table are listed in T207627#5105046 (double-checked just now to be sure the list is still the same). The slot, revision and pages infocorresponding to each of those is listed in the same comment, double-checked just now to be sure none of that changed either.
Aug 5 2019
@aaron This revision rEFLR848ef073fa89036c40c440016a8092690ddcf56b for FlaggedRevs seems to indicate that flaggedrevs_stats and flaggedrevs_stats2 are no longer used. Do you know or can you point me to someone who could verify that this is the case? If they aren't used, I will add them to my 'don't ever dump these' list. Thanks!
I'm going to decline this because we would have to walk through and decide which entries can be published and which ones not .
This should be done by letting MediaWiki do the work, rather than re-implementing the logic in the python scripts and needing to keep it in sync.
The pages-logging xml dumps already do that for us.
I sincerely apologize: this weekend the heat baked my brain and I did nothing related to computers at all. And Friday evening I was out. I'll set a notification to remind me this coming Friday earlier in the day, so that this gets done.
Aug 2 2019
Aug 1 2019
Jul 30 2019
The category links are the fallback that was designed, so this is a net positive. Going to go update the CR on the patch.
Ran the following with old and new code for abstracts:
/usr/bin/php7.2 /srv/mediawiki/multiversion/MWScript.php dumpBackup.php --wiki=anwiki /srv/v/mediawiki/php-1.34.0-wmf.15/extensions/ActiveAbstract/includes/AbstractFilter.php --full --report=1 --output=file:/mnt/dumpsdata/temp/dumpsgen/abstracts-anwiki-cr-testing.txt.test --filter=noredirect --filter=abstract --skip-header --start=36142 --skip-footer --end 36150
is this still an issue? It wasn't even on my radar but I saw it now by a change search in phab.
The wikis ran to completion, but I forgot to close this. Doing so now!