Page MenuHomePhabricator

Dumps twisted in several languages
Closed, ResolvedPublic

Description

Author: risanecek

Description:
Many dumps are twisted corrupted in many languages.

While syntactically correct, titles do not correspond to content.

e.g. "A mír na Zemi!" in the czech wiki, has the text of "singapore" in the dump. I've discovered this all across the languages - seems not to affect
all articles though. (cswiki dump as of 20100411)
If you need more examples, I can provide them

<page>
  <title>A mír na Zemi!</title>
  <id>70749</id>
  <revision>
    <id>5178497</id>
    <timestamp>2010-04-03T22:56:32Z</timestamp>
    <contributor>
      <username>Chalupa</username>
      <id>3656</id>
    </contributor>
    <comment>obrázek z commons</comment>
    <text xml:space="preserve">{{Infobox stát|
  genitiv = Singapuru
| úřední název = Republic of Singapore&lt;br /&gt;新加坡共和国&lt;br /&gt;Republik Singapura&lt;br /&gt;சிங்கப்பூர் குடியரசு
| vlajka = Flag of Singapore.svg
| článek o vlajce = Singapurská vlajka
| znak =
| mapa umístění = LocationSingapore.png

...


Version: unspecified
Severity: critical
URL: upload.wikimedia.org/wikipedia/commons/9/95/Image-Dadd|

Details

Reference
bz23264

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 11:10 PM
bzimport set Reference to bz23264.

The recent dumps for cswiki, ltwiktionary, thwiki and elwiki had to be interrupted, as they were hung. They were restarted by forcefully shooting threads. Please use files from the previous dumps. You'll see on the index page (http://dumps.wikimedia.org/backup-index.html) messages like "Dump complete, 2 items failed".

If you are seeing this in some other dump than the above, please note it here. Thanks.

abxabx wrote:

I see it in pl.wiktionary dump and reported a few days ago in bug #18651

I'm seeing this show up consistently for any stalled run

Warning: XMLReader::read(): compress.bzip2:///mnt/dumps/public/frwiktionary/20100422/frwiktionary-20100422-pages-articles.xml.bz2:6208817: parser error : Extra content at the end of the document in /usr/local/apache/common-local/wmf-deployment/maintenance/backupPrefetch.inc on line 151
Warning: XMLReader::read(): [[zh:extravasa in /usr/local/apache/common-local/wmf-deployment/maintenance/backupPrefetch.inc on line 151
Warning: XMLReader::read(): ^ in /usr/local/apache/common-local/wmf-deployment/maintenance/backupPrefetch.inc on line 151
Warning: XMLReader::read(): An Error Occured while reading in /usr/local/apache/common-local/wmf-deployment/maintenance/backupPrefetch.inc on line 151

Were going along two paths for this right now

For the first we are

  1. Turning off pre fetch since some previous snapshots are bad
  2. Turning off spawning child procs since we seeing inter process messaging break down

This is now being tested on snapshot2 and if we see no issues will be propagated to production.

For the second we are testing out a potential bugfix to the core issue and making sure it has no unexpected consequences.

This is being tested on snapshot3

Ascander wrote:

It could be a new instance of the same problem reported last June for the Spanish Wikipedia (see Bug 18694 [https://bugzilla.wikimedia.org/show_bug.cgi?id=18694]).

Our problem was:

Must articles last modified between mid January 2009 and mid April 2009 appeared with a wrong content in dumps. The reason was never discovered, but as this problem affected several kinds of users, we dealt with it by updating all articles last modified in that period (spell checking, cosmetic changes, and finally, useless changes).

Just an update: I'm taking the opportunity to refactor dumpTextPass, fetchText, backuos.inc and dumpBackup.php so the Maintenance class is used appropriately and so that we can add timeouts to reads and writes properly. Should be testing the new code on one of the snapshot hosts tomorrow afternoon. This should address the "revisions out of sync" as well as "backup processes hang indefinitely on write" issues.

Any news on this? Fresher dumps would be welcome :).

tests look good, going to try to run some production dumps this afternoon.

Are there any news about this?

I am running cswiki now; when it's done I'll make it available via the downloads page. It should be inspected closely by a regular user of the dumps to see if it's correct. If someone else watching this bug is on a smaller project and would be interested in getting dumps now and checking them for accuracy, I'd be happy to run a set. Once I have a few dumps verified as ok, I'll do a full run through all the projects.

If you run on Rowiki pages-articles, I can do some ad hock checking against previous versions. Not sure how much help that would be. And I have to say pages-articles dumps would still be useful even if they are somewhat broken, as long as we know!

(In reply to comment #12)

If you run on Rowiki pages-articles, I can do some ad hock checking against

: Ad hock? Means I will be drinking I suppose.

Egmontaz wrote:

I can check el.wikipedia dump against the last erroneous and the previous 2 good ones, I usually work with pages-meta-current, but will do it with pages-articles too.

(In reply to comment #11)

I am running cswiki now; when it's done I'll make it available via the
downloads page. It should be inspected closely by a regular user of the dumps
to see if it's correct. If someone else watching this bug is on a smaller
project and would be interested in getting dumps now and checking them for
accuracy, I'd be happy to run a set. Once I have a few dumps verified as ok,
I'll do a full run through all the projects.

If you make ptwikt (wikt, not wiki) available, I'll be happy to analyze it too.

Currently running: cs, ro, el. I'll start up ptwikt once one of those completes. We won't be updating the central index page but I'll add a note here as they become available.

elwiki is temporarily on hold. I ran across a revision with unretrievable text: see http://el.wikipedia.org/w/index.php?title=%CE%A3%CF%85%CE%B6%CE%AE%CF%84%CE%B7%CF%83%CE%B7_%CF%87%CF%81%CE%AE%CF%83%CF%84%CE%B7:Geraki/%CE%91%CF%81%CF%87%CE%B5%CE%AF%CE%BF_9&oldid=1422393 for the particular revision. I'll probably restart the job tomorrow and ignore this one revision's text. At some point we should decide how to mark up pages for which there are errors. We already put 'deleted' in some fields, perhaps we should have error indications as well.

If folks have other thoughts, please chime in during the next 6-7 hours or so (while I sleep).

You may have forgotten about pt.wikt's dump... ;)

No, I didn't forget. I was hoping for the folks with ro and cs to look at those before I continued on. However, since you asked again (and they haven't commented yet), I'm running it now.

Please see (and check closely) http://dumps.wikimedia.org/ptwiktionary/20100524/ and let me know if there are any issues. Thanks.

Ariel,
I didn't find any inconsistencies so far, but I never found any in previous pt.wikt dumps either. I will let you know if I find anything meanwhile.
Thanks.

OK the notable difference is that the filespace seem to have changed name from April - it is now Fisier. However I developed signatures for both dumps and compared them. The correlation is very high, internal consistency also seems good. Spot checks of differences were supported by the history pages of the wiki.

Attending to Tomasz request and confirming what I said above (#23), no inconsistencies were found in the latest ptwiktionary dump. Obviously, I didn't check everything but did a few random checks, so it's possible inconsistences may exist and were not noticed.

glwiki run completed, see http://dumps.wikimedia.org/glwiki/20100524/

Content of glwiki-20100524-pages-meta-current seems good. Scanned fully the main namespace and all pages seemed related to the purpoted title.

Egmontaz wrote:

elwiki seems good. I did random checks, and the usual queries I do and all seem nice and consistent, none of the problems I had with the previous dump showed up.

Running one worker (one queue of dumps) now. They should be showing up on http://dumps.wikimedia.org/backup-index.html already.

Yes, testwiki won't run correctly until my fixes are deployed in the production branch (special case).

I have moved all of the previous bad dumps (April 11 through May 2 2010) to a separate location so they will no longer show up on the download page. Dumps will continue running, doing projects with the oldest good dump first.

Some fixes which should help to prevent an occurrence of this bug have been committed to trunk.

2010-06-21 08:04:07 enwiki (new): missing status record

Not sure what's happening here, the date seems to be today's date - i.e. when I looked on the 16th it said

2010-06-15 08:04:07 enwiki (new): missing status record

enwiki dumps are not running right now (any job that might have started can be ignored). We expect to start a job for it later in the week once migration to the new storage server has been completed.

Yahoo extracts failed

2010-06-29 20:54:09 failed Extracted page abstracts for Yahoo

Database returned error "0: "

  • abstract.xml

As of today, several dumps (simplewiki, elwiki, cswiki) seem stuck since 5th July.

g33kdyoo wrote:

simplewiki is stuck on rev 1832000, which can be fetched for me.

Closing this since there have been no new reports of text content drift after putting the text length check in place (and fixing the underlying bug in mid 2010 that caused the issue).