In T364250#9790566, @xcollazo wrote:Bunch of recoverable errors reported on https://groups.google.com/a/wikimedia.org/g/ops-dumps/c/efjbIJHS--Q/m/LekH2tawAgAJ. A sample:
*** Wiki: wikidatawiki ===================== [20240512132606]: Skipping bad text id 44423a2f2f636c757374657232392f3739303236383533 of revision 2148925565 [20240512132620]: Skipping bad text id 44423a2f2f636c757374657232392f3739303435343333 of revision 2148962571 [20240512133014]: Skipping bad text id 44423a2f2f636c757374657232382f3739303137383638 of revision 2148962836 [20240512133027]: Skipping bad text id 44423a2f2f636c757374657232392f3739303336323538 of revision 2148944452 [20240512133121]: Skipping bad text id 44423a2f2f636c757374657232392f3739303339343635 of revision 2148950677 [20240512133833]: Skipping bad text id 44423a2f2f636c757374657232392f3739303339373430 of revision 2148951238 [20240512134223]: Skipping bad text id 44423a2f2f636c757374657232382f3739303035343633 of revision 2148937955 [20240512134256]: Skipping bad text id 44423a2f2f636c757374657232382f3739303033373434 of revision 2148934567I have not seen these errors before, and they coincide with https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1020948 being merged, which is part of T362566.
Perusing code, the exception is coming from :
... protected function getText( $id, $model = null, $format = null, $expSize = null ) { if ( !$this->isValidTextId( $id ) ) { $msg = "Skipping bad text id " . $id . " of revision " . $this->thisRev; $this->progress( $msg ); return ''; } ...and isValidTextId() is defined as:
private function isValidTextId( $id ) { if ( preg_match( '/:/', $id ) ) { return $id !== 'tt:0'; } elseif ( preg_match( '/^\d+$/', $id ) ) { return intval( $id ) > 0; } return false; }
In T364250#9791419, @xcollazo wrote:In T364250#9790650, @Ladsgroup wrote:We should get rid of that part and bump the XML version as Daniel suggested. Wanna do it?
I'm afraid I am not versed in MediaWiki development. I just happen to be the guy that inherited Dumps 1.0 via a series of unfortunate events.
I am happy to bump the XML and get rid of the text id as part of Dumps 2.0 work though (Dumps 2.0 is tech stack of Hadoop/Flink/Spark which am very comfortable with).
In T364250#9792093, @Ladsgroup wrote:We all have had such misfortunes! Fixing the issue in mw is rather easy, we should just drop the code piece, bump the number in several places, update the schema validation and update tests. Here is an example https://gerrit.wikimedia.org/r/c/mediawiki/core/+/464768
I'd be more than happy to review or help getting it done.