Page MenuHomePhabricator

BatchRowIterator slow query on commonswiki
Closed, ResolvedPublic

Description

This kind of query might need tunning:

SELECT /* BatchRowIterator::next  */  cl_from,cl_to  FROM `categorylinks`    WHERE cl_type = 'subcat' AND cl_from IN (92967652,234494,24559020,960551,3007520,76398273,6972234,363488,2257260,4157420,89319925,84920900,41797907,61421859,92055128,9221880,14562,26762776,33298380,65449552,3795363,66235719,42442426,89319828,27708617,2563533,66701920,22548996,108484,25232065,6846286,43665564,2257433,8811984,84203487,3837544,5324927,8645978,'',805218,1078394,81978764,391851)   ORDER BY cl_from ASC,cl_to ASC LIMIT 200

root@db1121.eqiad.wmnet[commonswiki]> show explain for 751641971;
+------+-------------+---------------+-------+---------------+---------+---------+------+-----------+-----------------------------+
| id   | select_type | table         | type  | possible_keys | key     | key_len | ref  | rows      | Extra                       |
+------+-------------+---------------+-------+---------------+---------+---------+------+-----------+-----------------------------+
|    1 | SIMPLE      | categorylinks | index | PRIMARY       | PRIMARY | 261     | NULL | 396988884 | Using where; Using filesort |
+------+-------------+---------------+-------+---------------+---------+---------+------+-----------+-----------------------------+
1 row in set, 1 warning (0.348 sec)

I am not sure who owns this piece of code, the only thing I have been able to find at gerrit is: https://gerrit.wikimedia.org/r/c/mediawiki/core/+/299578
If anyone knows better tags for this, please feel free to change/add them!

Event Timeline

Umherirrender subscribed.

Someone is running maintenance/dumpCategoriesAsRdf.php where the query cames from.

Someone probably being cron: modules/snapshot/files/cron/dumpcategoriesrdf.sh - T173892: Setup dump for categories RDF representation and T198356: Generate daily diffs for recently changed categories

None of the indexes on categorylinks cover this use case of cl_type and cl_from

--
-- Track category inclusions *used inline*
-- This tracks a single level of category membership
--
CREATE TABLE /*_*/categorylinks (
  -- Key to page_id of the page defined as a category member.
  cl_from int unsigned NOT NULL default 0,

  -- Name of the category.
  -- This is also the page_title of the category's description page;
  -- all such pages are in namespace 14 (NS_CATEGORY).
  cl_to varchar(255) binary NOT NULL default '',

  -- A binary string obtained by applying a sortkey generation algorithm
  -- (Collation::getSortKey()) to page_title, or cl_sortkey_prefix . "\n"
  -- . page_title if cl_sortkey_prefix is nonempty.
  cl_sortkey varbinary(230) NOT NULL default '',

  -- A prefix for the raw sortkey manually specified by the user, either via
  -- [[Category:Foo|prefix]] or {{defaultsort:prefix}}.  If nonempty, it's
  -- concatenated with a line break followed by the page title before the sortkey
  -- conversion algorithm is run.  We store this so that we can update
  -- collations without reparsing all pages.
  -- Note: If you change the length of this field, you also need to change
  -- code in LinksUpdate.php. See T27254.
  cl_sortkey_prefix varchar(255) binary NOT NULL default '',

  -- This isn't really used at present. Provided for an optional
  -- sorting method by approximate addition time.
  cl_timestamp timestamp NOT NULL,

  -- Stores $wgCategoryCollation at the time cl_sortkey was generated.  This
  -- can be used to install new collation versions, tracking which rows are not
  -- yet updated.  '' means no collation, this is a legacy row that needs to be
  -- updated by updateCollation.php.  In the future, it might be possible to
  -- specify different collations per category.
  cl_collation varbinary(32) NOT NULL default '',

  -- Stores whether cl_from is a category, file, or other page, so we can
  -- paginate the three categories separately.  This only has to be updated
  -- when moving pages into or out of the category namespace, since file pages
  -- cannot be moved to other namespaces, nor can non-files be moved into the
  -- file namespace.
  cl_type ENUM('page', 'subcat', 'file') NOT NULL default 'page',
  PRIMARY KEY (cl_from,cl_to)
) /*$wgDBTableOptions*/;


-- We always sort within a given category, and within a given type.  FIXME:
-- Formerly this index didn't cover cl_type (since that didn't exist), so old
-- callers won't be using an index: fix this?
CREATE INDEX /*i*/cl_sortkey ON /*_*/categorylinks (cl_to,cl_type,cl_sortkey,cl_from);

-- Used by the API (and some extensions)
CREATE INDEX /*i*/cl_timestamp ON /*_*/categorylinks (cl_to,cl_timestamp);

-- Used when updating collation (e.g. updateCollation.php)
CREATE INDEX /*i*/cl_collation_ext ON /*_*/categorylinks (cl_collation, cl_to, cl_type, cl_from);

Wikidata team doesn't maintain WDQS, it's search platform.

The table has PRIMARY KEY (cl_from,cl_to)

So it could use a range index on the primary? Order is on primary ORDER BY cl_from ASC,cl_to ASC to make the query ready for unique paginate.

But for the long list of cl_from it needs a filesort to find the subcats from it, which not so much rows in compare to the pages

As this is a maintenance script and the run duration is not a big problem the filter to subcats may needs to move to the code instead. That can result in select of millions of rows and non is needed.

Or trust cat_subcats and only select cl_from where cat_subcats is non zero to reduce the list of cl_from
Or in generally reduce the batch size for the first query which gets the cl_from to get a smaller list of "cl_from IN"
I am wonder why the query contains a '' (empty string) in the IN clause, there should be integer all over the page ids. The cl_from are from the page table from a previous select.

Following the cat tree is always expensive and done on wmflabs for this purpose (PetScan)

@Cparle is this something the SD team should look at?

Daily rdf dumps are probably broken until this is resolved, just a fyi for folks importing these for search purposes.

SELECT /* BatchRowIterator::next  */  cl_from,cl_to  FROM `categorylinks`    WHERE cl_type = 'subcat' AND cl_from IN (92967652,234494,24559020,960551,3007520,76398273,6972234,363488,2257260,4157420,89319925,84920900,41797907,61421859,92055128,9221880,14562,26762776,33298380,65449552,3795363,66235719,42442426,89319828,27708617,2563533,66701920,22548996,108484,25232065,6846286,43665564,2257433,8811984,84203487,3837544,5324927,8645978,'',805218,1078394,81978764,391851)   ORDER BY cl_from ASC,cl_to ASC LIMIT 200

What is that empty thing in there? I see ' ' in the list of ids.

SELECT /* BatchRowIterator::next  */  cl_from,cl_to  FROM `categorylinks`    WHERE cl_type = 'subcat' AND cl_from IN (92967652,234494,24559020,960551,3007520,76398273,6972234,363488,2257260,4157420,89319925,84920900,41797907,61421859,92055128,9221880,14562,26762776,33298380,65449552,3795363,66235719,42442426,89319828,27708617,2563533,66701920,22548996,108484,25232065,6846286,43665564,2257433,8811984,84203487,3837544,5324927,8645978,'',805218,1078394,81978764,391851)   ORDER BY cl_from ASC,cl_to ASC LIMIT 200

What is that empty thing in there? I see ' ' in the list of ids.

At a quick guess, it's a null (or something similar) that's being cast to a string so comes out as ''...

Presumably the validation of the list being built isn't doing a very good job

When I ran the above query on db2073 (codfw dups and vslow host) without the crap ' ' field in there, it returned in 0.00 seconds. Maybe the bad entries are a new development?

Looking at the code...

		foreach ( $this->getCategoryIterator( $dbr ) as $batch ) {
			$pages = [];
			foreach ( $batch as $row ) {
				$this->categoriesRdf->writeCategoryData(
					$row->page_title,
					$row->pp_propname === 'hiddencat',
					(int)$row->cat_pages - (int)$row->cat_subcats - (int)$row->cat_files,
					(int)$row->cat_subcats
				);
				$pages[$row->page_id] = $row->page_title;
			}

			foreach ( $this->getCategoryLinksIterator( $dbr, array_keys( $pages ) ) as $row ) {
				$this->categoriesRdf->writeCategoryLinkData( $pages[$row->cl_from], $row->cl_to );
			}
			fwrite( $output, $this->rdfWriter->drain() );
		}

The answer would seemingly be from array_keys( $pages ) ... But how would a page_id end up not being a number if pulled from the DB?

Even more so as in getCategoryIterator it's using page as the main table, and joining in others... ie the page table isn't the one LEFT JOIN'd in...

I guess it'd be useful to get the actual SQL query being generated by this (ie by the code, not by what it should be if a human built it from that):

	/**
	 * Produce row iterator for categories.
	 * @param IDatabase $dbr Database connection
	 * @return RecursiveIterator
	 */
	public function getCategoryIterator( IDatabase $dbr ) {
		$it = new BatchRowIterator(
			$dbr,
			[ 'page', 'page_props', 'category' ],
			[ 'page_title' ],
			$this->getBatchSize()
		);
		$it->addConditions( [
			'page_namespace' => NS_CATEGORY,
		] );
		$it->setFetchColumns( [
			'page_title',
			'page_id',
			'pp_propname',
			'cat_pages',
			'cat_subcats',
			'cat_files'
		] );
		$it->addJoinConditions(
			[
				'page_props' => [
					'LEFT JOIN', [ 'pp_propname' => 'hiddencat', 'pp_page = page_id' ]
				],
				'category' => [
					'LEFT JOIN', [ 'cat_title = page_title' ]
				]
			]

		);
		return $it;
	}

And then see what the result set looks like...

The two sets of code in question don't seem to have changed in any related way recently, so it could just be some bad stuff in the actual DB

The question would be how far through (ie after how many batches) does this error occur

I saw multiple queries with this string in them while camping on the production vslow and looking at the processlist. I don't know how many of the queries have this issue.

Just for completeness, on db2073 I also I ran the original query with the crap entry, the show explain showed use of a filesort as above, and the execution time was... well it's still going, 330 seconds in. I killed it.

I guess it'd be useful to get the actual SQL query being generated by this (ie by the code, not by what it should be if a human built it from that):

SELECT  page_title,page_id,pp_propname,cat_pages,cat_subcats,cat_files  FROM `page` LEFT JOIN `page_props` ON (pp_propname = 'hiddencat' AND (pp_page = page_id)) LEFT JOIN `category` ON ((cat_title = page_title))   WHERE page_namespace = 14  ORDER BY page_title ASC LIMIT 200

Thanks Cormac!

I'd basically done the same thing in the background too. That query on commons, without a limit gives 9197995 rows

Something else I don't understand about this - if I take the ids (minus the empty string) from that query that fails and run this:

select page_title,page_id from page where page_id in (92967652,234494,24559020,960551,3007520,76398273,6972234,363488,2257260,4157420,89319925,84920900,41797907,61421859,92055128,9221880,14562,26762776,33298380,65449552,3795363,66235719,42442426,89319828,27708617,2563533,66701920,22548996,108484,25232065,6846286,43665564,2257433,8811984,84203487,3837544,5324927,8645978,805218,1078394,81978764,391851) order by page_title;

I get this

+------------------------------------------------------------------------------------+----------+
| page_title                                                                         | page_id  |
+------------------------------------------------------------------------------------+----------+
| 1865_births                                                                        |   234494 |
| Am_Südpark_(Köln)                                                                  | 24559020 |
| Artists_from_Germany                                                               |   960551 |
| Association_football_clubs_in_the_Czech_Republic                                   |  3007520 |
| Association_football_in_Benešov_District                                           | 76398273 |
| Badges_of_rank_of_the_Schutzstaffel                                                |  6972234 |
| Breweries                                                                          |   363488 |
| Brick_walls                                                                        |  2257260 |
| Broken_category_redirects                                                          |  4157420 |
| Cemeteries_in_Berezivka_Raion                                                      | 89319925 |
| Cemeteries_in_Odesa_Oblast_by_raion                                                | 84920900 |
| Cold_soups                                                                         | 41797907 |
| Cold_soups_of_Turkey                                                               | 61421859 |
| Collective_works_of_Suriname                                                       | 92055128 |
| Creator_template_home_categories                                                   |  9221880 |
| Culture                                                                            |    14562 |
| Enrico_Ruggeri                                                                     | 26762776 |
| Files_reused_by_external_parties_out_of_compliance_with_licensing_terms            | 33298380 |
| Fomes_fomentarius_in_the_Netherlands                                               | 65449552 |
| Gate_of_Salutation_(Topkapı_Palace)                                                |  3795363 |
| Gomphaceae_by_country                                                              | 66235719 |
| Gomphales_by_country                                                               | 42442426 |
| Graves_in_Berezivka_Raion                                                          | 89319828 |
| Graves_in_Odessa_Oblast                                                            | 27708617 |
| Hees                                                                               |  2563533 |
| Hees_(Nijmegen)                                                                    | 66701920 |
| Hotels_in_Taichung                                                                 | 22548996 |
| Interior_decoration                                                                |   108484 |
| Language_templates_with_no_text_displayed                                          | 25232065 |
| Members_of_the_Communist_Party_of_the_Russian_Federation                           |  6846286 |
| Members_of_the_Communist_Party_of_the_Russian_Soviet_Federative_Socialist_Republic | 43665564 |
| Mud_walls                                                                          |  2257433 |
| Newspaper_clippings                                                                |  8811984 |
| Organizations_of_Benešov_District                                                  | 84203487 |
| PD_text                                                                            |  3837544 |
| Porte_Sante_cemetery                                                               |  5324927 |
| Screenshots_of_free_software                                                       |  8645978 |
| Sogndal                                                                            |   805218 |
| Sorbus_alnifrons                                                                   | 92967652 |
| Stone_walls                                                                        |  1078394 |
| Syndicalists                                                                       | 81978764 |
| Trujillo_(Peru)                                                                    |   391851 |
+------------------------------------------------------------------------------------+----------+

Why do I have such a big range of page_title strings, if the original query selecting the ids is ordered by page_title?

Hacking in searching for '' in the SQL... The only results I get are like this the below, where the '' is involving escaped ' as part of a page name

  string(90) "( cl_from = '60585988' AND cl_to > 'Ulmus_\'Concavaefolia\'' ) OR ( cl_from > '60585988' )"
string(1983) "SELECT  cl_from,cl_to  FROM `categorylinks`    WHERE cl_type = 'subcat' AND cl_from IN (6917262,11731108,11731115,6917321,6917308,22999510,22999506,6917324,19618347,6917263,3901226,36136947,71803983,15481013,15643460,15643457,4168969,23720883,11731128,26544904,6917270,6917541,81853210,6917487,6917504,11355945,81853101,6917492,14751164,8831727,41214113,6917575,19146951,11731159,6917395,48699998,11731135,8843251,6917470,6917359,80175799,81853112,6917526,48700003,6917586,26543831,11355935,7417687,9744646,6917532,6917569,75438180,6917605,6917458,6917273,45631507,30329485,90128015,7100286,80347216,13335754,70286641,15643569,17462056,79164124,87026969,7270189,1503240,8374419,8480627,2088961,2088986,35410655,39175699,77851420,77851406,11301195,27675596,30481165,11301152,50148533,42139045,27675562,35626227,77851336,84656381,49129500,49129533,49129498,57618508,48948238,48948388,48948149,49243838,48948380,48948139,49129706,49129519,49129494,57618498,8300337,27675358,9814610,8300363,42139117,27675328,42139176,42107188,29042464,15723045,40463923,87003026,69747457,69747412,69747483,81202445,90219115,90218996,90219590,90218955,90219659,90219453,90219094,87310587,87321275,69747470,90219917,90219597,90219023,59283448,59285253,59284734,59284549,59308876,59609905,90219677,90219816,90220014,59285239,90219159,69747392,59308911,59309432,69747449,69747363,59283466,8300374,18294650,9814499,81518518,59607943,59608094,30302635,59614610,59611723,59611876,59612109,59608959,60670799,59610338,59610473,59609608,59609722,59608716,59609973,30302676,59614318,59723405,59687995,59723471,30218594,30218635,59608352,59688006,30218483,60671152,30218507,30258073,3537699,90663308,90663255,90663227,59643008,59610599,90663158,90662992,59643024,90663496,90663454,59610687,77851187,77851130,77851349,1503250,6609354,19838991,34323319,19838988,66147515,84866975)  AND (( cl_from = '30329485' AND cl_to > 'Bo\'+Bo\'Bo\'+Bo\'' ) OR ( cl_from > '30329485' ))  ORDER BY cl_from ASC,cl_to ASC LIMIT 200  "
  string(85) "( cl_from = '30329485' AND cl_to > 'Bo\'+Bo\'Bo\'+Bo\'' ) OR ( cl_from > '30329485' )"

But these queries have an extra condtion on the end, not what the originally reported example was

	/**
	 * Uses the primary key list and the maximal result row from the
	 * previous iteration to build an SQL condition sufficient for
	 * selecting the next page of results.  All except the final key use
	 * `=` conditions while the final key uses a `>` condition
	 *
	 * Example output:
	 *     [ '( foo = 42 AND bar > 7 ) OR ( foo > 42 )' ]
	 *
	 * @return array The SQL conditions necessary to select the next set
	 *  of rows in the batched query
	 */
	protected function buildConditions() {
		if ( !$this->current ) {
			return $this->conditions;
		}

		$maxRow = end( $this->current );
		$maximumValues = [];
		foreach ( $this->primaryKey as $alias => $column ) {
			$name = is_numeric( $alias ) ? $column : $alias;
			$maximumValues[$column] = $this->db->addQuotes( $maxRow->{$name} );
		}

		$pkConditions = [];
		// For example: If we have 3 primary keys
		// first run through will generate
		//   col1 = 4 AND col2 = 7 AND col3 > 1
		// second run through will generate
		//   col1 = 4 AND col2 > 7
		// and the final run through will generate
		//   col1 > 4
		while ( $maximumValues ) {
			$pkConditions[] = $this->buildGreaterThanCondition( $maximumValues );
			array_pop( $maximumValues );
		}

		$conditions = $this->conditions;
		$conditions[] = sprintf( '( %s )', implode( ' ) OR ( ', $pkConditions ) );

		return $conditions;
	}

I wonder if I'm looking in the wrong place; maintenance/categoryChangesAsRdf.php also does a similar query (but from a different input), which might explain Cormac's question/confusion about it not apparently being what would seem to have come from ORDER BY page_title ASC (ie it's being generated from recent changes instead)

That is also run by cron https://github.com/wikimedia/puppet/blob/c54076e866b49e3a737dc728a6d063f2b31ae076/modules/snapshot/files/cron/dumpcategoriesrdf-daily.sh

Obviously it'd be very helpful if BatchRowIterator::next wasn't coming up as the calling function... but I guess that's harder with the iterator function calls... I wonder if we can do much about that?

Obviously it'd be very helpful if BatchRowIterator::next wasn't coming up as the calling function... but I guess that's harder with the iterator function calls... I wonder if we can do much about that?

Setting a caller like GenderCache or LinkBatch allow and pass it to the database as function - T260354

I took to brute force approach of writing all queries to a log file by adding the appropriate fopen/fputs/fclose in Database::select (live on snapshot1010, testbed host). I then ran:

dumpsgen@snapshot1010:/srv/mediawiki$ /usr/bin/php7.2 /srv/mediawiki/multiversion/MWScript.php maintenance/categoryChangesAsRdf.php --wiki=commonswiki -s 20200815200001 -e 20200817050001  | gzip > /srv/tmp/categories-out.gz

I examined the output and found numerous examples of queries with the ' ' string in them (without the space).

The following two queries were back-to-back, indicating that one was used to generate input for the next:

SELECT  page_id,cat_title AS `rc_title`,pp_propname,cat_pages,cat_subcats,cat_files  FROM `category` LEFT JOIN `page` ON ((page_title = cat_title) AND page_namespace = 14) LEFT JOIN `page_props` ON (pp_propname = 'hiddencat' AND (pp_page = page_id))   WHERE cat_title IN ('Bridges_over_Kunar_River_(Pakistan)','People_of_the_University_of_Wyoming','University_of_Wyoming','Bus_routes_numbered_144','Churches_in_the_Roman_Catholic_Archdiocese_of_Benevento','August_2020_in_Cardiff','Cardiff_Coach_Station,_Sophia_Gardens','Bus_stations_in_Cardiff','Sophia_Gardens','Logos_of_companies_based_in_Mecklenburg-Vorpommern','Rameswaram','Media_needing_categories_as_of_18_March_2018','All_media_needing_categories_as_of_2018','Pages_with_local_object_coordinates_and_missing_SDC_coordinates','CC-BY-SA-4.0','Self-published_work','Photographs_by_LigaDue','Civitella_Marittima','Pages_with_maps','Scans_from_the_Internet_Archive','CC-PD-Mark','PD_US_Government','FEDLINK_-_United_States_Federal_Collection','Books_uploaded_by_Fæ','Files_with_no_machine-readable_author','Former_bus_lines_in_Budapest','Bus_lines_in_Budapest','Plzeň_1','Plzeň','Plzeň-City_District','Kaufland_Plzeň-Roudná','Epta_Piges_(Rhodes)','PD_US_expired','Books_in_the_Library_of_Congress','Trains_at_Inuyama_Yuen_Station','Inuyama_Yuen_Station','People_in_1910','2_men','OCR_detected_cover_page','1910_photographs','Iwakura_Station_(Aichi)','Unidentified_subjects_in_Japan','名古屋鉄道の画像','駅名板画像','Alumni_of_the_University_of_Wyoming','Lety_memorial','Cultural_buildings_in_Burgos','Iwateken_Kotsu','岩手県交通の画像','Piet_Retief,_Mpumalanga','Quality_images_missing_SDC_source_of_file','Quality_images_missing_SDC_copyright_status','Quality_images_missing_SDC_copyright_license','Quality_images_missing_SDC_inception','Media_requiring_renaming','Media_requiring_renaming_-_rationale_6','Bus_routes_numbered_148','Stained-glass_windows_in_Burgenland','Stained-glass_windows_in_Austria_by_district','Rust_(Burgenland)','PD_NASA','Tropical_Storm_Josephine_(2020)','Quality_images_missing_SDC_Commons_quality_assessment','PD-old-100-expired','Medical_Heritage_Library','Nominated_valued_image_candidates','Iwate_Kyūkō_Bus','バス画像','Bus_routes_numbered_149','Quality_images_missing_SDC_creator','Bus_routes_numbered_150','1926-03-27','Breda,_Netherlands','一関市の画像','Bus_routes_numbered_147','Hernán_Cortés','Augusto_Belvedere','Ichinoseki_Station','1926_photographs','Items_with_OTRS_permission_confirmed','Files_with_PermissionOTRS_template_but_without_P6305_SDC_statement','Stolpersteine_in_Oslo-Gamle','Images_uploaded_by_Donna_Gedenk','Pages_with_local_camera_coordinates_and_missing_SDC_coordinates','1926_photographs_of_the_United_States','Schools_in_Quebec_City','Railway_photographs_by_Geof_Sheppard','Photographs_by_Geof_Sheppard')

SELECT  cl_from,cl_to  FROM `categorylinks`    WHERE cl_type = 'subcat' AND cl_from IN (16427435,77160905,29237265,5273988,93171207,8292833,49598671,48452708,73514884,73514913,93141746,73514933,73514942,5229557,65375295,89119256,49325694,2371050,11740061,71765819,2581799,12178689,16468547,3355416,92207293,56860321,45788180,4127763,47563334,102952,4314089,25108543,93119689,5062995,2255349,6788554,'',62189827,93056961)   ORDER BY cl_from ASC,cl_to ASC LIMIT 200

And lo and behold, when I run the first query, what do I get:

+----------+--------------------------------------------------------------------+-------------+-----------+-------------+-----------+
| page_id  | rc_title                                                           | pp_propname | cat_pages | cat_subcats | cat_files |
+----------+--------------------------------------------------------------------+-------------+-----------+-------------+-----------+
|  8057514 | 1910_photographs                                                   | NULL        |       509 |          12 |       497 |
| 93176596 | 1926-03-27                                                         | NULL        |         3 |           0 |         3 |
... other normal-looking stuff ...
| 93137742 | Stained-glass_windows_in_Austria_by_district                       | NULL        |        98 |          98 |         0 |
| 24821400 | Stained-glass_windows_in_Burgenland                                | NULL        |         7 |           7 |         0 |
|     NULL | Stolpersteine_in_Oslo-Gamle                                        | NULL        |        51 |           0 |        51 |
| 62189827 | Trains_at_Inuyama_Yuen_Station                                     | NULL        |        15 |           0 |        15 |
| 93056961 | Tropical_Storm_Josephine_(2020)                                    | NULL        |        37 |           0 |        37 |
| 18050441 | Unidentified_subjects_in_Japan                                     | NULL        |       494 |          13 |       481 |
|  4892503 | University_of_Wyoming                                              | NULL        |        65 |           7 |        58 |
|     NULL | バス画像                                                           | NULL        |         1 |           0 |         1 |
|     NULL | 一関市の画像                                                       | NULL        |         2 |           0 |         2 |
|     NULL | 名古屋鉄道の画像                                                   | NULL        |         3 |           0 |         3 |
|     NULL | 駅名板画像                                                         | NULL        |        14 |           0 |        14 |
+----------+--------------------------------------------------------------------+-------------+-----------+-------------+-----------+
85 rows in set (0.03 sec)

NULLs. for page id. Here's a sample: https://commons.wikimedia.org/wiki/Category:Stolpersteine_in_Oslo-Gamle Things in the category indeed but the page does not yet exist. We need to filter those out from the subsequent query.

Change 620775 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[mediawiki/core@master] don't include null page ids in query list for category dumps

https://gerrit.wikimedia.org/r/620775

Change 620900 had a related patch set uploaded (by Lucas Werkmeister (WMDE); owner: Lucas Werkmeister (WMDE)):
[mediawiki/core@master] Add test for dumping category without page

https://gerrit.wikimedia.org/r/620900

Change 623692 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/puppet@production] disable category rdf dumps for now

https://gerrit.wikimedia.org/r/623692

Change 623692 merged by ArielGlenn:
[operations/puppet@production] disable category rdf dumps for now

https://gerrit.wikimedia.org/r/623692

Change 620775 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[mediawiki/core@master] don't include null page ids in query list for category dumps

https://gerrit.wikimedia.org/r/620775

I have tested the above patch by doing a manual run of the cron script on snapshot1008 as the dumpsgen user:

dumpsgen@snapshot1008:~$ /usr/local/bin/dumpcategoriesrdf.sh --config /etc/dumps/confs/wikidump.conf.other --list /srv/mediawiki/dblists/categories-rdf.dblist

It completed in a little under 4 hours for all wikis. What is needed to get the patch merged?

Change 625642 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[mediawiki/core@master] don't pass page id 0 to page related queries for category change rdf dumps

https://gerrit.wikimedia.org/r/625642

Change 620775 merged by jenkins-bot:
[mediawiki/core@master] don't include null page ids in query list for category dumps

https://gerrit.wikimedia.org/r/620775

Change 625642 merged by jenkins-bot:
[mediawiki/core@master] don't pass null page id to page related queries for category change rdf dumps

https://gerrit.wikimedia.org/r/625642

Change 626041 had a related patch set uploaded (by Reedy; owner: ArielGlenn):
[mediawiki/core@REL1_35] don't include null page ids in query list for category dumps

https://gerrit.wikimedia.org/r/626041

Change 626042 had a related patch set uploaded (by Reedy; owner: ArielGlenn):
[mediawiki/core@REL1_34] don't include null page ids in query list for category dumps

https://gerrit.wikimedia.org/r/626042

Change 626043 had a related patch set uploaded (by Reedy; owner: ArielGlenn):
[mediawiki/core@REL1_31] don't include null page ids in query list for category dumps

https://gerrit.wikimedia.org/r/626043

Change 626041 merged by jenkins-bot:
[mediawiki/core@REL1_35] don't include null page ids in query list for category dumps

https://gerrit.wikimedia.org/r/626041

Change 626042 merged by jenkins-bot:
[mediawiki/core@REL1_34] don't include null page ids in query list for category dumps

https://gerrit.wikimedia.org/r/626042

Change 626043 merged by jenkins-bot:
[mediawiki/core@REL1_31] don't include null page ids in query list for category dumps

https://gerrit.wikimedia.org/r/626043

Change 625642 merged by jenkins-bot:
[mediawiki/core@master] don't pass null page id to page related queries for category change rdf dumps

https://gerrit.wikimedia.org/r/625642

When this is deployed on the wikis I'll be able to re-enable category dumps, both dailies and weeklies, which shouldmean at the end of the week.

ArielGlenn claimed this task.

Re-enabled, checked daily runs, they look good, so I'm resolving this. Thanks, everybody!