Page MenuHomePhabricator

[Task] Review WikidataQuality DB schema
Closed, ResolvedPublic1 Estimated Story Points

Description

The Wikibase-Quality extensions define a couple of database tables that need review before deployment. The table definitions can be found here (linking the v1 branch designated for deployment):

Code running queries against these can be found here:

  • WikidataQuality: includes/Violations/SqlViolationRepo.php
  • WikidataQualityConstraints: includes/ConstraintRepository.php
  • WikidataQualityExternalValidation: includes/ExternalDataRepo.php
  • WikidataQualityExternalValidation: includes/DumpMetaInformation/DumpMetaInformationRepo.php

Quick overview of definitions and queries in V1:

ConstraintRepository / wbqc_constraints

		CREATE TABLE IF NOT EXISTS /*_*/wbqc_constraints (
			constraint_guid   		  VARCHAR(255)  	PRIMARY KEY,
			pid               		  INT(11)       	NOT NULL,
			constraint_type_qid    	VARCHAR(255)  	NOT NULL,
			constraint_parameters		TEXT			      DEFAULT NULL
		) /*$wgDBTableOptions*/;

		CREATE INDEX /*i*/wbqc_constraints_pid_index
		ON /*_*/wbqc_constraints (pid);

		CREATE INDEX /*i*/wbqc_constraints_constraint_type_qid_index
		ON /*_*/wbqc_constraints (constraint_type_qid);

Queries:

		SELECT * FROM wbqc_constraints WHERE pid = "xxxxxxxxx";
		
		INSERT INTO wbqc_constraints VALUES ....; -- batch size in UpdateTable defaults to 1000
		
		DELETE FROM wbqc_constraints LIMIT 1000; -- looped until the table is empty; MySQL only.

insertBatch() does a $db->commit( __METHOD__, "flush" ). That's probably not a good idea. UpdateTables could do that.

ExternalDataRepo / wbqev_external_data

		CREATE TABLE IF NOT EXISTS /*_*/wbqev_external_data (
			row_id          BIGINT UNSIGNED  PRIMARY KEY AUTO_INCREMENT,
			dump_id         VARBINARY(25)    NOT NULL,
			external_id     VARBINARY(100)   NOT NULL,
			pid             VARBINARY(15)    NOT NULL,
			external_value  VARBINARY(255)   NOT NULL
		) /*$wgDBTableOptions*/;
		
		CREATE INDEX /*i*/dump_id_external_id_pid ON /*_*/wbqev_external_data (dump_id, external_id, pid);

Queries:

		SELECT ( dump_id, external_id, pid, external_value ) 
		FROM wbqev_external_data
		WHERE dump_id IN ( ... )
		AND external_id IN ( ... )
		AND pid IN ( ... );
		
		INSERT INTO wbqev_external_data VALUES ....; -- batch size in UpdateTable defaults to 1000
		
		DELETE FROM wbqev_external_data WHERE dumpId IN ( ... ) LIMIT 1000; -- looped until the table is empty; MySQL only.

insertBatch() does a $db->commit( __METHOD__, "flush" ). That's probably not a good idea. UpdateTables could do that.

DumpMetaInformationRepo / wbqev_dump_information / wbqev_dump_information

		CREATE TABLE IF NOT EXISTS /*_*/wbqev_dump_information (
			dump_id      VARBINARY(25)   PRIMARY KEY NOT NULL,
			source_qid   VARBINARY(15)   NOT NULL,
			import_date  VARBINARY(25)   NOT NULL,
			language     VARBINARY(10)   NOT NULL,
			source_url   VARBINARY(300)  UNIQUE NOT NULL,
			size         INT UNSIGNED    NOT NULL,
			license_qid  VARBINARY(15)   NOT NULL
		) /*$wgDBTableOptions*/;

		CREATE TABLE IF NOT EXISTS /*_*/wbqev_identifier_properties (
			identifier_pid  VARBINARY(15)  NOT NULL,
			dump_id         VARBINARY(25)  NOT NULL,
			PRIMARY KEY (identifier_pid, dump_id)
		) /*$wgDBTableOptions*/;

Queries:

		SELECT * FROM wbqev_dump_information
		LEFT JOIN wbqev_identifier_properties
		ON wbqev_dump_information.dumpId = wbqev_identifier_properties.dump_id;
		
		SELECT * FROM wbqev_dump_information
		LEFT JOIN wbqev_identifier_properties
		ON wbqev_dump_information.dumpId = wbqev_identifier_properties.dump_id
		WHERE wbqev_dump_information.dumpId IN ( ... );
		
		SELECT dump_id FROM wbqev_identifier_properties
		WHERE identifier_pid IN ( ... );
		
		SELECT DISTINCT source_qid FROM wbqev_dump_information;
		
		INSERT INTO wbqev_dump_information ( ... );
		UPDATE wbqev_dump_information SET ...;
		DELETE FROM wbqev_identifier_properties WHERE dumpId = '...';
		INSERT INTO wbqev_identifier_properties VALUES ...; -- no batch limit in the code, but maybe 10 in practice, max 2000 (number of properites on wikidata)

wbq_violations

		CREATE TABLE IF NOT EXISTS /*_*/wbq_violations (
			entity_id                     VARBINARY(15)     NOT NULL,
			pid                           VARBINARY(15)     NOT NULL,
			claim_guid                    VARBINARY(63)     NOT NULL,
			constraint_id                 VARBINARY(63)     NOT NULL,
			constraint_type_entity_id     VARBINARY(15)     NOT NULL,
			additional_info               TEXT              DEFAULT NULL,
			updated_at                    VARBINARY(31)     NOT NULL,
			revision_id                   INT(10) UNSIGNED  NOT NULL,
			status                        VARBINARY(31)     NOT NULL,
			PRIMARY KEY (claim_guid, constraint_id)
		) /*$wgDBTableOptions*/;

		CREATE INDEX /*i*/claim_guid ON /*_*/wbq_violations (claim_guid);
		CREATE INDEX /*i*/constraint_id ON /*_*/wbq_violations (constraint_id);

Unused in v1.

wbq_evaluation

		CREATE TABLE IF NOT EXISTS /*_*/wbq_evaluation (
			special_page_id               int               NOT NULL,
			entity_id                     VARCHAR(15)       NOT NULL,
			insertion_timestamp           int               NOT NULL,
			reference_timestamp           int               DEFAULT NULL,
			result_string                 TEXT              DEFAULT NULL
		) /*$wgDBTableOptions*/;

No index defined at all! Unused??

Related Objects

View Standalone Graph
This task is connected to more than 200 other tasks. Only direct parents and subtasks are shown here. Use View Standalone Graph to show more of the graph.

Event Timeline

daniel raised the priority of this task from to High.
daniel updated the task description. (Show Details)
daniel subscribed.
daniel renamed this task from Review WikidataQuery DB schema to Review WikibaseQuery DB schema.Jun 18 2015, 6:59 PM
daniel set Security to None.
hoo renamed this task from Review WikibaseQuery DB schema to Review WikidataQuality DB schema.Jun 18 2015, 9:03 PM
hoo added a subscriber: JAnstee_WMF.

The URL of the repositories changed, see the description.

First comments from a MySQL DBA, assuming that goes to production:

  • Some tables lack of a PRIMARY KEY. This should be a requirement for new tables.
  • character strings do not have a specified CHARACTER SET. Check that those do not have to be BINARY fields for compatibility with current wiki fields or set them otherwise (by default, they may be restricted to latin characters in some systems).
  • Unless there is a reason not to, add the index definition along the table creation, and not with separate CREATE INDEX afterwards. While the results are the same from the point of view of the application, not from the point of view of maintaining hundreds of wikis.
  • Some PRIMARY KEYs use guid strings. Those should be justified by external dependencies, otherwise stick to int or bigint for better innodb performance.
  • VARBINARY(15) for import_date seems like a bad idea, there are specific data types for that.
  • There is a tendency of using VARCHAR(255) for variable length fields. Please note that using VARCHAR(255) is not an advantage over VARCHAR(256), and that smaller sizes are usually preferred (again, unless justified by existing types).

I was subscribed to this task, likely in error, as I do not know how this task relates to my work. I will unsubscribe, if it was not an error but intentional, please let me know by resubscribing me in that case.

The index names are off:

CREATE INDEX /*i*/wbqc_constraints_pid_index
ON /*_*/wbqc_constraints (pid);

is redundant, it should be

CREATE INDEX /*i*/pid
ON /*_*/wbqc_constraints (pid);

on mysql, this will create a pid index on the wbqc_constraints table. On SqlLite, /*i*/ will expand to the table name, making the index name unique, resulting in an index called wbqc_constraints_qid, which is what we want. The current code creates an index called wbqc_constraints_wbqc_constraints_pid_index, which seems a bit silly.

Index names are pretty irellevant, unless you use FORCE INDEX somewhere. In that case, make sure to get it right...

A first review of the schema and queries:

Queries run against wbqc_constraints seem to all use the (pid) index. What's the (constraint_type_qid) index for?

Queries run against wbqev_external_data seem to all use the (dump_id, external_id, pid) index. Unless ExternalDataRepo->get is called with no dump id or no external id. In these cases, an exception should be thrown instead of creating extremely inefficient queries.

Calling $db->commit( __METHOD__, "flush" ) should not be done in DAO objects. Simmilarly, calling wfWaitForSlaves should be avoided in the DAO layer. These things should be kept on the level of updaters/importers.

wbqev_identifier_properties should have an index on dump_id, so the query planner is free to choose which side of the join to evaluate first. It's also needed for the DELETE queries that select by dumpId.

wbq_violations seems to be unused in v1, is that correct? The (claim_guid) index is redundant to the composite primary key (claim_guid, constraint_id). Prefixes of composite keys can be used like separate keys. The (constraint_id) index is thus not redundant.

It seems very likely that wbq_violations needs more indexed. At one on (entity_id), for getting (or clearing) all violations for a given entity. Or is the plan to do this by prefix-matching on the claim_guid?

wbq_evaluation does not define any indexes. That's bad, at least give it an auto-increment row id. But is it used at all? It seems like the eval data is going to a log file at the moment.

In Version two we have a data issues special page where we want to be able to filter all violations by constraint_type_qid, so we think, an index makes sense.

Yes, wbq_violations is unused in v1 and it actually is removed from that version.

wbq_evaluation is also removed and will not come back. Instead, we use wfDebugLog for the evaluation.

Change 222722 had a related patch set uploaded (by Soeren.oldag):
Database schema improvements (T102992)

https://gerrit.wikimedia.org/r/222722

Change 222861 had a related patch set uploaded (by Soeren.oldag):
Database schema improvements (T102992)

https://gerrit.wikimedia.org/r/222861

Change 222862 had a related patch set uploaded (by Soeren.oldag):
Database schema improvements (T102992)

https://gerrit.wikimedia.org/r/222862

Change 222861 abandoned by Soeren.oldag:
Database schema improvements (T102992)

https://gerrit.wikimedia.org/r/222861

Change 222867 had a related patch set uploaded (by Soeren.oldag):
Database schema improvements (T102992)

https://gerrit.wikimedia.org/r/222867

Change 222869 had a related patch set uploaded (by Soeren.oldag):
Database schema improvements (T102992)

https://gerrit.wikimedia.org/r/222869

For all tables except wbq_violations, the issues should be solved. Since wbq_violations was removed in v1, this should not block the deployment.

@jcrespo

  • Only wbq_evaluation, which was removed recently, had no PRIMARY KEY.
  • Index definitions along the table creation are not supported by sqlite. For compatibility reasons we seperated them.
  • Stored guid strings represent Wikibase statements guids and can therefore not stick to int or bigint

@daniel Schema and indices of wbq_violations on master differ from the one you posted in the description of this ticket. Maybe you could have a look at it, too? However, this is not urgent, since it is not part of v1.

Change 222722 merged by jenkins-bot:
Database schema improvements (T102992)

https://gerrit.wikimedia.org/r/222722

When looking at the v1 branch at https://github.com/wikimedia/mediawiki-extensions-WikibaseQualityConstraints/blob/v1/sql/create_wbqc_constraints.sql there seems to be one issue left:

  • The field constraint_parameters is still TEXT with no character set. Why is that? The field contains a JSON blob. MediaWiki uses a binary mediumblob for this.

On production TEXT makes the DB create table of type blob. Perhaps use that or mediumblob instead?

There are 2 open changes (https://gerrit.wikimedia.org/r/#/c/222869, https://gerrit.wikimedia.org/r/#/c/222867) addressing the mentioned issues for ExternalValidation that need to be reviewed. Maybe you guys can have a look?

Jonas renamed this task from Review WikidataQuality DB schema to [Task] Review WikidataQuality DB schema.Aug 13 2015, 3:33 PM

Change 222869 merged by jenkins-bot:
Database schema improvements (T102992)

https://gerrit.wikimedia.org/r/222869

I have approved both pending patches. If they go in, this can probably be closed, but we should double-check.

Change 233981 had a related patch set uploaded (by JanZerebecki):
Database schema improvements (T102992)

https://gerrit.wikimedia.org/r/233981

Change 233981 merged by jenkins-bot:
Database schema improvements (T102992)

https://gerrit.wikimedia.org/r/233981

Change 222862 abandoned by Jonas Kress (WMDE):
Database schema improvements (T102992)

Reason:
All changes have been merged previously

https://gerrit.wikimedia.org/r/222862

Change 222867 abandoned by Thiemo Kreuz (WMDE):
Database schema improvements (T102992)

Reason:
This is not more than a cherry pick from the v1 branch, see I05048ca. No need to keep this open as a separate pull request for so long. The v1 branch was essentially abandoned via a PM decision long ago.

https://gerrit.wikimedia.org/r/222867