Page MenuHomePhabricator

Create SQL replica table for Wikidata statements in Labs
Open, Needs TriagePublic

Description

Currently if Toolforge tools want to query Wikidata data they are some options to use:

  • Wikidata database replica: can not query data inside JSON efficiently
  • Wikibase API: same as above, plus API are subject to time out
  • WDQS: subject to time out; requires web requests
  • Search feature (WikibaseCirrusSearch): subject to time out; requires web requests; only able to made simple queries; not possible to return more than 10000 results

Therefore I propose to set up an SQL replica of Wikidata data (proposed database name: wikidata-replica). The proposed schemas are described below. We need some indexes to run queries efficiently; they are not described here.

Possible schema 1

wbr_statements

  • statements_id: primary key.
  • statements_entity: full entity ID of the entity.
  • statements_property: property ID of the property.
  • statements_sid: ID of the statement, in the form of Q123$00000000-0000-0000-0000-000000000000.
  • statements_main_snak: a reference to wbr_snaks.
  • statements_rank: rank of the statement.

wbr_snaks

  • snak_id: primary key.
  • snak_property: property ID of the property.
  • snak_value_type: value_type of the value.
  • snak_value: the string representation of the value.
  • snak_value_detail: a reference to specific tables about values, such as wbr_external_id, etc. The layout of these tables are omitted,

wbr_reference_group

  • reference_group_id: primary key.
  • reference_group_statement: a reference to wbr_statements.

wbr_reference

  • reference_id: primary key.
  • reference_reference_group: a reference to wbr_reference_group.
  • reference_snak: a reference to wbr_snaks.

wbr_qualifier

  • qualifier_id: primary key.
  • qualifier_statement: a reference to wbr_statements.
  • qualifier_snak: a reference to wbr_snaks.

Possible schema 2

We create a database table for each property. The schema of the table are determined by the datatype of the property. Each row represents one statement. This schema does not replicate references or qualifiers, but is more friendly for query.