The code can be found at https://gerrit.wikimedia.org/r/#/admin/projects/mediawiki/extensions/Cognate
The extension has been coded so that the cluster & db are configurable. For all wiktionaries a single db table would be used. The table would include a single row for each wiktionary page, initially in the main namespace. Based on https://stats.wikimedia.org/wiktionary/EN/TablesWikipediaZZ.htm Wiktionary has roughly 27 million articles which would mean the main cognate database table would initially have roughly 27 million rows. This may later be extended to other namespaces, but that would likely not increase the row count too dramatically.
- One table for the Wiktionary group of wikis
- Roughly 27 million rows in the table.
In the schemas below cgti_title is the dbkey for the title as stored on the local site. cgti_key is a normalized version of this title currently based on some simply rules at https://github.com/wikimedia/mediawiki-extensions-Cognate/blob/master/src/StringNormalizer.php#L11
DELETES query on cgti_site, cgti_title, cgti_namespace
SELECTS query on cgti_site, cgti_key, cgti_namespace
Current Schema
The initial DB schema can be seen at https://github.com/wikimedia/mediawiki-extensions-Cognate/blob/master/db/addCognateTitles.sql
CREATE TABLE IF NOT EXISTS /*_*/cognate_titles ( cgti_site VARBINARY(32) NOT NULL, cgti_namespace INT NOT NULL, cgti_title VARBINARY(255), cgti_key VARBINARY(255) NOT NULL, PRIMARY KEY (cgti_site, cgti_namespace, cgti_title) )/*$wgDBTableOptions*/; CREATE INDEX /*i*/cgti_keys ON /*_*/cognate_titles (cgti_site, cgti_namespace, cgti_key);
Current example data would be:
gti_site, cgti_namespace, cgti_title, cgti_key en, 0, Berlin, Berlin fr, 0, Berlin, Berlin
etc.
This works for the usecase presented to Cognate, how does it look for the DBAs?
Another option
To remove some nasty assumptions from the code an alternate schema would be something like:
CREATE TABLE IF NOT EXISTS /*_*/cognate_titles ( cgti_site VARBINARY(32) NOT NULL, cgti_interwiki VARBINARY(32) NOT NULL, cgti_namespace INT NOT NULL, cgti_title VARBINARY(255), cgti_key VARBINARY(255) NOT NULL, PRIMARY KEY (cgti_site, cgti_namespace, cgti_title) )/*$wgDBTableOptions*/; CREATE INDEX /*i*/cgti_keys ON /*_*/cognate_titles (cgti_site, cgti_namespace, cgti_key); CREATE INDEX /*i*/cgti_interwikis ON /*_*/cognate_titles (cgti_interwiki);
With example data would be:
gti_site, cgti_interwiki, cgti_namespace, cgti_title, cgti_key en, enwiktionary, 0, Berlin, Berlin fr, frwiktionary, 0, Berlin, Berlin
Do DBAs see an issue with the extra column / how wide it is?