Cognate DB review
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Addshore
	Oct 24 2016, 4:30 PM

Description

The code can be found at https://gerrit.wikimedia.org/r/#/admin/projects/mediawiki/extensions/Cognate

The extension has been coded so that the cluster & db are configurable. For all wiktionaries a single db table would be used. The table would include a single row for each wiktionary page, initially in the main namespace. Based on https://stats.wikimedia.org/wiktionary/EN/TablesWikipediaZZ.htm Wiktionary has roughly 27 million articles which would mean the main cognate database table would initially have roughly 27 million rows. This may later be extended to other namespaces, but that would likely not increase the row count too dramatically.

One table for the Wiktionary group of wikis
Roughly 27 million rows in the table.

In the schemas below cgti_title is the dbkey for the title as stored on the local site. cgti_key is a normalized version of this title currently based on some simply rules at https://github.com/wikimedia/mediawiki-extensions-Cognate/blob/master/src/StringNormalizer.php#L11

DELETES query on cgti_site, cgti_title, cgti_namespace
SELECTS query on cgti_site, cgti_key, cgti_namespace

Current Schema

The initial DB schema can be seen at https://github.com/wikimedia/mediawiki-extensions-Cognate/blob/master/db/addCognateTitles.sql

CREATE TABLE IF NOT EXISTS /*_*/cognate_titles (
  cgti_site VARBINARY(32) NOT NULL,
  cgti_namespace INT NOT NULL,
  cgti_title VARBINARY(255),
  cgti_key VARBINARY(255) NOT NULL,
  PRIMARY KEY (cgti_site, cgti_namespace, cgti_title)
  )/*$wgDBTableOptions*/;

CREATE INDEX /*i*/cgti_keys ON /*_*/cognate_titles (cgti_site, cgti_namespace, cgti_key);

Current example data would be:

gti_site, cgti_namespace, cgti_title, cgti_key
en, 0, Berlin, Berlin
fr, 0, Berlin, Berlin

etc.

This works for the usecase presented to Cognate, how does it look for the DBAs?

Another option

To remove some nasty assumptions from the code an alternate schema would be something like:

CREATE TABLE IF NOT EXISTS /*_*/cognate_titles (
  cgti_site VARBINARY(32) NOT NULL,
  cgti_interwiki VARBINARY(32) NOT NULL,
  cgti_namespace INT NOT NULL,
  cgti_title VARBINARY(255),
  cgti_key VARBINARY(255) NOT NULL,
  PRIMARY KEY (cgti_site, cgti_namespace, cgti_title)
  )/*$wgDBTableOptions*/;

CREATE INDEX /*i*/cgti_keys ON /*_*/cognate_titles (cgti_site, cgti_namespace, cgti_key);
CREATE INDEX /*i*/cgti_interwikis ON /*_*/cognate_titles (cgti_interwiki);

With example data would be:

gti_site, cgti_interwiki, cgti_namespace, cgti_title, cgti_key
en, enwiktionary, 0, Berlin, Berlin
fr, frwiktionary, 0, Berlin, Berlin

Do DBAs see an issue with the extra column / how wide it is?

Related Objects
Search...

Status	Subtype	Assigned	Task
Open	Feature	None	T13996 A way to select which parts of Wiktionary articles to show
Open	Feature	None	T14213 Following a link to a language entry in Wiktionary should display only that entry
Open	Feature	None	T13998 A way to show only those languages on Wiktionary that the user is interested in
Open	Feature	None	T38881 Wiktionary needs usable API
Open		None	T31229 Extension to provide access via the dict protocol
Resolved		Lydia_Pintscher	T146637 Wikidata 2016 Q4 goals
Resolved		Lydia_Pintscher	T150179 Wikidata 2017 Q1 goals
Open		None	T109579 [Epic] Give more sister projects access to Wikidata
Resolved		Lydia_Pintscher	T986 Use structured data on Wiktionary
Resolved		Lydia_Pintscher	T988 Phase 1: Represent Wiktionary lexicon using structured data
Resolved		Lydia_Pintscher	T987 [Story] Phase 0: Automate interwiki language links for Wiktionary
Resolved		Lydia_Pintscher	T169708 Wikidata 2017 Q3 goals
Resolved		Lydia_Pintscher	T150178 Wikidata 2017 Q2 goals
Resolved		Lydia_Pintscher	T159316 Enable arbitrary access on English Wiktionary
Resolved		aude	T158323 enable sitelinks on Wikidata for Wiktionary pages outside main namespace (phase 1)
Resolved		Addshore	T150182 Deploy Cognate extension to production
Resolved		jcrespo	T148988 Cognate DB review
Resolved		aude	T158324 Enable phase 1 for Wiktionary at beta cluster
Resolved		None	T158325 Enable phase 1 for Wiktionary at test Wikidata
Open		None	T166765 Investigate the case of custom namespaces

Event Timeline

Addshore created this task.Oct 24 2016, 4:30 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 24 2016, 4:30 PM

hoo subscribed.Oct 24 2016, 4:31 PM

Addshore added a project: User-Addshore.Oct 25 2016, 11:06 AM

Addshore moved this task from Unsorted 💣 to Watching 👀 on the User-Addshore board.

Addshore updated the task description. (Show Details)Oct 25 2016, 11:23 AM

Addshore updated the task description. (Show Details)Oct 25 2016, 11:26 AM

@Addshore I have some questions, they are not long, but they depend on each other, so I would love to chat with you when you find the time, as that will simplify the interaction. I am at Europe Timezone, so if you can find some time to meet at IRC, it would be great.

The main blocker is: is everything that this table has 100% public, or will it contain some private information. If it is fully public, it will be replicate do labs, assuming it will be useful there (I would assume yes). If there is private information, or things that could derive private information, it needs to be filtered before going to labs.

For the colums and indexes advice, please meet me on IRC.

In T148988#2742025, @jcrespo wrote:

The main blocker is: is everything that this table has 100% public, or will it contain some private information. If it is fully public, it will be replicate do labs, assuming it will be useful there (I would assume yes). If there is private information, or things that could derive private information, it needs to be filtered before going to labs.

Everything in this table will be public information. Labs replication would likely be useful.

So, with the feedback @Addshore gave me on IRC, I would suggest:

Converting, in the first case:

CREATE TABLE IF NOT EXISTS /*_*/cognate_titles (
  cgti_site VARBINARY(32) NOT NULL,
  cgti_namespace INT NOT NULL,
  cgti_title VARBINARY(255),
  cgti_key VARBINARY(255) NOT NULL,
  PRIMARY KEY (cgti_site, cgti_namespace, cgti_title)
  )/*$wgDBTableOptions*/;

CREATE INDEX /*i*/cgti_keys ON /*_*/cognate_titles (cgti_site, cgti_namespace, cgti_key); // this could be a PK candidate too, or unique key, it depends which way queries are more often sent

into something more or less like:

CREATE TABLE IF NOT EXISTS /*_*/cognate_titles (
  ct_site VARBINARY(32) NOT NULL,
  ct_namespace INT NOT NULL,
  ct_title VARBINARY(255),
  ct_key INT UNSIGNED NOT NULL,
  PRIMARY KEY (ct_site, ct_key)
  )/*$wgDBTableOptions*/;

CREATE INDEX /*i*/ct_site_namespace_title ON /*_*/cognate_titles (ct_site, ct_namespace, ct_title);
// add the following if you will want to query all sites with a key (indexes depend on the SELECTs):
// CREATE INDEX /*i*/ct_key ON /*_*/cognate_titles (ct_key);

CREATE TABLE IF NOT EXISTS /*_*/cognate_keys (
  ck_id INT UNSIGNED PRIMARY KEY,
  ck_name VARBINARY(255),
  PRIMARY KEY (cgti_site, cgti_namespace, cgti_title)
  )/*$wgDBTableOptions*/;

CREATE INDEX /*i*/ck_name ON /*_*/cognate_keys (ck_name); // names can be unique or not, up to you

We didn't go deep into (site) and (namespace, title), I would definitely normalize site/interwiki on separate table on the second option you mention, it is sufficiently small in the first case.

The (namespace,title) denormalization will depend on if this allows to get rid of a having extra connections to the regular shards; I am ok if it allows that.

I will put this in progress, but I will be waiting for further feedback.

Addshore added a project: WMDE-TechWish.Oct 27 2016, 12:26 PM

Addshore moved this task from Proposed to Watching/Blocked/External on the WMDE-TechWish board.

So I have a second proposal now which is slightly more complex than the initial single table design.

The below SQL includes querys that would be run against the scheme, including on page create, move, delete & load!

use cognate;

DROP TABLE cognate_sites;
DROP TABLE cognate_titles;
DROP TABLE cognate_normalizations;


/*
 *Create the tables
 */


CREATE TABLE IF NOT EXISTS cognate_sites (
  cs_dbname VARBINARY(32) PRIMARY KEY NOT NULL,
  cs_group VARBINARY(32) NOT NULL,
  cs_interwiki VARBINARY(32) NOT NULL
);

CREATE TABLE IF NOT EXISTS cognate_titles (
  ct_site VARBINARY(32) NOT NULL,
  ct_namespace INT NOT NULL,
  ct_title VARBINARY(255),
  PRIMARY KEY (ct_site, ct_namespace, ct_title)
);

CREATE TABLE IF NOT EXISTS cognate_normalizations (
  cn_raw VARBINARY(255) PRIMARY KEY,
  cn_normalized VARBINARY(255)
);

CREATE INDEX cn_normalized ON cognate_normalizations (cn_normalized);


/*
 *Insert some sample data / example production queries
 */


/*Sites will be added by a maint script*/

INSERT IGNORE INTO cognate_sites (cs_dbname, cs_group, cs_interwiki)
VALUES
('enwiktionary', 'wiktionary', 'en'),
('dewiktionary', 'wiktionary', 'de'),
('ptwiktionary', 'wiktionary', 'pt'),
('zh_min_nanwiktionary', 'wiktionary', 'zh-min-nan');


/*On page create (mediawiki hook)*/
/*If the normalized value differs from the raw value add an entry*/

INSERT IGNORE INTO cognate_titles (ct_site, ct_namespace, ct_title)
VALUES('enwiktionary', 0, 'Apple');

INSERT IGNORE INTO cognate_titles (ct_site, ct_namespace, ct_title)
VALUES('zh_min_nanwiktionary', 0, 'Apostrophe’');
INSERT IGNORE INTO cognate_normalizations (cn_raw, cn_normalized)
VALUES('Apostrophe’', 'Apostrophe''');

INSERT IGNORE INTO cognate_titles (ct_site, ct_namespace, ct_title)
VALUES('enwiktionary', 0, 'Apostrophe’');
INSERT IGNORE INTO cognate_normalizations (cn_raw, cn_normalized)
VALUES('Apostrophe’', 'Apostrophe''');

INSERT IGNORE INTO cognate_titles (ct_site, ct_namespace, ct_title)
VALUES('dewiktionary', 0, 'Apostrophe''');
INSERT IGNORE INTO cognate_titles (ct_site, ct_namespace, ct_title)
VALUES('ptwiktionary', 0, 'Apostrophe''');


/*On page delete*/
DELETE FROM cognate_titles
WHERE ct_site = 'enwiktionary' AND ct_namespace = 0 AND ct_title = 'Apostrophe222’';
/*Deferred, delete normalizations that may have existed but are no longer needed?*/
SELECT COUNT(*) FROM cognate_titles WHERE ct_title = 'Apostrophe222’';
/*If no rows are returned*/
DELETE IGNORE FROM cognate_normalizations WHERE cn_raw = 'Apostrophe222’';


/*On page load*/

SET @wgCognateGroup = 'wiktionary';
SET @cognateNormalizedTitle = 'Apostrophe''';

SELECT cn_raw AS title, ct_site AS site, cs_interwiki AS interwiki
FROM cognate_normalizations
LEFT JOIN cognate_titles ON cn_raw = ct_title
LEFT JOIN cognate_sites ON ct_site = cs_dbname
WHERE cn_normalized = @cognateNormalizedTitle
AND cs_group = @wgCognateGroup

UNION

SELECT ct_title AS title, ct_site AS site, cs_interwiki AS interwiki
FROM cognate_titles
LEFT JOIN cognate_sites ON ct_site = cs_dbname
WHERE ct_title = @cognateNormalizedTitle
AND cs_group = @wgCognateGroup

DROP TABLE cognate_sites;
DROP TABLE cognate_titles;
DROP TABLE cognate_normalizations;


/*
 *Create the tables
 */


CREATE TABLE IF NOT EXISTS cognate_sites (

This is strange syntax. Normally, we want to DROP TABLE IF EXISTS, and force its creation, but anyway.

For the next tables, I do not fully understand:

CREATE TABLE IF NOT EXISTS cognate_sites (

Ok, you have defined a db name <-> wiki type + lang, which I assume it may be handy, but very small table. Not too relevant for the design decisions- if it is helpful, it is, end of discussion.

CREATE TABLE IF NOT EXISTS cognate_titles (

You separate the trio site, namespace, title- ok, but you do not give that a unique identifier, so this table is useless?

CREATE TABLE IF NOT EXISTS cognate_normalizations (

This is the only table creating a relationship, but it goes against everything we talked before- you use full titles (so no space is saved), and it is impossible, as it is now, to establish any kind of relationship between actual titles in this title column.

I am not saying this is wrong, but if it is right, you have changed completely the model you explained to me before, and I no longer understand it. There is an n:n relationship between raw title and cognate titles, and that makes no sense, because it is cross-wiki.

In T148988#2762557, @jcrespo wrote:
CREATE TABLE IF NOT EXISTS cognate_titles (
You separate the trio site, namespace, title- ok, but you do not give that a unique identifier, so this table is useless?

Any sort of uniquie identifier would be useless / never used. As can be seen in the sample queries above.

CREATE TABLE IF NOT EXISTS cognate_normalizations (
This is the only table creating a relationship, but it goes against everything we talked before- you use full titles (so no space is saved), and it is impossible, as it is now, to establish any kind of relationship between actual titles in this title column.

I am not saying this is wrong, but if it is right, you have changed completely the model you explained to me before, and I no longer understand it. There is an n:n relationship between raw title and cognate titles, and that makes no sense, because it is cross-wiki.

There is a 1:n relationship between the normalizations and the titles.

For example a normalization of "Ellipsis..." using 3 dots relates to the titles "Ellipsis..." using 3 dots and "Ellipsis…" using the single char ellipsis.

Out of the 27 million rows in the titles table, probably only a few thousand (defiantly under 100k) will result in normalizations that differ from the original title.
Also no matter which wiki the title comes from the normalization process would always be the same.

If on the en and de site there are titles of "Ellipsis…" using the single char ellipsis, but on the fr and pt site there are titles of "Ellipsis..." using 3 dots then the data in the tables would look like the below:

titles
site, ns, title
enwiktionary, Ellipsis…
dewiktionary, Ellipsis…
frwiktionary Ellipsis...
ptwiktionary Ellipsis...

normalizations
raw, normalized
Ellipsis…, Ellipsis...

I think the key here that may have been missed in our previous chat is that the normalization step is the same for all wikis.

I think the key here that may have been missed in our previous chat is that the normalization step is the same for all wikis.

Ok, assuming that, which can create issues in the future, but that is your decision. Why do you need the other tables? You are maintaining core tables (page) creating redundant data and extra maintenance work.

Also, if you have:

enwiktionary, Ellipsis…
dewiktionary, Ellipsis...
frwiktionary, Ellipsis......

Ellipsis…, Ellipsis...
Ellipsis......, Ellipsis...

How do you search the page for frwiktionary, if you start with @site='enwiktionary', @title='Ellipsis…'? Please do not tell me you re going to do a range of > 800 items and then a join of that + UNION, instead of a single row fetch.

I would beg you to optimize first for simple point selects, then for size. I would suggest you to create an n:n relationship with all information ready to be easily read (we do 1000x times more reads than writes). Just that UNION creates a temporary table that, even small, makes unnecessary overhead.

In T148988#2762644, @jcrespo wrote:

I think the key here that may have been missed in our previous chat is that the normalization step is the same for all wikis.

Ok, assuming that, which can create issues in the future, but that is your decision. Why do you need the other tables? You are maintaining core tables (page) creating redundant data and extra maintenance work.

Also, if you have:

enwiktionary, Ellipsis…
dewiktionary, Ellipsis...
frwiktionary, Ellipsis......

Ellipsis…, Ellipsis...
Ellipsis......, Ellipsis...

This would actually be:

enwiktionary, Ellipsis…
dewiktionary, Ellipsis...
frwiktionary, Ellipsis......

Ellipsis…, Ellipsis...

As … is the only character there that is normalized.

How do you search the page for frwiktionary, if you start with @site='enwiktionary', @title='Ellipsis…'? Please do not tell me you re going to do a range of > 800 items and then a join of that + UNION, instead of a single row fetch.

A SELECT query would never be made by the extension looking for an entry for a given site, it would always query for a list of all titles / normalized titles that match a given string.
It would also be trivial to remove the 'group' from the SELECT example (and the sites table)

I would beg you to optimize first for simple point selects, then for size. I would suggest you to create an n:n relationship with all information ready to be easily read (we do 1000x times more reads than writes). Just that UNION creates a temporary table that, even small, makes unnecessary overhead.

It could be done as 2 separate selects (avoiding the UNION)?
I may also be able to rework the query to get rid of the UNION all together with some more JOINS.

SET @cognateNormalizedTitle = 'Apostrophe''';

SELECT ct_title AS title, ct_site AS site, cs_interwiki AS interwiki
FROM cognate_normalizations
RIGHT JOIN cognate_titles ON cn_raw = ct_title
LEFT JOIN cognate_sites ON ct_site = cs_dbname
WHERE cn_normalized = @cognateNormalizedTitle
OR ct_title = @cognateNormalizedTitlecognate_normalizations

RIGHT JOIN cognate_titles ON cn_raw = ct_title
LEFT JOIN cognate_sites ON ct_site = cs_dbname

No thanks, please continue with your original proposal, which was way better than this.

In T148988#2762720, @jcrespo wrote:
RIGHT JOIN cognate_titles ON cn_raw = ct_title
LEFT JOIN cognate_sites ON ct_site = cs_dbname
No thanks, please continue with your original proposal, which was way better than this.

I assume this refers to the 1 big table approach?

Is there a reason why T148988#2744793, with some fixes based on your latest feedback could not work?

I do not know what you are trying to do achieve here, honestly, only you know- and I do not have time to work on this unless you schedule it appropriately some months in advance, only to review it afterwards.
Create a design where your most common queries are:

SELECT X FROM table1 WHERE PK=A;

SELECT Y FROM table1 JOIN table2 ON table1.indexed_colum = table2.PK_column WHERE PK=B;

Then make sure you do not create redundant data (writing the same data on 2 places) and follow https://en.wikipedia.org/wiki/Database_normalization best practices.

Some links:

When discussing with people here in the office having the relation to a second table as in T148988#2744793 seems odd, as said above, normalization only changes the title in maybe 1% of cases (well, much less)... Hence the attempt to put the normalization in a table of their own, that would have a much smaller row count than compared with the titles table.

I think we will aim for a modified first version (If of course there are no strong oppositions):

The sites table will be added to make the dbname vs interwiki prefix lookups easy.

The sites table could either use the dbname as the primary key or use and int ID. this would then of course mean the site field in the titles table would either be a string (the dbname) or an int (the id)
In the case of an int ID DELETES would require a lookup in the sites table, as would inserts. SELECTS would also join the table.

group has been dropped from the table, the reasoning behind this is right now we only have the usecase for a single group, and if we every have more groups they can simply be configured to use a different database / tables.

CREATE TABLE IF NOT EXISTS cognate_sites (
  cs_dbname VARBINARY(32) PRIMARY KEY NOT NULL,
  cs_interwiki VARBINARY(32) NOT NULL
);

The titles table would contain the same basic fields.

title_key would be a normalized version of the title.
title_source would be the origional title in cases that it differs from title_key. NULL if they are the same.

CREATE TABLE IF NOT EXISTS cognate_titles (
  cgti_site VARBINARY(32) NOT NULL,
  cgti_namespace INT NOT NULL,
  cgti_title_source VARBINARY(255),
  cgti_title_key VARBINARY(255) NOT NULL,
  );

Example data can be seen below:

cgti_site, cgti_namespace, cgti_title_source, cgti_title_key
1,0,NULL,Berlin
2,0,NULL,Berlin
1,0,Ellipsis…, Ellipsis...
2,0,NULL, Ellipsis...

SELECTS here would thus always select using namespace, and title_key, joining on the sites table.

A final question here is regarding the keys for the table.

A strictly unique key (possible PK) in the table would be a combination of all 4 fields, Do you see this as being too much? Would the id vs string for site make a difference here?
Another option could be a PK on the site, namespace and title_key (but this could run into an edge case where a single site contains 2 different titles that get normalized to the same key.

Addshore added subscribers: Tobi_WMDE_SW, daniel.Nov 2 2016, 4:41 PM

A strictly unique key (possible PK) in the table would be a combination of all 4 fields, Do you see this as being too much? Would the id vs string for site make a difference here?
Another option could be a PK on the site, namespace and title_key (but this could run into an edge case where a single site contains 2 different titles that get normalized to the same key.

Both are not really good.

I have no idea what you are doing (this is not a complain, I just do not understand it). First you sent a proposal, then you sent a complete different one doing LEFT JOINs and RIGHT JOINs. I have no idea what you are trying to model here. I would suggest asking someone who knows better what you are trying to do and gives you some help on how to modeling that, while following the 2 general rules I told you: simple SELECTs and normalized content.

Something that usually helps is https://en.wikipedia.org/wiki/Entity%E2%80%93relationship_model and writing realistic, large example data.

Once you have a final proposal, send a gerrit review and I will give it another look.

jcrespo removed jcrespo as the assignee of this task.Nov 2 2016, 5:17 PM

jcrespo moved this task from In progress to Blocked external/Not db team on the DBA board.

In T148988#2764019, @jcrespo wrote:

I have no idea what you are doing (this is not a complain, I just do not understand it).

Most of what the extension is doign was in the IRC chat (I can't find it to add it to this ticket)
The short description @ https://www.mediawiki.org/wiki/Extension:Cognate should help. Also the testing system may help.

Once you have a final proposal, send a gerrit review and I will give it another look.

The extension is already entirely written, and will work with any of the schemas listed in this ticket.
As it is unclear which of the schemas you prefer (considering the team here doesn't really like the sound of T148988#2744793) I don't really have a clear path forward.

Something that usually helps is https://en.wikipedia.org/wiki/Entity%E2%80%93relationship_model and writing realistic, large example data.

I think an ER diagram for these concepts is overly complicating things. The only thing the extension knows about are sites, titles, and a normalized title string.

A site has a dbname and an interwiki prefix
A title has a site, namespace and value(page_title in terms of mediawiki)
A titles value also exists in a normalized form, this normalization step is the same for ALL sites. 99.9% of the time the normalization step outputs the exact same title value string that was input.

As for queries:

DELETES will always happen using a site dbname, namespace, and title_value
SELECTS will always happen using a namespace and the normalized title_value

As for uniqueness:

site,namespace,title will always be unique
site,namespace,normalized_title will NOT always be unique

As I am struggling to grasp a clear path forward from this ticket so far I will likely push ahead with T148988#2763922 as from this side this makes the most sense (given the comments above and our understanding of the extension). This schema is very similar to the one currently in the extension with the addition of the sites table and the NULL values in the title field where the title is the same as the normalized title (to cut back on duplicated data)
As for the PKs I will simply go for the initial PK over everything for now.

And then I guess we can look through all of this again in a few more weeks.

the NULL values in the title field where the title is the same as the normalized title (to cut back on duplicated data)

This is the kind of thing what makes SELECT not simple- you have to do an IF on the code based on if the value is NULL or not. That is not normalized. Even if it takes less space, it is not a good practice.

Duplicate data here means having:

title1, tag_1
title1, tag_2
title2, tag_2
title2, tag_3

instead of the simpler:

1, title1
2, title2

1, tag_1
2, tag_2
3, tag_3

1,1
1,2
2,2
2,3

And despite having more tables, it avoids duplication. Renaming a title, or a tag is changing a single row. And it saves a lot of space by doing references instead of full contents, repeated many times.

99.9% of the time the normalization step outputs the exact same title value

If that is the case, store only the ones that have been normalized, do not add NULL values.
And rename the table to something like cognate_normalized_titles.

When using the extension, check for the title on this table, if it doesn't exist, it means it has not been normalized (yet?). I would be ok with that, but that is a very different usage than the originally proposed.

Okay, I'm getting very confused as to where these 'tags' have come from

In T148988#2764212, @jcrespo wrote:

the NULL values in the title field where the title is the same as the normalised title (to cut back on duplicated data)

This is the kind of thing what makes SELECT not simple- you have to do an IF on the code based on if the value is NULL or not. That is not normalised. Even if it takes less space, it is not a good practice.

Actual selects never use title, they always use the normalised title. Deletes and will use title, but naturally, there is a much lower flow of deletes.

And despite having more tables, it avoids duplication. Renaming a title, or a tag is changing a single row. And it saves a lot of space by doing references instead of full contents, repeated many times.

If a title is renamed, the normalised title also has to be updated. If it is in 2 tables, 2 tables / 2 rows must be updated.

99.9% of the time the normalisation step outputs the exact same title value

If that is the case, store only the ones that have been normalised, do not add NULL values.
And rename the table to something like cognate_normalized_titles.

This is exactly what I proposed in T148988#2759329, but this makes the selects very hard either requiring the UNION query in that comment or the multi-join query in T148988#2762718, but of which you have said are bad.

As the selects always happen on the normalised titles to make the selects easy all normalised values should be stored (no nulls in that field), thus in cases where the normalisation and the original title match we do not need the original title. This is what is proposed in T148988#2763922.

Lydia_Pintscher added a parent task: T150182: Deploy Cognate extension to production.Nov 7 2016, 3:52 PM

@jcrespo Tags? What tags? And why multiple tags for the same title? We are comparing pages titles between wikis, with some minimal string normalization applied. That's it.

(Btw, note that we are discussing two completely unrelated kinds of normalization here: string normalization, and schema normalization. Let's try and not get confused about that.)

@Addshore I just realized that we don't need the key (i.e. the normalized title) as a string at all. We just need a hash of the normalized title. A 64 bit hash can be represented as an integer. You may still want to have a table for titles, so you don't have to store the same title 20 times if the title exists on 20 wikis. If you do this, I suggest to again use the (unnormalized) title's hash as the numeric representation of the title. This ensures consistency between wikis, and reduces the need for lookups.

@jcrespo Tags? What tags? And why multiple tags for the same title? We are comparing pages titles between wikis, with some minimal string normalization applied. That's it.

The above was an unrelated example, with columns that had nothing to do with the current design to avoid mistakes of me saying what should be actually done.

string normalization, and schema normalization.

There is nothing I mentioned about string normalization. Such a thing does not exists to me. If you are calling title/string normalization what you are doing, it is ok, but I never entered or commented on that.

In T148988#2782164, @jcrespo wrote:

There is nothing I mentioned about string normalization. Such a thing does not exists to me. If you are calling title/string normalization what you are doing, it is ok, but I never entered or commented on that.

You didn't, but Adam and I did - that's kind of the point. Comparing (normalized) titles between wikis is the entire purpose of having these tables, and "normalized" in the proposed schema refers to string normalization. I wasn't pointing out any particular misunderstanding. It was just a note of caution about vocabulary.