Review & work on Cognate extension
Closed, ResolvedPublic34 Estimated Story Points
Actions

Assigned To

Authored By

	Addshore
	Sep 12 2016, 4:19 PM

Details

	Subject	Repo	Branch	Lines +/-
	Use new db schema	mediawiki/extensions/Cognate	master	+139 -98

Customize query in gerrit

Related Objects
Search...

Status	Subtype	Assigned	Task
Open	Feature	None	T13996 A way to select which parts of Wiktionary articles to show
Open	Feature	None	T14213 Following a link to a language entry in Wiktionary should display only that entry
Open	Feature	None	T13998 A way to show only those languages on Wiktionary that the user is interested in
Open	Feature	None	T38881 Wiktionary needs usable API
Open		None	T31229 Extension to provide access via the dict protocol
Open		None	T109579 [Epic] Give more sister projects access to Wikidata
Resolved		Lydia_Pintscher	T986 Use structured data on Wiktionary
Resolved		Lydia_Pintscher	T988 Phase 1: Represent Wiktionary lexicon using structured data
Resolved		Lydia_Pintscher	T146637 Wikidata 2016 Q4 goals
Resolved		Lydia_Pintscher	T150179 Wikidata 2017 Q1 goals
Resolved		Lydia_Pintscher	T987 [Story] Phase 0: Automate interwiki language links for Wiktionary
Resolved		Addshore	T145412 Review & work on Cognate extension
Resolved		Addshore	T146877 Purge pages on other wikis when cognate entry is added or removed.
Resolved		Addshore	T146878 Cognate Sorting of language links.
Resolved		Addshore	T146879 Normalize titles to keys
Resolved		Addshore	T146991 Add script to populate Cognate titles db table

Event Timeline

Addshore created this task.Sep 12 2016, 4:19 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 12 2016, 4:19 PM

Addshore moved this task from Proposed to Todo on the WMDE-QWERTY-Team-Experimental-Sprint board.Sep 12 2016, 4:19 PM

For the record, some concerns that came up:

Performance and scalability. We need a way to efficiently track and query page names across all Wiktionaries.
Normalization of page names before comparison.
Sorting of language links. We may want a separate extension for that.

Lydia_Pintscher added a parent task: T987: [Story] Phase 0: Automate interwiki language links for Wiktionary.Sep 12 2016, 7:05 PM

JAnD subscribed.Sep 14 2016, 9:14 AM

Addshore moved this task from incoming to ready to go on the Wikidata board.Sep 16 2016, 6:21 PM

In T145412#2629460, @daniel wrote:

For the record, some concerns that came up:

Performance and scalability. We need a way to efficiently track and query page names across all Wiktionaries.

Right now it looks like this is a central DB table that contains a mapping of language and title.
Some caching should probably be added here.
Also a primary key should be added to aid with the WMF replication setup / future line by line stuff!

Normalization of page names before comparison.

Yes, and this needs to be consistent across sites!

Sorting of language links. We may want a separate extension for that.

Indeed, there is a note currently in the code saying:

		// TODO Move InterwikiSorter class from the wikibase extension to its own extension and use it to sort the language links
		// See https://git.wikimedia.org/blob/mediawiki%2Fextensions%2FWikibase/master/client%2Fincludes%2FInterwikiSorter.php

There are a bunch of other TODOs in the code too! :)

Addshore moved this task from Todo to Doing on the WMDE-QWERTY-Team-Experimental-Sprint board.Sep 20 2016, 8:51 AM

So I have been powering through some refactoring and adding functionality:

While looking at implementing the moves in https://gerrit.wikimedia.org/r/#/c/311696/ I realised that the current db table only covers a single namespace, however the config allows multiple namespaces to be defined as cognate namespaces for a single site.

@daniel @Lydia_Pintscher @gabriel-wmde Do we have to cover the usecase of multiple namespaces per site? Or is a single namespace okay?

Addshore set the point value for this task to 55.Sep 20 2016, 1:45 PM

Addshore changed the point value for this task from 55 to 34.

In T145412#2652087, @Addshore wrote:

While looking at implementing the moves in https://gerrit.wikimedia.org/r/#/c/311696/ I realised that the current db table only covers a single namespace, however the config allows multiple namespaces to be defined as cognate namespaces for a single site.

@daniel @Lydia_Pintscher @gabriel-wmde Do we have to cover the usecase of multiple namespaces per site? Or is a single namespace okay?

I think we need to look at the existing namespaces. The main namespace would be handled automatically via this extension. What about its discussion pages? Help and similar namespaces would be handled via the regular Wikidata items.

So @WMDE-leszek managed to find https://de.wiktionary.org/wiki/Reim:Deutsch:-a%CA%A6%C9%99 which links to https://en.wiktionary.org/wiki/Rhymes:German/ats%C9%99

Do we want the Cognate extension to link there pages?

Namespaces that are not default also include (on en wiktionary):

Appendix
Concordance
Index
Rhymes
Wikisaurus
Sign gloss
Reconstruction

In T987#1103780, @Darkdadaah wrote:

Is there a reason why only "main namespace" pages are mentioned in the above proposal?

@Addshore linking the rhymes-namespaces would be desirable, but not trivial. I would suggest to leave it for later.

Stripping prefixes during normalization isn't too hard, but Cognate would need to know that "German/" is the equivalent prefix to "Deutsch:" - and so on for all languages.

I would consider this a feature request for later.

In the light of the above comments regarding prefixes and namespaces, a few thoughts about the database schema for connecting the pages. It seems we need the following fields:

cgnt_wiki: the wiki ID
cgnt_title: the page title (including namespace)
cgnt_key: a normalized version of the title (including namespace)

Here, cgnt_wiki+cgnt_title are unique; cgnt_wiki+cgnt_key are also unique. Pages to link are all rows with the same cgnt_wiki+cgnt_key.

Note that potentially, multiple titles could get normalized to the same key, creating a conflict. This should be rare and would very likely be the result of a mistake, but the software need to recover from such a situation gracefully, particularly when one of the conflicting pages gets renamed or deleted.

However, this table will become very tall, because it has roughly one entry per content page on all wiktionary projects combined. So we should try to make the rows less "broad", and remove redundant information. For instance:

cgnt_wiki: the wiki ID (int ID referencing another table)
cgnt_namespace: a "virtual" namespace id (int referencing another table that has namespace names and IDs for each virtual namespace, for each wiki)
cgnt_title: the page title (no namespace)
cgnt_key: a normalized version of the title (no namespace)

Here, cgnt_wiki+cgnt_namespace+cgnt_title are unique; cgnt_wiki+cgnt_namespace+cgnt_key are also unique. Pages to link are all rows with the same cgnt_wiki+cgnt_namespace+cgnt_key. In order to construct the titles to link to, the actual per-wiki namespace IDs need to be looked up.

Also note that cgnt_title and cgnt_key are usually the same. We can potentially save a lot of room by setting cgnt_title to null unless it is different from cgnt_key. This means that we can't have a unique key on cgnt_wiki+cgnt_namespace+cgnt_title, so we can only query by key, not by title.

This approach makes constructing the actual page title more complex (use the title if not null, key otherwise), and updates for page deletion and renaming need to be based on the key, not the title. This may be problematic if there are multiple conflicting pages with the same key.

To clarify: the result is that we need to be able to handle some other namespaces in the future but for now only do the main namespace. Correct?

Change 312257 had a related patch set uploaded (by Addshore):
Use new db schema

https://gerrit.wikimedia.org/r/312257

Addshore mentioned this in rECOG84062562a990: Use new db schema.Sep 22 2016, 2:55 PM

Addshore mentioned this in rECOGc71c3608e448: Use new db schema.Sep 22 2016, 3:12 PM

For reference: Langlink-entries not matching the page title, from the main namespace of de, en, and fr Wiktionary.

wiktionary-langlink-mismatch.zip34 KBDownload

Addshore mentioned this in rECOGd23b9448f20f: Use new db schema.Sep 26 2016, 8:53 AM

Addshore mentioned this in rECOGb463c8884acc: Use new db schema.

Addshore added a project: User-Addshore.Sep 26 2016, 9:51 AM

Addshore moved this task from Unsorted 💣 to Back Burner 🏛️ on the User-Addshore board.

In T145412#2659357, @daniel wrote:

For reference: Langlink-entries not matching the page title, from the main namespace of de, en, and fr Wiktionary.

wiktionary-langlink-mismatch.zip34 KBDownload

So the following library seems to be very useful here
https://github.com/Behat/Transliterator
Looking at the lists linked above this results in 2224/3238 normalized

		$string = Behat\Transliterator\Transliterator::transliterate( $string );
		$string = str_replace( '-', '', $string );

Methods in core just don't seem to cut it.
List of cases missed can be seen at P4120

While looking through that paste it looks like there are a few other special cases that can be worked with.

		if ( $sourceLanguage === 'fr' ) {
			if ( strpos( $string, 's’' ) === 0 ) {
				$string = str_replace( 's’', '', $string );
			}
			if ( strpos( $string, 'se_' ) === 0 ) {
				$string = str_replace( 'se_', '', $string );
			}
		}

Change 313003 had a related patch set uploaded (by Addshore):
Add normalization for titles => keys

https://gerrit.wikimedia.org/r/313003

Addshore mentioned this in rECOG3372bb2936a1: Add normalization for titles => keys.Sep 27 2016, 12:44 PM

Addshore mentioned this in rECOG2fa1fd16ab83: Add normalization for titles => keys.Sep 27 2016, 12:49 PM

I just realized: When an entry is added to or removed from the central table, all pages in the table with the same key (including the added or removed entry) need to trigger a purge for the respective wiki page.

There should probably be a ticket for this :)

Maybe I read the code wrong, but we don't want to normalize the case of the words, e.g. "Clause" and "clause": the interwiki from [[de:Clause]] to [[ar:clause]] is wrong.

In T145412#2670778, @Darkdadaah wrote:

Maybe I read the code wrong, but we don't want to normalize the case of the words, e.g. "Clause" and "clause": the interwiki from [[de:Clause]] to [[ar:clause]] is wrong.

Hmm, okay!
I'm starting to think that an initial implementation should perhaps not normalize at all and then take on cases on a case by case basis?
@Lydia_Pintscher @daniel @WMDE-leszek thoughts?

In T145412#2670625, @daniel wrote:

I just realized: When an entry is added to or removed from the central table, all pages in the table with the same key (including the added or removed entry) need to trigger a purge for the respective wiki page.

There should probably be a ticket for this :)

Ahh yes, there should be! I will make some sub tickets today!

Addshore created subtask T146877: Purge pages on other wikis when cognate entry is added or removed..Sep 28 2016, 9:30 AM

Addshore created subtask T146878: Cognate Sorting of language links..

Addshore created subtask T146879: Normalize titles to keys.Sep 28 2016, 9:32 AM

In T145412#2673520, @Addshore wrote:

I'm starting to think that an initial implementation should perhaps not normalize at all and then take on cases on a case by case basis?
@Lydia_Pintscher @daniel @WMDE-leszek thoughts?

Normalization should be limited to what is absolutely needed. This is not a search index where you want to find as many sensible matches as possible. The match needs to be exact, except for a very few cases dictated by differing local policies, such as useing "`" instead of "'".

A baseline implementation doesn't need to support normalization. But I do not think we can deploy without it, simply because Cognate would then not really work for French Wiktionary, which is one of the most active Wiktionaries, and one of the most vocal stakeholder groups regarding Wikidata integration of Wiktionary.

In T145412#2673520, @Addshore wrote:

Hmm, okay!
I'm starting to think that an initial implementation should perhaps not normalize at all and then take on cases on a case by case basis?
@Lydia_Pintscher @daniel @WMDE-leszek thoughts?

Yeah that sound sensible by now.

In T145412#2673669, @daniel wrote:

A baseline implementation doesn't need to support normalization. But I do not think we can deploy without it, simply because Cognate would then not really work for French Wiktionary, which is one of the most active Wiktionaries, and one of the most vocal stakeholder groups regarding Wikidata integration of Wiktionary.

That also makes sense. Do we have 1 or 2 normalization steps we can do to cover most of the French cases?

Addshore created subtask T146991: Add script to populate Cognate titles db table.Sep 29 2016, 2:36 PM

Meno25 subscribed.Oct 3 2016, 4:49 PM

Performance and scalability. We need a way to efficiently track and query page names across all Wiktionaries.

Why not solve the problem forever by making title a first-class entity on core, solving title and *link, page_assessment,etc. space issues at the same time? I am not saying you should solve this completely, just adding the barebones to make the extension work, and then let other features on core be added by someone else? This is not an uncommon request among extension developers, and right now they all have to create their own tables.

Addshore mentioned this in rECOG0f4f8a0db633: Use new db schema.Oct 12 2016, 2:02 PM

Addshore mentioned this in rECOGb7057b38a956: Use new db schema.Oct 12 2016, 2:05 PM

Change 312257 merged by jenkins-bot:
Use new db schema

https://gerrit.wikimedia.org/r/312257

Addshore moved this task from Doing to Done on the WMDE-QWERTY-Team-Experimental-Sprint board.Oct 12 2016, 5:36 PM

Hi @jcrespo!

In T145412#2704922, @jcrespo wrote:

Performance and scalability. We need a way to efficiently track and query page names across all Wiktionaries.

Why not solve the problem forever by making title a first-class entity on core, solving title and *link, page_assessment,etc. space issues at the same time?

Can you elaborate? I don't quite understand what you are proposing. Title is already a first-class entity, that's what the page table is. But that is per-wiki. What we need here is cross-wiki.

Are you proposing to use numeric IDs instead of title strings, even for pages that do not exist? That's worth considering. I have done this before on occasion, for instance for my thesis. Maintaining the mapping is a pain, though. I found that using a 64 bit hash of the title is a better solution than auto-increment: the mapping is implicit at least in one direction, and collisions are nearly impossible.

Do you mean the global lookup table for cognate should be blocked on having a numeric representation for titles, to reduce the need for space? Actually, we could immediately use 64 bit hashes for the normalized keys... all we are interested in is equality anyway. That could work.

But what does this have to do with page_assessment? That seems unrelated.

Title is already a first-class entity, that's what the page table is.

No, right now, page is an entity that has a series of properties: at title, a text, etc. By setting title as a strong entity, it has meaning by its own: a page has 1:1 titles, a title has 1:0 pages. There is some issues with the namespace handling, but I will ignore those for now.

But yes, the idea is that a title can exist independently if a page exists. If at some point they are referenced, they are created. While that can create a lot of garbage, I believe it will not be worse than the amount of duplicate references we have to non-existent page titles on the link* templates (e.g. red links), while being more efficient in space. E.g. linking to the template "Template:WPBannerMeta/importancescale" takes 28 bytes (I think the template namespace is not used for templatelinks), and it is being used 6 million times (that is 128 MB, without compression) while an identifier would only take 4 or 8 bytes. I do not have a tought on the hashing, although I find "I found that using a 64 bit hash of the title is a better solution than auto-increment" strange, as in both cases, the mapping would be 1 way, and impossible to revert. I do not intend to manage the ids, we store them forever (append only table).

What I mean is we could apply such a patch to core, and you can immediately use it for your own needs, and I would take care of other people using it (any help is welcome). For example, page assessment has its own duplicate table (for projects namespace only) where normalization happens for titles- and it is not an uncommon request for other extensions, so duplicating efforts is a waste. I also predict it will halve our database storage requirements and io if it was applied to the link* tables. That is instantly double the db resources!

However, it is not an immediate change- there are cases where references to titles are actually references to pages, denormalized. In some cases, references should change on page rename, in others it shouldn't (*links). Also simple selects now become joins, and many people apparently doesn't like joins, even when in this case they would be faster, because they would run over smaller tables.

Let's create a separate task for this.

Addshore added a project: Cognate.Oct 17 2016, 2:16 PM

Addshore moved this task from Back Burner 🏛️ to Active 🚁 on the User-Addshore board.Oct 18 2016, 4:39 PM

Addshore closed subtask T146879: Normalize titles to keys as Resolved.Oct 19 2016, 2:11 PM