Page MenuHomePhabricator

Review & work on Cognate extension
Closed, ResolvedPublic34 Estimated Story Points

Related Objects

Event Timeline

For the record, some concerns that came up:

  • Performance and scalability. We need a way to efficiently track and query page names across all Wiktionaries.
  • Normalization of page names before comparison.
  • Sorting of language links. We may want a separate extension for that.

For the record, some concerns that came up:

  • Performance and scalability. We need a way to efficiently track and query page names across all Wiktionaries.

Right now it looks like this is a central DB table that contains a mapping of language and title.
Some caching should probably be added here.
Also a primary key should be added to aid with the WMF replication setup / future line by line stuff!

  • Normalization of page names before comparison.

Yes, and this needs to be consistent across sites!

  • Sorting of language links. We may want a separate extension for that.

Indeed, there is a note currently in the code saying:

		// TODO Move InterwikiSorter class from the wikibase extension to its own extension and use it to sort the language links
		// See https://git.wikimedia.org/blob/mediawiki%2Fextensions%2FWikibase/master/client%2Fincludes%2FInterwikiSorter.php

There are a bunch of other TODOs in the code too! :)

Addshore added a project: Patch-For-Review.

So I have been powering through some refactoring and adding functionality:

While looking at implementing the moves in https://gerrit.wikimedia.org/r/#/c/311696/ I realised that the current db table only covers a single namespace, however the config allows multiple namespaces to be defined as cognate namespaces for a single site.

@daniel @Lydia_Pintscher @gabriel-wmde Do we have to cover the usecase of multiple namespaces per site? Or is a single namespace okay?

Addshore set the point value for this task to 55.Sep 20 2016, 1:45 PM
Addshore changed the point value for this task from 55 to 34.

While looking at implementing the moves in https://gerrit.wikimedia.org/r/#/c/311696/ I realised that the current db table only covers a single namespace, however the config allows multiple namespaces to be defined as cognate namespaces for a single site.

@daniel @Lydia_Pintscher @gabriel-wmde Do we have to cover the usecase of multiple namespaces per site? Or is a single namespace okay?

I think we need to look at the existing namespaces. The main namespace would be handled automatically via this extension. What about its discussion pages? Help and similar namespaces would be handled via the regular Wikidata items.

So @WMDE-leszek managed to find https://de.wiktionary.org/wiki/Reim:Deutsch:-a%CA%A6%C9%99 which links to https://en.wiktionary.org/wiki/Rhymes:German/ats%C9%99

Do we want the Cognate extension to link there pages?

Namespaces that are not default also include (on en wiktionary):

  • Appendix
  • Concordance
  • Index
  • Rhymes
  • Wikisaurus
  • Sign gloss
  • Reconstruction

Is there a reason why only "main namespace" pages are mentioned in the above proposal?

@Addshore linking the rhymes-namespaces would be desirable, but not trivial. I would suggest to leave it for later.

Stripping prefixes during normalization isn't too hard, but Cognate would need to know that "German/" is the equivalent prefix to "Deutsch:" - and so on for all languages.

I would consider this a feature request for later.

In the light of the above comments regarding prefixes and namespaces, a few thoughts about the database schema for connecting the pages. It seems we need the following fields:

  • cgnt_wiki: the wiki ID
  • cgnt_title: the page title (including namespace)
  • cgnt_key: a normalized version of the title (including namespace)

Here, cgnt_wiki+cgnt_title are unique; cgnt_wiki+cgnt_key are also unique. Pages to link are all rows with the same cgnt_wiki+cgnt_key.

Note that potentially, multiple titles could get normalized to the same key, creating a conflict. This should be rare and would very likely be the result of a mistake, but the software need to recover from such a situation gracefully, particularly when one of the conflicting pages gets renamed or deleted.

However, this table will become very tall, because it has roughly one entry per content page on all wiktionary projects combined. So we should try to make the rows less "broad", and remove redundant information. For instance:

  • cgnt_wiki: the wiki ID (int ID referencing another table)
  • cgnt_namespace: a "virtual" namespace id (int referencing another table that has namespace names and IDs for each virtual namespace, for each wiki)
  • cgnt_title: the page title (no namespace)
  • cgnt_key: a normalized version of the title (no namespace)

Here, cgnt_wiki+cgnt_namespace+cgnt_title are unique; cgnt_wiki+cgnt_namespace+cgnt_key are also unique. Pages to link are all rows with the same cgnt_wiki+cgnt_namespace+cgnt_key. In order to construct the titles to link to, the actual per-wiki namespace IDs need to be looked up.

Also note that cgnt_title and cgnt_key are usually the same. We can potentially save a lot of room by setting cgnt_title to null unless it is different from cgnt_key. This means that we can't have a unique key on cgnt_wiki+cgnt_namespace+cgnt_title, so we can only query by key, not by title.

This approach makes constructing the actual page title more complex (use the title if not null, key otherwise), and updates for page deletion and renaming need to be based on the key, not the title. This may be problematic if there are multiple conflicting pages with the same key.

To clarify: the result is that we need to be able to handle some other namespaces in the future but for now only do the main namespace. Correct?

Change 312257 had a related patch set uploaded (by Addshore):
Use new db schema

https://gerrit.wikimedia.org/r/312257

For reference: Langlink-entries not matching the page title, from the main namespace of de, en, and fr Wiktionary.

For reference: Langlink-entries not matching the page title, from the main namespace of de, en, and fr Wiktionary.

So the following library seems to be very useful here
https://github.com/Behat/Transliterator
Looking at the lists linked above this results in 2224/3238 normalized

		$string = Behat\Transliterator\Transliterator::transliterate( $string );
		$string = str_replace( '-', '', $string );

Methods in core just don't seem to cut it.
List of cases missed can be seen at P4120

While looking through that paste it looks like there are a few other special cases that can be worked with.

		if ( $sourceLanguage === 'fr' ) {
			if ( strpos( $string, 's’' ) === 0 ) {
				$string = str_replace( 's’', '', $string );
			}
			if ( strpos( $string, 'se_' ) === 0 ) {
				$string = str_replace( 'se_', '', $string );
			}
		}

Change 313003 had a related patch set uploaded (by Addshore):
Add normalization for titles => keys

https://gerrit.wikimedia.org/r/313003

I just realized: When an entry is added to or removed from the central table, all pages in the table with the same key (including the added or removed entry) need to trigger a purge for the respective wiki page.

There should probably be a ticket for this :)

Maybe I read the code wrong, but we don't want to normalize the case of the words, e.g. "Clause" and "clause": the interwiki from [[de:Clause]] to [[ar:clause]] is wrong.

Maybe I read the code wrong, but we don't want to normalize the case of the words, e.g. "Clause" and "clause": the interwiki from [[de:Clause]] to [[ar:clause]] is wrong.

Hmm, okay!
I'm starting to think that an initial implementation should perhaps not normalize at all and then take on cases on a case by case basis?
@Lydia_Pintscher @daniel @WMDE-leszek thoughts?

I just realized: When an entry is added to or removed from the central table, all pages in the table with the same key (including the added or removed entry) need to trigger a purge for the respective wiki page.

There should probably be a ticket for this :)

Ahh yes, there should be! I will make some sub tickets today!

I'm starting to think that an initial implementation should perhaps not normalize at all and then take on cases on a case by case basis?
@Lydia_Pintscher @daniel @WMDE-leszek thoughts?

Normalization should be limited to what is absolutely needed. This is not a search index where you want to find as many sensible matches as possible. The match needs to be exact, except for a very few cases dictated by differing local policies, such as useing "`" instead of "'".

A baseline implementation doesn't need to support normalization. But I do not think we can deploy without it, simply because Cognate would then not really work for French Wiktionary, which is one of the most active Wiktionaries, and one of the most vocal stakeholder groups regarding Wikidata integration of Wiktionary.

Hmm, okay!
I'm starting to think that an initial implementation should perhaps not normalize at all and then take on cases on a case by case basis?
@Lydia_Pintscher @daniel @WMDE-leszek thoughts?

Yeah that sound sensible by now.

A baseline implementation doesn't need to support normalization. But I do not think we can deploy without it, simply because Cognate would then not really work for French Wiktionary, which is one of the most active Wiktionaries, and one of the most vocal stakeholder groups regarding Wikidata integration of Wiktionary.

That also makes sense. Do we have 1 or 2 normalization steps we can do to cover most of the French cases?

Performance and scalability. We need a way to efficiently track and query page names across all Wiktionaries.

Why not solve the problem forever by making title a first-class entity on core, solving title and *link, page_assessment,etc. space issues at the same time? I am not saying you should solve this completely, just adding the barebones to make the extension work, and then let other features on core be added by someone else? This is not an uncommon request among extension developers, and right now they all have to create their own tables.

Change 312257 merged by jenkins-bot:
Use new db schema

https://gerrit.wikimedia.org/r/312257

Hi @jcrespo!

Performance and scalability. We need a way to efficiently track and query page names across all Wiktionaries.

Why not solve the problem forever by making title a first-class entity on core, solving title and *link, page_assessment,etc. space issues at the same time?

Can you elaborate? I don't quite understand what you are proposing. Title is already a first-class entity, that's what the page table is. But that is per-wiki. What we need here is cross-wiki.

Are you proposing to use numeric IDs instead of title strings, even for pages that do not exist? That's worth considering. I have done this before on occasion, for instance for my thesis. Maintaining the mapping is a pain, though. I found that using a 64 bit hash of the title is a better solution than auto-increment: the mapping is implicit at least in one direction, and collisions are nearly impossible.

Do you mean the global lookup table for cognate should be blocked on having a numeric representation for titles, to reduce the need for space? Actually, we could immediately use 64 bit hashes for the normalized keys... all we are interested in is equality anyway. That could work.

But what does this have to do with page_assessment? That seems unrelated.

Title is already a first-class entity, that's what the page table is.

No, right now, page is an entity that has a series of properties: at title, a text, etc. By setting title as a strong entity, it has meaning by its own: a page has 1:1 titles, a title has 1:0 pages. There is some issues with the namespace handling, but I will ignore those for now.

But yes, the idea is that a title can exist independently if a page exists. If at some point they are referenced, they are created. While that can create a lot of garbage, I believe it will not be worse than the amount of duplicate references we have to non-existent page titles on the link* templates (e.g. red links), while being more efficient in space. E.g. linking to the template "Template:WPBannerMeta/importancescale" takes 28 bytes (I think the template namespace is not used for templatelinks), and it is being used 6 million times (that is 128 MB, without compression) while an identifier would only take 4 or 8 bytes. I do not have a tought on the hashing, although I find "I found that using a 64 bit hash of the title is a better solution than auto-increment" strange, as in both cases, the mapping would be 1 way, and impossible to revert. I do not intend to manage the ids, we store them forever (append only table).

What I mean is we could apply such a patch to core, and you can immediately use it for your own needs, and I would take care of other people using it (any help is welcome). For example, page assessment has its own duplicate table (for projects namespace only) where normalization happens for titles- and it is not an uncommon request for other extensions, so duplicating efforts is a waste. I also predict it will halve our database storage requirements and io if it was applied to the link* tables. That is instantly double the db resources!

However, it is not an immediate change- there are cases where references to titles are actually references to pages, denormalized. In some cases, references should change on page rename, in others it shouldn't (*links). Also simple selects now become joins, and many people apparently doesn't like joins, even when in this case they would be faster, because they would run over smaller tables.

Let's create a separate task for this.

This is now fully split down into sub tasks

This is now fully split down into sub tasks