Fixing redirects for Cognate (step 1)
Open, HighPublic

Description

Right now Cognate discards redirects and doesn't show sitelinks to redirect pages. This task is for the first part of fixing this issue. After this at least one direction of the linking should work. It is illustrated in red in the following sketch:

Wikitiki89 added a comment.EditedMay 11 2017, 8:45 PM

Just to clarify, we don't want "color" in Wiki 1 to link directly to "colow" in Wiki 2, but rather it should link to "color" in Wiki 2 and then the redirect should take place as usual. In other words we just want redirects to be treated as ordinary pages.

Yes that is the goal of this task. Sorry if the sketch doesn't make that clear.

There is one thing to be careful about here: The combination of redirects and normalization.

As far as I know, it's quite frequent to have redirects to the normalized version of a title. For instance, if wiki 2 follows the convention of using the ellipsis character ("…") in titles instead of three dots ("..."), they may have a redirect from "Foo..." (with three dots) to "Foo…" with an ellipsis.

Cognate will also recognize these two titles as equivalent (redirect or no) because of the normalization rules. So, if wiki 1 has a page called "Foo..." (with dots), Cognate will add language links to both, the actual page on wiki 2 ("Foo…" with an ellipsis) as well as the redirect on wiki 2 ("Foo..." with dots). That's the consequence of Cognate applying normalization and at the same time treating redirects like normal pages.

Ideally, there would be a rule like "if you find an actual page to link to, ignore all the redirects to that page". But I currently do not see a way to do this efficiently, without asking each client database for redirect information. Cognate would have to track redirects in its own central database table - possible, but not trivial. And database changes need time.

I seem to recall that this issue was the original reason for ignoring redirects.

That brings us to another point. We (at English Wiktionary at least) do not want this normalization feature.

That brings us to another point. We (at English Wiktionary at least) do not want this normalization feature.

It was added on the specific request of (I think) the French Wiktionary community. Since the language links are supposed to be symmetrical and 1:1, you'll have to agree on doing it either way. It cannot be different per-wiki.

When discussing this, please keep in mind that any change to the normalization means we have to completely rebuild the Cognate database.

Can you explain a bit more why you do not want it?

Wikitiki89 added a comment.EditedMay 12 2017, 2:10 PM

It's the result of many community discussions, both past ones and present ones (see this one, for example). Cognate doesn't really do anything that bots couldn't do previously. It is not as though we previously couldn't have done this normalization by bots, but rather we specifically chose not to allow it. The fact that Cognate enables us to generate links automatically without running bots does not nullify our previous decision.

Now if you want more details about the actual reasoning, one reason is that on English Wiktionary (I don't know the details of French Wiktionary) we can have an entry for both "Foo..." (with three dots) and "Foo…" (with an ellipsis). And this probably applies to almost any character you might want to normalize. This is one of the reasons we want the redirects working in the first place, so we can control these links by creating redirects in proper places.

Thanks for your feedbacks. I'd like to start a discussion over the 3 main Wiktionaries (en, fr and de) to solve these questions about redirects. I'm sure we can find a common solution that will allow Cognate to be efficient while fitting the rules and processes of the Wiktionaries.
As the Wikimedia hackathon and Wikicite are happening this week and I'll be quite busy during this period, I'll start this discussion on June 1rst. In the meantime, no changes (but bug fixes) will be operated.

@Wikitiki89 Do you have examples of a Foo... <-> Foo… article (or with different apostrophes)? The only articles I can think of would be those about the character themselves.

If this is a problem for a relatively large number of articles, then we need to discuss it further. If not, then we can just override the interwikis manually in the handful of articles involved.

NB: fr did this normalization by bot.

@Wikitiki89 Do you have examples of a Foo... <-> Foo… article (or with different apostrophes)? The only articles I can think of would be those about the character themselves.

Is there a list somewhere of the characters that would be normalized? That would help me find examples and assess their frequency.

If this is a problem for a relatively large number of articles, then we need to discuss it further. If not, then we can just override the interwikis manually in the handful of articles involved.

NB: fr did this normalization by bot.

This certainly needs to be discussed. It has been a longstanding policy on the English Wiktionary specifically not to allow such normalization and we do not appreciate having a feature like this forced down our throats without discussion.

-sche added a subscriber: -sche.May 18 2017, 10:01 PM

Whether it's bad to link pages on one wiki with one character to pages on another wiki with another character, depends on which characters would be treated as equivalent.

In T987, I mention some characters that are different: for geresh, he.Wikt standardizes apostrophes (ר') where en.Wikt standardizes geresh (ר׳); for palochkas, different wikis (sometimes internally-inconsistently) use upper- or lowercase palochka, I, І, 1, or other characters. I also mention an example of straight and curly apostrophes contrasting: there are separate pages on en.Wikt for Mopan Maya ka'an ("sky") vs Yucatec Maya ka’an ("sky"). However, I think based on discussion en.Wikt needs to fix the ka'ans by standardizing on one apostrophe. (And I think it's silly other wikis don't encode geresh as geresh and palochka as palochka.) I haven't seen a reason to think it'd cause problems for valid entries if the extension automatically linked straight-vs-curly apostrophes, and it'd save the trouble of continually needing to find which pages need redirects created for them. There may be more people on en.Wikt receptive to automatically linking different apostrophes than comments above suggest.

OTOH, hard redirects like vi.Wikt uses for xoá→xóa (mentioned in the above-linked ticket) would never be used by us on en.Wikt because they could be separate words in other languages, like Icelandic sóa and Hungarian soá are.

Here is the current normalization map from Cognate\StringNormalizer:

	private $replacements = [
		'’' => '\'',
		'…' => '...',
		' ' => '_',
	];

This maps:

  • right-single-quotation-mark (codepoint 02019) to the ascii apostrophy
  • horizontal-ellipsis (codepoint 02026) to three dots
  • spaces to underscores, like MediaWiki always does.

According to our analysis of existing language links, these normalization rules seem to cover nearly all cases in which the link is between pages that don't have exactly the same title. The remaining handful of pages can be linked manually.

However, the point is now raised whether these rules will lead to too many language links to be inferred. This would happen if there are two (non-redirect) pages on the same wiki that would have the same title after applying these rules.

Udo_T added a subscriber: Udo_T.May 19 2017, 9:00 AM

The following database query shows all pairs of page names on English wiktionary that would be conflicting based on these normalization rules:

mysql:wikiadmin@10.64.16.18 [cognate_wiktionary]> SELECT a.cgti_raw, b.cgti_raw
    -> FROM cognate_titles as a
    ->   JOIN cognate_titles as b ON a.cgti_normalized_key = b.cgti_normalized_key
    ->   JOIN cognate_pages as p ON p.cgpa_title = a.cgti_raw_key 
    ->     and p.cgpa_namespace = 0 and p.cgpa_site = 8711873510529828948
    ->   JOIN cognate_pages as q ON q.cgpa_title = b.cgti_raw_key 
    ->     and q.cgpa_namespace = 0 and q.cgpa_site = 8711873510529828948
    ->   WHERE a.cgti_raw_key < b.cgti_raw_key 
    -> LIMIT 10;
+---------------------------+-----------------------------+
| cgti_raw                  | cgti_raw                    |
+---------------------------+-----------------------------+
| дев'ятнадцять             | дев’ятнадцять               |
| ...                       | …                           |
| '_'                       | ’_’                         |
| ’                         | '                           |
| luum                     | lu'um                       |
| ni'                       | ni                         |
+---------------------------+-----------------------------+
6 rows in set (39.45 sec)

I suppose it would be ok to manage the language links for 12 pages manually.

-sche added a comment.May 21 2017, 3:29 AM

I've almost finished standardizing the Maya entries I mentioned, consolidating ~4 pairs of entries prior to your post. :-) Now I just consolidated 3 of the pairs you mention. The other 6 entries, for the individual characters, will probably be kept separate.

I think no Latin-script entries on en.Wikt are supposed to use ’ except the entry ’ itself, so new pairs should not arise in Latin script. En.Wikt's Macedonian entries are currently standardized on ’ (though this could be changed) while Russian entries use ', so there is the potential for a few conflicting pairs to arise as our coverage of Macedonian and Russian increases, but the number of such pairs should always be small. Mg.Wikt and fr.Wikt may also have a few pairs of conflicting entries which you might alert them to fix.

Perhaps there should be a discussion/poll on Meta, advertised on all Wiktionaries, about whether or not to enable this feature, to ensure all wikis have their say?

Will the normalization/linking function keep track (accessibly) of cases where it encounters too many pages, e.g. encounters both curly ’ and straight ' on one wiki, so that wikis can know which pages they need to maintain manual interwiki links for?

The data for enwiki now looks as below:

+----------+----------+
| cgti_raw | cgti_raw |
+----------+----------+
| ...      | …        |
| '_'      | ’_’      |
| ’        | '        |
+----------+----------+
3 rows in set (35.95 sec)

Is this something that would be useful to have for all wikis?

Thibaut120094 added a comment.EditedMay 21 2017, 11:24 AM

Btw, [[’]] should only show interwiki links to [[’]], not links to [[’]] and [[']] like on https://fr.wiktionary.org/w/index.php?title=%E2%80%99&oldid=23047369

Same for [[...]], [[…]] [[']]

Showing two links for the same language doesn't make any sense for the reader (see capture).

Btw, [[’]] should only show interwiki links to [[’]], not links to [[’]] and [[']] like on https://fr.wiktionary.org/w/index.php?title=%E2%80%99&oldid=23047369
[...]
Showing two links for the same language doesn't make any sense for the reader (see capture).

I added manual interwiki links in the linked page: as a result all language links are unique (just like for [[...]]).

Showing two links for the same language doesn't make any sense for the reader (see capture).

True. Cognate could detect this situation, and only show the link that exactly matches the local page's title.

However, this would hide a potential error. Maybe it's better to have this visible, so people can notice and fix it?

Hello all, just to mention that I created a discussion topic here, so we can find a consensus whithin the different communities. Feel free to summarize your point of view and your concerns there. Thanks a lot!

Hello all, just to mention that I created a discussion topic here, so we can find a consensus whithin the different communities.

If some wikis need to change something, they should be notified individually.

daniel added a comment.Jun 1 2017, 2:57 PM

@Nemo_bis whether or not they need to change something depends on their local conventions. We have no way to know and understand such local conventions for all Wiktionaries.

We did our best to make this non-disruptive, but there are always edge-cases. The best we can do is inform people of what we intend to do, and listen to their feedback. Quite often, people only notice issues once a feature has gone live.