Page MenuHomePhabricator

Fixing redirects for Cognate (step 1)
Closed, ResolvedPublic5 Estimated Story Points

Description

As an editor I want to have as many Wiktionary pages connected to each other across language versions as possible in order to make accessing them easy.

Problem:
Right now Cognate discards redirects and doesn't show interwiki links to redirect pages. This task is for the first part of fixing this issue. After this at least one direction of the linking should work. It is illustrated in red in the following sketch:

IMG_20170511_180716031.jpg (1×3 px, 273 KB)

BDD
GIVEN Wiktionary 1 with a page titled "color"
AND Wiktionary 2 with two pages titled "color" and "colour"
AND the page "color" redirecting to "colour" on Wiktionary 2
WHEN viewing the interwiki links in Wiktionary 1
THEN a interwiki links to the page for "color" on Wiktionary 2 is shown in the sidebar

Acceptance criteria:

  • interwiki links to redirect pages are accepted by Cognate and shown in the sidebar

Notes:

  • We have an intentional check in the code to prevent this case. This task is about removing the check.
  • People say they'd likely not want the second step so we will for now only do this step of the two steps.

Event Timeline

Just to clarify, we don't want "color" in Wiki 1 to link directly to "colow" in Wiki 2, but rather it should link to "color" in Wiki 2 and then the redirect should take place as usual. In other words we just want redirects to be treated as ordinary pages.

Yes that is the goal of this task. Sorry if the sketch doesn't make that clear.

There is one thing to be careful about here: The combination of redirects and normalization.

As far as I know, it's quite frequent to have redirects to the normalized version of a title. For instance, if wiki 2 follows the convention of using the ellipsis character ("…") in titles instead of three dots ("..."), they may have a redirect from "Foo..." (with three dots) to "Foo…" with an ellipsis.

Cognate will also recognize these two titles as equivalent (redirect or no) because of the normalization rules. So, if wiki 1 has a page called "Foo..." (with dots), Cognate will add language links to both, the actual page on wiki 2 ("Foo…" with an ellipsis) as well as the redirect on wiki 2 ("Foo..." with dots). That's the consequence of Cognate applying normalization and at the same time treating redirects like normal pages.

Ideally, there would be a rule like "if you find an actual page to link to, ignore all the redirects to that page". But I currently do not see a way to do this efficiently, without asking each client database for redirect information. Cognate would have to track redirects in its own central database table - possible, but not trivial. And database changes need time.

I seem to recall that this issue was the original reason for ignoring redirects.

That brings us to another point. We (at English Wiktionary at least) do not want this normalization feature.

That brings us to another point. We (at English Wiktionary at least) do not want this normalization feature.

It was added on the specific request of (I think) the French Wiktionary community. Since the language links are supposed to be symmetrical and 1:1, you'll have to agree on doing it either way. It cannot be different per-wiki.

When discussing this, please keep in mind that any change to the normalization means we have to completely rebuild the Cognate database.

Can you explain a bit more why you do not want it?

It's the result of many community discussions, both past ones and present ones (see this one, for example). Cognate doesn't really do anything that bots couldn't do previously. It is not as though we previously couldn't have done this normalization by bots, but rather we specifically chose not to allow it. The fact that Cognate enables us to generate links automatically without running bots does not nullify our previous decision.

Now if you want more details about the actual reasoning, one reason is that on English Wiktionary (I don't know the details of French Wiktionary) we can have an entry for both "Foo..." (with three dots) and "Foo…" (with an ellipsis). And this probably applies to almost any character you might want to normalize. This is one of the reasons we want the redirects working in the first place, so we can control these links by creating redirects in proper places.

Thanks for your feedbacks. I'd like to start a discussion over the 3 main Wiktionaries (en, fr and de) to solve these questions about redirects. I'm sure we can find a common solution that will allow Cognate to be efficient while fitting the rules and processes of the Wiktionaries.
As the Wikimedia hackathon and Wikicite are happening this week and I'll be quite busy during this period, I'll start this discussion on June 1rst. In the meantime, no changes (but bug fixes) will be operated.

@Wikitiki89 Do you have examples of a Foo... <-> Foo… article (or with different apostrophes)? The only articles I can think of would be those about the character themselves.

If this is a problem for a relatively large number of articles, then we need to discuss it further. If not, then we can just override the interwikis manually in the handful of articles involved.

NB: fr did this normalization by bot.

@Wikitiki89 Do you have examples of a Foo... <-> Foo… article (or with different apostrophes)? The only articles I can think of would be those about the character themselves.

Is there a list somewhere of the characters that would be normalized? That would help me find examples and assess their frequency.

If this is a problem for a relatively large number of articles, then we need to discuss it further. If not, then we can just override the interwikis manually in the handful of articles involved.

NB: fr did this normalization by bot.

This certainly needs to be discussed. It has been a longstanding policy on the English Wiktionary specifically not to allow such normalization and we do not appreciate having a feature like this forced down our throats without discussion.

Whether it's bad to link pages on one wiki with one character to pages on another wiki with another character, depends on which characters would be treated as equivalent.

In T987, I mention some characters that are different: for geresh, he.Wikt standardizes apostrophes (ר') where en.Wikt standardizes geresh (ר׳); for palochkas, different wikis (sometimes internally-inconsistently) use upper- or lowercase palochka, I, І, 1, or other characters. I also mention an example of straight and curly apostrophes contrasting: there are separate pages on en.Wikt for Mopan Maya ka'an ("sky") vs Yucatec Maya ka’an ("sky"). However, I think based on discussion en.Wikt needs to fix the ka'ans by standardizing on one apostrophe. (And I think it's silly other wikis don't encode geresh as geresh and palochka as palochka.) I haven't seen a reason to think it'd cause problems for valid entries if the extension automatically linked straight-vs-curly apostrophes, and it'd save the trouble of continually needing to find which pages need redirects created for them. There may be more people on en.Wikt receptive to automatically linking different apostrophes than comments above suggest.

OTOH, hard redirects like vi.Wikt uses for xoá→xóa (mentioned in the above-linked ticket) would never be used by us on en.Wikt because they could be separate words in other languages, like Icelandic sóa and Hungarian soá are.

Here is the current normalization map from Cognate\StringNormalizer:

	private $replacements = [
		'’' => '\'',
		'…' => '...',
		' ' => '_',
	];

This maps:

  • right-single-quotation-mark (codepoint 02019) to the ascii apostrophy
  • horizontal-ellipsis (codepoint 02026) to three dots
  • spaces to underscores, like MediaWiki always does.

According to our analysis of existing language links, these normalization rules seem to cover nearly all cases in which the link is between pages that don't have exactly the same title. The remaining handful of pages can be linked manually.

However, the point is now raised whether these rules will lead to too many language links to be inferred. This would happen if there are two (non-redirect) pages on the same wiki that would have the same title after applying these rules.

The following database query shows all pairs of page names on English wiktionary that would be conflicting based on these normalization rules:

mysql:wikiadmin@10.64.16.18 [cognate_wiktionary]> SELECT a.cgti_raw, b.cgti_raw
    -> FROM cognate_titles as a
    ->   JOIN cognate_titles as b ON a.cgti_normalized_key = b.cgti_normalized_key
    ->   JOIN cognate_pages as p ON p.cgpa_title = a.cgti_raw_key 
    ->     and p.cgpa_namespace = 0 and p.cgpa_site = 8711873510529828948
    ->   JOIN cognate_pages as q ON q.cgpa_title = b.cgti_raw_key 
    ->     and q.cgpa_namespace = 0 and q.cgpa_site = 8711873510529828948
    ->   WHERE a.cgti_raw_key < b.cgti_raw_key 
    -> LIMIT 10;
+---------------------------+-----------------------------+
| cgti_raw                  | cgti_raw                    |
+---------------------------+-----------------------------+
| дев'ятнадцять             | дев’ятнадцять               |
| ...                       | …                           |
| '_'                       | ’_’                         |
| ’                         | '                           |
| luum                     | lu'um                       |
| ni'                       | ni                         |
+---------------------------+-----------------------------+
6 rows in set (39.45 sec)

I suppose it would be ok to manage the language links for 12 pages manually.

I've almost finished standardizing the Maya entries I mentioned, consolidating ~4 pairs of entries prior to your post. :-) Now I just consolidated 3 of the pairs you mention. The other 6 entries, for the individual characters, will probably be kept separate.

I think no Latin-script entries on en.Wikt are supposed to use ’ except the entry ’ itself, so new pairs should not arise in Latin script. En.Wikt's Macedonian entries are currently standardized on ’ (though this could be changed) while Russian entries use ', so there is the potential for a few conflicting pairs to arise as our coverage of Macedonian and Russian increases, but the number of such pairs should always be small. Mg.Wikt and fr.Wikt may also have a few pairs of conflicting entries which you might alert them to fix.

Perhaps there should be a discussion/poll on Meta, advertised on all Wiktionaries, about whether or not to enable this feature, to ensure all wikis have their say?

Will the normalization/linking function keep track (accessibly) of cases where it encounters too many pages, e.g. encounters both curly ’ and straight ' on one wiki, so that wikis can know which pages they need to maintain manual interwiki links for?

The data for enwiki now looks as below:

+----------+----------+
| cgti_raw | cgti_raw |
+----------+----------+
| ...      | …        |
| '_'      | ’_’      |
| ’        | '        |
+----------+----------+
3 rows in set (35.95 sec)

Is this something that would be useful to have for all wikis?

Btw, [[’]] should only show interwiki links to [[’]], not links to [[’]] and [[']] like on https://fr.wiktionary.org/w/index.php?title=%E2%80%99&oldid=23047369

Same for [[...]], [[…]] [[']]

Showing two links for the same language doesn't make any sense for the reader (see capture).

image.png (467×113 px, 4 KB)

Btw, [[’]] should only show interwiki links to [[’]], not links to [[’]] and [[']] like on https://fr.wiktionary.org/w/index.php?title=%E2%80%99&oldid=23047369
[...]
Showing two links for the same language doesn't make any sense for the reader (see capture).

I added manual interwiki links in the linked page: as a result all language links are unique (just like for [[...]]).

Showing two links for the same language doesn't make any sense for the reader (see capture).

True. Cognate could detect this situation, and only show the link that exactly matches the local page's title.

However, this would hide a potential error. Maybe it's better to have this visible, so people can notice and fix it?

Hello all, just to mention that I created a discussion topic here, so we can find a consensus whithin the different communities. Feel free to summarize your point of view and your concerns there. Thanks a lot!

Hello all, just to mention that I created a discussion topic here, so we can find a consensus whithin the different communities.

If some wikis need to change something, they should be notified individually.

@Nemo_bis whether or not they need to change something depends on their local conventions. We have no way to know and understand such local conventions for all Wiktionaries.

We did our best to make this non-disruptive, but there are always edge-cases. The best we can do is inform people of what we intend to do, and listen to their feedback. Quite often, people only notice issues once a feature has gone live.

In order to move forward, the description of the tasks needs to be improved, with a clearer description of what the need is and what we need to do.

4 years later. A solution is needed for " https://sv.wiktionary.org/wiki/When_in_Rome,_do_as_the_Romans_do. " vs " https://en.wiktionary.org/wiki/when_in_Rome,_do_as_the_Romans_do ". Either by allowing linking to redirects, or by some other trick. Most wiktionaries have restrictive policies when it comes to redirects. For example "colour" redirecting to "color" is prohibited on many wiktionaries and thus probably not an issue. Those are NOT same pages. But "When_in_Rome,_do_as_the_Romans_do." and "when_in_Rome,_do_as_the_Romans_do" are same. There is no obvious solution about how to deal with proverbs on wiktionaries. EN wikt has a policy to remove final punctuation and avoid capitalization of beginning letter of a sentence, thus the lemma form is " when_in_Rome,_do_as_the_Romans_do ". SV wikt has a policy resulting in " When_in_Rome,_do_as_the_Romans_do. " (uppercase "W" and dot at the end). Those are SAME LEMMAS and should thus be linked to each other. Unfortunately automatically adjusting the letter case is not possible as it would result in a huge number or false positives. Allowing automatic interwiki linking to redirects would allow to solve the problem manually by creating redirects on both sides. I do not see any better "magic" solution now. If there is an issue with false positives then the interwiki linking to redirects can be restricted to pagenames containing at least one space. This would rule out a mess of "color" and "colour" and "colow" as pointed above. I consider "Cognate" as highly preferable to manual explicit interwiki links (as used before 2017) but this issue with proverbs needs to be solved.

Change 670495 had a related patch set uploaded (by Tonina Zhelyazkova; owner: Tonina Zhelyazkova):
[mediawiki/extensions/Cognate@master] README: Add Development section

https://gerrit.wikimedia.org/r/670495

Change 670855 had a related patch set uploaded (by Tonina Zhelyazkova; owner: Tonina Zhelyazkova):
[mediawiki/extensions/Cognate@master] Allow interlinks to be created to redirect pages

https://gerrit.wikimedia.org/r/670855

Change 670495 merged by jenkins-bot:
[mediawiki/extensions/Cognate@master] README: Add Development section

https://gerrit.wikimedia.org/r/670495

Change 670855 merged by jenkins-bot:
[mediawiki/extensions/Cognate@master] Allow interlinks to be created to redirect pages

https://gerrit.wikimedia.org/r/670855

This should be getting rolled out with the train this week

Amy and I tested it. It seems a dummy edit on the page that contains the redirect is needed to make it show up.

Change 888305 had a related patch set uploaded (by Hoo man; author: Hoo man):

[mediawiki/extensions/Cognate@master] Fix maintenance/purgeDeletedCognatePages, add tests

https://gerrit.wikimedia.org/r/888305

Change 888305 merged by jenkins-bot:

[mediawiki/extensions/Cognate@master] Fix maintenance/purgeDeletedCognatePages, add tests

https://gerrit.wikimedia.org/r/888305