Fixing redirects for Cognate (step 1)
Closed, ResolvedPublic5 Estimated Story Points
Actions

Assigned To

Authored By

	Lydia_Pintscher
	May 11 2017, 4:15 PM

Description

As an editor I want to have as many Wiktionary pages connected to each other across language versions as possible in order to make accessing them easy.

Problem:
Right now Cognate discards redirects and doesn't show interwiki links to redirect pages. This task is for the first part of fixing this issue. After this at least one direction of the linking should work. It is illustrated in red in the following sketch:

IMG_20170511_180716031.jpg (1×3 px, 273 KB)

BDD
GIVEN Wiktionary 1 with a page titled "color"
AND Wiktionary 2 with two pages titled "color" and "colour"
AND the page "color" redirecting to "colour" on Wiktionary 2
WHEN viewing the interwiki links in Wiktionary 1
THEN a interwiki links to the page for "color" on Wiktionary 2 is shown in the sidebar

Acceptance criteria:

interwiki links to redirect pages are accepted by Cognate and shown in the sidebar

Notes:

We have an intentional check in the code to prevent this case. This task is about removing the check.
People say they'd likely not want the second step so we will for now only do this step of the two steps.

Details

Subject	Repo	Branch	Lines +/-
Fix maintenance/purgeDeletedCognatePages, add tests	mediawiki/extensions/Cognate	master	+100 -0
Allow interlinks to be created to redirect pages	mediawiki/extensions/Cognate	master	+6 -94
README: Add Development section	mediawiki/extensions/Cognate	master	+58 -0

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		Lydia_Pintscher	T163717 Cognate should show links to redirects
		Resolved		• Tonina_Zhelyazkova_WMDE	T165061 Fixing redirects for Cognate (step 1)

Event Timeline

Lydia_Pintscher created this task.May 11 2017, 4:15 PM

Just to clarify, we don't want "color" in Wiki 1 to link directly to "colow" in Wiki 2, but rather it should link to "color" in Wiki 2 and then the redirect should take place as usual. In other words we just want redirects to be treated as ordinary pages.

Yes that is the goal of this task. Sorry if the sketch doesn't make that clear.

There is one thing to be careful about here: The combination of redirects and normalization.

As far as I know, it's quite frequent to have redirects to the normalized version of a title. For instance, if wiki 2 follows the convention of using the ellipsis character ("…") in titles instead of three dots ("..."), they may have a redirect from "Foo..." (with three dots) to "Foo…" with an ellipsis.

Cognate will also recognize these two titles as equivalent (redirect or no) because of the normalization rules. So, if wiki 1 has a page called "Foo..." (with dots), Cognate will add language links to both, the actual page on wiki 2 ("Foo…" with an ellipsis) as well as the redirect on wiki 2 ("Foo..." with dots). That's the consequence of Cognate applying normalization and at the same time treating redirects like normal pages.

Ideally, there would be a rule like "if you find an actual page to link to, ignore all the redirects to that page". But I currently do not see a way to do this efficiently, without asking each client database for redirect information. Cognate would have to track redirects in its own central database table - possible, but not trivial. And database changes need time.

I seem to recall that this issue was the original reason for ignoring redirects.

That brings us to another point. We (at English Wiktionary at least) do not want this normalization feature.

In T165061#3258596, @Wikitiki89 wrote:

That brings us to another point. We (at English Wiktionary at least) do not want this normalization feature.

It was added on the specific request of (I think) the French Wiktionary community. Since the language links are supposed to be symmetrical and 1:1, you'll have to agree on doing it either way. It cannot be different per-wiki.

When discussing this, please keep in mind that any change to the normalization means we have to completely rebuild the Cognate database.

Can you explain a bit more why you do not want it?

It's the result of many community discussions, both past ones and present ones (see this one, for example). Cognate doesn't really do anything that bots couldn't do previously. It is not as though we previously couldn't have done this normalization by bots, but rather we specifically chose not to allow it. The fact that Cognate enables us to generate links automatically without running bots does not nullify our previous decision.

Now if you want more details about the actual reasoning, one reason is that on English Wiktionary (I don't know the details of French Wiktionary) we can have an entry for both "Foo..." (with three dots) and "Foo…" (with an ellipsis). And this probably applies to almost any character you might want to normalize. This is one of the reasons we want the redirects working in the first place, so we can control these links by creating redirects in proper places.

Thanks for your feedbacks. I'd like to start a discussion over the 3 main Wiktionaries (en, fr and de) to solve these questions about redirects. I'm sure we can find a common solution that will allow Cognate to be efficient while fitting the rules and processes of the Wiktionaries.
As the Wikimedia hackathon and Wikicite are happening this week and I'll be quite busy during this period, I'll start this discussion on June 1rst. In the meantime, no changes (but bug fixes) will be operated.

@Wikitiki89 Do you have examples of a Foo... <-> Foo… article (or with different apostrophes)? The only articles I can think of would be those about the character themselves.

If this is a problem for a relatively large number of articles, then we need to discuss it further. If not, then we can just override the interwikis manually in the handful of articles involved.

NB: fr did this normalization by bot.

In T165061#3263807, @Darkdadaah wrote:

@Wikitiki89 Do you have examples of a Foo... <-> Foo… article (or with different apostrophes)? The only articles I can think of would be those about the character themselves.

Is there a list somewhere of the characters that would be normalized? That would help me find examples and assess their frequency.

If this is a problem for a relatively large number of articles, then we need to discuss it further. If not, then we can just override the interwikis manually in the handful of articles involved.

NB: fr did this normalization by bot.

This certainly needs to be discussed. It has been a longstanding policy on the English Wiktionary specifically not to allow such normalization and we do not appreciate having a feature like this forced down our throats without discussion.

Whether it's bad to link pages on one wiki with one character to pages on another wiki with another character, depends on which characters would be treated as equivalent.

In T987, I mention some characters that are different: for geresh, he.Wikt standardizes apostrophes (ר') where en.Wikt standardizes geresh (ר׳); for palochkas, different wikis (sometimes internally-inconsistently) use upper- or lowercase palochka, I, І, 1, or other characters. I also mention an example of straight and curly apostrophes contrasting: there are separate pages on en.Wikt for Mopan Maya ka'an ("sky") vs Yucatec Maya ka’an ("sky"). However, I think based on discussion en.Wikt needs to fix the ka'ans by standardizing on one apostrophe. (And I think it's silly other wikis don't encode geresh as geresh and palochka as palochka.) I haven't seen a reason to think it'd cause problems for valid entries if the extension automatically linked straight-vs-curly apostrophes, and it'd save the trouble of continually needing to find which pages need redirects created for them. There may be more people on en.Wikt receptive to automatically linking different apostrophes than comments above suggest.

OTOH, hard redirects like vi.Wikt uses for xoá→xóa (mentioned in the above-linked ticket) would never be used by us on en.Wikt because they could be separate words in other languages, like Icelandic sóa and Hungarian soá are.

Here is the current normalization map from Cognate\StringNormalizer:

	private $replacements = [
		'’' => '\'',
		'…' => '...',
		' ' => '_',
	];

This maps:

right-single-quotation-mark (codepoint 02019) to the ascii apostrophy
horizontal-ellipsis (codepoint 02026) to three dots
spaces to underscores, like MediaWiki always does.

According to our analysis of existing language links, these normalization rules seem to cover nearly all cases in which the link is between pages that don't have exactly the same title. The remaining handful of pages can be linked manually.

However, the point is now raised whether these rules will lead to too many language links to be inferred. This would happen if there are two (non-redirect) pages on the same wiki that would have the same title after applying these rules.

Udo_T subscribed.May 19 2017, 9:00 AM

The following database query shows all pairs of page names on English wiktionary that would be conflicting based on these normalization rules:

mysql:wikiadmin@10.64.16.18 [cognate_wiktionary]> SELECT a.cgti_raw, b.cgti_raw
    -> FROM cognate_titles as a
    ->   JOIN cognate_titles as b ON a.cgti_normalized_key = b.cgti_normalized_key
    ->   JOIN cognate_pages as p ON p.cgpa_title = a.cgti_raw_key 
    ->     and p.cgpa_namespace = 0 and p.cgpa_site = 8711873510529828948
    ->   JOIN cognate_pages as q ON q.cgpa_title = b.cgti_raw_key 
    ->     and q.cgpa_namespace = 0 and q.cgpa_site = 8711873510529828948
    ->   WHERE a.cgti_raw_key < b.cgti_raw_key 
    -> LIMIT 10;
+---------------------------+-----------------------------+
| cgti_raw                  | cgti_raw                    |
+---------------------------+-----------------------------+
| дев'ятнадцять             | дев’ятнадцять               |
| ...                       | …                           |
| '_'                       | ’_’                         |
| ’                         | '                           |
| lu’um                     | lu'um                       |
| ni'                       | ni’                         |
+---------------------------+-----------------------------+
6 rows in set (39.45 sec)

I suppose it would be ok to manage the language links for 12 pages manually.

I've almost finished standardizing the Maya entries I mentioned, consolidating ~4 pairs of entries prior to your post. :-) Now I just consolidated 3 of the pairs you mention. The other 6 entries, for the individual characters, will probably be kept separate.

I think no Latin-script entries on en.Wikt are supposed to use ’ except the entry ’ itself, so new pairs should not arise in Latin script. En.Wikt's Macedonian entries are currently standardized on ’ (though this could be changed) while Russian entries use ', so there is the potential for a few conflicting pairs to arise as our coverage of Macedonian and Russian increases, but the number of such pairs should always be small. Mg.Wikt and fr.Wikt may also have a few pairs of conflicting entries which you might alert them to fix.

Perhaps there should be a discussion/poll on Meta, advertised on all Wiktionaries, about whether or not to enable this feature, to ensure all wikis have their say?

Will the normalization/linking function keep track (accessibly) of cases where it encounters too many pages, e.g. encounters both curly ’ and straight ' on one wiki, so that wikis can know which pages they need to maintain manual interwiki links for?

The data for enwiki now looks as below:

+----------+----------+
| cgti_raw | cgti_raw |
+----------+----------+
| ...      | …        |
| '_'      | ’_’      |
| ’        | '        |
+----------+----------+
3 rows in set (35.95 sec)

Is this something that would be useful to have for all wikis?

Btw, [[’]] should only show interwiki links to [[’]], not links to [[’]] and [[']] like on https://fr.wiktionary.org/w/index.php?title=%E2%80%99&oldid=23047369

Same for [[...]], [[…]] [[']]

Showing two links for the same language doesn't make any sense for the reader (see capture).

In T165061#3281823, @Thibaut120094 wrote:

Btw, [[’]] should only show interwiki links to [[’]], not links to [[’]] and [[']] like on https://fr.wiktionary.org/w/index.php?title=%E2%80%99&oldid=23047369
[...]
Showing two links for the same language doesn't make any sense for the reader (see capture).

I added manual interwiki links in the linked page: as a result all language links are unique (just like for [[...]]).

In T165061#3281823, @Thibaut120094 wrote:

Showing two links for the same language doesn't make any sense for the reader (see capture).

True. Cognate could detect this situation, and only show the link that exactly matches the local page's title.

However, this would hide a potential error. Maybe it's better to have this visible, so people can notice and fix it?

Hello all, just to mention that I created a discussion topic here, so we can find a consensus whithin the different communities. Feel free to summarize your point of view and your concerns there. Thanks a lot!

In T165061#3307061, @Lea_Lacroix_WMDE wrote:

Hello all, just to mention that I created a discussion topic here, so we can find a consensus whithin the different communities.

If some wikis need to change something, they should be notified individually.

@Nemo_bis whether or not they need to change something depends on their local conventions. We have no way to know and understand such local conventions for all Wiktionaries.

We did our best to make this non-disruptive, but there are always edge-cases. The best we can do is inform people of what we intend to do, and listen to their feedback. Quite often, people only notice issues once a feature has gone live.

Lydia_Pintscher removed a project: Wikidata-Former-Sprint-Board.Nov 21 2017, 8:59 AM

In order to move forward, the description of the tasks needs to be improved, with a clearer description of what the need is and what we need to do.

4 years later. A solution is needed for " https://sv.wiktionary.org/wiki/When_in_Rome,_do_as_the_Romans_do. " vs " https://en.wiktionary.org/wiki/when_in_Rome,_do_as_the_Romans_do ". Either by allowing linking to redirects, or by some other trick. Most wiktionaries have restrictive policies when it comes to redirects. For example "colour" redirecting to "color" is prohibited on many wiktionaries and thus probably not an issue. Those are NOT same pages. But "When_in_Rome,_do_as_the_Romans_do." and "when_in_Rome,_do_as_the_Romans_do" are same. There is no obvious solution about how to deal with proverbs on wiktionaries. EN wikt has a policy to remove final punctuation and avoid capitalization of beginning letter of a sentence, thus the lemma form is " when_in_Rome,_do_as_the_Romans_do ". SV wikt has a policy resulting in " When_in_Rome,_do_as_the_Romans_do. " (uppercase "W" and dot at the end). Those are SAME LEMMAS and should thus be linked to each other. Unfortunately automatically adjusting the letter case is not possible as it would result in a huge number or false positives. Allowing automatic interwiki linking to redirects would allow to solve the problem manually by creating redirects on both sides. I do not see any better "magic" solution now. If there is an issue with false positives then the interwiki linking to redirects can be restricted to pagenames containing at least one space. This would rule out a mess of "color" and "colour" and "colow" as pointed above. I consider "Cognate" as highly preferable to manual explicit interwiki links (as used before 2017) but this issue with proverbs needs to be solved.

Lydia_Pintscher updated the task description. (Show Details)Feb 22 2021, 10:56 AM

Lydia_Pintscher moved this task from Needs Wikidata PM Work to Unconnected Stories on the Wikidata-Campsite board.Feb 22 2021, 11:00 AM

• amy_rc subscribed.Feb 22 2021, 11:11 AM

darthmon_wmde set the point value for this task to 5.Feb 25 2021, 10:31 AM

darthmon_wmde moved this task from Unconnected Stories to Wikidata-Campsite-Iteration-∞ (On Hold) on the Wikidata-Campsite board.

darthmon_wmde edited projects, added Wikidata-Campsite (Wikidata-Campsite-Iteration-∞ (On Hold)); removed Wikidata-Campsite.

Maintenance_bot moved this task from incoming to in progress on the Wikidata board.Feb 25 2021, 11:15 AM

• Tonina_Zhelyazkova_WMDE claimed this task.Mar 9 2021, 9:40 AM

• Tonina_Zhelyazkova_WMDE moved this task from To Do (prioritised from top to bottom) to Doing on the Wikidata-Campsite (Wikidata-Campsite-Iteration-∞ (On Hold)) board.

Change 670495 had a related patch set uploaded (by Tonina Zhelyazkova; owner: Tonina Zhelyazkova):
[mediawiki/extensions/Cognate@master] README: Add Development section

https://gerrit.wikimedia.org/r/670495

gerritbot added a project: Patch-For-Review.Mar 10 2021, 4:31 PM

• Tonina_Zhelyazkova_WMDE updated the task description. (Show Details)Mar 11 2021, 10:59 AM

Change 670855 had a related patch set uploaded (by Tonina Zhelyazkova; owner: Tonina Zhelyazkova):
[mediawiki/extensions/Cognate@master] Allow interlinks to be created to redirect pages

https://gerrit.wikimedia.org/r/670855

• Tonina_Zhelyazkova_WMDE moved this task from Doing to Peer Review on the Wikidata-Campsite (Wikidata-Campsite-Iteration-∞ (On Hold)) board.Mar 11 2021, 5:05 PM

Change 670495 merged by jenkins-bot:
[mediawiki/extensions/Cognate@master] README: Add Development section

https://gerrit.wikimedia.org/r/670495

• Tonina_Zhelyazkova_WMDE mentioned this in rECOG3dc5f49f1aec: README: Add Development section.Mar 12 2021, 2:29 PM

ReleaseTaggerBot added a project: MW-1.36-notes (1.36.0-wmf.35; 2021-03-16).Mar 12 2021, 3:00 PM

Silvan_WMDE moved this task from Peer Review to Test (Verification) on the Wikidata-Campsite (Wikidata-Campsite-Iteration-∞ (On Hold)) board.Mar 15 2021, 10:17 AM

Change 670855 merged by jenkins-bot:
[mediawiki/extensions/Cognate@master] Allow interlinks to be created to redirect pages

https://gerrit.wikimedia.org/r/670855

• Tonina_Zhelyazkova_WMDE mentioned this in rECOG9689368fec72: Allow interlinks to be created to redirect pages.Mar 15 2021, 10:30 AM

This should be getting rolled out with the train this week

Amy and I tested it. It seems a dummy edit on the page that contains the redirect is needed to make it show up.

Change 888305 had a related patch set uploaded (by Hoo man; author: Hoo man):

[mediawiki/extensions/Cognate@master] Fix maintenance/purgeDeletedCognatePages, add tests

https://gerrit.wikimedia.org/r/888305

Change 888305 merged by jenkins-bot:

[mediawiki/extensions/Cognate@master] Fix maintenance/purgeDeletedCognatePages, add tests

https://gerrit.wikimedia.org/r/888305

Maintenance_bot removed a project: Patch-For-Review.Feb 13 2023, 10:31 AM

ReleaseTaggerBot added a project: MW-1.40-notes (1.40.0-wmf.23; 2023-02-13).Feb 13 2023, 11:00 AM

	F8145905: image.png
	May 21 2017, 11:24 AM

Fixing redirects for Cognate (step 1)Closed, ResolvedPublic5 Estimated Story PointsActions

Description

Details

Related ObjectsSearch...

Event Timeline

Fixing redirects for Cognate (step 1)
Closed, ResolvedPublic5 Estimated Story Points
Actions

Related Objects
Search...