Match cʼh with c'h
Closed, ResolvedPublic

Description

Based on a request on French Wiktionary.

Basically, there is a string of characters used in Breton language, "c'h". Some Wiktionaries use it with curve apostrophe and some others with straight apostrophe.

They would need to match "cʼh" with "c'h" so Cognate interlinks automatically these pages. (currently it's not the case, and they have to keep manual links).

This specific string "c'h" is only use for words in Breton.

Comment from Addshore:
"So I just looked into this a bit and created some test pages and it looks like the pages don't get linked currently.
This can be seen by going to https://en.wiktionary.beta.wmflabs.org/wiki/c%CA%BCh and https://de.wiktionary.beta.wmflabs.org/wiki/c%27h which use the titles that you used in your first email.

I have created https://gerrit.wikimedia.org/r/#/c/368205/ which will fix the issue / add the new normalization.
It would probably make sense for my to spend another hour or so working on the maintenance scripts that populate the tables to speed them up for minor changes like this and rather than repopulate the whole table only look at titles that contain this character."

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 10 2017, 1:30 PM

@Lydia_Pintscher @daniel do we want to goat ahead with this?

Good to go ahead if it was requested by the editors.

Yes it was confirmed by the community, we can go

Okay, I need to look into the maintenance scripts and see if I can alter them to avoid having to rebuild the entire table when we make this change to the normalization!
This will probably mean making the scripts run only for titles that contain a select character.

Addshore claimed this task.Aug 22 2017, 3:18 PM
Addshore added a project: User-Addshore.
Addshore moved this task from Backlog to In Progress on the User-Addshore board.

For reference, in production this is how many titles contain the characters in question.

mysql:wikiadmin@10.64.16.18 [cognate_wiktionary]> select count(*) from cognate_titles where cgti_raw LIKE '%ʼ%';
+----------+
| count(*) |
+----------+
|     6322 |
+----------+
1 row in set (5.88 sec)

mysql:wikiadmin@10.64.16.18 [cognate_wiktionary]> select count(*) from cognate_titles where cgti_raw LIKE '%\'%';
+----------+
| count(*) |
+----------+
|   102570 |
+----------+
1 row in set (6.01 sec)

The patch for this ticket can be found @ https://gerrit.wikimedia.org/r/#/c/368205/
This has just been merged.
It will be deployed with the train next week and once deployed the maint script will have to be run. A dry-run version can be seen below.

mwscript extensions/Cognate/maintenance/recalculateCognateNormalizedHashes.php --wiki enwiktionary --dry-run

This script only has to be run once, for all wiktionaries, as the table being touched is shared by all wiktionaries.
I will run the maint script at some point after the train and confirm everything is working before closing the ticket

On beta:

addshore@deployment-tin:~$ mwscript extensions/Cognate/maintenance/recalculateCognateNormalizedHashes.php --wiki enwiktionary --batch-size 10000
Started processing...
Getting batch starting from -9208669415219601405
Calculating new hashes..
Performing 1 updates
1303 rows processed, 1 rows upserted
Getting batch starting from 9213466911507511073
Calculating new hashes..
Performing 0 updates
0 rows processed, 0 rows upserted
1 hashes recalculated
Done!

https://en.wiktionary.beta.wmflabs.org/wiki/c%CA%BCh & https://de.wiktionary.beta.wmflabs.org/wiki/c'h are now linked as expected.

Addshore triaged this task as Normal priority.Aug 25 2017, 12:15 PM
Restricted Application added a subscriber: jeblad. · View Herald TranscriptAug 25 2017, 12:15 PM
jeblad removed a subscriber: jeblad.Aug 25 2017, 9:27 PM

Change 374956 had a related patch set uploaded (by Addshore; owner: Addshore):
[mediawiki/extensions/Cognate@master] Add waitForReplication to RecalculateCognateNormalizedHashes script

https://gerrit.wikimedia.org/r/374956

Change 374957 had a related patch set uploaded (by Addshore; owner: Addshore):
[mediawiki/extensions/Cognate@wmf/1.30.0-wmf.16] Add waitForReplication to RecalculateCognateNormalizedHashes script

https://gerrit.wikimedia.org/r/374957

Change 374957 merged by jenkins-bot:
[mediawiki/extensions/Cognate@wmf/1.30.0-wmf.16] Add waitForReplication to RecalculateCognateNormalizedHashes script

https://gerrit.wikimedia.org/r/374957

Mentioned in SAL (#wikimedia-operations) [2017-08-31T08:40:11Z] <addshore@tin> Synchronized php-1.30.0-wmf.16/extensions/Cognate/maintenance/recalculateCognateNormalizedHashes.php: T172987 [[gerrit:374957|Add waitForReplication to RecalculateCognateNormalizedHashes script]] (duration: 00m 47s)

Change 374956 merged by jenkins-bot:
[mediawiki/extensions/Cognate@master] Add waitForReplication to RecalculateCognateNormalizedHashes script

https://gerrit.wikimedia.org/r/374956

Mentioned in SAL (#wikimedia-operations) [2017-08-31T09:50:33Z] <addshore> addshore@terbium:~$ mwscript extensions/Cognate/maintenance/recalculateCognateNormalizedHashes.php --wiki enwiktionary --batch-size 1000000 # Should upsert 6326 rows T172987

Mentioned in SAL (#wikimedia-operations) [2017-08-31T09:57:33Z] <addshore> extensions/Cognate/maintenance/recalculateCognateNormalizedHashes.php run done, 6326 hashes recalculated, T172987

Addshore closed this task as Resolved.Aug 31 2017, 9:59 AM
Addshore moved this task from Doing to Done on the WMDE-QWERTY-Sprint-2017-08-22 board.
Addshore moved this task from In Progress to Done / Closed on the User-Addshore board.