U+00AD SOFT HYPHEN shouldn't be allowed in wiki article titles
Open, LowPublic

Description

I spotted the word techno­determinism:

00000000  74 65 63 68 6e 6f c2 ad  64 65 74 65 72 6d 69 6e  |techno..determin|
00000010  69 73 6d 0a                                       |ism.|
00000014

in wiktionary. Both URL of the article and the h1-title on the page contain it. Hyphen isn't visible in both.

This shouldn't be allowed.

Yurivict created this task.Dec 19 2015, 4:14 PM
Yurivict updated the task description. (Show Details)
Yurivict raised the priority of this task from to Needs Triage.
Yurivict added a subscriber: Yurivict.
Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald TranscriptDec 19 2015, 4:14 PM
Aklapper triaged this task as Low priority.Dec 19 2015, 6:21 PM
Aklapper set Security to None.

I thought this was fixed in T5696...
(For future reference, please associate a project to tasks. Thanks!)

This may have been fixed for the new articles, but I spotted this ones created on 2015-06-29.

Soft hyphens in titles are bad. I remove them regularly. Blocking or at least creating a warning would be useful.

On the other hand soft hyphens are useful for titles with very long words. With T66528 I suggest to allow to insert soft hyphens into the display title.

Change 393381 had a related patch set uploaded (by Fomafix; owner: Fomafix):
[mediawiki/core@master] [WIP] Strip soft hyphens (U 00AD) from title

https://gerrit.wikimedia.org/r/393381

Fomafix claimed this task.Nov 26 2017, 1:05 PM

On enwiki there are currently the following titles containing soft hyphens. They are all redirects:

  1. https://en.wikipedia.org/wiki/Baltimore­Washington_Parkway?redirect=no
  2. https://en.wikipedia.org/wiki/Comprehensive_Environmental_Response,_Compen­sation,_and_Liability_Act?redirect=no
  3. https://en.wikipedia.org/wiki/Comprehensive_Environmental_Response,_Compen­sation,_and_Liability_Act_of_1980?redirect=no
  4. https://en.wikipedia.org/wiki/Immuni­sation?redirect=no
  5. https://en.wikipedia.org/wiki/India­-Myanmar_relations?redirect=no
  6. https://en.wikipedia.org/wiki/Kendall,_Tay­lor_&_Com­pany?redirect=no
  7. https://en.wikipedia.org/wiki/Kendall,_Tay­lor_and_Com­pany?redirect=no
  8. https://en.wikipedia.org/wiki/Lopado­temacho­selacho­galeo­kranio­leipsano­drim­hypo­trimmato­silphio­parao?redirect=no
  9. https://en.wikipedia.org/wiki/Lopado­temacho­selacho­galeo­kranio­leipsano­drim­hypo­trimmato­silphio­parao­melito­katakechy­meno­kichl­epi­kossypho­phatto­perister­alektryon­opte­kephallio­kigklo­peleio­lagoio­siraio­baphe­tragano­pterygon?redirect=no
  10. https://en.wikipedia.org/wiki/Whip­lash_Shaken_Infant_Syndrome?redirect=no
  11. https://en.wikipedia.org/wiki/­?redirect=no
  12. https://en.wikipedia.org/wiki/Œu­v­re?redirect=no

After deploying https://gerrit.wikimedia.org/r/393381 these titles are invalid and get renamed by maintenance/cleanupTitles.php. The redirects can already deleted before deploying because they are superfluously because it exists an article or a redirect with a title without soft hyphens.

There are actually 30 such titles on enwiki, when you count other namespaces. Some are usernames, which makes this a bit awkward, but luckily it seems they are all permanently banned, so while we'll probably need to do something special about them, we won't annoy anyone when they are renamed.

I am currently running this query for this across all wikis to see what we should expect.

select
  page_namespace, page_title, page_is_redirect, replace(a.page_title,'­','') as page_title_new,
  (select count(*) from page b where b.page_namespace=a.page_namespace and b.page_title=page_title_new) as conflicts
from page a where page_title like '%­%'
page_namespacepage_titlepage_is_redirectpage_title_newconflicts
2Impro­­v0Improv1
2Improv­0Improv1
2Neutrality­0Neutrality1
3Happy­Troll0HappyTroll0
2Erwin_Walsh­0Erwin_Walsh1
2Erwin_Walsh­­0Erwin_Walsh1
3Erwin_Walsh­­0Erwin_Walsh1
3Marmotville­0Marmotville0
2Uniting_Nations­0Uniting_Nations0
3Uniting_Nations­0Uniting_Nations0
2Love_Virus­0Love_Virus1
3Love_Virus­0Love_Virus1
3­Friendly_AIDS0Friendly_AIDS0
2Nymph/­lol0Nymph/lol1
4Articles_for_deletion/Família_fotológica0Articles_for_deletion/FamÃlia_fotológica0
4Articles_for_deletion/Lip­smackin­thirst­quenchin­acetastin­motivatin­good­buzzin­cool­talkin­high­walkin­fast­livin­ever­givin­cool­fizzin0Articles_for_deletion/Lipsmackinthirstquenchinacetastinmotivatingoodbuzzincooltalkinhighwalkinfastlivinevergivincoolfizzin0
0­10
0India­-Myanmar_relations1India-Myanmar_relations1
0Lopado­temacho­selacho­galeo­kranio­leipsano­drim­hypo­trimmato­silphio­parao1Lopadotemachoselachogaleokranioleipsanodrimhypotrimmatosilphioparao0
0Baltimore­Washington_Parkway1BaltimoreWashington_Parkway0
0Kendall,_Tay­lor_and_Com­pany1Kendall,_Taylor_and_Company1
0Kendall,_Tay­lor_&_Com­pany1Kendall,_Taylor_&_Company1
0Lopado­temacho­selacho­galeo­kranio­leipsano­drim­hypo­trimmato­silphio­parao­melito­katakechy­meno­kichl­epi­kossypho­phatto­perister­alektryon­opte­kephallio­kigklo­peleio­lagoio­siraio­baphe­tragano­pterygon1Lopadotemachoselachogaleokranioleipsanodrimhypotrimmatosilphioparaomelitokatakechymenokichlepikossyphophattoperisteralektryonoptekephalliokigklopeleiolagoiosiraiobaphetraganopterygon1
0Œu­v­re1Œuvre1
0Comprehensive_Environmental_Response,_Compen­sation,_and_Liability_Act_of_19801Comprehensive_Environmental_Response,_Compensation,_and_Liability_Act_of_19801
0Comprehensive_Environmental_Response,_Compen­sation,_and_Liability_Act1Comprehensive_Environmental_Response,_Compensation,_and_Liability_Act1
3HIPPOPOTO­MONSTRO­SESQUIPED­AL­IAN~enwiki0HIPPOPOTOMONSTROSESQUIPEDALIAN~enwiki0
3­~enwiki0~enwiki0
0Immuni­sation1Immunisation1
0Whip­lash_Shaken_Infant_Syndrome1Whiplash_Shaken_Infant_Syndrome1

Results for all Wikimedia wikis. page_title_new columns has the title with the soft hyphen removed, and conflicts column indicates whether a page with the "fixed" title already exists on the wiki.

There are 2322 such pages across all our wikis, including 913 on kuwiktionary and the rest spread across 153 other wikis.

798 of these have no conflicts and we could just rename to the version without a hyphen.

1275 are redirects with conflicts. Most of these are probably redirects to the soft-hyphen-less title, but we can't check that with just a SQL query.

249 are non-redirects that have a title conflict. Users will have to deal with these manually.

Change 393381 merged by jenkins-bot:
[mediawiki/core@master] Strip soft hyphens (U+00AD) from title

https://gerrit.wikimedia.org/r/393381

matmarex closed this task as Resolved.May 25 2018, 12:17 AM

This is done for MediaWiki. I filed T195546: Run the maintenance script cleanupTitles.php on all wikis to rescue currently-inaccessible pages about running cleanupTitles.php on Wikimedia wikis.

Kghbln added a subscriber: Kghbln.May 29 2018, 8:37 PM
matmarex reopened this task as Open.Jun 2 2018, 9:09 AM

Unfortunately we reverted the change due to problems with WMF deployment :( Turns out the maintenance script is not great. T195546