Page MenuHomePhabricator

Lists loader: set rich wikitext cleaner
Closed, ResolvedPublic4 Estimated Story PointsFeature

Assigned To
Authored By
Yug
Dec 28 2018, 4:32 PM
Referenced Files
Restricted File
Aug 12 2024, 11:28 AM
F57230056: image.png
Aug 10 2024, 11:54 AM
F57224986: image.png
Aug 10 2024, 11:54 AM
F57224984: Screenshot from 2024-08-10 11-13-49.png
Aug 10 2024, 11:54 AM
F57224996: image.png
Aug 10 2024, 11:54 AM
F56766947: 173875987-28e6fa64-27f9-4c47-9f3b-ed3ef46d0a16.png
Jul 29 2024, 2:32 PM
F56766954: 173875966-70ce752d-9e47-485e-9feb-ad727a3e4212.png
Jul 29 2024, 2:32 PM
F56766946: 173876566-3241a971-861e-4e72-89b5-920a959c4183.png
Jul 29 2024, 2:32 PM

Description

Local lists, as wikipages, may contain rich wikitext including compulsory licence templates (ex: Unicode License), sections'titles or else.
While useful on Wikimedia Commons and for its users, these noise need to be cleaned out by the list loader system.
A series of non-greedy regex can clean this up wave after wave.

Rich list content

Example of rich wikitext such as https://lingualibre.org/wiki/List:Test/Rich_format

<!-- Comment 1 -->
<noinclude>
{{draft}}
{{Unicode Licence|3.0}}
{{Lingualibre list|type=mixed|quality=C}}
{{Lingualibre list|type=frequency|quality=A}}
{{Lingualibre list|type=frequency|quality=A|series=Unilex}}
{{Convention|Meta-data of this list should follow the following conventions:
* <code>, </code>: L2 translations separator
* <code>(adj.)</code>: part of speech, values [ adj., n., art., conj., v., adv. ]
* ...
}}
</noinclude>
== Test ==
# Albus
# Bicos
# Craco !
# red neck parrots → péroquet à cou rouge
# yellow → jaune
# green → vert [pos:adjective, ipa: /vɜːt/]
<!-- Comment 2a
Comment 2b
Comment 2c -->
#	他 [simplified:他]	[pinyin:tā]	[IPA:tʰa˥˥]	[eng:he]
#	我們	[simplified:我们]	[pinyin:wǒmen]	[IPA:uɔ˨˩mən]	[eng:we]

173876566-3241a971-861e-4e72-89b5-920a959c4183.png (243×851 px, 22 KB)

173875966-70ce752d-9e47-485e-9feb-ad727a3e4212.png (311×851 px, 23 KB)

Current

Currently returns with obvious noise

173875987-28e6fa64-27f9-4c47-9f3b-ed3ef46d0a16.png (674×783 px, 69 KB)

image.png (675×1 px, 72 KB)

Wanted

Loaded list should be :

# Albus
# Bicos
# Craco !
# red neck parrots
# yellow
# green

Ceate regex cleaners

Integrate regex into JS

1: Metadata part could be parsed and saved into relevant variables. (Passing it downstream is another issue, see T196038 )

To test

See also

Event Timeline

Yug updated the task description. (Show Details)
Pamputt changed the subtype of this task from "Task" to "Feature Request".Oct 6 2020, 8:29 PM
Yug triaged this task as Medium priority.Jul 6 2022, 10:42 AM
Yug renamed this task from If <noinclude> element in list page, do not include content in record wizard display to List loader: remove <noinclude> element when loading list.Jul 7 2022, 10:12 AM
Yug renamed this task from List loader: remove <noinclude> element when loading list to Lists loader: remove <noinclude> element when loading list.Jul 7 2022, 11:02 AM
Yug renamed this task from Lists loader: remove <noinclude> element when loading list to Lists loader: handle more input types.Jul 19 2022, 5:21 PM
Yug updated the task description. (Show Details)
Yug updated the task description. (Show Details)
Yug raised the priority of this task from Medium to High.Jul 20 2022, 10:24 AM
Yug set the point value for this task to 4.Jul 21 2022, 6:21 PM
Yug updated the task description. (Show Details)
Yug updated the task description. (Show Details)
Yug updated the task description. (Show Details)
Yug updated the task description. (Show Details)
Yug renamed this task from Lists loader: handle more input types to Lists loader: allow rich input types to be stored while keeping recording headwords alone.Aug 11 2023, 4:07 AM
Yug updated the task description. (Show Details)
Yug renamed this task from Lists loader: allow rich input types to be stored while keeping recording headwords alone to Lists loader: set rich wikitext cleaner.Jul 29 2024, 2:20 PM
Yug updated the task description. (Show Details)

cc: @Pushkar7077 @Poslovitch > Task T212671#10055429 was presented at the Wikimedia 2024 hackathon showcast, 3pm, Saturday August 10th, 2024, Kyiv room, Katowice.

Lingua Libre + GSoC24

Minority languages speakers : contribute to Lingualibre audio recording
Minority languages speakers : want to revitalize their language -> ask for a low cost multimedia dictionary (link)

image.png (944×1 px, 165 KB)

Lingua Libre has word lists : we want to support words list AND minimal dictionaries

Screenshot from 2024-08-10 11-13-49.png (973×1 px, 121 KB)

Adding a data filter so Lingualibre recording studio still sees a clean list of words

image.png (744×1 px, 84 KB)
{F57269170}

It works ! Lingualibre recordings continue to works AND we can lead low computer skills editathos to add 1000s translations.

image.png (919×1 px, 419 KB)

A minority language multimedia bilingual dictionary with 1500 words bilingual can be created 3 hours edithathon + 3 hours recording session.

image.png (944×1 px, 165 KB)

When reviewed, data can be imported mass imported to Wikidata Lexeme.