Local lists, as wikipages, may contain rich wikitext including compulsory licence templates (ex: Unicode License), sections'titles or else.
While useful on Wikimedia Commons and for its users, these noise need to be cleaned out by the list loader system.
A series of non-greedy regex can clean this up wave after wave.
Rich list content
Example of rich wikitext such as https://lingualibre.org/wiki/List:Test/Rich_format
<!-- Comment 1 -->
<noinclude>
{{draft}}
{{Unicode Licence|3.0}}
{{Lingualibre list|type=mixed|quality=C}}
{{Lingualibre list|type=frequency|quality=A}}
{{Lingualibre list|type=frequency|quality=A|series=Unilex}}
{{Convention|Meta-data of this list should follow the following conventions:
* <code>, </code>: L2 translations separator
* <code>(adj.)</code>: part of speech, values [ adj., n., art., conj., v., adv. ]
* ...
}}
</noinclude>
== Test ==
# Albus
# Bicos
# Craco !
# red neck parrots → péroquet à cou rouge
# yellow → jaune
# green → vert [pos:adjective, ipa: /vɜːt/]
<!-- Comment 2a
Comment 2b
Comment 2c -->
# 他 [simplified:他] [pinyin:tā] [IPA:tʰa˥˥] [eng:he]
# 我們 [simplified:我们] [pinyin:wǒmen] [IPA:uɔ˨˩mən] [eng:we]Current
Currently returns with obvious noise
Wanted
Loaded list should be :
# Albus # Bicos # Craco ! # red neck parrots # yellow # green
Ceate regex cleaners
- Wiki-titles remover: https://regex101.com/r/dth5ST/1
- HTML comments remover https://regex101.com/r/biXQg5/1
- HTML and <noinclude> remover : https://regex101.com/r/WswD2Z/1
- Arrow separated translations remover, such 'red → rouge' sends 'rouge' to the RW : https://regex101.com/r/Rby7uR/1
- Metadata input remover: such as 'rouge [pos:adjective,ipa:/ɹuːʒ/]' into 'rouge' to the RW.[1] https://regex101.com/r/skKmhm/1
Integrate regex into JS
- Integrate those regex in code in the list loader code
1: Metadata part could be parsed and saved into relevant variables. (Passing it downstream is another issue, see T196038 )
To test
- Record Wizard > Details : local list > Pol/words-by-frequency-2001-to-4000. See result, at the end of the list. See wikipage, at the end of the page.
See also
- Query on wiki pages content : :wp:en:Mermaid
- Other one on Dragon
- Previous edit in same area https://github.com/lingua-libre/RecordWizard/commit/75a4d32158b28bfb24e53cf7e92437f9d9ade9e1







