Local lists, as wikipages, may contain rich wikitext including compulsory licence templates (ex: [Unicode License](https://commons.wikimedia.org/wiki/Template:Unicode_License)), sections'titles or else.
While useful on Wikimedia Commons and for its users, these noise need to be cleaned out by the list loader system.
A series of non-greedy regex can clean this up wave after wave.
### Rich list content
Example of rich wikitext such as https://lingualibre.org/wiki/List:User:Yug/test:
```
<!-- Comment 1 -->
<noinclude>
{{draft}}
{{Unicode Licence|3.0}}
{{Lingualibre list|type=mixed|quality=C}}
{{Lingualibre list|type=frequency|quality=A}}
{{Lingualibre list|type=frequency|quality=A|series=Unilex}}
{{Convention|Meta-data of this list should follow the following conventions:
* <code>, </code>: L2 translations separator
* <code>(adj.)</code>: part of speech, values [ adj., n., art., conj., v., adv. ]
* ...
}}
</noinclude>
== Test ==
# Albus
# Bicos
# Craco !
# red neck parrots → péroquet à cou rouge
# yellow → jaune
# green → vert [pos:adjective, ipa: /vɜːt/]
<!-- Comment 2a
Comment 2b
Comment 2c -->
# 他 [simplified:他] [pinyin:tā] [IPA:tʰa˥˥] [eng:he]
# 我們 [simplified:我们] [pinyin:wǒmen] [IPA:uɔ˨˩mən] [eng:we]
```
{F56766946}
{F56766954}
### Current
Currently returns with obvious noise
{F56766947}
{F37101621}
### Wanted
Loaded list should be :
```
# Albus
# Bicos
# Craco !
# red neck parrots
# yellow
# green
```
### Ceate regex cleaners
- [x] **Wiki-titles remover:** https://regex101.com/r/dth5ST/1
- [x] **HTML comments remover** https://regex101.com/r/biXQg5/1
- [x] **HTML and `<noinclude>` remover :** https://regex101.com/r/WswD2Z/1
- [x] **Arrow separated translations remover,** such 'red → rouge' sends 'rouge' to the RW : https://regex101.com/r/Rby7uR/1
- [x] **Metadata input remover:** such as 'rouge [pos:adjective,ipa:/ɹuːʒ/]' into 'rouge' to the RW.[1] https://regex101.com/r/skKmhm/1
### Integrate regex into JS
- [ ] **Integrate those regex in code** in [the list loader code](https://github.com/lingua-libre/RecordWizard/blob/master/modules/generator/rw.generator.List.js)
1: Metadata part could be parsed and saved into relevant variables. (Passing it downstream is another issue, see T196038 )
### To test
- [[ https://lingualibre.fr/wiki/Special:RecordWizard | Record Wizard ]] > Details : local list > [[ https://lingualibre.fr/wiki/List:Pol/words-by-frequency-2001-to-4000 | Pol/words-by-frequency-2001-to-4000 ]]. See result, at the end of the list. See wikipage, at the end of the page.
### See also
- Query on wiki pages content : [[ https://en.wikipedia.org/w/api.php?action=query&titles=Mermaid&prop=extracts&exintro=&explaintext=&exsentences=10&redirects=&converttitles=&format=json | :wp:en:Mermaid ]]
- Other one on [[ https://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&format=json&titles=Dragon&prop=extracts&explaintext=&exsentences=100&redirects=&converttitles= | Dragon ]]
- Previous edit in same area https://github.com/lingua-libre/RecordWizard/commit/75a4d32158b28bfb24e53cf7e92437f9d9ade9e1