Page titles & section titles are not consistently formatted across data sources.
Page titles
- https://gitlab.wikimedia.org/repos/structured-data/section-topics/-/merge_requests/21/ uses the enterprise dumps, where the display title (incl. html) is used
- I've also encountered titles being formatted differently throughout other scripts (original vs lower case, spaces vs underscores) and we've already encountered bugs
A page like https://en.wikipedia.org/wiki/Conditions_of_My_Parole could have titles in any of these formats:
- <i>Conditions of My Parole</i>
- Conditions of My Parole
- Conditions_of_My_Parole
- conditions of my parole
- conditions_of_my_parole
We need to standardize on a common format across our scripts (I would propose non-html, original case, spaces, but need to check all our data sources what the lowest common denominator format is)
Or even better, change all our joins to use page_id (where also available), which also eliminates join issues over differences in title (which could also become an issue because we're gluing together multiple sources whose snapshot times are not guaranteed to match)
Section titles
Similar to page titles, section titles seem to have inconsistent formatting.
Most of our scripts currently appear to use a lowercased variant of the title, but at least some of them are ill-formatted when they contain (wikitext) markup.
E.g. https://en.wikipedia.org/?curid=1653045#As_USS_Hunt/USCGD_Hunt has been seen to show up as either:
- as uss hunt/uscgd hunt (in https://gitlab.wikimedia.org/repos/structured-data/section-topics/-/merge_requests/21/)
- as uss ''hunt''/uscgd ''hunt (in current section level image suggestions output, probably other places as well)
The latter is clearly an improperly trimmed version that needs additional fixing.
I suspect it's already faulty in (some of) the datasets that our scripts ingest, so all of those need to be checked and fixed carefully (or at least need additional fixing on our end if it isn't realistic to fix the source)
Additionally, we're going to have to be able to translate these section titles back to a correct link anchor format for notifications, which, for this example, looks like this: #As_USS_Hunt/USCGD_Hunt, which means that, ideally, section titles preserve their original case.
I believe that at least some of the datasets we use have already case-folded section titles, so that might be even less realistic to change/fix.
But if it's implausible to fix the section title format to a more desirable format, we need to look into implementing a similar conversion in MediaWiki/PHP so that we can match up the available section titles (potentially casefolded/lowercased) to a more original format (assuming that is even readily available) from which we can build a working link.
Note: back-conversion potentially affects Growth similarly - to be checked.
Lastly, I've also seen both casefold & lower being used, but in some cases, they're functionally different (see https://stackoverflow.com/questions/45745661/lower-vs-casefold-in-string-matching-and-converting-to-lowercase)
We should carefully check all transformations already happening and settle on a standard format to use (assuming we're sticking with lowercase representations)
AC:
- Figure out a standardized page title format and change all non-compliant usage
- Ensure consistent use of either casefold or lower, if standardized format can't retain original case
- Join on page ids instead of page titles where possible
- Figure out a standardized section title format and change all non-compliant usage
- Ensure consistent use of either casefold or lower, if standardized format can't retain original case
- Figure out how to back-convert from standardized section title format to link anchor format in MediaWiki/PHP
Section topics report
Here are some counts of unique page and section titles containing wikitext and/or HTML tags, from the /user/mfossati/section_topics/2023-03-06 dataset (all Wikipedias):
| what | Ill | total | percentage |
| page | 208 | 36,768,482 | 0.0005 % |
| section | 143,686 | 5,424,725 | 2.6 % |
Observations
- ill titles are identified via the regexp ''|[\[\]\{\}<>], which looks for formatted text (double quotes), links (opening or closing square brackets), templates (opening or closing curly brackets), or HTML tags (opening or closing symbols)
- lead sections are not included in the count
- ill page titles only contain formatted text
- ill section titles contain all symbols that were looked up
Results
Snapshot: 2023-04-03
Section topics
| wiki_db | before | now | difference |
| arwiki | 3449447 | 3468695 | +1 % |
| bnwiki | 605170 | 601946 | -1 % |
| cswiki | 3807318 | 2916905 | -23 % |
| eswiki | 11372406 | 11602393 | +2 % |
| idwiki | 1659491 | 1624959 | +2 % |
| ptwiki | 2769186 | 2410816 | -13 % |
| ruwiki | 12136805 | 9570394 | -21 % |
| all Wikipedias | 215799906 | 229211197 | +6 % |
The decrease in cs, pt, and ru wikis is due to more matches against the section title denylist and the HTML table filter.
For instance, the Carreira, Vida, and Características section titles in ptwiki are now denylisted, thus removing 200k rows out of a 360k difference.
These titles seem like false positives by the way.
Section alignment
| wiki_db | before | now | difference |
| arwiki | 69007 | 69334 | +0.5 % |
| bnwiki | 30185 | 30851 | +2 % |
| cswiki | 129918 | 133557 | + 3 % |
| eswiki | 216886 | 226715 | +5 % |
| idwiki | 56161 | 59746 | +6 % |
| ptwiki | 135539 | 143889 | +6 % |
| ruwiki | 237773 | 246301 | +4 % |
| all Wikipedias | 1967312 | 2011259 | +2 % |
SLIS
| wiki_db | before | now | difference |
| arwiki | 19364 | 59168 | +206 % |
| bnwiki | 15254 | 16643 | +9 % |
| cswiki | 41993 | 69849 | +66 & |
| eswiki | 62834 | 158581 | +152 % |
| idwiki | 19164 | 46082 | +140 % |
| ptwiki | 21083 | 61586 | +192 % |
| ruwiki | 126117 | 148203 | +18 % |
| all Wikipedias | 1936765 | 2875377 | +48 % |