Page MenuHomePhabricator

[M] Check title/section formatting consistency across datasets
Closed, ResolvedPublic

Description

Page titles & section titles are not consistently formatted across data sources.

Page titles

A page like https://en.wikipedia.org/wiki/Conditions_of_My_Parole could have titles in any of these formats:

  • <i>Conditions of My Parole</i>
  • Conditions of My Parole
  • Conditions_of_My_Parole
  • conditions of my parole
  • conditions_of_my_parole

We need to standardize on a common format across our scripts (I would propose non-html, original case, spaces, but need to check all our data sources what the lowest common denominator format is)
Or even better, change all our joins to use page_id (where also available), which also eliminates join issues over differences in title (which could also become an issue because we're gluing together multiple sources whose snapshot times are not guaranteed to match)

Section titles

Similar to page titles, section titles seem to have inconsistent formatting.
Most of our scripts currently appear to use a lowercased variant of the title, but at least some of them are ill-formatted when they contain (wikitext) markup.

E.g. https://en.wikipedia.org/?curid=1653045#As_USS_Hunt/USCGD_Hunt has been seen to show up as either:

The latter is clearly an improperly trimmed version that needs additional fixing.
I suspect it's already faulty in (some of) the datasets that our scripts ingest, so all of those need to be checked and fixed carefully (or at least need additional fixing on our end if it isn't realistic to fix the source)

Additionally, we're going to have to be able to translate these section titles back to a correct link anchor format for notifications, which, for this example, looks like this: #As_USS_Hunt/USCGD_Hunt, which means that, ideally, section titles preserve their original case.
I believe that at least some of the datasets we use have already case-folded section titles, so that might be even less realistic to change/fix.
But if it's implausible to fix the section title format to a more desirable format, we need to look into implementing a similar conversion in MediaWiki/PHP so that we can match up the available section titles (potentially casefolded/lowercased) to a more original format (assuming that is even readily available) from which we can build a working link.
Note: back-conversion potentially affects Growth similarly - to be checked.

Lastly, I've also seen both casefold & lower being used, but in some cases, they're functionally different (see https://stackoverflow.com/questions/45745661/lower-vs-casefold-in-string-matching-and-converting-to-lowercase)
We should carefully check all transformations already happening and settle on a standard format to use (assuming we're sticking with lowercase representations)

AC:

  • Figure out a standardized page title format and change all non-compliant usage
    • Ensure consistent use of either casefold or lower, if standardized format can't retain original case
  • Join on page ids instead of page titles where possible
  • Figure out a standardized section title format and change all non-compliant usage
    • Ensure consistent use of either casefold or lower, if standardized format can't retain original case
    • Figure out how to back-convert from standardized section title format to link anchor format in MediaWiki/PHP

Section topics report

Here are some counts of unique page and section titles containing wikitext and/or HTML tags, from the /user/mfossati/section_topics/2023-03-06 dataset (all Wikipedias):

whatIlltotalpercentage
page20836,768,4820.0005 %
section143,6865,424,7252.6 %

Observations

  • ill titles are identified via the regexp ''|[\[\]\{\}<>], which looks for formatted text (double quotes), links (opening or closing square brackets), templates (opening or closing curly brackets), or HTML tags (opening or closing symbols)
  • lead sections are not included in the count
  • ill page titles only contain formatted text
  • ill section titles contain all symbols that were looked up
NOTE: the only join on page titles is done in gather_section_topics and can't really be avoided: the extracted blue links are just titles, thus needing the same join on a different table to look page IDs up.

Results

Snapshot: 2023-04-03

NOTE: TL;DR massive SLIS increase, overall +48 % with a peak of +206 % in arwiki. Section topics saw a decrease in cs, pt, and ru wikis.

Section topics

wiki_dbbeforenowdifference
arwiki34494473468695+1 %
bnwiki605170601946-1 %
cswiki38073182916905-23 %
eswiki1137240611602393+2 %
idwiki16594911624959+2 %
ptwiki27691862410816-13 %
ruwiki121368059570394-21 %
all Wikipedias215799906229211197+6 %

The decrease in cs, pt, and ru wikis is due to more matches against the section title denylist and the HTML table filter.
For instance, the Carreira, Vida, and Características section titles in ptwiki are now denylisted, thus removing 200k rows out of a 360k difference.
These titles seem like false positives by the way.

Section alignment

wiki_dbbeforenowdifference
arwiki6900769334+0.5 %
bnwiki3018530851+2 %
cswiki129918133557+ 3 %
eswiki216886226715+5 %
idwiki5616159746+6 %
ptwiki135539143889+6 %
ruwiki237773246301+4 %
all Wikipedias19673122011259+2 %

SLIS

wiki_dbbeforenowdifference
arwiki1936459168+206 %
bnwiki1525416643+9 %
cswiki4199369849+66 &
eswiki62834158581+152 %
idwiki1916446082+140 %
ptwiki2108361586+192 %
ruwiki126117148203+18 %
all Wikipedias19367652875377+48 %

Details

TitleReferenceAuthorSource BranchDest Branch
Standardize page title & section heading formatsrepos/structured-data/image-suggestions!34mlitnT333333main
Standardize page title & section heading formatsrepos/structured-data/section-image-recs!8mlitnT333333main
Standardize page title & section heading formatsrepos/structured-data/section-topics!27mlitnT333333main
Customize query in GitLab

Event Timeline

3333333333333333333333333333! Congratulations on your task number, not at all bitter that I got T333334...

CBogen renamed this task from Check title/section formatting consistency across datasets to [M] Check title/section formatting consistency across datasets.Apr 5 2023, 7:11 PM

Quick update so far:

  • (Commons) file titles were almost consistently in original case, underscored format. The only exception was in imagerec/article_images.py, but that piece of data was not being used.
  • Wiki page titles are a mixed bag; some are underscored, others with spaces. These are all being standardized to underscored format, and joins are done with page_id where possible to circumvent the issue altogether.
    • This discrepancy caused missing alignment-based suggestions for multi-word pages due to differences in page title format.

Next up: sections.

mfossati updated the task description. (Show Details)

Ran all scripts and made data checks, see report in the task description.
All merged! Closing.