Page MenuHomePhabricator

Key-value extraction misses on Wikipedia:WikiProject Council/Directory/WikiProject template invocations
Open, NormalPublic

Description

It appears that when whitespace is missing between the equals sign and the keys or values in a key-value assignment in concrete wikitext invocations of the Wikipedia:WikiProject Council/Directory/WikiProject template on enwiki the key-value extraction routine may not pick up subcategories for potential inclusion in the base WikiProjects data, at least from the checkpoint of 247a46aa8938ffeb40437a83dcf5887ed891f843 (that patch is not the offender, just a frame of reference for this ticket).

https://github.com/wikimedia/drafttopic/blob/master/drafttopic/utilities/fetch_wikiprojects.py#L56

wp_section_regex =\
    r'{{Wikipedia:WikiProject Council/Directory/WikiProject\n'\
    '\|project = ([a-zA-Z_: -]+)\n'\
    '\|shortname = ([a-zA-Z\(\) -]+)\n'\
    '\|active = (yes|no)\n([^}]*)}}'

While manually inspecting the code and manually aping the code into a Python interpreter, I noticed that traversal into https://en.wikipedia.org/w/api.php?action=parse&page=Wikipedia:WikiProject_Council/Directory/History%20and%20society&prop=wikitext&section=2 doesn't seem to pick up the "Ageing and culture" leaf node by way of https://github.com/wikimedia/drafttopic/blob/master/drafttopic/utilities/fetch_wikiprojects.py#L348.

def get_wikiprojects_from_table(self, wikitext):
    """
    Takes a WikiProjects table listing, and returns individual WikiProjects
    """
    wp = {}
    matches = re.findall(wp_section_regex, wikitext)
    for match in matches:
        remaining = match[3]
        listed_in = re.search(wp_section_regex_listed, remaining)
        # Listed somewhere else, so skip
        if listed_in:
            continue
        wp[match[1]] = {'name': match[0], 'shortname': match[1], 'active':
                        match[2]}
    return wp

Rather, it seems to jump straight to the "Agriculture" node. The reason seems to be because the "Ageing and culture" template invocation doesn't have an exact match on the equals sign for key-value assignment having space characters on each side whereas the "Agriculture" one does (and is therefore the the first regex hit). Here's the first bit of the JSON server response bearing the wikitext in case someone's perusing this ticket later:

{
    "parse": {
        "title": "Wikipedia:WikiProject Council/Directory/History and society",
        "pageid": 6918422,
        "wikitext": {
            "*": "===General topics===\n{{Wikipedia:WikiProject Council/Directory/WikiProject header}}\n\n{{Wikipedia:WikiProject Council/Directory/WikiProject\n|project= Wikipedia:WikiProject Ageing and culture\n|shortname= Ageing and culture\n|active= yes\n|assessment= \n|peer-review= \n|collaboration= \n|portal= Society\n|notes= }}\n\n{{Wikipedia:WikiProject Council/Directory/WikiProject\n|project = Wikipedia:WikiProject Agriculture\n|shortname = Agriculture\n|active = yes\n|assessment = \n|peer-review = \n|collaboration = \n|portal = Agriculture\n|notes = \n|task-force = \n|listed-in = }}...

It doesn't seem like the regex-based parsing here needs to go way over the top accounting for every edge case, but missing spaces is potentially a not-uncommon pattern in template invocation.

The task filer here is happy to provide a patch, but was hoping first if someone could confirm understanding - and if it's cheap, perhaps a coverage diff pre- and post-patch; the task filer suspects the practical consequence in the final model production as-is is that it all gets washed out anyway, but the task filer digresses. The thought was to simply allow a little extra flexibility in the regex without trying to build a full-on template parser.

Event Timeline

Restricted Application added subscribers: Liuxinyu970226, Aklapper. · View Herald TranscriptJul 31 2019, 12:16 PM
dr0ptp4kt updated the task description. (Show Details)Jul 31 2019, 12:16 PM
dr0ptp4kt updated the task description. (Show Details)
dr0ptp4kt renamed this task from Key-value extraction misses on Wikipedia:WikiProject Council/Directory/WikiProject to Key-value extraction misses on Wikipedia:WikiProject Council/Directory/WikiProject template invocations.Jul 31 2019, 12:20 PM
dr0ptp4kt updated the task description. (Show Details)
dr0ptp4kt added a comment.EditedAug 2 2019, 10:36 AM

Here's approximately what I had in mind. The mwparserfromhell library might streamline extraction of key-value pairs, but this gets at the typical whitespacing patterns and it seems editors typically follow convention for parameter ordering.

$ git diff
diff --git a/drafttopic/utilities/fetch_wikiprojects.py b/drafttopic/utilities/fetch_wikiprojects.py
index 7165ba8..1f6bec0 100644
--- a/drafttopic/utilities/fetch_wikiprojects.py
+++ b/drafttopic/utilities/fetch_wikiprojects.py
@@ -55,12 +55,12 @@ wp_section_nextheading_regex = r'(.+)[=]{2,}'
 
 wp_section_regex =\
     r'{{Wikipedia:WikiProject Council/Directory/WikiProject\n'\
-    '\|project = ([a-zA-Z_: -]+)\n'\
-    '\|shortname = ([a-zA-Z\(\) -]+)\n'\
-    '\|active = (yes|no)\n([^}]*)}}'
+    '[ \t]*\|project[ \t]*=[ \t]*([^\n/#]*)\n'\
+    '[ \t]*\|shortname[ \t]*=[ \t]*([^\n]*)\n'\
+    '[ \t]*\|active[ \t]*=[ \t]*(yes|no)\n([^}]*)}}'
 # To check listing in other wikiprojects
 wp_section_regex_listed =\
-    r'listed-in = ([A-Za-z#/:_ ]+)'
+    r'listed-in[ \t]*=[ \t]*([^\s][^n]*)'
 
 wp_main_links_regex1 =\

That does produce some additional wikiprojects.

$ diff --new-line-format="" --unchanged-line-format="" outmid.newregex outmid | sort | uniq
            "Wikipedia:Raymond E. Feist series",
            "Wikipedia:Version 1.0 Editorial Team",
            "Wikipedia:WikiProject 24",
            "Wikipedia:WikiProject A1 Grand Prix",
            "Wikipedia:WikiProject AP Biology 2018"
            "Wikipedia:WikiProject African diaspora",
            "Wikipedia:WikiProject Ageing and culture",
            "Wikipedia:WikiProject Apple Inc.",
            "Wikipedia:WikiProject Athletics",
            "Wikipedia:WikiProject Babylon 5",
            "Wikipedia:WikiProject Bah\u00e1'\u00ed Faith",
            "Wikipedia:WikiProject Beyonc\u00e9 Knowles",
            "Wikipedia:WikiProject Big 12 Conference",
            "Wikipedia:WikiProject Bob Dylan",
            "Wikipedia:WikiProject Britney Spears",
            "Wikipedia:WikiProject C++",
            "Wikipedia:WikiProject Capital District",
            "Wikipedia:WikiProject Cardiff",
            "Wikipedia:WikiProject Children's literature",
            "Wikipedia:WikiProject Columbia, Missouri",
            "Wikipedia:WikiProject Cooperation",
            "Wikipedia:WikiProject Cote d'Ivoire",
            "Wikipedia:WikiProject Cultural Evolution",
            "Wikipedia:WikiProject Dams",
            "Wikipedia:WikiProject Death",
            "Wikipedia:WikiProject Dungeons & Dragons",
            "Wikipedia:WikiProject East Asia",
            "Wikipedia:WikiProject Education in Nepal",
            "Wikipedia:WikiProject Electrical engineering",
            "Wikipedia:WikiProject Elvis Presley",
            "Wikipedia:WikiProject Eurovision",
            "Wikipedia:WikiProject Finance & Investment",
            "Wikipedia:WikiProject Fisheries and Fishing",
            "Wikipedia:WikiProject Fraternities and Sororities",
            "Wikipedia:WikiProject G.I. Joe",
            "Wikipedia:WikiProject General Audience",
            "Wikipedia:WikiProject Georgia (country)",
            "Wikipedia:WikiProject Greater Boston Public Transit",
            "Wikipedia:WikiProject HHGTTG",
            "Wikipedia:WikiProject Horror",
            "Wikipedia:WikiProject Hospitals",
            "Wikipedia:WikiProject Insects",
            "Wikipedia:WikiProject Islands",
            "Wikipedia:WikiProject Java",
            "Wikipedia:WikiProject Jehovah's Witnesses",
            "Wikipedia:WikiProject Jennifer Lopez",
            "Wikipedia:WikiProject Kelly Clarkson",
            "Wikipedia:WikiProject Kylie Minogue",
            "Wikipedia:WikiProject Lady Gaga",
            "Wikipedia:WikiProject Latin music ",
            "Wikipedia:WikiProject Lepidoptera",
            "Wikipedia:WikiProject M*A*S*H",
            "Wikipedia:WikiProject Magic: The Gathering",
            "Wikipedia:WikiProject Mariah Carey",
            "Wikipedia:WikiProject Men's Issues",
            "Wikipedia:WikiProject Michael Jackson",
            "Wikipedia:WikiProject Motorcycling",
            "Wikipedia:WikiProject Multi-sport events",
            "Wikipedia:WikiProject Myanmar (Burma)",
            "Wikipedia:WikiProject New England Public Transit",
            "Wikipedia:WikiProject North America",
            "Wikipedia:WikiProject Occupations",
            "Wikipedia:WikiProject Orders, Decorations and Medals",
            "Wikipedia:WikiProject Parallel and Distributed Computing Systems",
            "Wikipedia:WikiProject Pharmacology",
            "Wikipedia:WikiProject Pok\u00e9mon",
            "Wikipedia:WikiProject Polyhedra",
            "Wikipedia:WikiProject Pop music ",
            "Wikipedia:WikiProject Psychedelics, Dissociatives and Deliriants"
            "Wikipedia:WikiProject Punjab (India)",
            "Wikipedia:WikiProject R&B and Soul Music",
            "Wikipedia:WikiProject RISC OS",
            "Wikipedia:WikiProject Ravidassia",
            "Wikipedia:WikiProject Retailing",
            "Wikipedia:WikiProject Rihanna",
            "Wikipedia:WikiProject Saskatchewan Communities & Neighbourhoods",
            "Wikipedia:WikiProject Signal Processing",
            "Wikipedia:WikiProject Skepticism",
            "Wikipedia:WikiProject Squash",
            "Wikipedia:WikiProject St. Louis Cardinals",
            "Wikipedia:WikiProject St. Louis",
            "Wikipedia:WikiProject Statistics",
            "Wikipedia:WikiProject Superfunds",
            "Wikipedia:WikiProject Table Tennis",
            "Wikipedia:WikiProject The 4400",
            "Wikipedia:WikiProject The Beatles",
            "Wikipedia:WikiProject The Clash",
            "Wikipedia:WikiProject The Simpsons",
            "Wikipedia:WikiProject The Supremes",
            "Wikipedia:WikiProject Timeline Tracer",
            "Wikipedia:WikiProject Trade"
            "Wikipedia:WikiProject Triathlon",
            "Wikipedia:WikiProject Typography",
            "Wikipedia:WikiProject U.S. Congress",
            "Wikipedia:WikiProject U.S. Presidents",
            "Wikipedia:WikiProject U.S. Roads",
            "Wikipedia:WikiProject U.S. Supreme Court cases",
            "Wikipedia:WikiProject U.S. counties",
            "Wikipedia:WikiProject U2",
            "Wikipedia:WikiProject US Governors",
            "Wikipedia:WikiProject Uniform Polytopes"
            "Wikipedia:WikiProject United States Constitution",
            "Wikipedia:WikiProject United States Government",
            "Wikipedia:WikiProject United States Public Policy",
            "Wikipedia:WikiProject University of Pittsburgh"
            "Wikipedia:WikiProject Warhammer 40,000",
            "Wikipedia:WikiProject Wikipedia-Books"
            "Wikipedia:WikiProject Women's History",
            "Wikipedia:WikiProject Women's sport",
            "Wikipedia:WikiProject Yoga"
            "Wikipedia:WikiProject YouTube",
            "Wikipedia:WikiProject \u00c5land Islands",
            "Wikipedia:WikiProject on open proxies",
        "Assistance.Classroom projects": [
        ]
Halfak added a subscriber: Halfak.Aug 8 2019, 9:58 PM

Nice work! Thank you for looking into this. We'd certainly be interested in a pull request if you were to file one.

Halfak triaged this task as Normal priority.Wed, Sep 11, 9:13 PM
Halfak moved this task from Untriaged to Maintenance/cleanup on the Scoring-platform-team board.