Change Details

It appears that when whitespace is missing between the equals sign and the keys or values in a key-value assignment in concrete wikitext invocations of the `Wikipedia:WikiProject Council/Directory/WikiProject` template on enwiki the key-value extraction routine may not pick up subcategories for potential inclusion in the base WikiProjects data, at least as of 247a46aa8938ffeb40437a83dcf5887ed891f843. https://github.com/wikimedia/drafttopic/blob/master/drafttopic/utilities/fetch_wikiprojects.py#L56 ``` wp_section_regex =\ r'{{Wikipedia:WikiProject Council/Directory/WikiProject\n'\ '\|project = ([a-zA-Z_: -]+)\n'\ '\|shortname = ([a-zA-Z\(\) -]+)\n'\ '\|active = (yes|no)\n([^}]*)}}' ``` While manually inspecting the code and manually aping the code into a Python interpreter, I noticed that traversal into https://en.wikipedia.org/w/api.php?action=parse&page=Wikipedia:WikiProject_Council/Directory/History%20and%20society&prop=wikitext&section=2 doesn't seem to pick up the "Ageing and culture" leaf node by way of https://github.com/wikimedia/drafttopic/blob/master/drafttopic/utilities/fetch_wikiprojects.py#L348. ``` def get_wikiprojects_from_table(self, wikitext): """ Takes a WikiProjects table listing, and returns individual WikiProjects """ wp = {} matches = re.findall(wp_section_regex, wikitext) for match in matches: remaining = match[3] listed_in = re.search(wp_section_regex_listed, remaining) # Listed somewhere else, so skip if listed_in: continue wp[match[1]] = {'name': match[0], 'shortname': match[1], 'active': match[2]} return wp ``` Rather, it seems to jump straight to the "Agriculture" node. The reason seems to be because the "Ageing and culture" template invocation doesn't have an exact match on the equals sign for key-value assignment having space characters on each side whereas the "Agriculture" one does (and is therefore the the first regex hit). Here's the first bit of the JSON server response bearing the wikitext in case someone's perusing this ticket later: ``` { "parse": { "title": "Wikipedia:WikiProject Council/Directory/History and society", "pageid": 6918422, "wikitext": { "*": "===General topics===\n{{Wikipedia:WikiProject Council/Directory/WikiProject header}}\n\n{{Wikipedia:WikiProject Council/Directory/WikiProject\n|project= Wikipedia:WikiProject Ageing and culture\n|shortname= Ageing and culture\n|active= yes\n|assessment= \n|peer-review= \n|collaboration= \n|portal= Society\n|notes= }}\n\n{{Wikipedia:WikiProject Council/Directory/WikiProject\n|project = Wikipedia:WikiProject Agriculture\n|shortname = Agriculture\n|active = yes\n|assessment = \n|peer-review = \n|collaboration = \n|portal = Agriculture\n|notes = \n|task-force = \n|listed-in = }}... ``` It doesn't seem like the regex-based parsing here needs to go way over the top accounting for every edge case, but missing spaces is potentially a not-uncommon pattern in template invocation. The task filer here is happy to provide a patch, but was hoping first if someone could confirm understanding - and if it's cheap, perhaps a coverage diff pre- and post-patch; I suspect the practical consequence in the final model production as-is is that it all gets washed out anyway, but I digress. The thought was to simply allow a little extra flexibility in the regex without trying to build a full-on template parser.

It appears that when whitespace is missing between the equals sign and the keys or values in a key-value assignment in concrete wikitext invocations of the `Wikipedia:WikiProject Council/Directory/WikiProject` template on enwiki the key-value extraction routine may not pick up subcategories for potential inclusion in the base WikiProjects data, at least from the checkpoint of 247a46aa8938ffeb40437a83dcf5887ed891f843 (that patch is not the offender, just a frame of reference for this ticket). https://github.com/wikimedia/drafttopic/blob/master/drafttopic/utilities/fetch_wikiprojects.py#L56 ``` wp_section_regex =\ r'{{Wikipedia:WikiProject Council/Directory/WikiProject\n'\ '\|project = ([a-zA-Z_: -]+)\n'\ '\|shortname = ([a-zA-Z\(\) -]+)\n'\ '\|active = (yes|no)\n([^}]*)}}' ``` While manually inspecting the code and manually aping the code into a Python interpreter, I noticed that traversal into https://en.wikipedia.org/w/api.php?action=parse&page=Wikipedia:WikiProject_Council/Directory/History%20and%20society&prop=wikitext&section=2 doesn't seem to pick up the "Ageing and culture" leaf node by way of https://github.com/wikimedia/drafttopic/blob/master/drafttopic/utilities/fetch_wikiprojects.py#L348. ``` def get_wikiprojects_from_table(self, wikitext): """ Takes a WikiProjects table listing, and returns individual WikiProjects """ wp = {} matches = re.findall(wp_section_regex, wikitext) for match in matches: remaining = match[3] listed_in = re.search(wp_section_regex_listed, remaining) # Listed somewhere else, so skip if listed_in: continue wp[match[1]] = {'name': match[0], 'shortname': match[1], 'active': match[2]} return wp ``` Rather, it seems to jump straight to the "Agriculture" node. The reason seems to be because the "Ageing and culture" template invocation doesn't have an exact match on the equals sign for key-value assignment having space characters on each side whereas the "Agriculture" one does (and is therefore the the first regex hit). Here's the first bit of the JSON server response bearing the wikitext in case someone's perusing this ticket later: ``` { "parse": { "title": "Wikipedia:WikiProject Council/Directory/History and society", "pageid": 6918422, "wikitext": { "*": "===General topics===\n{{Wikipedia:WikiProject Council/Directory/WikiProject header}}\n\n{{Wikipedia:WikiProject Council/Directory/WikiProject\n|project= Wikipedia:WikiProject Ageing and culture\n|shortname= Ageing and culture\n|active= yes\n|assessment= \n|peer-review= \n|collaboration= \n|portal= Society\n|notes= }}\n\n{{Wikipedia:WikiProject Council/Directory/WikiProject\n|project = Wikipedia:WikiProject Agriculture\n|shortname = Agriculture\n|active = yes\n|assessment = \n|peer-review = \n|collaboration = \n|portal = Agriculture\n|notes = \n|task-force = \n|listed-in = }}... ``` It doesn't seem like the regex-based parsing here needs to go way over the top accounting for every edge case, but missing spaces is potentially a not-uncommon pattern in template invocation. The task filer here is happy to provide a patch, but was hoping first if someone could confirm understanding - and if it's cheap, perhaps a coverage diff pre- and post-patch; I suspect the practical consequence in the final model production as-is is that it all gets washed out anyway, but I digress. The thought was to simply allow a little extra flexibility in the regex without trying to build a full-on template parser.

It appears that when whitespace is missing between the equals sign and the keys or values in a key-value assignment in concrete wikitext invocations of the `Wikipedia:WikiProject Council/Directory/WikiProject` template on enwiki the key-value extraction routine may not pick up subcategories for potential inclusion in the base WikiProjects data, at least as of 247a46aa8938ffeb40437a83dcf5887ed891f843from the checkpoint of 247a46aa8938ffeb40437a83dcf5887ed891f843 (that patch is not the offender, just a frame of reference for this ticket). https://github.com/wikimedia/drafttopic/blob/master/drafttopic/utilities/fetch_wikiprojects.py#L56 ``` wp_section_regex =\ r'{{Wikipedia:WikiProject Council/Directory/WikiProject\n'\ '\|project = ([a-zA-Z_: -]+)\n'\ '\|shortname = ([a-zA-Z\(\) -]+)\n'\ '\|active = (yes|no)\n([^}]*)}}' ``` While manually inspecting the code and manually aping the code into a Python interpreter, I noticed that traversal into https://en.wikipedia.org/w/api.php?action=parse&page=Wikipedia:WikiProject_Council/Directory/History%20and%20society&prop=wikitext&section=2 doesn't seem to pick up the "Ageing and culture" leaf node by way of https://github.com/wikimedia/drafttopic/blob/master/drafttopic/utilities/fetch_wikiprojects.py#L348. ``` def get_wikiprojects_from_table(self, wikitext): """ Takes a WikiProjects table listing, and returns individual WikiProjects """ wp = {} matches = re.findall(wp_section_regex, wikitext) for match in matches: remaining = match[3] listed_in = re.search(wp_section_regex_listed, remaining) # Listed somewhere else, so skip if listed_in: continue wp[match[1]] = {'name': match[0], 'shortname': match[1], 'active': match[2]} return wp ``` Rather, it seems to jump straight to the "Agriculture" node. The reason seems to be because the "Ageing and culture" template invocation doesn't have an exact match on the equals sign for key-value assignment having space characters on each side whereas the "Agriculture" one does (and is therefore the the first regex hit). Here's the first bit of the JSON server response bearing the wikitext in case someone's perusing this ticket later: ``` { "parse": { "title": "Wikipedia:WikiProject Council/Directory/History and society", "pageid": 6918422, "wikitext": { "*": "===General topics===\n{{Wikipedia:WikiProject Council/Directory/WikiProject header}}\n\n{{Wikipedia:WikiProject Council/Directory/WikiProject\n|project= Wikipedia:WikiProject Ageing and culture\n|shortname= Ageing and culture\n|active= yes\n|assessment= \n|peer-review= \n|collaboration= \n|portal= Society\n|notes= }}\n\n{{Wikipedia:WikiProject Council/Directory/WikiProject\n|project = Wikipedia:WikiProject Agriculture\n|shortname = Agriculture\n|active = yes\n|assessment = \n|peer-review = \n|collaboration = \n|portal = Agriculture\n|notes = \n|task-force = \n|listed-in = }}... ``` It doesn't seem like the regex-based parsing here needs to go way over the top accounting for every edge case, but missing spaces is potentially a not-uncommon pattern in template invocation. The task filer here is happy to provide a patch, but was hoping first if someone could confirm understanding - and if it's cheap, perhaps a coverage diff pre- and post-patch; I suspect the practical consequence in the final model production as-is is that it all gets washed out anyway, but I digress. The thought was to simply allow a little extra flexibility in the regex without trying to build a full-on template parser.