Page MenuHomePhabricator

Migrate IABot parsing code to annotated HTML (Parsoid)
Closed, DeclinedPublic

Description

There are 30+ bugs related to the interpretation of syntax in IABot. IABot uses its own parsing code.

My idea is to migrate much of the parsing code to a reliable external parsing library. Parsoid is an obvious choice.

The ideal outcome for this task is the creation of an abstract wiki document object that IABot would interact with, rather than applying regular expressions directly to text.

Event Timeline

Have you started this effort and/or are you still working on this?

I believe that the template parsing is a major issue with the bot (and bots in general). I have written a very small template editor with a proper linter/parser that is roundtrip safe, i.e., it preserves whitespaces, unknown fields etc. ( template_source = create_string(parse(template_source))) and does not rely on regular expressions to extract data. In particular, it can handle nested templates, which IABot seems to struggle with a lot.

It's currently written in Python, because I intend to use it with my Python-based bot to clean up some IABot issues, but it should be straightforward to port it to PHP. It might be less effort than to integrate Parsoid. Do you have a good understanding of the object that IABot needs?

Harej triaged this task as Medium priority.Feb 4 2021, 2:11 AM

For something as critical as parsing, InternetArchiveBot should go with something that can be depended on for years to come. Using Wikipedia's canonical parser seems like the clear choice here.

Originally we were exploring including Parsoid as a composer dependency, but this is apparently a bad route to take. A lot of Parsoid's functionality depends on the context of the wiki; you can't really separate the parser from the wiki. We hesitated to go down the route of using an external service for parsing, as it would add latency due to network I/O. However, since we operate the bot on Wikimedia Cloud Services, it is possible to access Wikimedia Parsoid through the internal network, drastically reducing networking overhead compared to an external HTTP(S) service.

This is how InternetArchiveBot processes pages currently:

  1. Wikitext is downloaded from wiki via api.php
  2. The wikitext is parsed and manipulated through InternetArchiveBot's own code
  3. The modified wikitext is submitted to the wiki via api.php

With a switch to Parsoid, and an external parsing service generally, this would be the new workflow:

  1. Annotated HTML is downloaded from the wiki
  2. The HTML is parsed using a standard HTML parser and manipulated as HTML
  3. The transformed HTML is sent to Parsoid to convert into wikitext
  4. The new wikitext is submitted to the wiki via api.php

This means that each edit InternetArchiveBot makes will require three API roundtrips instead of two. While this is not great, I believe the tradeoffs in both making InternetArchiveBot more accurate in its edits and easier to maintain over time make up for this. And as mentioned before, this is done strictly over Wikimedia's own local network, meaning the latency will be quite minimal.

To implement this, first I would write unit tests for the existing functions. Then, I would create a WikiPage object that retrieves and manipulates data in annotated HTML, with a method to export to wikitext. The existing functions would be adapted to use WikiPage objects instead.

Harej renamed this task from Migrate IABot parsing code to Parsoid to Migrate IABot parsing code to annotated HTML (Parsoid).Jul 21 2021, 5:47 PM