Page MenuHomePhabricator

Support the Parsing team with the "Remove Tidy dependency from MediaWiki output" goal
Closed, ResolvedPublic

Description

https://www.mediawiki.org/wiki/Wikimedia_Engineering/2016-17_Q2_Goals#Editing

I think the need there will be communicating to the community what this change is and how it will affect them.

There is information here: https://www.mediawiki.org/wiki/Parsing/Replacing_Tidy

Event Timeline

Elitre added a subscriber: ssastry.

@ssastry , am I correct? Should I synch up with you to discuss more about the kind of help you'll need from us? Thanks :)

@Elitre: Thanks for the proactive work here. :) Let us talk more next week?

Sure thing. Grab me when you want.

Here's the basic approach that we're expecting:

  • Tech-savvy users can look at https://www.mediawiki.org/wiki/Parsing/Replacing_Tidy and start correcting pages now. The "new" syntax works has always been correct.
    • The English Wikipedia has the best infrastructure for handling this. Other projects may have a smaller number of pages that will be affected, but they may have a much harder time with the transition due to a lack of tools and editors dedicated to gnoming tasks.
    • Help pages need to be reviewed, e.g., to make sure that https://meta.wikimedia.org/wiki/Help:List#Specifying_a_starting_value is still correct/possible (that might be used a lot at Wikisource).
  • Things to consider:
    • Gnomes usually focus on the mainspace, but this affects all pages (e.g., archived talk pages).
    • The team is particularly worried about templates: they'll break the "article", but the person editing the article won't be able to find the broken thing in the article's wikitext.
    • Parsoid can automatically fix some of the the issues that Tidy cleans up, so if anyone edits the page in the visual editor, then the page (or parts of it) could be cleaned up. (Parsoid is already fully compliant, so new content created in the visual editor will be correct.) However, problems such as unclosed tables can't be fixed automatically, as it requires human judgment to determine where the table ought to close.
  • In late 2016, the team hopes to release a tool that will identify problems. This will just be a "compare the preview" tool (it won't edit the page or try to fix anything). This will be available as a regular preference setting for all users (current plan: default off).
  • The removal of Tidy can be phased per wiki/language/project. However, we can't phase it by specific change (unclosed tables break this week, self-closing tags break next week, etc.)
    • Phased removals of Tidy will probably happen throughout 2017.
    • Something to consider: English Wikipedia first? (Heresy, I know, but they'll likely have done more clean-up work than any other project, they are the main community source of tech information, and they're the source of most of the templates. So if they convert first, then communities that copy them will find it easier to update.)

In terms of communication, I'm contacting a couple of editors now, e.g., https://en.wikipedia.org/w/index.php?oldid=740818345#Tidy_is_going_away

We have some good information at https://www.mediawiki.org/wiki/Parsing/Replacing_Tidy about what needs to be fixed. We need to add a clearer explanation of why this is happening (HTML5 compliance + making the two parsers provide identical output).

This will need a User-notice when the comparison tool becomes available (December 2016?), MassMessages to all the village pumps, and then individual messages to each community as Tidy is scheduled for removal.

Thanks Sherry for capturing the notes from our conversation. Some clarifications:

  1. Note that it is not "Tidy.js" .. it is Tidy.
  2. There are two aspects to phased removal. (a) you guys recommended that Tidy be turned off in phases across wikis (b) i mentioned (or I think I did, but in case I did not, here it is now) that we'll remove the Tidy compatibility passes gradually in 2017. We added the Tidy compatibility passes to reduce the potential breakages and cleanup required (even if they could be done automatically) when Tidy is turned off. Once things settle down after step (a), we'll work through the phased removal of the compatibility passes.
  3. Yes, the plan is for us to enable the Tidy replacement service and the parser migration tool (which lets editor preview output with Tidy and with the replacement service) by end of 2016.
  4. While Parsoid could do some cleanups when a page is edited in VE, that is not something we are considering on our own. I just mentioned it as a possibility. I don't want anyone getting alarmed about Parsoid suddenly dirtying edits.

Thanks, I've corrected my notes.

I'm assuming that Parsoid will correct anything if the editor is changing the exact bit of content that is 'wrong', but I think you're right about people not necessarily wanting the rest of the page to be changed, for fear of dirty diffs (especially on long pages).

I'm also thinking that if Parsoid could auto-correct something, then a bot or AWB script could also correct that content. There will always be the things that require human attention.

Actually, I am not proposing using Parsoid as a fixup tool, but if editors want it, we can work on that. I mentioned it as a possibility in case editors want to use that option.

Elitre triaged this task as Medium priority.Sep 29 2016, 3:37 PM

Well, here's an interesting problem: Wikignomes often work from categories, and we (i.e., they) can put pages with bad HTML or wikitext code into a cleanup category, but the cats get populated only when someone edits the page, which means that they can't find the pages that need to be fixed.

See https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#Pages_not_being_added_to_maintenance_categories_in_a_timely_fashion and T132467: Change to template or module resulting in category change can take days to update the categories.

The English Wikisource is aware of this; I've left informal notes for the English Wikibooks, the English Wiktionary, and the English Wikivoyage. These are central contact points for these projects, and I've asked them to share the news with other projects.

We should plan a MassMessage to all wikis when the comparison tool is released.

Two things:

  1. There should probably be a maintenance category created at the MediaWiki level for these errors, as was done for invalid self-closed HTML tags in July 2016. See https://en.wikipedia.org/wiki/Wikipedia_talk:WikiProject_Check_Wikipedia#Line_break_tags for more discussion, including the tidbit that there are over 75,000 individual pages containing "</br>" in en.WP, and that's just one version of one malformed element.
  1. As noted and linked above, these MediaWiki-level error categories can take many months to populate. The en.WP self-closed tag category was created three months ago, and I was able to increase its membership from 1,000 to 4,000 pages today (three months after the software change) with a series of null edits to a few thousand articles. If you want us gnomes to fix these errors in a timely fashion so that the Tidy features can be removed, you'll need to find a way to get the error categories to fill more quickly. Null edits to every page in the WP would do the trick, but there may be an easier way.

Two things:

  1. There should probably be a maintenance category created at the MediaWiki level for these errors, as was done for invalid self-closed HTML tags in July 2016.

We cannot automatically detect the error types for all rendering changes which is why we are involving editors and deploying the ParserMigration extension to help editors find things to fix. That means we cannot add maintenance categories for all errors at this time. Separately, we are trying to deploy the Linter extension that uses Parsoid to find markup errors. If there is a way to flag some of these errors there, we'll explore that option. But, there will always be some markup errors that require intervention by editors.

  1. As noted and linked above, these MediaWiki-level error categories can take many months to populate. The en.WP self-closed tag category was created three months ago, and I was able to increase its membership from 1,000 to 4,000 pages today (three months after the software change) with a series of null edits to a few thousand articles. If you want us gnomes to fix these errors in a timely fashion so that the Tidy features can be removed, you'll need to find a way to get the error categories to fill more quickly. Null edits to every page in the WP would do the trick, but there may be an easier way.

We are not sure yet. We'll discuss this.

I made nulledit to all hewiki pages this month. It gave me a lot of tracking categories entries, part of them last edited in 2011.

Thanks, Subbu and team. I'm going to work on stuff related to this task for the rest of the week, and I know Quiddity will also help here.

I will update this task with "next steps" for my collaboration with Parsing before the end of the year.

Work for this quarter is done, as the FAQ page is marked for translation.
We'll work more on the socialization of changes next year. I'll file a different task for next quarter's work.