Page MenuHomePhabricator

Use HTML instead of wikitext for Revise Tone Task Generator in LiftWing
Closed, ResolvedPublic

Description

We would like to shift to HTML instead of wikitext for Revise Tone Task Generator to better address the following issues with parsing wikitext content:

  • Identify paragraphs in a VE-compatible way (slack thread)
  • Avoid suggestions that appear within common quote structures on Wikipedia (T411892)
  • Exclude content from non-prose sections and suggestions within tables, infoboxes, data tables, image captions (T411897)

Event Timeline

Here's a starting notebook for understanding the plaintext functionality and some thoughts related to quotations: https://public-paws.wmcloud.org/User:Isaac_(WMF)/tone-check/tone-check-html.ipynb. Hopefully enough to get you started but let me know if you have questions!

KStoller-WMF subscribed.

Thanks for creating this task, @achou!

My understanding is that this task blocks T411892: Revise Tone: Exclude direct quotes from Tone Recommendations, correct?

@Sucheta-Salgaonkar-WMF Do we have a timeline for this task? Does it seem possible to complete this task and T411892 before January 15, 2026?
I'm asking because I think it would be ideal if we can do more to filter out more direct quotes from suggestions before the Growth team starts our A/B test, which we have tentatively set for January 15th.

For the issues we want to address with the HTML parser,

Exclude direct quotes

I've added an update in T411892#11481085

Exclude certain sections

I've added an update in T411897#11481148

Identify paragraphs in a VE-compatible way

We loop through the HTML section by section and generate plaintext for each paragraph, so we still need to decide whether a single linebreak (\n) or two linebreaks (\n\n) should trigger the start of a new paragraph.

Using the same example Michael used, Grigoris_Rallatos:

  • Single linebreak (\n) starts a new paragraph

In the section “Club career”:

In 2009 he joined Polis Kallitheas. In the first season with his new team, managed to achieve a high finish, but failed to promote. In his second season (2010–11) with the Kallithea side, he managed to achieve a promotion once more, this time for the 4th tier, G' Ethniki. In 2011, he once more signed for a 5th tier club, Ermis Peraus in which he stayed for 1 year.

With a single linebreak as the separator, this text would be treated as three paragraphs:

  1. In 2009 he joined Polis Kallitheas. …
  2. In his second season (2010–11) with the Kallithea side, …
  3. In 2011, he once more signed …

In 2017, he made the breakthrough by signing Panionios B.C. in the Greek Basket League for the first time in his career. In 2018 he signed for Holargos B.C., in the club's 1st season in the highest national league. He was a crucial member of the outstanding season Holargos had, reaching the play-offs and defeating AEK in the second game after an away loss, before eventually losing and the third game and being eliminated from the play-offs' quarter finals. He was an important member due to the motivation his teammates was getting from him, being a passionate person on and off the court.

With a single linebreak as the separator, this text would be treated as two paragraphs:

  1. In 2017, he made the breakthrough …
  2. He was an important member due to …
  • Two linebreaks (\n\n) start a new paragraph

If we use two linebreaks (\n\n) to start a new paragraph, then both examples above remain single paragraphs, which matches how VE represents them.

  • Known edge case

In the section “Pietermaritzburg Treason Trial” of the article "John_Milne_(judge)", the content inside <blockquote> is excluded. This causes the text above and below the blockquote gets merged into one paragraph:

Also in 1985–1986, Milne presided in the high-profile Pietermaritzburg Treason Trial, which ended in the acquittal of 16 prominent United Democratic Front activists. The state dropped the charges against the final defendants after Milne's ruling in S v Ramgobin and Others (1986), which held that videotape recordings (in this case, recordings of political speeches) were admissible evidence only if it was proven that they were original recordings and that there existed no reasonable possibility of interference with them; the judgement remains authoritative in South African law of evidence. Dhaya Pillay said of the judgement:A month after the judgement, in July 1986, Dugard told the New York Times that Milne's conduct in the Pietermaritzburg trial and other such cases was among several factors that was inspiring increasing judicial confidence and a willingness to challenge executive actions.

This merging happens because the blockquote content is removed and there is no linebreak between the surrounding text. Based on the examples I have seen so far, this kind of merge appears rare. In most cases, editors add linebreaks after blockquote, which prevents this issue.


@KStoller-WMF, regarding the timeline, I'd like to confirm whether these results meet the Growth team's requirements. If so, I'll proceed with the necessary changes to the Revise Tone Task Generator in LiftWing, which I expect to complete in the first week of January.

@Michael, I'd like to discuss how to release this new change to LW production/Data Gateway/Search weighted tags, so on the Growth side, you can distinguish between data from the old and new versions. I'm thinking we may need to clean up the current weighted tags and/or redo the initial ingestion, and update the "model_version" to v1.1 or something else. We can have a chat in January! :)

In the section “Pietermaritzburg Treason Trial” of the article "John_Milne_(judge)", the content inside <blockquote> is excluded. This causes the text above and below the blockquote gets merged into one paragraph:

Thanks for coming across this as it's definitely a bug. I looked into it quickly and I think an easy fix thankfully. Could you try re-running with https://gitlab.wikimedia.org/repos/research/html-dumps/-/merge_requests/39 (pip install --upgrade git+https://gitlab.wikimedia.org/repos/research/html-dumps.git@144-plaintext-merged-paragraphs) as your mwparserfromhtml source instead and let me know if that fixes things? I checked one of your examples locally and that seemed to do the trick but will be good to have a second pair of eyes and more examples.

Also my vote is for the double newline, though in reality I think you should never encounter one in the chunks of plaintext at this stage in the pipeline because they should have already been converted into new paragraph nodes and these are yielded by the library each time it moves to a new direct child of a section node in the HTML. The single newlines are still in there because Parsoid preserves when the editors have added a newline in the wikitext even if it doesn't trigger a new paragraph node. I assume that's what VE expects too, but you could always do something like " ".join(chunk.split()) to convert everything to standard whitespace if desired. I realize the default behavior for the library is to further split on single newlines but I might take that out in light of these examples.

@KStoller-WMF, regarding the timeline, I'd like to confirm whether these results meet the Growth team's requirements. If so, I'll proceed with the necessary changes to the Revise Tone Task Generator in LiftWing, which I expect to complete in the first week of January.

@achou you're absolutely incredible, thank you so much!! Thank you for being on top of this, for working really quickly and diligently, and for communicating so proactively and clearly <3

Thanks, @achou! Ditto to Sucheta's comment, I appreciate all of your effort and communication around all of this!

@KStoller-WMF, regarding the timeline, I'd like to confirm whether these results meet the Growth team's requirements. If so, I'll proceed with the necessary changes to the Revise Tone Task Generator in LiftWing, which I expect to complete in the first week of January.

As far as I understand it, yes, I think these changes will meet our requirements. I understand that there will always be formatting that creates edge cases, but as long as we are able to exclude most quotes and most problematic sections, we should be OK. (Previously Growth has always aimed to ensure edit suggestions are correct at least 70% of the time).

Change #1223235 had a related patch set uploaded (by AikoChou; author: AikoChou):

[machinelearning/liftwing/inference-services@main] revise-tone-task-generator: Switch from wikitext to HTML for content processing

https://gerrit.wikimedia.org/r/1223235

let me know if that fixes things? I checked one of your examples locally and that seemed to do the trick but will be good to have a second pair of eyes and more examples.

I tested it and it fixes the issue. Thanks for the quick fix :)

Also my vote is for the double newline, though in reality I think you should never encounter one in the chunks of plaintext at this stage in the pipeline because they should have already been converted into new paragraph nodes and these are yielded by the library each time it moves to a new direct child of a section node in the HTML. The single newlines are still in there because Parsoid preserves when the editors have added a newline in the wikitext even if it doesn't trigger a new paragraph node. I assume that's what VE expects too, but you could always do something like " ".join(chunk.split()) to convert everything to standard whitespace if desired. I realize the default behavior for the library is to further split on single newlines but I might take that out in light of these examples.

Also thanks for the explanation. After checking the source, it makes sense. We're using your suggestion, " ".join(chunk.split()), to turn those non-paragraph newlines into whitespaces to match what VE shows.

Change #1223235 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] revise-tone-task-generator: Switch from wikitext to HTML for content processing

https://gerrit.wikimedia.org/r/1223235

I tested it and it fixes the issue. Thanks for the quick fix :)

@achou Great! Would it be helpful for me to merge that change then and release a new minor version of the library?

Change #1224117 had a related patch set uploaded (by AikoChou; author: AikoChou):

[operations/deployment-charts@master] ml-services: Update image and config for revise-tone-task-generator on staging

https://gerrit.wikimedia.org/r/1224117

@Isaac Yes, that would be very helpful! I've +1 the MR. :)

Yes, that would be very helpful! I've +1 the MR. :)

Great -- 2.0.3 now available with the fix: https://pypi.org/project/mwparserfromhtml/

Change #1224117 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: Update image and config for revise-tone-task-generator on staging

https://gerrit.wikimedia.org/r/1224117

Change #1224721 had a related patch set uploaded (by AikoChou; author: AikoChou):

[operations/deployment-charts@master] ml-services: Update image for revise-tone-task-generator on prod

https://gerrit.wikimedia.org/r/1224721

Change #1224721 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: Update image for revise-tone-task-generator on prod

https://gerrit.wikimedia.org/r/1224721

Change #1225510 had a related patch set uploaded (by AikoChou; author: AikoChou):

[machinelearning/liftwing/inference-services@main] revise-tone-task-generator: Guard against empty html to mwparserfromhtml.Article

https://gerrit.wikimedia.org/r/1225510

Change #1225510 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] revise-tone-task-generator: Guard against empty html to mwparserfromhtml.Article

https://gerrit.wikimedia.org/r/1225510

Change #1225529 had a related patch set uploaded (by AikoChou; author: AikoChou):

[operations/deployment-charts@master] ml-services: Update image and add client timeout for revise-tone-task-generator

https://gerrit.wikimedia.org/r/1225529

Change #1225529 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: Update image and add client timeout for revise-tone-task-generator

https://gerrit.wikimedia.org/r/1225529