Page MenuHomePhabricator

Create a dataset for training/evaluating models for summarizing (long) discussions
Open, Needs TriagePublic


Put together a cleaned dataset of discussions as well as their summaries from Wikipedia.

Context: One of the use-cases of ML/AI/LLMs that came up in last year's hackathon was to automatically generate summaries of very long discussions. One of the blockers, IMO, is the lack of a high-quality and readily available ground truth dataset. However, this is crucial not only for training a custom model, but most importantly, to also systematically investigate how good such a model performs in summarizing nuanced discussions.

Some ideas how to get started:

Tagging @Htriedman as he indicated interest to also work on this.

Event Timeline

I didnt get to work a lot on this during the hackathon as I was mostly focusing on T362805.

I did some exploration of the dataset in

  • the dataset contains a set of RfCs in English Wikipedia
  • Relevant fields that are contained are:
    • URL: allows for easy retrieval of the content and further parsing
    • first comment: the box at the top of the RfC often containing the summary of the discussion
    • closed: marks whether the discussion was formally closed
  • one of the main limitations is that the dataset does not contain the text of the actual discussion but only some aggregated statistics such as number of comments.
  • therefore, the dataset cannot be readily used to build a dataset with a set of discussions AND summaries. this would require substantial additional parsing to extract the discussions.

Unassigning myself as I am not planning to work on this in the near future (next 3 months)

I've been doing a lot of wikitext parsing work for the SparQL dataset, including parsing on-wiki conversations. If I can figure that out for (say) the request a query archive, I may take a crack at this adapting the same script to parse RfCs. Will keep everyone updated on this phab task!

I've been actively working on parsing on-wiki discussions in the context of the request a query archive, and I took a few hours to adapt that (hacky but mostly working) code to this task!

@Htriedman this is super useful. thanks for looking into this and sharing your results.
For future reference: The above code looks for all pages that use the template {{Closed_rfc_top}}. This template is used to close a Request for comment (RfC) on a talk page or a noticeboard. Thus, we can easily identify pages that contain relevant (i.e. closed) discussions. The above code then goes a lot further by parsing the content of the page to extract the discussion in a tree-like structure (capturing comments etc).