Create a dataset for training/evaluating models for summarizing (long) discussions
Open, Needs TriagePublic
Actions

Assigned To

None

Authored By

	MGerlach
	Apr 19 2024, 8:44 AM

Description

Put together a cleaned dataset of discussions as well as their summaries from Wikipedia.

Context: One of the use-cases of ML/AI/LLMs that came up in last year's hackathon was to automatically generate summaries of very long discussions. One of the blockers, IMO, is the lack of a high-quality and readily available ground truth dataset. However, this is crucial not only for training a custom model, but most importantly, to also systematically investigate how good such a model performs in summarizing nuanced discussions.

Some ideas how to get started:

Requests for comments (RfC) contain discussions and, in many/some cases, a manually created summary by an experienced editor (example). I came across some previous work in a paper from a few years back: Deliberation and Resolution on Wikipedia: A Case Study of Requests for Comments, which already provides some resources
- A published dataset of 7,316 RfCs Published dataset of RfCs https://figshare.com/articles/dataset/rfc_sql/7038575 (though not sure what contained in there, probably needs filtering, etc)
- Scripts for parsing RfCs: https://github.com/trusttri/rfc-analysis/tree/master/create_dataset
Alternatively, one could look at discussions around deletion of articles. One starting point could be the recent paper Why Should This Article Be Deleted? Transparent Stance Detection in Multilingual Wikipedia Editor Discussions which also published code/data that could be re-used for this case.

Tagging @Htriedman as he indicated interest to also work on this.

Related Objects

Mentioned In: T361778: [Session] 👋 Wikimedia Hackathon 2024 Opening Ceremony
T362419: Attend Wikimedia Hackathon 2024
Mentioned Here: T362805: Build a tool (or tools) to easily visualize differentially-private datasets

Event Timeline

MGerlach created this task.Apr 19 2024, 8:44 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 19 2024, 8:44 AM

MGerlach mentioned this in T362419: Attend Wikimedia Hackathon 2024.Apr 19 2024, 8:48 AM

• Quiddity mentioned this in T361778: [Session] 👋 Wikimedia Hackathon 2024 Opening Ceremony.May 4 2024, 8:42 PM

I didnt get to work a lot on this during the hackathon as I was mostly focusing on T362805.

I did some exploration of the dataset in https://figshare.com/articles/dataset/rfc_sql/7038575

the dataset contains a set of RfCs in English Wikipedia
Relevant fields that are contained are:
- URL: allows for easy retrieval of the content and further parsing
- first comment: the box at the top of the RfC often containing the summary of the discussion
- closed: marks whether the discussion was formally closed
one of the main limitations is that the dataset does not contain the text of the actual discussion but only some aggregated statistics such as number of comments.
therefore, the dataset cannot be readily used to build a dataset with a set of discussions AND summaries. this would require substantial additional parsing to extract the discussions.

Aklapper assigned this task to MGerlach.May 7 2024, 10:59 AM

leila added a project: Research-Freezer.May 7 2024, 1:42 PM

Unassigning myself as I am not planning to work on this in the near future (next 3 months)

I've been doing a lot of wikitext parsing work for the SparQL dataset, including parsing on-wiki conversations. If I can figure that out for (say) the request a query archive, I may take a crack at this adapting the same script to parse RfCs. Will keep everyone updated on this phab task!

I've been actively working on parsing on-wiki discussions in the context of the request a query archive, and I took a few hours to adapt that (hacky but mostly working) code to this task!

jupyter notebook exploring the extraction of RfC conversations + results + statuses (using the RfC closed top, Rfctop, and RfC top templates): https://gitlab.wikimedia.org/htriedman/rfc-parsing/-/blob/main/01_rfc_parsing.ipynb
first-pass json-formatted dataset of extracted conversations (~2700 closed RfCs, 241mb): https://drive.google.com/file/d/1hNbrYPoPqxMbX-y60KKMWVETFTK_uDBg/view?usp=sharing

@Htriedman this is super useful. thanks for looking into this and sharing your results.
For future reference: The above code looks for all pages that use the template {{Closed_rfc_top}}. This template is used to close a Request for comment (RfC) on a talk page or a noticeboard. Thus, we can easily identify pages that contain relevant (i.e. closed) discussions. The above code then goes a lot further by parsing the content of the page to extract the discussion in a tree-like structure (capturing comments etc).

Create a dataset for training/evaluating models for summarizing (long) discussionsOpen, Needs TriagePublicActions

Description

Related Objects

Event Timeline

Create a dataset for training/evaluating models for summarizing (long) discussions
Open, Needs TriagePublic
Actions