Put together a cleaned dataset of discussions as well as their summaries from Wikipedia.
Context: One of the use-cases of ML/AI/LLMs that came up in last year's hackathon was to automatically generate summaries of very long discussions. One of the blockers, IMO, is the lack of a high-quality and readily available ground truth dataset. However, this is crucial not only for training a custom model, but most importantly, to also systematically investigate how good such a model performs in summarizing nuanced discussions.
Some ideas how to get started:
- Requests for comments (RfC) contain discussions and, in many/some cases, a manually created summary by an experienced editor (example). I came across some previous work in a paper from a few years back: Deliberation and Resolution on Wikipedia: A Case Study of Requests for Comments, which already provides some resources
- A published dataset of 7,316 RfCs Published dataset of RfCs https://figshare.com/articles/dataset/rfc_sql/7038575 (though not sure what contained in there, probably needs filtering, etc)
- Scripts for parsing RfCs: https://github.com/trusttri/rfc-analysis/tree/master/create_dataset
- Alternatively, one could look at discussions around deletion of articles. One starting point could be the recent paper Why Should This Article Be Deleted? Transparent Stance Detection in Multilingual Wikipedia Editor Discussions which also published code/data that could be re-used for this case.
Tagging @Htriedman as he indicated interest to also work on this.