Create a figshare entry with metadata and basic documentation on article quality score data (T135684) and announce it.
Description
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Resolved | Halfak | T145332 Formal publication of article quality score dataset | |||
| Resolved | Ladsgroup | T135684 Generate recent article quality scores for English Wikipedia | |||
| Resolved | • DarTar | T146708 Ask Figshare to remove file upload limit for Article Quality Score dataset | |||
| Resolved | Halfak | T146709 Mini blogpost for Article Quality Score dataset |
Event Timeline
I was thinking the AQ score dataset would be good material for a short blog post, to drive more attention to it (roping in @Nettrom with a couple of quotes maybe?) @Halfak, @Ladsgroup: what do you guys think? I don't want to add a lot more work (and I can help with this task) but I feel the data release deserves a more visible announcement than just lists+wikiresearch.
Forgive me if my question is a little bit stupid but It's already in datasets.wikimedia.org (see T135684#2622793) So what exactly should we do to consider it published? I was thinking that having a table in labs (T106278) would be really nice. Do you mean this?
For all our major data releases we create a registry entry in figshare, which adds a couple of benefits to just a static dump on datasets.wikimedia.org:
- it assigns the dataset a DOI (making it citable)
- it stores metadata (which gets propagated) making it more easily discoverable
- it includes additional mirroring of the dataset for long term preservation
See for example:
@Ladsgroup I realize I should document the process (and benefits) somewhere on wikitech.
I think there's two AQ datasets going around. One is the one @Ladsgroup pointed to, which I believe @Halfak gathered, and is used for ORES training and evaluation. The second is the one I used to do some additional training to improve the wikiclass library, and that's already on figshare: https://figshare.com/articles/English_Wikipedia_Quality_Asssessment_Dataset/1375406 This second dataset is gathered by following the process described in our 2015 CSCW paper, and referenced in the figshare description.
@Nettrom agreed, we should definitely reference the 2015 one (maybe cross-link the two entries).