Would be cool! Quarry can use it! People can build tools on top of it!
|Resolved||Ladsgroup||T106278 Setup a db on labsdb for article quality that is publicly accessible|
|Resolved||Ladsgroup||T135684 Generate recent article quality scores for English Wikipedia|
Something along the lines of:
- Once a week / Month
- Look at all the pages in mainspace
- Compute the quality for them and update them in db
So the table would be structured with the following columns:
And possibly with different tables per project.
For article quality, we can work from the dumps directly since the feature set is 100% text-based.. The scores won't make much sense for revisions of pages that are not intended to be articles.
We only have a model for enwiki, so we'll be looking at making a table that is roughly the same height as the 'revision' table. If we generate it based on the dumps, it will get out of date and have weirdness for pages that are restored (undeleted) after the dump was cut, but I expect it will be mostly complete for most purposes.
That makes sense for a "what quality is the wiki?" use-case, but not the "how has the quality been changing?" use-case. It turns out that a lot of the research of quality has looked at quality changes, so I imagine that use-cases is big enough that we should consider it right away.
- Arazy, O., Nov, O., Patterson, R., & Yeo, L. (2011). Information quality in Wikipedia: The effects of group composition and task conflict. Journal of Management Information Systems, 27(4), 71-98. (Couldn't find a PDF)
We talked a little about this in our meetings. Here's the result (and some of them are my opinions):
- We only get results for top revision of each article in main namespace (around 5 million rows)
- We do it once a month but each month gets a new table. So people can track article quality changes using joins of the tables.
- I start doing it by adding a database in tools-dexbot. We move it to another place later on.
I wrote a script that populate the database tables based on the generated data. It's here (You can find the table schema, index and constraints there).
I already populated enwiki_201608 table in s51100__wp10_p db in tools. Unfortunately we can't use it in quarry. It would be great to let quarry read this db or move this db to a place that quarry can read.
Otherwise, if you have a tools account, you can login and do mysql -h tools.labsdb s51100__wp10_p and run querys.
I think it's a bad idea to have each month get its own table. I'd like to load a historic monthly dataset and that would result in about 180 tables (12 months * 15 years that English Wikipedia has been around)
As discussed in the meeting. Review https://wikitech.wikimedia.org/wiki/Help:Tool_Labs/Database and make a user-db on the main labsDB instance (that quarry uses). You can connect from laps with sql enwiki.