Page MenuHomePhabricator

Setup a db on labsdb for article quality that is publicly accessible
Closed, ResolvedPublic

Description

Would be cool! Quarry can use it! People can build tools on top of it!

Event Timeline

yuvipanda raised the priority of this task from to Needs Triage.
yuvipanda updated the task description. (Show Details)
yuvipanda subscribed.

Something along the lines of:

  1. Once a week / Month
  2. Look at all the pages in mainspace
  3. Compute the quality for them and update them in db

So the table would be structured with the following columns:

  1. page_name
  2. scored_revision
  3. quality

And possibly with different tables per project.

11845146 pages on enwiki in mainspace, whoops, this isn't going to run per week.

@Halfak do you think we have the resources to run this on 1,184,5146 page revisions? Would we be fetching the text of each of those from the API? Should we just instead use the dumps to start with?

For article quality, we can work from the dumps directly since the feature set is 100% text-based.. The scores won't make much sense for revisions of pages that are not intended to be articles.

We only have a model for enwiki, so we'll be looking at making a table that is roughly the same height as the 'revision' table. If we generate it based on the dumps, it will get out of date and have weirdness for pages that are restored (undeleted) after the dump was cut, but I expect it will be mostly complete for most purposes.

I was mostly thinking of doing it only for the 'current' set of revisions, and hence it need be only as tall as the page table (which itself is huge)

That makes sense for a "what quality is the wiki?" use-case, but not the "how has the quality been changing?" use-case. It turns out that a lot of the research of quality has looked at quality changes, so I imagine that use-cases is big enough that we should consider it right away.

Examples:

  1. http://repository.cmu.edu/cgi/viewcontent.cgi?article=1098&context=hcii
  2. Arazy, O., Nov, O., Patterson, R., & Yeo, L. (2011). Information quality in Wikipedia: The effects of group composition and task conflict. Journal of Management Information Systems, 27(4), 71-98. (Couldn't find a PDF)
  3. http://research.microsoft.com/en-us/um/redmond/groups/connect/CSCW_10/docs/p233.pdf

hmm, possibly, but since this is a mysql table there is an order of magnitude difference in how easy those two are to do, I think.

We talked a little about this in our meetings. Here's the result (and some of them are my opinions):

  • We only get results for top revision of each article in main namespace (around 5 million rows)
  • We do it once a month but each month gets a new table. So people can track article quality changes using joins of the tables.
  • I start doing it by adding a database in tools-dexbot. We move it to another place later on.

I wrote a script that populate the database tables based on the generated data. It's here (You can find the table schema, index and constraints there).
I already populated enwiki_201608 table in s51100__wp10_p db in tools. Unfortunately we can't use it in quarry. It would be great to let quarry read this db or move this db to a place that quarry can read.
Otherwise, if you have a tools account, you can login and do mysql -h tools.labsdb s51100__wp10_p and run querys.

I think it's a bad idea to have each month get its own table. I'd like to load a historic monthly dataset and that would result in about 180 tables (12 months * 15 years that English Wikipedia has been around)

I mean monthly from now on. For historic data I guess we can use much bigger intervals as we go older. (6 for 2015 and 2014, 3 for 2013, 2012, and annually afterwards)

As discussed in the meeting. Review https://wikitech.wikimedia.org/wiki/Help:Tool_Labs/Database and make a user-db on the main labsDB instance (that quarry uses). You can connect from laps with sql enwiki.