Setup a db on labsdb for article quality that is publicly accessible
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	yuvipanda
	Jul 19 2015, 4:09 PM

Description

Would be cool! Quarry can use it! People can build tools on top of it!

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		Ladsgroup	T106278 Setup a db on labsdb for article quality that is publicly accessible
		Resolved		Ladsgroup	T135684 Generate recent article quality scores for English Wikipedia

Event Timeline

yuvipanda created this task.Jul 19 2015, 4:09 PM

yuvipanda raised the priority of this task from to Needs Triage.

yuvipanda updated the task description. (Show Details)

yuvipanda added a project: Machine-Learning-Team (Active Tasks).

yuvipanda subscribed.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 19 2015, 4:09 PM

yuvipanda set Security to None.Jul 19 2015, 4:55 PM

yuvipanda added a subscriber: Harej.

Something along the lines of:

Once a week / Month
Look at all the pages in mainspace
Compute the quality for them and update them in db

So the table would be structured with the following columns:

page_name
scored_revision
quality

And possibly with different tables per project.

11845146 pages on enwiki in mainspace, whoops, this isn't going to run per week.

@Halfak do you think we have the resources to run this on 1,184,5146 page revisions? Would we be fetching the text of each of those from the API? Should we just instead use the dumps to start with?

For article quality, we can work from the dumps directly since the feature set is 100% text-based.. The scores won't make much sense for revisions of pages that are not intended to be articles.

We only have a model for enwiki, so we'll be looking at making a table that is roughly the same height as the 'revision' table. If we generate it based on the dumps, it will get out of date and have weirdness for pages that are restored (undeleted) after the dump was cut, but I expect it will be mostly complete for most purposes.

I was mostly thinking of doing it only for the 'current' set of revisions, and hence it need be only as tall as the page table (which itself is huge)

That makes sense for a "what quality is the wiki?" use-case, but not the "how has the quality been changing?" use-case. It turns out that a lot of the research of quality has looked at quality changes, so I imagine that use-cases is big enough that we should consider it right away.

Examples:

http://repository.cmu.edu/cgi/viewcontent.cgi?article=1098&context=hcii
Arazy, O., Nov, O., Patterson, R., & Yeo, L. (2011). Information quality in Wikipedia: The effects of group composition and task conflict. Journal of Management Information Systems, 27(4), 71-98. (Couldn't find a PDF)
http://research.microsoft.com/en-us/um/redmond/groups/connect/CSCW_10/docs/p233.pdf

hmm, possibly, but since this is a mysql table there is an order of magnitude difference in how easy those two are to do, I think.

He7d3r subscribed.Jul 22 2015, 9:46 PM

dschwen subscribed.Jul 26 2015, 9:15 PM

Alchimista subscribed.Aug 6 2015, 11:04 PM

Halfak edited projects, added Machine-Learning-Team; removed Machine-Learning-Team (Active Tasks).Mar 30 2016, 2:55 PM

Halfak moved this task from Unsorted to Ideas on the Machine-Learning-Team board.Mar 30 2016, 5:15 PM

Halfak triaged this task as Low priority.Aug 18 2016, 2:36 PM

Halfak added a subtask: T135684: Generate recent article quality scores for English Wikipedia.

Halfak added a project: articlequality-modeling.

• DarTar subscribed.Sep 11 2016, 3:46 PM

Ladsgroup mentioned this in T145332: Formal publication of article quality score dataset.Sep 11 2016, 4:52 PM

Halfak closed subtask T135684: Generate recent article quality scores for English Wikipedia as Resolved.Sep 14 2016, 12:58 AM

Ladsgroup claimed this task.Sep 14 2016, 4:56 PM

Ladsgroup edited projects, added Machine-Learning-Team (Active Tasks); removed Machine-Learning-Team.

Restricted Application added a project: User-Ladsgroup. · View Herald TranscriptSep 14 2016, 4:56 PM

Ladsgroup moved this task from Incoming to In progress on the User-Ladsgroup board.Sep 14 2016, 4:59 PM

We talked a little about this in our meetings. Here's the result (and some of them are my opinions):

We only get results for top revision of each article in main namespace (around 5 million rows)
We do it once a month but each month gets a new table. So people can track article quality changes using joins of the tables.
I start doing it by adding a database in tools-dexbot. We move it to another place later on.

I wrote a script that populate the database tables based on the generated data. It's here (You can find the table schema, index and constraints there).
I already populated enwiki_201608 table in s51100__wp10_p db in tools. Unfortunately we can't use it in quarry. It would be great to let quarry read this db or move this db to a place that quarry can read.
Otherwise, if you have a tools account, you can login and do mysql -h tools.labsdb s51100__wp10_p and run querys.

I think it's a bad idea to have each month get its own table. I'd like to load a historic monthly dataset and that would result in about 180 tables (12 months * 15 years that English Wikipedia has been around)

I mean monthly from now on. For historic data I guess we can use much bigger intervals as we go older. (6 for 2015 and 2014, 3 for 2013, 2012, and annually afterwards)

As discussed in the meeting. Review https://wikitech.wikimedia.org/wiki/Help:Tool_Labs/Database and make a user-db on the main labsDB instance (that quarry uses). You can connect from laps with sql enwiki.

Here's the first part: https://quarry.wmflabs.org/query/12550

Another example: https://quarry.wmflabs.org/query/12647

Ladsgroup moved this task from Parked to Completed on the Machine-Learning-Team (Active Tasks) board.Sep 20 2016, 3:15 PM

Ladsgroup moved this task from In progress to Done on the User-Ladsgroup board.Sep 20 2016, 3:24 PM

Halfak closed this task as Resolved.Sep 28 2016, 9:40 PM

• Phabricator_maintenance removed a subscriber: yuvipanda.Jun 7 2017, 6:48 PM

Restricted Application added a project: artificial-intelligence. · View Herald TranscriptJun 7 2017, 6:48 PM

Setup a db on labsdb for article quality that is publicly accessibleClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Setup a db on labsdb for article quality that is publicly accessible
Closed, ResolvedPublic
Actions

Related Objects
Search...