Page MenuHomePhabricator

[Discuss] Hosting the monthly article quality dataset on labsDB
Closed, ResolvedPublic

Description

We're looking to host a big dataset on labs. We think it will be very useful to analysts who are working with PAWS and the tool developer community. See T145655

This dataset contains a row for every month that an article was visible in English Wikipedia. 5 million articles * ~100 months since 2002 = ~500 million rows. The output TSV file is 55GB uncompressed.

So now the question for discussion: Can we load this into user DB on labs without making @yuvipanda or @jcrespo sad?

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 26 2016, 11:24 PM

Would people need to join this with replica databases or would that be not
particularly useful?

That's a good question. I imagine that joins would be useful, but we could probably get a lot of value out of this without supporting joins.

If we don't have to support joins, we can try putting this up as a
usertable on toolsdb (which is accessible to everyone, although you can't
query it from quarry yet). I'll talk to jcrespo to see if we can put it up
on the replicas.

The things is, with the changes coming to the replicas T140788, I would not add any new data there until we have clear how we are going to handle those.

Toolsdb would be the right place, but the size is large enough to potentially require its own hardware. Please @Halfak, get in contact with me with details about the dataset and how it will be more useful for users and its refresh cycle to see how to handle that (better if next week rather than this one). We will figure out something.

Halfak added a comment.EditedSep 29 2016, 3:25 PM

@jcrespo, we have 6 columns

  • page_id (UNSIGNED INT) -- The page identifier
  • page_title (VARBINARY(255)) -- The title of the article
  • rev_id (UNSIGNED INT) -- The most recent revision ID at the time of assessment
  • timestamp (VARBINARY(14)) -- The timestamp when the assessment was taken (YYYYMMDDHHMMSS)
  • prediction (VARBINARY(10))-- The predicted quality class
  • weighted_sum (FLOAT(4,3)) -- The sum of prediction weights assuming indexed class ordering

We have 441,683,818 rows. I expect that we'll add about 5m rows once per month. I'd like to run the process to generate those new rows in along the release of new XML database dumps (pages-articles).

Each row should take an INT (4 bytes) + VARBINARY(255) (256 bytes -- max) + INT (4 bytes) + VARBINARY(14) (15 bytes) + VARBINARY(10) (11 bytes) + FLOAT(4,3) (4 bytes) = 4 + 256 + 4 + 15 + 11 + 4 = 294 bytes. We could drop the title field to get down to 38 bytes.

I wanted to see how big titles are on average, so I ran this query against enwiki:

SELECT AVG(len), MIN(len), MAX(len) FROM (SELECT CHAR_LENGTH(page_title) AS len FROM page WHERE page_namespace = 0 ORDER BY RAND() LIMIT 10000) AS foo;
+----------+----------+----------+
| AVG(len) | MIN(len) | MAX(len) |
+----------+----------+----------+
|  19.9795 |        2 |      120 |
+----------+----------+----------+
1 row in set (22.51 sec)

So, assuming that titles are generally 20 chars, we'll need 21 bytes to store it. That means 4 + 21 + 4 + 15 + 11 + 4 = 59 bytes per row.

If we keep the title, we should expect 59 bytes * 442m = 24.3GB. If we drop the title, we should expect 38 bytes * 442m = 15.6GB.

Each month, we'll add ~5m rows. If we keep the title, we'll need 59 bytes * 5m = +281MB. If we drop the title, we'll need 38 bytes * 5m = +182MB. That's 3.3GB or 2.1GB per year respectively.

Now that I think about it, I'd really like to drop the title field since it's not a stable identifier for a page.

@Halfak We can talk further, but a single 500 million-row mysql table should either get its own hardware or be split on smaller tables (isn't PAWS going to have its own dedicated hardware?).

Not because it wouldn't work, but because things like schema changes and table maintenance would affect greatly other users (the database is shared with many other users, serving dozens of other tools and requests.

You should also normalize both the title and the class, as it is a waste of space. Also consider compression.

If you are going to perform analytics-like analysis, have you thought about using alternative storage models (column-based?) rather than traditional OLTP dbs? (you do not clarify how it will be used).

Does splitting 500m rows into 5 x 100m row tables help somehow?

Indeed, I expect that schema changes and table maintenance would have performance concerns.

I'm not sure what you are referring to re "class", but I've already discussed dropping "title". What do you mean by "consider compression"?

The goal of getting this table on labs is to make it available to tool developers (e.g. exposing the quality change over time on history page) and to expose it via Quarry/PAWS for analytics purposes. Are you asking me to draw up some user scenarios?

Does splitting 500m rows into 5 x 100m row tables help somehow?
Indeed, I expect that schema changes and table maintenance would have performance concerns.

Yes. The idea is that if later the schema changes, we only create a new table instead; plus writing to a smaller table helps a bit. For example, it can be a table per year or month (whatever makes sense in the context, I do not know what makes sense), so it is append-only and most people use only a subset of the data (latest table, for example). Again, that depends on how that is used.

I'm not sure what you are referring to re "class", but I've already discussed dropping "title". What do you mean by "consider compression"?

prediction (VARBINARY(10))-- The predicted quality class If it is a "class", usually it is a subset of values, and normally that is implemented with a foreign key or an enum, reducing the storage size from 40 bytes to only 4 or 8.

The goal of getting this table on labs is to make it available to tool developers (e.g. exposing the quality change over time on history page) and to expose it via Quarry/PAWS for analytics purposes. Are you asking me to draw up some user scenarios?

Yes, the idea is if it is how popular do you think it is going to be? And if it is, if people can perform huge SELECTs like aggregations, averages, max, etc. calculations. Point selects (single row selections) work nicely on mysql; for analytics-like rows, different technologies other than mysql are generaly faster. I was asking if you would predict those other uses to be common.

Another question is, what is the relationship between these tables and ores_classification and ores_model from production? Because if they are the same, they are already on labs* (just not visible for users).

What do you mean by "consider compression"?

Both InnoDB has a compressed row format; or alternatively, tokudb has high compression properties. The first can save a 50% on size; toku up to 5x. https://dev.mysql.com/doc/refman/5.6/en/innodb-compression.html https://en.wikipedia.org/wiki/TokuDB Both are available for usage on labs, although they can create issues, that is why I asked what would it be the typical usage.

I don't believe we have anything available to Quarry that isn't a MySQL replica, right?

@yuvipanda, what do you think about the timing for new database machines?

Another question is, what is the relationship between these tables and ores_classification and ores_model from production?

I missed this. The dataset has a full history. ores_classification only has scores for recent edits and it is purged when we rebuild models.

Note that if you drop page_title, in order to have page_id mean something, you would have to join it (and even if including page_title, shouldn't there also be a page_namespace?).

This is only articles, so page_namespace == 0.

You're right that joining would be necessary to get back to a title, but one could do lookups based on page_id without joining. E.g. if you build a gadget that runs on a page in the browser, it has access to wgArticleId or however the cool mw.config kids are accessing that value these days ;)

@Halfak So my suggestion would be, if this can wait 3 months, wait for the labsdb pending work, were we will have a more stable environment.

If you need it done before that, I would suggest you to follow the recommendations I said about smaller tables and using compression on toolsdb.

I would suggest to contact some potential consumers of the data to share how they would use it to better suggest solutions on how to host it.

Halfak added a subscriber: chasemp.

I talked to @chasemp in IRC yesterday and he confirmed @jcrespo's recommendation. I think we should wait until the new DB machines are online.

Halfak closed this task as Resolved.Oct 24 2016, 7:33 PM
Halfak reopened this task as Open.Jan 13 2017, 7:30 PM

Talked to @jcrespo. New DB machines are online.

@jcrespo per your request, here's a cleaned up MySQL dialect TSV file. https://datasets.wikimedia.org/public-datasets/all/wp10/20160801/enwiki-20160801.monthly_scores.trimmed.tsv.bz2

The 6 columns correspond to the following table creation:

CREATE TABLE monthly_wp10_enwiki (
page_id (UNSIGNED INT),
rev_id (UNSIGNED INT),
timestamp (BINARY(14)),
prediction (VARBINARY(10)),
weighted_sum (FLOAT(4,3)),
PRIMARY KEY(page_id, timestamp)
);

Thanks for your help.

@yuvipanda, FYI, I'm hoping that we can have this available for the CSCW workshop in 1.5 weeks.

Here's my plan of action:

  1. We don't need to update this db, it'll be a one time operation
  2. I'll co-ordinate with DBAs to load this into one of the newer labsdbs. I did this last year for CSCW as well on the older labsdbs, and should be able to do this this time too without too much involvement from the DBAs
  3. I'll then make this available from PAWS in a nice way. Not sure what this looks like yet.

This all sounds great to me :)

Rejigged the schema to work on mysql:

CREATE TABLE monthly_wp10_enwiki (
    page_id int unsigned,
    rev_id int unsigned,
    timestamp VARBINARY(14),
    prediction VARBINARY(10),
    weighted_sum FLOAT(4,3),
    PRIMARY KEY(page_id, timestamp)
);

CREATE INDEX prediction ON monthly_wp10_enwiki (prediction);

@Halfak I think it'll be useful if you figure out what kind of indexes you'll need on this dataset.

Whatever I'm doing now is purely for CSCW, we'll have to rethink it afterwards.

Thanks. That primary key should serve us well for the uses that I intend.

CSCW is now over, and I think I can just delete the database now.

CSCW is now over, and I think I can just delete the database now.

For the record, this db was deleted on Monday evening from labsdb1009 after having a chat with @yuvipanda
Thanks!

yuvipanda removed yuvipanda as the assignee of this task.Mar 22 2017, 10:31 PM

@yuvipanda, thanks for helping with the workshop. o/

I just came to check on this task because it's been sitting for almost a month and I've been seeing overwhelming demand. It turns out all the WikiProjects and Programs want to measure their gaps/coverage after they saw https://blog.wikimedia.org/2017/03/07/the-keilana-effect/.

This task was miscategorized. I've now added the DBA tag. @Marostegui, is there anything I can do beyond what's already been done here to help get this data available in a public database.

CC @Siko, @Brianhe, @Keilana, @MNeisler. There's a few other people who I've talked to about making this data easier to access who I don't have a phab contact for.

For all of those who I just CC'd, completing this task will allow anyone to take "Keilana Effect" style measurements. I'm looking forward to writing some sample queries for doing this for arbitrary cross-sections of Wikipedia (like WikiProject-tagged pages) and enabling tool devs doing some interesting stuff with WikiProject worklists and that kind of thing.

Hi @Halfak

What are the requirements then? The original 55GB initial dataset (split in different tables) or is there anything that has changed?
The new labsdb servers are indeed online but there is still work to do.
Would the users need to join this data with production data?
Keep in mind that the current new labsdb servers are running a few shards only: s1,s3 and s4 (s5 to come soonish) .

The old labsdb servers will be soon decommissioned (and they do not have enough disk space to place this there anyways).
If you do not need to join this with production data, as suggested a few posts earlier, toolsdb servers might be the best option here.

Hi @Marostegui.

  • Requirements are 15.6GB with an additional 2GB per year. See T146718#2677286 for the analysis
  • This table needs to be joined against the production database in order to make cross-sections based on WikiProjects & categories possible.
  • Currently, there's high demand for English Wikipedia (s1), but I already have requests for new article quality models in several wikis. We can cross that bridge when we get there.
Seeeko added a subscriber: Seeeko.Apr 17 2017, 9:56 PM

Thanks for the CC @Halfak, I'm watching this from my active volunteer account now too. We're hoping to use this for a few projects on English Wikipedia now, and several other languages in coming months, so looking forward to seeing progress soon!

Hi @Marostegui.

  • Requirements are 15.6GB with an additional 2GB per year. See T146718#2677286 for the analysis
  • This table needs to be joined against the production database in order to make cross-sections based on WikiProjects & categories possible.
  • Currently, there's high demand for English Wikipedia (s1), but I already have requests for new article quality models in several wikis. We can cross that bridge when we get there.

Thanks for the comments.
I guess then the place for this is the new labs infraestructure, that is labsdb1009, 1010 and 1011? Toolsdb does not contain production data so that makes it impossible to fulfill the requirements.
However, as Jaime pointed out on T146718#2677658 we need to carefully see what is going to be the usage. High complex analytics queries will not get a great performance there, as, as you know, labsdb servers are shared with hundreds of users, databases etc.

@Marostegui, we've already experimented heavily with usage of this table by researchers at the above mentioned workshops. A simple index on page_id has proven effective for making most of the relevant queries tractable. A secondary index (or compound index) on timestamp could be useful for people who would like to query for a specific date (or just the most recent quality predictions).

Generally, it seems that these databases already receive complex analytics queries. See https://quarry.wmflabs.org/query/runs/all for examples of the types of queries I'd like to be able to intersect with the data in this table.

Restricted Application added a project: artificial-intelligence. · View Herald TranscriptMay 27 2017, 12:21 PM

We just released a model for Wikidata, so I added T166427: Measure maturity/quality of Wikidata items in a way it can be queried as a subtask here. Seems there a lot of demand already.

Also, see https://meta.wikimedia.org/wiki/Research:Interpolating_quality_dynamics_in_Wikipedia_and_demonstrating_the_Keilana_Effect and the related Wikimedia Metrics presentation https://www.youtube.com/watch?v=rAGwQdLyFb4#t=28m39s for the kind of analyses this dataset enables. My inbox is overflowing with people who want me to repeat this analysis in their content space. I'd much rather enable them to do the analysis themselves -- which is the whole point of this task. :)

Halfak added a comment.Jun 3 2017, 4:22 PM

@Marostegui just wanted to check in with you here. I'm getting a lot of pressure in my inbox to make this data publicly available since the Metrics presentation. It's going to have an awesome impact when we do. Do you think that we're getting close to being able to get the dataset online and accessible via quarry/PAWS?

@Halfak You can always put this on toolsdb, but as per T146718#3186337 you'd still need to join your data with production data? If so, toolsdb obviously wouldn't work for you.
The new labs servers are still on read only (meaning they only replicate production data and clean private data up, but they do not contain user generated dbs)

Obviously the old (and hopefully-soon-to-be-decommissioned hosts are still up and with user generated data). I rather not place a new service there, as it would imply migrating it somewhere else soon.

I will discuss this with Jaime during the week and see what we can come up with.

We believe that if you want the service to be up soon, the best option is to place it on the old labs servers.
You need to keep in mind that those hosts will be decommissioned once we have the new labs infra totally finished, so at some point a migration will be needed. It is up to you if you want and can afford that.

Halfak added a comment.Jun 8 2017, 5:09 PM

A late assessment is better than never! Woot! I'll get it hosted on the old labsdbs ASAP. Thanks for taking a look and making an assessment. I'll ping here with the database/table/quarry sample query and then we can resolve.

Sounds good thank you!

Halfak claimed this task.Jun 15 2017, 2:44 PM
Halfak added a subscriber: yuvipanda.

Somehow I accidentally added yuvi. :S

So I've attempted to load the table once and I had a connection failure. I'm now trying again with a screen. Will report back tomorrow as it should be ready to use.

Again a connection failure. So I'm going to try splitting the file into parts and loading it 50m rows at a time.

Halfak closed this task as Resolved.Jul 10 2017, 7:16 PM
Halfak added a subscriber: yuvipanda.
Halfak reopened this task as Open.Jul 31 2017, 3:15 PM

Looks like I'm failing to create one important index on the table.

MariaDB [u2041__ores_p]> CREATE INDEX page_idx ON monthly_wp10_enwiki (page_id);
ERROR 2013 (HY000): Lost connection to MySQL server during query

The index building process seems to take slightly too long for the query killer. @Marostegui, can you help me by running the index building command outside of the query killer's perview?

Looks like I'm failing to create one important index on the table.

MariaDB [u2041__ores_p]> CREATE INDEX page_idx ON monthly_wp10_enwiki (page_id);
ERROR 2013 (HY000): Lost connection to MySQL server during query

The index building process seems to take slightly too long for the query killer. @Marostegui, can you help me by running the index building command outside of the query killer's perview?

on labsdb1001, right?

Maybe?

$ host enwiki.labsdb
enwiki.labsdb has address 10.64.4.11

Maybe?

$ host enwiki.labsdb
enwiki.labsdb has address 10.64.4.11

Yep!

Mentioned in SAL (#wikimedia-operations) [2017-07-31T15:33:40Z] <marostegui> Create index on u2041__ores_p.monthly_wp10_enwiki - T146718

Marostegui closed this task as Resolved.Jul 31 2017, 4:43 PM
mysql:root@localhost [u2041__ores_p]> CREATE INDEX page_idx ON monthly_wp10_enwiki (page_id);
Query OK, 441683817 rows affected (1 hour 8 min 35.51 sec)
Records: 441683817  Duplicates: 0  Warnings: 0

mysql:root@localhost [u2041__ores_p]> show create table monthly_wp10_enwiki\G
*************************** 1. row ***************************
       Table: monthly_wp10_enwiki
Create Table: CREATE TABLE `monthly_wp10_enwiki` (
  `page_id` int(10) unsigned NOT NULL DEFAULT '0',
  `rev_id` int(10) unsigned DEFAULT NULL,
  `timestamp` binary(14) NOT NULL DEFAULT '\0\0\0\0\0\0\0\0\0\0\0\0\0\0',
  `prediction` varbinary(10) DEFAULT NULL,
  `weighted_sum` float(4,3) DEFAULT NULL,
  PRIMARY KEY (`page_id`,`timestamp`),
  KEY `page_idx` (`page_id`)
) ENGINE=Aria DEFAULT CHARSET=binary PAGE_CHECKSUM=1
1 row in set (0.01 sec)