Page MenuHomePhabricator

Reload monthly article quality dataset into wikireplica "datasets_p"
Closed, DeclinedPublic

Description

Now that quarry is using the new DB hosts (T176694: Switch Quarry to use *.analytics.db.svc.eqiad.wmflabs as replica database host), the article quality dataset used to measure the Keilana Effect is no longer available in via quarry.

I'd like to T173513: Create a database on the wikireplica servers called "datasets_p" and get this dataset reloaded before the workshop.

Creation statements

USE datasets_p;
CREATE TABLE monthly_wp10_enwiki (
  page_id       INT UNSIGNED,
  rev_id        INT UNSIGNED,
  timestamp     VARBINARY(14),
  prediction    VARCHAR(200),
  weighted_sum  FLOAT(4, 3),
  PRIMARY KEY(page_id, timestamp)
);
CREATE INDEX page_idx ON monthly_wp10_enwiki (page_id);

Data to be loaded:

This task is done when: A Quarry user can join the monthly article quality dataset with the replica tables for enwiki.

Event Timeline

Halfak created this task.Oct 27 2017, 2:07 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 27 2017, 2:07 PM

See discussion of "datasets_p" and what we'd like to do with datasets like this one in T173511. My goal in filing this task is that this dataset will serve as an example for the general process for loading curated dataset tables onto these replicas.

Halfak updated the task description. (Show Details)Oct 27 2017, 2:11 PM
Halfak updated the task description. (Show Details)
Halfak added a subscriber: diego.

@diego, this is the task for loading that article quality dataset I showed you through Quarry a few days ago. I'd like to use this in our workshop at IWSC.

Not sure what is needed from us (DBAs) as we do not handle create statements (https://wikitech.wikimedia.org/wiki/Schema_changes#What_is_not_a_schema_change). I will remain subscribed just in case I can help and/or advise :-)

Not sure what is needed from us (DBAs) as we do not handle create statements (https://wikitech.wikimedia.org/wiki/Schema_changes#What_is_not_a_schema_change). I will remain subscribed just in case I can help and/or advise :-)

This is something to be done after the discussion on T173511: Implement technical details and process for "datasets_p" on wikireplica hosts has found a way for what I have been calling "curated datasets" to be replicated across the *.{analytics,web}.db.svc.eqiad.wmflabs cluster (labsdb10{09,10,11}). The main new info for cloud-services-team and the DBA team that @Halfak is presenting here is that he would like this problem to be solved close to immediately (~1.5 weeks). I'm not entirely sure that this is actually possible, but we can try. I would really rather not hack this solution together quickly and create tech debt that we will have to undo later.

This comment was removed by Halfak.

I see what you're saying, @bd808. We're currently in a reduced capacity and have been for a while. I was under the impression that you were aware that this has been "immediately" needed since I'd originally filed T173511. After IWSC, there will be more workshop, more analyses, and more wiki tools that want to take advantage of this data, so it will continue to be "immediately" needed. I'd really appreciate a stop-gap for this workshop, but I understand if it is not possible/desirable to prioritize this in time for IWSC. If that is the case, we can edit the description to drop the desired date to make it clear.

Halfak updated the task description. (Show Details)Oct 27 2017, 4:18 PM

Removed from description but pasted here as a conversation point:

I'm planning a workshop for IWSC that will happen some time between Nov. 8th and Nov. 10th where I'd like to highlight this dataset through Quarry and PAWS.

@Halfak is the datasets_p name important or would it be ok for your curated tables to live an a database named with the normal tool/user owned naming convention of ${MYSQLUSER}__{SOMETHING}_p? I'm thinking that the easiest way for us to solve T173511 will be to have the primary databases on tools.db.svc.eqiad.wmflabs and then replicate those databases to the Wiki Replica cluster. Using a normal database that you can already create on tools.db.svc.eqiad.wmflabs using Toolforge account credentials removes one blocker from things and lets you get the data all setup. Then we just need to work out the process to control which databases are replicated with the DBA team.

Sure! That could work if that's easier to set up.

If possible, I'd love to have a more intuitive DB name since we're targeting Quarry users, but if that's difficult, then let's not worry about it now.

@Halfak I think that having multiple users owning tables in a single db will end up being problematic from the server management side. You could request a custom database name on tools.db.svc.eqiad.wmflabs associated with some Tool credentials you already have (or a new tool just to manage this data). datasets_p seems very generic. Maybe something like oresdata_p instead? I don't personally have the mysql rights needed to set that up for you, so you'll need to open a ticket and the DBAs can setup the grant needed to let a user you specify control a database other than the standard naming grant.

I'm OK with owning my own DB, but note that I volunteered to manage curation of datasets for a shared DB. That has the benefit of being a strategy that might not result in a big mess.

Then again, maybe we can do curation at the step where we enable replication. E.g. nothing gets replicated unless it has adequate documentation and an owner.

Then again, maybe we can do curation at the step where we enable replication. E.g. nothing gets replicated unless it has adequate documentation and an owner.

Yeah, that's my thinking. Replication will be selective and there will be some sort of checklist that needs to be met. Hopefully we can figure out that rubric with a little help from the DBAs and then establish the process in T173511: Implement technical details and process for "datasets_p" on wikireplica hosts.

Halfak claimed this task.Oct 30 2017, 4:54 PM
Harej closed this task as Declined.Mar 26 2019, 9:29 PM
Harej added a subscriber: Harej.

This project is no longer feasible, as it originally relied on colocating our data table with wiki replicas.