Implement technical details and process for "datasets_p" on wikireplica hosts
Open, MediumPublicFeature
Actions

Assigned To

None

Authored By

	Halfak
	Aug 17 2017, 3:18 PM

Description

Now that the new replica servers are available (T172704) and userdb creation is disabled, we need a solution to make common datasets available.

Straw dog proposals

Proposal A: single database, many table owners

Single database named datasets_p created hosting N tables of curated data.
Single authoritative primary server (tools.db.svc.eqiad.wmflabs?) where datasets are loaded/updated into the datasets_p database.
Each table must have an 'owner' in the form of one or more individuals who are responsible for support and maintenance of the data. Complete details of this responsibility to be determined.
Each table subject to the replication criteria described below.

Proposal B: many databases

Public *_p user-owned databases created on tools.db.svc.eqiad.wmflabs following normal processes.
Database owner files Phabricator task to request replication to Wiki Replicas hosts.
Each table subject to the replication criteria described below.

Replication criteria

Table schema, content appropriateness for public distribution, and data set size must be approved by cloud-services-team , DBA, and Security-Team/WMF-Legal teams before initial load/replication.
Data stored as InnoDB compressed tables.
Each table must have unique primary key for each row. This primary key may be an auto increment integer.
Authoritative database replicated to labsdb100{9,10,11} using:
- Global Transaction Identifiers (GTIDs)
- Row based replication (RBR)

Example datasets:

Related Objects
Search...

Status	Subtype	Assigned	Task
Resolved		jcrespo	T140788 Labs databases rearchitecture (tracking)
Resolved		bd808	T166402 Program 7 Outcome 3: data services
Resolved		bd808	T142807 Migrate all users to new Wiki Replica cluster and decommission old hardware
Declined		None	T180636 Make Dispenser's principle_links table accessible in new Wiki replica cluster
Open	Feature	None	T173511 Implement technical details and process for "datasets_p" on wikireplica hosts
Declined		None	T173512 Create a phabricator project called "wikireplica-datasets"
Duplicate		None	T173513 Create a database on the wikireplica servers called "datasets_p"
Duplicate		None	T173514 Document the process for importing a new "datasets_p" table

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 17 2017, 3:18 PM

Halfak updated the task description. (Show Details)Aug 17 2017, 3:22 PM

Halfak created subtask T173512: Create a phabricator project called "wikireplica-datasets".

Halfak updated the task description. (Show Details)

Halfak created subtask T173513: Create a database on the wikireplica servers called "datasets_p".Aug 17 2017, 3:25 PM

Halfak updated the task description. (Show Details)

Halfak created subtask T173514: Document the process for importing a new "datasets_p" table.Aug 17 2017, 3:30 PM

Halfak updated the task description. (Show Details)

Halfak updated the task description. (Show Details)Aug 17 2017, 3:32 PM

zhuyifei1999 subscribed.Aug 17 2017, 3:32 PM

• chasemp edited projects, added cloud-services-team (Kanban); removed cloud-services-team.Aug 17 2017, 3:42 PM

• chasemp moved this task from Inbox to Needs discussion on the cloud-services-team (Kanban) board.

• chasemp triaged this task as Medium priority.Aug 17 2017, 3:44 PM

FYI, we put this on the Analytics 'Radar' column, as I don't think there is active work for the Analytics team. Let us know if this is incorrect.

@Halfak and I spoke a bit about this this morning. We talked about a SIG for wikireplica things and how this could relate with him being an interested party. I was planning on bringing it up at the cloud-services-team meeting tomorrow as a general point of interest since the near-term work and future of the wikireplica setup has a lot of moving parts and then talking with the DBA crew about their thoughts. Two kind of things here: a one-off consideration for a specific dataset, and then a more general thinking on datasets living alongside the replica data in general.

@Halfak can we close the 3 sub-tasks you forked from this and have the discussion before we jump to a specific solution? I'm worried that discussion will be spread over 4 tickets making things very hard to follow.

Krenair subscribed.Aug 21 2017, 8:14 PM

One thought- after setting up the new labsdbs, we said we were going to consider the DBA part of the project as mostly done (note we stressed the need for user databases early on the refactoring process, as not fully served on the current design, as it was a difficult thing to do well). Like the above comment on analytics, I want to stress that we are more than happy to support this goal and both do DBA stuff and help taking decision, but we do cannot lead this project -both in terms of resources- our short term focus is s8 and backups, and in scope (I personally consider this architecture/application level work ("new" features), not storage maintenance, which is our aim for labsdbs (make replication work from production, first-pass filtering, and make sure the databases are up, running and available to users). In other words, for us labsdb is now in "maintenance mode", but we welcome and will support other people doing further refactoring.

This is not intended to pass the responsibility to cloud, again we are happy to meet and help and do any maintenance work at mysql level, as we always did, but the leading and programing work resources (which I can infer they will not be trivial here) cannot come from us ("application" level, be it analytics, research, cloud or mediawiki has to come from those groups) . The other thing is that, no matter the solution, our requirements is that high availability with automatic failover for the replica dataset and blocking of replication cannot be compromised.

jcrespo moved this task from Triage to Blocked external/Not db team on the DBA board.Aug 22 2017, 10:56 AM

bd808 added a project: Data-Services.Aug 25 2017, 4:39 PM

bd808 moved this task from Backlog to Wiki replicas on the Data-Services board.

Consolidating discussion:

In T173513#3538175, @bd808 wrote:

The ask here as I understand is to provide a database that is co-located with the wiki replicas available via the wikireplica-web.eqiad.wmnet and wikireplica-analytics.eqiad.wmnet service names that can contain N tables of curated data produced by ORES, some analytics project, or other ETL/aggregation methods. These are similar to adhoc tool databases in that they are not canonical MediaWiki metadata, but different in that each table would have a clear owner and a well documented process for how the data it contains is produced and replicated across the pool of hosts that comprise the logical cluster.

I can see the value in this, but I want the DBA team (@jcrespo, @Marostegui) to help design a system that they see as scalable and maintainable. I don't want to see the hard work of T153743: Add and sanitize s2, s4, s5, s6 and s7 to sanitarium2 and new labsdb hosts accidentally undone by adding in new complexity without careful consideration of how it will effect:

adding/removing hosts from the pool of servers backing cluster

replication lag

replication drift

Straw dog proposal

Single database named datasets_p hosting N tables of curated data.
Single authoritative primary server where datasets are loaded/updated into the datasets_p database.
Each table must have an 'owner' in the form of one or more individuals who are responsible for support and maintenance of the data. Complete details of this responsibility to be determined.
Table schema, content appropriateness for public distribution, and data set size must be approved by WMCS, DBA, and Security/Legal teams before initial load.
Data stored as InnoDB compressed tables.
Each table must have unique primary key for each row. This primary key may be an auto increment integer.
Authoritative database replicated to labsdb100{9,10,11} using:
- Global Transaction Identifiers (GTIDs)
- Row based replication (RBR)

@jcrespo would it be fundamentally sound to design this feature such that one of labsdb100{9,10,11} is the primary? Or would it be better to host the primary separately from the current cluster?

Duplicate of T156869? We should fix it for all users, not only datasets_p?

In T173511#3558277, @jcrespo wrote:

Duplicate of T156869? We should fix it for all users, not only datasets_p?

It's certainly related. My thinking from the bits and pieces of discussion that I have heard is that keeping arbitrary tables in sync is "too hard", but that we may be able to determine a way to handle tables which have the characteristics listed in T173511#3556395. Most importantly the schema/size vetting, configuration that is compatible with GTID + RBR replication, and clear ownership.

This is all a bit of hand waving on my part however. My skills/knowledge in the realm of MySQL/MariaDB replication could be recreated with a few Google searches and an afternoon of reading. The urgency for datasets_p is that finding some solution for it is a blocker to converting Quarry to the new db cluster. I'm sure other tools will be blocked on similar issues, but Quarry is one were the WMCS team has more agency to implement changes.

bd808 mentioned this in T156869: Design a method for keeping user-created tables in sync across labsDBs.Aug 28 2017, 7:27 PM

bd808 mentioned this in T172704: Promote initial use of new Wiki Replica servers.Aug 30 2017, 3:43 PM

• chasemp mentioned this in T168584: Labsdb* servers need to be rebooted.Aug 30 2017, 4:42 PM

bd808 mentioned this in Blog Post: New Wiki Replica servers ready for use.Sep 1 2017, 12:58 AM

Halfak closed subtask T173512: Create a phabricator project called "wikireplica-datasets" as Declined.Sep 5 2017, 4:49 PM

bd808 moved this task from Wiki replicas to Datasets on the Data-Services board.Sep 5 2017, 4:53 PM

bd808 mentioned this in T173513: Create a database on the wikireplica servers called "datasets_p".Sep 5 2017, 4:56 PM

bd808 changed the status of subtask T173513: Create a database on the wikireplica servers called "datasets_p" from Open to Stalled.

bd808 mentioned this in T173514: Document the process for importing a new "datasets_p" table.

bd808 changed the status of subtask T173514: Document the process for importing a new "datasets_p" table from Open to Stalled.

bd808 mentioned this in T175096: Identify tools hosting databases on labsdb100[13] and notify maintainers.Sep 5 2017, 11:08 PM

• Niharika subscribed.Sep 11 2017, 11:29 PM

bd808 added a parent task: T142807: Migrate all users to new Wiki Replica cluster and decommission old hardware.Sep 18 2017, 8:52 PM

bd808 mentioned this in T176694: Switch Quarry to use *.analytics.db.svc.eqiad.wmflabs as replica database host.Sep 26 2017, 1:19 AM

Ricordisamoa subscribed.Sep 27 2017, 1:28 AM

Superyetkin subscribed.Sep 27 2017, 7:08 PM

In T173511#3558849, @bd808 wrote:

The urgency for datasets_p is that finding some solution for it is a blocker to converting Quarry to the new db cluster. I'm sure other tools will be blocked on similar issues, but Quarry is one were the WMCS team has more agency to implement changes.

Just a note for people from the future (hi people!): I talked with @Halfak about the Quarry/u2041__ores_p issue and we decided that this data is valuable, but can be recreated and is not in wide use currently. This unblocked converting Quarry to the new database servers which was done in T176694: Switch Quarry to use *.analytics.db.svc.eqiad.wmflabs as replica database host.

Nettrom subscribed.Oct 26 2017, 8:40 PM

Halfak mentioned this in T179187: Reload monthly article quality dataset into wikireplica "datasets_p".Oct 27 2017, 2:09 PM

bd808 updated the task description. (Show Details)Nov 6 2017, 10:38 PM

This task is potentially of interest to me. I previously ran a dumb report with the scripts here: https://en.wikipedia.org/wiki/Wikipedia:List_of_Wikipedians_by_article_count/Configuration. I was joining s51334__enwiki_first_page_revisions_p.page against enwiki_p.revision_userindex and it worked pretty well. This report is now broken and users have noticed: https://en.wikipedia.org/w/index.php?title=User_talk:BernsteinBot&oldid=808955710#Not_updated_for_five_days:_List_of_Wikipedians_by_article_count.2C_9k_-_10k.

For this particular report, we could add this feature to MediaWiki core, which I think would be the best solution. Or I could continue to maintain this database of page IDs and their first revision IDs. If this task were implemented, under "Proposal B," would it be possible to get this s51334__enwiki_first_page_revisions_p.page database table replicated to the same hosts where enwiki_p is available in order to re-allow database joins?

In T173511#3739381, @MZMcBride wrote:

If this task were implemented, under "Proposal B," would it be possible to get this s51334__enwiki_first_page_revisions_p.page database table replicated to the same hosts where enwiki_p is available in order to re-allow database joins?

Yes, the intent for both Proposal A and B would be to allow curated datasets like the one that you were updating daily in s51334__enwiki_first_page_revisions_p.page to be replicated to the new labsdb100{9,10,11} servers. The main difference is purely if there are multiple databases that are replicated or if there is a single database of multiple tables replicated, and how a new database/table would be proposed and vetted for inclusion.

Dispenser subscribed.Nov 7 2017, 3:20 AM

Proposal C:

Maintenance of datasets is puppetized (so it is both the source and the documentation of data stored there) and added to sanitarium. This would be an issue to do quick changes, but after all, it seems the proposal includes "Table schema, content appropriateness for public distribution, and data set size must be approved by cloud-services-team , DBA, and Security-Team/WMF-Legal teams before initial load/replication."- git would track that and review would happen using gerrit/phabricator. However, by adding it puppetized to production, it would make it available to all users and shared between them if it is really useful (like T59617). This would work for both external data AND regularly pre-computed sumaries of existing data sets. The puppetization would be key to long-term maintenance for usages such as the ones that @MZMcBride proposes (which would be carried on on next hardware maintenance). If in some cases there is not need of formally maintaining it on the repo, one could still maintain summaries cross-hosts (mysql -e "SELECT" analytics | mysql -e "load data" toolsdb).

Note it doesn't have to be sanitarium itself, but one host where one would not accidentally copy private/garbage data, plus it would allow sanitarium and labsdb hosts to keep being equal in content. There can be (as it happens how) more than one "sanitarium" host.

In T173511#3755154, @jcrespo wrote:

Proposal C:

Maintenance of datasets is puppetized (so it is both the source and the documentation of data stored there) and added to sanitarium.

To be sure I'm understanding your proposal I'd like to try and restate it. Proposal C would be that we select/create some server to act as the single source of truth for curated datasets. Then we would create a Puppet module(s) that provisions cronjobs on this server which are responsible for populating the curated tables. These data import jobs could gather data from external sources (e.g. T179187: Reload monthly article quality dataset into wikireplica "datasets_p") and/or compute data based on information that exists in the Wiki Replicas (e.g. @MZMcBride's page_first_revision_id data). The intent of using Puppet to manage this system is to "productionize" it and make it trivially easy to ensure that some server is always able to provide the information for replication. Is that a fair summary of your proposal @jcrespo?

Yes, although not sure it is really a separate proposal; it would just be a small modification of the existing ones- the idea of being in production is that it would allow for certain queries that on labsdb-only access would not be possible (e.g. counting number of article watchers without revealing which ones) or less performant (due to views access).

I agree that Proposal C is not really opposing A or B. It is instead proposing an implementation for either of them. If the content curation is consolidated like this it would seem to favor the consolidated replication source of proposal A.

One side effect of Proposal C would be that all code, debugging, and deployment of the datasets would be functionally forced to be managed by production root key holders or a least a new sudoers group. Technical contributors could certainly propose patches for this system, but an ops/puppet.git +2 would be needed to deploy them. If the scripts are run on a host with access to data that is omitted from the public Wiki Replicas then we are also restricting debugging to those with NDAs and production host access. This takes control out of the hands of most volunteer technical contributors. That in turn slows the pace of innovation in favor of stability and repeatability. I can see pros and cons for this approach. I'm not currently sure if the pros outweigh the cons or not.

As a pro- I have run many times into people doing the certain summary queries many times independently- instead of those being pregenerated.

Do we need a full blown ETL pipeline system for Proposal C or can we just start with a Puppet module that manages some cron jobs and static scripts?

@Halfak, what are your thoughts on the manageability of the Puppet driven approach for the data that you know you want to expose?

• MZMcBride mentioned this in T179628: Consider granting `CREATE TEMPORARY TABLES` to labsdbuser.Nov 14 2017, 4:23 AM

In T173511#3755509, @jcrespo wrote:

As a pro- I have run many times into people doing the certain summary queries many times independently- instead of those being pregenerated.

Right... for example, MediaWiki could just store page.page_first (as a "sibling" column of page.page_latest) with the revision ID of the first revision of a page. Then I wouldn't need to keep track of (page ID, first revision ID) pairs and they would be automatically replicated to all of the production hosts. I'm not sure what the balance is when deciding whether to set up and maintain additional infrastructure to support custom indices like this or to have them "built in" to MediaWiki core.

Technical contributors could certainly propose patches for this system, but an ops/puppet.git +2 would be needed to deploy them.

Actually, only the infrastructure would be controlled at root level- the queries run could be on a separate repository, but IF it runs in production, only NDA-signed people should be able to deploy, as the wrong query would expose private data. There could be 2 levels- one at sanitarium for access to private data (for performance or access needs), and another, once private data has been filtered with more relaxed permissions, from toolsdb.

I'm not sure what the balance is when deciding whether to set up and maintain additional infrastructure to support custom indices like this or to have them "built in" to MediaWiki core.

@MZMcBride As we are all here at the infrastructure side, we are looking at infrastructure solutions. That doesn't mean you couldn't propose such changes to mediawiki hackers. I still believe things like meta_p database should exist in production, and not (only) on the replicas.

zhuyifei1999 added a parent task: T180636: Make Dispenser's principle_links table accessible in new Wiki replica cluster.Nov 15 2017, 9:23 PM

Betacommand subscribed.Dec 23 2017, 7:22 PM

Bstorm moved this task from Needs discussion to Inbox on the cloud-services-team (Kanban) board.Oct 29 2018, 1:12 PM

Bstorm subscribed.

leila removed a project: Research.Jul 11 2019, 12:24 AM

zhuyifei1999 mentioned this in T230758: Drop database s54020__wsexporttest_p and grant access to Quarry.Aug 29 2019, 3:40 PM

Bstorm moved this task from Inbox to Graveyard on the cloud-services-team (Kanban) board.Dec 18 2019, 4:41 PM

bd808 merged tasks: T173514: Document the process for importing a new "datasets_p" table, T173513: Create a database on the wikireplica servers called "datasets_p".Feb 10 2020, 6:08 PM

bd808 added a subscriber: Chicocvenancio.

Is there any estimate as to when we will start using this database on replica servers?

In T173511#5866295, @Superyetkin wrote:

Is there any estimate as to when we will start using this database on replica servers?

No. There has been no substantive work or discussion of this topic for more than 2 years.

Marostegui removed a project: DBA.Mar 5 2020, 8:44 AM

Aklapper edited projects, added Analytics-Radar; removed Analytics.Jun 10 2020, 6:33 AM

fnegri edited projects, added cloud-services-team; removed cloud-services-team (Kanban).Jan 18 2023, 6:53 PM

fnegri moved this task from Kanban to Graveyard on the cloud-services-team board.

fnegri moved this task to Inbox on the cloud-services-team board.Jan 18 2023, 9:58 PM

Aklapper added a project: Data-Engineering-Icebox.Feb 10 2023, 5:44 PM

With no useful activity, including activism for implementation, in 7+ years let's let this feature request go. It certainly can be revisited in the future if there is new demand.

This needs to stay open.

What do we need to do in order to create the "datasets_p" schema on replica servers? I have a "candidate" dataset to serve on this schema.

In T173511#9603083, @Superyetkin wrote:

This needs to stay open.

I find your one drive-by question 4 years ago to be uncompelling evidence that this needs to stay open, but I won't wheel war on the status at this time.

What do we need to do in order to create the "datasets_p" schema on replica servers? I have a "candidate" dataset to serve on this schema.

You, or someone wishing to move this idea forward, needs to present use cases and have a reasonable discussion about technical limitations and workflows with the "owners" of the MariaDB level of the Wiki Replica servers. This is also where things get a bit hand wavy because I do not think that today there is a clear agreement among the DBA, Data-Engineering, and cloud-services-team teams who is most responsible for a decision in this area.

Implement technical details and process for "datasets_p" on wikireplica hostsOpen, MediumPublicFeatureActions

Description

Straw dog proposals

Proposal A: single database, many table owners

Proposal B: many databases

Replication criteria

Related ObjectsSearch...

Event Timeline

Straw dog proposal

Implement technical details and process for "datasets_p" on wikireplica hosts
Open, MediumPublicFeature
Actions

Related Objects
Search...