Page MenuHomePhabricator

Implement technical details and process for "datasets_p" on wikireplica hosts
Open, NormalPublic

Description

Now that the new replica servers are available (T172704) and userdb creation is disabled, we need a solution to make common datasets available.

Straw dog proposals

Proposal A: single database, many table owners

  • Single database named datasets_p created hosting N tables of curated data.
  • Single authoritative primary server (tools.db.svc.eqiad.wmflabs?) where datasets are loaded/updated into the datasets_p database.
  • Each table must have an 'owner' in the form of one or more individuals who are responsible for support and maintenance of the data. Complete details of this responsibility to be determined.
  • Each table subject to the replication criteria described below.

Proposal B: many databases

  • Public *_p user-owned databases created on tools.db.svc.eqiad.wmflabs following normal processes.
  • Database owner files Phabricator task to request replication to Wiki Replicas hosts.
  • Each table subject to the replication criteria described below.

Replication criteria


Example datasets:

Related Objects

Event Timeline

Halfak created this task.Aug 17 2017, 3:18 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 17 2017, 3:18 PM
Halfak updated the task description. (Show Details)Aug 17 2017, 3:32 PM
chasemp triaged this task as Normal priority.Aug 17 2017, 3:44 PM
Ottomata moved this task from Incoming to Radar on the Analytics board.Aug 21 2017, 3:34 PM
Ottomata added a subscriber: Ottomata.

FYI, we put this on the Analytics 'Radar' column, as I don't think there is active work for the Analytics team. Let us know if this is incorrect.

@Halfak and I spoke a bit about this this morning. We talked about a SIG for wikireplica things and how this could relate with him being an interested party. I was planning on bringing it up at the cloud-services-team meeting tomorrow as a general point of interest since the near-term work and future of the wikireplica setup has a lot of moving parts and then talking with the DBA crew about their thoughts. Two kind of things here: a one-off consideration for a specific dataset, and then a more general thinking on datasets living alongside the replica data in general.

bd808 added a subscriber: bd808.Aug 21 2017, 8:08 PM

@Halfak can we close the 3 sub-tasks you forked from this and have the discussion before we jump to a specific solution? I'm worried that discussion will be spread over 4 tickets making things very hard to follow.

jcrespo added a subscriber: jcrespo.

One thought- after setting up the new labsdbs, we said we were going to consider the DBA part of the project as mostly done (note we stressed the need for user databases early on the refactoring process, as not fully served on the current design, as it was a difficult thing to do well). Like the above comment on analytics, I want to stress that we are more than happy to support this goal and both do DBA stuff and help taking decision, but we do cannot lead this project -both in terms of resources- our short term focus is s8 and backups, and in scope (I personally consider this architecture/application level work ("new" features), not storage maintenance, which is our aim for labsdbs (make replication work from production, first-pass filtering, and make sure the databases are up, running and available to users). In other words, for us labsdb is now in "maintenance mode", but we welcome and will support other people doing further refactoring.

This is not intended to pass the responsibility to cloud, again we are happy to meet and help and do any maintenance work at mysql level, as we always did, but the leading and programing work resources (which I can infer they will not be trivial here) cannot come from us ("application" level, be it analytics, research, cloud or mediawiki has to come from those groups) . The other thing is that, no matter the solution, our requirements is that high availability with automatic failover for the replica dataset and blocking of replication cannot be compromised.

bd808 moved this task from Backlog to Wiki replicas on the Data-Services board.

Consolidating discussion:

The ask here as I understand is to provide a database that is co-located with the wiki replicas available via the wikireplica-web.eqiad.wmnet and wikireplica-analytics.eqiad.wmnet service names that can contain N tables of curated data produced by ORES, some analytics project, or other ETL/aggregation methods. These are similar to adhoc tool databases in that they are not canonical MediaWiki metadata, but different in that each table would have a clear owner and a well documented process for how the data it contains is produced and replicated across the pool of hosts that comprise the logical cluster.

I can see the value in this, but I want the DBA team (@jcrespo, @Marostegui) to help design a system that they see as scalable and maintainable. I don't want to see the hard work of T153743: Add and sanitize s2, s4, s5, s6 and s7 to sanitarium2 and new labsdb hosts accidentally undone by adding in new complexity without careful consideration of how it will effect:

  • adding/removing hosts from the pool of servers backing cluster
  • replication lag
  • replication drift
bd808 added a comment.Aug 27 2017, 1:09 AM

Straw dog proposal

  • Single database named datasets_p hosting N tables of curated data.
  • Single authoritative primary server where datasets are loaded/updated into the datasets_p database.
  • Each table must have an 'owner' in the form of one or more individuals who are responsible for support and maintenance of the data. Complete details of this responsibility to be determined.
  • Table schema, content appropriateness for public distribution, and data set size must be approved by WMCS, DBA, and Security/Legal teams before initial load.
  • Data stored as InnoDB compressed tables.
  • Each table must have unique primary key for each row. This primary key may be an auto increment integer.
  • Authoritative database replicated to labsdb100{9,10,11} using:

@jcrespo would it be fundamentally sound to design this feature such that one of labsdb100{9,10,11} is the primary? Or would it be better to host the primary separately from the current cluster?

Duplicate of T156869? We should fix it for all users, not only datasets_p?

bd808 added a comment.Aug 28 2017, 3:28 PM

Duplicate of T156869? We should fix it for all users, not only datasets_p?

It's certainly related. My thinking from the bits and pieces of discussion that I have heard is that keeping arbitrary tables in sync is "too hard", but that we may be able to determine a way to handle tables which have the characteristics listed in T173511#3556395. Most importantly the schema/size vetting, configuration that is compatible with GTID + RBR replication, and clear ownership.

This is all a bit of hand waving on my part however. My skills/knowledge in the realm of MySQL/MariaDB replication could be recreated with a few Google searches and an afternoon of reading. The urgency for datasets_p is that finding some solution for it is a blocker to converting Quarry to the new db cluster. I'm sure other tools will be blocked on similar issues, but Quarry is one were the WMCS team has more agency to implement changes.

bd808 moved this task from Wiki replicas to Datasets on the Data-Services board.Sep 5 2017, 4:53 PM
bd808 added a comment.Oct 26 2017, 5:29 PM

The urgency for datasets_p is that finding some solution for it is a blocker to converting Quarry to the new db cluster. I'm sure other tools will be blocked on similar issues, but Quarry is one were the WMCS team has more agency to implement changes.

Just a note for people from the future (hi people!): I talked with @Halfak about the Quarry/u2041__ores_p issue and we decided that this data is valuable, but can be recreated and is not in wide use currently. This unblocked converting Quarry to the new database servers which was done in T176694: Switch Quarry to use *.analytics.db.svc.eqiad.wmflabs as replica database host.

bd808 updated the task description. (Show Details)Nov 6 2017, 10:38 PM

This task is potentially of interest to me. I previously ran a dumb report with the scripts here: https://en.wikipedia.org/wiki/Wikipedia:List_of_Wikipedians_by_article_count/Configuration. I was joining s51334__enwiki_first_page_revisions_p.page against enwiki_p.revision_userindex and it worked pretty well. This report is now broken and users have noticed: https://en.wikipedia.org/w/index.php?title=User_talk:BernsteinBot&oldid=808955710#Not_updated_for_five_days:_List_of_Wikipedians_by_article_count.2C_9k_-_10k.

For this particular report, we could add this feature to MediaWiki core, which I think would be the best solution. Or I could continue to maintain this database of page IDs and their first revision IDs. If this task were implemented, under "Proposal B," would it be possible to get this s51334__enwiki_first_page_revisions_p.page database table replicated to the same hosts where enwiki_p is available in order to re-allow database joins?

bd808 added a comment.Nov 7 2017, 12:53 AM

If this task were implemented, under "Proposal B," would it be possible to get this s51334__enwiki_first_page_revisions_p.page database table replicated to the same hosts where enwiki_p is available in order to re-allow database joins?

Yes, the intent for both Proposal A and B would be to allow curated datasets like the one that you were updating daily in s51334__enwiki_first_page_revisions_p.page to be replicated to the new labsdb100{9,10,11} servers. The main difference is purely if there are multiple databases that are replicated or if there is a single database of multiple tables replicated, and how a new database/table would be proposed and vetted for inclusion.

Proposal C:

  • Maintenance of datasets is puppetized (so it is both the source and the documentation of data stored there) and added to sanitarium. This would be an issue to do quick changes, but after all, it seems the proposal includes "Table schema, content appropriateness for public distribution, and data set size must be approved by cloud-services-team , DBA, and Security-Team/WMF-Legal teams before initial load/replication."- git would track that and review would happen using gerrit/phabricator. However, by adding it puppetized to production, it would make it available to all users and shared between them if it is really useful (like T59617). This would work for both external data AND regularly pre-computed sumaries of existing data sets. The puppetization would be key to long-term maintenance for usages such as the ones that @MZMcBride proposes (which would be carried on on next hardware maintenance). If in some cases there is not need of formally maintaining it on the repo, one could still maintain summaries cross-hosts (mysql -e "SELECT" analytics | mysql -e "load data" toolsdb).

Note it doesn't have to be sanitarium itself, but one host where one would not accidentally copy private/garbage data, plus it would allow sanitarium and labsdb hosts to keep being equal in content. There can be (as it happens how) more than one "sanitarium" host.

bd808 added a comment.Nov 13 2017, 4:17 PM

Proposal C:

  • Maintenance of datasets is puppetized (so it is both the source and the documentation of data stored there) and added to sanitarium.

To be sure I'm understanding your proposal I'd like to try and restate it. Proposal C would be that we select/create some server to act as the single source of truth for curated datasets. Then we would create a Puppet module(s) that provisions cronjobs on this server which are responsible for populating the curated tables. These data import jobs could gather data from external sources (e.g. T179187: Reload monthly article quality dataset into wikireplica "datasets_p") and/or compute data based on information that exists in the Wiki Replicas (e.g. @MZMcBride's page_first_revision_id data). The intent of using Puppet to manage this system is to "productionize" it and make it trivially easy to ensure that some server is always able to provide the information for replication. Is that a fair summary of your proposal @jcrespo?

Yes, although not sure it is really a separate proposal; it would just be a small modification of the existing ones- the idea of being in production is that it would allow for certain queries that on labsdb-only access would not be possible (e.g. counting number of article watchers without revealing which ones) or less performant (due to views access).

bd808 added a comment.Nov 13 2017, 4:35 PM

I agree that Proposal C is not really opposing A or B. It is instead proposing an implementation for either of them. If the content curation is consolidated like this it would seem to favor the consolidated replication source of proposal A.

One side effect of Proposal C would be that all code, debugging, and deployment of the datasets would be functionally forced to be managed by production root key holders or a least a new sudoers group. Technical contributors could certainly propose patches for this system, but an ops/puppet.git +2 would be needed to deploy them. If the scripts are run on a host with access to data that is omitted from the public Wiki Replicas then we are also restricting debugging to those with NDAs and production host access. This takes control out of the hands of most volunteer technical contributors. That in turn slows the pace of innovation in favor of stability and repeatability. I can see pros and cons for this approach. I'm not currently sure if the pros outweigh the cons or not.

As a pro- I have run many times into people doing the certain summary queries many times independently- instead of those being pregenerated.

bd808 added a comment.Nov 13 2017, 9:14 PM

Do we need a full blown ETL pipeline system for Proposal C or can we just start with a Puppet module that manages some cron jobs and static scripts?

@Halfak, what are your thoughts on the manageability of the Puppet driven approach for the data that you know you want to expose?

As a pro- I have run many times into people doing the certain summary queries many times independently- instead of those being pregenerated.

Right... for example, MediaWiki could just store page.page_first (as a "sibling" column of page.page_latest) with the revision ID of the first revision of a page. Then I wouldn't need to keep track of (page ID, first revision ID) pairs and they would be automatically replicated to all of the production hosts. I'm not sure what the balance is when deciding whether to set up and maintain additional infrastructure to support custom indices like this or to have them "built in" to MediaWiki core.

Technical contributors could certainly propose patches for this system, but an ops/puppet.git +2 would be needed to deploy them.

Actually, only the infrastructure would be controlled at root level- the queries run could be on a separate repository, but IF it runs in production, only NDA-signed people should be able to deploy, as the wrong query would expose private data. There could be 2 levels- one at sanitarium for access to private data (for performance or access needs), and another, once private data has been filtered with more relaxed permissions, from toolsdb.

I'm not sure what the balance is when deciding whether to set up and maintain additional infrastructure to support custom indices like this or to have them "built in" to MediaWiki core.

@MZMcBride As we are all here at the infrastructure side, we are looking at infrastructure solutions. That doesn't mean you couldn't propose such changes to mediawiki hackers. I still believe things like meta_p database should exist in production, and not (only) on the replicas.

Bstorm added a subscriber: Bstorm.