Provide tools for querying MediaWiki replica databases without having to specify the shard
Closed, ResolvedPublic13 Estimated Story Points
Actions

Description

The main thing Product-Analytics needs to adapt to T172410: Replace the current multisource analytics-store setup is a tool that lets us run queries against particular wiki replicas without having to think about which shard they are on.

Based on a discussion with @elukey, the best way to accomplish this would probably be a library (owned by Analytics) that handles routing the query to the specific shard, although we're open to any other method.

This library would have a signature something like: run(sql_query, wiki). It wouldn't need to provide for cross-wiki joins or a single staging database accessible from all shards; we are willing to handle those use cases in our own code as long as we have the core ability to run shard-agnostic queries.

Ideally, we would have this functionality available in both Python and R, but R poses an issue for the Analytics team because they have no experience with it. Possible solutions include:

Someone from Product Analytics writes the R client and hands it over to Analytics for maintenance
Analytics writes a command line utility which can be easily wrapped in both Python and R
We make do with just the Python

Details

Subject	Repo	Branch	Lines +/-
Add the 'centralauth' db use case in refinery utils.py	analytics/refinery	master	+6 -0
profile::analytics::refinery: ensure log dir only for 'hdfs'	operations/puppet	production	+15 -13
Add notebook100[3,4] and stat1006 to the scap targets	analytics/refinery/scap	master	+3 -0
role::swap: add analytics-admins admin group	operations/puppet	production	+1 -0
Add profile::analytics::refinery to notebook100[3,4] and stat1006	operations/puppet	production	+21 -7
profile::analytics::refinery: add a wrapper for analytics-mysql	operations/puppet	production	+8 -0
Add support for querying x1	wikimedia/discovery/wmf	master	+95 -26
analytics-mysql: add print-target parameter	analytics/refinery	master	+10 -10
Update for sharding setup	wikimedia/discovery/wmf	master	+191 -26
Introduce analytics-mysql	analytics/refinery	master	+76 -0
profile::analytics::refinery: require python3-dnspthon	operations/puppet	production	+5 -0
profile::analytics::cluster::repo::statistics: fix directory (again)	operations/puppet	production	+1 -11
profile::analytics::cluster::repositories::statistics: fix mw directory	operations/puppet	production	+11 -1
Add common statistics repositories to stat/notebook hosts	operations/puppet	production	+22 -0
Add analytics dbstore SRV records	operations/dns	master	+12 -0

Related Objects
Search...

Status	Assigned	Task
Resolved	elukey	T172410 Replace the current multisource analytics-store setup
Resolved	Jclark-ctr	T216491 Decommission dbstore1002
Resolved	Marostegui	T210478 Migrate dbstore1002 to a multi instance setup on dbstore100[3-5]
Resolved	elukey	T215589 Migrate users to dbstore100[3-5]
Resolved	elukey	T212386 Provide tools for querying MediaWiki replica databases without having to specify the shard
Resolved	nshahquinn-wmf	T216634 Update wmfdata package to support sharded wiki replicas

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

In T212386#4862292, @jcrespo wrote:

There is already an 'sql' tool that developers that query production use without having to know the underlying mediawiki topology (100 servers)- probably could be adapted for analytics hosts?

Interesting! Is there a good link where I can learn more?

I haven't found much on wikitech, so:

marostegui@tools-bastion-03:~$ sql --help
usage: sql [-h] [-v] [-N] [--cluster {analytics,web}] DATABASE ...

Connect to or run SQL query on replica or Tools database server

positional arguments:
  DATABASE              for example commonswiki_p or enwiki
  ...                   SQL query; multiple arguments will be joined by spaces

optional arguments:
  -h, --help            show this help message and exit
  -v, --verbose         increase logging verbosity
  -N, --skip-column-names
                        do not write column names in results
  --cluster {analytics,web}
                        cluster to connect to

Report bugs to Phabricator: https://phabricator.wikimedia.org

https://wikitech.wikimedia.org/wiki/Help:Toolforge/Database#Connecting_to_the_database_replicas

@Marostegui I am a bit lost, I though jaime was talking about "prod"databases, but your post is about the replicas on cloud correct?

@Nuria There is apparently 2 tools (or the same, reused), one on production, too:

jynus@deploy1001:/srv/mediawiki-staging$ sql

Execute the MySQL client binary. Non-option arguments will be passed through to
mysql.

Usage: php mysql.php [--cluster|--conf|--dbgroupdefault|--dbpass|--dbuser|--globals|--group|--help|--host|--list-hosts|--memory-limit|--mwdebug|--profiler|--quiet|--server|--wiki|--wikidb|--write] [-- mysql_option ...]

Generic maintenance parameters:
    --help (-h): Display this help message
    --quiet (-q): Whether to suppress non-error output
    --conf: Location of LocalSettings.php, if not default
    --wiki: For specifying the wiki ID
    --globals: Output globals at the end of processing for debugging
    --memory-limit: Set a specific memory limit for the script, "max"
        for no limit or "default" to avoid changing it
    --server: The protocol and server name to use in URLs, e.g.
        http://en.wikipedia.org. This is sometimes necessary because server name
        detection may fail in command line scripts.
    --profiler: Profiler output format (usually "text")
    --mwdebug: Enable built-in MediaWiki development settings

Script dependant parameters:
    --dbuser: The DB user to use for this script
    --dbpass: The password to use for this script
    --dbgroupdefault: The default DB group to use.

Script specific parameters:
    --cluster: Use an external cluster by name
    --group: Specify query group
    --host: Connect to a specific MySQL server
    --list-hosts: List the available DB hosts
    --wikidb: The database wiki ID to use if not the current one
    --write: Connect to the master database

Arguments:
    [-- mysql_option ...]: Options to pass to mysql

Cloud will know more about the first, mediawiki developers (platform, audiences, releng) about the second. Any of the 2 models may be helpful for your case so you do not have to develop one from zero.

@jcrespo it seems we should be able to deploy (out of the box with a new config) the tool existing in prod to the new and upcoming analytics replicas right? Am I missing something why this would not be possible?

Even us roots have one for mysql administration:

root@cumin1001:~$ mysql.py --version
/usr/local/sbin/mysql.py  Ver 15.1 Distrib 10.1.36-MariaDB, for Linux (x86_64) using readline 5.2

I think the point is to reuse some of the existing tools, the one you see more fit, no to reinvent the wheel, while getting "free" support.

In T212386#4872811, @Nuria wrote:

@jcrespo it seems we should be able to deploy (out of the box with a new config) the tool existing in prod to the new and upcoming analytics replicas right? Am I missing something why this would not be possible?

I agree with you and I don't see why you couldn't do that. :-) My only petition at T212386#4862319 was to make maintenance easy for us, be it with a puppetized configuration or any other method.

In T212386#4872834, @jcrespo wrote:

In T212386#4872811, @Nuria wrote:

@jcrespo it seems we should be able to deploy (out of the box with a new config) the tool existing in prod to the new and upcoming analytics replicas right? Am I missing something why this would not be possible?

I agree with you and I don't see why you couldn't do that. :-) My only petition at T212386#4862319 was to make maintenance easy for us, be it with a puppetized configuration or any other method.

Maintaining those CNAMEs is fine for us, I thought about something similar in the beginning but then I dropped the idea because of the mysql ports (namely I wasn't clear about how to map dbstore100X:port with DNS records). Can we discuss about how to implement these? I am really curious :)

In T212386#4872808, @jcrespo wrote:

Cloud will know more about the first, mediawiki developers (platform, audiences, releng) about the second. Any of the 2 models may be helpful for your case so you do not have to develop one from zero.

Very interesting. Is there a preferred one that would fit better with the new dbstore100X architecture? I mean if there is one that you guys would prefer to see re-used. Otherwise I'll check both and come up with one that fits best for Analytics :)

elukey moved this task from Backlog to In Progress on the User-Elukey board.Jan 11 2019, 3:01 PM

Can we discuss about how to implement these?

We use "standard" ports (decided before I joined): s1-sX go to 331X, x1 goes to 3320, m1-mX misc go to 332X. I have the code to implement that. My biggest concern is for us to have easy to change ips and move sections from host to host for maintenance reasons, dns or configuration is not an issue as long as it is an obvious way, without requiring complex deployments.

Otherwise I'll check both and come up with one that fits best for Analytics :)

Please do, we aren't really familiar with those really, or ask the team member I mentioned.

nshahquinn-wmf mentioned this in T172410: Replace the current multisource analytics-store setup.Jan 16 2019, 1:48 AM

Addshore moved this task from Unsorted 💣 to Watching 👀 on the User-Addshore board.Jan 16 2019, 3:15 PM

Product Analytics met with @elukey and @Milimetric on Monday and had a good discussion about this project, which they are planning to work on (full notes).

There were a couple of action items for me:

Write [Python] pseudo-code demonstrating how I would invoke and use a hypothetical library used to make the queries.
Test the existing tools discussed here and determine whether that level of functionality works for us (it likely does)

In T212386#4886726, @Neil_P._Quinn_WMF wrote:

Product Analytics met with @elukey and @Milimetric on Monday and had a good discussion about this project, which they are planning to work on (full notes).

There were a couple of action items for me:

Write [Python] pseudo-code demonstrating how I would invoke and use a hypothetical library used to make the queries.

Test the existing tools discussed here and determine whether that level of functionality works for us (it likely does)

I continued this discussion with @Nuria today, and she clarified that the team is going to try installing the production SQL script first (on the stat/notebook machines, I assume?) and see how that works. Given that, for now I'll focus for now on evaluating that tool and leave aside thinking about pseudocode/data library integration.

Thanks to Jaime's suggestions and Manuel's patient explanations about sections/etc.. I came up with the following prototype (very quick and dirty):

import argparse

MEDIAWIKI_CONFIG_PATH = '/Users/ltoscano/WikimediaSource/mediawiki-config/dblists/{}.dblist'

def parse_db_config(db_sections):
    db_mapping = {}
    for section in db_sections:
        if section == 'x1':
            dblist_path = MEDIAWIKI_CONFIG_PATH.format('all')
        else:
            dblist_path = MEDIAWIKI_CONFIG_PATH.format(section)

        with open(dblist_path, 'r') as f:
            for db in f.readlines():
                db_mapping[db.strip()] = section

    return db_mapping


def main(database_name, sections):
    db_mapping = parse_db_config(sections)
    try:
        section = db_mapping[database_name]
    except KeyError:
        raise RuntimeError("The database {} is not listed among the dblist files of the supported sections {}"
                            .format(database_name, str(sections)))
    if sections == ['x1']:
        port = '3320'
        host = 'x1.replica.analytics.svc.eqiad.wmnet'
        mysql_connect_string = "mysql -h {} -p {} {}".format(host, port, database_name)
    else:
        port = "331{}".format(section[1])
        host = "s{}.replica.analytics.svc.eqiad.wmnet".format(section[1])
        mysql_connect_string = "mysql -h {} -p {} {}".format(host, port, database_name)
    print(mysql_connect_string)


if __name__ == '__main__':
    parser = argparse.ArgumentParser(description='Test script to generate mysql conn string')
    parser.add_argument('database', type=str, help='Database name to connect to (like "enwiki")')
    parser.add_argument('--use-x1', dest='x1', action='store_true', default=False,
                        help='Use the x1 section (all dbs) or the sX sections.')
    args = parser.parse_args()

    if args.x1:
        sections = ['x1']
    else:
        sections = ['s1', 's2', 's3', 's4', 's5', 's6']
    main(args.database, sections)

The idea is basically to have the mediawiki-config repo checked out somewhere, read the .dblist configurations and then figure out the hostname to connect to. Examples:

#~ python3 mysql.py enwiki
mysql -h s1.replica.analytics.svc.eqiad.wmnet -p 3311 enwiki

#~ python3 mysql.py --use-x1 enwiki
mysql -h x1.replica.analytics.svc.eqiad.wmnet -p 3320 enwiki

#~ python3 mysql.py itwiki
mysql -h s2.replica.analytics.svc.eqiad.wmnet -p 3312 itwiki

All the .svc.eqiad.wmnet domains will be created later on, they will be CNAMEs to dbstore100X.eqiad.wmnet. The script basically implements the convention outlined in T212386#4873022 and prints the mysql connect string. It shouldn't be difficult to use a library like pymysql to send a query to the host, together with basic username/password auth (via TLS?). My idea would be to create a python script to deploy with puppet on the stat boxes.

I am going to review the other tools listed since it would be great to re-use them, but at the same time I am a bit scared about the time needed to 1) learn their code 2) modify them 3) comply with deployment/testing/etc.. standards. For example, I'd wouldn't love to write PHP and risk after each deployment to break other mediawiki scripts/users...

I don't have comments about the script itself, but:
['s1', 's2', 's3', 's4', 's5', 's6']

is hardcoded, however, it misses s7 and s8. Consider in any case not hardcoding sections, as they may change in the future (s0 or s9).

Thanks for the suggestion! This is a very quick and dirty prototype to show up the main idea for the moment :)

Another version that is inspired from mysql.py in the puppet repo:

import argparse
import getpass
import glob
import re
import sys


def parse_db_config(mw_config_path, use_x1):
    db_mapping = {}
    if use_x1:
        dblist_section_paths = [mw_config_path.rstrip('/') + '/all.dblist']
    else:
        dblist_section_paths = glob.glob(mw_config_path.rstrip('/') + '/s[0-9]*.dblist')

    for dblist_section_path in dblist_section_paths:
        with open(dblist_section_path, 'r') as f:
            for db in f.readlines():
                db_mapping[db.strip()] = dblist_section_path.strip().rstrip('.dblist').split('/')[-1]

    return db_mapping


def main(database_name, mw_config_path, use_x1, mysql_args):
    db_mapping = parse_db_config(mw_config_path, use_x1)
    try:
        section = db_mapping[database_name]
    except KeyError:
        raise RuntimeError("The database {} is not listed among the dblist files of the supported sections."
                            .format(database_name))

    args = []

    if use_x1:
        mysql_args.extend(['-h', 'x1-analytics.eqiad.wmnet', '-P', '3320'])
    else:
        port = "331{}".format(section[1])
        host = "s{}-analytics.eqiad.wmnet".format(section[1])
        mysql_args.extend(['-h', host, '-P', port])

    args.extend(mysql_args)
    args.append(database_name)

    print(args)
    # sys.exit(os.execvp('mysql', mysql_args))


if __name__ == '__main__':
    parser = argparse.ArgumentParser(description='Test script to generate mysql conn string')
    parser.add_argument('database', type=str, help='Database name to connect to (like "enwiki")')
    parser.add_argument('--use-x1', dest='x1', action='store_true', default=False,
                        help='Use the x1 section (all dbs) or the sX sections.')
    parser.add_argument('--mw-config-path', dest='mw_config_path', default='/srv/deployment/mediawiki-config',
                        help='Path of the MediaWiki config repository. Default: /srv/deployment/mediawiki-config/dblists')
    
    args, unknown = parser.parse_known_args()

    main(args.database, args.mw_config_path, args.x1, unknown)

#~ python3 mysql.py itwiki --mw-config-path /Users/ltoscano/WikimediaSource/mediawiki-config/dblists/ -u research -p
['-u', 'research', '-p', '-h', 's2.replica.analytics.svc.eqiad.wmnet', '-P', '3312', 'itwiki']

#~  python3 mysql.py itwiki --mw-config-path /Users/ltoscano/WikimediaSource/mediawiki-config/dblists/ -u research -p --use-x1
['-u', 'research', '-p', '-h', 'x1.replica.analytics.svc.eqiad.wmnet', '-P', '3320', 'itwiki']

So at the moment it only prints a list of arguments, but eventually this should exec a mysql command. In this way we basically create a light wrapper that builds the correct hostname/port to connect to and then passes it to the mysql command. The benefit is that all the mysql client options would be available and probably existing scripts wouldn't need a lot of changes.

@Neil_P._Quinn_WMF thoughts? This is still a prototype of course, just to kick off the conversation :)

@Neil_P._Quinn_WMF:

You will use this tool something like:

>python mysql-access.py itwiki --mw-config-path ./mediawiki-config/dblists --use-x1 -e "select.sql"

and it will figure out the shard in which itwiki is and run your query ('select.sql') there. This is similar how would you run a query in hive (hive -f some-sql.sql) and it will return "raw" results.

Now, the sql running will be done via python so you can also use this tool to build your connection string like:

mysql --defaults-extra-file=/etc/mysql/conf.d/analytics-research-client.cnf -h s2.replica.analytics.svc.eqiad.wmnet:3312

Although this latest was not working for me when I tried but I am not sure if password is the same

In T212386#4910158, @Nuria wrote:
Now, the sql running will be done via python so you can also use this tool to build your connection string like:
mysql --defaults-extra-file=/etc/mysql/conf.d/analytics-research-client.cnf -h s2.replica.analytics.svc.eqiad.wmnet:3312
Although this latest was not working for me when I tried but I am not sure if password is the same

The DNS records are still pending creation (together with the research user credentials), I'll create them after all hands :)

Addshore moved this task from Incoming to Watching / External on the WMDE-Analytics-Engineering board.Jan 29 2019, 9:22 AM

I had the chance to have a chat with a lot of people during all hands, this is a summary:

Mikhail/Chelsy mostly use R for their work, and it seems that the move to a multi dbstore hosts setup should not impact their work too much.
Tilman/Morten/Neil are using mostly python ad-hoc scripts for their daily work, so as starting point we agreed to 1) add snippets of Python code to generate a host:port mysql combination (to create a valid connection) from a dbname 2) have a tool like the one described above to wrap mysql commands for quick and interactive exploratory use cases.

I already started to modify https://wikitech.wikimedia.org/wiki/Analytics/Data_access#MariaDB_replicas :)

Milimetric mentioned this in T215290: update mw scooping to be able to scoop from new db cluster .Feb 5 2019, 9:46 PM

Change 488473 had a related patch set uploaded (by Elukey; owner: Elukey):
[analytics/refinery@master] [WIP] Introduce analytics-mysql

https://gerrit.wikimedia.org/r/488473

gerritbot added a project: Patch-For-Review.Feb 6 2019, 3:27 PM

Change 489170 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/dns@master] Add analytics dbstore SRV records

https://gerrit.wikimedia.org/r/489170

If https://gerrit.wikimedia.org/r/489170 is approved by SRE (should be), the logic to find the hostname/port of the right mysql instance will be reduced. The idea is to add DNS SRV records to help:

import dns.resolver

answers = dns.resolver.query('_s1-analytics._tcp.eqiad.wmnet', 'SRV')
host, port = answers[0].target, answers[0].port
print(host,port)

Basically the SRV record itself (that looks weird I know, but it needs to follow some guidelines, see https://en.wikipedia.org/wiki/SRV_record) will contain the host/port combination, so the logic to establish the correct port is not needed anymore.

I like it, thanks @elukey !

In T212386#4927285, @elukey wrote:

I had the chance to have a chat with a lot of people during all hands, this is a summary:

Mikhail/Chelsy mostly use R for their work, and it seems that the move to a multi dbstore hosts setup should not impact their work too much.

Tilman/Morten/Neil are using mostly python ad-hoc scripts for their daily work, so as starting point we agreed to 1) add snippets of Python code to generate a host:port mysql combination (to create a valid connection) from a dbname 2) have a tool like the one described above to wrap mysql commands for quick and interactive exploratory use cases.

I already started to modify https://wikitech.wikimedia.org/wiki/Analytics/Data_access#MariaDB_replicas :)

Thank you for taking the time to track all of us down and learn about our workflows! These plans definitely seem like the right track to me. Also, using SRV records is a quite an elegant trick 😁

nshahquinn-wmf added a project: Product-Analytics.Feb 10 2019, 3:25 PM

nshahquinn-wmf moved this task from Triage to Tracking on the Product-Analytics board.

Change 489170 merged by Elukey:
[operations/dns@master] Add analytics dbstore SRV records

https://gerrit.wikimedia.org/r/489170

SRV records deployed and documentation updated! If anybody could give it a try, I'd be really glad :)

https://wikitech.wikimedia.org/wiki/Analytics/Data_access#MariaDB_replicas

In the meantime, I am also trying to complete the python wrapper for the mysql client.

Change 489660 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Add common statistics repositories to stat/notebook hosts

https://gerrit.wikimedia.org/r/489660

In T212386#4942636, @elukey wrote:

SRV records deployed and documentation updated! If anybody could give it a try, I'd be really glad :)

I just tried it out on notebook1003, and it seems to work!

import dns.resolver
import mysql.connector as mysql


query = "select count(*) from recentchanges"
ans = dns.resolver.query('_s7-analytics._tcp.eqiad.wmnet', 'SRV')[0]
conn = mysql.connect(
    host=str(ans.target), 
    port=ans.port, 
    database="hewiki",
    option_files='/etc/mysql/conf.d/research-client.cnf',
    charset='binary',
    autocommit=True
)
cursor = conn.cursor()
cursor.execute(query)
cursor.fetchall()

returned

[(436448,)]

@Neil_P._Quinn_WMF Thanks a lot! Mind if I add your snipped to https://wikitech.wikimedia.org/wiki/Analytics/Data_access#MariaDB_replicas as example?

EDIT: I went ahead and added it, it is very informative/cool, hope you don't mind!

Change 489660 merged by Elukey:
[operations/puppet@production] Add common statistics repositories to stat/notebook hosts

https://gerrit.wikimedia.org/r/489660

As FYI to everybody, puppet is currently deploying a copy of mediawiki-config on all stat/notebook hosts under /srv/mediawiki-config.

Change 489723 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::analytics::cluster::repositories::statistics: fix mw directory

https://gerrit.wikimedia.org/r/489723

Change 489723 merged by Elukey:
[operations/puppet@production] profile::analytics::cluster::repositories::statistics: fix mw directory

https://gerrit.wikimedia.org/r/489723

Change 489724 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::analytics::cluster::repo::statistics: fix directory (again)

https://gerrit.wikimedia.org/r/489724

Change 489724 merged by Elukey:
[operations/puppet@production] profile::analytics::cluster::repo::statistics: fix directory (again)

https://gerrit.wikimedia.org/r/489724

In T212386#4943549, @elukey wrote:

@Neil_P._Quinn_WMF Thanks a lot! Mind if I add your snipped to https://wikitech.wikimedia.org/wiki/Analytics/Data_access#MariaDB_replicas as example?

Definitely no problem 😁

I think this ticket can be closed. Work on migrating cluster continues this week but docs should be up to date, firewall rules were updated yesterday: https://wikitech.wikimedia.org/wiki/Analytics/Data_access#MariaDB_replicas

• Nuria closed this task as Resolved.Feb 12 2019, 8:47 PM

Actually, I would like to request for https://github.com/wikimedia/operations-mediawiki-config/tree/master/dblists to have a single file I can download which has a mapping.

As it is, there's no way to know how many s*.dblist files there are without doing some querying of the GitHub API first. I started looking into that and it is not trivial. The reason I can't just hardcode it in to fetch the latest versions of s1-s8 is because that won't help if more shards are added or if shards are removed. Something like a s-dblist.yaml (JSON is also fine) in that directory would be tremendously helpful in my efforts to make the new sharding solution work with our R package for querying MySQL. For example:

- s: 1
  dblist:
  - enwiki
- s: 2
  dblist:
  - bgwiki
  - bgwiktionary
  …
…
- s: 8
  dblist:
  - wikidatawiki

Then I can just fetch the latest mapping from that file and use the ruleset described in https://wikitech.wikimedia.org/wiki/Analytics/Data_access#MariaDB_replicas to set the ports (AFAIK there's no way for me to resolve DNS SRV records dynamically in R).

@elukey: would that be possible?

Important, dblists are not the canonical place for database distribution (while it tries to be in sync). The canonical method is the array at the sectionsByDB array:

https://phabricator.wikimedia.org/source/mediawiki-config/browse/master/wmf-config/db-eqiad.php;c178d2ccf57d28434c8acf06b5b793125ff25e1b$28?as=source&blame=off

I have a mediawiki-config to json parser at: https://gerrit.wikimedia.org/r/#/c/operations/software/+/280947/1/dbhell/db-config-json.php if it is helpful for anyone

In T212386#4951229, @jcrespo wrote:

Important, dblists are not the canonical place for database distribution (while it tries to be in sync). The canonical method is the array at the sectionsByDB array:

https://phabricator.wikimedia.org/source/mediawiki-config/browse/master/wmf-config/db-eqiad.php;c178d2ccf57d28434c8acf06b5b793125ff25e1b$28?as=source&blame=off

Thanks for the pointer! I should be able to extract & parse the relevant parts of https://phabricator.wikimedia.org/source/mediawiki-config/browse/master/wmf-config/db-eqiad.php?as=source&blame=off&view=raw

@jcrespo: is it safe to assume that the current config of s3 (default) will stay that way? and if not, can I assume that the shard which is designated as the default one will have "(default)" in the comment?

Yes, the most likely changes in the future are an s0 (closed wikis with minimal resources) and an s9, but none of those are planned ATM. The most likely changes that will happen soon is rebalancing s3 by moving wikis away from them to other sections. Don't relay on code comments though. it is ok to not do it perfectly- but there is always a limit on the abstraction.

We are working right now on a metadata database (tendril/dbtree replacement) with may be a better option (API) than mediawiki-config at some point in the future (no ETA, though).

kzimmerman mentioned this in T216055: Move backend for current search dashboard to pull data from Hadoop.Feb 13 2019, 5:59 PM

Change 490604 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::analytics::refinery: require python3-dnspthon

https://gerrit.wikimedia.org/r/490604

Change 490604 merged by Elukey:
[operations/puppet@production] profile::analytics::refinery: require python3-dnspthon

https://gerrit.wikimedia.org/r/490604

Change 490618 had a related patch set uploaded (by Bearloga; owner: Bearloga):
[wikimedia/discovery/wmf@master] Update for sharding setup

https://gerrit.wikimedia.org/r/490618

Change 488473 merged by Milimetric:
[analytics/refinery@master] Introduce analytics-mysql

https://gerrit.wikimedia.org/r/488473

mpopov mentioned this in R1821:fa6def7b9934: Update for sharding setup.Feb 14 2019, 3:41 PM

mpopov mentioned this in R1821:a85102e48e68: Update for sharding setup.Feb 14 2019, 8:25 PM

@elukey: is there a recommendation for how to sqoop with the shards? since a shell command would look like:

sqoop import --connect jdbc:mysql://analytics-store.eqiad.wmnet/wikidatawiki:3306 -m 4 \
  --password-file … \
  --username research \
  --query 'SELECT … WHERE $CONDITIONS' \
  --split-by … \
  --as-avrodatafile \
  --target-dir /tmp/… \
  --delete-target-dir

does it need to be wrapped in a Python/R script which inserts the appropriate hostname, dbname, and port details in the --connect option?

@mpopov yes exactly, instead of using analytics-store.eqiad.wmnet and the 3306 port you'll have to find the proper host:port combination following https://wikitech.wikimedia.org/wiki/Analytics/Data_access#MariaDB_replicas

Change 490618 merged by Bearloga:
[wikimedia/discovery/wmf@master] Update for sharding setup

https://gerrit.wikimedia.org/r/490618

As FYI to everybody, with Thursday's Analytics Refinery deployment we'll deploy a wrapper to the mysql client command on stat1007/4 to ease the job of data exploration. This is an example from a local checkout:

elukey@stat1007:~/refinery$ PYTHONPATH=${PYTHONPATH}:/home/elukey/refinery/python python3 bin/analytics-mysql enwiki -e "show tables"
Executing: ['/usr/bin/mysql', '-e', 'show tables', '-h', 'dbstore1003.eqiad.wmnet.', '-P', '3311', 'enwiki']
+---------------------------+
| Tables_in_enwiki          |
+---------------------------+
[..]

The tool will probably have some bugs in the beginning, but I am pretty confident that we'll iron them out quickly. So I'd ask you to report any weirdness found in your daily usage as soon as possible :)

@nettrom_WMF --^ :)

@elukey, to help with @mpopov's question, could the wrapper have a mode where it just outputs the proper hostname? Then he could do something like:

sqoop import --connect jdbc:mysql://$(analytics-mysql --output wikidatawiki)
...

BTW, it should be easier to run the wrapper script if you export PYTHONPATH and put refinery/bin in your PATH in your ~/.bash_profile:

export PYTHONPATH=${PYTHONPATH}:/srv/deployment/analytics/refinery/python
export PATH=${PATH}:/srv/deployment/analytics/refinery/bin

Then you can just use it like

analytics-mysql ...

In T212386#4964864, @Ottomata wrote:
@elukey, to help with @mpopov's question, could the wrapper have a mode where it just outputs the proper hostname? Then he could do something like:
sqoop import --connect jdbc:mysql://$(analytics-mysql --output wikidatawiki)
...

Sure, we can definitely work on a shared sqoop wrapper, I am now concentrating on the first tools that we agreed upon a while ago, then we'll do the rest. I am not working on the sqoop porting to the new environment, so it might be something that we'll release as part of our migration to the new set of dbstore nodes. I am not trying to avoid providing tools, on the contrary, there is too many things atm so I am trying to prioritize :)

BTW, it should be easier to run the wrapper script if you export PYTHONPATH and put refinery/bin in your PATH in your ~/.bash_profile:
export PYTHONPATH=${PYTHONPATH}:/srv/deployment/analytics/refinery/python
export PATH=${PATH}:/srv/deployment/analytics/refinery/bin
Then you can just use it like
analytics-mysql ...

Yes this is the idea, I was waiting the first deployment of the refinery to be able to create a bash wrapper in puppet :) The example above was only to give an idea :)

Sure, we can definitely work on a shared sqoop wrapper

I don't even mean a sqoop wrapper! Just I think the script should be able to output the proper hostname:port, maybe in a 'dry-run'n mode, rather than always connecting via mysql CLI.

analytics-mysql --output (or --dry-run?) enwiki

Would just output the hostname:port for easy use with other tools.

In T212386#4964927, @Ottomata wrote:
Sure, we can definitely work on a shared sqoop wrapper

I don't even mean a sqoop wrapper! Just I think the script should be able to output the proper hostname:port, maybe in a 'dry-run'n mode, rather than always connecting via mysql CLI.
analytics-mysql --output (or --dry-run?) enwiki
Would just output the hostname:port for easy use with other tools.

Didn't read it carefully, this seems a great idea, going to work on it asap!

In T212386#4964982, @elukey wrote:
In T212386#4964927, @Ottomata wrote:
Sure, we can definitely work on a shared sqoop wrapper

I don't even mean a sqoop wrapper! Just I think the script should be able to output the proper hostname:port, maybe in a 'dry-run'n mode, rather than always connecting via mysql CLI.
analytics-mysql --output (or --dry-run?) enwiki
Would just output the hostname:port for easy use with other tools.
Didn't read it carefully, this seems a great idea, going to work on it asap!

+1! Just wanted to voice my support for this as it would be a really neat solution! pinging @Neil_P._Quinn_WMF as this development may be of interest to him

Change 491520 had a related patch set uploaded (by Elukey; owner: Elukey):
[analytics/refinery@master] analytics-mysql: add print-target parameter

https://gerrit.wikimedia.org/r/491520

Change 491528 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::analytics::refinery: add a wrapper for analytics-mysql

https://gerrit.wikimedia.org/r/491528

Change 491557 had a related patch set uploaded (by Bearloga; owner: Bearloga):
[wikimedia/discovery/wmf@master] Add support for querying x1

https://gerrit.wikimedia.org/r/491557

mpopov mentioned this in R1821:153bd282525d: Add support for querying x1.Feb 19 2019, 7:16 PM

Change 491520 merged by Joal:
[analytics/refinery@master] analytics-mysql: add print-target parameter

https://gerrit.wikimedia.org/r/491520

Tried using analytics-mysql on stat1007 and got "permission denied". Follow-up question: will it be made available on SWAP?

@mpopov still not ready sorry, the last patch will fix it after the next refinery deployment (should happen tomorrow EU time). It will be available where refinery is deployed, so currently not on notebooks, but we can thing about it if you guys need it :)

Ya, let's deploy refinery to notebooks, it'd be nice to have there too.

Change 491557 merged by Bearloga:
[wikimedia/discovery/wmf@master] Add support for querying x1

https://gerrit.wikimedia.org/r/491557

Change 491528 merged by Elukey:
[operations/puppet@production] profile::analytics::refinery: add a wrapper for analytics-mysql

https://gerrit.wikimedia.org/r/491528

@mpopov you should be able to test analytics-mysql from stat1007 :) (it is already in your PATH so you can use it without PYTHONPATH etc.. just 'analytics-mysql' and parameters).

Change 491824 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Add profile::analytics::refinery to notebook100[3,4] and stat1006

https://gerrit.wikimedia.org/r/491824

Change 491827 had a related patch set uploaded (by Elukey; owner: Elukey):
[analytics/refinery/scap@master] Add notebook100[3,4] and stat1006 to the scap targets

https://gerrit.wikimedia.org/r/491827

Change 491824 merged by Elukey:
[operations/puppet@production] Add profile::analytics::refinery to notebook100[3,4] and stat1006

https://gerrit.wikimedia.org/r/491824

Change 491908 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::swap: add analytics-admins admin group

https://gerrit.wikimedia.org/r/491908

Change 491908 merged by Elukey:
[operations/puppet@production] role::swap: add analytics-admins admin group

https://gerrit.wikimedia.org/r/491908

Change 491909 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::analytics::refinery: ensure log dir only for 'hdfs'

https://gerrit.wikimedia.org/r/491909

Change 491909 merged by Elukey:
[operations/puppet@production] profile::analytics::refinery: ensure log dir only for 'hdfs'

https://gerrit.wikimedia.org/r/491909

@mpopov: analytics-mysql deployed on notebooks and stat1006, I have also updated https://wikitech.wikimedia.org/wiki/Analytics/Data_access#MariaDB_replicas :)

elukey moved this task from In Progress to Done on the User-Elukey board.Feb 21 2019, 8:16 AM

Change 491827 merged by Elukey:
[analytics/refinery/scap@master] Add notebook100[3,4] and stat1006 to the scap targets

https://gerrit.wikimedia.org/r/491827

nshahquinn-wmf edited parent tasks, added: T215589: Migrate users to dbstore100[3-5]; removed: T172410: Replace the current multisource analytics-store setup.Feb 21 2019, 6:12 PM

All good if we close this task? Or is there anything else pending?

Change 494496 had a related patch set uploaded (by Elukey; owner: Elukey):
[analytics/refinery@master] Add the 'centralauth' db use case in refinery utils.py

https://gerrit.wikimedia.org/r/494496

Change 494496 merged by Elukey:
[analytics/refinery@master] Add the 'centralauth' db use case in refinery utils.py

https://gerrit.wikimedia.org/r/494496

In T212386#4974798, @elukey wrote:

All good if we close this task? Or is there anything else pending?

I consider silence as green light for closing, let me know otherwise :)

elukey closed this task as Resolved.Mar 6 2019, 8:41 AM

nshahquinn-wmf closed subtask T216634: Update wmfdata package to support sharded wiki replicas as Resolved.Jul 29 2019, 6:24 PM

	nshahquinn-wmf
	Dec 20 2018, 12:27 AM

Provide tools for querying MediaWiki replica databases without having to specify the shardClosed, ResolvedPublic13 Estimated Story PointsActions

Description

Details

Related ObjectsSearch...

Event Timeline

Provide tools for querying MediaWiki replica databases without having to specify the shard
Closed, ResolvedPublic13 Estimated Story Points
Actions

Related Objects
Search...