Page MenuHomePhabricator

Retrieve host & port info when connecting to MariaDB replicas on the cluster
Open, Needs TriagePublic

Description

As seen in wmfdata.mariadb the connection code relies on analytics-mysql utility (source), which prints out the host & port info which are then parsed.

It depends on that utility to be available at the system level. One idea is to factor that functionality out of refinery into a separate small package that can be used by both refinery and wmfdata-python, but it's not trivial to change the deployment strategy for refinery in that case.

Either way, wmfdata-py needs to get that information somehow when it's running on the cluster and used in an Airflow data pipeline.

Event Timeline

There are some example python scripts here, which show how to use DNS calls to look up the host and port number for a particular database section.

https://wikitech.wikimedia.org/wiki/Analytics/Systems/MariaDB#Database_setup

Specifically, you can look up a SRV record for them. Here's an example using the dig utility.

btullis@stat1004:~$ dig +short -t srv _s1-analytics._tcp.eqiad.wmnet
0 1 3311 dbstore1003.eqiad.wmnet.

Xabriel published https://gitlab.wikimedia.org/repos/data-engineering/dumps/mediawiki-content-dump/-/blob/6743db0e987a4567352eec4277e5a7f4092de423/notebooks/Access%20MariaDB%20From%20Cluster.ipynb with some very cool code and examples!

@xcollazo get_mariadb_host_port_for_wikidb() would be a very excellent addition to wmfdata.utils and then wmfdata can fallback on that if analytics-mysql isn't available to it.