Page MenuHomePhabricator

Import recommendations into production database
Closed, ResolvedPublic

Description

Research would like to load recommendation into a production MySQL database. @bmansurov would like to pair with someone from SRE to work on this task:

  • Data to be imported is at stat1007:/home/bmansurov/tp9/article-recommender-deploy/predictions-06032018-11302018. This data is generated using Wikidata dumps of 10/01/2018 and page views of until 11/30/2018. Language pairs included are: en-es, en-fa, and ru-uz.
  • The import script is here.
  • Database schema
  • Script that inserts data into the database

DB

host to connect: m2-master.eqiad.wmnet
db: recommendationapi
TLS: disabled
user: recommendationapi and recommendationapiservice (for read only)
credentials: on private puppet (class passwords::recommendationapi::mysql, $recommendationapi_pass & $recommendationapiservice_pass)

A/C

  • Create a Puppet script that loads data from a git repository.
  • Set up the script so that a configuration change will load the data automatically.
  • Make sure the data is versioned. We'll periodically remove old versions of the data once we make sure the new version is better than the old ones.
  • Actually load 'en-es', 'ru-uz', and 'en-fa' into the database.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
colewhite triaged this task as Medium priority.
jcrespo subscribed.

Sorry, I don't have bandwidth to work on this search for another op.

Hey @bmansurov -- stepping in for @mark while he's on vacation this week.

We'll discuss internally and let you know! How soon do you need this and is there a timeline you're working on? The DBA team is short on capacity until next week due to PTOs and emergencies, so most likely we'll have a response from you late next week or so; does that work? Thanks!

@faidon They don't need a DBA, they are searching for someone to support them with puppetization. We already attended the DBA tasks at T205294, the rest can be handled with anyone with production access or knowledge- they can wait for me to have the time, of course, but if they do, they may be waiting forever :-)

@faidon thanks! Next week sounds good. We're having our offsite this week, so there's no rush.

@bmansurov I can help with puppetization. Just need to know a bit more what you actually need. Let's chat about that after the holidays or feel free to update with more details here, whatever seems easier.

@Dzahn, thanks! OK, Iʼll ping you on Monday.

@Dzahn I've updated the description with acceptance criteria. Let me know if it doesn't make sense. Also, how about we pair tomorrow at your convenience? I was thinking of a Hangout call or an IRC chat.

Yep, we had an IRC chat and i already know more and we will continue tomorrow.

Change 476098 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] create profile::article_recommender

https://gerrit.wikimedia.org/r/476098

summary from IRC:

  • 2 new Gerrit repos have been requested, one regular one as deploy repo, code will move from github over there
  • instead of using pip Debian packages python-numpy','python-pandas' will be used and these are already installed by puppet, no action needed there
  • once deploy repo exists we will let puppet git clone from that

These things will ensure stat1007 can be used for creating dumps without having to manually install software.

Banyek subscribed.

I'll check what are the needs for achieving these goals in the DBA perspective

@Banyek thanks for helping. While we port repositories to Gerrit, here's the database related part of the task. We can discuss it on IRC (my nick is bmansurov and I'm on Research or you can /query me directly) and I'll explain what I'm trying to achieve.

as we were talking with @bmansurov I learned that we need to keep old data after new import is not considered working.
my recommendation is to do have tables with date postfix, import data into those, and point a view to the version we use. After we confirmed the data is good we can drop the old table
eg.

 (pseudo)
[import  data to article_recommendation_20181128]
CREATE OR REPLACE VIEW article_recommendation AS SELECT * FROM article_recommendation_20181128 
[use article_recommendation until next import]
[import  data to article_recommendation_20181221
CREATE OR REPLACE VIEW article_recommendation AS SELECT * FROM article_recommendation_20181221
[use article_recommendation  until next import]
[when data is considered good then DROP TABLE article_recommendation_20181128

Also talked with bmansurov about the pupetization part. There is my pending Gerrit change for the git clone part (pending repos are on Gerrit) and creating the local dirs.

Then more specifically it was about how to puppetize the mysql password. We have precedence for that in puppet, we make it write a .my.cnf and then allow the admin group "researchers" to read that file. That's how we've been doing it before, so i was going to just copy that code and it makes sense since bmansurov is of course in 'researchers'.

But then we realized that the researchers admin group is only applied on hosts with role statistics::cruncher (stat1006 f.e.) and not on hosts with role statistics::private (stat1007 f.e.) and the reason he has shell access there already is via other groups (analytics-privatedata / statistics-privatedata).

Using these groups wouldn't be ideal though, as the members are not researchers and doing completely other things. And using the other hosts that already have researchers access isn't feasible because they don't have required software installed.

From there i recommened to created an access request to add researchers to "hosts that have requirements to create recommendation api dumps" which is now:

T210757

and i tried to explain all the details there.

With that access request we would then be able to copy existing code like:

# = Define: statistics::mysql_credentials
# Sets up mysql credentials for a given user group to access
# the research dbs
define statistics::mysql_credentials(

..
   # This file will render at
    # /etc/mysql/conf.d/research-client.cnf.

and add this password in the same manner.

Change 476098 abandoned by Dzahn:
create profile::research::article_recommender

Reason:
https://phabricator.wikimedia.org/T210757#4787737

https://gerrit.wikimedia.org/r/476098

see T210757#4786370 for the latest status. things have changed since Nuria pointed out hadoop should be used instead.

@Banyek, I was about to help @bmansurov do a manual import of his data into the recommendationapi database on m2-master, while the productionization stuff is figured out.

I think we need a larger understanding of how this import should actually work, and where it should run from. Do we have anything else like this, where a non-mediawiki app needs to write into to a production MySQL instance? Would it be better for this service to have its own MySQL instances or data store somehow? How does ORES work, does it have its own MySQL datastore?

@Banyek here's the repository that I'm planning on using to import data. The data lives in stat1007.

Ah, nm, my question is answered on this ticket:
https://phabricator.wikimedia.org/T203039#4574768

This db is in a misc MySQL instance.

Ok, @Banyek then the Q is: May I copy @bmansurov's data and import script over to a place that has access to m2-master:3306 (stat1007 does not have this) and then run the import script to import the data into the recommendationapi database?

@Ottomata Yes, I don't see anything against this. Just make sure that the data is copied over a secure channel and get removed both the export and the import servers after the import finished.

ehm. i spent time on puppetizing this to make sure bmansurov's import script gets installed in a way that doesn't conflict with server access policies. (L3)

And then i abandoned it because i was told that Hadoop should be used instead. Now it sounds we are copying the same software manually to another place.

All we'd have to do is change which server uses the role then ? https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/476098/11/modules/profile/manifests/research/article_recommender.pp

Yah thanks for that @Dzahn! The problem is a larger one: how should people get data out of analytics systems for production usage. The search team has this problem too.

The computation of data should be done in Hadoop, it actually already is. But how should data get from Hadoop to the MySQL servers, or the ElasticSearch servers? We don't have a good answer to that yet. I think it will involve custom and locked down rsync modules, but we need to puppetize that somehow. Once we have that, we'd need a place to apply the puppet you wrote. stat1007 is probably not the place. Not yet sure where is.

Ok, thanks @Ottomata feel free to just hit "restore" on that and apply it on another host once we get there.

YI think it will involve custom and locked down rsync modules, but we need to puppetize that somehow

Yea, makes sense. I actually expected some puppetized rsync to be involved as well and talked about that to bmansurov at first. I just also did not know from where to where exactly.

I think rsync::quickdatacopy might do the job.

You just configure source and dest and a path and it will setup rsyncd and ferm one one side and create a sync script to pull from it on the other side.

Since just recently it also has "auto_ferm" parameters to add/remove the ferm rules (IPv4 and IPv6) and additionally uses hosts_allow in rsyncd to further lock it down. Combined with "read_only = yes" it is as locked as it gets i think. With "autosync = true" you get a cronjob automating it. And still easy to use (now). Examples in profile::mediawiki::maintenance, profile::releases::reprepro or other places.

@Banyek another Q: Can we add permissions to the recommendationapi user on m2-master to be able to connect from stat1007? This might not be the final place where this import runs from, but it will at least allow @bmansurov to do this on his own next time.

For posterity, I did the following from neodymium for Baho:

python3 deploy.py import_languages 20181130 m2-master.eqiad.wmnet 3306 recommendationapi recommendationapi recommendation_api_db_pass.txt --language_file 06032018-11302018/languages.tsv

python3 deploy.py import_scores 20181130 m2-master.eqiad.wmnet 3306 recommendationapi recommendationapi recommendation_api_db_pass.txt --scores_file 06032018-11302018/ru-uz.tsv --source_language ru --target_language uz

python3 deploy.py import_scores 20181130 m2-master.eqiad.wmnet 3306 recommendationapi recommendationapi recommendation_api_db_pass.txt --scores_file 06032018-11302018/en-es.tsv --source_language en --target_language 

python3 deploy.py import_scores 20181130 m2-master.eqiad.wmnet 3306 recommendationapi recommendationapi recommendation_api_db_pass.txt --scores_file 06032018-11302018/en-fa.tsv --source_language en --target_language fa

python3 deploy.py create_views 20181130 m2-master.eqiad.wmnet 3306 recommendationapi recommendationapi recommendation_api_db_pass.txt


CREATE OR REPLACE VIEW language AS SELECT * FROM language_20181130;
CREATE OR REPLACE VIEW article_recommendation AS SELECT * FROM article_recommendation_20181130;

SHOW FULL TABLES IN recommendationapi WHERE TABLE_TYPE LIKE 'VIEW';

+-----------------------------+------------+
| Tables_in_recommendationapi | Table_type |
+-----------------------------+------------+
| article_recommendation      | VIEW       |
| language                    | VIEW       |
+-----------------------------+------------+
2 rows in set (0.00 sec)

@Banyek another Q: Can we add permissions to the recommendationapi user on m2-master to be able to connect from stat1007? This might not be the final place where this import runs from, but it will at least allow @bmansurov to do this on his own next time.

Actually as far as I understand the system I'd say 'yes' but now I think I need to summon @Marostegui here for the final word

Is this going to be a one time import?
Why does it need to be done from stat1007? They should connect to m2-master will route them thru the proxy.
How much data will be imported?

I propose a quick talk with @bmansurov and @Ottomata to clarify a few questions on monday

@Marostegui

Is this going to be a one time import?

Maybe a 4-5 time import, maybe less. For now we have all the data needed in MySQL (thanks to Andrew).

Why does it need to be done from stat1007? They should connect to m2-master will route them thru the proxy.

Because I have access to stat1007 and not neodymium, for example. The data is also in stat1007 (although we can share download it to other machines too). I can try m2-master, but I'm not sure if I have access to it either.

How much data will be imported?

Yesterday we imported about 300MB, but in the future we may import up to 3-4 GB of data.

@Banyek OK, Monday sounds good. Feel free to send a calendar invitation or let me know your preferred time and I'll send it myself.

@Marostegui

Is this going to be a one time import?

Maybe a 4-5 time import, maybe less. For now we have all the data needed in MySQL (thanks to Andrew).

Why does it need to be done from stat1007? They should connect to m2-master will route them thru the proxy.

Because I have access to stat1007 and not neodymium, for example. The data is also in stat1007 (although we can share download it to other machines too). I can try m2-master, but I'm not sure if I have access to it either.

stat1007 should always use m2-master so you get routed via dbproxy, which has grants to connect..
You need to use m2-master.eqiad.wmnet, however it looks like there must be a firewall in between as it doesn't reach it.

root@stat1007:~# telnet m2-master.eqiad.wmnet 3306
Trying 10.64.0.166...

So that needs to be fixed before, and once that is fixed you should be able to connect to the database.
However, if there is a firewall in between it must be for a reason. So probably this needs good research before dropping that firewall rule.
Maybe @Ottomata or @Dzahn can help out with that.

However, if there is a firewall in between it must be for a reason. So probably this needs good research before dropping that firewall rule.
Maybe @Ottomata or @Dzahn can help out with that.

I checked on stat1007 for iptables / ferm rules, expecting it to be puppetized. But to my surprise i found none at all. I thought everything already had base:::firewall nowadays. (cc: @Muehlenhoff )

That means it should be about ACLs on routers and i don't have access to those, so i can't help with that i'm afraid. Please ping network-ops about that if needed.

Why not using mwmaint1002 for the import?
I just checked and it can reach m2-master fine.

checked on stat1007 for iptables / ferm rules, expecting it to be puppetized. But to my surprise i found none at all. I thought everything already had base:::firewall nowadays. (cc: @Muehlenhoff )

This is known. spark spawns distributed workers listening on random(ish) ephemeral ports, and we haven't found a way to predict them enough to lock anything down.

You need to use m2-master.eqiad.wmnet, however it looks like there must be a firewall in between as it doesn't reach it.

Ah ha, this is the analytics VLAN ACLs. If we want to use stat1007, we'll need to open this up.

Why not using mwmaint1002 for the import?

This could work, but we'll need a way to get the data over to mwmaint1002. We could open up an rsync pull from mwmaint1002 to stat1007? We could even do that with rsync::quickdatacopy?

@MoritzMuehlenhoff wherever Baho's import job runs, the user running it (either him or some system user) will need access to a file with the recommendationapi MySQL password in it. Any suggestions for accomplishing that?

A quick recap on today's meeting:

  • we'll have an import in every month or in every quarter, this is tbd, but it will happen continuouly, but not too often
  • the data to import is about 3-4 gigabytes. Not too much, but definitely a volume.

As we talked about the process:

  1. the data is created on stat1007
  2. it is planned to move to mwmaint1002
  3. data load would happen from there with the system user (we also need to figure out how to access the password by the script

Maybe it's worth to think about not to move the data to mwmaint1002 but if we'll stick with stat1007 we have to solve firewalling, and the password issue too.

and the password issue too.

We have a previous case where we solved "give access to a mysql db to a puppet admin group". It was the original reason to create the admin group "researchers". So i think that can be copied.

example:

modules/profile/manifests/statistics/cruncher.pp

# This file will render at
# /etc/mysql/conf.d/researchers-client.cnf.
# This is so that users in the researchers
# group can access the research slave dbs.
statistics::mysql_credentials { 'research':
    group => 'researchers',
}
# = Define: statistics::mysql_credentials
# Sets up mysql credentials for a given user group to access
# the research dbs
define statistics::mysql_credentials(
    $group,
mysql::config::client { $title:
      user  => $::passwords::mysql::research::user,
      pass  => $::passwords::mysql::research::pass,
      group => $group,

So this puts snippets into /etc/mysql/conf.d/ and the admin group can read it.

The actual secrets user and password are in the private repo matched by fake passwords in labs/private so that puppet doesn't break in labs.

First step would be to add the credentials on the puppetmaster in /srv/private/modules/passwords/manifests/init.pp somewhere alongside class passwords::mysql::research_prod

Afaict the issue is basically that the existing group "researchers" doesn't match the people we want to give this new access to, which is something new like "research-team" ?

So it seems we need a new admin group and then copy the researchers setup for a new mysql snippet, confirm it works, then add people to the new group.

P.S. One issue with my suggestion above, don't use mysql::config::client but mariadb::config or something else in mariadb:: . We want to get rid of the mysql module per T162070.

Right, the 'researchers' group has no association with the Research team. Its just a bunch of people who get access to 'research' MySQL instances from stat1006.

So yes, we'd need a new group if we wanted to do this, hence my Q at the SRE meeting today. In addition to a new group, we'll also likely need a new system user in that group that will run the cron jobs to import the data.

Perhaps a good group name would be research-admins?

Perhaps a good group name would be research-admins?

I was thinking research-team to make a point it matches the WMF research team, assuming that's what we want here. Yours sounds also good. Actually even better because not all members of research team might be admins. Yea, let's do research-admins :)

Ok, @bmansurov, I can help you manually load your data one more time! Just one more time though so let's get it right.

There is broader discussion we need to highlight here. This isn't the first time we have this problem to solve (export data from Analytics cluster for production use) and it won't be the last. In our Analytics standup today, we discussed this and came up with the following conclusion.

It's on the Analytics team to figure out and build the infrastructure for data exports like this. We need to make sure we don't introduce fragility, especially around production services that will depend on this data. So, we really don't want to be in the habit of using a single compute purposed machine (like stat1007) to export data. We've solved this problem for Cassandra, for graphite, etc. Search has a similar problem with exporting Elasticsearch indexes. Here, we need to solve the problem of Hadoop -> MySQL.

There is a pretty generic tool for Hadoop+Mysql integration: Sqoop. We use this mostly for importing from MySQL into Hadoop, but it should work the other way as well. Spark itself also has JDBC DataFrame integration We use Oozie to schedule Sqoop jobs in Hadoop, which is likely what Baho should do when productionizing his job.

So, the Analytics team needs to take time to figure out how to properly productionize Hadoop -> MySQL. We recognize the urgency here, but it isn't going to happen this quarter. We're considering putting this into next quarters goals. When doing this work, we'd work with Baho to hammer out all the details.

I'm sorry if this delays the deployment of a productionized recommendationapi service. Here's some general advice when deploying a new service though: don't go it alone! Ask for help early on! It takes time for teams to incorporate other teams needs into their quarterly schedules, so its good to ask preemptively. Don't just work on your own and expect teams to deploy stuff without extra work or review. In this case, ideally Research would have consulted with Analytics (and possibly DBAs or SRE even) before work on this service began. That way we all could have collaborated on the data flow and deployment architecture ahead of time, and anticipated this work. (Since I came to this task, I recognize that there was a lot of back and forth between

Thanks, @Ottomata, for helping to import data, update on the status, and general advice.

I think we can do a better job of informing the interested parties on such work. I was under the impression that Analytics was aware of what Research was trying to achieve starting this task: T203039: Storage of data for recommendation API. I was also talking to Services, DBAs, and SRE among others about this work starting some time ago. That said, I appreciate your help in moving this task forward, both by importing data manually and helping with setting up a production pipeline.

Oops, i didn't finish my sentence in that last comment. Finishing...

(Since I came to this task late, I recognize that there was a lot of back and forth between parties that I don't have the context of. I've tried to read tickets, but there's a lot. So apologies if I'm missing some context!)

I was under the impression that the Analytics was aware of what Research was trying to achieve starting this task: T203039: Storage of data for recommendation API. I was also working talking to Services, DBAs, and SRE among others about this work starting some time ago.

Ya, ok cool, just read that task (was looking for it, perhaps it should be linked to this one somehow as a parent task?). From that task I gather that a lot of diligence was done with SRE and Services to figure out where the data would be finally stored and how it would be accessed from production. Nuria did chime in early on, but I don't think she realized at that time that the data was coming from Hadoop.

Maybe next time add the Analytics tag so we can triage it and discuss and reach out to you if we need to?

Maybe next time add the Analytics tag so we can triage it and discuss and reach out to you if we need to?

OK, will do.

I still see significant items that I hope we are thinking how to address:
1 . we need to remove the stats machines from the data path to production

  1. we need to setup oozie jobs that would do data gathering using other datasources (wikidata) and handle creation of files to be consumed by mysql
  2. the jobs need to have a check step (a guard) that asses data is good, so we do not end up by mistake (a bug on pyspark, say) loading bad data into the service in production that is not of quality

Both 2 and 3 above can start being worked as soon as @bmansurov has any bandwidth.

@Nuria, what you mention makes sense. I created this task in order to get the current recommendations into MySQL. I think we should treat T210844: Generate article recommendations in Hadoop for use in production as the main task for setting up a production pipeline. I'm short on time to address your comments here or over that task this week, but after the holidays I should be able to work on them.

bmansurov claimed this task.
bmansurov moved this task from In Progress to Done (current quarter) on the Research board.

@Ottomata helped import the data we need for now. I'll follow up on open questions in other relevant tasks. Thanks everyone for helping.