Page MenuHomePhabricator

Transferring data from Hadoop to production MySQL database
Closed, ResolvedPublic

Description

In T210844: Generate article recommendations in Hadoop for use in production we'll be generating TSV files that live in HDFS. We want to take these TSV files and import them into the m2-master database.

Analytics is working on a solution to take data from the analytics cluster to the production servers at T213976: Workflow to be able to move data files computed in jobs from analytics cluster to production .

We have import scripts that take TSVs and import them into MySQL. Where should these scripts be executed from?

Summary of discussion so far

How to import article recommendation scores to MySQL from Hadoop?

  • Transfer data to a production node that has access to the MySQL database and run the import script ...
    1. From the recommendation-api service node. Services, is it viable?
    2. From some production virtual machine. Which one? SRE which one is a good for this use case? Should we request a dedicated VM (Ganeti instance)?
  • Write directly from Hadoop to MySQL. SRE what are the issues here?
  • Running import scripts from stats machines or database host machines has been ruled out.

Links

A/C

  • Decide where and how to execute the import scripts.
  • Implement the agreed-upon solution.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
bmansurov updated the task description. (Show Details)

The first thing we need to do is to oozie-fy the data creation workflow that produces the files you would be loading into mysql (likely tsv), let's work on that while Analytics works on designing a workflow for you to transfer data into MySQL.

I think this might involve manipulation of wikidata dumps, since those are not productionized a frozen copy can be dump to HDFS.

@bmansurov sounds good, let's start working on the code for oozie jobs, that work has to be done before we could transfer data into mySQL anyways.

@bmansurov I think we can eventually figure out a way to get your dump files out of analytics to somewhere that can access mysql. Where and how to run your importer script is a different question that we probably need to talk with SRE about. Could we automate running it on the boxes where your service actually runs? Not sure.

Unless...unless that is we can actually write to the MySQL db from Hadoop. I suspect we won't be allowed to, but it might be worth asking. This is a DBA/SRE question too.

Unless...unless that is we can actually write to the MySQL db from Hadoop.

I do not think we should consider this an option. We should have a clear separation of concerns and while the hadoop cluster is in charge of computing the data the task of updating the db needs to live in the service side.

bmansurov updated the task description. (Show Details)

@Ottomata thanks! I've updated the task description and pinged the groups you mentioned.

bmansurov added subscribers: Banyek, Dzahn.

@Banyek and @Dzahn I'd appreciate your input on this task. Thank you.

We should have a clear separation of concerns and while the hadoop cluster is in charge of computing the data the task of updating the db needs to live in the service side.

Maybe! I wouldn't rule it out. We use Hadoop to update AQS databases (cassanrda, public-druid). We also read directly from MySQL into Hadoop. This might not be the right approach in this case, but I don't think an (offline) push from Hadoop into a datastore is out of the question.

How to install the importer scripts is what i started once in https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/476098/

It has been abandoned after Analytics said to not use stat hosts and use Hadoop instead.

The puppet code could of course be restored but where to apply it instead i don't know. For that i would have to refer to DBA and seniors in SRE. If you need a dedicated VM in production it is possible to request one (Ganeti).

It has been abandoned after Analytics said to not use stat hosts and use Hadoop instead.

To clarify: stats machines should not be in the path to update a production database, that has not changed. The current debate (on this ticket) is whether it makes sense to do it from hadoop (yours truly disagrees) or rather (my preferred option) to have a daemon on the mysql hosts that gets notified that data is available for consumption, fetches a file from an accessible mount and does whatever needs doing: inserting data, droping data , creating databases...

We want to find a solution that works for all consumers of data produced in hadoop, in this case might be MySQL but the consumer might very well be ElasticSearch which is another current use case.

to have a daemon on the mysql hosts

To clarify, it is unlikely these scripts would run on the mysql servers themselves. If not Hadoop, they'd run somewhere else, either a Ganeti instance as Dzahn suggests, or perhaps colocated with the recommendation-api service.

I think one telling use case the ilustrates why we want to decouple data loading from hadoop is a rollback. Say that you have RecomendationsData1.tsv and RecomendationsData2.tsv and you have currently loaded in prod RecomendationsData1.tsv. You produce a new data set in cluster and load it (RecomendationsData2.tsv). This dataset, despite your data guards, has bad data . You want an immediate rollback and that process should not involve the cluster in any case since you have that data (might need reloading as it might no longer be on mysql) . You would go to the mount where the data exists and reload it as needed using service code, not code that runs inside an oozie job or similar.

Rollback is already taken care of at the script level. We'll have different versions of data in MySQL and can rollback anytime we want. No need to bother Hadoop, mounted partitions, or Oozie. Please refer to the Pipeline documentation, specifically this diagram. If you have concerns about it I can work on improving it, but this concern has already been addressed there.

@bmansurov how do handle deleting data in your storage when you have reached capacity or when that dataset is bad? There must be a daemon that takes care of that work right ? as older versions of data are no longer useful and should not occupy space. That process that should be automated is not dissimilar to the one that would load the data, right? One is concern with deletion of older versions (language pairs might have been added or removed) other is concern with loading of newer versions.

@Nuria

how do handle deleting data in your storage when you have reached capacity or when that dataset is bad? There must be a daemon that takes care of that work right ?

Many ways, systemd daemon, cron job, (maybe) Oozie, or something else. Without knowing how to execute these scripts, I cannot tell you how. It will all become apparent once we agree on an approach in this task.

That process that should be automated is not dissimilar to the one that would load the data, right?

Yes and no. If the system is unable to check data quality (despite automatic tests T213761) and let it go all the way to production, then I think we need a human intervention to check the data quality. Once data is deemed OK, and after a certain time, we can deploy a config change that would clean up old data. I'm not too worried about these details because they are straightforward (and the script is ready to clean up the data). The bigger question is where and how to run these scripts.

Marostegui subscribed.

I don't think writing from Hadoop directly to M2 master is a good idea. But it is not really my call.

From wherever you decide it is suitable, it needs to be accessed thru the proxy, that means m2-master.eqiad.wmnet
No local scripts on the MySQL hosts will be allowed as the master can change anytime, and the proxy will point to the new one. If the script executes things locally you will have a split brain sooner or later

Lastly, I don't know at what numbers we are looking at, but if there is a big amount of data being inserted, we might need to ask you to throttle it (ie: to avoid lag on the slaves)

I don't think writing from Hadoop directly to M2 master is a good idea. But it is not really my call.

It is partly your call! :p Writing from Hadoop directly likely wouldn't be much different that writing from some other node somewhere. But still, there might be details about mysql passwords and firewall holes that we don't want to touch. Finding a separate place to run the import scripts is probably the path of least resistance.

Finding a separate place to run the import scripts is probably the path of least resistance.

+1

Thanks everyone for the discussion. I've added a summary to the task description.

@Pchelolo @Marostegui @Dzahn @Nuria @Ottomata please take a look at the updated task description and help us choose one of the options (or a new one).

@Dzahn please feel free to invite a senior SRE to the discussion.

Thank you!

@bmansurov I am guessing that option 2 is the most likely one, in any case I want to stress that we should really be working on producing those tsvs in the cluster which is something that needs to happen prior to them being imported in mysql. Has that work started?

@Nuria, thanks for the input. I suppose you mean the option 2 of the first point.

Has that work started?

I'm currently working on the subtask/blocker (T211981) of the task you mention (T210844). Once it's finished, I'll work on the Oozie job.

I agree that the most likely solution to work here is option (2), i.e. getting a host to execute it from. Perhaps it could be done from [mwmaint1002](https://wikitech.wikimedia.org/wiki/Heterogeneous_deployment#Run_a_maintenance_script_on_a_wiki), but that would need to be checked with SREs.

Option (1) (execute it from hosts serving recommendation api) is not an option because those are production machines that host other services and we cannot allow any kind of disruption that could be potentially caused during the import.

@Dzahn please feel free to invite a senior SRE to the discussion.

Hi @bmansurov I wanted to let you know i did bring it up in my sub-team meeting and discussion has started and it has been said we need to continue that in the general SRE meeting. So just to let you know this is ongoing and has been triggered. More people involved now.

Any further thoughts on this, i think we agree that best solution is to run and deploy these scripts from some virtual machine, once we are settled on that analytics would work in how to best move the tsvs generated to the machine in question, for this -given our tools and ecosystem- I cannot think of other solutions that rsync.

@Nuria I agree it seems the most likely solution is using a Ganeti VM though but due to allhands we still did not have an SRE meeting since my last comment and i have been told to bring it up there. I will follow-up again Thursday.

We could start the VM request process though, that would mean to fill out the (short) form at https://wikitech.wikimedia.org/wiki/SRE_Team_requests#Virtual_machine_requests_(Production) Maybe @bmansurov could you do that?

Labs Project Tested: <project_name>    (skip, unless you have a cloud VPS for testing import scripts)
Site/Location:<EQIAD|CODFW>   (EQIAD) 
Number of systems: <# of VMs>   (1)
Service: <service name>   (have an idea for a name?)
Networking Requirements: <internal|external IP>, <specific networking access needed>  (internal)
Processor Requirements: <Number of Virtual CPUS>  ( i don't know how much CPU the scripts will need) ?
Memory:   ?
Disks: <Capacity only>   ?
Other Requirements:  ?

Marco's suggestion of using mwmaint1002 is not a bad idea...

How is the data going to make it from Hadoop, which resides in the analytics cluster and is firewalled at the router level (aka network ACLs) to whichever machine is chosen for this? Has this been already worked out (cause I see no mention of this)?

For what is worth, mwmaint1002 is also the place for a lot of other maintenance tasks. Depending on the amount of resources required during the import/update phase it might make sense to keep that workload separated from other mwmaint1002 workloads. Is it just a LOAD DATA INFILE "something.tsv" or is it something more complex?

How is the data going to make it from Hadoop, which resides in the analytics cluster and is firewalled at the router level (aka network ACLs) to whichever machine is chosen for this? Has this been already worked out (cause I see no mention of this)?

@akosiaris: T213976: Workflow to be able to move data files computed in jobs from analytics cluster to production

It hasn't been totally worked out, but a solution is rsync pull. The network firewalls don't allow analytics VLAN to initiate connections, but they will accept incoming ones.

Is it just a LOAD DATA INFILE "something.tsv" or is it something more complex?

Yes, more or less: https://github.com/wikimedia/research-article-recommender-deploy/blob/master/deploy.py
I plan on adding chunked imports later.

How is the data going to make it from Hadoop, which resides in the analytics cluster and is firewalled at the router level (aka network ACLs) to whichever machine is chosen for this? Has this been already worked out (cause I see no mention of this)?

@akosiaris: T213976

It hasn't been totally worked out, but a solution is rsync pull. The network firewalls don't allow analytics VLAN to initiate connections, but they will accept incoming ones.

The network firewalls are not stateful, but rather stateless. That is, while they indeed don't allow hosts in the analytics to initiate a TCP connection (aka TCP SYN packet), they will also not allow them to send the SYN/ACK packet required for the second (of the three) phase of the TCP handshake, despite the fact they will be allowed to receive the SYN packet. Hence rsync pull will not work.

Is it just a LOAD DATA INFILE "something.tsv" or is it something more complex?

Yes, more or less: https://github.com/wikimedia/research-article-recommender-deploy/blob/master/deploy.py
I plan on adding chunked imports later.

That does look simple enough and not resource expensive on mwmaint1002. I guess it can fit in there as well? But a VM is also a fine solution and separation of concerns does suggest we go that way.

That does look simple enough and not resource expensive on mwmaint1002. I guess it can fit in there as well? But a VM is also a fine solution and separation of concerns does suggest we go that way.

My preference is also toward a VM because we're just getting started with recommendation APIs. We'll have section recommendations in the future and possibly other kinds of article recommendations.

they will also not allow them to send the SYN/ACK packet required for the second (of the three) phase of the TCP handshake,

? Is that true? I'm pretty sure it isn't...unless we have some special rules I'm forgetting.

Just checked. We have rsync server modules on stat1007 that only certain hosts are allowed to rsync from. I live edited it to add mwmaint1002.

Before adding mwmaint1002

[@mwmaint1002:/home/otto] $ rsync -rvn stat1007.eqiad.wmnet::srv/ ./
@ERROR: access denied to srv from mwmaint1002.eqiad.wmnet (10.64.16.77)

After adding mwmaint1002 to rsyncd.conf hosts allow

[@mwmaint1002:/home/otto] $ rsync -rvn stat1007.eqiad.wmnet::srv/published-datasets/one-off/ ./
receiving incremental file list
recSheetsTSVforMartin.tar.gz
article-recommender/
article-recommender/06032018-11302018.tar.gz
article-recommender/20181130.tar.gz
...

Ideally I would prefer that stats machines are completely out of the workflow of pushing data to machines like mwmaint1002.eqiad.wmnet because stats machines are meant to do experiments and one off computations, they are not a "bridge" for data to get to our production hosts. We use them to put data for the whole wide world to consume and it should be fine to put data on dumps but that is as far as I think we should go. This ticket (oh my!) might require us to consider other options and think of solutions we might have not used before. Is swift a viable alternative. Could we push those files there (a lå s3) and have mwmaint1002.eqiad.wmnet pull them?

As @Ottomata pointed out more generic discussion about this topic can be found here: https://phabricator.wikimedia.org/T213976

they will also not allow them to send the SYN/ACK packet required for the second (of the three) phase of the TCP handshake,

? Is that true? I'm pretty sure it isn't...unless we have some special rules I'm forgetting.

analytics-in4 seems to be one of the filters that have a tcp-established match term, a bitfield mask of tcp-flags “(ack | rst)” so indeed the handshake will proceed fine as per you example. I was wrong in that comment.

For the record, just saying pointing out that the question of a new VM versus mwmaint1002 is probably irrelevant here. We can do both with what looks like minimal repercussions.

The discussion in T213976 is probably more important. Should we btw stall this on T213976?

Should we btw stall this on T213976?

yes, we need to resolve first where/how are binarie/data files s going to be moved to the prod network

akosiaris changed the task status from Open to Stalled.Feb 20 2019, 2:11 PM

Per comment above.

Ottomata claimed this task.

This is now supported via Kafka, Swift and an Oozie workflow.
T213976: Workflow to be able to move data files computed in jobs from analytics cluster to production

Resolving.