Transferring data from Hadoop to production MySQL database
Closed, ResolvedPublic
Actions

Description

In T210844: Generate article recommendations in Hadoop for use in production we'll be generating TSV files that live in HDFS. We want to take these TSV files and import them into the m2-master database.

Analytics is working on a solution to take data from the analytics cluster to the production servers at T213976: Workflow to be able to move data files computed in jobs from analytics cluster to production .

We have import scripts that take TSVs and import them into MySQL. Where should these scripts be executed from?

Summary of discussion so far

How to import article recommendation scores to MySQL from Hadoop?

Transfer data to a production node that has access to the MySQL database and run the import script ...
1. From the recommendation-api service node. Services, is it viable?
2. From some production virtual machine. Which one? SRE which one is a good for this use case? Should we request a dedicated VM (Ganeti instance)?
Write directly from Hadoop to MySQL. SRE what are the issues here?
Running import scripts from stats machines or database host machines has been ruled out.

A/C

Decide where and how to execute the import scripts.
Implement the agreed-upon solution.

Related Objects
Search...

Status	Assigned	Task
Open	None	T214074 Productionize article recommender systems
Resolved	Ottomata	T213566 Transferring data from Hadoop to production MySQL database
Resolved	Ottomata	T213976 Workflow to be able to move data files computed in jobs from analytics cluster to production
Resolved	Ottomata	T219544 Make hadoop cluster able to push to swift
Resolved	ayounsi	T220081 Allow swift https access from analytics to prod
Resolved	Ottomata	T227896 Make oozie swift upload emit event to Kafka about swift object upload complete
Resolved	elukey	T228882 Allow analytics VLAN to reach eventgate-analytics.discovery.wmnet:31192
Resolved	EBernhardson	T227364 Adjust mjolnir bulk_daemon to import glent swift uploads
Resolved	MatthewVernon	T229584 Run swift-object-expirer as part of the swift cluster
Resolved	None	T211980 'morelike' recommendation API: Bulk import data to MySQL in chunks
Declined	None	T215421 Site: 1 VM request for recommender-systems

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

• bmansurov triaged this task as Medium priority.Jan 11 2019, 5:28 PM

• bmansurov updated the task description. (Show Details)

The first thing we need to do is to oozie-fy the data creation workflow that produces the files you would be loading into mysql (likely tsv), let's work on that while Analytics works on designing a workflow for you to transfer data into MySQL.

I think this might involve manipulation of wikidata dumps, since those are not productionized a frozen copy can be dump to HDFS.

@Nuria, Oozie task is happening in parallel here: T210844: Generate article recommendations in Hadoop for use in production.

@bmansurov sounds good, let's start working on the code for oozie jobs, that work has to be done before we could transfer data into mySQL anyways.

• Nuria added a parent task: T213976: Workflow to be able to move data files computed in jobs from analytics cluster to production .Jan 16 2019, 10:00 PM

mforns moved this task from Incoming to Operational Excellence on the Analytics board.Jan 17 2019, 6:12 PM

@bmansurov I think we can eventually figure out a way to get your dump files out of analytics to somewhere that can access mysql. Where and how to run your importer script is a different question that we probably need to talk with SRE about. Could we automate running it on the boxes where your service actually runs? Not sure.

Unless...unless that is we can actually write to the MySQL db from Hadoop. I suspect we won't be allowed to, but it might be worth asking. This is a DBA/SRE question too.

• bmansurov edited parent tasks, added: T214074: Productionize article recommender systems; removed: T213976: Workflow to be able to move data files computed in jobs from analytics cluster to production , T148129: Productization of Recommendation API.Jan 17 2019, 7:02 PM

• bmansurov added a subtask: T213976: Workflow to be able to move data files computed in jobs from analytics cluster to production .

Unless...unless that is we can actually write to the MySQL db from Hadoop.

I do not think we should consider this an option. We should have a clear separation of concerns and while the hadoop cluster is in charge of computing the data the task of updating the db needs to live in the service side.

@Ottomata thanks! I've updated the task description and pinged the groups you mentioned.

@Banyek and @Dzahn I'd appreciate your input on this task. Thank you.

We should have a clear separation of concerns and while the hadoop cluster is in charge of computing the data the task of updating the db needs to live in the service side.

Maybe! I wouldn't rule it out. We use Hadoop to update AQS databases (cassanrda, public-druid). We also read directly from MySQL into Hadoop. This might not be the right approach in this case, but I don't think an (offline) push from Hadoop into a datastore is out of the question.

How to install the importer scripts is what i started once in https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/476098/

It has been abandoned after Analytics said to not use stat hosts and use Hadoop instead.

The puppet code could of course be restored but where to apply it instead i don't know. For that i would have to refer to DBA and seniors in SRE. If you need a dedicated VM in production it is possible to request one (Ganeti).

It has been abandoned after Analytics said to not use stat hosts and use Hadoop instead.

To clarify: stats machines should not be in the path to update a production database, that has not changed. The current debate (on this ticket) is whether it makes sense to do it from hadoop (yours truly disagrees) or rather (my preferred option) to have a daemon on the mysql hosts that gets notified that data is available for consumption, fetches a file from an accessible mount and does whatever needs doing: inserting data, droping data , creating databases...

We want to find a solution that works for all consumers of data produced in hadoop, in this case might be MySQL but the consumer might very well be ElasticSearch which is another current use case.

to have a daemon on the mysql hosts

To clarify, it is unlikely these scripts would run on the mysql servers themselves. If not Hadoop, they'd run somewhere else, either a Ganeti instance as Dzahn suggests, or perhaps colocated with the recommendation-api service.

I think one telling use case the ilustrates why we want to decouple data loading from hadoop is a rollback. Say that you have RecomendationsData1.tsv and RecomendationsData2.tsv and you have currently loaded in prod RecomendationsData1.tsv. You produce a new data set in cluster and load it (RecomendationsData2.tsv). This dataset, despite your data guards, has bad data . You want an immediate rollback and that process should not involve the cluster in any case since you have that data (might need reloading as it might no longer be on mysql) . You would go to the mount where the data exists and reload it as needed using service code, not code that runs inside an oozie job or similar.

Rollback is already taken care of at the script level. We'll have different versions of data in MySQL and can rollback anytime we want. No need to bother Hadoop, mounted partitions, or Oozie. Please refer to the Pipeline documentation, specifically this diagram. If you have concerns about it I can work on improving it, but this concern has already been addressed there.

@bmansurov how do handle deleting data in your storage when you have reached capacity or when that dataset is bad? There must be a daemon that takes care of that work right ? as older versions of data are no longer useful and should not occupy space. That process that should be automated is not dissimilar to the one that would load the data, right? One is concern with deletion of older versions (language pairs might have been added or removed) other is concern with loading of newer versions.

@Nuria

how do handle deleting data in your storage when you have reached capacity or when that dataset is bad? There must be a daemon that takes care of that work right ?

Many ways, systemd daemon, cron job, (maybe) Oozie, or something else. Without knowing how to execute these scripts, I cannot tell you how. It will all become apparent once we agree on an approach in this task.

That process that should be automated is not dissimilar to the one that would load the data, right?

Yes and no. If the system is unable to check data quality (despite automatic tests T213761) and let it go all the way to production, then I think we need a human intervention to check the data quality. Once data is deemed OK, and after a certain time, we can deploy a config change that would clean up old data. I'm not too worried about these details because they are straightforward (and the script is ready to clean up the data). The bigger question is where and how to run these scripts.

• bmansurov moved this task from Backlog to In Progress on the Research board.Jan 17 2019, 9:22 PM

I don't think writing from Hadoop directly to M2 master is a good idea. But it is not really my call.

From wherever you decide it is suitable, it needs to be accessed thru the proxy, that means m2-master.eqiad.wmnet
No local scripts on the MySQL hosts will be allowed as the master can change anytime, and the proxy will point to the new one. If the script executes things locally you will have a split brain sooner or later

Lastly, I don't know at what numbers we are looking at, but if there is a big amount of data being inserted, we might need to ask you to throttle it (ie: to avoid lag on the slaves)

@Marostegui thanks, your last suggestion is captured as T211980: 'morelike' recommendation API: Bulk import data to MySQL in chunks.

• bmansurov added a subtask: T211980: 'morelike' recommendation API: Bulk import data to MySQL in chunks.Jan 18 2019, 2:35 PM

I don't think writing from Hadoop directly to M2 master is a good idea. But it is not really my call.

It is partly your call! :p Writing from Hadoop directly likely wouldn't be much different that writing from some other node somewhere. But still, there might be details about mysql passwords and firewall holes that we don't want to touch. Finding a separate place to run the import scripts is probably the path of least resistance.

Finding a separate place to run the import scripts is probably the path of least resistance.

• bmansurov updated the task description. (Show Details)Jan 22 2019, 2:37 PM

• bmansurov added a project: Services.

Thanks everyone for the discussion. I've added a summary to the task description.

@Pchelolo @Marostegui @Dzahn @Nuria @Ottomata please take a look at the updated task description and help us choose one of the options (or a new one).

@Dzahn please feel free to invite a senior SRE to the discussion.

Thank you!

• bmansurov updated the task description. (Show Details)Jan 22 2019, 2:43 PM

@bmansurov I am guessing that option 2 is the most likely one, in any case I want to stress that we should really be working on producing those tsvs in the cluster which is something that needs to happen prior to them being imported in mysql. Has that work started?

@Nuria, thanks for the input. I suppose you mean the option 2 of the first point.

Has that work started?

I'm currently working on the subtask/blocker (T211981) of the task you mention (T210844). Once it's finished, I'll work on the Oozie job.

• Pchelolo moved this task from Backlog to watching on the Services board.Jan 22 2019, 6:47 PM

• Pchelolo edited projects, added Services (watching); removed Services.

I agree that the most likely solution to work here is option (2), i.e. getting a host to execute it from. Perhaps it could be done from [mwmaint1002](https://wikitech.wikimedia.org/wiki/Heterogeneous_deployment#Run_a_maintenance_script_on_a_wiki), but that would need to be checked with SREs.

Option (1) (execute it from hosts serving recommendation api) is not an option because those are production machines that host other services and we cannot allow any kind of disruption that could be potentially caused during the import.

• mobrovac added a project: Platform Team Legacy (Watching / External).Jan 22 2019, 8:22 PM

Maybe @MoritzMuehlenhoff can give some ideas

In T213566#4898998, @bmansurov wrote:

@Dzahn please feel free to invite a senior SRE to the discussion.

Hi @bmansurov I wanted to let you know i did bring it up in my sub-team meeting and discussion has started and it has been said we need to continue that in the general SRE meeting. So just to let you know this is ongoing and has been triggered. More people involved now.

• Phabricator_maintenance moved this task from Backlog to Acknowledged on the SRE board.Jan 26 2019, 11:03 PM

Any further thoughts on this, i think we agree that best solution is to run and deploy these scripts from some virtual machine, once we are settled on that analytics would work in how to best move the tsvs generated to the machine in question, for this -given our tools and ecosystem- I cannot think of other solutions that rsync.

Dzahn added a project: serviceops.Feb 6 2019, 3:31 AM

@Nuria I agree it seems the most likely solution is using a Ganeti VM though but due to allhands we still did not have an SRE meeting since my last comment and i have been told to bring it up there. I will follow-up again Thursday.

We could start the VM request process though, that would mean to fill out the (short) form at https://wikitech.wikimedia.org/wiki/SRE_Team_requests#Virtual_machine_requests_(Production) Maybe @bmansurov could you do that?

Labs Project Tested: <project_name>    (skip, unless you have a cloud VPS for testing import scripts)
Site/Location:<EQIAD|CODFW>   (EQIAD) 
Number of systems: <# of VMs>   (1)
Service: <service name>   (have an idea for a name?)
Networking Requirements: <internal|external IP>, <specific networking access needed>  (internal)
Processor Requirements: <Number of Virtual CPUS>  ( i don't know how much CPU the scripts will need) ?
Memory:   ?
Disks: <Capacity only>   ?
Other Requirements:  ?

jijiki moved this task from Incoming 🐫 to Stalled 🐌 on the serviceops board.Feb 6 2019, 12:08 PM

Thanks, @Dzahn for the info. I've this task: T215421: Site: 1 VM request for recommender-systems.

Marco's suggestion of using mwmaint1002 is not a bad idea...

Joe changed the status of subtask T215421: Site: 1 VM request for recommender-systems from Open to Stalled.Feb 7 2019, 12:14 PM

How is the data going to make it from Hadoop, which resides in the analytics cluster and is firewalled at the router level (aka network ACLs) to whichever machine is chosen for this? Has this been already worked out (cause I see no mention of this)?

For what is worth, mwmaint1002 is also the place for a lot of other maintenance tasks. Depending on the amount of resources required during the import/update phase it might make sense to keep that workload separated from other mwmaint1002 workloads. Is it just a LOAD DATA INFILE "something.tsv" or is it something more complex?

How is the data going to make it from Hadoop, which resides in the analytics cluster and is firewalled at the router level (aka network ACLs) to whichever machine is chosen for this? Has this been already worked out (cause I see no mention of this)?

@akosiaris: T213976: Workflow to be able to move data files computed in jobs from analytics cluster to production

It hasn't been totally worked out, but a solution is rsync pull. The network firewalls don't allow analytics VLAN to initiate connections, but they will accept incoming ones.

In T213566#4934832, @akosiaris wrote:

Is it just a LOAD DATA INFILE "something.tsv" or is it something more complex?

Yes, more or less: https://github.com/wikimedia/research-article-recommender-deploy/blob/master/deploy.py
I plan on adding chunked imports later.

In T213566#4934835, @Ottomata wrote:

How is the data going to make it from Hadoop, which resides in the analytics cluster and is firewalled at the router level (aka network ACLs) to whichever machine is chosen for this? Has this been already worked out (cause I see no mention of this)?

@akosiaris: T213976

It hasn't been totally worked out, but a solution is rsync pull. The network firewalls don't allow analytics VLAN to initiate connections, but they will accept incoming ones.

The network firewalls are not stateful, but rather stateless. That is, while they indeed don't allow hosts in the analytics to initiate a TCP connection (aka TCP SYN packet), they will also not allow them to send the SYN/ACK packet required for the second (of the three) phase of the TCP handshake, despite the fact they will be allowed to receive the SYN packet. Hence rsync pull will not work.

In T213566#4934847, @bmansurov wrote:

In T213566#4934832, @akosiaris wrote:

Is it just a LOAD DATA INFILE "something.tsv" or is it something more complex?

Yes, more or less: https://github.com/wikimedia/research-article-recommender-deploy/blob/master/deploy.py
I plan on adding chunked imports later.

That does look simple enough and not resource expensive on mwmaint1002. I guess it can fit in there as well? But a VM is also a fine solution and separation of concerns does suggest we go that way.

That does look simple enough and not resource expensive on mwmaint1002. I guess it can fit in there as well? But a VM is also a fine solution and separation of concerns does suggest we go that way.

My preference is also toward a VM because we're just getting started with recommendation APIs. We'll have section recommendations in the future and possibly other kinds of article recommendations.

they will also not allow them to send the SYN/ACK packet required for the second (of the three) phase of the TCP handshake,

? Is that true? I'm pretty sure it isn't...unless we have some special rules I'm forgetting.

Just checked. We have rsync server modules on stat1007 that only certain hosts are allowed to rsync from. I live edited it to add mwmaint1002.

Before adding mwmaint1002

[@mwmaint1002:/home/otto] $ rsync -rvn stat1007.eqiad.wmnet::srv/ ./
@ERROR: access denied to srv from mwmaint1002.eqiad.wmnet (10.64.16.77)

After adding mwmaint1002 to rsyncd.conf hosts allow

[@mwmaint1002:/home/otto] $ rsync -rvn stat1007.eqiad.wmnet::srv/published-datasets/one-off/ ./
receiving incremental file list
recSheetsTSVforMartin.tar.gz
article-recommender/
article-recommender/06032018-11302018.tar.gz
article-recommender/20181130.tar.gz
...

Ideally I would prefer that stats machines are completely out of the workflow of pushing data to machines like mwmaint1002.eqiad.wmnet because stats machines are meant to do experiments and one off computations, they are not a "bridge" for data to get to our production hosts. We use them to put data for the whole wide world to consume and it should be fine to put data on dumps but that is as far as I think we should go. This ticket (oh my!) might require us to consider other options and think of solutions we might have not used before. Is swift a viable alternative. Could we push those files there (a lå s3) and have mwmaint1002.eqiad.wmnet pull them?

As @Ottomata pointed out more generic discussion about this topic can be found here: https://phabricator.wikimedia.org/T213976

akosiaris added a subscriber: jijiki.Feb 14 2019, 5:17 PM

In T213566#4934895, @Ottomata wrote:

they will also not allow them to send the SYN/ACK packet required for the second (of the three) phase of the TCP handshake,

? Is that true? I'm pretty sure it isn't...unless we have some special rules I'm forgetting.

analytics-in4 seems to be one of the filters that have a tcp-established match term, a bitfield mask of tcp-flags “(ack | rst)” so indeed the handshake will proceed fine as per you example. I was wrong in that comment.

For the record, just saying pointing out that the question of a new VM versus mwmaint1002 is probably irrelevant here. We can do both with what looks like minimal repercussions.

The discussion in T213976 is probably more important. Should we btw stall this on T213976?

Should we btw stall this on T213976?

yes, we need to resolve first where/how are binarie/data files s going to be moved to the prod network

Per comment above.

• Nuria moved this task from Operational Excellence to Machine Learning Platform on the Analytics board.May 15 2019, 10:06 PM

Joe edited projects, added serviceops-radar; removed serviceops.Jun 21 2019, 8:50 AM

leila moved this task from In Progress to Backlog on the Research board.Jul 11 2019, 12:38 AM

leila edited projects, added Research-Freezer; removed Research.Jul 11 2019, 3:52 PM

Usmanmuhd closed subtask T211980: 'morelike' recommendation API: Bulk import data to MySQL in chunks as Resolved.Aug 24 2019, 11:14 AM

Dzahn unsubscribed.Sep 25 2019, 12:29 AM

leila removed a project: Research-Freezer.Nov 20 2019, 12:31 AM

This is now supported via Kafka, Swift and an Oozie workflow.
T213976: Workflow to be able to move data files computed in jobs from analytics cluster to production

Resolving.

Ottomata closed subtask T213976: Workflow to be able to move data files computed in jobs from analytics cluster to production as Resolved.Nov 20 2019, 7:51 PM

Dzahn closed subtask T215421: Site: 1 VM request for recommender-systems as Declined.Sep 23 2020, 9:56 PM

Transferring data from Hadoop to production MySQL databaseClosed, ResolvedPublicActions

Description

Summary of discussion so far

Links

A/C

Related ObjectsSearch...

Event Timeline

Transferring data from Hadoop to production MySQL database
Closed, ResolvedPublic
Actions

Related Objects
Search...