Page MenuHomePhabricator

Workflow to be able to move data files computed in jobs from analytics cluster to production
Closed, ResolvedPublic

Description

We need to have an easy workflow to be able to move data files from analytics cluster to production. Those files could be anything but more often than not will be results of models that are shipped to some production system to make use of them. While we use kafka as a data bridge in many instances, kafka is not an option for thi suse case as most of the files we will produce/are producing are larger that 10 M.

For an example of a process that now produces files that requires manual push of those files to prod see the Search Platform team workflow to ship results of models calculated in the cluster to elasticsearch: https://wikitech.wikimedia.org/wiki/Search/MLR_Pipeline

A possible way to do this would be using rsync. Analytics would need to provide a workflow that can be used by oozie jobs or similar in which once files are produced are stored in hdfs and from there they are pushed via rsync mount to a known location in the Analytics VLAN (this could be done on a box similar to a stats box dedicated to this purpose). The files will be pulled from production systems once ready. Ideally the workflow includes a message in a known kafka topic (or similar) that notifies a consumer when a file is ready to be pulled. This way the daemons running on, for example, the elasticsearch fleet could pull data when available and upload it.

Another possible way would be using swift as a generic repository of big blobs (s3-style)

Event Timeline

In our discussion yesterday, we mention both git lfs and swift as an option for this. I turned that idea down, but it seems it has been partially explored by RelEng. We should at least look into it.

T180627: Support git-lfs in scap
T182085: Connect Phabricator to swift for storage of git-lfs and file uploads.

@fgiunchedi @mmodell got a sec sometime to discuss this? Here's a quick summary:

Folks generate large-ish binary artifacts (e.g. ML models?) in the analytics hadoop cluster and then ship them elsewhere for input into some production system (elasticsearch, article recommendation api, etc.). Skimming the tickets I linked to in my last comment, it looks like this use case might be similar to what yall were previously exploring with git-lfs and swift? Do you think it would it be worth our time exploring this option too?

mforns triaged this task as Medium priority.Jan 17 2019, 5:50 PM
mforns moved this task from Incoming to Operational Excellence on the Analytics board.

FYI, I believe @Halfak and the ORES folks have a use case for this too. They build some models on stat1007 and use git lfs to push them. We should consider that too as we figure this out. See: T214089

@fgiunchedi ping on this...or can you ping someone who might know more about using Swift for this kind of thing? Should we consider this option or stick with rsync pull?

@fgiunchedi ping on this...or can you ping someone who might know more about using Swift for this kind of thing? Should we consider this option or stick with rsync pull?

That'd be me yes :) Swift can be certainly used for storing binary blobs, some observations and questions though. Single object size limit is 5GB, when exceeding that size it is possible to split the object in chunks (by the swift client automatically or managing the chunks explicitly), see also large object support

How big is the dataset and how fast is it going to grow? How often clients will download/upload data? Will all access (download/upload) be authenticated?

The reason I'm asking is that the production swift cluster is shared and its main use case is commons media, and I'd like to avoid non-production use cases to overload it. Having said that we do have protections in place too (quotas, rate limits).

In terms of expectations, I am the official point of contact / maintainer for swift, though it isn't my primary focus these days as monitoring and logging goals/duties take precedence. With all of that said I'd love to see more use cases for swift being implemented as the hardware is there and it has been working well for us!

HTH!

How big is the dataset and how fast is it going to grow?

In the hundreds of megabytes I believe. @Halfak, @EBernhardson, @Miriam, @bmansurov, is this right? Will ML models be about this size for the foreseeable future?

How often clients will download/upload data?

I believe for now, not that often. It'll either be manual or around once a month.

Will all access (download/upload) be authenticated?

No? Everything will be accessed internally only. I suppose if we can easily authenticate, that'd be fine!

How big is the dataset and how fast is it going to grow?

In the hundreds of megabytes I believe. @Halfak, @EBernhardson, @Miriam, @bmansurov, is this right? Will ML models be about this size for the foreseeable future?

It really depends on the particular use case. For search we want to ship models from analytics to production. We generate maybe 20 models a month at a size of 10-50MB each. For my immediate needs something like that would work.

How often clients will download/upload data?

I believe for now, not that often. It'll either be manual or around once a month.

Search would generally be monthly for the automated uploads, and perhaps more when we are doing active model development/test deploys.

Will all access (download/upload) be authenticated?

No? Everything will be accessed internally only. I suppose if we can easily authenticate, that'd be fine!

For the search use case a write-only unauthenticated binary storage would be fine. Basically if something in analytics can write to it, then produce a message to kafka saying 'find X at location Y' and have the other side get back the whole file that would be perfect.

Longer term search will potentially want to generate some significantly larger datasets to ship to production, but we don't yet have a concrete implementation plan so everything is a bit hand-wavy. As one example though we have looked into turning all sentences from articles on a wiki into vectors. These vectors are 4kB and a previous estimation was 250M sentences on en.wiki, 50M sentences on fr.wiki, declining from there. Overall data size on the order of 2-10TB. This is fairly far off though, something closer term would be more like a 1kB vector per article which is a much more reasonable ~5GB for enwiki and declining from there.

I think our biggest models are around 100MB. I don't expect to have a model larger than 1GB any time soon.

We change models around once per month. Often we'll just be adding a model here and there. Periodically, we'll re-train every model. Currently we have 93 models. When we retrain, we might want to upload ~2-5GB of models and other assets for deployment.

We do have one very large asset file at 1.9GB (word2vec embedding). I don't need that to be much bigger right now, but we're starting to discuss using embeddings more generally in the mid term and I don't have a good sense for how large they can become. @diego might have a better sense for how big these embeddings can be.

We do have one very large asset file at 1.9GB (word2vec embedding). I don't need that to be much bigger right now, but we're starting to discuss using embeddings more generally in the mid term and I don't have a good sense for how large they can become. @diego might have a better sense for how big these embeddings can be.

In this ongoing project, we used around 5Gb to have one vector [1,300] per article (namespace:0) for full the English Wikipedia.

@fgiunchedi thoughts on this? looks like we are talking about 10-100 G files, not quite Terabytes

@fgiunchedi thoughts on this? looks like we are talking about 10-100 G files, not quite Terabytes

Looks indeed doable to me! We'd need to think a little on how to "productionize" this use case but worth a try for sure.

In the hundreds of megabytes I believe. @Halfak, @EBernhardson, @Miriam, @bmansurov, is this right? Will ML models be about this size for the foreseeable future?

@Ottomata yes, the size of the best image classification models is <1G

Ok, I leave up to @fgiunchedi and @Ottomata to think about to how to productionize the "deployment" of models and data to swift (deployment should include versioning so we are working with swift as if it was a package manager, this last sentence probably conveys how little I know about swift manages binary blobs but just to emphasize that versioning in this use case is important).

We use transfer.py to transfer up to 12TB of data for database provisioning and backups while being able to saturate the link. It would be nice to work on a singular system rather than each inventing its own. It is based on cumin remote execution.

@jcrespo: have in mind that this is not only for data destined to mysql (although this is the particular case this ticket is concerned with). We would like to find a generic system by which we can put large blobs (that might be binaries) that were calculated in the cluster into production.

transfer.py works for:

  • Plain files from filesystem to filesystem
  • Online mysql/mariaDB databases

It would be trivial to make it work for hadoop blobs?

@jcrespo seems something worth considering, I leave up to @fgiunchedi @Ottomata and @akosiaris to see if transfer.py + swift could be a path to get this done

One note about hadoop blobs: HDFS stores files split in chunks, with those not collocated. If we use transfer.py on local files after having brought them back from HDFS to the local machine, then yes, I'm assuming it would work. And I actually think the best would be to make transfer.py stream from standard-input so that files don't need to be moved from HDFS to local-filesystem, then to remote. Now the most efficient way to transfer data "hadoop-style" would be to prevent the hop to the local machine using transfer.py and stream it directly to its destination from the various datanodes - The main concern I can see here is that access is needed for all hadoop workers.

FWIW I found it fairly easy to work with swift from a development point of view but getting that experiment (phabricator large file support) into production hasn't been a priority so the project has been stalled for a long time.

I think transferring data *seems* that could be taken care of with hadoop's copytolocal right? Issue we want to transfer to a "filesystem lookalike" that other prod systems can pull from.

Alright, I'm not familiar with Swift, but if we were to do this, here is what I think we'd need:

  • Network access Analytics <-> Swift
  • A Swift 'account'. (I don't think 'analytics' is the proper account name here, let's discuss.)
  • Standard 'data ready' event
  • Tooling to make it easy to upload to Swift and emit 'data ready' event

@mmodell This is kind of a 'deployment' process thing, is this something we can work with RelEng to figure out? @fgiunchedi should we work with you to get the Swift parts available/working? If not, let us know who to ping.

Alright, I'm not familiar with Swift, but if we were to do this, here is what I think we'd need:

  • Network access Analytics <-> Swift
  • A Swift 'account'. (I don't think 'analytics' is the proper account name here, let's discuss.)

Calling it out explicitly JIC: in my mind this also means provisioning of credentials to read/write clients.

  • Standard 'data ready' event
  • Tooling to make it easy to upload to Swift and emit 'data ready' event

There will also some functionality needed to create the required containers. If (and I suspect that's the case) containers are not created frequently we can provision them via puppet, otherwise an external script should take care of that.

@mmodell This is kind of a 'deployment' process thing, is this something we can work with RelEng to figure out? @fgiunchedi should we work with you to get the Swift parts available/working? If not, let us know who to ping.

Yes that'd be still for the swift part, however in terms of priorities my time is mostly dedicated to infra foundations goals. What timeline were you looking at?

In general it would be great if the storage would be decoupled from the analytics cluster through an API. So in longer term this can be used in other places for example I had the idea of (re)building models through jobs in kubernetes cluster (production). Feel free to discard the idea if it's too much work.

it would be great if the storage would be decoupled from the analytics cluster through an API

Well, APIish? Probably just some generic tooling that works inside and outside of the analytics cluster.

What timeline were you looking at?

This is blocking some work for T214074: Productionize article recommender systems, but we might be able to hack together an rsync pipeline for that in the short term if we have to. @Nuria, should we make this swift stuff a goal for next quarter?

but we might be able to hack together an rsync pipeline for that in the short term if we have to

I rather keep the technical debt to a minimum, would @fgiunchedi have time to work on this next quarter too? If so let's make a joined goal to solve this problem not just for this particular use case but several others.

but we might be able to hack together an rsync pipeline for that in the short term if we have to

I rather keep the technical debt to a minimum, would @fgiunchedi have time to work on this next quarter too? If so let's make a joined goal to solve this problem not just for this particular use case but several others.

I don't know yet whether I'll have time but for sure that seems a worthwhile goal to me!

Ok!

@bmansurov it sounds like we might be blocking you for a bit more while we solve this problem. I'm sorry about that. In the meantime, I can help you manually import new data if you need.

In general it would be great if the storage would be decoupled from the analytics cluster through an API. So in longer term this can be used in other places for example I had the idea of (re)building models through jobs in kubernetes cluster (production). Feel free to discard the idea if it's too much work.

The idea is fine (I've been having the same for over 1 year now), but 1 important point. Not in the production kubernetes clusters please as those are dedicated to service traffic. But I can see a kubernetes cluster existing for easing batch processing jobs, although the specifics of such an installation should be discussed broadly.

The idea is fine (I've been having the same for over 1 year now), but 1 important point. Not in the production kubernetes clusters please as those are dedicated to service traffic. But I can see a kubernetes cluster existing for easing batch processing jobs, although the specifics of such an installation should be discussed broadly.

Yes, you're right. Maybe turning mwmaint1002 to a minikube and applying jobs there (not just ores jobs but also mediawiki manual maintenance jobs as well when mediawiki moved to kubernetes). Just an idea.

Ping @fgiunchedi about putting this as a commong goal next quarter

Ping @fgiunchedi about putting this as a commong goal next quarter

I'm generally in favor and would be happy to help. As goals (and their commitments) are coming together now maybe we can put together some goal language to get a better idea of what would be needed, what do you think?

@fgiunchedi sounds good, will try to set up short 30 min meeting?

@fgiunchedi sounds good, will try to set up short 30 min meeting?

Sounds good! (FTR said is already scheduled at Thu Mar 28, 2019 16:30 CET)

@EBernhardson mentioned a feature he wanted to me yesterday: a way to delete the swift objects after they are no longer needed. Kafka does this for them now automatically, since their messages expire after 7 days.

We could use swift's expiring objects support, although that is something we'd have to deploy first (puppetization of the configuration mostly) cfr https://docs.openstack.org/swift/2.10.2/overview_expiring_objects.html

The expiration for objects can be specified at the time of upload, so it needs to be added to our current workflow.

@EBernhardson @Ottomata re: swift expiring objects, see the link above too and tl;dr is:

The X-Delete-After header takes an integer number of seconds. The proxy server that receives the request will convert this header into an X-Delete-At header using its current time plus the value given.

Great! @fgiunchedi you said 'that is something we'd have to deploy first'. Can I use this now?

Great! @fgiunchedi you said 'that is something we'd have to deploy first'. Can I use this now?

Indeed, I was talking about the object expirer (i.e. the process that actually deletes the objects). However setting the header should work even now and after the time has passed the object will start returning 404. It'd be great if you could help testing that's actually the case, thanks!

@mmodell This is kind of a 'deployment' process thing, is this something we can work with RelEng to figure out? @fgiunchedi should we work with you to get the Swift parts available/working? If not, let us know who to ping.

I can't really commit to anything on behalf of the Release-Engineering-Team team, for that we'd need to talk to @greg. That said, it's certainly something I would like to help out with. I don't think that I have enough familiarity with swift, nor enough access to the cluster to really get this done though.

Yes, you're right. Maybe turning mwmaint1002 to a minikube and applying jobs there (not just ores jobs but also mediawiki manual maintenance jobs as well when mediawiki moved to kubernetes). Just an idea.

I recently installed k3s locally. It's sort of an alternative to minikube. It works on bare metal instead of putting the cluster inside of virtualbox. So far I like what I've seen.

Change 522503 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[analytics/refinery@master] oozie swift/upload - add support for X-Delete-After header

https://gerrit.wikimedia.org/r/522503

Change 522503 merged by Ottomata:
[analytics/refinery@master] oozie swift/upload - add support for X-Delete-After header

https://gerrit.wikimedia.org/r/522503

Ottomata claimed this task.

This was finished back in July. Resolving.