In ticket T170878 I originally asked for my account on the stats boxes to expire on 2018-01-01. The process I was managing there has recently gone haywire, so I'd like to push back my expiration so I'll have time to fix it. Is it possible to get my account extended until, say, 2018-04-01 to make sure I have time to wrap things up?
I have a home directory on stat1005, so whatever accounts are necessary to access that.
This is for this WMF research project. The report is done and published, but part of that project is ongoing daily reports monitoring Wikipedia's accessibility around the world. That report generation process broke on Dec 1, and we've still got an outstanding issue of how we can publicly release them. Once those two things are resolved, my account can be disabled.
Can you explain a bit more why do you need to account extended? I understand that you need to run some reports and that process broken on December 1st, give that seems like access until the end of January gives you time to fix issues. Are there any other concerns?
The other concern is that the output from these reports is supposed to be made publicly available by WMF. That's been agreed to in principle, but the process has not been worked out yet. As that's on WMF's side, I don't know how long figuring that out is going to take. If possible, I'd prefer to not lose access to the reports until after they've been made public, so I made a conservative estimate of April.
We are discussing the data publication over email, and Justin is still helping resolve a bug in the report that started on December 1. There is still a need for data access, and April is a reasonable estimate.
@Jdcc-berkman Just looked at the tools that you are using to generate the report (@stat1005:/home/jdc) and they are fine and dandy for a prototype. In order to produce recurrent reports we would need to change the setup heavily as the actual setup is quite brittle, it relies on technology that is no longer updated or maintained (uses https://github.com/berkmancenter/hekaanom , see https://mail.mozilla.org/pipermail/heka/2016-May/001059.html) and lastly, does not take advantage of the hadoop cluster, rather does all processing via python scripts via a user database on hive.
I would not invest much time fixing setup as for it to be solid enough to produce publicly reliably reports it will need quite a bit of work. Thanks for documenting with READMEs and such.
@Nuria OK, I understand those issues. Thank you for taking the time to look through all that. There are a lot of groups in our research community (and the WMF community/staff, I believe) that would still find this data useful, and we have engineering time at our disposal, so productionizing this is not out of the question for us. The guts of this whole thing is really just a parameterized Hive query and some linear algebra - the remaining 90% was just flexibility for experimentation.
Stated simply, our goal is to do some math on your time series data and dump a CSV somewhere like https://analytics.wikimedia.org/datasets/archive/public-datasets/analytics/.
If you tell me what a production system looks like that queries from Hive, does some math, and spits the results into a CSV, I'll happily look for the resources to get it done. Is there a document somewhere that you could point me to outlining what it takes to get something into production? If not, from your previous comment and the Analytics Systems wiki, these look like top priorities - please let me know what I've missed:
- Generally good engineering principles (robust to failure, well scoped, maintained, etc.)
- Is a Hadoop job (what is the preferred client? Java API? Pig?)
- Managed as an Oozie workflow (just a guess)
@Jdcc-berkman Thanks for taking the time, See for example what a product ionized job looks like here (this is oozie/spark) : https://gerrit.wikimedia.org/r/#/c/383761/ In your case you are using the heka plugging to do anomaly detection the algorithm that go code uses is called RPCA. I could not find an RPCA implementation in spark so you will need to do some research there, other more simplistic methods for anomalies like winters-holt are available but likely those are not of use.
I haven't been able to secure the developer time yet for this. It's unlikely I'll have anything to show for at least two months, during which time I won't need access.
But assuming I can get the resources to productionize this, how difficult is it to open access back up? If difficult, is there anything we can do to minimize that (locking/disabling account rather than deleting, etc.)?
We discourage access for collaborations that are not active, if think that you will be able to collaborate on this two months from now we can maintain access for a few more months (no big deal). If you are not sure of whether you can commit to the collaboration I rather disable the account.
Just to close this out, our developer put together something that could work as a good starting place for integrating this work into the production stack: https://github.com/berkmancenter/mw-anomaly-detection. I don't see us doing more of this work right now, but if productionizing this becomes a priority for either us or WMF, we're not starting from scratch.