Requesting account expiration extension
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Jdcc-berkman
	Dec 19 2017, 8:33 PM

Description

In ticket T170878 I originally asked for my account on the stats boxes to expire on 2018-01-01. The process I was managing there has recently gone haywire, so I'd like to push back my expiration so I'll have time to fix it. Is it possible to get my account extended until, say, 2018-04-01 to make sure I have time to wrap things up?

Related Objects

Mentioned In: T217438: Requesting access to stat1007 for sukhe
T215379: Restoring the daily traffic anomaly reports
T184085: Requesting extended access to stat1005 for jdcc
Mentioned Here: T170878: Audit users and account expiry dates for stat boxes

Event Timeline

Jdcc-berkman created this task.Dec 19 2017, 8:33 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 19 2017, 8:33 PM

Accounts where exactly? Which project is this about? :)

Aklapper added a project: Analytics-Clusters.Dec 20 2017, 12:16 AM

Restricted Application added a project: Analytics. · View Herald TranscriptDec 20 2017, 12:16 AM

I have a home directory on stat1005, so whatever accounts are necessary to access that.

This is for this WMF research project. The report is done and published, but part of that project is ongoing daily reports monitoring Wikipedia's accessibility around the world. That report generation process broke on Dec 1, and we've still got an outstanding issue of how we can publicly release them. Once those two things are resolved, my account can be disabled.

Can you explain a bit more why do you need to account extended? I understand that you need to run some reports and that process broken on December 1st, give that seems like access until the end of January gives you time to fix issues. Are there any other concerns?

• Nuria moved this task from Incoming to Radar on the Analytics board.Dec 20 2017, 9:55 PM

The other concern is that the output from these reports is supposed to be made publicly available by WMF. That's been agreed to in principle, but the process has not been worked out yet. As that's on WMF's side, I don't know how long figuring that out is going to take. If possible, I'd prefer to not lose access to the reports until after they've been made public, so I made a conservative estimate of April.

Who in WMF is taking the task of making these reports public to add them to ticket?

Stephen LaPorte. Sorry, I don't know his username.

@Slaporte: can you advice on progress of this project and requirements for data access?

We are discussing the data publication over email, and Justin is still helping resolve a bug in the report that started on December 1. There is still a need for data access, and April is a reasonable estimate.

We can extend access until April, now , more work needs to happen to make those reports public, we neither seen the data nor the process by which it is harvested.

@Jdcc-berkman: you need to open a new ticket to ops requesting access until April. https://wikitech.wikimedia.org/wiki/Analytics/Data_access#Production_access

@Jdcc-berkman Just looked at the tools that you are using to generate the report (@stat1005:/home/jdc) and they are fine and dandy for a prototype. In order to produce recurrent reports we would need to change the setup heavily as the actual setup is quite brittle, it relies on technology that is no longer updated or maintained (uses https://github.com/berkmancenter/hekaanom , see https://mail.mozilla.org/pipermail/heka/2016-May/001059.html) and lastly, does not take advantage of the hadoop cluster, rather does all processing via python scripts via a user database on hive.

I would not invest much time fixing setup as for it to be solid enough to produce publicly reliably reports it will need quite a bit of work. Thanks for documenting with READMEs and such.

@Nuria OK, I understand those issues. Thank you for taking the time to look through all that. There are a lot of groups in our research community (and the WMF community/staff, I believe) that would still find this data useful, and we have engineering time at our disposal, so productionizing this is not out of the question for us. The guts of this whole thing is really just a parameterized Hive query and some linear algebra - the remaining 90% was just flexibility for experimentation.

Stated simply, our goal is to do some math on your time series data and dump a CSV somewhere like https://analytics.wikimedia.org/datasets/archive/public-datasets/analytics/.

If you tell me what a production system looks like that queries from Hive, does some math, and spits the results into a CSV, I'll happily look for the resources to get it done. Is there a document somewhere that you could point me to outlining what it takes to get something into production? If not, from your previous comment and the Analytics Systems wiki, these look like top priorities - please let me know what I've missed:

Generally good engineering principles (robust to failure, well scoped, maintained, etc.)
Is a Hadoop job (what is the preferred client? Java API? Pig?)
Managed as an Oozie workflow (just a guess)

@Jdcc-berkman Thanks for taking the time, See for example what a product ionized job looks like here (this is oozie/spark) : https://gerrit.wikimedia.org/r/#/c/383761/ In your case you are using the heka plugging to do anomaly detection the algorithm that go code uses is called RPCA. I could not find an RPCA implementation in spark so you will need to do some research there, other more simplistic methods for anomalies like winters-holt are available but likely those are not of use.

@Nuria Sounds good, I'll dig in. I'm familiar with the various RPCA implementations (I wrote the go one), so that part shouldn't be too much trouble.

Jdcc-berkman mentioned this in T184085: Requesting extended access to stat1005 for jdcc.Jan 3 2018, 5:36 PM

BTW, superb work on https://dash.harvard.edu/bitstream/handle/1/32741922/Wikipedia_Censorship_final.pdf

Slaporte awarded a token.Jan 3 2018, 7:37 PM

Since https://phabricator.wikimedia.org/T184085 is resolved, should we keep this task open or can we close?

I haven't been able to secure the developer time yet for this. It's unlikely I'll have anything to show for at least two months, during which time I won't need access.

But assuming I can get the resources to productionize this, how difficult is it to open access back up? If difficult, is there anything we can do to minimize that (locking/disabling account rather than deleting, etc.)?

We discourage access for collaborations that are not active, if think that you will be able to collaborate on this two months from now we can maintain access for a few more months (no big deal). If you are not sure of whether you can commit to the collaboration I rather disable the account.

OK. I do think we will be able to collaborate on this - the timing just hasn't worked out yet. Can we extend for a few more months?

Let's maintain access until the end of June. Thank you.

Update on this: We've finally secured some developer time and are in the process of signing a contract. Can we extend three more months, please?

• Tbayer mentioned this in T215379: Restoring the daily traffic anomaly reports.Feb 6 2019, 4:09 AM

ssingh mentioned this in T217438: Requesting access to stat1007 for sukhe.Mar 1 2019, 7:12 PM

Just to close this out, our developer put together something that could work as a good starting place for integrating this work into the production stack: https://github.com/berkmancenter/mw-anomaly-detection. I don't see us doing more of this work right now, but if productionizing this becomes a priority for either us or WMF, we're not starting from scratch.

Nice, thank you.

Aklapper edited projects, added Analytics-Radar; removed Analytics.Jun 10 2020, 6:33 AM

Restricted Application edited projects, added Analytics; removed Analytics-Radar. · View Herald TranscriptJun 10 2020, 6:33 AM

Aklapper edited projects, added Analytics-Radar; removed Analytics.Jun 10 2020, 6:36 AM

Restricted Application edited projects, added Analytics; removed Analytics-Radar. · View Herald TranscriptJun 10 2020, 6:36 AM

Aklapper edited projects, added Analytics-Radar; removed Analytics.Jun 10 2020, 6:41 AM

elukey closed this task as Resolved.Jun 10 2020, 2:22 PM

elukey claimed this task.

To keep archives happy, WMF did teh work of productionizing these scripts: https://wikitech.wikimedia.org/wiki/Analytics/Data_quality/Traffic_per_city_entropy

Requesting account expiration extensionClosed, ResolvedPublicActions

Description

Related Objects

Event Timeline

Requesting account expiration extension
Closed, ResolvedPublic
Actions