Page MenuHomePhabricator

add whether an edit happened on cloud VPS to geoeditors-daily dataset
Open, HighPublic

Description

@bd808 and @srishakatux can probably expand my explanation a bit but the cloud team is interested in having a public dashboard in which you can visualize edits that happen on cloud vps versus global edits (see parent task). Spoke with @mforns on this and (pending @JAllemandou's feedback) we think it will be easy to add a is_cloud_vps edit column to the geoeditors-daily dataset. From there we can run a reportupdater script that calculates total edits versus edits in cloud vps per namespace (while the namespace was not on the original request i think w/o it the metric of "edits on cloud vps' does not have the same value)

Event Timeline

Can someone from cloud-services-team confirm we wnat the dashboard with this data to be public? cause if the request is for a private dashboard after adding a column we could set up one in superset in 5 minutes

Change 538607 had a related patch set uploaded (by Joal; owner: Joal):
[analytics/refinery/source@master] Update subnet lists for IpUtil

https://gerrit.wikimedia.org/r/538607

Change 538613 had a related patch set uploaded (by Joal; owner: Joal):
[analytics/refinery@master] Add network-origin to the geoeditors-daily table

https://gerrit.wikimedia.org/r/538613

Change 538607 merged by Ottomata:
[analytics/refinery/source@master] Update subnet lists for IpUtil

https://gerrit.wikimedia.org/r/538607

Provided 2 patches based on the request above as examples. More discussion is probably needed based on needed retention.
Example results for enwiki for 2019-08:

dateFrom Internet - RegisteredFrom Internet - AnonymousFrom labs - Registered
2019-08-0110225925611145
2019-08-0210475024815113
2019-08-03975382438085
2019-08-0410291326112129
2019-08-051052122614095
2019-08-0610978226593302
2019-08-071055942598683
2019-08-0810353026282196
2019-08-0910289226052233
2019-08-109669523680155
2019-08-119464524505169
2019-08-1210511127146197
2019-08-1310192126847185
2019-08-1410037126126229
2019-08-159821425493300
2019-08-1610648824801256
2019-08-17937232398574
2019-08-189653624398379
2019-08-1910299925800113
2019-08-2010678726775100
2019-08-2110460926028186
2019-08-221022982549577
2019-08-2310921324802109
2019-08-2410444023408126
2019-08-259781924346319
2019-08-261053932557365
2019-08-2710356526259194
2019-08-281029972627569
2019-08-2910360425670632
2019-08-3010315426541179
2019-08-3111366024695156
fdans triaged this task as High priority.Mon, Sep 23, 3:24 PM
fdans moved this task from Incoming to Smart Tools for Better Data on the Analytics board.

Not sure about the is_cloud_vps name...can the dashboard just examine the ip network and differentiate directly, rather than us adding a new field? Not sure.

bd808 added a comment.Mon, Sep 23, 4:43 PM

Can someone from cloud-services-team confirm we wnat the dashboard with this data to be public? cause if the request is for a private dashboard after adding a column we could set up one in superset in 5 minutes

Adding @Bmueller as definitive person to answer this, but my understanding is that we would like to create a public dashboard/report that is updated at least monthly showing month over month/year over year trends in Cloud VPS/Toolforge edit counts.

Not sure about the is_cloud_vps name...can the dashboard just examine the ip network and differentiate directly, rather than us adding a new field? Not sure.

Yes, this can be done without a precomputed field in the hadoop tables by using either the UDF manually in an HQL query automated with Reportupdater or a custom script like the one in T226663#5287195 that works from the mysql replicas of the checkuser tables. Percomputing probably makes the most sense if folks can find other reasons they want to correlate events with the Cloud Services network origin.

Nuria moved this task from Next Up to In Code Review on the Analytics-Kanban board.Wed, Oct 9, 4:22 PM

Change 538613 merged by Milimetric:
[analytics/refinery@master] Add network-origin to the geoeditors-daily table

https://gerrit.wikimedia.org/r/538613

The column has been added and I'm restarting the job so it will be filled going forward. Should we backfill this data as far back as we have the raw source (90 days)?

Nuria added a comment.Wed, Oct 9, 7:10 PM

I do not think that is needed as Cloud-Services team has that data from their , ahem "legacy" script that @bd808 was running since old times, I think @srishakatux will be using this data going forward to add some reportupdater jobs

ok, I restarted the monthly job and this column will be populated going forward. The first time it will be inserted is November 1st, 2019, when it runs the month of October.