@bd808 and @srishakatux can probably expand my explanation a bit but the cloud team is interested in having a public dashboard in which you can visualize edits that happen on cloud vps versus global edits (see parent task). Spoke with @mforns on this and (pending @JAllemandou's feedback) we think it will be easy to add a is_cloud_vps edit column to the geoeditors-daily dataset. From there we can run a reportupdater script that calculates total edits versus edits in cloud vps per namespace (while the namespace was not on the original request i think w/o it the metric of "edits on cloud vps' does not have the same value)
Description
Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | srishakatux | T226663 Create a WMCS edits dashboard via Dashiki | |||
Resolved | JAllemandou | T233504 add whether an edit happened on cloud VPS to geoeditors-daily dataset |
Event Timeline
We have a UDF that @bd808 work on a while back that can classify iPS as coming from cloud: https://github.com/wikimedia/analytics-refinery-source/blob/06e6c0a1d63d31236638934129c5d5d0344dc677/refinery-core/src/main/java/org/wikimedia/analytics/refinery/core/IpUtil.java
Can someone from cloud-services-team confirm we wnat the dashboard with this data to be public? cause if the request is for a private dashboard after adding a column we could set up one in superset in 5 minutes
Change 538607 had a related patch set uploaded (by Joal; owner: Joal):
[analytics/refinery/source@master] Update subnet lists for IpUtil
Change 538613 had a related patch set uploaded (by Joal; owner: Joal):
[analytics/refinery@master] Add network-origin to the geoeditors-daily table
Change 538607 merged by Ottomata:
[analytics/refinery/source@master] Update subnet lists for IpUtil
Provided 2 patches based on the request above as examples. More discussion is probably needed based on needed retention.
Example results for enwiki for 2019-08:
date | From Internet - Registered | From Internet - Anonymous | From labs - Registered |
2019-08-01 | 102259 | 25611 | 145 |
2019-08-02 | 104750 | 24815 | 113 |
2019-08-03 | 97538 | 24380 | 85 |
2019-08-04 | 102913 | 26112 | 129 |
2019-08-05 | 105212 | 26140 | 95 |
2019-08-06 | 109782 | 26593 | 302 |
2019-08-07 | 105594 | 25986 | 83 |
2019-08-08 | 103530 | 26282 | 196 |
2019-08-09 | 102892 | 26052 | 233 |
2019-08-10 | 96695 | 23680 | 155 |
2019-08-11 | 94645 | 24505 | 169 |
2019-08-12 | 105111 | 27146 | 197 |
2019-08-13 | 101921 | 26847 | 185 |
2019-08-14 | 100371 | 26126 | 229 |
2019-08-15 | 98214 | 25493 | 300 |
2019-08-16 | 106488 | 24801 | 256 |
2019-08-17 | 93723 | 23985 | 74 |
2019-08-18 | 96536 | 24398 | 379 |
2019-08-19 | 102999 | 25800 | 113 |
2019-08-20 | 106787 | 26775 | 100 |
2019-08-21 | 104609 | 26028 | 186 |
2019-08-22 | 102298 | 25495 | 77 |
2019-08-23 | 109213 | 24802 | 109 |
2019-08-24 | 104440 | 23408 | 126 |
2019-08-25 | 97819 | 24346 | 319 |
2019-08-26 | 105393 | 25573 | 65 |
2019-08-27 | 103565 | 26259 | 194 |
2019-08-28 | 102997 | 26275 | 69 |
2019-08-29 | 103604 | 25670 | 632 |
2019-08-30 | 103154 | 26541 | 179 |
2019-08-31 | 113660 | 24695 | 156 |
Not sure about the is_cloud_vps name...can the dashboard just examine the ip network and differentiate directly, rather than us adding a new field? Not sure.
Adding @Bmueller as definitive person to answer this, but my understanding is that we would like to create a public dashboard/report that is updated at least monthly showing month over month/year over year trends in Cloud VPS/Toolforge edit counts.
Yes, this can be done without a precomputed field in the hadoop tables by using either the UDF manually in an HQL query automated with Reportupdater or a custom script like the one in T226663#5287195 that works from the mysql replicas of the checkuser tables. Percomputing probably makes the most sense if folks can find other reasons they want to correlate events with the Cloud Services network origin.
Change 538613 merged by Milimetric:
[analytics/refinery@master] Add network-origin to the geoeditors-daily table
The column has been added and I'm restarting the job so it will be filled going forward. Should we backfill this data as far back as we have the raw source (90 days)?
I do not think that is needed as Cloud-Services team has that data from their , ahem "legacy" script that @bd808 was running since old times, I think @srishakatux will be using this data going forward to add some reportupdater jobs
ok, I restarted the monthly job and this column will be populated going forward. The first time it will be inserted is November 1st, 2019, when it runs the month of October.