Page MenuHomePhabricator

Figure out usage reporting for toolhub.wikimedia.org
Open, Needs TriagePublic

Description

We would like to have some tracking of usage of Toolhub. By the magic of the Wikimedia CDN, the webrequest hadoop table has information on Toolhub traffic.

$ ssh stat1007
hive (wmf)> select count(distinct ip) as hits from webrequest where year = 2021 and month = 9 and day = 29 and uri_host = "toolhub.wikimedia.org";
... lots of hive progress reporting ...
MapReduce Total cumulative CPU time: 1 days 5 hours 57 minutes 45 seconds 740 msec
Ended Job = job_1632476005296_21353
MapReduce Jobs Launched:
Stage-Stage-1: Map: 5206  Reduce: 1   Cumulative CPU: 107865.74 sec   HDFS Read: 176232805069 HDFS Write: 102 SUCCESS
Total MapReduce CPU Time Spent: 1 days 5 hours 57 minutes 45 seconds 740 msec
OK
hits
14
Time taken: 106.498 seconds, Fetched: 1 row(s)

Figure out what we want to mine from here and if it is worth setting up a dashboard somewhere.

Event Timeline

Any plans to begin work on this @bd808 ? so far we don't have much on the side of "ui for viewing metrics"

Any plans to begin work on this @bd808 ? so far we don't have much on the side of "ui for viewing metrics"

Someone needs to figure out what metrics are desired and what level of public transparency is needed for viewing them. Possible solutions could range from a Notebook hosted by https://wikitech.wikimedia.org/wiki/Analytics/Systems/Jupyter to ETL into https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid with a https://wikitech.wikimedia.org/wiki/Analytics/Systems/Turnilo frontend to canned reports published to https://wikitech.wikimedia.org/wiki/Analytics/Systems/Dashiki.