Page MenuHomePhabricator

Provide basic page view metrics for individual tools on toolforge
Open, In Progress, LowestPublic

Description

Popularity metrics such as these are important, I think.

Should just be a thing somewhere that provides daily total 2xx response counts.

Event Timeline

yuvipanda raised the priority of this task from to Needs Triage.
yuvipanda updated the task description. (Show Details)
yuvipanda added a project: Toolforge.
yuvipanda added subscribers: yuvipanda, Ironholds, Halfak.
Ironholds triaged this task as Lowest priority.Jan 16 2015, 4:29 PM

Oh, just whatever any UA classifier considers as browsers vs bots UA.

Our pageviews definition is based, in large parts, on MediaWiki's
structure. Applying it to the tool labs structure would require an
entirely new setup - one with variable accuracy depending on things as
idiosyncratic as how individual users decide to structure their tools.

Metrics are important; metrics are important for tools. I'm not
convinced metrics for tools are important enough to justify that kind
of effort (and the ongoing effort required to /keep/ it useful). Is
there a reason not to simply use a raw request count?

What I specifically have in mind is:

  1. Parse the nginx logs directly once an hour, and update the count publicly somewhere. This would be good enough for tools, since they don't get as much traffic as prod does.
  2. Use the same counting methodology that prod uses (not the same pipeline), so it is consistent.

I'm also not hoisting this on analytics :D I would consider this a Tool-Labs feature and put time into building this out myself. I only want guidance from analytics as to how the prod system works so I can replicate it here.

The counting methodology will /not/ be consistent. It's based on
things like MediaWiki directory names and specific hosts ;p.

Ok, so 'as consistent as possible'? Which I suppose boils down to just deciding which UAs to bucket as 'humans' and which as 'bots' and nothing more.

Or, say, MIME type filtering. But yes. So, you want ua-parser ;p

Cool. Can you tell me what exactly mime filtering is used for?

Filtering out calls to JS/image/css assets

yuvipanda set Security to None.
bd808 renamed this task from Provide page view metrics for individual tools on toollabs to Provide basic page view metrics for individual tools on toollabs.May 16 2018, 12:37 AM
bd808 removed a project: Cloud-Services.
bd808 updated the task description. (Show Details)
bd808 subscribed.

I have been working on this a bit weekends/evenings and I think I have a viable basic process worked out. I'm dropping the original ideas of hourly data (moderately interesting, but makes the data set 24 times larger) and bot/non-bot classification (also moderately interesting, but a big pain to keep up with parsing and categorizing User-Agent data). I am also defining the metric as "any 200 to 299 status code returned by the tool" rather than "page" as categorizing content type is very tool specific. This metric will be useful for rough orders of magnitude comparisons in tool usage, which is much better than having no usage data at all which is our current state.

Change 482237 had a related patch set uploaded (by BryanDavis; owner: Bryan Davis):
[operations/puppet@production] toolforge: process dynamicproxy access logs

https://gerrit.wikimedia.org/r/482237

Change 482238 had a related patch set uploaded (by BryanDavis; owner: Bryan Davis):
[labs/private@master] toolforge: profile::toolforge::toolviews::mysql_password

https://gerrit.wikimedia.org/r/482238

Change 482238 merged by Andrew Bogott:
[labs/private@master] toolforge: profile::toolforge::toolviews::mysql_password

https://gerrit.wikimedia.org/r/482238

Change 482237 merged by Andrew Bogott:
[operations/puppet@production] toolforge: process dynamicproxy access logs

https://gerrit.wikimedia.org/r/482237

Change 486822 had a related patch set uploaded (by BryanDavis; owner: Bryan Davis):
[operations/puppet@production] toolforge: fix script naming for run-parts

https://gerrit.wikimedia.org/r/486822

Change 486822 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] toolforge: fix script naming for run-parts

https://gerrit.wikimedia.org/r/486822

@MusikAnimal Hey! This is the project I was talking to you about at the Prague Hackathon. There is currently a web interface at https://tools.wmflabs.org/toolviews/api/v1/day/2019-05-31 that returns a json dump of each day's traffic stats. The web service is a really simple flask app with no consumers yet, so I can tweak the response any way you'd like to make it easier for you to put a pretty UI on it. It would be pretty awesome if we could generate topviews and siteviews style visualizations of this raw data. Let me know your thoughts about how we might accomplish that without me learning a whole lot about modern javascript UIs. :)

As a tool provider it would (also) be nice to have the data transposed - so to speak: Eg. https://tools.wmflabs.org/toolviews/api/v1/tool/scholia where the returned data is across multiple days.

@MusikAnimal Hey! This is the project I was talking to you about at the Prague Hackathon. There is currently a web interface at https://tools.wmflabs.org/toolviews/api/v1/day/2019-05-31 that returns a json dump of each day's traffic stats. The web service is a really simple flask app with no consumers yet, so I can tweak the response any way you'd like to make it easier for you to put a pretty UI on it. It would be pretty awesome if we could generate topviews and siteviews style visualizations of this raw data. Let me know your thoughts about how we might accomplish that without me learning a whole lot about modern javascript UIs. :)

I am interested! I can't remember the term from the xkcd about getting easily persuaded into building an app for something, but that applies here :)

The Topviews-style visualization makes the most sense with the format of the API response. I think this would be fairly easy to build. A Siteviews-style app (where you can selectively enter in specific tools, or "all") would be awesome, but ideally we'd have endpoints similar to https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia/all-access/user/Cat/daily/2019051200/2019060100 , where it gives us per-day data for a given tool and date range (which I think is what Fnielsen is talking about). So let's start off with just the Topviews variant and go from there.

Would it be possible to for me to use the toolviews account? I see https://tools.wmflabs.org/toolviews/ says "coming soon"; and frankly, "Toolviews" is the most fitting name :)

Would it be possible to for me to use the toolviews account? I see https://tools.wmflabs.org/toolviews/ says "coming soon"; and frankly, "Toolviews" is the most fitting name :)

Yes, but... you would either need to do your work in the existing Python3 Flask webservice that is running there to provide the API, or I would need to move the API to another tool account. Either is possible, so let me know if the language+framework constraint is something you can work with or not.

As a tool provider it would (also) be nice to have the data transposed - so to speak: Eg. https://tools.wmflabs.org/toolviews/api/v1/tool/scholia where the returned data is across multiple days.

This is entirely possible, yes. I think we would want to do something similar to what @MusikAnimal mentioned in T87001#5229051 and actually give this per-tool endpoint some way to specify the desired date range instead of dumping out all data known for the specified tool. The backing database is only storing one row per tool per day, but even that will become an unwieldy result set over time.

Maybe something like /toolviews/api/v1/tool/<toolname>?start=<ISO 8601 date>&end=<ISO 8601 date> with both start and end defaulting to the prior day? We could mirror the pageviews URL format too, but these criteria really seem more like query string parameters than path components to me.

You can also use your ToolsDB credentials ($HOME/replica.my.cnf) to access the s53734__toolviews_p database directly until an API is available (NOTE: any data from before 2019-02-01 should be treated as a guess at best):

$ sql toolsdb
Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 142786401
Server version: 10.1.38-MariaDB MariaDB Server

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

(u3518@tools.db.svc.eqiad.wmflabs) [(none)]> use s53734__toolviews_p;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
(u3518@tools.db.svc.eqiad.wmflabs) [s53734__toolviews_p]> show tables;
+-------------------------------+
| Tables_in_s53734__toolviews_p |
+-------------------------------+
| daily_raw_views               |
+-------------------------------+
1 row in set (0.00 sec)

(u3518@tools.db.svc.eqiad.wmflabs) [s53734__toolviews_p]> describe daily_raw_views;
+-------------+------------------+------+-----+---------+-------+
| Field       | Type             | Null | Key | Default | Extra |
+-------------+------------------+------+-----+---------+-------+
| tool        | varchar(128)     | NO   | PRI | NULL    |       |
| request_day | date             | NO   | PRI | NULL    |       |
| hits        | int(11) unsigned | NO   |     | 0       |       |
+-------------+------------------+------+-----+---------+-------+
3 rows in set (0.01 sec)

(u3518@tools.db.svc.eqiad.wmflabs) [s53734__toolviews_p]> select sum(hits) from daily_raw_views where tool = 'scholia';
+-----------+
| sum(hits) |
+-----------+
|   5026749 |
+-----------+
1 row in set (0.01 sec)

Would it be possible to for me to use the toolviews account? I see https://tools.wmflabs.org/toolviews/ says "coming soon"; and frankly, "Toolviews" is the most fitting name :)

Yes, but... you would either need to do your work in the existing Python3 Flask webservice that is running there to provide the API, or I would need to move the API to another tool account. Either is possible, so let me know if the language+framework constraint is something you can work with or not.

Unfortunately there is a PHP 7.2 dependency. This is only to use Krinkle's Intuition i18n framework. Everything else is just JS/CSS.

That said, https://tools.wmflabs.org, https://toolsadmin.wikimedia.org/, etc. don't appear to be localized (that's not a complaint), so maybe we don't need to localize Toolviews either? This way the frontend and the API could live on the same tool account, assuming it's trivial for the Python3 Flask webservice to serve static assets. Basically I could end up giving you three files: the HTML, JS and CSS.

Unfortunately there is a PHP 7.2 dependency. This is only to use Krinkle's Intuition i18n framework. Everything else is just JS/CSS.

That said, https://tools.wmflabs.org, https://toolsadmin.wikimedia.org/, etc. don't appear to be localized (that's not a complaint), so maybe we don't need to localize Toolviews either? This way the frontend and the API could live on the same tool account, assuming it's trivial for the Python3 Flask webservice to serve static assets. Basically I could end up giving you three files: the HTML, JS and CSS.

I think we can live without i18n of the small number of UI strings for a bit. Maybe this would motivate me to figure out a Flask integration for Intuition. :)

@bd808 Could we extend the /toolviews/api/v1/day/YYYY-MM-DD endpoint to accept and end date too, as with /toolviews/api/v1/day/YYYY-MM-DD/YYYY-MM-DD? This way we can do a Topviews-style visualization but allow any arbitrary date range. This I think is different than T227120, which is about getting timeline data for a specific tool (Pageviews-style visualization).

Also, is there any easy way to tell if a tool has a webservice, other than checking the response of tools.wmflabs.org/toolname? It'd be neat to link to the tools in the interface, similar to how https://tools.wmflabs.org/admin/tools only links if there is a webservice.

@bd808 Could we extend the /toolviews/api/v1/day/YYYY-MM-DD endpoint to accept and end date too, as with /toolviews/api/v1/day/YYYY-MM-DD/YYYY-MM-DD? This way we can do a Topviews-style visualization but allow any arbitrary date range. This I think is different than T227120, which is about getting timeline data for a specific tool (Pageviews-style visualization).

Also, is there any easy way to tell if a tool has a webservice, other than checking the response of tools.wmflabs.org/toolname? It'd be neat to link to the tools in the interface, similar to how https://tools.wmflabs.org/admin/tools only links if there is a webservice.

@MusikAnimal I think I got all the things you asked for in the API implemented. I fell into a gold plating rabbit hole too and ended up adding an OpenAPI spec and UI to the app too: https://tools.wmflabs.org/toolviews/api/

In T87001#5314929, bd808 wrote:

@MusikAnimal I think I got all the things you asked for in the API implemented. I fell into a gold plating rabbit hole too and ended up adding an OpenAPI spec and UI to the app too: https://tools.wmflabs.org/toolviews/api/

Yay! This is awesome :) It will be much easier to put something together. I should have enough free time this week to make a prototype.

bd808 removed bd808 as the assignee of this task.Mar 25 2020, 5:53 PM

Unlicking this cookie. https://tools.wmflabs.org/toolviews/api/ is working, but we still need a UI to draw pretty graphs of the data. @MusikAnimal may or may not be able to help get someone started in the right direction on doing that. I would be happy to add co-maintainers to toolviews as needed to make deploying that UI possible.

MusikAnimal changed the task status from Open to In Progress.May 20 2023, 5:25 PM
MusikAnimal claimed this task.

@bd808 I never saw you that final night at the Hackathon, but I did get a demo working just 30 mins or so after the presentations! I was so close, hehe...

Anyway, I'm now very close to having a polished product ready to demo. How should we go about this? I found https://gitlab.wikimedia.org/toolforge-repos/toolviews ; maybe we could add https://gitlab.wikimedia.org/musikanimal/toolviews as a git submodule, and have the Flask app render it's HTML/JS/CSS? Or should I contribute to the toolviews repo directly? Only issue there is it adds a bunch of dependencies that you otherwise don't need, but none of the dependencies clash with the backend, at least. This also got me thinking -- maybe I could leverage the Python backend to render localized strings on page load, rather than doing it all clientside which means it will first load in English before the requested language can be shown. What do you think?

Sneak peak (yes there are visual bugs...):

Screenshot from 2023-06-02 10-54-38.png (841×1 px, 94 KB)

taavi renamed this task from Provide basic page view metrics for individual tools on toollabs to Provide basic page view metrics for individual tools on toolforge.Jun 2 2023, 3:05 PM

@bd808 I never saw you that final night at the Hackathon, but I did get a demo working just 30 mins or so after the presentations! I was so close, hehe...

\o/ I don't know that I've ever finished a project at the event, so I have empathy for almost getting there. The important bit is that you are still pushing forward. :)

Anyway, I'm now very close to having a polished product ready to demo. How should we go about this? I found https://gitlab.wikimedia.org/toolforge-repos/toolviews ; maybe we could add https://gitlab.wikimedia.org/musikanimal/toolviews as a git submodule, and have the Flask app render it's HTML/JS/CSS? Or should I contribute to the toolviews repo directly? Only issue there is it adds a bunch of dependencies that you otherwise don't need, but none of the dependencies clash with the backend, at least.

Either linking in a UI repo as a submodule or adding the UI code directly into the existing backend repo is fine with me. If we do a submodule I would recommend that we make a https://gitlab.wikimedia.org/toolforge-repos/toolviews-ui repo via Striker to hold the code rather than a personal namespace repo. That should make it easier for the next set of maintainers to manage the code.

This also got me thinking -- maybe I could leverage the Python backend to render localized strings on page load, rather than doing it all clientside which means it will first load in English before the requested language can be shown. What do you think?

Sure. The current backend doesn't have any localization system setup, but that should be possible. I think we can get away without localization for an initial implementation if needed too.

Sneak peak (yes there are visual bugs...):

Screenshot from 2023-06-02 10-54-38.png (841×1 px, 94 KB)

Nice! Part of me knew that csp-report would end up being the most active tool, but I wasn't expecting it to be that much more used. Time for a new round of working with maintainers to find datasources that don't violate the Content-Security-Policy to bring that down I guess.

Either linking in a UI repo as a submodule or adding the UI code directly into the existing backend repo is fine with me. If we do a submodule I would recommend that we make a https://gitlab.wikimedia.org/toolforge-repos/toolviews-ui repo via Striker to hold the code rather than a personal namespace repo. That should make it easier for the next set of maintainers to manage the code.

Definitely... my repo is littered with development commits so I was going to squash it all anyway. I'll move it all to a development branch on the toolviews repo shortly!

The current backend doesn't have any localization system setup, but that should be possible. I think we can get away without localization for an initial implementation if needed too.

Awesome! Do you know of any Python packages for MediaWiki localization, as in something akin to Intuition? There aren't a whole lot of messages but there is a need for proper pluralization rules, etc. I also wasn't going to consider this a blocker for the initial release, but I have a feeling we'll want localization in the longer-term.

Part of me knew that csp-report would end up being the most active tool, but I wasn't expecting it to be that much more used. Time for a new round of working with maintainers to find datasources that don't violate the Content-Security-Policy to bring that down I guess.

I don't know how this tool works, but I gather there's automation somewhere to POST to the /collect endpoint. Could that be where the traffic is coming from? https://csp-report.toolforge.org doesn't even load for me, but I see https://csp-report.toolforge.org/tools does.

The question of bot traffic vs. human was the first thing that came up when I showed toolviews to people at the hackathon. There are other high-ranking tools like https://spellcheck.toolforge.org/ that don't seem to have a frontend, so I assume the traffic is from the service it provides to external clients. None of this is an issue, we may just want to put some clarification in a FAQ or something for toolviews.

The current backend doesn't have any localization system setup, but that should be possible. I think we can get away without localization for an initial implementation if needed too.

Awesome! Do you know of any Python packages for MediaWiki localization, as in something akin to Intuition? There aren't a whole lot of messages but there is a need for proper pluralization rules, etc. I also wasn't going to consider this a blocker for the initial release, but I have a feeling we'll want localization in the longer-term.

I don't know of any python ports of the MediaWiki l10n engine. Generally Python projects use gnutext based localization engines. Gnutext supports plurals but not gender. https://babel.pocoo.org/en/latest/ seems to be the one most commonly used with the Flask framework that the backend is written with. I haven't used babel, but I do have a reasonable amount of experience with Django's l10n system. The two look to be quite similar based on a very quick review of the babel docs.

Another direction that we could go would be to use a pure client side js framework like banana-i18n. Toolhub actually uses both gnutext on the Django backend and banana-i18n on the Vue frontend.

Part of me knew that csp-report would end up being the most active tool, but I wasn't expecting it to be that much more used. Time for a new round of working with maintainers to find datasources that don't violate the Content-Security-Policy to bring that down I guess.

I don't know how this tool works, but I gather there's automation somewhere to POST to the /collect endpoint. Could that be where the traffic is coming from? https://csp-report.toolforge.org doesn't even load for me, but I see https://csp-report.toolforge.org/tools does.

Yes, the traffic is primarily browsers reporting Content-Security-Policy header violations by tools. Everything served by the Toolforge front proxy has a CSP header attached that contains a report-uri https://csp-report.toolforge.org/collect; stanza.

The landing page not loading is an intermittent problem that I've never spent enough time poking at to figure out, but historically I think it has been correlated with traffic spikes.

The question of bot traffic vs. human was the first thing that came up when I showed toolviews to people at the hackathon. There are other high-ranking tools like https://spellcheck.toolforge.org/ that don't seem to have a frontend, so I assume the traffic is from the service it provides to external clients. None of this is an issue, we may just want to put some clarification in a FAQ or something for toolviews.

"Bot traffic" is a made up thing. :) The WMF pageview analytics do some user-agent based classification to put things into their bot bucket. The very streamlined data collection that toolviews is doing based on the front proxy nginx logs does not contain any user-agent based processing. The idea was to be able to show traffic patterns more than to provide any sort of detailed traffic analysis. I agree that we should probably put a FAQ section on the https://wikitech.wikimedia.org/wiki/Tool:Toolviews page that I've not yet bothered to create for this tool.