Page MenuHomePhabricator

Add logging to gauge TemplateWizard usage
Closed, ResolvedPublic5 Estimated Story Points

Description

We want to gauge TemplateWizard usage once it has launched to get a sense of impact - especially on small wikis. This data can be helpful in recognizing which wikis we should do more outreach to. We don't want to store this data long-term. If we need to purge the data after 90 days, that works.

What do we want to know:

  • How often is TemplateWizard launched?
  • How often are templates inserted on pages using TemplateWizard?
  • How easily do users find the template they were looking for?
    • How often do users select templates that appear in the recently used list? (Not implemented yet but keep in mind when developing schema)
    • How often do users discard a template after selecting one from the search?
  • How satisfied are users with TemplateWizard?
    • How often do users click 'Cancel' to close the dialog?
    • How often are edits made with TemplateWizard reverted?
    • How often are edits made with TemplateWizard abandoned?

More specifically, we want to log the following for every instance of TemplateWizard use:

  • When TemplateWizard launches
  • When a template from recently used list is selected (Not implemented yet but keep in mind when developing schema)
  • When a template is discarded using the trash icon
  • When the 'Insert' button is clicked to insert a template
  • When the dialog is closed using the 'Close' button
  • When the page is saved after having one or more templates inserted (along with the template names)

Along with the above, we want to -

  • record the wiki on which the action occurred
  • record the parent edit revision ID so we can later see how many of the edits made with TemplateWizard got reverted.

Research into how to know TW was used in a saved edit is covered in T201476.

The schema will be defined in a separate task.

Ensure that no errors are thrown or issues discovered if Event Logging is not available.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Discussion from meeting:

  • How do we do the logging?
    • EventBus is the preferred way to do it rather than Eventlogging.
  • We shouldn't track usernames/IDs
    • Removed
  • We don't always know the revision ID in advance
    • Let's record the parent revision ID? Other ideas?
  • It's hard to figure out if an edit which was made using TemplateWizard was saved.
    • How can we record that?

@Niharika How do you normally consume this data? How is it retrieved or displayed or shared?

I assume there's a system or process for these things as this task appears to only require the collection of the data.

If there's not already something in place, do we need a separate task to take care of the questions I asked above?

@aezell Good question. We don't have a system in place to continuously surface the data, sadly. This data is stored on a database which I can access. In the past I've gone through and run the queries myself to find out what's required. As we have a data analyst now, I believe he can take care of that work. I will file a new task for that once we need that data, once the extension has been deployed long enough.

Nope, all this data can be visualized via dashboards.

In reading through some of the documentation, it appears that there are recommended ways to build dashboards on top of this data. I'd like to see us give that a shot if you think you'd find value in it. I think it beats having to re-run reports and queries.

Additionally, there was discussion yesterday about whether to use EventBus or EventLogging. According to this documentation, it appears that EventLogging is what we should use.

Oh, I didn't realize we could build dashboards on top of eventlogging data. It definitely beats having to run queries. Let's see if Morten can help us with building that in our meeting today.

In reading through some of the documentation, it appears that there are recommended ways to build dashboards on top of this data. I'd like to see us give that a shot if you think you'd find value in it. I think it beats having to re-run reports and queries.

You beat me to it - I was going to point out we can absolutely build a dashboard that can represent ongoing data. It's a bit of work but not too bad, we will just need to agree on what to set up as dashboard information / queries.

I just found out that we have access to Graphite and Graphana as well. It's possible that's a better fit for this kind of data?

If we did that, there's less mess with EventLogging schemas and such. We would have to think about how to namespace the data points we send to Graphite but that's less cumbersome that making a schema, getting it approved, and then worrying about whether we want to change it later.

Also, Graphana graphs are very easy to make and customize once the data points are in Graphite.

aezell set the point value for this task to 5.

@Niharika There were discussions in the research for this task regarding saving a list of what was inserted and then when Publish was clicked also logging that list. That is not in the description here so we've estimated this work without it. If you'd like that work to happen as well, we'll need to create a new task and estimate that work separately.

@Niharika There were discussions in the research for this task regarding saving a list of what was inserted and then when Publish was clicked also logging that list. That is not in the description here so we've estimated this work without it. If you'd like that work to happen as well, we'll need to create a new task and estimate that work separately.

Do you mean in the event that multiple templates were inserted? Or doing logging for saved edits versus all edits?

Maybe @Samwilson can explain it better? I was sort of transcribing the discussion and might have missed it.

Yeah, if multiple templates were inserted.

My understanding is that we'll keep a track of each template that was inserted (including if the same one was inserted multiple times) and then on the 'publish' event we'll save this list.

My understanding is that we'll keep a track of each template that was inserted (including if the same one was inserted multiple times) and then on the 'publish' event we'll save this list.

Okay. Is that not included in the ticket estimate then? Trying to understand what exactly are we not doing as part of this ticket.

Correct. We are going to provide the statistics in the description. We did not interpret the description to include the scenario that Sam described above so that is not currently part of this estimate or this work.

This is what we understood as the work to do:

More specifically, we want to log the following for every instance of TemplateWizard use (for every saved edit):
When TemplateWizard launches
When a template from recently used list is selected (Not implemented yet but keep in mind when developing schema)
When a template is discarded using the trash icon
When the 'Insert' button is clicked to insert a template
When the dialog is closed using the 'Close' button
record the wiki on which the action occurred
record the parent edit revision ID so we can later see how many of the edits made with TemplateWizard got reverted.

One thought I had is whether we are also interested in parameter usage, for instance whether only a small set of parameters is used, whether they interact with the parameters in the wizard, etc? I am unsure whether that provides us with much useful information, I suspect it does not, and at the cost of storing a lot more data.

One thought that came up in our Product Analytics meeting today was whether an edit tag would be useful?

Lastly, do we want to log that a page was published in which at least one template was added or modified through the TemplateWizard (for example together with the parent revision ID and the timestamp)? I am not familiar with how edit tags work, but that might be a good use case for it, rather than push it through EventLogging. If we have an edit tag it should be straightforward to find edits that used TW and check if they were reverted. Otherwise, having a parent ID and a timestamp of when the "publish" button was clicked would make it easier to find the relevant revision with a fair amount of certainty.

Change 461083 had a related patch set uploaded (by Samwilson; owner: Samwilson):
[mediawiki/extensions/TemplateWizard@master] Add EventLogging

https://gerrit.wikimedia.org/r/461083

One thought I had is whether we are also interested in parameter usage, for instance whether only a small set of parameters is used, whether they interact with the parameters in the wizard, etc? I am unsure whether that provides us with much useful information, I suspect it does not, and at the cost of storing a lot more data.

One thought that came up in our Product Analytics meeting today was whether an edit tag would be useful?

Lastly, do we want to log that a page was published in which at least one template was added or modified through the TemplateWizard (for example together with the parent revision ID and the timestamp)? I am not familiar with how edit tags work, but that might be a good use case for it, rather than push it through EventLogging. If we have an edit tag it should be straightforward to find edits that used TW and check if they were reverted. Otherwise, having a parent ID and a timestamp of when the "publish" button was clicked would make it easier to find the relevant revision with a fair amount of certainty.

This data is only for the short-term to see if the wizard is being used at all. It's okay if we don't have perfect numbers. My feeling is that an edit-tag for this feature would be an overkill. It seems to be mostly used for bigger things like VE edits, mobile edits etc. (I added this comment a while back but forgot to submit :/)

@nettrom_WMF It would be great if you can review the schema @Samwilson linked to above (per the guideline).

@Niharika : I've reviewed the schema and from what I can tell it looks good to go!

@Samwilson any progress on this? Is this stuck somewhere? Are we collecting data?

After we collect the data, we need to also create a dashboard for this so we can see the logging visually and follow up.

Yes, sorry, I had in the back of my mind that this was waiting on something. It wasn't.

The code to add the logging is now ready for review: https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/TemplateWizard/+/461083/

The only action we're not adding is use-recent.

Once the above patch is merged, the logging DB table will be created (automatically, I understand) and then it'll be time to figure out our SQL queries, and create a dashboard.

Change 461083 merged by jenkins-bot:
[mediawiki/extensions/TemplateWizard@master] Add EventLogging

https://gerrit.wikimedia.org/r/461083

Can edits made at eventually deleted pages be logged as well?

@gh87 I'm not sure what you mean. At the time when TemplateWizard logs e.g. that a user inserted a template, it is not known whether the page will one day be deleted. Or do you mean that when a page is deleted, a extra log entry should be made stating that the page once (may have) had a template inserted by TemplateWizard?

@gh87 That won't be feasible because we will have to check every page that gets deleted to see if it had a TemplateWizard template inserted in the past. This can create a big load on the database. Also I don't think this will give us any new information about the tool. If an edit is incorrect/bad, the edits will get reverted. Pages being deleted because of a template is extreme.

@Milimetric Hi! I want to access the data from the above EventLogging data but I don't see the table on stat1006. Do you know if there is a config switch somewhere to turn on the data copying? The patch got merged a while ago. The schema in question is - https://meta.wikimedia.org/wiki/Schema:TemplateWizard

@kaldari asked me to look into this, so I did and here's a bit of info. Data for this schema is being captured (ref this dashboard) and stored in the Data Lake (see more info below), but does not appear to be whitelisted to be ingested into MariaDB (ref this documentation and this file). I'm not sure what the process is to get this data into MariaDB if that's needed, but Analytics can hopefully advise.

Querying this schema in the Data Lake:

SELECT CONCAT(year, '-', month, '-', LPAD(day, 2, "0")) AS log_date, count(*) AS num_events
FROM event.templatewizard
WHERE year=2018 AND month >= 10 GROUP BY CONCAT(year, '-', month, '-', LPAD(day, 2, "0"));

Results in the following output at this point in time:

log_date	num_events
2018-10-31	13
2018-11-01	147
2018-11-02	179
2018-11-03	367
2018-11-04	116
2018-11-05	440
2018-11-06	1466
2018-11-07	1342
2018-11-08	1894
2018-11-09	1361
2018-11-10	1445
2018-11-11	1455
2018-11-12	1488
2018-11-13	1601
2018-11-14	1356
2018-11-15	1674
2018-11-16	1093
Time taken: 34.277 seconds, Fetched: 17 row(s)

@nettrom_WMF Thanks for the info! It's very helpful. I'll await a reply from Analytics on the replication step.
Can you tell me more about how to access this data in the Data Lake while we wait for the replication? Do I need additional permissions to be able to do that? I have access to stat1006.

@Niharika : You're welcome! stat1006 doesn't have access to Hive (the query tool for the Data Lake), so you'll need access to either stat1004 or stat1007. There's also a browser interface called Hue, but I'm unsure if that's a tool that Analytics recommends.

I'll add @Nuria to this thread so they can correct me if I'm wrong and let you know how to proceed.

Sorry to have missed this ping so long. If you're waiting on me for more than a day, and you need an answer quickly, do ping me on IRC.

On replication, yes, as @nettrom_WMF found and pointed out, you have to explicitly whitelist a schema for it to be inserted into mysql: https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging#MariaDB. We're moving away from mysql and working on making querying in Hadoop easier. Hue is definitely a way to query the data, Niharika I'm happy to help with getting access or showing you how to query, open a task if you don't already have it all set up.

@Niharika +1 to @Milimetric response
We prefer to have as few data on MySQL as possible, we recommend hive as the best system to look at data . MySQL would not scale for our needs and we cannot guarantee it will work for more schemas going forward.

getting it approved, and then worrying about whether we want to change it later.

Schemas are changeable, changes just have to be backwards compatible, just like you would for a public API.
You can also test your whole workflow in beta: https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging/TestingOnBetaCluster

Niharika moved this task from QA to Q2 2018-19 on the Community-Tech-Sprint board.

@Nuria @Milimetric Thanks for the response! I will open up a ticket to get access to Hive.