Page MenuHomePhabricator

Run script to collect one-time stats about TemplateData
Closed, ResolvedPublic5 Estimated Story Points

Description

Fetch all TemplateData from an entire wiki, then analyze the collective JSON to create statistics. Run script on en, de, fa wikis and compare the patterns in use.

Questions
TemplateData Usage

  • How many templates already have TemplateData?

DE wiki is an outlier with 8.6% of all templates having TemplateData, a significant jump from the ~2% for the others. Still, all have relatively low percentages. With this data it is hard to know if that is because there are many templates which do not need it (because they do not have parameters or are rarely used) or if many commonly used templates are missing TemplateData. We hope to track the percentage of templates opened in VE with or without TemplateData and that will give a better idea of how many templates are missing it, which could benefit from having it. T259705: Log template metadata whenever a template dialog is opened

TemplateData usage.jpg (696×1 px, 122 KB)

Parameter Types

  • How often is each parameter type used? (Boolean, Content, File, Line, Number, Date, Page, String, Template, Unbalanced wikitext, Unknown, URL, User)

Most parameters are unknown; this is the default when parameters are added. The most often chosen type is 'string.' On EN wiki, unknown and strings combined are 85.95% of all parameters. Both have no effect on VE or TemplateWizard and are essentially ‘blank’ types. DE wiki is an outlier and uses Line, Number, Content, and Wiki-page-name often (31-12%), but they are rarely used on the other wikis (7-1%). All four wikis rarely use Boolean and Date (4-1%), and almost never use Wiki-file-name, unbalanced-wikitext, wiki-template-name, URL, and wiki-user-name (under 1% on all wikis). These seven under-used types are some of the ones that offer special support in VE, such as autocomplete or validation.

  • How many deprecated, required, suggested, vs optional parameters are there?

Almost all parameters are optional. DE wiki is an outlier with 12% suggested and 6% required while all others have only 1-3% for either suggested or required. Very few parameters are deprecated (0.29%-1.94%).

Parameter types.jpg (878×1 px, 166 KB)

Parameter Properties

  • How often is each additional property used? (Name, Alias, Label, Description, Example, Default, Auto value)

Label and description are used significantly more often than anything else, except on de wiki where aliases are used most often.

  • How often do templates have values for both Example and Default?

Examples and defaults are not used very often. When used, typically one or the other is used, not both.

  • How often do Boolean parameter types have default values?

Booleans are not a large percentage of parameters (2% or less), but for those that exist even fewer use a default. On smaller wikis, this essentially never occurred.

Parameter properties.jpg (878×1 px, 127 KB)

Results
https://gitlab.com/adamwight/templatedata-stats/-/blob/master/notebooks/Parameter%20analysis.ipynb

Key Insights

  1. Specific parameter types are mostly not being used, especially the ones with special properties like those with autocomplete or the boolean with checkmark.
  2. Seems like most common behavior is to add a parameter, give it a label and description and leave the rest of the properties as defaults.
  3. DE wiki has very non-standard behavior for almost every measurement when compared with the other three wikis, which all have very similar behavior just at different scales.
  4. When inserting a new template, it is very unlikely that a user would see a checkbox for a boolean value or an autocomplete dropdown for any value.
  5. Flexibility seems most valued when choosing parameter types. It's also possible that people are not aware of the special behaviors of specific parameter types.

Event Timeline

ECohen_WMDE set the point value for this task to 8.Jul 28 2020, 7:54 AM

Clarified wording and added a question about booleans specifically since they are the only parameter type where 2-dimensional data between the first two questions is useful.

Script to scan a dump for template invocations: P631

Script to find all templates using TemplateData: https://fr.wikipedia.org/wiki/Utilisateur:Salix_alba/TDList.js

Query to count both usages and note whether TemplateData is present: https://quarry.wmflabs.org/query/14837

Not exactly related, just leaving a breadcrumb for later: this template does amazing things, https://en.wikipedia.org/wiki/Template:Template_usage

I still haven't rediscovered the toolforge projects I know I've seen before. These are probably linked from the (private) templates discovery doc.

These are two related tools that I'm aware of, though I'm not sure there is anything useful there related to this specific task:
Vorlagenmeister
TemplateParamWizard

Added one more question to answer. I've realized that people mix up the usage of example and default often, so I'm curious how common it is for a template to have both of them, vs those that have only example or only default. I've also noticed that people often type the example into the description, though I doubt there is a way to easily calculate how often this is done versus using the available parameter property.

In T258924#6341793, @ecohen wrote:

I've also noticed that people often type the example into the description, though I doubt there is a way to easily calculate how often this is done versus using the available parameter property.

I can imagine a heuristic where we go through each parameter value usage and search the description for matching text. It'll give lots of false positives, which we can manually screen.

This script parses out template invocations and stores the values provided to parameters. Note, the maintainers do not want to publicize the tool itself yet: https://persondata.toolforge.org/vorlagen/

! In T258924#6344143, @awight wrote:
I can imagine a heuristic where we go through each parameter value usage and search the description for matching text. It'll give lots of false positives, which we can manually screen.

Thanks for proposing an option to get this information. After thinking over it some more though, I think an understanding of how often the example and default properties are used will tell us enough about the usage patterns. I think I'll leave the task as it currently is for now.

Canonical, parsed TemplateData is stored as a page prop! For comparison, VisualEditor uses this data source via the action=templatedata API, but the TemplateData editor itself is more fragile and can only operate on a raw <templatedata> block in the template or its doc subpage.

A query to fetch this data looks like,

select * from page_props where pp_propname = 'templatedata' limit 1;

Gotcha: the pp_value must be run through gzdecode() :-D

Basic loop to dump as jsonlines:

$result = $pdo->query('select pp_value from page_props where pp_propname=\'templatedata\' order by pp_page');
foreach ( $result as $row ) {
    print gzdecode($row['pp_value']) . "\n";
}

I scrapped the paging, even enwiki only has ~18k rows of templatedata. The query finishes in under 1 second.

Raw jsonlines dumps are uploaded to the gitlab repo, above.

There was one corrupt template, page ID 39991427. It's in somebody's sandbox, and the only problem is that its templatedata is stored uncompressed. Maybe a test of something, maybe an early adopter... doesn't matter.

@ECohen_WMDE
Clarification question, these counts will tell us how many parameters have each property described, but should we also take counts on a per-template level? For example, a template with 100 parameters that all include a description would add +100 to "parameters with default" but only +1 towards "templates using a default parameter". I feel like both of these are meaningful, do you agree?

awight updated the task description. (Show Details)
awight changed the point value for this task from 8 to 5.

reduced story points.

@awight I see your point about the difference. In which way is it currently being collected? I think I would be interested in the first. All of the attribute measurements should be on a parameter basis not on a template basis. For example, we would like to know how many parameters have a default value, not how many templates have at least one parameter with a default value. I also think at the moment, this is the only measurement where this question would apply. Or are there others that I'm missing?

As I've been going through I realized I have another question. Is it possible to quickly collect the total number of templates on each of the four wikis so we can also find the percentage of all templates with TemplateData?

In which way is it currently being collected?

The current data is counting the attributes per parameter. I agree this is the most common-sense way to analyze the statistics, but I'll explain briefly what I was thinking with regard to per-template distribution: As an artificial example, some high-attention, complex templates like Infobox might present 100 parameters with the "number" type, but meanwhile 100 other random templates have only "unknown" or "string" parameters. Strictly per-parameter statistics would make it look like ~half of all parameters are using the more sophisticated, specific type, but in practice these types are used in < 1% of the templates. I guess the numbers I'm imagining are "How likely is it that a template user will encounter attribute X".

Of course, counting per-template still wouldn't answer the question, we would need to know something about frequency of usage. And another big caveat: some very common templates will exclusively be added by bots, or are transparently transcluded from other templates, so are rarely encountered by humans even though the database count is high. We can't do anything about this yet, so let's ignore my question, I just wanted to show several reasons that per-parameter counts might not correspond well to what editors actually see.

Also on this tangent: maybe we should start recording the template name when a template insert dialog is opened?

As I've been going through I realized I have another question. Is it possible to quickly collect the total number of templates on each of the four wikis so we can also find the percentage of all templates with TemplateData?

For sure, I'll add those counts and some pie charts. Again, this might not be very meaningful until we can correlate actual usage frequency with TemplateData completeness, but it's a start...

Thanks for the extra explanation! I understand what you mean (incl the limitations it points to in how we can understand the data we have so far). I've attached my prelim analysis (very much a draft), it is helpful for me to visualize things to understand their meaning so I made a few extra charts to compare the four. On page three, I think is a good example of what you mean. Almost all parameters are optional - this could mean most templates have a single required parameter or a few templates have many required params (based one experience I think it's the first). At the same time, I think it tells us what we need to know - which is that most parameters need to be added to templates using the difficult to use dropdown menu since that's where optional parameters end up.

I think you're saying though that we can't separate out the stats for commonly used vs not commonly used because it's hard to count this. (Bots give some artificially high counts, and even then we don't know which are used in VE vs source code.) I like the idea to start recording the template name when the dialog is opened to see if there is any pattern to which types of templates VE is actually being used to edit, and how diverse. It might also be useful to know how many of the templates edited in VE have TemplateData or not since it can be used for both (with the second one auto-generating parameter names).

Also to note, my main high level takeaways are:

  1. My primary hypothesis is true - specific parameter types are mostly not being used, especially the ones with special properties like those with autocomplete or the boolean with checkmark.
  2. A surprise - de wiki has very non-standard behavior when compared with the other three wikis, which all have very similar behavior just at different scales.
  1. When inserting a new template, it is very unlikely that a user would see a checkbox for a boolean value.
  2. Examples and defaults are not used very often. When used, typically one or the other is used, not both.

I've attached my prelim analysis

Really inspiring report so far! I've removed the crude pie chart from the data collection script, it's redundant if there is technology from the future available. However, I'm so sorry about the PNG, you must have typed in half the numbers—I'll include plain text tables in the next revision. Or is there anything else I can do to make the data interchange easier? Do you want to apply for Jupyter access, for example?

At the same time, I think it tells us what we need to know - which is that most parameters need to be added to templates using the difficult to use dropdown menu since that's where optional parameters end up.

Right on, this look like an important fix and is now well-justified :-)

see if there is any pattern to which types of templates VE is actually being used to edit
[....]
It might also be useful to know how many of the templates edited in VE have TemplateData or not since it can be used for both (with the second one auto-generating parameter names).

These are entasked as T259705: Log template metadata whenever a template dialog is opened. Good point that we can directly measure TemplateData availability rather than guessing during post-processing.

Cool! Glad you like it. Just trying to figure out what the most relevant insights are for us. And don't worry, it wasn't too many numbers! Otherwise I might have asked. And the charts were still helpful. Just felt I wanted to understand them in relation to each other. I'm not sure if I need access to Jupyter, is that what you're using? It seems separate?

Also, thanks for updating the notebook and adding in the total number of templates. The percentage with TemplateData is even lower than I expected, though I know there are many templates where adding it doesn't make any sense and it's hard to separate that out. I think if we're able to track how often a template without TemplateData is opened in VE, that will give a much better idea of what is 'missing.'

@awight Was going through the data and just noticed that there isn't a count for how often the description property is used. Is it possible to add that? Thanks!

@awight Was going through the data and just noticed that there isn't a count for how often the description property is used. Is it possible to add that? Thanks!

Good catch, that's an especially important one! The report is updated.

Updated task description with summary and key insights.

New questions have come up, which I think we can answer:

@awight Are these easy things to check since the data is already there? I don't think they are essential if it's a lot of extra work, but they're questions that have come up as these other topics have developed.

@awight Are these easy things to check since the data is already there?

For sure, this is an especially quick and easy dataset to process. I'll put this back into the new sprint for visibility.

We should pull in a multi-language wiki such as mediawiki.org or meta.wikimedia.org, to give this a fair chance. I'll include metawiki to start with.

Very interesting results! Again de wiki is an outlier with zero, but it seems the rest use it and non-english wikis at a higher rate. Thanks for also adding meta wiki as a multi-language wiki for comparison - very helpful.

I looked into the VisualEditor source and can verify that there is zero difference between the types content, string, and unknown.

line behaves almost the same as well, with a subtle difference:

  • line doesn't allow linebreaks. But it allows to switch to raw wikitext mode with that [[]] button. In this mode it behaves as the other types again, including linebreaks.
  • When a value in a line field already contains a linebreak, the type line is effectively ignored, and it behaves as the other types.
  • In a line field, very long values don't wrap. The text starts to scroll horizontally.

From this, I draw a few speculations:

  • Some might prefer line to block users from entering line breaks. They probably assume most inputs are short, or assume users can handle horizontal scrolling.
  • Some might avoid line because they consider the horizontal scrolling bad usability. They probably assume users typically do not enter linebreaks, or assume templates are fine with linebreaks.

Another speculation: suggested allows to visually split parameters into 2 groups – a feature we already identified as being missing. All input fields for suggested parameters automatically show up when adding a new template. You don't need to manually pick every single one.

Minor observation: I was curious to see whether the database contains TemplateData for a template page and for its /doc subpage, and if these are redundant. I discovered they are not, the data is only attached to the template page and not its subpage.

When we did our TemplateData drives back in the day, we sorted out lists by how frequently the template was used, so I imagine the higher volume templates have more TempateData on avergage.

If you want to get a feel for how often users encounter TemplateData you could weight these results by how often each template is used. We could also add tracking to the dialog to report if TtemplateData was found.

Thanks for raising this point @Esanders ! Agree that this will give us a more useful idea of what percentage of templates 'needing' TemplateData already have it. As part of our investigation into findability and possibly sorting by 'popularity', @awight started investigating how to define this in this ticket T261112: Investigation: popular templates. Maybe there is a useful way to combine the two, since how often a template is transcluded is not necessarily a reflection of how often it's used, because of nested templates. Not sure if this overcomplicates it though. Weighting by use might be enough.

Great to see this discussion happening!

If you want to get a feel for how often users encounter TemplateData you could weight these results by how often each template is used. We could also add tracking to the dialog to report if TtemplateData was found.

Yes, that's an excellent idea. We're hoping to collect data for what you suggested, and a few more metrics, under T258917: Record template use and dialog interaction metrics.