Page MenuHomePhabricator

Please make me a list of commonly used templates in Wikipedia articles
Closed, ResolvedPublic0 Story Points

Description

Here's what I really want to know: If I'm doing fairly typical editing work in VisualEditor, and I insert a template, how likely is it that I'm going to get (some, ideally complete, correct, and useful) TemplateData? And if the answer is anything other than "Very high likelihood of good TemplateData existing", then what are the templates that should most urgently be fixed?

I imagine that the result from this task will be a list of the top 1000 directly transcluded templates, or the top 10%, or templates that are used more than some sensible amount of times, or something like that. I think that the main problem will be getting the templates that editors insert into the mainspace, and not the meta-templates like {{ambox}} and {{navbox}}.

I would like the list to indicate which ones have TemplateData (either on the template page or on a transcluded subpage, like /doc). I assume that it's not possible to figure out which ones have complete TemplateData.

I need this for en.wp (the sooner, the better). I would also like to have this for another dozen large communities (de, es, fr, ja, ru, it, zh, pt, tr, nl, pl, ar). At most of these communities, if I can get the list, I can get people to add the TemplateData.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Neither; Quim is pointing to something that doesn't give the information that's wanted, unfortunately (I know because I generated it). I'm working on this tomorrow, but I'm trying to optimise the query before unleashing it on 20 databases.

@Rdicerb: This has nothing to do with the Qualtrics survey. This has to do with the inconvenience of adding a template in VisualEditor if TemplateData doesn't exist on the template that you want to add. It's a basic usability problem that can be largely solved if we can figure out which templates are (still) missing TemplateData.

Okay, this is without a doubt one of the slowest and most inefficient database structures _possible_ for asking for this data. I have no idea if gathering it is even viable, let alone how long it'll take. I'll keep trying but I'm not promising that I won't come back with "this query is so slow we lost the connection to the server three times before it finally gave up, I quit forever".

Thank you for the reality check @Ironholds - standing by

@Ironholds any update on this? Curious.

As I reported to Rachel just two days ago, I've been in wall-to-wall S&D work this week and haven't had a time to work on these tasks. If you're interested in getting them done more quickly I'd suggest either formally making a pitch to Wes for some of my time or accelerating James's hiring of dedicated VE data analysts (I'm helping with that process too)

OK, good to know - thanks! (And obviously she's the right one to talk to about this stuff :)

Hmmm - had a quick chat with @Jdforrester-WMF and he mentioned this kind of query frankly could take several days. He suggested perhaps the list of templates transcluded on most pages might be a helpful place to list to have updated (it hasn't been updated since 2013, but it might be possible to have refreshed. It might not be perfect, but might it help, somewhat?

That we could update terribly conveniently - but it won't note which ones have TemplateData entries :/.

Elitre added a comment.May 1 2015, 6:52 PM

Which, incidentally, is what I need to know for several of those wikis.

Rdicerb added a comment.EditedMay 6 2015, 10:30 PM

Is there a 50% solution within this? Perhaps it's somewhat effective to have a 50% solution than a 0% solution, isn't it? (this is my suggesting that the templates transcluded on most pages, while not perfect, might be something? However, if it would cause more work and frustration (which, er, it very well could), it sounds like we might get stuck in 0% land?)

That would be my question too. @Elitre when you said "that is what I need to know", did you mean you needed to know the top templates, or you needed to know the templatedata presence?

Whatamidoing-WMF added a comment.EditedMay 7 2015, 7:12 PM

https://en.wikipedia.org/wiki/Wikipedia:Database_reports/Templates_transcluded_on_the_most_pages is pretty much what we DON'T want. We don't want a list of meta templates (90% of the first 50 in that list). We don't want a list of talk-page templates (all of those "WPBannerMeta" ones, and several others as well).

We want templates whose names are directly in the wikitext in the mainspace. This means that we want Template:Tumors on the list (if 72 transclusions is enough to make the list), but not Template:Navbox, which is inside Template:Tumors.

You shouldn't be going to Insert > Template and typing in "Navbox". Therefore, TemplateData is superfluous for {{Navbox}}. But you might go to to Insert > Template and type in "Tumors". Therefore we do want to know whether {{Tumors}} has TemplateData (it doesn't).

That doesn't actually answer the question.

Elitre added a comment.May 7 2015, 7:35 PM

Ol, I was talking about TD :)
Templates in the list you'll give us should not be in https://en.wikipedia.org/wiki/Category:TemplateData_documentation (which is a thing at some wikis, but not at all of them).

Gotcha. Then yeah, the 50% solution does not work :(

Hmmm, this is an interesting enough request that I might be willing to do it. You've piqued my curiosity, anyway. Are we offering any sort of prize for the report (for one language)? Or playing code golf perhaps? ;-) Just looking for a way to make this fun.

Keegan added a subscriber: Keegan.EditedMay 8 2015, 4:43 AM

Hmmm, this is an interesting enough request that I might be willing to do it. You've piqued my curiosity, anyway. Are we offering any sort of prize for the report (for one language)? Or playing code golf perhaps? ;-) Just looking for a way to make this fun.

I will donate a crisp two-dollar bill from my pocket to Sam Nerk.

The one that hasn't been updated since 2013 is indeed the same as the regular database report that superseded it, yes.

https://en.wikipedia.org/wiki/Wikipedia:Database_reports/Templates_transcluded_on_the_most_pages is pretty much what we DON'T want.

I know its not what you want, but isn't that report the exact same as
https://en.wikipedia.org/wiki/Special:MostTranscludedPages

Looking at https://en.wikipedia.org/wiki/Wikipedia:Database_reports/Templates_transcluded_on_the_most_pages/Configuration, yeah, it's pretty close. Expensive Special pages, particularly on the larger wikis such as the English Wikipedia, used to be disabled, I believe. I don't think these types of reports have gotten any cheaper to generate, we're just more willing to throw (production) hardware at them every few days now, as I understand it.

Or playing code golf perhaps? ;-) Just looking for a way to make this fun.

At 637 points, my entry is:

select a.tl_title,count(a.tl_title) 'm',(select count(*) from templatelinks b join page e on e.page_id=b.tl_from join templatelinks c on c.tl_namespace=e.page_namespace and c.tl_title=e.page_title and c.tl_from_namespace=0 where b.tl_namespace=a.tl_namespace and b.tl_title=a.tl_title and b.tl_from_namespace>0 and d.page_id!=b.tl_from)'o' from templatelinks a join page d on d.page_namespace=a.tl_namespace and d.page_title=a.tl_title left join page_props on pp_page=a.tl_from and pp_propname='templatedata' where a.tl_namespace=10 and a.tl_from_namespace=0 and pp_page is null group by 1 order by count(a.tl_title)-o desc limit 500;

(Oh wait, did you mean you wanted something that would actually finish this century)

Or playing code golf perhaps? ;-) Just looking for a way to make this fun.

At 637 points, my entry is:

select a.tl_title,count(a.tl_title) 'm',(select count(*) from templatelinks b join page e on e.page_id=b.tl_from join templatelinks c on c.tl_namespace=e.page_namespace and c.tl_title=e.page_title and c.tl_from_namespace=0 where b.tl_namespace=a.tl_namespace and b.tl_title=a.tl_title and b.tl_from_namespace>0 and d.page_id!=b.tl_from)'o' from templatelinks a join page d on d.page_namespace=a.tl_namespace and d.page_title=a.tl_title left join page_props on pp_page=a.tl_from and pp_propname='templatedata' where a.tl_namespace=10 and a.tl_from_namespace=0 and pp_page is null group by 1 order by count(a.tl_title)-o desc limit 500;

(Oh wait, did you mean you wanted something that would actually finish this century)

Actually, let me modify that to be:

select a.tl_title,count(a.tl_title) 'm',(select count(*) from templatelinks b join page e on e.page_id=b.tl_from join templatelinks c on c.tl_namespace=e.page_namespace and c.tl_title=e.page_title and c.tl_from_namespace=0 where b.tl_namespace=a.tl_namespace and b.tl_title=a.tl_title and b.tl_from_namespace>0 and d.page_id!=b.tl_from)'o' from page d join templatelinks a on d.page_namespace=a.tl_namespace and d.page_title=a.tl_title left join page_props on pp_page=a.tl_from and pp_propname='templatedata' where a.tl_from_namespace=0 and pp_page is null and d.page_namespace=10 group by 1 order by count(a.tl_title)-o desc limit 500;

(still 637 points)

Keegan added a comment.EditedMay 9 2015, 4:42 AM

@MZMcBride, @Bawolff - so we've established this is theoretically possible to query. Is it actually possible to query in a manner that is worthwhile considering expense of time and performance? I think taking ~3-7 days is reasonable for a one-off request, but if it's going to be longer than that...

MZMcBride has an alternative plan which will probably give more accurate results

Keegan added a comment.May 9 2015, 6:33 AM

MZMcBride has an alternative plan which will probably give more accurate results

Wonderful. Thanks!

btw, initial results of my query (I optimized it a bit. Still super slow, but I got results just for the A's) end up with templates that have no parameters. Does templatedata need to be added if there are 0 parameters?

Some examples I got so far:

Aviation_lists
Amg_movie
Aut
Aircontent
Ahnentafel-compact5
Ann
Ahnentafel_top

It should be emphasized that the query I made is more getting 90% of what you're asking for. It will probably not be precisely right but miss some templates, and rank some templates inappropriately.

@MZMcBride, @Bawolff - so we've established this is theoretically possible to query. Is it actually possible to query in a manner that is worthwhile considering expense of time and performance? I think taking ~3-7 days is reasonable for a one-off request, but if it's going to be longer than that...

I don't think you can use templatelinks for this, so the information that WhatamIdoing-WMF is after can't be queried directly with the replicated databases. I think you need to do a dump scan. Fortunately, most of the time involved in dump scans is either downloading the dump or doing a linear scan of it. Generally the programming time is less than the time it would take to download and read through the dump. In theory, anyway!

It's not really helpful to throw out estimates like "~3-7 days" unless you both qualify whose time you're measuring in (Keegan's? mine? bawolff's?) and what time you're measuring (business days? calendar days? 24-hour days?). And there's no clear breakdown of what the time would consist of (waiting around for processes to complete? actively coding? testing?). Some versions of the estimate you provided ("~3-7 days") would be enough for some people to redo the templatelinks infrastructure altogether in MediaWiki, while in other scenarios it would just result in several thousand browser tabs being opened.

Krenair added a subscriber: Krenair.May 9 2015, 2:22 PM

I don't know what the deal is with tool labs. Now its completing super quickly, particularly when I split it up into grops.

Anyways, my version of the report is split alphabetically into 4 groups:
*https://tools.wmflabs.org/bawolff/templatedata!-C.txt
*https://tools.wmflabs.org/bawolff/templatedataC-D.txt (Did you know, 1 out of 6 things in template namespace start with the letter C.)
*https://tools.wmflabs.org/bawolff/templatedataD-N.txt
*https://tools.wmflabs.org/bawolff/templatedataN-z.txt

These should be considered the best I can do using the db. MZMcbride's method should yield much better results. First column is template name (always template namespace). Second column is how many links to main namespace (both direct and indirect). Third column is a rough estimate of how many indirect links to main namespace. You want results that have a high number in second column but a low number in third

mzmcbride@tools-bastion-01:/data/scratch/dumps/enwiki/20150403$ time wget "https://dumps.wikimedia.org/enwiki/20150403/enwiki-20150403-pages-meta-current.xml.bz2"
--2015-05-09 14:22:36--  https://dumps.wikimedia.org/enwiki/20150403/enwiki-20150403-pages-meta-current.xml.bz2
Resolving dumps.wikimedia.org (dumps.wikimedia.org)... 208.80.154.11, 2620:0:861:1:208:80:154:11
Connecting to dumps.wikimedia.org (dumps.wikimedia.org)|208.80.154.11|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 23036371658 (21G) [application/octet-stream]
Saving to: ‘enwiki-20150403-pages-meta-current.xml.bz2’

100%[===================================>] 23,036,371,658 1.98MB/s   in 3h 23m

2015-05-09 17:46:02 (1.80 MB/s) - ‘enwiki-20150403-pages-meta-current.xml.bz2’ saved [23036371658/23036371658]


real    203m26.209s
user    1m37.611s
sys     3m57.184s

The dump is now downloaded.

My reading of this task is that you're basically asking for "templates by number of invocations in the article namespace" (which you call "directly transcluded"). So {{cite news}} and other reference templates that are often used in articles would probably be really high in this list. Templates such as {{ambox}} would be really low in the list (a few weird articles probably call {{ambox}} directly, but probably not more than a 100 articles... I guess we'll find out!).

After thinking about this request for a bit, here are my thoughts, specific to the English Wikipedia:

  • this will probably result in a lot of "stub", "R from [X]", "disambig", and "infobox" templates in the top results, as those are common to many, many articles
  • it's not clear whether redirects and disambiguation pages should be included in the scan
  • template invocations are often made using redirects or all lowercase spelling (e.g., {{SCOTUSCase instead of {{Infobox SCOTUS case or {{cite news instead of {{Cite news), so title normalization may need to be accounted for

If we ignore these considerations, it's a fairly simple report to generate. Accounting for these considerations adds complexity.

mzmcbride@tools-bastion-01:/data/scratch/dumps/enwiki/20150403$ time openssl dgst -md5 enwiki-20150403-pages-meta-current.xml.bz2 
MD5(enwiki-20150403-pages-meta-current.xml.bz2)= a6187fbd93dfa44876dbe02e22180514

real	20m55.525s
user	0m54.704s
sys	0m14.920s

Which matches the value from https://dumps.wikimedia.org/enwiki/20150403/enwiki-20150403-md5sums.txt. Nice.

jeremyb-phone set Security to None.May 10 2015, 2:58 AM
jeremyb-phone added a subscriber: jeremyb.
Bawolff added a comment.EditedMay 10 2015, 3:24 AM

So the final combined version of the results of my query is at https://tools.wmflabs.org/bawolff/templatedata.htm . I'm very curious as how the results will compare to what MzMcbride is doing.

For reference, the final query I used was:

set wait_timeout=172800; SET SESSION TRANSACTION ISOLATION LEVEL READ UNCOMMITTED; select /* SLOW_OK */ a.tl_title,count(a.tl_title) 'm',(select count(*) from templatelinks b join page e on e.page_id=b.tl_from join templatelinks c on c.tl_namespace=e.page_namespace and c.tl_title=e.page_title and c.tl_from_namespace=0 where b.tl_namespace=a.tl_namespace and b.tl_title=a.tl_title and b.tl_from_namespace>0 and d.page_id!=b.tl_from)'o' from page d join templatelinks a on d.page_namespace=a.tl_namespace and d.page_title=a.tl_title left join page_props on pp_page=d.page_id and pp_propname='templatedata' where a.tl_from_namespace=0 and pp_page is null and d.page_namespace=10 and d.page_title BETWEEN 'N' AND 'z'
and d.page_id NOT IN (543510, 568524, 577224, 586123, 689990, 692091, 811789, 827297, 828359, 843603, 891845, 942370, 942373, 945764, 946440, 973377, 976819, 995954, 1011446, 1021476, 1090321, 1142901, 1156461, 1169824, 1185518, 1188537, 1208356, 1239772, 1272240, 1275099, 1346385, 1406921, 1430094, 1432586, 1440745, 1459859, 1487430, 1567462, 1627975, 1664277, 1695548, 1721254, 1780010, 1808502, 1875763, 1938204, 1960394, 2048255, 2048472, 2075417, 2095006, 2123757, 2159001, 2219293, 2234172, 2238249, 2352275, 2372168, 2379594, 2385304, 2392235, 2468023, 2537558, 2557713, 2624944, 2744143, 2884643, 2966417, 2981784, 3048892, 3049117, 3049121, 3049127, 3164016, 3218295, 3218301, 3218306, 3233169, 3382507, 3431775, 3483283, 3501055, 3503011, 3529988, 3730177, 3797227, 3883451, 3883454, 4086375, 4148498, 4215041, 4292385, 4321630, 4332114, 4466805, 4478857, 4497810, 4592538, 4651605, 4740319, 4788776, 4816819, 4844043, 4858813, 4859929, 4920565, 5117873, 5154028, 5210580, 5219693, 5277509, 5295636, 5347876, 5591450, 5632010, 5660335, 5705757, 5762361, 5837249, 5902435, 5916135, 5994763, 6007199, 6594285, 6658694, 6707647, 6798076, 7233408, 7256413, 7342482, 7437205, 7467901, 7506837, 7581587, 7585648, 7599646, 7663734, 7736359, 7855881, 7890381, 8005706, 8189098, 8387047, 8488620, 8592093, 8623434, 8719208, 8763352, 8819754, 8874317, 8878970, 8886607, 8905177, 8926619, 8953176, 8953180, 9033455, 9172159, 9363453, 9378150, 9391283, 9538337, 9599636, 9620042, 9645242, 9682137, 9794950, 9868228, 9950598, 10023179, 10118245, 10144484, 10160967, 10409927, 10418232, 10511155, 10517961, 10696542, 10825440, 10891410, 10964158, 10981056, 11437907, 11437931, 11656899, 12033623, 12140399, 12148853, 12162599, 12546911, 12549672, 12679552, 12973994, 13179742, 13219644, 13219896, 13319244, 13358059, 13373859, 13431031, 13505276, 13519481, 13529042, 13594145, 13676323, 13716067, 13777826, 13984942, 13995840, 14037603, 14038656, 14120502, 14222047, 14314129, 15560851, 15682776, 15682790, 15682792, 15825878, 16176610, 16176634, 16176890, 16178273, 16451497, 16635679, 16639086, 16862406, 16909074, 16990712, 17006843, 17006857, 17102136, 17171798, 17219316, 17220251, 17220513, 17433896, 17522403, 17586236, 17595576, 18244846, 18284416, 18311709, 18728632, 18788148, 19027526, 19086232, 19090147, 19121556, 19375848, 19662365, 19849438, 20167758, 20207957, 20345543, 20467664, 20488681, 20504849, 20505184, 20709219, 20980600, 20999385, 21044097, 21067859, 21073188, 21120401, 21140240, 21161485, 21198408, 21250838, 21401490, 21418395, 21520104, 21651755, 21696393, 21706434, 21725841, 21773865, 21815705, 21843384, 21876873, 21911302, 21949014, 22135277, 22235001, 22251764, 22255088, 22255387, 22278348, 22281215, 22281698, 22322043, 22331658, 22589696, 22598546, 22637691, 22663708, 22663831, 22668599, 22962446, 23041294, 23062402, 23092479, 23143954, 23208067, 23226218, 23298190, 23313287, 23327809, 23660785, 23810672, 24067846, 24086003, 24238435, 24382166, 24389988, 24433394, 24480860, 24755203, 24845882, 24961578, 25071298, 25084818, 25180967, 25297702, 25371621, 25520545, 26061495, 26108914, 26561413, 26822887, 27150839, 27169626, 27177567, 27199162, 27238919, 27246435, 27461703, 27486458, 27566674, 27579816, 27661849, 27662093, 27707370, 28130201, 28372034, 28637971, 28710486, 28991697, 29133768, 29216387, 29216399, 29263647, 29471370, 29644322, 29780699, 29848371, 30505076, 30858554, 31103313, 31509679, 31608179, 32357517, 33172134, 33825240, 34017846, 34647240, 34652660, 34701984, 36169224, 37309562, 37516233, 37858377, 38040101, 38479070, 38650247, 39277816, 39363460, 39371122, 40018205, 40797933, 41732321, 42217688, 43261306, 44066567, 44221346, 44988845, 46273068, 46273076, 46311214 )
 group by 1 having count(a.tl_title) > 2000 order by count(a.tl_title)-o desc limit 5000;

p.s. Well this query is extremely slow for all of wikipedia, dividing up the templates in 4 groups and recombining is quite fast (on the order of about 7 minutes). I don't know if it hits some limits where a temp table has to be written to disk or what triggers the slowness, but as long as the query is divided up, it is very feasible

The script I used is at P631. It's pretty bad code, but it mostly gets the job done.

tools.mzmcbride@tools-bastion-01:~/scripts/enwiki$ time ./templateinvocations.py
[...]
real    180m21.028s
user    178m9.952s
sys     0m28.462s

This generated an output file that was 1000726462 bytes (about 955 MB). This is just a plain text file with every "template" from every article (page namespace == 0) in the dump on its own line. There are a few false positives such as {{DEFAULTSORT:}}, but they don't matter for our purposes.

I ran the following command to aggregate the template invocations:

tools.mzmcbride@tools-bastion-01:~/scripts/enwiki$ time sort templates-by-invocations-in-articles-enwiki-20150403.txt | uniq -c | sort -n | tail -2000
[...]
real	4m10.340s
user	9m23.579s
sys	0m11.717s

The results are available at P632. The final step (for a single wiki) would be to figure out how to answer "does this template have TemplateData?" programmatically and then mark the templates from P632 as having or not having TemplateData, I suppose.

Hmm, one interesting thing. That list has:
​ 62460 Cr

But select count(*) from templatelinks where tl_title = 'Cr' and tl_namespace = 10 limit 4;

count(*)
2169

Hmm, one interesting thing. That list has:
​ 62460 Cr
But select count(*) from templatelinks where tl_title = 'Cr' and tl_namespace = 10 limit 4;

count(*)
2169

Oh. I bet the difference is if the template is used multiple times on a single page, whether that counts once per page or once per use.

Hmm, one interesting thing. That list has:
​ 62460 Cr
But select count(*) from templatelinks where tl_title = 'Cr' and tl_namespace = 10 limit 4;

count(*)
2169

Oh. I bet the difference is if the template is used multiple times on a single page, whether that counts once per page or once per use.

Right. :-) If we look at a template such as {{reflist}} which is typically used only once on a page, the report at P632 says:

3289681 Reflist

Compare with this data, from about a month later (the dump is from April 2015 and we're now in May 2015):

MariaDB [enwiki_p]> select count(*) from templatelinks join page on tl_from = page_id where tl_namespace = 10 and tl_title = 'Reflist' and page_namespace = 0;
+----------+
| count(*) |
+----------+
|  3283504 |
+----------+
1 row in set (1 min 15.91 sec)

This isn't to say that the script in P631 doesn't have false positives (DEFAULTSORT and #tag:ref) and plenty of goofiness (<!-- comments -->, etc.). This is what happens when you use a silly regex instead of a proper parser such as mwparserfromhell.

After thinking about this request for a bit, here are my thoughts, specific to the English Wikipedia:

  • this will probably result in a lot of "stub", "R from [X]", "disambig", and "infobox" templates in the top results, as those are common to many, many articles
  • it's not clear whether redirects and disambiguation pages should be included in the scan
  • template invocations are often made using redirects or all lowercase spelling (e.g., {{SCOTUSCase instead of {{Infobox SCOTUS case or {{cite news instead of {{Cite news), so title normalization may need to be accounted for

If we ignore these considerations, it's a fairly simple report to generate. Accounting for these considerations adds complexity.

It's not my request, but from my understanding leaving out redirects and disambiguation pages might be okay. It's up to @WhatamIdoing.

  • I definitely want a lot of popular stub templates, etc., so long as I'm getting {{hospital-stub}} and not {{asbox}} (what {{hospital-stub}} is built from). If a reasonable editor would actually add it to any page in the mainspace, then I want it counted.
  • Please feel free to include redirects and dab pages. (I assume that this is the easiest option.)
  • I'd rather have the counts "unified", if that makes sense (and is reasonably possible). I want to hear that there are a million instances of Template:Citation needed, not a quarter million each of {{fact}}, {{Fact}}, {{citation needed}}, and {{Citation needed}}.
  • Please feel free to include redirects and dab pages. (I assume that this is the easiest option.)

Okay, all pages in namespace 0 were included in the scan of enwiki-20150403-pages-meta-current.xml.bz2.

  • I definitely want a lot of popular stub templates, etc., so long as I'm getting {{hospital-stub}} and not {{asbox}} (what {{hospital-stub}} is built from). If a reasonable editor would actually add it to any page in the mainspace, then I want it counted.
  • I'd rather have the counts "unified", if that makes sense (and is reasonably possible). I want to hear that there are a million instances of Template:Citation needed, not a quarter million each of {{fact}}, {{Fact}}, {{citation needed}}, and {{Citation needed}}.

Can you take a look at P632 and let me know your thoughts? It's the first 2,000 results in the list. It includes a number of stub templates (e.g., Spilomelinae-stub).

It's possible to merge redirects with their targets, it just requires another small script. I'm guessing you'd build a dictionary of every redirect --> target pairing and then simply iterate through the 955 MB text file and replace any redirect title with the target title.

This looks like a fine list of templates. Some already have TemplateData (usually on the /doc pages, like https://en.wikipedia.org/wiki/Template:Unreferenced#TemplateData).

It's possible to merge redirects with their targets, it just requires another small script. I'm guessing you'd build a dictionary of every redirect --> target pairing and then simply iterate through the 955 MB text file and replace any redirect title with the target title.

If by "you", you mean "someone who is not me", then that sounds fine. ;-)

What sort of cutoff do you think would be most sensible? The top few-thousand templates, or those that are used more than 1,000 times, or ??? Presumably it should be something less than every single template at the English Wikipedia. (I don't have strong opinions. The results need to be manageable for the volunteers who will be working on this. You can chooes whatever you believe will be best.)

How difficult would it be to get one for another wiki (the Spanish Wikipedia, for example)?

Also, @MZMcBride, could you change the format for the list?

1855 Infobox baseball biography
1856 Smallsup
1858 NASDAQ
1859 Mountain elev row

(with a plain space between the number and the name of the template) is hard to turn into what I ultimately need on wiki, which is something like:

* 1855 {{tl|Infobox baseball biography}}
* 1856 {{tl|Smallsup}}
* 1858 {{tl|NASDAQ}}
* 1859 {{tl|Mountain elev row}}

A simple search-and-replace operation won't be able to tell the difference between the spaces in "Infobox baseball biography" and the space between the count and the template name. A tab or any sort of separator that isn't present in the names of the templates would work.

Oh, and one more thing I noticed, more of an oddity that would be difficult for handling redirects: Please search through the list for the "Persondata" entries. It's picked up the code for hidden HTML comments, like this:

793692 Persondata <!-- Metadata: see [[Wikipedia:Persondata]]. -->

How difficult would it be to get one for another wiki (the Spanish Wikipedia, for example)?

I tried to include all of the relevant information needed to run this report in this task (or at least linked from it). Basically you would download a dump of the Spanish Wikipedia and run this script on it and then sort the results.

A simple search-and-replace operation won't be able to tell the difference between the spaces in "Infobox baseball biography" and the space between the count and the template name. A tab or any sort of separator that isn't present in the names of the templates would work.

The word you're struggling for here is "delimiter." :-) The counts are coming from the use of uniq -c. I guess I could do a bit of additional post-processing to add a delimiter... though any decent text editor can do this. You just search for strings that are like ^[ ]*(\d+) (.+?)$ and replace with \1\t\2 or something like that.

Oh, and one more thing I noticed, more of an oddity that would be difficult for handling redirects: Please search through the list for the "Persondata" entries. It's picked up the code for hidden HTML comments, like this:

793692 Persondata <!-- Metadata: see [[Wikipedia:Persondata]]. -->

Search through the list and then do what?

The counts are coming from the use of uniq -c. I guess I could do a bit of additional post-processing to add a delimiter... though any decent text editor can do this. You just search for strings that are like ^[ ]*(\d+) (.+?)$ and replace with \1\t\2 or something like that.

I updated P632 to use tab delimiters.

Seriously, if you download TextWrangler or some other decent text editor and play around with using find and replace with regular expressions, it will immeasurably improve your life. I promise!

If by "you", you mean "someone who is not me", then that sounds fine. ;-)

That's the spirit.

What sort of cutoff do you think would be most sensible? The top few-thousand templates, or those that are used more than 1,000 times, or ??? Presumably it should be something less than every single template at the English Wikipedia. (I don't have strong opinions. The results need to be manageable for the volunteers who will be working on this. You can chooes whatever you believe will be best.)

I chose 2000 arbitrarily. It makes no difference to me what limit is used.

How difficult would it be to get one for another wiki (the Spanish Wikipedia, for example)?

I tried to include all of the relevant information needed to run this report in this task (or at least linked from it). Basically you would download a dump of the Spanish Wikipedia and run this script on it and then sort the results.

Here's a demo.

The Spanish Wikipedia is considerably smaller than the English Wikipedia.

mzmcbride@tools-bastion-01:/data/scratch/dumps/eswiki/20150518$ time wget "https://dumps.wikimedia.org/eswiki/20150518/eswiki-20150518-pages-meta-current.xml.bz2"
--2015-06-06 01:52:01--  https://dumps.wikimedia.org/eswiki/20150518/eswiki-20150518-pages-meta-current.xml.bz2
Resolving dumps.wikimedia.org (dumps.wikimedia.org)... 208.80.154.11, 2620:0:861:1:208:80:154:11
Connecting to dumps.wikimedia.org (dumps.wikimedia.org)|208.80.154.11|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2995463505 (2.8G) [application/octet-stream]
Saving to: ‘eswiki-20150518-pages-meta-current.xml.bz2’

100%[====================================>] 2,995,463,505 1.88MB/s   in 26m 2s

2015-06-06 02:18:04 (1.83 MB/s) - ‘eswiki-20150518-pages-meta-current.xml.bz2’ saved [2995463505/2995463505]

real    26m2.808s
user    0m10.465s
sys     0m25.138s

Comparing with https://dumps.wikimedia.org/eswiki/20150518/eswiki-20150518-md5sums.txt, we want 8451eb4b0aa2063aa414d4143478ddeb. Let's see what we get:

mzmcbride@tools-bastion-01:/data/scratch/dumps/eswiki/20150518$ time openssl dgst -md5 eswiki-20150518-pages-meta-current.xml.bz2
MD5(eswiki-20150518-pages-meta-current.xml.bz2)= 8451eb4b0aa2063aa414d4143478ddeb

real    2m34.054s
user    0m6.785s
sys     0m1.930s

Looks good to me! Though this has already taken about 30 minutes. If there were a reliable place for database dumps on Tool Labs (reliable meaning that the dumps are already downloaded there and have been checked for corruption), we could save ourselves a fair bit of time.

Now it's just a matter of tweaking this templateinvocations.py script (from P631) and re-running it.

tools.mzmcbride@tools-bastion-01:~/scripts/enwiki$ time python templateinvocations.py
[...]
real    20m14.155s
user    20m8.533s
sys     0m2.677s

(Yes, the directory for this script is located at ~/scripts/enwiki, but we're actually processing a dump of eswiki. Naming things is hard.)

Processing a dump of the Spanish Wikipedia in about 21 minutes is pretty good time. Next we take this list and run it through the command mentioned above...

tools.mzmcbride@tools-bastion-01:~/scripts/enwiki$ time sort templates-by-invocations-in-articles-eswiki-20150518.txt | uniq -c | sort -n | tail -2000
[...]
real	0m18.903s
user	0m45.870s
sys	0m2.323s

And finally we do the bit of post-processing using find and replace in a text editor (adding tab delimiters to each entry). This gives us P736.

@Elitre: how does the list at P736 look to you?

Could you or @Whatamidoing-WMF repeat this process for the other wikis?

Elitre added a comment.Jun 8 2015, 9:00 AM

Thanks again for your help.
That list includes default sort keys, I think, and a few other weird things - I'll list some examples here, HTH!
line 18, {5 ;
line 153, #if:{{{ancho ;
line 192, Formatnum:{{#expr: ;
line 207, <includeonly>subst:</includeonly>CURRENTDAY ;
line 700, Ficha de escritor <!-- For more information see [[:Template:Infobox Writer/doc]]. --> .

Oh, and one more thing I noticed, more of an oddity that would be difficult for handling redirects: Please search through the list for the "Persondata" entries. It's picked up the code for hidden HTML comments, like this:

793692 Persondata <!-- Metadata: see [[Wikipedia:Persondata]]. -->

Search through the list and then do what?

And then remove or filter the <!-- hidden comments -->, assuming that the presence of such strings is going to interfere with merging redirects. (I don't understand why it treats "{{template}}<!-- comment -->" as the name of a template in the first place.) Having said that, if that's more than an hour's work, then it might not be worth it. There aren't that many, and they could be fixed manually in each list if necesary.

BTW, I think it might be necessary to resolve redirects before checking for the presence of TemplateData, because there is (presumably) no TemplateData at [[Template:Fact]], but there is at [[Template:Citation needed]].

Updated lists for nl, es and hy wikis, per request:

@Jdforrester-WMF , the list for es.wiki also displays weird stuff as outlined in my comment above (see for example line 12 or 22).
Is it worth fixing and re-running the query? Thanks.

Elitre added a comment.Aug 7 2015, 4:46 PM

Ok, actually all the last comments by me were directed at @MZMcBride :)

Jdforrester-WMF triaged this task as Low priority.Sep 17 2015, 6:30 PM

@Elitre, are you still looking for data here, or is this more or less resolved?

(Sorry for all the bugspam! I'm just trying to groom my backlog :)

It isn't resolved, but looks like MZ isn't going to work on it anymore, so we can just declare it closed.
If anyone's interested in lists of templates with TemplateData,
@SalixAlba's got a script which will generate them.

Jdforrester-WMF closed this task as Resolved.Nov 11 2015, 3:23 PM
Jdforrester-WMF claimed this task.
Jdforrester-WMF moved this task from Backlog to Doing on the TemplateData board.Jan 15 2016, 9:21 PM
Restricted Application added a project: VisualEditor. · View Herald TranscriptDec 15 2016, 4:27 PM

I've run a modified version if IKhitron's query on en-wiki The query is at https://quarry.wmflabs.org/query/14837 and the results are at https://en.wikipedia.org/wiki/Wikipedia_talk:TemplateData#Top_templates_with_no_template_data.

To get the query to run in under 30 min I had to split it into separate initial letters.

Yes, enwiki can be problematic for such kind of queries because of time limit. Other wikis have no problem - I ran the query in about 15 seconds. Maybe enwiki should use it in downloaded sql tools.

Jdforrester-WMF set the point value for this task to 0.
Neil_P._Quinn_WMF raised the priority of this task from Low to Needs Triage.Mar 30 2018, 10:38 AM
Neil_P._Quinn_WMF moved this task from Backlog to Radar on the Contributors-Analysis board.
Restricted Application added a project: User-Ryasmeen. · View Herald TranscriptMar 30 2018, 10:38 AM