Let's audit the uses of inline styling in the main namespace on the English Wikipedia!
Description
Related Objects
- Mentioned In
- P2255 (An Untitled Masterwork)
- Mentioned Here
- P2230 (An Untitled Masterwork)
P2229 (An Untitled Masterwork)
Event Timeline
tools.mzmcbride@tools-bastion-01:/data/scratch/dumps/enwiki/20151002$ time wget "https://dumps.wikimedia.org/enwiki/20151002/enwiki-20151002-pages-meta-current.xml.bz2" --2015-10-12 01:02:40-- https://dumps.wikimedia.org/enwiki/20151002/enwiki-20151002-pages-meta-current.xml.bz2 Resolving dumps.wikimedia.org (dumps.wikimedia.org)... 208.80.154.11, 2620:0:861:1:208:80:154:11 Connecting to dumps.wikimedia.org (dumps.wikimedia.org)|208.80.154.11|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 24035886897 (22G) [application/octet-stream] Saving to: ‘enwiki-20151002-pages-meta-current.xml.bz2’ 100%[===================================>] 24,035,886,897 1.99MB/s in 3h 18m 2015-10-12 04:20:41 (1.93 MB/s) - ‘enwiki-20151002-pages-meta-current.xml.bz2’ saved [24035886897/24035886897] real 198m1.127s user 1m28.630s sys 3m42.090s
tools.mzmcbride@tools-bastion-01:/data/scratch/dumps/enwiki/20151002$ time openssl dgst -md5 enwiki-20151002-pages-meta-current.xml.bz2 MD5(enwiki-20151002-pages-meta-current.xml.bz2)= 358832ae3dd2737e7d053467d12e4e22 real 3m44.017s user 0m55.471s sys 0m17.184s
https://dumps.wikimedia.org/enwiki/20151002/enwiki-20151002-md5sums.txt says:
358832ae3dd2737e7d053467d12e4e22 enwiki-20151002-pages-meta-current.xml.bz2
So we're good to go!
The XML dump reader script is probably largely finished now: https://github.com/mzmcbride/dump-reports/blob/a8dbbcb3/xmldumpreader.py.
The pattern instances output is going to require a small amount of normalization, so I'll write a separate script for that.
tools.mzmcbride@tools-bastion-01:~/scripts/enwiki$ time ./xmldumpreader.py [...] real 197m38.315s user 196m30.590s sys 0m38.797s
tools.mzmcbride@tools-bastion-01:~/scripts/enwiki$ du -hs enwiki-20151002-* 238M enwiki-20151002-all-main-namespace-pages.txt 11M enwiki-20151002-main-namespace-pages-containing-pattern.txt 189M enwiki-20151002-main-namespace-pattern-instances.txt
tools.mzmcbride@tools-bastion-01:~/scripts/enwiki$ wc -l enwiki-20151002-* 11998298 enwiki-20151002-all-main-namespace-pages.txt 408777 enwiki-20151002-main-namespace-pages-containing-pattern.txt 8772726 enwiki-20151002-main-namespace-pattern-instances.txt 21179801 total
The rudimentary normalization script has been written and is now available at P2229.
tools.mzmcbride@tools-bastion-01:~/scripts/enwiki$ time ./normalize-inline-styling.py real 1m2.802s user 1m1.686s sys 0m0.622s
tools.mzmcbride@tools-bastion-01:~/scripts/enwiki$ wc -l enwiki-20151002-main-namespace-pattern-instances* 8772726 enwiki-20151002-main-namespace-pattern-instances-normalized.txt 8772726 enwiki-20151002-main-namespace-pattern-instances.txt 17545452 total
The top 1,000 instances of inline styling in main namespace pages on the English Wikipedia are available at P2230.
Looking at http://en.wikipedia.org/w/index.php?title=Wikipedia:Database_reports/Page_count_by_namespace&oldid=684346035, we can see that there were about 7,022,747 redirects in the main namespace on the English Wikipedia at the beginning of October 2015. Subtracting the redirects from 11,998,298 (all pages in the main namespace), we find there are about 4,975,551 non-redirects. Non-redirects presumably have almost all of the inline styling, as styling redirect pages would be pretty strange. We also know that 408,777 pages in the main namespace on the English Wikipedia contained inline styling (from the wc -l output above).
Given these stats, we can say that about 8.22% of articles on the English Wikipedia contain inline styling ((408,777/4,975,551)*100). Not bad.
Related thread on wikitech-l: https://lists.wikimedia.org/pipermail/wikitech-l/2015-October/083683.html.