Page MenuHomePhabricator

Audit inline styling in the main namespace on the English Wikipedia
Closed, ResolvedPublic

Description

Let's audit the uses of inline styling in the main namespace on the English Wikipedia!

Event Timeline

MZMcBride claimed this task.
MZMcBride raised the priority of this task from to Low.
MZMcBride updated the task description. (Show Details)
MZMcBride subscribed.
tools.mzmcbride@tools-bastion-01:/data/scratch/dumps/enwiki/20151002$ time wget "https://dumps.wikimedia.org/enwiki/20151002/enwiki-20151002-pages-meta-current.xml.bz2"
--2015-10-12 01:02:40--  https://dumps.wikimedia.org/enwiki/20151002/enwiki-20151002-pages-meta-current.xml.bz2
Resolving dumps.wikimedia.org (dumps.wikimedia.org)... 208.80.154.11, 2620:0:861:1:208:80:154:11
Connecting to dumps.wikimedia.org (dumps.wikimedia.org)|208.80.154.11|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 24035886897 (22G) [application/octet-stream]
Saving to: ‘enwiki-20151002-pages-meta-current.xml.bz2’

100%[===================================>] 24,035,886,897 1.99MB/s   in 3h 18m

2015-10-12 04:20:41 (1.93 MB/s) - ‘enwiki-20151002-pages-meta-current.xml.bz2’ saved [24035886897/24035886897]


real    198m1.127s
user    1m28.630s
sys     3m42.090s
tools.mzmcbride@tools-bastion-01:/data/scratch/dumps/enwiki/20151002$ time openssl dgst -md5 enwiki-20151002-pages-meta-current.xml.bz2 
MD5(enwiki-20151002-pages-meta-current.xml.bz2)= 358832ae3dd2737e7d053467d12e4e22

real    3m44.017s
user    0m55.471s
sys     0m17.184s

https://dumps.wikimedia.org/enwiki/20151002/enwiki-20151002-md5sums.txt says:

358832ae3dd2737e7d053467d12e4e22  enwiki-20151002-pages-meta-current.xml.bz2

So we're good to go!

The XML dump reader script is probably largely finished now: https://github.com/mzmcbride/dump-reports/blob/a8dbbcb3/xmldumpreader.py.

The pattern instances output is going to require a small amount of normalization, so I'll write a separate script for that.

tools.mzmcbride@tools-bastion-01:~/scripts/enwiki$ time ./xmldumpreader.py
[...]
real    197m38.315s
user    196m30.590s
sys     0m38.797s
tools.mzmcbride@tools-bastion-01:~/scripts/enwiki$ du -hs enwiki-20151002-*
238M	enwiki-20151002-all-main-namespace-pages.txt
11M	enwiki-20151002-main-namespace-pages-containing-pattern.txt
189M	enwiki-20151002-main-namespace-pattern-instances.txt
tools.mzmcbride@tools-bastion-01:~/scripts/enwiki$ wc -l enwiki-20151002-*
 11998298 enwiki-20151002-all-main-namespace-pages.txt
   408777 enwiki-20151002-main-namespace-pages-containing-pattern.txt
  8772726 enwiki-20151002-main-namespace-pattern-instances.txt
 21179801 total

The rudimentary normalization script has been written and is now available at P2229.

tools.mzmcbride@tools-bastion-01:~/scripts/enwiki$ time ./normalize-inline-styling.py

real	1m2.802s
user	1m1.686s
sys	0m0.622s
tools.mzmcbride@tools-bastion-01:~/scripts/enwiki$ wc -l enwiki-20151002-main-namespace-pattern-instances*
  8772726 enwiki-20151002-main-namespace-pattern-instances-normalized.txt
  8772726 enwiki-20151002-main-namespace-pattern-instances.txt
 17545452 total

The top 1,000 instances of inline styling in main namespace pages on the English Wikipedia are available at P2230.

Looking at http://en.wikipedia.org/w/index.php?title=Wikipedia:Database_reports/Page_count_by_namespace&oldid=684346035, we can see that there were about 7,022,747 redirects in the main namespace on the English Wikipedia at the beginning of October 2015. Subtracting the redirects from 11,998,298 (all pages in the main namespace), we find there are about 4,975,551 non-redirects. Non-redirects presumably have almost all of the inline styling, as styling redirect pages would be pretty strange. We also know that 408,777 pages in the main namespace on the English Wikipedia contained inline styling (from the wc -l output above).

Given these stats, we can say that about 8.22% of articles on the English Wikipedia contain inline styling ((408,777/4,975,551)*100). Not bad.