Page MenuHomePhabricator

Determine total number of external links in all Wikipedias
Open, LowPublic

Description

A question was raised by WMF legal dept. on the total number of external links in all Wikipedias.

Wikistats doesn't have numbers on this after 2010. This is because Wikistats parses stub dumps (stub = without page content). It used to parse so called full archive dumps (all page revisions, all page content).
After 2010 it was no longer possible to parse these full archive dumps for our largest wikis, due to performance constraints.

Event Timeline

Restricted Application added subscribers: Zppix, Aklapper. · View Herald TranscriptJun 16 2016, 5:35 PM

First let me state that the upcoming Wikistats 2.0 (parts of Wikistats migrated to hadoop) will be authoritative on this in a few months time, if not earlier.

For now, I first tried a small wiki: Frisian Wikipedia.

For every wiki there is a periodical dump of external links: in this case a.o. https://dumps.wikimedia.org/fywiki/20160601/fywiki-20160601-externallinks.sql.gz (1.3 MB)

Counting the occurrences of 'http' in this dump, as follows:
zgrep -o http fywiki-20160601-externallinks.sql.gz | wc -l
-> 94660

However every record can contain two links.
Counting the records themselves
zgrep -o \( fywiki-20160601-externallinks.sql.gz | wc -l
-> 56257

But are all of these links still in use?

Counting occurrences of '\[http' in the article dump says no
bunzip2 -c fywiki-20160601-pages-articles.xml.bz2| grep -o "\[http" |wc -l
-> 35930

This seems the best quick way for a ball-park figure. By adding the \[ most uses of 'http' in other contexts (2019 occurrences) than external links are filtered out.

Note that unlike Wikistats this number is for all external links in any namespace. Wikistats only counts 'content namespaces', mostly ns 0. This will be a very minor difference.

Same process as above for fywiki yields for enwiki
19,635,379 external links

Nemo_bis added a subscriber: Nemo_bis.EditedJun 17 2016, 7:39 AM

The externallinks table is authoritative here; looking for "http" may exclude protocol-relative links (as well as links to other protocols, although these are rare) and grepping the dumps doesn't find links added by templates.

PleaseStand added a subscriber: PleaseStand.EditedJun 17 2016, 10:13 PM

looking for "http" may exclude protocol-relative links (as well as links to other protocols, although these are rare)

And looking for "[http" would also exclude bare URLs that are automatically hyperlinked by MediaWiki.

The externallinks table is authoritative here; [...] and grepping the dumps doesn't find links added by templates.

Some things to consider:

The externallinks table includes all (non-special) page namespaces, including the User namespace and all talk namespaces, unlike the pages-articles dumps. To filter by namespace, you have to join to the page table, because the el_from_namespace field either does not exist or is left unpopulated (see T114117: Drop externallinks.el_from_namespace on wmf databases).

The externallinks table, for each page, counts each external link URL once, regardless of how many links to the same URL are on the page, though not in the case below:

The externallinks table, since rMWd045b999ec4f83de (rSVN102951), contains two rows for each protocol-relative external link on each page: one with el_index LIKE 'http:%' and one with el_index LIKE 'https:%'. Except for el_id, the rows should otherwise be identical. You need to account for this to avoid double-counting these links.

The externallinks table includes links to pages on other Wikimedia sites. There are also some entries for links to the same site, probably added because of T34951 and not yet cleaned up.

The externallinks table does not include interwiki links. Some interwiki links are to the same site, and some are to non-WMF sites. You'll have to match iwlinks.iwl_prefix against the interwiki map, which you can obtain by querying each site's API to get the possible interwiki prefixes and corresponding URLs on that site. Or you can somehow extract the data from the interwiki cache kept in operations/mediawiki-config as wmf-config/interwiki.php. I think all the non-local interwiki prefixes come from m:Interwiki map, so should be the same for all sites.

Note that interwiki links can be multi-prefixed. For example, from English Wikipedia, you can link to the English Wikipedia page "Wikipedia" as [[wikt:de:w:en:Wikipedia]]. From en.wikipedia.org, this redirects through en.wiktionary.org, de.wiktionary.org, de.wikipedia.org, and finally back to en.wikipedia.org. Fortunately, this does not work for "non-local" interwikis (those not marked as "local" in the above API query). None of the "local" interwiki prefixes should point to non-WMF sites, though there are a few that do, because they contain as a substring one of WMF's domain names. (Also note that the sidebar language links are in a separate table, langlinks, and apparently can be multi-prefixed as well.)

The externallinks table also does not include external links generated by some extensions, such as EasyTimeline.


I don't know exactly what "total number of external links in all Wikipedias" means for the purposes of this request. Assuming that means:

  • All links from a wiki page to a non-WMF site should be counted
  • All namespaces should be included
  • Only links from Wikipedia sites, and not other projects, should be counted
  • It's OK to count each distinct URL (in terms of byte-for-byte equality) on a page once
  • It's OK to count distinct external interwiki links separately from the rest of the external links
  • The issue of non-WMF sites having local interwiki prefixes, combined with multi-prefixing, can be ignored
  • The issue of some extensions not recording links in the links tables can be ignored

Here is one possible way to count external links:

  1. In a Wikipedia site's externallinks table, count the distinct (el_from, el_to) pairs for which el_index starts with other than "http:" or "https:" or does not match any of WMF's domain names. (This is simple, though counts other protocols such as ftp:// and irc:// even if one of WMF's domain names is included.)
  2. In a Wikipedia site's iwlinks table, count the rows for which iwl_prefix corresponds to an URL that does not match any of WMF's domain names. (The non-local interwiki prefixes should be included, as well as a few local interwiki prefixes that should be non-local prefixes; see above.)
  3. Add the results of steps 1 and 2.
  4. Repeat steps 1-3 for each remaining Wikipedia site.
  5. Add all the results of step 3.
PleaseStand added a comment.EditedJun 17 2016, 11:20 PM

Here is one possible way to count external links:

Step 1, using the aforementioned fywiki dump as an example:

MariaDB [fywiki_subset]> SELECT COUNT(*) FROM (SELECT DISTINCT el_from, UNHEX(MD5(el_to)) FROM externallinks WHERE el_index NOT REGEXP '^https?://org\\.(wikipedia|wiktionary|wikisource|wikiquote|wikibooks|wikimedia|wikinews|wikiversity|wikivoyage|wikimediafoundation|mediawiki|wikidata)\\.') tmp;
+----------+
| COUNT(*) |
+----------+
|    37133 |
+----------+
1 row in set (1.73 sec)
PleaseStand added a comment.EditedJun 18 2016, 12:48 AM

Here is one possible way to count external links:

Step 2, using the fywiki-20160601-iwlinks.sql.gz dump as an example:

MariaDB [fywiki_subset]> SELECT COUNT(*) FROM iwlinks WHERE iwl_prefix IN ('acronym','advogato','aew','appropedia','aquariumwiki','arborwiki','arxiv','atmwiki','baden','battlestarwiki','bcnbio','beacha','betawiki','bluwiki','blw','botwiki','boxrec','brickwiki','bulba','c2','c2find','cache','cellwiki','centralwikia','chej','choralwiki','citizendium','ckwiss','comixpedia','communityscheme','communitywiki','comune','crazyhacks','creativecommons','creativecommonswiki','creatureswiki','cxej','dcc','dcdatabase','dcma','delicious','devmo','dict','dictionary','disinfopedia','distributedproofreaders','distributedproofreadersca','dmoz','dmozs','doi','doom_wiki','dpd','drae','dreamhost','drumcorpswiki','dwjwiki','ecoreality','ecxei','elibre','emacswiki','encyc','energiewiki','englyphwiki','enkol','eokulturcentro','esolang','ethnologue','ethnologuefamily','evowiki','exotica','eĉei','fanimutationwiki','fedora','finalfantasy','finnix','flickrphoto','flickruser','floralwiki','foldoc','forthfreak','foxwiki','freebio','freebsdman','freeculturewiki','freedomdefined','freefeel','freekiwiki','freenode','freesoft','ganfyd','gardenology','gausswiki','gentoo','genwiki','globalvoices','glossarwiki','glossarywiki','google','googledefine','googlegroups','greatlakeswiki','guildwarswiki','guildwiki','gutenberg','gutenbergwiki','h2wiki','hackerspaces','hammondwiki','hdl','heroeswiki','hrfwiki','hrwiki','hupwiki','iarchive','imdbcharacter','imdbcompany','imdbname','imdbtitle','infosecpedia','infosphere','irc','iso639-3','issn','iuridictum','jaglyphwiki','javanet','javapedia','jefo','jerseydatabase','jira','jspwiki','jstor','kamelo','karlsruhe','kinowiki','kmwiki','komicawiki','kontuwiki','koslarwiki','kpopwiki','libreplanet','linguistlist','linuxwiki','linuxwikide','liswiki','literateprograms','livepedia','localwiki','lojban','lostpedia','lqwiki','luxo','mariowiki','marveldatabase','meatball','memoryalpha','metawikisearch','mineralienatlas','moinmoin','monstropedia','mosapedia','mozcom','mozillawiki','mozillazinekb','musicbrainz','mwod','mwot','nara','nkcells','nosmoke','oeis','olpc','onelook','openfacts','openlibrary','openstreetmap','openwetware','openwiki','opera7wiki','organicdesign','orthodoxwiki','osmwiki','ourmedia','owasp','panawiki','patwiki','personaltelco','phpwiki','phwiki','planetmath','pmeg','pokewiki','pokéwiki','proofwiki','psycle','pythoninfo','pythonwiki','pywiki','quarry','reuterswiki','revo','rfc','rheinneckar','robowiki','rodovid','rowiki','rtfm','s23wiki','scholar','schoolswp','scores','scoutwiki','scramble','seapig','seattlewiki','seattlewireless','securewikidc','semantic-mw','senseislibrary','sharemap','silcode','slashdot','slwiki','sourceforge','squeak','stewardry','strategywiki','swinbrain','swtrain','tabwiki','tclerswiki','technorati','tfwiki','thelemapedia','theopedia','thinkwiki','tibiawiki','tmbw','tmnet','tmwiki','toollabs','tools','translatewiki','tviv','tvtropes','twiki','tyvawiki','uncyclopedia','unihan','unreal','urbandict','usej','usemod','vd','viaf','vikidia','vinismo','vkol','vlos','voipinfo','werelate','wikia','wikiapiary','wikiasite','wikichristian','wikicities','wikicity','wikif1','wikifur','wikihow','wikiindex','wikilemon','wikilivres','wikilivresru','wikimac-de','wikinfo','wikinvest','wikiotics','wikipapers','wikiskripta','wikisophia','wikispot','wikiti','wikitree','wikiwikiweb','wipipedia','wlug','wmar','wmat','wmau','wmch','wmcl','wmcz','wmdc','wmde','wmdeblog','wmes','wmfr','wmhk','wmhu','wmid','wmil','wmin','wmit','wmke','wmph','wmpt','wmsk','wmtw','wmuk','wmve','wmza','wookieepedia','wowwiki','wqy','wurmpedia','zrhwiki','zum','zwiki','ĉej');
+----------+
| COUNT(*) |
+----------+
|       29 |
+----------+
1 row in set (0.01 sec)

Step 3:

37133 + 29 = 37162 external links on fywiki, in all namespaces. Do not rely on this number.

PleaseStand added a comment.EditedJun 18 2016, 1:22 AM

Here is one possible way to count external links:

Step 3:
37133 + 29 = 37162 external links on fywiki, in all namespaces

Repeating this while excluding the User namespace and all talk namespaces (as in the pages-articles dump):

MariaDB [fywiki_subset]> SELECT COUNT(*) FROM (SELECT DISTINCT el_from, UNHEX(MD5(el_to)) FROM externallinks JOIN page ON el_from = page_id WHERE page_namespace % 2 = 0 AND page_namespace <> 2 AND el_index NOT REGEXP '^https?://org\\.(wikipedia|wiktionary|wikisource|wikiquote|wikibooks|wikimedia|wikinews|wikiversity|wikivoyage|wikimediafoundation|mediawiki|wikidata)\\.') tmp;
+----------+
| COUNT(*) |
+----------+
|    36295 |
+----------+
1 row in set (1.80 sec)
MariaDB [fywiki_subset]> SELECT COUNT(*) FROM iwlinks JOIN page ON iwl_from = page_id WHERE page_namespace % 2 = 0 AND page_namespace <> 2 AND iwl_prefix IN ('acronym','advogato','aew','appropedia','aquariumwiki','arborwiki','arxiv','atmwiki','baden','battlestarwiki','bcnbio','beacha','betawiki','bluwiki','blw','botwiki','boxrec','brickwiki','bulba','c2','c2find','cache','cellwiki','centralwikia','chej','choralwiki','citizendium','ckwiss','comixpedia','communityscheme','communitywiki','comune','crazyhacks','creativecommons','creativecommonswiki','creatureswiki','cxej','dcc','dcdatabase','dcma','delicious','devmo','dict','dictionary','disinfopedia','distributedproofreaders','distributedproofreadersca','dmoz','dmozs','doi','doom_wiki','dpd','drae','dreamhost','drumcorpswiki','dwjwiki','ecoreality','ecxei','elibre','emacswiki','encyc','energiewiki','englyphwiki','enkol','eokulturcentro','esolang','ethnologue','ethnologuefamily','evowiki','exotica','eĉei','fanimutationwiki','fedora','finalfantasy','finnix','flickrphoto','flickruser','floralwiki','foldoc','forthfreak','foxwiki','freebio','freebsdman','freeculturewiki','freedomdefined','freefeel','freekiwiki','freenode','freesoft','ganfyd','gardenology','gausswiki','gentoo','genwiki','globalvoices','glossarwiki','glossarywiki','google','googledefine','googlegroups','greatlakeswiki','guildwarswiki','guildwiki','gutenberg','gutenbergwiki','h2wiki','hackerspaces','hammondwiki','hdl','heroeswiki','hrfwiki','hrwiki','hupwiki','iarchive','imdbcharacter','imdbcompany','imdbname','imdbtitle','infosecpedia','infosphere','irc','iso639-3','issn','iuridictum','jaglyphwiki','javanet','javapedia','jefo','jerseydatabase','jira','jspwiki','jstor','kamelo','karlsruhe','kinowiki','kmwiki','komicawiki','kontuwiki','koslarwiki','kpopwiki','libreplanet','linguistlist','linuxwiki','linuxwikide','liswiki','literateprograms','livepedia','localwiki','lojban','lostpedia','lqwiki','luxo','mariowiki','marveldatabase','meatball','memoryalpha','metawikisearch','mineralienatlas','moinmoin','monstropedia','mosapedia','mozcom','mozillawiki','mozillazinekb','musicbrainz','mwod','mwot','nara','nkcells','nosmoke','oeis','olpc','onelook','openfacts','openlibrary','openstreetmap','openwetware','openwiki','opera7wiki','organicdesign','orthodoxwiki','osmwiki','ourmedia','owasp','panawiki','patwiki','personaltelco','phpwiki','phwiki','planetmath','pmeg','pokewiki','pokéwiki','proofwiki','psycle','pythoninfo','pythonwiki','pywiki','quarry','reuterswiki','revo','rfc','rheinneckar','robowiki','rodovid','rowiki','rtfm','s23wiki','scholar','schoolswp','scores','scoutwiki','scramble','seapig','seattlewiki','seattlewireless','securewikidc','semantic-mw','senseislibrary','sharemap','silcode','slashdot','slwiki','sourceforge','squeak','stewardry','strategywiki','swinbrain','swtrain','tabwiki','tclerswiki','technorati','tfwiki','thelemapedia','theopedia','thinkwiki','tibiawiki','tmbw','tmnet','tmwiki','toollabs','tools','translatewiki','tviv','tvtropes','twiki','tyvawiki','uncyclopedia','unihan','unreal','urbandict','usej','usemod','vd','viaf','vikidia','vinismo','vkol','vlos','voipinfo','werelate','wikia','wikiapiary','wikiasite','wikichristian','wikicities','wikicity','wikif1','wikifur','wikihow','wikiindex','wikilemon','wikilivres','wikilivresru','wikimac-de','wikinfo','wikinvest','wikiotics','wikipapers','wikiskripta','wikisophia','wikispot','wikiti','wikitree','wikiwikiweb','wipipedia','wlug','wmar','wmat','wmau','wmch','wmcl','wmcz','wmdc','wmde','wmdeblog','wmes','wmfr','wmhk','wmhu','wmid','wmil','wmin','wmit','wmke','wmph','wmpt','wmsk','wmtw','wmuk','wmve','wmza','wookieepedia','wowwiki','wqy','wurmpedia','zrhwiki','zum','zwiki','ĉej');
+----------+
| COUNT(*) |
+----------+
|       21 |
+----------+
1 row in set (0.01 sec)

36295 + 21 = 36316 external links on fywiki, excluding the User namespace and all talk namespaces. Do not rely on this number.

PleaseStand added a comment.EditedJun 20 2016, 9:42 AM

Here is one possible way to count external links:

Now I ran a modified version of my queries, which break the (de-duplicated) entries down by page namespace, against the 20160601 dumps of both fywiki and enwiki, then used a spreadsheet to add the per-namespace results (details in P3271):

tableall namespacesexcluding User: and talk pagescontent namespaces only
fywiki.externallinks371333629533477
fywiki.iwlinks292115
fywiki TOTAL371623631633492
enwiki.externallinks760575735810408639800352
enwiki.iwlinks4833548615130791
enwiki TOTAL765409275819023739831143

Some observations:

  • The external iwlinks entries (meaning interwiki links to non-WMF sites, including Wikimedia Labs sites) are relatively insignificant in number, compared to the external externallinks entries.
  • For enwiki, only 52.0% of the total external links are in content namespaces (the main namespace). 24.0% are in User: pages and talk namespaces, and the remaining 24.0% are in other namespaces. By comparison, about 32% of all pages, and 17% of non-redirect pages, are in the main namespace. So while as a group, content pages have more external links than meta pages, the choice of namespaces to include still has a big effect on the numbers.
  • For enwiki, the external externallinks entries (after de-duplication) are only 82.9% of the total number of rows in the table (91773151). This suggests that not excluding internal externallinks entries and/or duplicate entries for protocol-relative links may inflate the numbers. (Of course, the extent to which this is true may depend on what you consider to be an external link. As noted above, I considered any link to another WMF site, as determined by the domain name, to be an internal link.)
  • I also ran yet another version of the enwiki.externallinks query in which the domains "wmflabs.org", "toolserver.org", and "tools.wikimedia.de" are excluded. The respective counts are 67296104 (-11.5%) for all namespaces, 53579211 (-7.9%) when excluding User: and talk pages, and 37839513 (-5.0%) for content namespaces only. Probably many of the links to Wikimedia Labs in content are added by templates such as {{coord}}, though that particular template is only used on about a million pages. I don't know, as that would require further analysis.
  • My counts of external links from enwiki (though not fywiki) are significantly higher than the estimate from T137984#2386215 (by grepping the pages-articles dump for "[http"). Since links made using citation templates and infoboxes often do not require the left bracket, I would try grepping for "http://" and "https://" instead (preferably excluding the page titles, edit summaries, and so on), and would compare that to the other estimates for both wikis.

However, as I stated above, do not rely on my numbers. I only counted the links from these two Wikipedias (not all Wikipedias as requested), and I probably did not determine even those counts according to your requirements. And my numbers may very well be totally wrong. I'm mainly just questioning the assumption that counting the occurrences of "[http" in the wikitext dumps will produce a reasonable estimate.

ezachte added a comment.EditedJun 22 2016, 3:13 PM

@PleaseStand kudoos for the in-depth analysis! This will be really useful also later on when the hadoop numbers for external links need to be vetted/understood.

For now assume the question from the legal dept is for a ball-park estimate (will ask), and all namespaces count for legal. If so I will go with @Nemo_bis suggestion to treat externallinks table as authoritative.

I ran in stat1002:/mnt/data/xmldatadumps/public/ the following bash command:

find -maxdepth 1 -path './*wiki' -type d -exec bash -c "zgrep -o \( '{}'/latest/*ext*sql.gz | wc -l " \; > somefile

adding all numbers (after removing commons and wikidata) brings us to 372 million external links

Aklapper removed ezachte as the assignee of this task.Mon, Nov 18, 11:52 AM

Removing assignee @ezachte as that Phabricator account has been deactivated. (If there are questions, it seems that @erik_zachte could be contacted.)

Restricted Application added a project: Analytics. · View Herald TranscriptMon, Nov 18, 11:52 AM
Ottomata triaged this task as Low priority.Mon, Nov 18, 4:36 PM
Ottomata moved this task from Incoming to Analytics Query Service on the Analytics board.