Page MenuHomePhabricator

Examine non-ISBN refs on svwp
Closed, ResolvedPublic6 Estimated Story Points

Description is a handy tool to grab all the refs from a wiki. How can we best proceed to analyze non-ISBN (either old, or badly cited) books on svwp?

The raw refdump will be of use for T175348, so make sure to share it before processing!

Event Timeline

This will be done using the svwiki-20180801-pages-articles.xml.bz2 dump.

The output of the tool looks like:

page_id	page_title	rev_id	rev_timestamp	reference
6097238	Phoebe Point	41704536	20171025051715	<ref>…ref content…</ref>

Thus it will have to be processed further by extracting only the last column of the TSV file.

cut -f 5 svwp.tsv > svwp_refs_only.tsv

Then: find all instances of {{Bokref}} → create author : title pairs -> create a frequency list of those.

Note that author can be encoded in different ways, either as efternamn= |förnamn= or författare=; can be normalized by joining the separate names.