Page MenuHomePhabricator

[ge-ka]Clean up type list
Closed, ResolvedPublic

Description

The current types list contains line breaks, a variety of spaces which should be cleaned up.

Line breaks and commas should be considered to list keywords and the ეროვნული can be removed when stand-alone as it indicates heritage status rather than instance type. After removing this keyword we will now have duplicate entries which can be merged.

Of course a similar cleanup needs to be done on the values we compare to these types.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 8 2017, 8:52 AM

Built

1from collections import OrderedDict
2import pywikibot as pwb
3
4site = pwb.Site('wikidata', 'wikidata')
5page = pwb.Page(site, 'Wikidata:WikiProject WLM/Mapping tables/ge (ka)/types')
6contents = page.get()
7header, sep, rest = contents.partition('|-')
8rest, sep, footer = rest.rpartition('|}')
9footer = '|}' + footer
10
11NATIONAL_IMPORTANCE_STR = "ეროვნული"
12
13def clean_type(text):
14 """
15 Return a cleaned version of self.type.
16 Multiple types may exist either separated by "<br />" or ",".
17 Types may include NATIONAL_IMPORTANCE_STR which should be used only
18 for heritage status.
19 """
20 raw_type = text.lower()
21 raw_type = raw_type.replace("<br />", ",")
22 types = [typ.strip() for typ in raw_type.split(',')]
23 if NATIONAL_IMPORTANCE_STR in types:
24 types.remove(NATIONAL_IMPORTANCE_STR)
25 types = list(filter(None, types)) # remove empty entries
26 return ', '.join(types)
27
28entries = rest.split('|-')
29d = {}
30
31for entry in entries:
32 parts = entry.split('\n|')
33 name = clean_type(parts[1].strip())
34 num = parts[2].strip() or "0"
35 qid = parts[3].strip()
36 com = parts[4].strip()
37 if name not in d:
38 d[name] = {'num': 0, 'qid': '', 'com': '', 'orig':[]}
39 d[name]['num'] += int(num)
40 if qid and d[name]['qid']:
41 print('doh qid: {} {}'.format(qid, d[name]['qid']))
42 if com and d[name]['com']:
43 print('doh com: {} {}'.format(com, d[name]['com']))
44 d[name]['qid'] = d[name]['qid'] or qid
45 d[name]['com'] = d[name]['com'] or com
46 d[name]['orig'].append(parts[1].strip())
47
48od = OrderedDict(sorted(d.items(), key=lambda t: t[1]['num'], reverse=True))
49
50txt = ''
51for k, v in od.items():
52 txt += '|- \n| {}\n| {}\n| {}\n| {}\n'.format(k, v['num'], v['qid'], v['com'])
53
54page_text = header + txt + footer
55
56with open('tmp.wiki', 'w', encoding='utf-8') as f:
57 f.write(page_text)
to do the processing

Suhadakashter closed this task as a duplicate of T175367: Page wikipedia.
Reedy reopened this task as Open.Sep 8 2017, 2:11 PM
Lokal_Profil renamed this task from Clean up type list to [ge-ka]Clean up type list.Nov 9 2017, 11:15 AM
Lokal_Profil closed this task as Resolved.