Maniphest T203380

Add descriptions
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Alicia_Fagerving_WMSE
	Sep 3 2018, 9:51 AM

Description

Some of the Libris posts have short descriptions of the person, equivalent to the Wikidata description (in Swedish).

Sometimes they consist of more than one sentence, in which case the first sentence is most relevant (e.g. Svensk författare). Thus, we need to isolate the first sentence, and also convert the first word to lower-case.

Related Objects
Search...

Status	Assigned	Task
Resolved	Alicia_Fagerving_WMSE	T202527 Process and upload the Libris authorities database
Duplicate	Alicia_Fagerving_WMSE	T202528 Mass upload of person data
Resolved	Alicia_Fagerving_WMSE	T203380 Add descriptions

Event Timeline

Alicia_Fagerving_WMSE created this task.Sep 3 2018, 9:51 AM

Alicia_Fagerving_WMSE moved this task from 🗃️ Inbox to 📆 This week on the User-Alicia_Fagerving_WMSE board.

A sample of descriptions in the dataset: https://gist.github.com/Vesihiisi/1925cecc8b1d94c8a7bcd1d90fe74b2f

A suggestion of how they could be handled:

In order to be included, a description must:

Not start with ämne; if it does, it's excluded.
Not end with a full stop; final full stop is removed
Be at most x (8?) words long – to exclude essays like Farmaceut, teaterman, fotläkare; dansk cand. pharm., frivillig i slesvig-holsteinska kriget 1848-1850, chef för Stockholms stadsbudskår (från 1863), teaterdirektör, liktornsoperatör. Död i Bergen. Longer descriptions are excluded as it's difficult to shorten them elegantly. Splitting on a full stop and treating the first one as a sentence doesn't work with abbreviations like med. lic.
Start with a lowercase letter, as per Wikidata standards. Thus conversion is done.

(Ok, deep inside I just kinda want to ditch everything longer than like 3 words and containing any punctuation, but let's try :/)

Result:

Brittisk historiker. → brittisk historiker
Verksam vid institutionen för ekologi, miljö och geovetenskap, Umeå universitet (2017) → excluded due to >8 words
Ämnen: miljö och hållbar utveckling. Givit ut diktsamling. → excluded due to blacklisted word

This logic could benefit from an extra pair of eyes, so ping @Lokal_Profil

Alicia_Fagerving_WMSE added a subscriber: Lokal_Profil.Sep 5 2018, 12:39 PM

Alicia_Fagerving_WMSE moved this task from Backlog to In progress on the WMSE-Library-Data-2018 board.Sep 5 2018, 1:03 PM

Now that I have noticed that ämne(n): can also occur in the middle of the description, it should be searched for anywhere in the string (and disqualify it).

Alicia_Fagerving_WMSE moved this task from 📆 This week to 🗃️ Inbox on the User-Alicia_Fagerving_WMSE board.Sep 10 2018, 5:52 AM

Alicia_Fagerving_WMSE moved this task from 🗃️ Inbox to 📆 This week on the User-Alicia_Fagerving_WMSE board.Sep 10 2018, 6:08 AM

Alicia_Fagerving_WMSE moved this task from In progress to Backlog on the WMSE-Library-Data-2018 board.Sep 10 2018, 7:55 AM

Alicia_Fagerving_WMSE moved this task from Backlog to In progress on the WMSE-Library-Data-2018 board.

Alicia_Fagerving_WMSE moved this task from 📆 This week to ♾️ Watching on the User-Alicia_Fagerving_WMSE board.Sep 14 2018, 11:10 AM

Taken a quick look/stab at this. I think your gut instinct is right and 3-5 words is a more reasonable than 8 words.

I think the descriptions can be used if:

any string containing one of the following is discarded 'ämne', 'birthday.se', 'lc auth'* (lower case comparison).
any string with more than 4** words is discarded
any string not starting with an alphabet character is discarded (i.e. numbers, punctuation etc.)
any initial NE: or DB: strings are removed
any trailing full stop is removed
the first character is converted to lower case***

I put a gist of the cleaned results and the cleaning script at https://gist.github.com/lokal-profil/37547b9925a5489dbc6d533fd0be662c . Out of the 1933 strings 977 could be salvaged which isn't bad.

*There might be more strings which a skim read might reveal.
**When I got down to that length (4 worst) most strings seemed ok. Going to 5 words (also in gist) salvages another 130ish strings. A quick look suggests these might also be ok.
***Converting to the first character lower case gives a few bad results (e.g. TV->tV) but they are rare enough that it should be acceptable.