Page MenuHomePhabricator

Add descriptions
Closed, ResolvedPublic

Description

Some of the Libris posts have short descriptions of the person, equivalent to the Wikidata description (in Swedish).

Sometimes they consist of more than one sentence, in which case the first sentence is most relevant (e.g. Svensk författare). Thus, we need to isolate the first sentence, and also convert the first word to lower-case.

Event Timeline

A sample of descriptions in the dataset: https://gist.github.com/Vesihiisi/1925cecc8b1d94c8a7bcd1d90fe74b2f

A suggestion of how they could be handled:

In order to be included, a description must:

  • Not start with ämne; if it does, it's excluded.
  • Not end with a full stop; final full stop is removed
  • Be at most x (8?) words long – to exclude essays like Farmaceut, teaterman, fotläkare; dansk cand. pharm., frivillig i slesvig-holsteinska kriget 1848-1850, chef för Stockholms stadsbudskår (från 1863), teaterdirektör, liktornsoperatör. Död i Bergen. Longer descriptions are excluded as it's difficult to shorten them elegantly. Splitting on a full stop and treating the first one as a sentence doesn't work with abbreviations like med. lic.
  • Start with a lowercase letter, as per Wikidata standards. Thus conversion is done.

(Ok, deep inside I just kinda want to ditch everything longer than like 3 words and containing any punctuation, but let's try :/)

Result:

  • Brittisk historiker.brittisk historiker
  • Verksam vid institutionen för ekologi, miljö och geovetenskap, Umeå universitet (2017) → excluded due to >8 words
  • Ämnen: miljö och hållbar utveckling. Givit ut diktsamling. → excluded due to blacklisted word

This logic could benefit from an extra pair of eyes, so ping @Lokal_Profil

Now that I have noticed that ämne(n): can also occur in the middle of the description, it should be searched for anywhere in the string (and disqualify it).

Taken a quick look/stab at this. I think your gut instinct is right and 3-5 words is a more reasonable than 8 words.

I think the descriptions can be used if:

  • any string containing one of the following is discarded 'ämne', 'birthday.se', 'lc auth'* (lower case comparison).
  • any string with more than 4** words is discarded
  • any string not starting with an alphabet character is discarded (i.e. numbers, punctuation etc.)
  • any initial NE: or DB: strings are removed
  • any trailing full stop is removed
  • the first character is converted to lower case***

I put a gist of the cleaned results and the cleaning script at https://gist.github.com/lokal-profil/37547b9925a5489dbc6d533fd0be662c . Out of the 1933 strings 977 could be salvaged which isn't bad.

*There might be more strings which a skim read might reveal.
**When I got down to that length (4 worst) most strings seemed ok. Going to 5 words (also in gist) salvages another 130ish strings. A quick look suggests these might also be ok.
***Converting to the first character lower case gives a few bad results (e.g. TV->tV) but they are rare enough that it should be acceptable.