Page MenuHomePhabricator

[wiki-nlp-tools] Remove numbers from abbreviation list
Open, Needs TriagePublic

Description

Numbers ending with period symbol . end up in the abbreviation list and cause false positives. They could be removed, but there are other considerations.

Example of issue in iswiki:

['Greint var frá fyrstu tilfellum sjúkdómsins á þessu tímabili, í lok júlí. ', 'Curtis Cooper, prófessor í stærðfræði og tölvunarfræði við háskólann í miðhluta Missouri, hefur uppgötvað stærstu prímtöluna sem þekkist í dag þann 25. ', 'janúar.']

Notice the split here: '...dag þann 25. ', 'janúar.'

Isaac:

I feel there are probably legitimate sentences that end of numbers -- e.g., "In 2023, he turned 33."

Martin Gerlach:

numbers such as 25. are tricky. in some languages those should actually be part of abbreviations. for example, in german, we write "25th" as "25." (e.g. as part of a date) and thus should not be split.

From Isaac

yeah, might need a language-specific rule here for whether this is allowed. But also I'm realzing my example sentence would be even more typical if it said something like "...in 2023." -- i.e. years end sentences quite commonly as a reason we can't just rule out periods after numbers. But maybe some middle ground we can find.