Page MenuHomePhabricator

Run wikidata entity autocomplete optimizer for more languages
Closed, ResolvedPublic


Languages for entity search (per WikidataCompletionSearchClicks schema) with >1k searches per day over the last 30 days are:

  • en
  • fr
  • es
  • de

Run the same optimization process that was used in en for these languages

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 7 2019, 6:07 PM

Tuning reports for fr, de, es. Trained on data from oct through jan. All show improvements similar to english, with the strongest improvements in german.

TJones added a subscriber: TJones.Jan 15 2019, 6:25 PM
TJones added a comment.EditedJan 15 2019, 10:22 PM

Very nice! This is one of those subtle improvements that doesn't make a ton of difference to any one search (one whole character saved!) but adds up over all the users who will benefit. Good stuff!

P.S.: You skipped Russian because of a frequency cutoff, right?

debt closed this task as Resolved.Jan 18 2019, 7:09 PM
EBernhardson updated the task description. (Show Details)Jan 23 2019, 7:54 PM

Yes i ended up dropping russian for frequency. English was run with a 120k observation dataset (5% of the total iirc). The above represent all languages with >= 90k (~1k/day) observations. This technique can potentially apply with less data, but i started seeing more overfitting in the drop from 120k to 90k observations and decided to not look into it too deeply.