Page MenuHomePhabricator

Run wikidata entity autocomplete optimizer for more languages
Closed, ResolvedPublic


Languages for entity search (per WikidataCompletionSearchClicks schema) with >1k searches per day over the last 30 days are:

  • en
  • fr
  • es
  • de

Run the same optimization process that was used in en for these languages

Event Timeline

Tuning reports for fr, de, es. Trained on data from oct through jan. All show improvements similar to english, with the strongest improvements in german.

Very nice! This is one of those subtle improvements that doesn't make a ton of difference to any one search (one whole character saved!) but adds up over all the users who will benefit. Good stuff!

P.S.: You skipped Russian because of a frequency cutoff, right?

Yes i ended up dropping russian for frequency. English was run with a 120k observation dataset (5% of the total iirc). The above represent all languages with >= 90k (~1k/day) observations. This technique can potentially apply with less data, but i started seeing more overfitting in the drop from 120k to 90k observations and decided to not look into it too deeply.