Page MenuHomePhabricator

[Session] Let's share our Search challenges
Closed, ResolvedPublic

Description

Guillaume:

Categories/Tags/Keywords (up to 5):

  • Search
  • Discovery
  • Language

Session type (select one):

  • Presentation (including Q/A) - 25 mins
  • Discussion (including Q/A) - 55 mins
  • Workshop (including Q/A) - 55 mins
  • Lightning talk - 5 mins

Venue (select one):

  • I would like to be on the main track
  • I wouldn't mind being on the main track
  • I need a Jitsi room for the session

When are you available to have the session?

1pm to 3pm UTC, 22 - 23 May

Session Details

Short description of the session (~150 words):

The Search Platform team is answering your questions. Come and tell us what issues you have with Search, share your ideas to improve Search, tell us what works and what does not.

Target audience:

Anyone who uses Search, either to just find content, or to build higher level workflows. Casual searchers as well as experienced Search users are welcomed!

What will participants get out of this session? (~50 words)

A better understanding of how to use search, as a reader, editor, or developer. An opportunity to collaborate with the search team to make improvements.

(Optional) Additional resources:

Share any documentation or links where participants can learn more about the session topics
General search documentation:

Event Timeline

Gehel renamed this task from [Session] Let's share our Search problems to [Session] Let's share our Search challenges.May 11 2021, 2:52 PM

Hello @Gehel and thanks a lot for proposing this session!

Feel free to have a look at the remaining free slots in the two hacking rooms in the schedule (times are presented in UTC/GMT), and please add your session directly in the schedule on wiki, in one of the two hacking rooms, before May 20th. If you have issues editing the wiki, we can also do it for you, feel free to ask for help.

As a speaker in a hacking room, you will use Jitsi, where you will be able to present, share your screen, and interact directly with the participants. The session will not be recorded.

If you have any questions, feel free to reach out to me. Thanks!

I've scheduled us for Saturday at 13:00 UTC.

Hey @TJones @Gehel, as you're doing this session as a team, do you have a need for an extra facilitator? If so, we'd try to get you one, or I'd join!

Thanks for the offer, @Bmueller! We seem to have done okay; a few other people in the meeting jumped in to help keep track of questions, too. We had a big group—it was great!

https://etherpad.wikimedia.org/p/wmhack21-search-challenges

Current search-related projects:

WDQS: Wikidata Query Service
WCQS: Wikimedia Commons Query Service

Is there something that the Wikidata team can help with WDQS/WCQS?

  • WCQS is in an experimental beta, timeline for production hardware/stability is uncertain.
  • WCQS doesn't support application authentication. Next stage (beta 2) is moving to production infrastructure, then autentication. Service won't be production-stable yet, may have unplanned outages. After is monitoring and increased stability.
  • WCQS will continue to require authentication

Any plans for improving performance of wdqs?

  • Broad topic, includes both response time, service stability, and update process
    • All three have areas for improvement
    • New updater is coming soon (Streaming updater)
      • Important because Wikidata implements ratelimits based on WDQS update lag
      • Currently running from the Analytics cluster
    • Next is dealing with Blazegraph
      • Blazegraph is not well supported anymore, WMF can't maintain it alone
      • Thinking about splitting the service for more specialized use-cases and/or replacing Blazegraph
      • Blazegraph doesn't scale well
        • Can't grow a single graph forever
  • WDQS contains its own model, data has to be imported from Wikidata

Would it be possible to return partial results instead of timeouts (but of course with a prominent warning that results are not complete)?

  • Yes and no
  • Technically it's already possible in some cases, but not most of the time in our use cases
  • Not something we'll add to Blazegraph, but might be possible in a different backend
  • Not currently a priority in choosing a new backend

Why does incategory: / deepcategory: just return errors on basically everything?

  • Very backend intensive, especially on large sites with large category trees
  • Try to return partial results where possible
  • Likely just timing out
  • MW category hierarchy is ... suboptimal

Lexemes use a lot of items and are used by some items but a separate endpoint for lexemes would be very interresting!

  • We might be able to split lexemes, but the main graph will continue to grow
  • No clear domain splits for wikidata

do we have some map of Wikidata? (a bit like the old "internet map" https://internet-map.net)
wondering how much it's dense and archipelago like (like are lexemes/citations or humans really isolated)

  • want to have a better understanding of that
  • wikidata is big, so analyzing is hard
  • just an analysis of data is not usually enough if queries combine data from across the graph
    • Need to analyze data and queries
  • Wikidata has no rigid structure, can't design based on a schema

Why weren't categories implemented as tags, so if one searches for "Italian poets" then tags "Italian"+"poet" are applied. With categories one can follow different trees and is not guaranteed to get where she wants.

why is it so that page_id is not the best way to access wiki api
no search functions based on it

  • page_id is not search, if you know the page_id you don't need search
  • search is more about fuzzy content matching & what users want to surface

Any plans for global search?

  • No.
  • "Computationally insanely intensive"
  • Possible on CloudElastic indexes (https://global-search.toolforge.org/) , but difficult to do at scale on-wiki
  • It's slow, and would slow down everyone else if done on the main cluster
  • Would be useful for searching for global interface admins looking for deprecated JS
  • Production shell access users have mwgrep
  • Different languages have different indexes & other configuration
  • Different wikis can't be sorted together, ranking of results between wikis not possible
  • "on other projects" does already exist

Something that would be useful that doesn't need global search is better integration of Wikidata search results in your local wiki search results

  • Difficult to merge results from multiple sites
  • Several wikis (frwiki) include wikidata results in search
  • Would be useful if a local page doesn't exist on a topic, to see if WD/other projects have pages + doing the search in your local wiki gives you the red link to easily start the article
  • Could be added to sister projects search

regular expressions in search? are \d \w \s really that expensive?

  • want \b ^ $ too
  • Probably could implement those char classes (\d \w \s )
  • Haven't spent much time on insource/intitle regex
  • we use regex search too too, would like it to be better, not a priority

should all the PDF's that are now on Commons be deprioritize? they're almost always not usefull

  • index got cut to 50K, because metadat included entire transcriptions, which caused problems
  • not intentionally deprioritized
  • -filemime:pdf

Oh, yes, this morning I wanted to looki for "$" in esperanto projects to see if the devise symbol is generally used before or after the amount ($1 or 1$) in this language. How would you recommand to do that?

when inserting wikilinks in languages other than English, more specifically the inflected languages, terms like "quantumsuffix statisticalsuffix physicssuffix" need to be cleaned up to something like "quantum statistical physicssuffix". can this be improved? is a machine learning model responsible for these suggestions or is it just literal search?

  • Not ML, no work in that area
  • only support for suffixes on the final term

WM Search Platform has monthly office hours, feel free to join https://etherpad.wikimedia.org/p/Search_Platform_Office_Hours
https://www.youtube.com/watch?v=P_xJaqQV71s -- talk on language-based processing

Thanks for participating in the Wikimedia Hackathon 2021! We hope you had a great time.

  • If this session / event took place: Please change the task status to "resolved" via the Add Action...Change Status dropdown.
    • If there are specific follow-up tasks from this session / event: Please create dedicated tasks and add another active project tag to those tasks, so others can find those tasks (as likely nobody in the future will look back at Wikimedia-Hackathon-2021 tasks when trying to find something they are interested in).
  • In this session / event did not take place: Please set the task status to "declined".

Thank you,
your Hackathon venue housekeeping service

Good discussion and excellent notes! Thanks!