[Session] Let's share our Search challenges
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Gehel
	May 11 2021, 2:44 PM

Description

Guillaume:

Categories/Tags/Keywords (up to 5):

Search
Discovery
Language

Session type (select one):

Presentation (including Q/A) - 25 mins
Discussion (including Q/A) - 55 mins
Workshop (including Q/A) - 55 mins
Lightning talk - 5 mins

Venue (select one):

I would like to be on the main track
I wouldn't mind being on the main track
I need a Jitsi room for the session

When are you available to have the session?

1pm to 3pm UTC, 22 - 23 May

Session Details

Short description of the session (~150 words):

The Search Platform team is answering your questions. Come and tell us what issues you have with Search, share your ideas to improve Search, tell us what works and what does not.

Target audience:

Anyone who uses Search, either to just find content, or to build higher level workflows. Casual searchers as well as experienced Search users are welcomed!

What will participants get out of this session? (~50 words)

A better understanding of how to use search, as a reader, editor, or developer. An opportunity to collaborate with the search team to make improvements.

(Optional) Additional resources:

Share any documentation or links where participants can learn more about the session topics
General search documentation:

Event Timeline

Gehel created this task.May 11 2021, 2:44 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 11 2021, 2:44 PM

Gehel updated the task description. (Show Details)May 11 2021, 2:48 PM

Gehel renamed this task from [Session] Let's share our Search problems to [Session] Let's share our Search challenges.May 11 2021, 2:52 PM

dcausse subscribed.May 11 2021, 3:03 PM

TJones updated the task description. (Show Details)May 11 2021, 3:49 PM

Ladsgroup moved this task from Backlog to Session proposals on the Wikimedia-Hackathon-2021 board.May 12 2021, 4:58 AM

Hello @Gehel and thanks a lot for proposing this session!

Feel free to have a look at the remaining free slots in the two hacking rooms in the schedule (times are presented in UTC/GMT), and please add your session directly in the schedule on wiki, in one of the two hacking rooms, before May 20th. If you have issues editing the wiki, we can also do it for you, feel free to ask for help.

As a speaker in a hacking room, you will use Jitsi, where you will be able to present, share your screen, and interact directly with the participants. The session will not be recorded.

If you have any questions, feel free to reach out to me. Thanks!

TJones subscribed.May 18 2021, 2:42 PM

I've scheduled us for Saturday at 13:00 UTC.

Awesome, thanks!

Aklapper assigned this task to Gehel.May 20 2021, 8:14 PM

WMDE-Fisch subscribed.May 21 2021, 10:50 AM

Hey @TJones @Gehel, as you're doing this session as a team, do you have a need for an extra facilitator? If so, we'd try to get you one, or I'd join!

Thanks for the offer, @Bmueller! We seem to have done okay; a few other people in the meeting jumped in to help keep track of questions, too. We had a big group—it was great!

https://etherpad.wikimedia.org/p/wmhack21-search-challenges

Current search-related projects:

Wikidata person search
- https://orator-matcher.toolforge.org/sparql.php?names=Alexis+Figueroa%0D|Pablo+Rumel+Espinoza%0D|Ignacio+Fritz
Template discovery

WDQS: Wikidata Query Service
WCQS: Wikimedia Commons Query Service

Is there something that the Wikidata team can help with WDQS/WCQS?

WCQS is in an experimental beta, timeline for production hardware/stability is uncertain.
WCQS doesn't support application authentication. Next stage (beta 2) is moving to production infrastructure, then autentication. Service won't be production-stable yet, may have unplanned outages. After is monitoring and increased stability.
WCQS will continue to require authentication

Any plans for improving performance of wdqs?

Broad topic, includes both response time, service stability, and update process
- All three have areas for improvement
- New updater is coming soon (Streaming updater)
  - Important because Wikidata implements ratelimits based on WDQS update lag
  - Currently running from the Analytics cluster
- Next is dealing with Blazegraph
  - Blazegraph is not well supported anymore, WMF can't maintain it alone
  - Thinking about splitting the service for more specialized use-cases and/or replacing Blazegraph
  - Blazegraph doesn't scale well
    - Can't grow a single graph forever
WDQS contains its own model, data has to be imported from Wikidata

Would it be possible to return partial results instead of timeouts (but of course with a prominent warning that results are not complete)?

Yes and no
Technically it's already possible in some cases, but not most of the time in our use cases
Not something we'll add to Blazegraph, but might be possible in a different backend
Not currently a priority in choosing a new backend

Why does incategory: / deepcategory: just return errors on basically everything?

Very backend intensive, especially on large sites with large category trees
Try to return partial results where possible
Likely just timing out
MW category hierarchy is ... suboptimal

Lexemes use a lot of items and are used by some items but a separate endpoint for lexemes would be very interresting!

We might be able to split lexemes, but the main graph will continue to grow
No clear domain splits for wikidata

do we have some map of Wikidata? (a bit like the old "internet map" https://internet-map.net)
wondering how much it's dense and archipelago like (like are lexemes/citations or humans really isolated)

want to have a better understanding of that
wikidata is big, so analyzing is hard
just an analysis of data is not usually enough if queries combine data from across the graph
- Need to analyze data and queries
Wikidata has no rigid structure, can't design based on a schema

Why weren't categories implemented as tags, so if one searches for "Italian poets" then tags "Italian"+"poet" are applied. With categories one can follow different trees and is not guaranteed to get where she wants.

No one remembers
Commons Structured Data somewhat does that
- Some ideas to bring that to more wikis https://www.mediawiki.org/wiki/Structured_data_across_Wikimedia
ORES machine-generated general topics can be used to refine searches
current structure has scaling problems

why is it so that page_id is not the best way to access wiki api
no search functions based on it

page_id is not search, if you know the page_id you don't need search
search is more about fuzzy content matching & what users want to surface

Any plans for global search?

No.
"Computationally insanely intensive"
Possible on CloudElastic indexes (https://global-search.toolforge.org/) , but difficult to do at scale on-wiki
It's slow, and would slow down everyone else if done on the main cluster
Would be useful for searching for global interface admins looking for deprecated JS
Production shell access users have mwgrep
Different languages have different indexes & other configuration
Different wikis can't be sorted together, ranking of results between wikis not possible
"on other projects" does already exist
- Not exactly global search
- More likely target for further improvements -- highlighting likely relevant results on limited set of other wikis
- https://fr.wikipedia.org/w/index.php?search=%22escopateur%22&title=Sp%C3%A9cial%3ARecherche&profile=advanced&fulltext=1&ns0=1
- 7 searches vs 900 searches
- Complex queries don't always work well cross-wiki
- False positives likely in cross-language search

Something that would be useful that doesn't need global search is better integration of Wikidata search results in your local wiki search results

Difficult to merge results from multiple sites
Several wikis (frwiki) include wikidata results in search
Would be useful if a local page doesn't exist on a topic, to see if WD/other projects have pages + doing the search in your local wiki gives you the red link to easily start the article
Could be added to sister projects search

regular expressions in search? are \d \w \s really that expensive?

want \b ^ $ too
Probably could implement those char classes (\d \w \s )
Haven't spent much time on insource/intitle regex
we use regex search too too, would like it to be better, not a priority

should all the PDF's that are now on Commons be deprioritize? they're almost always not usefull

index got cut to 50K, because metadat included entire transcriptions, which caused problems
not intentionally deprioritized
-filemime:pdf

Oh, yes, this morning I wanted to looki for "$" in esperanto projects to see if the devise symbol is generally used before or after the amount ($1 or 1$) in this language. How would you recommand to do that?

insource:/[0-9] ?$/
Not really a good way to search across multiple projects (see previous discussion). Search on each wiki seperately
some regular expression docs at https://en.wikipedia.org/wiki/Help:Searching https://www.mediawiki.org/wiki/Help:CirrusSearch#Regular_expression_searches
Wouldn't it be something like "insource:/[0-9]\s\?\$/
- $ is not a special character in CirrusSearch regex (right now)
- \s is not supported either
  - You can use [] with arbitrary unicode chars to approximate \s
https://eo.wikipedia.org/w/index.php?search=insource%3A%2F%24%5B++%E2%80%AF%5D%3F%5B0-9%5D%2F&title=Speciala%C4%B5o%3ASer%C4%89i&profile=advanced&fulltext=1&ns0=1&ns9=1&ns12=1
https://eo.wikipedia.org/w/index.php?search=insource%3A%2F%5B0-9%5D%5B+%C2%A0%E2%80%AF%5D%3F%24%2F&title=Speciala%C4%B5o%3ASer%C4%89i&profile=advanced&fulltext=1&ns0=1&ns9=1&ns12=1

when inserting wikilinks in languages other than English, more specifically the inflected languages, terms like "quantumsuffix statisticalsuffix physicssuffix" need to be cleaned up to something like "quantum statistical physicssuffix". can this be improved? is a machine learning model responsible for these suggestions or is it just literal search?

Not ML, no work in that area
only support for suffixes on the final term

WM Search Platform has monthly office hours, feel free to join https://etherpad.wikimedia.org/p/Search_Platform_Office_Hours
https://www.youtube.com/watch?v=P_xJaqQV71s -- talk on language-based processing

Ponor subscribed.May 22 2021, 2:55 PM

Frostly subscribed.May 22 2021, 10:48 PM

Thanks for participating in the Wikimedia Hackathon 2021! We hope you had a great time.

If this session / event took place: Please change the task status to "resolved" via the Add Action... → Change Status dropdown.
- If there are specific follow-up tasks from this session / event: Please create dedicated tasks and add another active project tag to those tasks, so others can find those tasks (as likely nobody in the future will look back at Wikimedia-Hackathon-2021 tasks when trying to find something they are interested in).
In this session / event did not take place: Please set the task status to "declined".

Thank you,
your Hackathon venue housekeeping service

Good discussion and excellent notes! Thanks!

[Session] Let's share our Search challengesClosed, ResolvedPublicActions

Description

Session Details

Event Timeline

[Session] Let's share our Search challenges
Closed, ResolvedPublic
Actions