Page MenuHomePhabricator

Implement search indexing for Jade entity pages
Closed, DeclinedPublic

Description

We already have basic search, but we might want to support queries like "get all recent judgments of 'bad faith'". Figure out the use cases before coding.

I believe we get ElasticSearch integration by implementing ContentHandler::getDataForSearchIndex.

Extract fields:

  • Schema values
  • notes
  • endorsement user
  • endorsement comment
  • endorsement origin
  • endorsement timestamp

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

This will allow indexing the data, using that data for something depends on the use case. The most direct method is to implement a full text search keyword via the CirrusSearchAddQueryFeatures hook.

EBjune triaged this task as Medium priority.Oct 11 2018, 5:05 PM
EBjune moved this task from needs triage to watching / waiting on the Discovery-Search board.
awight added a subscriber: Halfak.

@Halfak Can you confirm that researchers might be interested in the fields I've listed so far? I don't want to burn too many search cluster resources unless we think the indexes will get some use.

/me scratches head.

I'm realizing that I've been on the wrong track. Elasticsearch is only for the UI, so I doubt we want to index anything other than judgment.notes and endorsement.comment, for finding judgments with common themes. Maybe endorsement.user for admin stuff. The other fields I was imagining we would index for researchers, but they'll actually be using a replica, and will need some other type of index. <- @Halfak can you inform us about what researchers normally use for that?

Using the cirrussearch mw-vagrant role, I can confirm that the default search indexing isn't going to work well, we need to customize. Reading the base class hooks, I'm surprised that this is this case. TextContent::getTextForSearchIndex returns getNativeData, which is a JSON string for judgment content.

Two issues to workaround, so far:

  • Fulltext search can't find words in the judgment.notes.
  • Judgment summary displayed in search results is JSON content with no line breaks, so almost never useful.

@Harej: Just a heads-up that we can consider end-user use cases here, for example the ORES and JADE extensions could support an advanced search syntax like "ores_damaging:true jade_damaging:false has_endorsements:0" to give judgments about all false positive reports waiting for an endorsement.

Change 469154 had a related patch set uploaded (by Awight; owner: Awight):
[mediawiki/extensions/JADE@master] Streamline search results summary

https://gerrit.wikimedia.org/r/469154

awight moved this task from Parked to Review on the Machine-Learning-Team (Active Tasks) board.

We might want to merge the minor change above, and circle back to the question of additional, custom indexes once the use cases are clear.

Change 469154 merged by jenkins-bot:
[mediawiki/extensions/JADE@master] Streamline search results summary

https://gerrit.wikimedia.org/r/469154

Change 469546 had a related patch set uploaded (by Awight; owner: Awight):
[mediawiki/extensions/JADE@master] Test for JudgmentContentHandlerTest

https://gerrit.wikimedia.org/r/469546

Change 469546 merged by jenkins-bot:
[mediawiki/extensions/JADE@master] Test for JudgmentContentHandlerTest

https://gerrit.wikimedia.org/r/469546

Finally got to respond here. Sorry for being late.

I want to split use-cases by user. For the primary use-cases, I want to target Wikipedians who are using the search UI to look for relevant judgments. I imagine use-cases for searching "damaging:true" with notes containing "#keymash" or something similar. It does seem like filtering based on judgment data is a key use-case.

I think that indexing on user-id is less interesting and should be relegated to a secondary feature. It would be primarily interesting in a sort/filtering perspective for researchers. I'm assuming we'll still have secondary tables in the relational DB for linking users to their judgments so one can ask basic (non-search/ranking) questions like "What endorsements has this user made." For the researcher use cases, I imagine that we'll have substantially more batch processing where researchers will create their own indexes for sorting and filtering. I don't expect that they will be using the Elastic Search very much since it'll be hard for them to describe consistently in their Methods discussions.

Does that make sense?

One more note. I do think that it would be interesting to answer the question:

"Which judgments of damaging:false contain "#keymash" in their notes and have an endorsement by someone using Special:Diff?"

Thanks for the helpful notes! I'm approaching this incrementally, so following your comments, I'll code the following indexes:

  • schema value (damaging, goodfaith, contentquality)
  • origin

The "notes" field is already covered by an earlier patch which makes all wikitext fields full-text searchable. Or do you think we should also offer specific indexes to distinguish between judgment and endorsement notes?

For now, I think one index sounds totally reasonable.

Change 470061 had a related patch set uploaded (by Awight; owner: Awight):
[mediawiki/extensions/JADE@master] [WIP] Index some data extracted from judgment page content

https://gerrit.wikimedia.org/r/470061

I haven't found any examples of AdvancedSearch integration, it might not be extensible yet. At least, the CirrusSearch fields are hardcoded into ext.advancedSearch.AdvancedOptionsConfig, probably a shortcut while this feature is in beta.

awight renamed this task from Extract judgment data for search indexing to Extract judgment data for end-user search indexing.Oct 29 2018, 7:42 PM

@Harej I came up with some use cases off-the-cuff, plus some input from @Halfak above, but I'd like to hand off the user-facing questions to you for the moment. It turns out that the technical side doesn't allow for my simplistic approach, yet, so I think we should more carefully define what our priorities are before asking for code changes in repos owned by other teams.

Take everything I've done so far with a grain of salt, obviously!

I don't want us to overthink CirrusSearch integration since I am not sure how much value there is for it. I imagine researchers would want to use more specialized tools, rather than pretending that the search engine is a relational database. At minimum, searching for key words used in judgments and being able to associate judgments with articles seem like the main requirements. Being able to do specialized searches like judgmentauthor:Harej would be cool, but I'm not convinced it's a priority.

@Harej It would be good to list the use cases still, for example, will editors want a workflow like, "browse all recent revisions where ORES predicted damaging which do not have a JADE judgment" or "all revisions where ORES predicts damaging, where there's a non-damaging judgment suggesting a false positive, but with no endorsements, I will confirm and endorse."

It's fine to leave this for later iterations, of course!

I should have made this task more granular—Wikitext content is now searchable, which is probably enough for our initial release. I'm moving to the backlog, and we can plan the next iteration.

awight moved this task from Inbox to Feature Requests on the Jade board.
awight renamed this task from Extract judgment data for end-user search indexing to Advanced field indexing to support search..Nov 14 2018, 11:06 PM
awight updated the task description. (Show Details)
Harej lowered the priority of this task from Medium to Low.Nov 14 2018, 11:39 PM
Ladsgroup raised the priority of this task from Low to Needs Triage.Nov 28 2018, 6:36 AM
Ladsgroup moved this task from Unsorted to New development on the Machine-Learning-Team board.
Harej triaged this task as Medium priority.Mar 26 2019, 9:15 PM

In case this is of interest here: The extension AdvancedSearch now allows gadgets or other extensions to add keywords to the advancedParameters panel, see T217446.

Change 470061 abandoned by Ladsgroup:
[WIP] Index some data extracted from judgment page content

Reason:
This has merge conflict and requires so much work to bring to a state that can be useful according to the notes made by the search platform team. Also, this extension is being archived in favor its clone "Jade". Feel free to cherry-pick it there if you want to work on it.

https://gerrit.wikimedia.org/r/470061

Halfak renamed this task from Advanced field indexing to support search. to Implement search indexing for Jade entity pages.Aug 6 2019, 8:38 PM
ACraze subscribed.

Declining as Jade has changed quite a bit since this ticket was written and there is not a ton of value in implementing this for now.