Maniphest T206352

Implement search indexing for Jade entity pages
Closed, DeclinedPublic
Actions

Description

We already have basic search, but we might want to support queries like "get all recent judgments of 'bad faith'". Figure out the use cases before coding.

I believe we get ElasticSearch integration by implementing ContentHandler::getDataForSearchIndex.

Extract fields:

Schema values
notes
endorsement user
endorsement comment
endorsement origin
endorsement timestamp

Details

Subject	Repo	Branch	Lines +/-
[WIP] Index some data extracted from judgment page content	mediawiki/extensions/JADE	master	+243 -3
Test for JudgmentContentHandlerTest	mediawiki/extensions/JADE	master	+57 -0
Streamline search results summary	mediawiki/extensions/JADE	master	+32 -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Declined	None	T212435 Review real-world query plans and performance for Jade
Declined	None	T238877 Write Huggle labels to Jade
Declined	calbon	T183381 Deploy pilot of Jade to a small set of wikis.
Resolved	• ACraze	T229973 Implement Jade Entity pages
Resolved	• ACraze	T229974 Implement secondary Jade Integrations
Declined	None	T206352 Implement search indexing for Jade entity pages

Event Timeline

awight created this task.Oct 5 2018, 6:31 PM

Restricted Application added a project: Discovery-Search. · View Herald TranscriptOct 5 2018, 6:31 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

This will allow indexing the data, using that data for something depends on the use case. The most direct method is to implement a full text search keyword via the CirrusSearchAddQueryFeatures hook.

• EBjune triaged this task as Medium priority.Oct 11 2018, 5:05 PM

• EBjune moved this task from needs triage to watching / waiting on the Discovery-Search board.

Harej moved this task from Inbox to Feature Requests on the Jade board.Oct 12 2018, 11:08 PM

awight edited projects, added Machine-Learning-Team (Active Tasks); removed Machine-Learning-Team.Oct 18 2018, 10:19 PM

awight claimed this task.Oct 19 2018, 11:48 PM

@Halfak Can you confirm that researchers might be interested in the fields I've listed so far? I don't want to burn too many search cluster resources unless we think the indexes will get some use.

/me scratches head.

I'm realizing that I've been on the wrong track. Elasticsearch is only for the UI, so I doubt we want to index anything other than judgment.notes and endorsement.comment, for finding judgments with common themes. Maybe endorsement.user for admin stuff. The other fields I was imagining we would index for researchers, but they'll actually be using a replica, and will need some other type of index. <- @Halfak can you inform us about what researchers normally use for that?

Using the cirrussearch mw-vagrant role, I can confirm that the default search indexing isn't going to work well, we need to customize. Reading the base class hooks, I'm surprised that this is this case. TextContent::getTextForSearchIndex returns getNativeData, which is a JSON string for judgment content.

Two issues to workaround, so far:

Fulltext search can't find words in the judgment.notes.
Judgment summary displayed in search results is JSON content with no line breaks, so almost never useful.

@Harej: Just a heads-up that we can consider end-user use cases here, for example the ORES and JADE extensions could support an advanced search syntax like "ores_damaging:true jade_damaging:false has_endorsements:0" to give judgments about all false positive reports waiting for an endorsement.

Change 469154 had a related patch set uploaded (by Awight; owner: Awight):
[mediawiki/extensions/JADE@master] Streamline search results summary

https://gerrit.wikimedia.org/r/469154

gerritbot added a project: Patch-For-Review.Oct 23 2018, 1:07 AM

awight mentioned this in rEJADd88a20351c3f: Streamline search results summary.Oct 23 2018, 1:07 AM

We might want to merge the minor change above, and circle back to the question of additional, custom indexes once the use cases are clear.

awight mentioned this in rEJAD5fa56e0e894a: Streamline search results summary.Oct 24 2018, 1:11 AM

awight mentioned this in rEJADf1ad61aabcd9: Streamline search results summary.Oct 24 2018, 1:22 AM

awight mentioned this in rEJADc4b6336b1e7a: Streamline search results summary.Oct 24 2018, 6:23 PM

awight mentioned this in rEJADc38d133cf442: Streamline search results summary.Oct 24 2018, 6:36 PM

Change 469154 merged by jenkins-bot:
[mediawiki/extensions/JADE@master] Streamline search results summary

https://gerrit.wikimedia.org/r/469154

ReleaseTaggerBot added a project: MW-1.33-notes (1.33.0-wmf.2; 2018-10-30).Oct 24 2018, 7:00 PM

Change 469546 had a related patch set uploaded (by Awight; owner: Awight):
[mediawiki/extensions/JADE@master] Test for JudgmentContentHandlerTest

https://gerrit.wikimedia.org/r/469546

awight mentioned this in rEJADe01c65fbe2d8: Test for JudgmentContentHandlerTest.Oct 25 2018, 1:01 AM

Change 469546 merged by jenkins-bot:
[mediawiki/extensions/JADE@master] Test for JudgmentContentHandlerTest

https://gerrit.wikimedia.org/r/469546

Finally got to respond here. Sorry for being late.

I want to split use-cases by user. For the primary use-cases, I want to target Wikipedians who are using the search UI to look for relevant judgments. I imagine use-cases for searching "damaging:true" with notes containing "#keymash" or something similar. It does seem like filtering based on judgment data is a key use-case.

I think that indexing on user-id is less interesting and should be relegated to a secondary feature. It would be primarily interesting in a sort/filtering perspective for researchers. I'm assuming we'll still have secondary tables in the relational DB for linking users to their judgments so one can ask basic (non-search/ranking) questions like "What endorsements has this user made." For the researcher use cases, I imagine that we'll have substantially more batch processing where researchers will create their own indexes for sorting and filtering. I don't expect that they will be using the Elastic Search very much since it'll be hard for them to describe consistently in their Methods discussions.

Does that make sense?

One more note. I do think that it would be interesting to answer the question:

"Which judgments of damaging:false contain "#keymash" in their notes and have an endorsement by someone using Special:Diff?"

Thanks for the helpful notes! I'm approaching this incrementally, so following your comments, I'll code the following indexes:

schema value (damaging, goodfaith, contentquality)
origin

The "notes" field is already covered by an earlier patch which makes all wikitext fields full-text searchable. Or do you think we should also offer specific indexes to distinguish between judgment and endorsement notes?

For now, I think one index sounds totally reasonable.

Harej moved this task from Feature Requests to In Progress on the Jade board.Oct 25 2018, 8:44 PM

Change 470061 had a related patch set uploaded (by Awight; owner: Awight):
[mediawiki/extensions/JADE@master] [WIP] Index some data extracted from judgment page content

https://gerrit.wikimedia.org/r/470061

awight mentioned this in rEJAD5f1ccab797cf: [WIP] Index some data extracted from judgment page content.Oct 26 2018, 8:37 PM

awight mentioned this in rEJAD96858688c135: [WIP] Index some data extracted from judgment page content.Oct 26 2018, 9:59 PM

I haven't found any examples of AdvancedSearch integration, it might not be extensible yet. At least, the CirrusSearch fields are hardcoded into ext.advancedSearch.AdvancedOptionsConfig, probably a shortcut while this feature is in beta.

Restricted Application added a project: TCB-Team (now WMDE-TechWish). · View Herald TranscriptOct 26 2018, 11:11 PM

@Harej I came up with some use cases off-the-cuff, plus some input from @Halfak above, but I'd like to hand off the user-facing questions to you for the moment. It turns out that the technical side doesn't allow for my simplistic approach, yet, so I think we should more carefully define what our priorities are before asking for code changes in repos owned by other teams.

Take everything I've done so far with a grain of salt, obviously!

I don't want us to overthink CirrusSearch integration since I am not sure how much value there is for it. I imagine researchers would want to use more specialized tools, rather than pretending that the search engine is a relational database. At minimum, searching for key words used in judgments and being able to associate judgments with articles seem like the main requirements. Being able to do specialized searches like judgmentauthor:Harej would be cool, but I'm not convinced it's a priority.

@Harej It would be good to list the use cases still, for example, will editors want a workflow like, "browse all recent revisions where ORES predicted damaging which do not have a JADE judgment" or "all revisions where ORES predicts damaging, where there's a non-damaging judgment suggesting a false positive, but with no endorsements, I will confirm and endorse."

It's fine to leave this for later iterations, of course!

I should have made this task more granular—Wikitext content is now searchable, which is probably enough for our initial release. I'm moving to the backlog, and we can plan the next iteration.

awight removed a project: Patch-For-Review.Nov 1 2018, 5:21 PM

Either the search index is taking a while to rebuild, or the wikitext rendering change isn't working on Beta: https://en.wikipedia.beta.wmflabs.org/w/index.php?search=bluffing&title=Special%3ASearch&profile=advanced&fulltext=1&advancedSearch-current=%7B%22namespaces%22%3A%5B810%5D%7D&ns810=1

awight moved this task from In Progress to Radar on the Jade board.Nov 14 2018, 10:56 PM

awight moved this task from Radar to Inbox on the Jade board.Nov 14 2018, 11:03 PM

awight moved this task from Inbox to Feature Requests on the Jade board.

awight renamed this task from Extract judgment data for end-user search indexing to Advanced field indexing to support search..Nov 14 2018, 11:06 PM

awight updated the task description. (Show Details)

Harej lowered the priority of this task from Medium to Low.Nov 14 2018, 11:39 PM

Ladsgroup raised the priority of this task from Low to Needs Triage.Nov 28 2018, 6:36 AM

Ladsgroup moved this task from Unsorted to New development on the Machine-Learning-Team board.

TJones moved this task from watching / waiting to making others happy on the Discovery-Search board.Jan 29 2019, 7:27 PM

awight unsubscribed.Mar 21 2019, 4:04 PM

Harej triaged this task as Medium priority.Mar 26 2019, 9:15 PM

In case this is of interest here: The extension AdvancedSearch now allows gadgets or other extensions to add keywords to the advancedParameters panel, see T217446.

Change 470061 abandoned by Ladsgroup:
[WIP] Index some data extracted from judgment page content

Reason:
This has merge conflict and requires so much work to bring to a state that can be useful according to the notes made by the search platform team. Also, this extension is being archived in favor its clone "Jade". Feel free to cherry-pick it there if you want to work on it.

https://gerrit.wikimedia.org/r/470061

Harej unsubscribed.Jul 4 2019, 9:27 AM

Halfak renamed this task from Advanced field indexing to support search. to Implement search indexing for Jade entity pages.Aug 6 2019, 8:38 PM

Halfak added a parent task: T229973: Implement Jade Entity pages.Aug 6 2019, 8:41 PM

Halfak merged a task: T212388: Jade Implementation: Search integration.Aug 8 2019, 7:56 PM

Halfak added a subscriber: Harej.

Halfak added a parent task: T229974: Implement secondary Jade Integrations.Mar 16 2020, 4:55 PM

Halfak moved this task from New development to Ready to go on the Machine-Learning-Team board.Jun 3 2020, 1:50 PM

CBogen moved this task from making others happy to watching / waiting on the Discovery-Search board.Aug 27 2020, 8:52 PM