Page MenuHomePhabricator

Wikimedia Phabricator search index missing Maniphest tasks
Closed, ResolvedPublic

Description

The Wikimedia Phabricator search index is incomplete. It doesn't contain every public Maniphest task. This is a horrible bug that wastes my time regularly. Someone needs to fix the index to be complete. Count all the entires in the index currently, compare to the number of public Maniphest tasks on the site, and you'll see a large discrepancy. Please fix that discrepancy so that every public Maniphest task is in the search index.

Event Timeline

Restricted Application added subscribers: TerraCodes, Aklapper. · View Herald TranscriptDec 18 2016, 8:15 PM
Paladox edited projects, added Phabricator; removed Phabricator (2017-01-25).

This command will rebuild the index (for tasks only) and fix the issue:

phabricator/ $ ./bin/search index --type task --background --force

I'd guess this will take 1-2 hours to finish (the longer indexing time, for --all, is when indexing all objects, including commits; commits likely account for the overwhelming majority of the indexing cost).

There is no downside to doing this: the index will be updated in-place object-by-object, so result quality will only improve -- the rebuild process does not start by destroying the existing index or anything like that.

(That rebuild can also safely be run while Phabricator is online, and against any version of Phabricator.)

@MZMcBride Do you have a list of example ones that aren't in the index? maybe we can see something in their history that explains why they weren't index.

scfc awarded a token.Dec 18 2016, 11:32 PM
scfc added a subscriber: scfc.

@MZMcBride Do you have a list of example ones that aren't in the index? maybe we can see something in their history that explains why they weren't index.

An example task is T49137. I've specifically not mentioned or linked to this task per T151500#2885438.

Screenshots:

This open task, T49137, contains both of these phrases exactly. One phrase occurs in the task title and another phrase occurs in a comment on the task.

I think the search index is missing any task that hasn't been updated/touched in the past several months. Another easy example is T123028.

It's trivial to see that the phrase shown in the attached screenshot is in the task description of T123028. I'm just copying and pasting strings from older tasks and inputting them into the search form.

Expected behavior: I see Maniphest tasks that include the input phrases in the search results.

Actual behavior: I get a "no results found for this query" error.

This seems to be the behavior, however, what doesn't make sense is that we have done a full re-index more than once. Perhaps that doesn't actually succeed despite appearing as if it does.

@epriestley: I've run that command more than once but I'm going to try it once more. Other times I did not use --background

@mmodell Ah! I didn't realize that, sorry.

If that command fails to resolve the "missing from index" issue this time, I think there's likely a significant bug in Phabricator somewhere which we probably need to fix upstream. The indexing code is largely shared between the MySQL and ElasticSearch indexes so I would anticipate that the issue won't be resolved by switching backends.

That said, I'm at a loss to guess what the issue might be -- the indexer itself clearly works, because mentioning tasks (which triggers a reindex) fixes the index. Let's see if the issue persists after this reindex, and I'll try to come up with some diagnostic steps or support tools if it isn't fixed.

@epriestley: Thanks! I've been meaning to build a diagnostic tool that can be used to examine the index, just haven't gotten that far yet.

reindexing is complete, now to see if it fixed the problem...

After reindex, there are 153054 body and title fields indexed:


select field, count(*) from search_documentfield where phidType='TASK' group by field;
+-------+----------+
| field | count(*) |
+-------+----------+
| !4rR  |    80113 |
| body  |   153054 |
| cmnt  |   742481 |
| titl  |   153054 |
+-------+----------+

( I have no idea what !4rR is ...)

ah, that's a custom field... unrelated.

F5097568 and F5097602 are better now. Thank you for that!

An example task is T49137. I've specifically not mentioned or linked to this task per T151500#2885438.
Screenshots:

This search (F5097569) is still busted. It's very puzzling as this exact phrase appears in both the task title and the task description of T49137. My current guess is that words such as a and of are getting stripped regardless of the presence of quotation marks. Searches such as +generate +list +pages or even generate list pages include T49137 in the results. Perhaps this (stripping out words in quotation marks) should be a separate task, though.

Perhaps just the word a is weird.

Bad results:

Good results:

These phrases are both copied directly from T108985.

In general, this Phabricator installation's search seems significantly better now. I think this task can probably be closed, unless we want to slightly expand its scope to include "re-run this index script every week" or something.

I think that may be an issue with the slightly less-sophisticated query parser that got deployed here to quickly fix the InnoDB "AND/OR" issue -- I can't immediately reproduce the issue locally on master, with the fancier parser that we eventually built in the upstream (I think this query is comparable):

@MZMcBride Hi, the letter a will not be found, nor will any letters under 3 letters due to a MySQL config. If you use elasticsearch you will be able to find that.

I have now deployed the latest upstream stable release and am in the process of re-indexing everything (This process takes several DAYS) in the meantime you can toggle a preference under "Developer" heading in your phabricator user settings to optionally switch on the elasticsearch backend. Doing this may return better results, especially while the mysql index is rebuilding. The drawback is that the elasticsearch index is not kept updated in real time so results for newly created tasks might be missing.

As we have now switched to the elasticsearch back-end by default, this and other related tasks can probably be resolved, however I'd love to hear some feedback about how elasticsearch is performing. @MZMcBride, care to comment? From what I have observed, search is much better now but I can still optimize the queries further so any feedback is welcome.

mmodell closed this task as Resolved.Feb 6 2017, 1:08 PM
mmodell claimed this task.

Please reopen if you are still seeing issues.