Page MenuHomePhabricator

Byte counts under-reported in search results
Closed, ResolvedPublic

Description

It seems that CirrusSearch (as of the version deployed on English Wikipedia) has a problem with byte counts. I assume they should reflect the wiki text size shown in the history page, but CirrusSearch reports sizes consistently smaller than that.

For https://en.wikipedia.org/wiki/Western_Star (833 bytes),

https://en.wikipedia.org/w/index.php?search=western+star&title=Special%3ASearch&fulltext=1&srbackend=CirrusSearch

> 741 B (116 words)

https://en.wikipedia.org/w/index.php?search=western+star&title=Special%3ASearch&fulltext=1&srbackend=LuceneSearch

> 833 B (115 words)


Version: master
Severity: normal

Details

Reference
bz70919

Event Timeline

bzimport raised the priority of this task from to Normal.Nov 22 2014, 3:48 AM
bzimport added a project: CirrusSearch.
bzimport set Reference to bz70919.
bzimport added a subscriber: Unknown Object (MLST).
whym created this task.Sep 17 2014, 12:19 AM
demon added a comment.Sep 19 2014, 5:52 PM

If I had to guess I'd say we're reporting the pre-expansion size. I'll have a look at this next week.

Cirrus's bytes count is just PHP's strlen function on the text field which is probably wrong now that we're stripping out 'aux_text'. Should we make sure that its the length of the wikitext or of the rendered text? Yusuke Matsubara, what do you use the length for? I'm curious because that'll inform what it should be. When we started cirrus we didn't think anyone really used the field so we just took a guess at how to implement it and never wrote any regression tests for it.

You can see what Cirrus stores for the page here:
https://en.wikipedia.org/wiki/Western_Star?action=cirrusdump

demon added a comment.Sep 19 2014, 6:24 PM

I think it should be the size of the full post-expanded text...we should be able to fetch that from the Revision or Page objects we have on hand during indexing.

You want it with the html?

demon added a comment.Sep 19 2014, 6:43 PM

(In reply to Nik Everett from comment #4)

You want it with the html?

Actually probably not. Revision stores it as the strlen() of the wikitext. We just need to get that length before stripping aux, like you said.

Change 162627 had a related patch set uploaded by Chad:
Use proper page sizes

https://gerrit.wikimedia.org/r/162627

Change 162627 merged by jenkins-bot:
Use proper page sizes

https://gerrit.wikimedia.org/r/162627

demon added a comment.Sep 24 2014, 5:19 PM

Should be fixed, sizes will slowly correct as patch goes out and pages are reindexed.