Page MenuHomePhabricator

Byte counts under-reported in search results
Closed, ResolvedPublic


It seems that CirrusSearch (as of the version deployed on English Wikipedia) has a problem with byte counts. I assume they should reflect the wiki text size shown in the history page, but CirrusSearch reports sizes consistently smaller than that.

For (833 bytes),

> 741 B (116 words)

> 833 B (115 words)

Version: master
Severity: normal



Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 3:48 AM
bzimport added a project: CirrusSearch.
bzimport set Reference to bz70919.
bzimport added a subscriber: Unknown Object (MLST).

If I had to guess I'd say we're reporting the pre-expansion size. I'll have a look at this next week.

Cirrus's bytes count is just PHP's strlen function on the text field which is probably wrong now that we're stripping out 'aux_text'. Should we make sure that its the length of the wikitext or of the rendered text? Yusuke Matsubara, what do you use the length for? I'm curious because that'll inform what it should be. When we started cirrus we didn't think anyone really used the field so we just took a guess at how to implement it and never wrote any regression tests for it.

You can see what Cirrus stores for the page here:

I think it should be the size of the full post-expanded text...we should be able to fetch that from the Revision or Page objects we have on hand during indexing.

(In reply to Nik Everett from comment #4)

You want it with the html?

Actually probably not. Revision stores it as the strlen() of the wikitext. We just need to get that length before stripping aux, like you said.

Change 162627 had a related patch set uploaded by Chad:
Use proper page sizes

Change 162627 merged by jenkins-bot:
Use proper page sizes

Should be fixed, sizes will slowly correct as patch goes out and pages are reindexed.