Page MenuHomePhabricator

proofread-page content model content is not indexed by CirrusSearch
Closed, ResolvedPublic

Description

CirrusSearch does not seem to index "proofread-page" content model anymore. This content model is defined by the ProofreadPage extension and is used by the Page: namespace of Wikisource.

Problem first reported here (in French): https://fr.wikisource.org/wiki/Wikisource:Scriptorium/Mars_2019#La_recherche_sur_l'espace_%22page%22_n'aboutit_plus

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Could you provide links to pages that should be indexed but are not? Will help tracking this down

Very simple..... all.

This search about pages of [[Indice:Deledda - Amori moderni - Colomba, Roma, Voghera, 1907.djvu]] doesn't produce any result:
https://it.wikisource.org/w/index.php?search=amore&prefix=Pagina%3ADeledda+-+Amori+moderni+-+Colomba%2C+Roma%2C+Voghera%2C+1907.djvu&title=Speciale%3ARicerca&profile=advanced&fulltext=1&advancedSearch-current=%7B%7D&ns108=1

Editing manually the url, this produces some result but doesn't find any page from right Index page:
https://it.wikisource.org/w/index.php?title=Speciale:Ricerca&limit=500&offset=0&ns108=1&prefix=Pagina%3ADeledda&search=amore&advancedSearch-current={}

Browser search of word "amore" into ns0 coming from that Index page returns 10 results.

API request to build a new search document and report currently indexed document for example page from Tpt:

https://en.wikisource.org/wiki/Special:ApiSandbox#action=query&format=json&prop=cirrusbuilddoc%7Ccirrusdoc&titles=Page%3ADevonshire_Characters_and_Strange_Events.djvu%2F156

EDIT: Removed snarky and unnecessary comment above.

Thank you for your help!

https://gerrit.wikimedia.org/r/c/mediawiki/extensions/ProofreadPage/+/499801 has been deployed a few hours ago.

A call to cirrusbuilddoc is adding back the "text" field. Do you know if this is enough to make Special:Search back working or are some extra changes needed, maybe on ProofreadPage side? I am not familiar at all with CirrusSearch.

Page: pages are using there own content model defined here: https://phabricator.wikimedia.org/diffusion/EPRP/browse/master/includes/Page/PageContent.php

Essentially if cirrusbuilddoc returns appropriate content that means that a null edit to a broken page should fix it. CirrusSearch has a background process that will automagically do this, but only about 1/8 of pages are done per week so there is significant lag time before all the pages would be fixed.

If there area few 10's of thousands of pages we can generate a list of page ids to be reindexed, otherwise I need to ponder how this might be fixable in less than 8 weeks.

I'm not sure i believe this, but the following query gives an estimate of 1.8M pages effected on enwikisource, and 7.8M pages across all wikis. Does that seem reasonable?

{
    "size": 0,
    "_source": [],
    "query": {
        "bool": {
            "must": [
                {"term": {"text.word_count": 0}},
                {"match": {"content_model": "proofread-page"}}
            ]
        }
    }
}

Does that seem reasonable?

Yes, sounds like the correct number. If you are interested by more detailed numbers: https://tools.wmflabs.org/phetools/statistics.php

I suppose the part that surprises me is not that there are 8M proofread pages, but that there are 8M proofread pages without search content. Perhaps I'm making improper assumptions, and this isn't a new regression from 4 days ago but a long standing problem that is just now getting fixed? Mostly what didn't seem right was that 8M pages have been edited or updated by the cirrus background process in that timespan.

On the other hand, if it only took 4 days for all the pages to be edited and blank the search index, reasonable chance they will all be edited again in the next 4 days and fix it?

Sorry for the misunderstanding. Search in the page: namespace is not enabled by default on Wikisource and I believe only a few people use it, mostly contributors to do curation tasks. So, it's definitely possible that people missed a slow decrease of the index coverage. Most of Page: pages are very rarely edited because there are very few reasons to change a proofread page, opposite to a Wikipedia article.

I'm daily contributing to Tiraboschi, storia della letteratura italiana, and I see that searching the very common "poeta" word into its pages, CirrusSearc only returns pages edited after 28.03.2019.

This is the url of search: search url

I observed that Page.touch() solves the issue; nevetheless by now I see that "untouched" pages too are going to be indexed, so I thionk that the problem is solved.

EBernhardson claimed this task.

Query from may 28 now reports 0 broken pages on wikisource. Calling this complete.