Page MenuHomePhabricator

Represent text in cirrus as an array of sections, rather than a flat string
Open, Needs TriagePublic

Description

Today in the search engine we capture the plain text of a wiki page as a single string with most whitespace simplified to spaces. We suspect that if this were changed to instead be a list of strings, split by section headings, it could make the cirrus dataset more useful to external use cases, and potentially solve a few edge cases in cirrus.

Cirrus Edge Cases:

  • Mostly going to be about phrase matching across section boundaries. In the current configuration the next section is +1 token position from the previous section, and anything that considers token proximity will think the tokens are related, when they generally aren't.
  • Highlighting is potentially also improved, for the same reason of not highlighting across section boundaries.

External Use Cases (other teams, community):

  • While we don't have super concrete info, we've heard from multiple potential users that the text format from cirrus doesn't have enough structure and loses the boundaries between sections.
  • "In addition to section, paragraph boundary issue, section titles and the hierarchy will also be missing which is important to keep the context. I had looked into cirrusDump format but could not find useful for the embedding"

AC:

  • Determine what shape the data should be stored in to increase reusability
  • If changing the text field schema: update the schema at https://schema.wikimedia.org/#!//primary/jsonschema/mediawiki/cirrussearch/update_pipeline/update or do T366343 before
  • Implement in Cirrus
  • Update the job that dumps the index to a hive table. Needs a schema change (and rewriting current dumps to match updated schema), along with an addition to the normalization routine in the dump script.
  • Notify users of cirrusdumps and the cloudelastic replica that the shape of the text field is changing

Event Timeline

An initial, simple, proposal would be to split the text field on section boundaries, and retain the section title as a header. This would mean having duplicates of the headings (in both the headings and text fields), increasing the importance of the heading content, but probably not a big deal.

Overall Structure

{
    "text": [
        "This is the opening content of the article, before the first section heading",
        "First Section Heading\n\ncontent found between the first and either second heading or end of document",
        "Second Section Heading\n\nsame as above",
    ]
}

Alternatively we could split them, but this feels awkward to me:

{
    "text": [
        "This is the opening content of the article, before the first section heading",
        "First Section Heading",
        "content found between the first and either second heading or end of document",
        "Second Section Heading",
        "same as above",
    ]
}

Per Section Structure

There is still the question about what is retained within the per-section strings. Historically the text content has been a single long line without any \n delimiting areas of the document. I suspect this has the opportunity to get quite complex depending on what is needed. A few possibilities:

  • We could try and inject \n after each <p>...</p> section, retaining the separation between paragraphs.
  • Similar for <li>...</li> ? Does it matter if the list is supposed to be rendered inline or vertically?
  • This seems like it has the opportunity to get very hairy, I suspect we should try and set a simple rule and live with the results, rather than crafting a complex decision tree of where to inject boundaries. But that requires the simple rule to still be a useful end result.

In a ideal world we might want something more expressive like:

{
    "content": [
        {
            "text": "This is the opening content of the article, before the first section heading"
        },
        {
            "section": "First Section Heading",
            "text": "content found between the first and either second heading or end of document"
        }
    ]
}

I'm wondering if we could find a combination of copy_to params that may keep indexed content in the existing text, heading fields so that the transition would be seamless. This could allow future improvements, like adding the depth of the section.
I haven't tested this and might not work well with highlighting but perhaps worth testing?

Related question regarding flow of data, based on the comment from the thread you linked.

The wikitext -> html happens inside the mediawiki application using the default mediawiki parser. I'm not sure what exactly happens under the hood, i expect it's a full php parser that runs in-process but i haven't paid enough attention to exactly what they do. This is indeed quite expensive, we are running hundreds of pages a second through the parser. Part of the reason i suggest we could do this is because we already parse this flow of data. Even at this high rate, it still takes a long time to get through everything. We have a loop that re-renders everything even if not edited, but it works on 16 week cycles.

A html dataset (T360794) is a request for data engineering with a number of use cases, and has been discussed in related phab tasks for years. The linked phab task is for an incremental html dataset, which is "the easier" part of a html dataset and will hopefully get prioritized soon. I have focused on that part to get something off the ground. The more challenging part is creating the html dataset of historical revisions (e.g. render with what mediawiki version, what to do with templates, etc..).

  • do I understand this right that the full re-render loop taking 16 weeks is for the "current" content of all pages (i.e. not historical)? That is indeed a long time.
  • How is the cirrus index updated in "normal running" mode, i.e. what event triggers the update when a page is changed?
  • if you had a html page change stream, would it be preferable to to use that instead going through mediawiki?
  • similarly, if you had a daily html dataset in the datalake of all current html content, would it be beneficial to use this for a full re-render loop instead of going through mediawiki?

I am asking as a data customer looking for a non-wikitext representation of wiki content. The improved structure for the cirrus dump you proposed above would be nice to have, but it is still an html derived representation that given a html dataset would be straight forward to do. So what I really want is the html dataset, and I am wondering if that is the case for search too.

Related question regarding flow of data, based on the comment from the thread you linked.

The wikitext -> html happens inside the mediawiki application using the default mediawiki parser. I'm not sure what exactly happens under the hood, i expect it's a full php parser that runs in-process but i haven't paid enough attention to exactly what they do. This is indeed quite expensive, we are running hundreds of pages a second through the parser. Part of the reason i suggest we could do this is because we already parse this flow of data. Even at this high rate, it still takes a long time to get through everything. We have a loop that re-renders everything even if not edited, but it works on 16 week cycles.

A html dataset (T360794) is a request for data engineering with a number of use cases, and has been discussed in related phab tasks for years. The linked phab task is for an incremental html dataset, which is "the easier" part of a html dataset and will hopefully get prioritized soon. I have focused on that part to get something off the ground. The more challenging part is creating the html dataset of historical revisions (e.g. render with what mediawiki version, what to do with templates, etc..).

  • do I understand this right that the full re-render loop taking 16 weeks is for the "current" content of all pages (i.e. not historical)? That is indeed a long time.

Yes this is the current version. At 16 weeks this works out to continuously rendering ~65 pages per second. It used to run faster, but we were asked to slow it down a few years ago due to resource constraints.

  • How is the cirrus index updated in "normal running" mode, i.e. what event triggers the update when a page is changed?

A variety of events, we source from the following set (if you need the actual names i can look them up, but this is the configuration names we use):

article-topic-stream
draft-topic-stream
public-page-change-stream
private-page-change-stream
public-page-rerender-stream
private-page-rerender-stream
page-weighted-tags-change-stream
page-weighted-tags-change-legacy-stream
recommendation-create-stream
  • if you had a html page change stream, would it be preferable to to use that instead going through mediawiki?

I doubt it, we need a variety of information and the text content is only one part of it. We would need everything returned by the cirrusdoc api. This includes a variety of metadata from the wikitext parser, information sourced from the mediawiki database, and information that varies depending on the per-wiki configuration.

  • similarly, if you had a daily html dataset in the datalake of all current html content, would it be beneficial to use this for a full re-render loop instead of going through mediawiki?

Same, the HTML isn't particularly useful to us. Additionally i think it would be overall worse if real time updates and the re-render loop used different methods of sourcing the data they send. I think it's a useful and desirable property that all methods of updating the search index go through the same update process.

I am asking as a data customer looking for a non-wikitext representation of wiki content. The improved structure for the cirrus dump you proposed above would be nice to have, but it is still an html derived representation that given a html dataset would be straight forward to do. So what I really want is the html dataset, and I am wondering if that is the case for search too.

I suppose we could store the raw html in an unindexed field, I'm sure we have plenty of space for it. As unindexed content this would probably be fairly trivial to add, the only bit i can think of is some minor modifications to the streaming updater internal schema.