Today in the search engine we capture the plain text of a wiki page as a single string with most whitespace simplified to spaces. We suspect that if this were changed to instead be a list of strings, split by section headings, it could make the cirrus dataset more useful to external use cases, and potentially solve a few edge cases in cirrus.
Cirrus Edge Cases:
- Mostly going to be about phrase matching across section boundaries. In the current configuration the next section is +1 token position from the previous section, and anything that considers token proximity will think the tokens are related, when they generally aren't.
- Highlighting is potentially also improved, for the same reason of not highlighting across section boundaries.
External Use Cases (other teams, community):
- While we don't have super concrete info, we've heard from multiple potential users that the text format from cirrus doesn't have enough structure and loses the boundaries between sections.
- "In addition to section, paragraph boundary issue, section titles and the hierarchy will also be missing which is important to keep the context. I had looked into cirrusDump format but could not find useful for the embedding"
AC:
- Determine what shape the data should be stored in to increase reusability
- If changing the text field schema: update the schema at https://schema.wikimedia.org/#!//primary/jsonschema/mediawiki/cirrussearch/update_pipeline/update or do T366343 before
- Implement in Cirrus
- Update the job that dumps the index to a hive table. Needs a schema change (and rewriting current dumps to match updated schema), along with an addition to the normalization routine in the dump script.
- Notify users of cirrusdumps and the cloudelastic replica that the shape of the text field is changing