I'm creating this task to evaluate our indexing strategy.
It's also correlated with some tasks we are currently discussing.
Today we index index a semi-structured document with few fields:
- title
- redirect
- opening_text
- text
- auxiliary_text
- category
- template
- outgoing_link
- external_link
- source_text
But I think we lose important information when merging data into these field, examples:
outgoing_link:
We should maybe not merge all the outgoing links into the same field.
This is somewhat related to few ongoing discussions:
- pagerank (see Fwd: Wikipedia PageRank thread in internal mailing lists)
- Wishlist : Improve Link Search
redirects:
Some redirects contain interesting information that we could re-use to display better suggestions:
See https://en.wikipedia.org/wiki/Template:R_from_misspelling or https://en.wikipedia.org/wiki/Template:R_from_incorrect_name
It's just an example to illustrate the idea, the impact is may not be worth the effort...
categories:
Should we consider all the categories in the same way?
Some are displayed in the page and some are hidden why?
wikitext:
I don't have any example here but I think we can leverage interesting features from parsing wikitext.
This is a very broad task and phab is maybe not the best place to discuss about that so feel free to delete it.