Page MenuHomePhabricator

EPIC: Evaluate the indexing strategy and try to make more benefits from the semi-structured content we have
Closed, InvalidPublic

Description

I'm creating this task to evaluate our indexing strategy.
It's also correlated with some tasks we are currently discussing.
Today we index index a semi-structured document with few fields:

  • title
  • redirect
  • opening_text
  • text
  • auxiliary_text
  • category
  • template
  • outgoing_link
  • external_link
  • source_text

But I think we lose important information when merging data into these field, examples:

outgoing_link:
We should maybe not merge all the outgoing links into the same field.
This is somewhat related to few ongoing discussions:

  • pagerank (see Fwd: Wikipedia PageRank thread in internal mailing lists)
  • Wishlist : Improve Link Search

redirects:
Some redirects contain interesting information that we could re-use to display better suggestions:
See https://en.wikipedia.org/wiki/Template:R_from_misspelling or https://en.wikipedia.org/wiki/Template:R_from_incorrect_name
It's just an example to illustrate the idea, the impact is may not be worth the effort...

categories:
Should we consider all the categories in the same way?
Some are displayed in the page and some are hidden why?

wikitext:
I don't have any example here but I think we can leverage interesting features from parsing wikitext.

This is a very broad task and phab is maybe not the best place to discuss about that so feel free to delete it.

Event Timeline

Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald Transcript
Deskana triaged this task as Medium priority.May 12 2016, 10:15 PM
Deskana added a subscriber: Deskana.

@dcausse I'm triaging tickets, and I noticed this one. Could you bring some comments on how we can fit this into our annual plan to the offsite next week? It's pretty high-level right now, so it would be good for us to discuss it. Thanks!

Gehel lowered the priority of this task from Medium to Low.Sep 9 2020, 2:49 PM
CBogen added a subscriber: CBogen.

Closing because this is out of date. We can open a new task if we have work to do on this in the future.