Page MenuHomePhabricator

EPIC: Evaluate the indexing strategy and try to make more benefits from the semi-structured content we have
Closed, InvalidPublic


I'm creating this task to evaluate our indexing strategy.
It's also correlated with some tasks we are currently discussing.
Today we index index a semi-structured document with few fields:

  • title
  • redirect
  • opening_text
  • text
  • auxiliary_text
  • category
  • template
  • outgoing_link
  • external_link
  • source_text

But I think we lose important information when merging data into these field, examples:

We should maybe not merge all the outgoing links into the same field.
This is somewhat related to few ongoing discussions:

  • pagerank (see Fwd: Wikipedia PageRank thread in internal mailing lists)
  • Wishlist : Improve Link Search

Some redirects contain interesting information that we could re-use to display better suggestions:
See or
It's just an example to illustrate the idea, the impact is may not be worth the effort...

Should we consider all the categories in the same way?
Some are displayed in the page and some are hidden why?

I don't have any example here but I think we can leverage interesting features from parsing wikitext.

This is a very broad task and phab is maybe not the best place to discuss about that so feel free to delete it.

Event Timeline

Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald Transcript
Deskana triaged this task as Medium priority.May 12 2016, 10:15 PM
Deskana added a subscriber: Deskana.

@dcausse I'm triaging tickets, and I noticed this one. Could you bring some comments on how we can fit this into our annual plan to the offsite next week? It's pretty high-level right now, so it would be good for us to discuss it. Thanks!

Gehel lowered the priority of this task from Medium to Low.Sep 9 2020, 2:49 PM
CBogen added a subscriber: CBogen.

Closing because this is out of date. We can open a new task if we have work to do on this in the future.