Page MenuHomePhabricator

caching of related articles is unhelpful on relatively new articles
Closed, DeclinedPublic

Description

When I opened https://nl.wikipedia.org/wiki/Aanslagen_in_Brussel_op_22_maart_2016 several hours after it was created, I was greeted by three related articles suggestions which seemed highly unrelated.

  • .tj
  • Segunda Liga
  • Phường

The api query run by the tool is apparently:
https://nl.wikipedia.org/w/api.php?action=query&format=json&formatversion=2&prop=pageimages%7Cpageterms&piprop=thumbnail&pithumbsize=80&wbptterms=description&pilimit=3&generator=search&gsrsearch=morelike%3AAanslagen_in_Brussel_op_22_maart_2016&gsrnamespace=0&gsrlimit=3

Resulting in:

{"batchcomplete":true,"continue":{"gsroffset":3,"continue":"gsroffset||"},"query":{"pages":[{"pageid":735084,"ns":0,"title":".tj","index":2},{"pageid":1146320,"ns":0,"title":"Segunda Liga","index":3},{"pageid":2306272,"ns":0,"title":"Phường","index":1}]}}


A direct query shows slightly more related articles:
https://nl.wikipedia.org/w/index.php?search=morelike%3AAanslagen+in+Brussel+op+22+maart+2016&cirrusPhraseBoost=1&title=Speciaal%3AZoeken&go=Artikel

  • Aanslagen in Brussel op 22 maart 2016/Kladpagina
  • Balie (advocatuur)
  • Brussel (stad)

As @dcausse notes:

maybe a cache issue, the morelike result cache was maybe populated for this page when it was just created (with a very small content). The result will be cached for 24h, I hope that it will be better tomorrow when the cache is invalidated.
agreed, we should maybe include page timestamps in the cache

Event Timeline

debt subscribed.

As the cache is only for 24 hours, it looks like this is pretty low priority; as it's also based on size of content.