Page MenuHomePhabricator

caching of related articles is unhelpful on relatively new articles
Closed, DeclinedPublic

Description

When I opened https://nl.wikipedia.org/wiki/Aanslagen_in_Brussel_op_22_maart_2016 several hours after it was created, I was greeted by three related articles suggestions which seemed highly unrelated.

  • .tj
  • Segunda Liga
  • Phường

The api query run by the tool is apparently:
https://nl.wikipedia.org/w/api.php?action=query&format=json&formatversion=2&prop=pageimages%7Cpageterms&piprop=thumbnail&pithumbsize=80&wbptterms=description&pilimit=3&generator=search&gsrsearch=morelike%3AAanslagen_in_Brussel_op_22_maart_2016&gsrnamespace=0&gsrlimit=3

Resulting in:

{"batchcomplete":true,"continue":{"gsroffset":3,"continue":"gsroffset||"},"query":{"pages":[{"pageid":735084,"ns":0,"title":".tj","index":2},{"pageid":1146320,"ns":0,"title":"Segunda Liga","index":3},{"pageid":2306272,"ns":0,"title":"Phường","index":1}]}}


A direct query shows slightly more related articles:
https://nl.wikipedia.org/w/index.php?search=morelike%3AAanslagen+in+Brussel+op+22+maart+2016&cirrusPhraseBoost=1&title=Speciaal%3AZoeken&go=Artikel

  • Aanslagen in Brussel op 22 maart 2016/Kladpagina
  • Balie (advocatuur)
  • Brussel (stad)

As @dcausse notes:

maybe a cache issue, the morelike result cache was maybe populated for this page when it was just created (with a very small content). The result will be cached for 24h, I hope that it will be better tomorrow when the cache is invalidated.
agreed, we should maybe include page timestamps in the cache

Event Timeline

TheDJ created this task.Mar 22 2016, 1:26 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 22 2016, 1:26 PM
Restricted Application added a project: Discovery. · View Herald TranscriptMar 22 2016, 1:27 PM
Restricted Application added a project: Discovery-Search. · View Herald TranscriptApr 11 2016, 2:42 AM
debt closed this task as Declined.Jun 15 2017, 5:24 PM
debt added a subscriber: debt.

As the cache is only for 24 hours, it looks like this is pretty low priority; as it's also based on size of content.