Page MenuHomePhabricator

Cleanup planet.wikimedia.org feeds database
Closed, ResolvedPublic

Description

Many are no longer updated or are 404, across all languages.
For example, I taken randomly four feeds consecutively:

  • many others: removed or updated where the URL changed

A big cleanup is needed.

How I think that we can solve this :

  1. check in the logs of the app for broken / 404 URLs
  2. write a small script that check every one

We can arbitrarily set a two-years selection : Feeds must have been active in the last two years to be kept.

Event Timeline

@Nemo_bis I ping you because you are active on https://meta.wikimedia.org/wiki/Planet_Wikimedia.
I would work on this task, and server logs can be very helpful. Do you now how can I see them ? logstash ?

@Framawiki Hi, i'm your person for this, i have access to the VMs running planet in prod and maintain the planet service.

I have done what this task describes in the past every once in a while. But it's always on low prio to keep up with all the changes. Also i always like fixing redirects and replacing http with https links where possible. So i'm happy that you made this ticket and want to help out with this.

Let me get some current logs and see what we have by type of error. also see T166399

404

ar-planet.log:ERROR:planet.runner:Error 404 while updating feed http://itwadi.com/taxonomy/term/68/0/feed
en-planet.log:ERROR:planet.runner:Error 404 while updating feed http://www.scrubnugget.com/categories/wikipedia/feed/
en-planet.log:ERROR:planet.runner:Error 404 while updating feed http://blog.anubite.co.uk/index.php/category/mediawiki/feed/
en-planet.log:ERROR:planet.runner:Error 404 while updating feed http://thingelstad.com/tag/mediawiki/feed/
ro-planet.log:ERROR:planet.runner:Error 404 while updating feed https://wpdist.blogspot.com/feeds/posts/default?alt=rss

503

en-planet.log:ERROR:planet.runner:Error 503 while updating feed http://saml.rilspace.org/taxonomy/term/69/0/feed
en-planet.log:ERROR:planet.runner:Error 503 while updating feed http://www.wikilove.in/feeds/posts/default
ro-planet.log:ERROR:planet.runner:Error 503 while updating feed http://wikilovesmonuments.ro/feed/
sr-planet.log:ERROR:planet.runner:Error 503 while updating feed http://blog.loshmir.org/feeds/posts/default/-/slobodno%20znanje
zh-planet.log:ERROR:planet.runner:Error 503 while updating feed http://www.onecorner.org/blog/category/%E7%B6%AD%E5%9F%BA%E8%A8%88%E5%88%92/feed/
zh-planet.log:ERROR:planet.runner:Error 503 while updating feed http://hawkman.geoidea.org/tag/%E7%BB%B4%E5%9F%BA%E7%99%BE%E7%A7%91/feed/

500

de-planet.log:ERROR:planet.runner:Error 500 while updating feed http://wikipedistik.de/category/wikipedia-news/feed/
en-planet.log:ERROR:planet.runner:Error 500 while updating feed http://hexmode.com/category/wmf/feed/atom/
en-planet.log:ERROR:planet.runner:Error 500 while updating feed https://wllm.com/tag/wikipedia/feed/

wtf (cert issues)

WARNING: planet.runner:wtf {'feed': {}, 'bozo': 1, 'bozo_exception': CertificateError("hostname 'wllm.com' doesn't match either of '*.wordpress.com', 'wordpress.com'",), 'entries': []} (ERROR:planet.runner:Error 500 while updating feed https://wllm.com/tag/wikipedia/feed/)

WARNING:planet.runner:wtf {'feed': {}, 'bozo': 1, 'bozo_exception': URLError(SSLError(1, u'[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:581)'),), 'entries': []}
(ERROR:planet.runner:Error 500 while updating feed http://hexmode.com/category/wmf/feed/atom/

But this is all.. i don't think it's even that much.. i already cleaned up quite a bit in the past and there are many feeds. there is always _some_ change and temp issues.

@Framawiki So what i could use help with now is checking which of these seem permanently broken and which may just be temporary. Maybe contacting webmaster of broken ones where it seems to make sense, asking them to fix it. And deciding which seem to not be fixable and should just be removed.

I agree with you, we need to find a way to deal with temporary broken links.
My request concern also blogs that aren't updated since a long time. For example I think that feeds that shows at the top a post from 2012 have no reason to be present.

I think that feeds that shows at the top a post from 2012 have no reason to be present.

Well, but also not really a reason to be removed? People can still read it and if there are no updates it will not cause any negative things like notifications or push anything down that is newer. Personally i would just care about errors and warnings in the log files, but no a strong opinion either.

Personally i would just care about errors and warnings in the log files, but no a strong opinion either.

+1. It's nice to have a blogroll for the archives and being in the Planet is an incentive to resume blogging when the need arises.

fwiw: I think "Wikimedia-Site-requests" is an outdated term that is overly generic. None of that is actually a "site request" as it used to mean "somebody with shell changes things". We don't work that way anymore. Almost anything is just a code change suggestion that anyone can upload to Gerrit. Very few things are still site requests as it used to be in the past.

I think that feeds that shows at the top a post from 2012 have no reason to be present.

Well, but also not really a reason to be removed? People can still read it and if there are no updates it will not cause any negative things like notifications or push anything down that is newer. Personally i would just care about errors and warnings in the log files, but no a strong opinion either.

I think that at least a section "archive" is better than mix up to date / dead feeds in the sidebar.

Dereckson subscribed.

The two last works for me.

The first domain doesn't resolve, the second one has been taken over by a buy-old-active-domain-and-put-ads-company.

Change 592439 had a related patch set uploaded (by Dereckson; owner: Dereckson):
[operations/puppet@production] Prune non existing domains from Planet

https://gerrit.wikimedia.org/r/592439

(pending a script to validate them, let's already prune the two problematic domains)

Change 592439 merged by Dzahn:
[operations/puppet@production] planet: Prune 2 non existing domains

https://gerrit.wikimedia.org/r/592439

Is this for all planet languages or just for en.planet?

For some reason i have "HTTP Status" lines in logs of ar, de, fr, gmq, pt, ro, ru planets but NOT for en planet.

The most recent log for en.planet actually says: " Fetching 0 feeds using 0 threads". right now ....

ar-planet.log-Feed: https://teqnia01.blogspot.com/feeds/posts/default/-/%D9%88%D9%8A%D9%83%D9%8A%D8%A8%D9%8A%D8%AF%D9%8A%D8%A7

ar-planet.log:HTTP Status: 404

ar-planet.log-Feed: http://itwadi.com/taxonomy/term/68/0/feed

ar-planet.log:HTTP Status: 404

de-planet.log-Feed: http://wikipedistik.de/category/wikipedia-news/feed/

de-planet.log:HTTP Status: 404

de-planet.log-Feed: http://flominator.ramselehof.de/rss.php?serendipity%5Btag%5D=wikipedia

de-planet.log:HTTP Status: 404

de-planet.log-Feed: https://unglaublich-was-auch-immer.blogspot.com/feeds/posts/default

de-planet.log:HTTP Status: 404

de-planet.log-Feed: https://festivalsommer.blogspot.com/feeds/posts/default/-/Planet

de-planet.log:HTTP Status: 404

fr-planet.log-Feed: http://littletony87.unblog.fr/feed/

fr-planet.log:HTTP Status: 404

fr-planet.log-Feed: http://compteurdedit.over-blog.com/rss

fr-planet.log:HTTP Status: 404

fr-planet.log-Feed: http://www.leconomiste-notes.fr/feed/tag/Wikip%C3%A9dia/atom

fr-planet.log:HTTP Status: 404

ro-planet.log-Feed: https://wpdist.blogspot.com/feeds/posts/default?alt=rss

ro-planet.log:HTTP Status: 404

ru-planet.log-Feed: http://wikimedia.ru/blog/feeds/latest/
ru-planet.log:HTTP Status: 404

But before just deleting all of them, keep in mind that sometimes these errors are temporary and later feeds are fixed and recover.

Change 592614 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] planet: remove 3 German blogs that have been closed

https://gerrit.wikimedia.org/r/592614

Change 592614 merged by Dzahn:
[operations/puppet@production] planet: remove 3 German blogs that have been closed

https://gerrit.wikimedia.org/r/592614

@Dereckson

fr-planet.log-Feed: http://www.dereckson.be/blog/category/wikimedia-fr/feed/
fr-planet.log:HTTP Status: 500

Error establishing a database connection

updating en.planet seems to fail with: "unterminated entity reference Archeology Scotland".

Change 592649 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] planet: remove some more feeds that don't exist anymore

https://gerrit.wikimedia.org/r/592649

Change 592649 merged by Dzahn:
[operations/puppet@production] planet: remove some more feeds that don't exist anymore

https://gerrit.wikimedia.org/r/592649

Change 598701 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] planet: fix feeds with 404 that moved to https

https://gerrit.wikimedia.org/r/598701

The 2 boxes not checked on this ticket are meanwhile working again. A good example why deleting all feeds with an occasional 404 too quickly is not the best idea.

Meanwhile there are other things to clean up though which I am doing above.

Change 598701 merged by Dzahn:
[operations/puppet@production] planet: fix feeds with 404 that moved to https

https://gerrit.wikimedia.org/r/598701

Change 598707 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] planet: remove some feeds that are gone, fix some links

https://gerrit.wikimedia.org/r/598707

I'm afraid this is one of those tickets that you can never really close because at any given time there will be 1 or more feeds with errors. Some will be fixed later, some will go away, some will change URLs due to new software... It is just a constant maintenance and one has to check the logs every once in a while. I am not sure I want to keep a ticket open forever though.. .so at some point it will have to be "good enough for now".

Change 598707 merged by Dzahn:
[operations/puppet@production] planet: remove some feeds that are gone, fix some links

https://gerrit.wikimedia.org/r/598707

I'm afraid this is one of those tickets that you can never really close because at any given time there will be 1 or more feeds with errors. Some will be fixed later, some will go away, some will change URLs due to new software... It is just a constant maintenance and one has to check the logs every once in a while. I am not sure I want to keep a ticket open forever though.. .so at some point it will have to be "good enough for now".

Perhaps close this one off, then every so often do "Audit/Cleanup of feed database (date)" tasks.

Perhaps close this one off, then every so often do "Audit/Cleanup of feed database (date)" tasks.

Yea, I wonder if maybe It could be automated with a Herald rule or something, to either keep reopening or make a new ticket on a regular schedule.

I don't think such Herald triggers exist for specific individual tickets.

Change 602733 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] planet: upgrade feeds from http to https detected by script

https://gerrit.wikimedia.org/r/602733

Change 602733 merged by Dzahn:
[operations/puppet@production] planet: upgrade feeds from http to https detected by script

https://gerrit.wikimedia.org/r/602733

Change 608974 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] planet: remove broken feeds, update feed URLs

https://gerrit.wikimedia.org/r/c/operations/puppet/ /608974

Change 608974 merged by Dzahn:
[operations/puppet@production] planet: remove broken feeds, update feed URLs

https://gerrit.wikimedia.org/r/c/operations/puppet/ /608974

Change 609565 had a related patch set uploaded (by Dzahn; owner: Amire80):
[operations/puppet@production] Remove three entries from the Russian Planet

https://gerrit.wikimedia.org/r/609565

Change 609565 merged by Dzahn:
[operations/puppet@production] Remove three entries from the Russian Planet

https://gerrit.wikimedia.org/r/609565

Change 609564 had a related patch set uploaded (by Dzahn; owner: Amire80):
[operations/puppet@production] Remove englishwikisource.tumblr.com from Planet Wikimedia

https://gerrit.wikimedia.org/r/609564

Change 609564 merged by Dzahn:
[operations/puppet@production] Remove englishwikisource.tumblr.com from Planet Wikimedia

https://gerrit.wikimedia.org/r/609564

Change 609871 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] planet: removing a few more broken feed URLs

https://gerrit.wikimedia.org/r/609871

Change 609871 merged by Dzahn:
[operations/puppet@production] planet: removing a few more broken feed URLs

https://gerrit.wikimedia.org/r/609871

I am declaring this resolved at this point. There was major cleanup, see all the patches above and I checked for all issues in the logs of all languages. There are only very few remaining and these look temporary.

Cleaning up the feed database is just a contant effort that should be repeated periodically.

Also of course it would be nice to add some new and working feeds, so please keep those coming if you know of any.