Page MenuHomePhabricator

A tool queries urwiki recentchanges 6 times per second
Closed, ResolvedPublic

Description

In the wiki production log I have spotted a tool that does a high rate of API query for recent changes. They look like:

https://ur.wikipedia.org/w/api.php?action=query&format=json&list=recentchanges&meta=siteinfo&rcnamespace=0&rcprop=title&rcshow=!redirect&rctype=new&siprop=statistics

Based on my count, there is ~ 6 requests per seconds which seems overkill.

The tool currently runs on tools-exec-1430.tools.eqiad.wmflabs

Event Timeline

hashar created this task.May 29 2017, 10:31 PM
Restricted Application added a project: Cloud-Services. · View Herald TranscriptMay 29 2017, 10:31 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
hashar renamed this task from tools queries urwiki recentchanges 8 times per second to A tool queries urwiki recentchanges 6 times per second.May 29 2017, 10:31 PM

At this moment, these tools are running on this host:

tools-exec-1430 Load: 112.25% Memory: 19.2% Free vmem: 22.6G
4129429 	celery 		algo-news 	Continuous / Running 	2017-04-20 09:04:20 	2h34m 		1/0(peak 1.4G)
4975529 	news 		shuaib 		Task / Running 			2017-05-15 01:05:05 	101h3m 	77/0
5306038 	WP-PWB 	jackbot 		Task / Running 			2017-05-24 00:30:17 	44s 		172/0(peak 194.1M)
5509610 	mlr-daily 	jjmc89-bot 	Continuous / Running 	2017-05-29 00:08:28 	7m6s 		328/0
5548379 	BOTSISTER 	botsister 	Task / Running 			2017-05-29 22:22:03 	2h25m 		979/0(peak 1003.1M)

Probably http://tools.wmflabs.org/?tool=shuaib : Home for hosting Urdu Wikipedia tools and bots

/data/project/shuaib/public_html/watch.py :

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

import os
import requests
import codecs

fname = 'creations.log'


def fetch_latest():
    print('Fetching...')
    r = requests.get('https://ur.wikipedia.org/w/api.php?action=query&meta=siteinfo&siprop=statistics&list=recentchanges&rctype=new&rcnamespace=0&rcprop=title&rcshow=!redirect&format=json')
    data = r.json()
    return [str(data['query']['statistics']['articles']), data['query']['recentchanges'][0]['title']]


last = ['0', '']
while True:
    latest = fetch_latest()
    if latest != last:
        print(latest)
        with codecs.open(fname, mode='a', encoding = 'utf8') as f:
            f.write(u': '.join(latest) + u'\n')
        last = latest

I am impressed how fast you managed to find the script!

If at all possible, can you hack the script so it sleeps whenever nothing has changed?

--- watch-orig.py	2017-05-30 09:32:14.447895045 +0200
+++ watch.py	2017-05-30 09:33:39.696292469 +0200
@@ -4,8 +4,10 @@
 import os
 import requests
 import codecs
+import time
 
 fname = 'creations.log'
+throttle = '10'  # seconds
 
 
 def fetch_latest():
@@ -18,7 +20,9 @@
 last = ['0', '']
 while True:
     latest = fetch_latest()
-    if latest != last:
+    if latest == last:
+        time.sleep(throttle)
+    else:
         print(latest)
         with codecs.open(fname, mode='a', encoding = 'utf8') as f:
             f.write(u': '.join(latest) + u'\n')

Quick and dirty.


Hitting the API in a loop is terrible. A way better way is to get recent changes events pushed to the client as they happen. Event Streams is exactly meant for that and provides an example in python https://wikitech.wikimedia.org/wiki/EventStreams#Python :]

Framawiki triaged this task as High priority.May 30 2017, 4:52 PM
Framawiki added a subscriber: yuvipanda.

I'm just an user of tools labs, I can read but can't edit.
Still running.
Perhaps @yuvipanda or an other admin can stop this job ? First, just qdel 4975529 please

Andrew added a subscriber: Andrew.May 30 2017, 6:27 PM

I have applied hashar's patch, and restarted the tool. I will also email the maintainer and direct them to this discussion.

I've asked the maintainer to verify the change and then close this ticket.

Framawiki lowered the priority of this task from High to Normal.EditedMay 30 2017, 6:59 PM

Perhaps the tool owner can take a look on continuous jobs, that is more appropriate.

Sorry my code was wrong. time.sleep() does not accept string but requires either an int or a float. So we gotta drop the single quote for throttle:

- throttle = '10'  # seconds
+ throttle = 10  # seconds

Note the spam is still happening.

Please remove the file or quit the job. I do not want to maintain this job, Thanks.

hashar closed this task as Resolved.Sep 21 2018, 8:15 PM
hashar claimed this task.

Patched to throttle. The script is probably no more running nowadays.