Sat, Nov 18
Fri, Nov 17
Thu, Nov 16
The first query essentially has to go through all 38 millions of items and sort sitelink values for them. I don't think this can be done efficiently.
Reload is now done.
Wed, Nov 15
Should be fine now (requires new edit or reindex for missed ones, of course).
Seems to be working ok on test.wikidata.org, so closing. If it fails on wikidata when the train moves on, please reopen.
@thiemowmde I think it should fix it but can't check since it's not deployed yet (not even on test). So I'd like to keep it open until we can verify the problem is indeed gone.
I believe 1.5% slowdown is acceptable for the functionality. I will review the code soon and add my comments.
Tue, Nov 14
make-wmf-branch's config.json requires manual updating
do you have a list of bots that you've created or some that you know of?
From the first graph I conclude the internal traffic is about 0.2-0.5 of the external one. Which means 3 or even 2 servers could serve it without much trouble, maybe even if we relax timeouts a bit. We also get some LDF traffic, but not too much, which means either keeping LDF server the same place or having another, internal, LDF server, is OK, but LDF is not really too popular with internal clients and running it off single server is ok.
@dcausse can it be that we need to add noop hints for description field too?
Mon, Nov 13
Also see https://wiki.blazegraph.com/wiki/index.php/QueryHints (esp. runFirst and runLast) for the way to control when service runs (you probably want runLast).
Probably can be done as a service:
Sat, Nov 11
HOME=/root sudo depool seems to be working.
Fri, Nov 10
@Dzahn it asks me for password then.
Doesn't look like it is working:
I suspect something else than data growth is to blame here. As I said, the size changed by less than 1.5% since last week, but the dump time grew by 21 hours. The previous dump started on Monday at 23:00 and finished Wed 00:56 - 26 hours. The last one started, again, on Monday at 23:00 and Wed 22:19 - 47 hours. That's 80% growth in time for 1.5% growth in size. So I don't think size is the reason.
Supporting +00:00 would be easy. Supporting other TZs a bit harder I imagine.
Thu, Nov 9
Thank you @TJones I think this is exactly what I needed.
@TJones this is something you may want to look at I think :)
The dump finally finished at 08-Nov-2017 22:19 for .gz, ws. 1-Nov-2017 00:56 for previous week. That's 21 hours slower than the previous week, with size 27577365639 vs. 27211647006 - less than 1.5% increase. Something has definitely changed.
Wed, Nov 8
So it's basically the same data size - the difference is less than 1% in size. However, the generation takes much longer than before - something is not working like it should be.
Tue, Nov 7
Mon, Nov 6
I think for now limiting it by IP should probably work? I think IP ranges from production hosts, labs and outside are segregated?
Sun, Nov 5
The benchmark looks pretty good. I'll review the patches a bit later (I have a cold right now and it's hard for me to concentrate) and if the code is OK I see no reason not to merge it. We could also do another xhprof run before that to see if there are any other things we could improve.
Sat, Nov 4
Fri, Nov 3
Looks like I found the reason - we use proxy_intercept_errors which sends 429 to separate location block, and that block does not have logstash access log configured.
From my tests on wdqs-test, looks like 429 is reported to /var/log/nginx/access.log but not into logstash or access log that is inside SPARQL location clause. Not sure why is that.
it doesn’t even open the output file until it’s done converting
@Gehel, do you have any thoughts on this?
Thu, Nov 2
There's a bit of a problem here because technically WDQS dataset is not the same as any dump. It is live-updated, unlike dump that is updated once a week, and WDQS data set could be loaded from dump months ago and live-updated sine then, so WDQS has no way of knowing if there are any dumps that happened after that and where they can be found. We could just import https://dumps.wikimedia.org/wikidatawiki/entities/dcatap.rdf but I am not sure how useful that would be and what is the difference between that and just having it for downloading. Is it just because we could query it over SPARQL if we load it?