Fri, Jan 10
I could really use some help trying to resolve this issue. The webservice is somewhat more stable now that I am running it with the Kubernetes backend instead of on the grid, but this just means that it fails every few hours instead of every few minutes, on average.
Thu, Jan 2
This issue is no longer relevant, since I have been able to start the webservice under kubernetes and no longer rely on the grid servers.
Sun, Dec 29
The webservice for dplbot now rarely stays up for more than 10-15 minutes at a time. There are still no error messages whatsoever in the logs giving any clue as to the cause of the outages. Note that, as reported originally, the lighttpd process continues to run during these outages, but all http requests return an error 503. (I have tried using webservice start --backend=kubernetes, but this results in all requests throwing an error 502.) I have also reopened T218915, a related issue.
Tue, Dec 24
This issue has resurfaced. /data/project/dplbot/service.log shows that the last attempted restart was at 2019-12-22T06:24:17.778139; it is now Tue Dec 24 15:49:50 and the webservice has been down for over 48 hours. I can restart it manually, but it will inevitably crash again sometime in the near future.
Oct 15 2019
Mar 24 2019
no longer an issue
Mar 22 2019
It's done as far as s51290 is concerned; thanks!
As of this morning, all of the jobs listed in my previous comment are now gone.
As of this morning I was able to start the webservice again.
Mar 21 2019
Mar 19 2019
I have four unkillable jobs stuck on the stretch grid for two different tools:
Mar 1 2019
@jcrespo please restore the table s51290__dpl_p.dab_hof ; thanks in advance!
Feb 19 2019
Also, please try
Feb 18 2019
Thanks. The tables that I'd like to recover from database s51290__dpl_p if possible are as follows:
Although I can probably recreate most of the tables in s51290__dpl_p from scratch, there are a few tables I’d like to recover if they turn out to be recoverable. (I’m mobile now but can come up with a list of tables sometime within the next 24 hours).
Feb 16 2019
Jul 5 2018
All working now.
Jul 2 2018
Sorry for the delay in responding to this. I was able to update the list manually for you.
Feb 26 2018
Jan 2 2018
Nov 16 2017
No one contacted me during this whole process, until after everything had been "resolved". I won't run the job with MEMORY tables again, so you can remove the filters.
Nov 3 2017
To answer your question: We cannot ensure that titles in a user database are unique unless we can set a UNIQUE KEY on the entire column; two titles might be identical in the first 254 characters but differ in the 255th (in the most extreme possible case).
Nov 2 2017
Oct 25 2017
I can't reproduce the issue now, and the bot seems to be logging in as before.
Oct 16 2017
User error :-) please disregard.
I have found an apparent replication issue on one of the new servers:
Sep 22 2017
Is it possible to give s51290 read-only access to labsdb1001 until I have a chance to update the troublesome scripts?
Sep 4 2017
What is the specific query that seems to be causing the problem now?
Aug 24 2017
More detail ... in the commons family file (families/commons_family.py), the values in self.category_redirect_templates aren't being picked up when the Family object's .category_redirects() method is called.
Aug 10 2017
ENGINE=MEMORY changed to ENGINE=InnoDB for all dplbot jobs.
Aug 9 2017
Well, this came as quite a surprise, but I've gone ahead and converted all of dplbot's user databases to use InnoDB. (Hard to believe we were causing all this lag, since the lag was continuing to be a problem even when none of dplbot's jobs were running, but whatever.)
May 25 2017
@jcrespo: Is there (or will there be) a way to access and create user databases on the new servers (as per https://wikitech.wikimedia.org/wiki/Help:Tool_Labs/Database#User_databases)?
Jan 26 2017
Still not working; same error message as previously reported (ERROR 1045).
Jan 25 2017
Nov 5 2016
MariaDB [enwiki_p]> select page_title from page join pagelinks on pl_from = page_id where page_namespace=0 and pl_namespace = 0 and pl_title="Will_Johnson"; +----------------------------------+ | page_title | +----------------------------------+ | Denver_Broncos | | List_of_current_AFC_team_rosters | | Will_Johnson_(disambiguation) | | William_Johnson_(rugby_player) | | Will_Johnson_(American_football) | | Johnson,_Will | | 2016_Denver_Broncos_season | +----------------------------------+ 7 rows in set (0.01 sec)
Is there any update on the status of the database? The erroneous entries reported back in May are still an issue. It has been nearly six months; is the reimport complete yet? If not, is there an ETA?
Sep 8 2016
It is down again and not restarting. It just went down at approx. 10:20 GMT today (Thursday).
Aug 6 2016
I've had to restart it manually at least once a day for the past several days, although not today (so far).
Jul 22 2016
Jul 21 2016
It is currently down again. Shell shows the following:
Jul 19 2016
It is down at the moment. "webservice status" says it is running, but "qstat" shows no server process running.
Yes, I had to manually restart it twice today. The automatic webservice restarter is not working.
Reopening; same symptoms are occurring again.
Jun 1 2016
Sorry; it appears that I must have stopped reading before the end of the sentence. :-*>
May 31 2016
Is there any update on the status of this? On 23 May, the revision table was in progress and was expected to take ~12 hours. The pagelinks table is about 3X larger and so might be expected to take ~36 hours. But this was eight days ago, so even if the estimates were off by a factor of 2 (or 4), the process could have completed by now.
May 22 2016
This has been happening multiple times per month, sometimes more than once in a week. When it happens, it can be fixed simply by re-running the nightly job on the Tool Labs pywikibot account. However, this has to be done manually. It would be much better if the script could be modified to detect failures and start over.
May 2 2016
Jan 19 2016
Dec 30 2015
No cron jobs have run for approximately the last six hours. The error messages I've been getting vary quite a bit; the most recent one was:
Oct 21 2015
Oct 11 2015
I'm not entirely sure because I have been mostly offline myself for several days, but I noticed it definitely on Saturday 10 Oct.