Documented the "new" section at https://wikitech.wikimedia.org/wiki/Add_a_wiki#MediaWiki_configuration
As a note, the list and the section name doesn't have to match, so there is nothing technically incorrect, but it is confusing (using the same name to refer to the same thing is something I recommend).
Sorry, Andrew, insisting because I may be misunderstood. You have done step2. Step 1, for clarity, would be to edit db-eqiad.php and db-codfw.php structure (which I wouldn't recommend doing on a Friday and add us as reviewers so we are aware of it, as we edit that file heavily).
section defined or named
Thu, Apr 19
If I have to guess, I would say it is the combination of the stretch version + high load (if it is network, cpu or io, I cannot say)- I think enwiki API are hosts with logs of ongoing connections/traffic. We should ask Traffic if they have any large-traffic server with stretch.
Wed, Apr 18
For example, as a procedure, could activity be checked on the port before being disabled to check the host is down/moved away?
Okey, I feel we should check what went wrong (was it the clarity of the communication, was it a one-time mistake that will unlikely happen again, was it the extended downtime on icinga that made the issue not beeing immediately apparent)?
Were the right interfaces disabled after the revert?
Dzahn the reason this is should not be set to high (and almost could be public) is that there is already a workaround in place/the other vulnerability does not affect us. So according to multiple people (Moritz, Alex, this should not be a priority).
Tue, Apr 17
firstname.lastname@example.org[enwiki_p]> SELECT count(*) FROM recentchanges WHERE rc_timestamp > '201804151148';
1 row in set (10.98 sec)
@Rduran Do you think you can take care of this? There is a prototype at https://gerrit.wikimedia.org/r/280947 but all the other Remote Calling methods should be dropped and use instead cumin ( https://wikitech.wikimedia.org/wiki/Cumin ). Sadly, Cumin is python2 only for now.
I am going to setup s1 on dbstore1001.
Done as T192349
while read host port; do ./mysql.py -h $host:$port enwiki -e "SHOW CREATE TABLE revision\G"; done < s1.hosts
root@neodymium:~$ ./mysql.py -h db1052 enwiki Welcome to the MariaDB monitor. Commands end with ; or \g. Your MariaDB connection id is 2447007520 Server version: 10.0.28-MariaDB MariaDB Server
Things to do:
Mon, Apr 16
commonswiki errors due to deadlocks on INSERT IGNORE INTO wbc_entity_usage seem to be common (not too worrying, but on of the most comon database errors), could the code be optimized to avoid those? I am guessing that the same row is written many times (once per change on the same item), and maybe that could be simplified somehow. INSERT IGNORE is a bit of a bad trick here, and we may be writing multiple times the same data without need. Given the changes are done by the job queue and arrive in any order, maybe transaction serialization can be relaxed?
Fri, Apr 13
Adding the tag to reflect work done at network layer.
Thu, Apr 12
Giuseppe mentioned some test stretch patches on beta, it may be unrelated, but so he is aware of ongoing issues.
I found T86530, which may be outdated, but may help with giving more options.
The reset a previous ticket suggested was T191977#4123270 (racadm reset)
@Papaul you are now free to handle the server- it is up, but with all the service down and depooled. I would try the reset I proposed earlier first, and if that doesn't work, checking bios/admin config, maybe?
Now that I have a way to test it, we can proceed, depooling:
Not now, I will have to depool it. Give me 5 minutes.
It probably crashed today at 2018-04-12 13:31:20, hardware logs should be checked.
That would explain the disconnections- too many connections leads to heartbeat check fails, which leads to disconnections.
MMm, so api queries timing out get killed? That could be. But aren't those connection errors? Needs more research.
The errors would be consistent with the 10-interval in which the connections happen (bursts of high activity). But not as large as thinking it is a hardware error.
No issue or locking or strangeness of any kind on any server?
We are fans of caching! :-)
My suggestion in the future, for new code/new views (not for regular things like dropping views or adding new wikis) would be to test extensively on a depooled host, to avoid bugs and security issues (maybe you did that already, I didn't follow all details).
@Anomie, you are the best!
Wed, Apr 11
We know it is mediawiki, I discovered through application logs on logstash.
T150160 suggests racadm reset may fix it.
@EddieGP Not queries will be lost, but if they pileup blocking the wiki's activity, it will be a worse issue (actual outage or edit outage).
@EBjune The largest issue right now, from the reporter point of view, that would threaten the stability of the site is some database-related work. #DBAs want to take care of that, but may need some code maintenance. Is that something that your team could help with? It should be a 1-time thing, as far as the database bugs are concerned.
Tue, Apr 10
I am going to close this ticket as the initial report, "Deletion not working", was resolved as soon as the maintenance finished. We hope that with T191892 that would mitigate the issues in the future, but only after that change is deployed we could test that is true. We will monitor and reopen if we gather more information/the mitigation doesn't work. Feel free to use the Incident talk page for more questions and comments rather than this ticket.
I believe this have been happening for some time now, but this incident only made it more real (happening not only for large deletes, but for small ones, too): https://logstash.wikimedia.org/goto/9facbbd99d63704f215285470b16d6f5
I agree with everything you said, my comment was a quick sketch of what I wanted, and what you proposed was what I really wanted, creating T191892 to handle that there.
CC @Anomie this is not directly related- maintenance was the direct cause, but I believe the new comment model may be creating worse locking patterns on deletion, with queries like:
SELECT rev_id,rev_page,rev_text_id,rev_timestamp,rev_minor_edit,rev_deleted,rev_len,rev_parent_id,rev_sha1,COALESCE( comment_rev_comment.comment_text, rev_comment ) AS `rev_comment_text`,comment_rev_comment.comment_data AS `rev_comment_data`,comment_rev_comment.comment_id AS `rev_comment_cid`,rev_user,rev_user_text,NULL AS `rev_actor`,rev_content_format,rev_content_model FROM `revision` LEFT JOIN `revision_comment_temp` `temp_rev_comment` ON ((temp_rev_comment.revcomment_rev = rev_id)) LEFT JOIN `comment` `comment_rev_comment` ON ((comment_rev_comment.comment_id = temp_rev_comment.revcomment_comment_id)) WHERE rev_page = 'X' FOR UPDATE
Could the SELECT ... FOR UPDATE be restricted to the revision table and select the comment on a second query, without locking? My thesis is the extra locking could affect deletions as all will try to block exclusively the same "deletion reason comment", which creates higher contention. So something like:
This were the queries ongoing at that time:
What I saw was INSERTs into alter being blocked due to metadata locking, but that would not make sense except at the start of the command, or the command would fail in 30 seconds. Maybe it requires a second metadata lock under certain conditions?
Normal as the incident should be solved, we now have to research what actually happened.
I would do the second.
I would honestly move x1 replica (or the master directy), probably in a logical way, somewhere else- we don't want to serve the whole service from the same row, and x1 is like s4 and s8- it is not really that easy to put in read only because cross-wiki dependencies. x1 hosts will have to be moved anyway, but we can serve it for some time with a single host.
Yes, we could do this on the masters even with a table reconstruction- but we should check if a table reconstruction is needed only for a definition change.
Mon, Apr 9
@TTO I think explain output is likely to change when data changes, and this is probably caused by a user with logs of change tags/logs, which may not show up on your local installation. It could also change based on the DBMS version used. Normally here I copy the results from production, so they are quite accurate- if not, we would have not received an error report in the first place.
I am not seeing linter issues lately, I will open a new one if they come back.
I've changed the title to better reflect that I don't want to remove this, in fact, what I want is there is better support for it, which right now is affecting me.
That last suggestion looks like a blocker to me, at least to check it before doing anything.
Fri, Apr 6
Wikidata team will test it on the test host and create a task for production deployment with all requested changes/full strategy. Until then, there is nothing for us to do here. If you need help, re-add us with a specific request.
Yes resolved from on our side.
On another side, cronjobs are still referred as silver on production, shouldn't that change too? Can you comment why this was closed as invalid?