In T172410#4965217, @mpopov wrote:I just noticed that the tables related to the Echo extension are (surprisingly) not yet available in the enwiki shard (s1-analytics-replica.eqiad.wmnet), but are in analytics-store.eqiad.wmnet. Is there a page we can refer to to check on parity/status of data availability?
- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Feed Advanced Search
Advanced Search
Advanced Search
Feb 19 2019
Feb 19 2019
Nothing has arrived since the restart without debug, so I think we are good
Marostegui closed T216273: New cronspam from db clusters, a subtask of T132324: Tracking and Reducing cron-spam to root@ , as Resolved.
Marostegui added a comment to T149077: Certain ApiQueryRecentChanges::run api query is too slow, slowing down dewiki.
Could this be another case of MariaDB getting the optimizer fixed with a new version as it doesn't happen on 10.1.36 or 10.1.37 for the original query?
root@db1070.eqiad.wmnet[dewiki]> EXPLAIN SELECT /* ApiQueryRecentChanges::run */ rc_id, rc_timestamp, rc_namespace, rc_title, rc_cur_id, rc_type, rc_deleted, rc_this_oldid, rc_last_oldid FROM `recentchanges` WHERE (rc_timestamp>='20161024013525') AND rc_namespace IN ('0', '120') AND rc_type IN ('0', '1', '3', '6') ORDER BY rc_timestamp ASC, rc_id ASC LIMIT 101\G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: recentchanges type: range possible_keys: rc_timestamp,rc_ns_usertext,rc_name_type_patrolled_timestamp,rc_ns_actor,rc_namespace_title_timestamp key: rc_timestamp key_len: 16 ref: NULL rows: 518658 Extra: Using index condition; Using where 1 row in set (0.00 sec)
Reducing priority as the errors on dbstore1002 are not too important anymore as this host shouldn't be used anymore and everything using it should migrate to the new hosts T210478: Migrate dbstore1002 to a multi instance setup on dbstore100[3-5]
For what is worth, dbstore1002 is now lagging behind on s8 (wikidatawiki) 7 days and it keeps lagging, I doubt it will ever catch up.
Yesterday the migration to dbstore1003-1005 of the staging database happened (T210478#4963411), so everyone should start using that one as soon as possible, specially after seeing so many crashes, lags that will never recover and corrupted data (due to the above crashes)
For what is worth, dbstore1002 is now lagging behind on s8 (wikidatawiki) 7 days and it keeps lagging, I doubt it will ever catch up.
Marostegui closed T174802: Archive and drop education program (ep_*) tables on all wikis as Resolved.
This is all done.
The only pending follow up is to remove the views which has its own task T216481: Remove views on ep_* tables on the wikireplicas hosts
Marostegui updated the task description for T174802: Archive and drop education program (ep_*) tables on all wikis.
Marostegui updated the task description for T174802: Archive and drop education program (ep_*) tables on all wikis.
Marostegui triaged T216481: Remove views on ep_* tables on the wikireplicas hosts as Medium priority.
Marostegui updated the task description for T174802: Archive and drop education program (ep_*) tables on all wikis.
Marostegui updated the task description for T174802: Archive and drop education program (ep_*) tables on all wikis.
Marostegui updated the task description for T174802: Archive and drop education program (ep_*) tables on all wikis.
Marostegui added a comment to T216240: Reboot, upgrade firmware and kernel of db1096-db1106, db2071-db2092.
db1106 has been rebooted (and kernel was upgraded)
Marostegui updated the task description for T216240: Reboot, upgrade firmware and kernel of db1096-db1106, db2071-db2092.
I have rebooted db1106, I will give it sometime to confirm the spam is gone before closing this task.
Marostegui updated the task description for T174802: Archive and drop education program (ep_*) tables on all wikis.
Marostegui changed the status of T216444: Global rename of Дагиров Умар → Takhirgeran Umar: supervision needed from Open to Stalled.
Marostegui changed the status of T216444: Global rename of Дагиров Умар → Takhirgeran Umar: supervision needed, a subtask of T169440: Pending global renames in need of sysadmin supervision (tracking), from Open to Stalled.
Marostegui updated the task description for T210478: Migrate dbstore1002 to a multi instance setup on dbstore100[3-5].
Marostegui added a comment to T210478: Migrate dbstore1002 to a multi instance setup on dbstore100[3-5].
The migration finished. These are the times in UTC from 18th Feb 2019:
Feb 18 2019
Feb 18 2019
Marostegui added a comment to T216444: Global rename of Дагиров Умар → Takhirgeran Umar: supervision needed.
This should wait until T215107 is unblocked and resolved T215107#4962933
Marostegui added a comment to T216441: Evaluate transferring the non-replicated tables to the new toolsdb server.
Of course :-). Just mentioning this as an idea to Cloud Team
Marostegui added a comment to T216441: Evaluate transferring the non-replicated tables to the new toolsdb server.
Just saying: we have a testing host where we could try to import those databases from labsdb1005 and see if they fail or what fails during the import process.
Let me know if I can help with this.
Marostegui added a comment to T215107: Global rename of The_Photographer → Wilfredor: supervision needed.
There have been no retries from what I can see on: https://logstash.wikimedia.org/goto/65afdb88fef30982130c53e40a644b06
Marostegui added a comment to T215107: Global rename of The_Photographer → Wilfredor: supervision needed.
It timed out on Commonswiki:
https://logstash.wikimedia.org/goto/34de73560ce6692f0012e846f7a4de0c
Maybe @Legoktm can help to unblock it?
Marostegui added a comment to T215107: Global rename of The_Photographer → Wilfredor: supervision needed.
Go for it!
Marostegui updated the task description for T174802: Archive and drop education program (ep_*) tables on all wikis.
Marostegui added a comment to T193264: Replace labsdb100[4567] with instances on cloudvirt1019 and cloudvirt1020.
I have been talking to @aborrero about the new instance on clouddb1001 - and I have been taking a general look.
While comparing the grants, I have realised that clouddb1001 is missing a grant for the following user: s52716 (that grant exists on labsdb1005) it could be a new user. I can easily copy that grant over to clouddb1001, but I want the green light from @Bstorm just in case this has something to do with maintain-dbusers or something :-)
Marostegui updated the task description for T174802: Archive and drop education program (ep_*) tables on all wikis.
I will take care of db1106 as I need to depool it anyways today or tomorrow.
Marostegui updated the task description for T174802: Archive and drop education program (ep_*) tables on all wikis.
s1 eqiad progress
- labsdb1011
- labsdb1010
- labsdb1009
- dbstore1003
- dbstore1002
- dbstore1001
- db1124
- db1119
- db1118
- db1106
- db1105
- db1099
- db1089
- db1083
- db1080
- db1067 T210713#4967984
I think we can close this, we already have the tasks in place:
T216142
T216138
T216137
T214069
T214066
Marostegui updated the task description for T174802: Archive and drop education program (ep_*) tables on all wikis.
db2085 has been rebooted - let's see if that stops the amount of emails.
Marostegui added a comment to T216240: Reboot, upgrade firmware and kernel of db1096-db1106, db2071-db2092.
I have rebooted db2085 without debug option on kernel as part of (T216273) and I have taken the opportunity to upgrade its kernel too.
Marostegui updated the task description for T216240: Reboot, upgrade firmware and kernel of db1096-db1106, db2071-db2092.
Feb 17 2019
Feb 17 2019
Marostegui edited projects for T216353: toolsdb: firewalling changes for new setup (temporal mysql replication), added: User-Marostegui; removed DBA.
Feb 16 2019
Feb 16 2019
We probably just need to reboot them without the kernel running debug mode as spoken on Friday
Feb 15 2019
Feb 15 2019
Ah, nevermind my comment, you decided to completely move away from dbstore1002 :-)
Thanks!
Another staging database where? Just to clarify: dbstore1002 will be full read only after the migration (MySQL doesn't allow to set read only on a database level, it is a global flag).
For what is worth, I do have a Herald rule that automatically subscribes me to any degraded RAID ticket for the databases and that proved to be a good way to get my attention, as otherwise monitoring the Operations queue is hard and it is easy to miss things.
Marostegui added a comment to T216213: s52481__stats_global running CREATE DATABASE IF NOT EXISTS on too many queries causing locking issues.
I don't know what was the situation yesterday night, as I wasn't present during the troubleshooting - however, we had pretty much the same issue at around 6AM UTC today, and from what I could see, your user wasn't among the ones creating issues (see T216208#4956626)
For what is worth, the server has looked stable for one hour now, since I enabled the global max_user_connections. It might be preventing some tools to work if they require more than 20 connections, but at least the rest of tools/users do not suffer the outage.
As per my conversation with @Bstorm this is a temporary mitigation issue to get the server under control again - if we finally want to go for per user limit, we should look at individual cases where we will need to increase the connection limit as we do with the wikireplicas.
Marostegui edited projects for T216223: Degraded RAID on labsdb1005, added: Toolforge; removed Data-Services.
cloud-services-team I would suggest you coordinate with @Cmjohnson to get this disk replaced
Marostegui added projects to T216223: Degraded RAID on labsdb1005: cloud-services-team, Data-Services.
Marostegui lowered the priority of T216183: Special:ProtectedPages times out on enwiki for Module namespace from High to Medium.
Another brilliant analysis from @Anomie :-)
(Decreasing priority as this doesn't seem to happen very often as per: https://logstash.wikimedia.org/goto/4854d6d92b272ad88d23696570c7dad6)
Cross posting from the main track task as an emergency mitigation: T216208#4956634
I have restarted the server with max_user_connections = 20 to try to mitigate this, the server was unusable anyways.
The server is again with "too many connections"
root@labsdb1005:~# mysql --skip-ssl information_schema -e "select user, count(*) as count FROM information_Schema.processlist GROUP BY user ORDER BY count DESC limit 10" +----------+-------+ | user | count | +----------+-------+ | u2815 | 169 | | s52552 | 151 | | watchdog | 121 | | s53098 | 87 | | s51344 | 78 | | s52524 | 45 | | s53213 | 40 | | s51434 | 23 | | s52585 | 22 | | s52680 | 20 | +----------+-------+
Marostegui added a comment to T216213: s52481__stats_global running CREATE DATABASE IF NOT EXISTS on too many queries causing locking issues.
That can be a consecuence and not really the cause. If the server is too overloaded, it might not be able to create that and the code might be retrying, and as we don't have a per user limit...
Both things should be probably fixed 1) code 2) establish a per user limit.
Feb 14 2019
Feb 14 2019
And also very very old hardware.
Marostegui edited projects for T216183: Special:ProtectedPages times out on enwiki for Module namespace, added: User-Marostegui; removed DBA.
I believe this is the query, which is too slow and gets killed by the query killer:
root@db1106.eqiad.wmnet[enwiki]> explain SELECT /* IndexPager::buildQueryInfo (ProtectedPagesPager) */ pr_id, page_namespace, page_title, page_len, pr_type, pr_level, pr_expiry, pr_cascade, log_timestamp, log_deleted, comment_log_comment.comment_text AS `log_comment_text`, comment_log_comment.comment_data AS `log_comment_data`, comment_log_comment.comment_id AS `log_comment_cid`, log_user, log_user_text, NULL AS `log_actor` FROM `page`, `page_restrictions` LEFT JOIN `log_search` ON (ls_field = 'pr_id' AND (ls_value = pr_id)) LEFT JOIN (`logging` JOIN `comment` `comment_log_comment` ON ((comment_log_comment.comment_id = log_comment_id))) ON ((ls_log_id = log_id)) WHERE (pr_expiry > '20190214194211' OR pr_expiry IS NULL) AND (page_id=pr_page) AND (pr_type='edit') AND (page_namespace='828') ORDER BY pr_id LIMIT 101 ; +------+-------------+---------------------+--------+------------------------------+------------+---------+-------------------------------+--------+---------------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +------+-------------+---------------------+--------+------------------------------+------------+---------+-------------------------------+--------+---------------------------------+ | 1 | SIMPLE | page | ref | PRIMARY,name_title | name_title | 4 | const | 25438 | Using temporary; Using filesort | | 1 | SIMPLE | page_restrictions | eq_ref | PRIMARY,pr_page,pr_typelevel | PRIMARY | 261 | enwiki.page.page_id,const | 1 | Using where | | 1 | SIMPLE | log_search | ref | PRIMARY | PRIMARY | 34 | const | 293280 | Using where; Using index | | 1 | SIMPLE | logging | eq_ref | PRIMARY | PRIMARY | 4 | enwiki.log_search.ls_log_id | 1 | Using where | | 1 | SIMPLE | comment_log_comment | eq_ref | PRIMARY | PRIMARY | 8 | enwiki.logging.log_comment_id | 1 | | +------+-------------+---------------------+--------+------------------------------+------------+---------+-------------------------------+--------+---------------------------------+ 5 rows in set (0.00 sec)
From what I can see none of the labsdb1005 have any connections limit, maybe we need to establish a limit of connections per user similar to what we have on the replicas. Better to "break" a tool than the whole server.
We can probably also take a look at those specific tools that might need more than X number of connections (being X the number of connections we decide to set).
Sure - go ahead :-)
All eqiad servers from the same batch as db1106 are running 4.9.0-8 already
db1096-db1106
Reboot tests with db2085 4.9.0-8 after getting the BIOS and FW upgraded by Papaul (T214840#4954418)
Thank you! I will delete the idrac logs and start testing
@Papaul thanks - I am going to put it down now. Will ping you on IRC once it is down
Thanks!
Marostegui closed Restricted Task, a subtask of T210478: Migrate dbstore1002 to a multi instance setup on dbstore100[3-5], as Resolved.
In T216067#4953498, @ArielGlenn wrote:@Marostegui /srv/sqldata has 39G on it on db04, presumably that's pretty close to the amount of data on the master.
Marostegui moved T215616: Improve interlingual links across wikis through Wikidata IDs from Triage to Blocked external/Not db team on the DBA board.
After the FW and BIOS upgraded I have rebooted db1106 a number of times with 4.9.0-8 and this is the result:
As per
1 | Feb 13 18:12:54 deployment-db04 mysqld[1483]: 190213 18:12:54 [Note] InnoDB: Starting crash recovery from checkpoint LSN=407832716048 |
---|---|
2 | Feb 13 18:12:54 deployment-db04 mysqld[1483]: 190213 18:12:54 [ERROR] InnoDB: checksum mismatch in tablespace ./enwiki/logging.ibd (table enwiki/logging) |
3 | Feb 13 18:12:54 deployment-db04 mysqld[1483]: 190213 18:12:54 [Note] InnoDB: Page size:1024 Pages to analyze:64 |
4 | Feb 13 18:12:54 deployment-db04 mysqld[1483]: 190213 18:12:54 [Note] InnoDB: Page size: 1024, Possible space_id count:0 |
5 | Feb 13 18:12:54 deployment-db04 mysqld[1483]: 190213 18:12:54 [Note] InnoDB: Page size:2048 Pages to analyze:64 |
6 | Feb 13 18:12:54 deployment-db04 mysqld[1483]: 190213 18:12:54 [Note] InnoDB: Page size: 2048, Possible space_id count:0 |
7 | Feb 13 18:12:54 deployment-db04 mysqld[1483]: 190213 18:12:54 [Note] InnoDB: Page size:4096 Pages to analyze:64 |
8 | Feb 13 18:12:54 deployment-db04 mysqld[1483]: 190213 18:12:54 [Note] InnoDB: Page size: 4096, Possible space_id count:0 |
9 | Feb 13 18:12:54 deployment-db04 mysqld[1483]: 190213 18:12:54 [Note] InnoDB: Page size:8192 Pages to analyze:64 |
10 | Feb 13 18:12:54 deployment-db04 mysqld[1483]: 190213 18:12:54 [Note] InnoDB: Page size: 8192, Possible space_id count:0 |
11 | Feb 13 18:12:54 deployment-db04 mysqld[1483]: 190213 18:12:54 [Note] InnoDB: Page size:16384 Pages to analyze:64 |
12 | Feb 13 18:12:54 deployment-db04 mysqld[1483]: 190213 18:12:54 [Note] InnoDB: VALID: space:24829 page_no:1 page_size:16384 |
13 | Feb 13 18:12:54 deployment-db04 mysqld[1483]: 190213 18:12:54 [Note] InnoDB: VALID: space:24829 page_no:2 page_size:16384 |
14 | Feb 13 18:12:54 deployment-db04 mysqld[1483]: 190213 18:12:54 [Note] InnoDB: VALID: space:24829 page_no:3 page_size:16384 |
15 | Feb 13 18:12:54 deployment-db04 mysqld[1483]: 190213 18:12:54 [Note] InnoDB: VALID: space:24829 page_no:4 page_size:16384 |
16 | Feb 13 18:12:54 deployment-db04 mysqld[1483]: 190213 18:12:54 [Note] InnoDB: VALID: space:24829 page_no:5 page_size:16384 |
17 | Feb 13 18:12:54 deployment-db04 mysqld[1483]: 190213 18:12:54 [Note] InnoDB: VALID: space:24829 page_no:6 page_size:16384 |
18 | Feb 13 18:12:54 deployment-db04 mysqld[1483]: 190213 18:12:54 [Note] InnoDB: VALID: space:24829 page_no:7 page_size:16384 |
19 | Feb 13 18:12:54 deployment-db04 mysqld[1483]: 190213 18:12:54 [Note] InnoDB: VALID: space:24829 page_no:8 page_size:16384 |
20 | Feb 13 18:12:54 deployment-db04 mysqld[1483]: 190213 18:12:54 [Note] InnoDB: VALID: space:24829 page_no:9 page_size:16384 |
21 | Feb 13 18:12:54 deployment-db04 mysqld[1483]: 190213 18:12:54 [Note] InnoDB: VALID: space:24829 page_no:10 page_size:16384 |
22 | Feb 13 18:12:54 deployment-db04 mysqld[1483]: 190213 18:12:54 [Note] InnoDB: VALID: space:24829 page_no:11 page_size:16384 |
23 | Feb 13 18:12:54 deployment-db04 mysqld[1483]: 190213 18:12:54 [Note] InnoDB: VALID: space:24829 page_no:12 page_size:16384 |
24 | Feb 13 18:12:54 deployment-db04 mysqld[1483]: 190213 18:12:54 [Note] InnoDB: VALID: space:24829 page_no:13 page_size:16384 |
25 | Feb 13 18:12:54 deployment-db04 mysqld[1483]: 190213 18:12:54 [Note] InnoDB: VALID: space:24829 page_no:14 page_size:16384 |
26 | Feb 13 18:12:54 deployment-db04 mysqld[1483]: 190213 18:12:54 [Note] InnoDB: VALID: space:24829 page_no:15 page_size:16384 |
27 | Feb 13 18:12:54 deployment-db04 mysqld[1483]: 190213 18:12:54 [Note] InnoDB: VALID: space:24829 page_no:16 page_size:16384 |
28 | Feb 13 18:12:54 deployment-db04 mysqld[1483]: 190213 18:12:54 [Note] InnoDB: VALID: space:24829 page_no:17 page_size:16384 |
29 | Feb 13 18:12:54 deployment-db04 mysqld[1483]: 190213 18:12:54 [Note] InnoDB: VALID: space:24829 page_no:18 page_size:16384 |
30 | Feb 13 18:12:54 deployment-db04 mysqld[1483]: 190213 18:12:54 [Note] InnoDB: VALID: space:24829 page_no:19 page_size:16384 |
31 | Feb 13 18:12:54 deployment-db04 mysqld[1483]: 190213 18:12:54 [Note] InnoDB: VALID: space:24829 page_no:20 page_size:16384 |
32 | Feb 13 18:12:54 deployment-db04 mysqld[1483]: 190213 18:12:54 [Note] InnoDB: VALID: space:24829 page_no:21 page_size:16384 |
33 | Feb 13 18:12:54 deployment-db04 mysqld[1483]: 190213 18:12:54 [Note] InnoDB: VALID: space:24829 page_no:22 page_size:16384 |
34 | Feb 13 18:12:54 deployment-db04 mysqld[1483]: 190213 18:12:54 [Note] InnoDB: VALID: space:24829 page_no:23 page_size:16384 |
35 | Feb 13 18:12:54 deployment-db04 mysqld[1483]: 190213 18:12:54 [Note] InnoDB: VALID: space:24829 page_no:24 page_size:16384 |
36 | Feb 13 18:12:54 deployment-db04 mysqld[1483]: 190213 18:12:54 [Note] InnoDB: VALID: space:24829 page_no:25 page_size:16384 |
37 | Feb 13 18:12:54 deployment-db04 mysqld[1483]: 190213 18:12:54 [Note] InnoDB: VALID: space:24829 page_no:26 page_size:16384 |
38 | Feb 13 18:12:54 deployment-db04 mysqld[1483]: 190213 18:12:54 [Note] InnoDB: VALID: space:24829 page_no:27 page_size:16384 |
39 | Feb 13 18:12:54 deployment-db04 mysqld[1483]: 190213 18:12:54 [Note] InnoDB: VALID: space:24829 page_no:28 page_size:16384 |
40 | Feb 13 18:12:54 deployment-db04 mysqld[1483]: 190213 18:12:54 [Note] InnoDB: VALID: space:24829 page_no:29 page_size:16384 |
41 | Feb 13 18:12:54 deployment-db04 mysqld[1483]: 190213 18:12:54 [Note] InnoDB: VALID: space:24829 page_no:30 page_size:16384 |
42 | Feb 13 18:12:54 deployment-db04 mysqld[1483]: 190213 18:12:54 [Note] InnoDB: VALID: space:24829 page_no:31 page_size:16384 |
43 | Feb 13 18:12:54 deployment-db04 mysqld[1483]: 190213 18:12:54 [Note] InnoDB: VALID: space:24829 page_no:32 page_size:16384 |
44 | Feb 13 18:12:54 deployment-db04 mysqld[1483]: 190213 18:12:54 [Note] InnoDB: VALID: space:24829 page_no:33 page_size:16384 |
45 | Feb 13 18:12:54 deployment-db04 mysqld[1483]: 190213 18:12:54 [Note] InnoDB: VALID: space:24829 page_no:34 page_size:16384 |
46 | Feb 13 18:12:54 deployment-db04 mysqld[1483]: 190213 18:12:54 [Note] InnoDB: VALID: space:24829 page_no:35 page_size:16384 |
47 | Feb 13 18:12:54 deployment-db04 mysqld[1483]: 190213 18:12:54 [Note] InnoDB: VALID: space:24829 page_no:36 page_size:16384 |
48 | Feb 13 18:12:54 deployment-db04 mysqld[1483]: 190213 18:12:54 [Note] InnoDB: VALID: space:24829 page_no:37 page_size:16384 |
49 | Feb 13 18:12:54 deployment-db04 mysqld[1483]: 190213 18:12:54 [Note] InnoDB: VALID: space:24829 page_no:38 page_size:16384 |
50 | Feb 13 18:12:54 deployment-db04 mysqld[1483]: 190213 18:12:54 [Note] InnoDB: VALID: space:24829 page_no:39 page_size:16384 |
51 | Feb 13 18:12:54 deployment-db04 mysqld[1483]: 190213 18:12:54 [Note] InnoDB: VALID: space:24829 page_no:40 page_size:16384 |
52 | Feb 13 18:12:54 deployment-db04 mysqld[1483]: 190213 18:12:54 [Note] InnoDB: VALID: space:24829 page_no:41 page_size:16384 |
53 | Feb 13 18:12:54 deployment-db04 mysqld[1483]: 190213 18:12:54 [Note] InnoDB: VALID: space:24829 page_no:42 page_size:16384 |
54 | Feb 13 18:12:54 deployment-db04 mysqld[1483]: 190213 18:12:54 [Note] InnoDB: VALID: space:24829 page_no:43 page_size:16384 |
55 | Feb 13 18:12:54 deployment-db04 mysqld[1483]: 190213 18:12:54 [Note] InnoDB: VALID: space:24829 page_no:44 page_size:16384 |
56 | Feb 13 18:12:54 deployment-db04 mysqld[1483]: 190213 18:12:54 [Note] InnoDB: VALID: space:24829 page_no:45 page_size:16384 |
57 | Feb 13 18:12:54 deployment-db04 mysqld[1483]: 190213 18:12:54 [Note] InnoDB: VALID: space:24829 page_no:46 page_size:16384 |
58 | Feb 13 18:12:54 deployment-db04 mysqld[1483]: 190213 18:12:54 [Note] InnoDB: VALID: space:24829 page_no:47 page_size:16384 |
59 | Feb 13 18:12:54 deployment-db04 mysqld[1483]: 190213 18:12:54 [Note] InnoDB: VALID: space:24829 page_no:48 page_size:16384 |
60 | Feb 13 18:12:54 deployment-db04 mysqld[1483]: 190213 18:12:54 [Note] InnoDB: VALID: space:24829 page_no:49 page_size:16384 |
61 | Feb 13 18:12:54 deployment-db04 mysqld[1483]: 190213 18:12:54 [Note] InnoDB: VALID: space:24829 page_no:50 page_size:16384 |
62 | Feb 13 18:12:54 deployment-db04 mysqld[1483]: 190213 18:12:54 [Note] InnoDB: VALID: space:24829 page_no:51 page_size:16384 |
63 | Feb 13 18:12:54 deployment-db04 mysqld[1483]: 190213 18:12:54 [Note] InnoDB: VALID: space:24829 page_no:52 page_size:16384 |
64 | Feb 13 18:12:54 deployment-db04 mysqld[1483]: 190213 18:12:54 [Note] InnoDB: VALID: space:24829 page_no:53 page_size:16384 |
65 | Feb 13 18:12:54 deployment-db04 mysqld[1483]: 190213 18:12:54 [Note] InnoDB: VALID: space:24829 page_no:54 page_size:16384 |
66 | Feb 13 18:12:54 deployment-db04 mysqld[1483]: 190213 18:12:54 [Note] InnoDB: VALID: space:24829 page_no:55 page_size:16384 |
67 | Feb 13 18:12:54 deployment-db04 mysqld[1483]: 190213 18:12:54 [Note] InnoDB: VALID: space:24829 page_no:56 page_size:16384 |
68 | Feb 13 18:12:54 deployment-db04 mysqld[1483]: 190213 18:12:54 [Note] InnoDB: VALID: space:24829 page_no:57 page_size:16384 |
69 | Feb 13 18:12:54 deployment-db04 mysqld[1483]: 190213 18:12:54 [Note] InnoDB: VALID: space:24829 page_no:58 page_size:16384 |
70 | Feb 13 18:12:54 deployment-db04 mysqld[1483]: 190213 18:12:54 [Note] InnoDB: VALID: space:24829 page_no:59 page_size:16384 |
71 | Feb 13 18:12:54 deployment-db04 mysqld[1483]: 190213 18:12:54 [Note] InnoDB: VALID: space:24829 page_no:60 page_size:16384 |
72 | Feb 13 18:12:54 deployment-db04 mysqld[1483]: 190213 18:12:54 [Note] InnoDB: VALID: space:24829 page_no:61 page_size:16384 |
73 | Feb 13 18:12:54 deployment-db04 mysqld[1483]: 190213 18:12:54 [Note] InnoDB: VALID: space:24829 page_no:62 page_size:16384 |
74 | Feb 13 18:12:54 deployment-db04 mysqld[1483]: 190213 18:12:54 [Note] InnoDB: VALID: space:24829 page_no:63 page_size:16384 |
75 | Feb 13 18:12:54 deployment-db04 mysqld[1483]: 190213 18:12:54 [Note] InnoDB: Page size: 16384, Possible space_id count:1 |
76 | Feb 13 18:12:54 deployment-db04 mysqld[1483]: 190213 18:12:54 [Note] InnoDB: space_id:24829, Number of pages matched: 63/63 (16384) |
77 | Feb 13 18:12:54 deployment-db04 mysqld[1483]: 190213 18:12:54 [Note] InnoDB: Chosen space:24829 |
78 | Feb 13 18:12:54 deployment-db04 mysqld[1483]: |
79 | Feb 13 18:12:54 deployment-db04 mysqld[1483]: 190213 18:12:54 [Note] InnoDB: Restoring page 0 of tablespace 24829 |
80 | Feb 13 18:12:54 deployment-db04 mysqld[1483]: 190213 18:12:54 [Warning] InnoDB: Doublewrite does not have page_no=0 of space: 24829 |
81 | Feb 13 18:12:54 deployment-db04 mysqld[1483]: 2019-02-13 18:12:54 7f238e13e780 InnoDB: Operating system error number 2 in a file operation. |
82 | Feb 13 18:12:54 deployment-db04 mysqld[1483]: InnoDB: The error means the system cannot find the path specified. |
83 | Feb 13 18:12:54 deployment-db04 mysqld[1483]: InnoDB: If you are installing InnoDB, remember that you must create |
84 | Feb 13 18:12:54 deployment-db04 mysqld[1483]: InnoDB: directories yourself, InnoDB does not create them. |
85 | Feb 13 18:12:54 deployment-db04 mysqld[1483]: InnoDB: Error: could not open single-table tablespace file ./enwiki/logging.ibd |
86 | Feb 13 18:12:54 deployment-db04 mysqld[1483]: InnoDB: We do not continue the crash recovery, because the table may become |
87 | Feb 13 18:12:54 deployment-db04 mysqld[1483]: InnoDB: corrupt if we cannot apply the log records in the InnoDB log to it. |
88 | Feb 13 18:12:54 deployment-db04 mysqld[1483]: InnoDB: To fix the problem and start mysqld: |
89 | Feb 13 18:12:54 deployment-db04 mysqld[1483]: InnoDB: 1) If there is a permission problem in the file and mysqld cannot |
90 | Feb 13 18:12:54 deployment-db04 mysqld[1483]: InnoDB: open the file, you should modify the permissions. |
91 | Feb 13 18:12:54 deployment-db04 mysqld[1483]: InnoDB: 2) If the table is not needed, or you can restore it from a backup, |
92 | Feb 13 18:12:54 deployment-db04 mysqld[1483]: InnoDB: then you can remove the .ibd file, and InnoDB will do a normal |
93 | Feb 13 18:12:54 deployment-db04 mysqld[1483]: InnoDB: crash recovery and ignore that table. |
94 | Feb 13 18:12:54 deployment-db04 mysqld[1483]: InnoDB: 3) If the file system or the disk is broken, and you cannot remove |
95 | Feb 13 18:12:54 deployment-db04 mysqld[1483]: InnoDB: the .ibd file, you can set innodb_force_recovery > 0 in my.cnf |
96 | Feb 13 18:12:54 deployment-db04 mysqld[1483]: InnoDB: and force InnoDB to continue crash recovery here. |
Probably the best approach would be to re-clone that new instance directly from the master - how big is the data size on the master?
@Cmjohnson should we also try to exchange the DIMM modules listed at T214720#4937872 and see if they fail again?
In T214840#4953022, @Papaul wrote:@Marostegui in most cases the CPU1/CPU2 Machine check error detected is caused from outdated BIOS. I will recommend that we first update the BIOS. The system BIOS right now is at 2.4.3 and there is a new version out (2.9.1) from 11/02/2019.After this we can check some settings in the BIOS under BIOS profile .
Feb 13 2019
Feb 13 2019
Chris has upgraded FW/BIOS on db1106 (thanks!) - so tomorrow I will do a few more reboots to keep debugging this.
In T215902#4950653, @Lucas_Werkmeister_WMDE wrote:term_type, small table holding the types of strings to be indexed in the db, right now this would be labels, descriptions and aliases, but this would scale to allowing more similar terms into the index (if desired) It might be the case not having a table here would be better and just keep INT ids in code.
There may be little point in normalizing the term_type for example. This thing only has 3 rows. (languages is also pretty small)
We could also turn term types and languages into short IDs via a hash function: as far as I’m aware, Wikibase only needs the string→ID direction (hash function), and if we need the ID→string direction (e. g. during manual investigation) we can hash all the known term types / language codes and look for the value we have.
Why not combining strings langstring and langstringtype on the same table?
For certain common types of items – especially people, but also e. g. cities – it is common to have the same label in a lot of different languages (see also T188992#4026839), so I think a strings table without a language code should help a lot. I’m not sure about the distinction between langstring and langstringtype though.
I guess the easiest is to migrate things while writing to both and at some point once both set of tables are in sync switch the writes to the new tables only?
Do the DB servers have enough storage space for this?
In T215902#4950293, @Addshore wrote:My first investigation into table normalization went for full normalization:
db1106 with 4.9.0-8 with debug enabled on the kernel, reboots sequence:
After power cycling db2085, this is what happened:
db2085 got stuck when booting up on:
[ 0.560579] x86: Booting SMP configuration: [ 0.565246] .... node #1, CPUs: #1 [ 0.674090] .... node #0, CPUs: #2
db2085 reboots with 4.9.0-8 with debug enabled:
db2085 reboots with 4.9.0-7 with debug enabled - all fine:
db2085: debug added to the kernel boot, to see if we catch something
linux /boot/vmlinuz-4.9.0-7-amd64 root=UUID=63e5ddbd-3c18-4bf5-ad22-88458ec175b7 ro ixgbe.allow_unsupported_sfp=1 console=ttyS1,115200n8 elevator=deadline debug
db2085 with kernel 4.9.0-7-amd64 reboots, another FAIL at the 6th and 7th reboot (similar patter as with kernel -9 at T214840#4948016):
db2085:
So I can confirm that the BIOS setting for Serial Communication is being sent to COM2 (which is ttyS1).
Which is the same as:
linux /boot/vmlinuz-4.9.0-7-amd64 root=UUID=63e5ddbd-3c18-4bf5-ad22-88458ec175b7 ro ixgbe.allow_unsupported_sfp=1 console=ttyS1,115200n8 elevator=deadline
Feb 12 2019
Feb 12 2019
@MoritzMuehlenhoff has removed -8 kernel from db2085 and I have rebooted it 8 times with -7 now
After restarting with the previous kernel 4.9.0-7-amd64 on db2085, the first time it didn't boot up, the second time it did.
@MoritzMuehlenhoff has installed 4.9.144-3 on db2085.
Out of 8 reboots, two of them got stuck (in a row).
1st reboot by @MoritzMuehlenhoff OK
2nd reboot by @MoritzMuehlenhoff OK
3rd reboot by @Marostegui OK
4th reboot by @Marostegui OK
5th reboot by @Marostegui OK
6th reboot by @Marostegui FAIL
7th reboot by @Marostegui FAIL
8th reboot by @Marostegui OK
Read only time would be around 16h (T210478#4942371)
Content licensed under Creative Commons Attribution-ShareAlike (CC BY-SA) 4.0 unless otherwise noted; code licensed under GNU General Public License (GPL) 2.0 or later and other open source licenses. By using this site, you agree to the Terms of Use, Privacy Policy, and Code of Conduct. · Wikimedia Foundation · Privacy Policy · Code of Conduct · Terms of Use · Disclaimer · CC-BY-SA · GPL