In T215107#4985384, @Tgr wrote:Oh, right, this is T188882: Attachment method should be preserved through global rename, let's follow up there. The accounts should be usable, this is more of a display bug.
- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Feed Advanced Search
Advanced Search
Advanced Search
Feb 27 2019
Feb 27 2019
Marostegui closed T215107: Global rename of The_Photographer → Wilfredor: supervision needed as Resolved.
Marostegui reassigned T215107: Global rename of The_Photographer → Wilfredor: supervision needed from MarcoAurelio to Tgr.
Marostegui changed the status of T216444: Global rename of Дагиров Умар → Takhirgeran Umar: supervision needed from Stalled to Open.
When do you want to do this?
Marostegui changed the status of T216444: Global rename of Дагиров Умар → Takhirgeran Umar: supervision needed, a subtask of T169440: Pending global renames in need of sysadmin supervision (tracking), from Stalled to Open.
Feb 26 2019
Feb 26 2019
s2 eqiad progress
- labsdb1011
- labsdb1010
- labsdb1009
- dbstore1004
- dbstore1002
- db1125
- db1122
- db1105
- db1103
- db1095
- db1090
- db1076
- db1074
- db1066
In T187295#4983856, @Daimona wrote:@Marostegui Thanks, that's nice to hear! Given that queries like that one are pretty common, this is surely a huge performance boost.
jcrespo awarded T187295: Apply AbuseFilter patch-fix-index a Love token.
Marostegui changed the status of T210713: Drop change_tag.ct_tag column in production from Open to Stalled.
Stalling this until we have failed over s1 master, as it is impossible to alter that host whilst it is active.
Marostegui changed the status of T210713: Drop change_tag.ct_tag column in production, a subtask of T194163: Drop change_tag.ct_tag column, from Open to Stalled.
Marostegui moved T86342: Dropping page.page_no_title_convert on wmf databases from Backlog to In progress on the Schema-change-in-production board.
Marostegui moved T215107: Global rename of The_Photographer → Wilfredor: supervision needed from Blocked external/Not db team to Done on the DBA board.
Marostegui updated the task description for T86342: Dropping page.page_no_title_convert on wmf databases.
Marostegui updated the task description for T86342: Dropping page.page_no_title_convert on wmf databases.
Marostegui added a comment to T215107: Global rename of The_Photographer → Wilfredor: supervision needed.
Thanks @Tgr!
@Wilfredor can you try to log-in now?
s1 eqiad progress
- labsdb1011
- labsdb1010
- labsdb1009
- dbstore1003
- dbstore1002
- dbstore1001
- db1124
- db1119
- db1118
- db1106
- db1105
- db1099
- db1089
- db1083
- db1080
- db1067
I used the following query on db1083 to measure the impact of the index change (I executed the query twice to make sure it was "warm"):
SELECT /* IndexPager::buildQueryInfo (AbuseLogPager) xx */ * FROM `abuse_filter_log` LEFT JOIN `abuse_filter` ON ((af_id=afl_filter)) WHERE afl_filter = '423' AND ((afl_deleted = '0') OR (afl_deleted IS NULL)) ORDER BY afl_timestamp DESC LIMIT 51;
Marostegui moved T217073: Clean up orphaned echo_event rows again from Triage to Blocked external/Not db team on the DBA board.
Thanks for the heads up, that works for me!
Marostegui added a comment to T215107: Global rename of The_Photographer → Wilfredor: supervision needed.
@MarcoAurelio green light from use to get the job re-scheduled
Feb 25 2019
Feb 25 2019
Excellent! Thank you Chris!
Yeah, I will ask for that once we have more replicas altered, otherwise you might sometimes reach one that is altered and another time one that is not :)
Thank you!
In T187295#4981381, @Daimona wrote:Nothing weird seems to have happened during the last 6 hours, and this is good. However, I guess that any possible trouble won't be visible before reaching enwiki...
Marostegui moved T72005: Apply enum changes to (img|oi|fa)_major_mime on production from Backlog to Pending comment on the DBA board.
Marostegui moved T71127: Discrepancies with logging table on different wikis from Backlog to Pending comment on the DBA board.
Marostegui moved T205626: Document clearly the mariadb backup and recovery setup from Backlog to Pending comment on the DBA board.
Marostegui moved T86342: Dropping page.page_no_title_convert on wmf databases from Pending comment to In progress on the DBA board.
@Daimona the following wikis are now fully altered, I am going to give it some hours before continuing and will monitor tendril:
s2:
bgwiki
bgwiktionary
cswiki
enwikiquote
enwiktionary
eowiki
fiwiki
idwiki
itwiki
nlwiki
nowiki
plwiki
ptwiki
svwiki
thwiki
trwiki
zhwiki
s2 eqiad progress
- labsdb1011
- labsdb1010
- labsdb1009
- dbstore1004
- dbstore1002
- db1125
- db1122
- db1105
- db1103
- db1095
- db1090
- db1076
- db1074
- db1066
In T187295#4979911, @Daimona wrote:In T187295#4979679, @Marostegui wrote:Thanks @Daimona!
Is this an example of a slow query?SELECT /* IndexPager::buildQueryInfo (AbuseLogPager) xxx */ * FROM `abuse_filter_log` LEFT JOIN `abuse_filter` ON ((af_id=afl_filter)) WHERE afl_filter = '550' AND ((afl_deleted = '0') OR (afl_deleted IS NULL)) ORDER BY afl_timestamp LIMIT 51 /*Uhm, has it been reported as slow? I couldn't spot it, if so. Yes, this query being slow could be a side-effect of changing indexes.
I have increased the TTL back to 30 days.
Going to monitor the graphs for a few days before closing this.
s6 eqiad progress
- labsdb1011
- labsdb1010
- labsdb1009
- dbstore1005
- dbstore1002
- dbstore1001
- db1125
- db1113
- db1098
- db1096
- db1093
- db1088
- db1085
- db1061
s6 eqiad progress
Thanks @Daimona!
Is this an example of a slow query?
SELECT /* IndexPager::buildQueryInfo (AbuseLogPager) xxx */ * FROM `abuse_filter_log` LEFT JOIN `abuse_filter` ON ((af_id=afl_filter)) WHERE afl_filter = '550' AND ((afl_deleted = '0') OR (afl_deleted IS NULL)) ORDER BY afl_timestamp LIMIT 51 /*
Marostegui updated the task description for T86342: Dropping page.page_no_title_convert on wmf databases.
Marostegui closed T197486: prop=revisions API timing out for a specific user and pages they edited as Resolved.
All the core replicas that receive this query are now running > 10.1.36 which doesn't have this optimizer "bug".
The masters aren't running those version, but they are not receiving (or shouldn't be) this queries so this is pretty much solved.
s5 eqiad progress
- labsdb1011
- labsdb1010
- labsdb1009
- dbstore1003
- dbstore1002
- db1124
- db1113
- db1110
- db1102
- db1100
- db1097
- db1096
- db1082
- db1070
Feb 24 2019
Feb 24 2019
mysql crashed last night:
Thread pointer: 0x0x0 Attempting backtrace. You can use the following information to find out where mysqld died. If you see no messages after this, something went terribly wrong... stack_bottom = 0x0 thread_stack 0x48000 mysys/stacktrace.c:247(my_print_stacktrace)[0xbdd6ee] sql/signal_handler.cc:153(handle_fatal_signal)[0x73dc40] /lib/x86_64-linux-gnu/libpthread.so.0(+0x10330)[0x7f77a198a330] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x37)[0x7f77a079ec37] /lib/x86_64-linux-gnu/libc.so.6(abort+0x148)[0x7f77a07a2028] srv/srv0srv.cc:2200(srv_error_monitor_thread)[0x9870aa] /lib/x86_64-linux-gnu/libpthread.so.0(+0x8184)[0x7f77a1982184] /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7f77a086603d] The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains information that should help you find out what is causing the crash. 190224 03:20:23 mysqld_safe Number of processes running now: 0 190224 03:20:23 mysqld_safe mysqld restarted
@Cmjohnson db1114 crashed again with the same memory errors on the same slots, so it looks like the mainboard memory slots aren't healthy?
Record: 1 Date/Time: 02/21/2019 19:30:12 Source: system Severity: Ok Description: Log cleared. ------------------------------------------------------------------------------- Record: 2 Date/Time: 02/23/2019 21:25:36 Source: system Severity: Non-Critical Description: Correctable memory error rate exceeded for DIMM_B7. ------------------------------------------------------------------------------- Record: 3 Date/Time: 02/23/2019 21:25:37 Source: system Severity: Non-Critical Description: Correctable memory error rate exceeded for DIMM_B3. ------------------------------------------------------------------------------- Record: 4 Date/Time: 02/23/2019 21:25:58 Source: system Severity: Critical Description: Correctable memory error rate exceeded for DIMM_B7. -------------------------------------------------------------------------------
Marostegui moved T214720: db1114 crashed (HW memory issues) from Blocked external/Not db team to In progress on the DBA board.
Feb 22 2019
Feb 22 2019
Marostegui removed a subtask for T210478: Migrate dbstore1002 to a multi instance setup on dbstore100[3-5]: Unknown Object (Task).
Marostegui added a subtask for T172410: Replace the current multisource analytics-store setup: Unknown Object (Task).
MySQL will be stopped the 4th of March as a final part of the deprecation of this host.
It has been on read only since 18th Feb anyways. The data should not be trusted anymore as it is very corrupted as a result of all the crashes it has had.
I am closing this as this host will no longer have support, if mysql crashes it will get restarted and replication started automatically with idempotent mode (T213670#4934489)
Marostegui closed T213670: dbstore1002 Mysql errors, a subtask of T210478: Migrate dbstore1002 to a multi instance setup on dbstore100[3-5], as Resolved.
I have ran the ALTER on codfw DC for s5 section:
cebwiki dewiki enwikivoyage mgwiktionary shwiki srwiki
Marostegui updated the task description for T86342: Dropping page.page_no_title_convert on wmf databases.
Marostegui updated the task description for T210478: Migrate dbstore1002 to a multi instance setup on dbstore100[3-5].
This host crashed today again:
------------------------------------------------------------------------------- Record: 40 Date/Time: 02/22/2019 06:10:16 Source: system Severity: Ok Description: A problem was detected related to the previous server boot. ------------------------------------------------------------------------------- ------------------------------------------------------------------------------- Record: 74 Date/Time: 02/22/2019 06:10:18 Source: system Severity: Ok Description: An OEM diagnostic event occurred. ------------------------------------------------------------------------------- Record: 75 Date/Time: 02/22/2019 06:12:12 Source: system Severity: Non-Critical Description: Correctable memory error rate exceeded for DIMM_B1. ------------------------------------------------------------------------------- Record: 76 Date/Time: 02/22/2019 06:14:41 Source: system Severity: Critical Description: Correctable memory error rate exceeded for DIMM_B1. -------------------------------------------------------------------------------
No, those are all the logged queries involving the logging table that where logged on sys, that's why I said I didn't think it would be too useful :(
In T216170#4973223, @Bstorm wrote:@Marostegui I see no subscribes or triggers on a quick pass in puppet, so if I'm not wrong I can change the config with puppet without auto-reloading or puppet restarting the server, right?
All good now, thank you!
logicaldrive 1 (3.3 TB, RAID 1+0, OK)
Feb 21 2019
Feb 21 2019
@jcrespo maybe we can leave a mydumper running 24x7 on a loop for days on that host: dumping everything, deleting the backups file, dump everyting and so forth.
Marostegui added a comment to T216656: API problem with usercontribs using `rev_user_text` rather than `rev_user`: Only use 'contributions' replica if querying by user ID.
Thanks for checking although now that I think about it, it is pretty much the same thing, it will timeout anyways (as we have seen) :-)
Marostegui added a comment to T216656: API problem with usercontribs using `rev_user_text` rather than `rev_user`: Only use 'contributions' replica if querying by user ID.
In T216656#4972232, @gerritbot wrote:Change 491993 had a related patch set uploaded (by Anomie; owner: Anomie):
[mediawiki/core@master] ApiQueryUserContribs: Only use 'contributions' replica if querying by user ID
Thanks!
logicaldrive 1 (3.3 TB, RAID 1+0, Recovering, 2% complete)
Marostegui renamed T214720: db1114 crashed (HW memory issues) from db1114 crashed to db1114 crashed (HW memory issues).
Marostegui moved T187295: Apply AbuseFilter patch-fix-index from Backlog to In progress on the Schema-change-in-production board.
Marostegui removed a project from T215616: Improve interlingual links across wikis through Wikidata IDs: DBA.
Going to remove the DBA tag from here as there are not really any actionables (yet) for the DBAs and we already provided some input here (T215616#4946564) and there is not much we can do about this at the moment.
I am leaving the MediaWiki-libs-Rdbms tag in case you want to discuss queries or even schema changes (then I would suggest you add Schema-change once you have some thoughts or proposals about it).
Lastly, I will remain subscribed to this task in case you need further help from us!
Thank you!
Marostegui moved T187295: Apply AbuseFilter patch-fix-index from Pending comment to In progress on the DBA board.
@Daimona I have done a quick grep on mediawiki-extensions-AbuseFilter and on mediawiki-core repo to make sure there are no FORCE INDEX on any of the following ones:
afl_filter afl_user afl_namespace afl_ip
As this drift has already created some issues I will try to work on this as a background task, trying to fix hosts slowly but steady.
Now that we can use ADD KEY IF NOT EXISTS and DROP KEY IF EXISTS it will be slightly easier, however from a first glance there are lots of drifts even between hosts on the same section.
All the hosts are done except db1067 (s1 master T210713#4967984 ) which I will try a few more times before stalling this until we do a failover.
There is no significant increase that can be seen on the graphs, but also 2 days might be too low to notice something
In a couple of days there it will be a month since I switched the TTL from 22 days to 24. There has not been any issues with this, so on Monday I think I will go from 24 to 30 as planned unless someone has any objection.
Thanks!
Marostegui added a comment to T216656: API problem with usercontribs using `rev_user_text` rather than `rev_user`: Only use 'contributions' replica if querying by user ID.
For s2 we can probably decrease the main traffic weight for the rc replicas (db1103 and db1105) as the other hosts I think will have no problem to assume the traffic, but this is another case where "special" slaves are a snowflake and bite us :-(
I don't think this is too useful https://phabricator.wikimedia.org/P8114 :(
Let's get the disk changed @Papaul - thanks!
Let's replace only the one that has FAILED, not the ones with predictive failure, those are being tracked at T208323: Predictive failures on disk S.M.A.R.T. status
Feb 20 2019
Feb 20 2019
Marostegui added a comment to T216635: MySQL database on deployment-db03 does not start due to InnoDB issue.
Data looks very corrupted. At this point the best option is to rebuild that host from the slave.
Data looks very corrupted. At this point the best option is to rebuild that host from the slave
Marostegui added a comment to T216635: MySQL database on deployment-db03 does not start due to InnoDB issue.
Anything on dmesg?
Can you do a touch /srv/test?
It is certainly being used for some queries, I can see this counter increasing:
root@db1089.eqiad.wmnet[sys]> select rows_selected,select_latency from x$schema_index_statistics where table_name='logging' and index_name like 'log_title_%'; +---------------+----------------+ | rows_selected | select_latency | +---------------+----------------+ | 14478 | 2739429620928 | | 225 | 98277884848 | +---------------+----------------+ 2 rows in set (0.05 sec)
Marostegui added a comment to T216635: MySQL database on deployment-db03 does not start due to InnoDB issue.
Broken storage?:
Feb 18 13:24:54 mysqld[837]: InnoDB: Error number 5 means 'Input/output error'.
Just for the record
db1069
s3 eqiad
- labsdb1011
- labsdb1010
- labsdb1009
- dbstore1004
- dbstore1002
- db1124
- db1123
- db1095
- db1077
- db1075
- db1078
db1067 (s1 master) has too much concurrency to let the alter go thru, I will try a few more times before giving up on it and leaving it for when we either failover the master or the DC.
Marostegui updated the task description for T86342: Dropping page.page_no_title_convert on wmf databases.
I have been taking a look at these indexes on enwiki, and we have two indexes in production that are not on tables.sql:
KEY `log_title_time` (`log_title`(16),`log_timestamp`), KEY `log_title_type_time` (`log_title`(16),`log_type`,`log_timestamp`),
Marostegui moved T187295: Apply AbuseFilter patch-fix-index from Backlog to Pending comment on the DBA board.
Content licensed under Creative Commons Attribution-ShareAlike (CC BY-SA) 4.0 unless otherwise noted; code licensed under GNU General Public License (GPL) 2.0 or later and other open source licenses. By using this site, you agree to the Terms of Use, Privacy Policy, and Code of Conduct. · Wikimedia Foundation · Privacy Policy · Code of Conduct · Terms of Use · Disclaimer · CC-BY-SA · GPL