Page MenuHomePhabricator

Change the MySQL passwords
Closed, ResolvedPublic

Description

The MySQL root password has not been changed since before I got shell access in 2003. The replication password has been the same since replication was first set up in 2004.

According to the SAL I changed the wikiuser and wikiadmin passwords in 2007. They probably haven't been changed again since then. The procedure I used to change the wikiuser password was:

  • Create a new user "wikiuser2" with the new password
  • Deploy the MW configuration change to use wikiuser2
  • Verify that no wikiuser connections remain
  • Change the password for wikiuser
  • Deploy the MW configuration change to use wikiuser with the new password
  • wikiadmin T232106
  • wikiuser

Event Timeline

tstarling triaged this task as Medium priority.Aug 10 2018, 1:37 AM
tstarling created this task.
tstarling created this object with visibility "Custom Policy".

wikiadmin password being changed due to {T232106}

jcrespo added a subscriber: ArielGlenn.

@Marostegui Given update on T234014, I would like to 1) test no leak is present and 2) schedule ASAP, with your help as well as someone from service ops and, if possible, someone with Mediawiki config experience (security or CPT) to do a password rollover change. Also document the procedure properly/automate it (based on above Tim's guidance) for future easier/regular changes. Dumps support (@ArielGlenn) would be helpful, too.

Given long running queries on dumps, cron and jobqueue, among others, this could take more than a few hours.

[note that to fully close this ticket, we should also change root-production and replication-production ones]

For minimal disruption on my side, we'd want the xml/sql dumps to be either in the 7z compression phase or idle. That will happen around the 17th I think. If we have to go earlier then it can be any time from now until then, as all times are equally likely to be disruptive. Just give me a heads up so that I can keep an eye out.

We'll probably take a hit on wikidata entity dumps no matter what,since they are on a weekly schedule that won't coincide with the xml/sql ones, but the failed bits will rerun.

+1 to change the password, but I would suggest we resume this conversation in January, after the break.

+1 to change the password, but I would suggest we resume this conversation in January, after the break.

We passed two Januaries since then :D

We talked about this a bit yesterday and I think with automation of reporting of grants, it's now easier to roll out new users (I estimate it to be a couple of days of work at most now). That being said, currently wikiuser has 26 distinct grants across cluster of which 11 of them are about wikitech (and 10 more on wikiadmin). Maybe we should roll this out once T292707 is done so we could simplify our grants? Maybe we could push for that?

Ladsgroup moved this task from Ready to In progress on the DBA board.

So I duplicated wikiuser and its grants anywhere the grant reporting system could get its hands on. The new user is wikiuser2022 (so hopefully we will change it next year). I will run another reporting tomorrow for dbs that have been missed (in case they were shut down today but I can't find any) and then deploy mw with the new user.

Next week we should start dropping wikiuser user and its grants.

I'm seeing a user without grants on some hosts, called wikiuser@localhost.localdomain. I simply won't move this forward to wikiuser2022.

I'm seeing a user without grants on some hosts, called wikiuser@localhost.localdomain. I simply won't move this forward to wikiuser2022.

This can be deleted

I just deployed wikiuser2022 in appservers everywhere. No errors so far.

Created this patch for making it easier to change username of wikiuser in puppet https://gerrit.wikimedia.org/r/c/operations/puppet/+/770890

Going to remove all wikiuser users and grants everywhere in an hour or so. Keep a way to roll it back if needed.

Changing replication user implies running a change master to master host on all the hosts.
Given how limited the replication user is, I wouldn't include it here. It is a pretty risky operation.

I just removed wikiuser from every host we have. It was automated based on the set of code I've written to automate grant handling work. The previous grants are stored in my home directory in cumin under omg_with_old_wikiuser.json which you can easily write a script to bring back old wikiuser and its grants if needed

But the errors are clean so I think it's safe to call it done. I agree with Manuel that replication user is a different beast with different needs and MO. I create a follow up and call this done.

I leave this to the security team if they want to make it public or not.

Ladsgroup updated the task description. (Show Details)
Ladsgroup moved this task from In progress to Done on the DBA board.

Created {T303920} for repl user.

I need to document the process of changing password so we can do it for wikiadmin and next year on wikiuser. For that, I need to get omg code to somewhere. In my todo list.

I double checked with git blame and the root db password was NOT changed at T104900#3569900, only the clouddb password was separated (not touching the production one!).

Filed T303930: Change mysql root password. Will get to it later.

Doing any security ticket seems like fighting hydra, closing one causes two new to show up.

replication user has also been rotated. Now there is one user with rather old password is left: ops. Let me take a look at that first. It should probably be dropped.

ops is the user under which the query killer events & logs run. If you drop it, events will fail and dbs will be overloaded, as it happens usually when the events for a db haven't been loaded properly.

Well, It's not in puppet grants file and it's not in roughly 1/3rd of the hosts. It's also sometimes on localhost, and sometimes on '%' which needs to be at least cleaned up.

If I remember correctly the ops user was migrated to root (I haven't checked but I recall something like that). The database is still ops but I believe the definer was migrated to root. Otherwise, if the user is missing in so many hosts, that's definitely something we need to look at.

@Ladsgroup can you give me a lost of hosts without that user?

This is one set I found

1db1163 (s1)
2db1179 (x1)
3db1216:3320 (x1)
4db1220 (x1)
5db1225:3312 (s2)
6db1225:3320 (x1)
7db1237 (x1)
8db1240:3311 (s1)
9dbstore1009:3320 (x1)
10es1026 (es2)
11es1027 (es1)
12es1028 (es3)
13es1029 (es1)
14es1030 (es2)
15es1031 (es3)
16es1032 (es1)
17es1033 (es2)
18es1034 (es3)
19es1035 (es7)
20es1036 (es6)
21es1037 (es6)
22es1038 (es6)
23es1039 (es7)
24es1040 (es7)
25es2026 (es2)
26es2027 (es3)
27es2028 (es1)
28es2029 (es3)
29es2030 (es1)
30es2031 (es2)
31es2032 (es1)
32es2033 (es2)
33es2034 (es3)
34es2035 (es6)
35es2036 (es6)
36es2037 (es6)
37es2038 (es7)
38es2039 (es7)
39es2040 (es7)
40pc1011 (pc1)
41pc1012 (pc2)
42pc1013 (pc3)
43pc1014 (pc1)
44pc1015 (pc4)
45pc1016 (pc4)
46pc2011 (pc1)
47pc2012 (pc2)
48pc2013 (pc3)
49pc2014 (pc1)
50pc2015 (pc4)
51pc2016 (pc4)

Looking at the events file we run every time we do a switchover, it is indeed from root:

drop event if exists wmf_slave_wikiuser_sleep;;

create definer='root'@'localhost' event wmf_slave_wikiuser_sleep
on schedule every 30 second starts date(now()) + interval 5 second
do begin

Yes, I can see events being killed on hosts which don't have ops user.

cumin2024@db1163.eqiad.wmnet[ops]> select stamp from event_log order by stamp desc limit 1;
+---------------------+
| stamp               |
+---------------------+
| 2024-06-04 04:54:33 |
+---------------------+
1 row in set (0.001 sec)

cumin2024@db1163.eqiad.wmnet[ops]> select user from mysql.user;
+---------------+
| User          |
+---------------+
| wikiadmin2023 |
| wikiuser2023  |
| repl2024      |
| cumin2024     |
| repl2024      |
| cumin2024     |
| orchestrator  |
| mariadb.sys   |
| nagios        |
| prometheus    |
| root          |
+---------------+
11 rows in set (0.002 sec)

Then I mistook the ops user with the ops db, sorry.

Yes, the ops database is where the events are installed and that is not being touched.

Yes, I think we can. Let's do it in a controlled way just in case. Let's drop all codfw and we can check in a few hours the event_log and make sure events are still being killed, if so, we can drop eqiad.

es2022 (es4), es2021 (es4), es2020 (es4), es1022 (es4), es1021 (es4), es1020 (es4), db2215 (x1), db2201:3320 (x1), db2197:3320 (x1), db2196 (x1), db2191 (x1), db2131 (x1), db2115 (x1), db2097:3320 (x1), db2144 (x2), db2143 (x2), db2142 (x2), es2025 (es5), es2024 (es5), es2023 (es5), es1025 (es5), es1024 (es5), es1023 (es5), dbstore1008:3311 (s1), db2216 (s1), db2212 (s1), db2203 (s1), db2188 (s1), db2186:3311 (s1), db2176 (s1), db2174 (s1), db2173 (s1), db2170 (s1), db2153 (s1), db2146 (s1), db2145 (s1), db2141:3311 (s1), db2130 (s1), db2116 (s1), dbstore1007:3312 (s2), db2207 (s2), db2204 (s2), db2197:3312 (s2), db2189 (s2), db2187:3312 (s2), db2175 (s2), db2148 (s2), db2138 (s2), db2126 (s2), db2125 (s2), db2097:3312 (s2), dbstore1007:3313 (s3), db2209 (s3), db2205 (s3), db2194 (s3), db2190 (s3), db2186:3313 (s3), db2177 (s3), db2156 (s3), db2149 (s3), db2139:3313 (s3), db2127 (s3), dbstore1007:3314 (s4), db2219 (s4), db2210 (s4), db2206 (s4), db2199:3314 (s4), db2187:3314 (s4), db2179 (s4), db2172 (s4), db2155 (s4), db2147 (s4), db2140 (s4), db2139:3314 (s4), db2137 (s4), db2136 (s4), db2099:3314 (s4), dbstore1008:3315 (s5), db2213 (s5), db2211 (s5), db2192 (s5), db2186:3315 (s5), db2178 (s5), db2171 (s5), db2157 (s5), db2128 (s5), db2123 (s5), dbstore1009:3316 (s6), db2217 (s6), db2214 (s6), db2197:3316 (s6), db2193 (s6), db2187:3316 (s6), db2180 (s6), db2169 (s6), db2158 (s6), db2151 (s6), db2129 (s6), db2124 (s6), db2114 (s6), db2097:3316 (s6), dbstore1008:3317 (s7), db2220 (s7), db2218 (s7), db2208 (s7), db2187:3317 (s7), db2182 (s7), db2168 (s7), db2159 (s7), db2150 (s7), db2122 (s7), db2121 (s7), dbstore1009:3318 (s8), db2200:3318 (s8), db2198:3318 (s8), db2195 (s8), db2186:3318 (s8), db2181 (s8), db2167 (s8), db2166 (s8), db2165 (s8), db2164 (s8), db2163 (s8), db2162 (s8), db2161 (s8), db2154 (s8), db2152 (s8), db2098:3318 (s8)

Dropping it on these hosts ^

It seems to be working:

cumin2024@db2166.codfw.wmnet[ops]> select stamp from event_log order by stamp desc limit 1;
+---------------------+
| stamp               |
+---------------------+
| 2024-06-10 08:00:35 |
+---------------------+
1 row in set (0.032 sec)

cumin2024@db2166.codfw.wmnet[ops]> select user from mysql.user;
+---------------+
| User          |
+---------------+
| wikiadmin2023 |
| wikiuser2023  |
| repl2024      |
| cumin2024     |
| repl2024      |
| cumin2024     |
| orchestrator  |
| mariadb.sys   |
| nagios        |
| prometheus    |
| root          |
+---------------+
11 rows in set (0.033 sec)

Dropping it everywhere now.

Dropped everywhere. We now can make these tickets public if @Marostegui thinks it's fine: {T364985}, T303930: Change mysql root password and this ticket. cc @sbassett (T364985#9813011)

Any concerns, @Marostegui? If not, happy to make these tasks public now.

sbassett changed the visibility from "Custom Policy" to "Public (No Login Required)".
sbassett added a project: SecTeam-Processed.
sbassett removed subscribers: Kormat, chasemp.