[05:32:33] i am starting mysql on all the hosts that had the pdu maintenance yesterday [07:22:36] PROBLEM - MariaDB sustained replica lag on m1 on db1117 is CRITICAL: 92.4 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1117&var-port=13321 [07:26:35] RECOVERY - MariaDB sustained replica lag on m1 on db1117 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1117&var-port=13321 [07:28:23] jynus: reminder that there will be maintenance on row B and C for some racks today, i haven't touched backup related hosts (not even sources) just in case you want to check/run/wait for backups to finish [07:28:49] jynus: i need to stop mysql on some es codfw hosts, could you double check if it is ok? [07:29:02] yeah, thet finished today, but I have to check to stop them [07:29:09] es2030, es2029, es2025, es2032, es2031 [07:29:23] yeah, es backups finished. you can check it on the dashboard [07:31:18] https://phabricator.wikimedia.org/P32198 [07:34:22] you can also see servers at a particular rack by going to /servers/?search=codfw+B3 [07:37:01] do you have the ssh tunnel command somewhere? [07:37:28] ssh -L 8000:localhost:8000 backupmon1001.eqiad.wmnet [07:37:35] thanks :* [07:39:04] let me do a quick patch to implement it on the instance list, too, so you can do e.g. "core codfw B3" [07:39:41] STuff like http://localhost:8000/servers/?search=codfw+B5 is SUPER useful [07:40:01] yeah, but I realized that it will be on instances too [07:40:31] at first I didn't because instances live on servers and servers live on racks, but I can do both at the same time, give me one sec [07:44:23] also I have a json api, and you can use it from the command line, too [07:52:13] I've done https://gerrit.wikimedia.org/r/c/operations/software/pampinus/+/820073 and you can now do: http://localhost:8000/instances/?search=codfw+B5 too [08:06:09] I will shutdown db2098, do you need help with the others, marostegui? [08:06:31] thanks jynus, it is all done! [08:15:32] I will have to also check later the backups that will be failing this week [08:15:41] as in, bacula backups [08:38:23] marostegui: while reviewing servers that will be affected to make sure I didn't miss anything, I saw 2 of yours that were will up: es2021 and db2124- maybe that is intended but FYI [08:39:27] yeah, I got distracted with apt-get [08:39:49] let me stop mysql there [08:40:23] ok, mentioned it because you said you were done :-P [08:40:28] yeah [08:46:06] PROBLEM - MariaDB sustained replica lag on s7 on db2159 is CRITICAL: 1.278e+04 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2159&var-port=9104 [08:49:49] newly setup host, I guess [08:59:30] no, it was one of the hosts that got stopped yesterday [08:59:35] it is still catching up and downtime expired [09:27:42] RECOVERY - MariaDB sustained replica lag on s7 on db2159 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2159&var-port=9104 [11:52:38] PROBLEM - MariaDB sustained replica lag on m1 on db1117 is CRITICAL: 10.2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1117&var-port=13321 [11:54:40] RECOVERY - MariaDB sustained replica lag on m1 on db1117 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1117&var-port=13321 [13:20:15] hi all i notice that the puppet certificate for dbstore2001.codfw.wmnet is due to expire. you should be able to use the sre.puppet.renew-cert cookbook to renew it [13:20:50] jbond: dbstore2001? [13:21:04] That host doesn't exist [13:22:05] marostegui: did it get removed recently? [13:22:19] jbond: probably years ago [13:22:27] No, quite long ago: https://phabricator.wikimedia.org/T220002 [13:23:23] maybe there was a discrepancy when dcops decommed it- probably it wasn't decommed back then [13:23:30] *automated [13:23:42] ahh ok thanks, ill look a bit more at the check and see if i can clear out old hosts [13:24:24] actually, I can see the automation was wroking as expected: T220002#5574262 [13:24:26] T220002: Decommission dbstore2001.codfw.wmnet and dbstore2002.codfw.wmnet - https://phabricator.wikimedia.org/T220002 [13:24:30] so maybe something else? [13:24:40] e.g. it got recreated or something [13:25:00] ack thanks [13:41:16] (PrometheusMysqldExporterFailed) firing: Prometheus-mysqld-exporter failed (db2177:9104) - https://grafana.wikimedia.org/d/000000278/mysql-aggregated - https://alerts.wikimedia.org/?q=alertname%3DPrometheusMysqldExporterFailed [13:41:36] ^ me [13:43:02] I wonder how/if we could integrate downtime with grafana/alertmanager [13:43:51] e.g. the downtime script also affecting altermanager, maybe? [14:07:40] jynus: it would be nice if the downtime cookbook did so [14:21:16] (PrometheusMysqldExporterFailed) resolved: Prometheus-mysqld-exporter failed (db2177:9104) - https://grafana.wikimedia.org/d/000000278/mysql-aggregated - https://alerts.wikimedia.org/?q=alertname%3DPrometheusMysqldExporterFailed [15:45:09] b3 databases back up and replicating [16:05:04] b6 databases back up and replicating [17:19:46] sigh, I didn't even need to do C2 today it turns out, but ms-be2055 is refusing to come up with its disks in the right order (it was a PITA last time too) :-( [17:24:55] b8 databases back up and replicating [17:47:17] 12 reboots and counting :( [17:52:20] at last. [18:16:37] is there a place where I can see the current maintenance progress? do you know it? [18:17:27] jynus: the pdu one? [18:17:33] yes [18:17:50] jynus: nop, just following irc conversations :( [18:18:16] b3 and b3 are up again, at least? [18:18:19] *b6 [18:18:26] jynus: everything is up [18:18:28] as far as I know [18:18:45] ok, because apparently something got rescheduled [18:19:45] I think C1 is happening now and C2 tomorrow, but unsure [18:19:58] [20:14:54] PDU swap complete for today thanks to all going to lunch will be back to double check all the servers are up an happy [18:20:02] ok [18:20:12] thanks, I think you started up db2098 for me! [18:20:18] https://phabricator.wikimedia.org/T310145 [18:20:22] jynus: yep :) [18:20:25] thank you [18:20:37] that's the C row scheduled for tomorrow ^ [18:20:38] I will then handle the probable missing backups on codfw [18:22:19] all hosts back and started, I am off [18:22:24] bye [18:22:35] will go soon too as soon as I revert backup state [18:58:11] I think backups are not back to normal, I will see them start in 3 minutes and leave for the day [19:05:52] all backups on codfw failed, but not sure why [19:07:15] cumin is busted on cumin2002 [19:24:11] I think it got fixed now, backups running [20:17:47] Amir1: btw labweb1001/1002 are now out of service, feel free to remove the labswiki grants from them [20:18:44] taavi: Thanks. Can you file a bug and assign it to me? I'll do it tomorrow. Also please include the IPs, that's what I see in db side [20:18:51] will do, thank you [20:21:35] T314528 [20:21:36] T314528: Revoke MariaDB grants for labweb1001/1002 - https://phabricator.wikimedia.org/T314528 [20:36:37] thanks