[05:32:33] <marostegui>	 i am starting mysql on all the hosts that had the pdu maintenance yesterday
[07:22:36] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on m1 on db1117 is CRITICAL: 92.4 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1117&var-port=13321
[07:26:35] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on m1 on db1117 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1117&var-port=13321
[07:28:23] <marostegui>	 jynus: reminder that there will be maintenance on row B and C for some racks today, i haven't touched backup related hosts (not even sources) just in case you want to check/run/wait for backups to finish
[07:28:49] <marostegui>	 jynus: i need to stop mysql on some es codfw hosts, could you double check if it is ok?
[07:29:02] <jynus>	 yeah, thet finished today, but I have to check to stop them
[07:29:09] <marostegui>	 es2030, es2029, es2025, es2032, es2031
[07:29:23] <jynus>	 yeah, es backups finished. you can check it on the dashboard
[07:31:18] <jynus>	 https://phabricator.wikimedia.org/P32198
[07:34:22] <jynus>	 you can also see servers at a particular rack by going to /servers/?search=codfw+B3
[07:37:01] <marostegui>	 do you have the ssh tunnel command somewhere?
[07:37:28] <jynus>	 ssh -L 8000:localhost:8000 backupmon1001.eqiad.wmnet
[07:37:35] <marostegui>	 thanks :*
[07:39:04] <jynus>	 let me do a quick patch to implement it on the instance list, too, so you can do e.g. "core codfw B3"
[07:39:41] <marostegui>	 STuff like http://localhost:8000/servers/?search=codfw+B5 is SUPER useful
[07:40:01] <jynus>	 yeah, but I realized that it will be on instances too
[07:40:31] <jynus>	 at first I didn't because instances live on servers and servers live on racks, but I can do both at the same time, give me one sec
[07:44:23] <jynus>	 also I have a json api, and you can use it from the command line, too
[07:52:13] <jynus>	 I've done https://gerrit.wikimedia.org/r/c/operations/software/pampinus/+/820073 and you can now do: http://localhost:8000/instances/?search=codfw+B5 too
[08:06:09] <jynus>	 I will shutdown db2098, do you need help with the others, marostegui?
[08:06:31] <marostegui>	 thanks jynus, it is all done!
[08:15:32] <jynus>	 I will have to also check later the backups that will be failing this week
[08:15:41] <jynus>	 as in, bacula backups
[08:38:23] <jynus>	 marostegui: while reviewing servers that will be affected to make sure I didn't miss anything, I saw 2 of yours that were will up: es2021 and db2124- maybe that is intended but FYI 
[08:39:27] <marostegui>	 yeah, I got distracted with apt-get
[08:39:49] <marostegui>	 let me stop mysql there
[08:40:23] <jynus>	 ok, mentioned it because you said you were done :-P
[08:40:28] <marostegui>	 yeah
[08:46:06] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on s7 on db2159 is CRITICAL: 1.278e+04 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2159&var-port=9104
[08:49:49] <jynus>	 newly setup host, I guess
[08:59:30] <marostegui>	 no, it was one of the hosts that got stopped yesterday
[08:59:35] <marostegui>	 it is still catching up and downtime expired
[09:27:42] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on s7 on db2159 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2159&var-port=9104
[11:52:38] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on m1 on db1117 is CRITICAL: 10.2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1117&var-port=13321
[11:54:40] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on m1 on db1117 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1117&var-port=13321
[13:20:15] <jbond>	 hi all i notice that the puppet certificate for dbstore2001.codfw.wmnet is due to expire.  you should be able to use the sre.puppet.renew-cert cookbook to renew it
[13:20:50] <marostegui>	 jbond: dbstore2001?
[13:21:04] <marostegui>	 That host doesn't exist
[13:22:05] <jbond>	 marostegui: did it get removed recently?  
[13:22:19] <jynus>	 jbond: probably years ago
[13:22:27] <marostegui>	 No, quite long ago: https://phabricator.wikimedia.org/T220002
[13:23:23] <jynus>	 maybe there was a discrepancy when dcops decommed it- probably it wasn't decommed back then
[13:23:30] <jynus>	 *automated
[13:23:42] <jbond>	 ahh ok thanks, ill look a bit more at the check and see if i can clear out old hosts 
[13:24:24] <jynus>	 actually, I can see the automation was wroking as expected: T220002#5574262
[13:24:26] <stashbot>	 T220002: Decommission dbstore2001.codfw.wmnet and dbstore2002.codfw.wmnet - https://phabricator.wikimedia.org/T220002
[13:24:30] <jynus>	 so maybe something else?
[13:24:40] <jynus>	 e.g. it got recreated or something
[13:25:00] <jbond>	 ack thanks 
[13:41:16] <jinxer-wm>	 (PrometheusMysqldExporterFailed) firing: Prometheus-mysqld-exporter failed (db2177:9104)  - https://grafana.wikimedia.org/d/000000278/mysql-aggregated - https://alerts.wikimedia.org/?q=alertname%3DPrometheusMysqldExporterFailed
[13:41:36] <marostegui>	 ^ me
[13:43:02] <jynus>	 I wonder how/if we could integrate downtime with grafana/alertmanager
[13:43:51] <jynus>	 e.g. the downtime script also affecting altermanager, maybe?
[14:07:40] <Emperor>	 jynus: it would be nice if the downtime cookbook did so
[14:21:16] <jinxer-wm>	 (PrometheusMysqldExporterFailed) resolved: Prometheus-mysqld-exporter failed (db2177:9104)  - https://grafana.wikimedia.org/d/000000278/mysql-aggregated - https://alerts.wikimedia.org/?q=alertname%3DPrometheusMysqldExporterFailed
[15:45:09] <marostegui>	 b3 databases back up and replicating
[16:05:04] <marostegui>	 b6 databases back up and replicating
[17:19:46] <Emperor>	 sigh, I didn't even need to do C2 today it turns out, but ms-be2055 is refusing to come up with its disks in the right order (it was a PITA last time too) :-(
[17:24:55] <marostegui>	 b8 databases back up and replicating
[17:47:17] <Emperor>	 12 reboots and counting :(
[17:52:20] <Emperor>	 at last.
[18:16:37] <jynus>	 is there a place where I can see the current maintenance progress? do you know it?
[18:17:27] <marostegui>	 jynus: the pdu one?
[18:17:33] <jynus>	 yes
[18:17:50] <marostegui>	 jynus: nop, just following irc conversations :(
[18:18:16] <jynus>	 b3 and b3 are up again, at least?
[18:18:19] <jynus>	 *b6
[18:18:26] <marostegui>	 jynus: everything is up
[18:18:28] <marostegui>	 as far as I know
[18:18:45] <jynus>	 ok, because apparently something got rescheduled
[18:19:45] <jynus>	 I think C1 is happening now and C2 tomorrow, but unsure
[18:19:58] <marostegui>	 [20:14:54]  <papaul> PDU swap complete for today thanks to all going to lunch will be back to double check all the servers are up an happy
[18:20:02] <jynus>	 ok
[18:20:12] <jynus>	 thanks, I think you started up db2098 for me!
[18:20:18] <marostegui>	 https://phabricator.wikimedia.org/T310145
[18:20:22] <marostegui>	 jynus: yep :)
[18:20:25] <jynus>	 thank you
[18:20:37] <marostegui>	 that's the C row scheduled for tomorrow ^
[18:20:38] <jynus>	 I will then handle the probable missing backups on codfw
[18:22:19] <marostegui>	 all hosts back and started, I am off
[18:22:24] <jynus>	 bye
[18:22:35] <jynus>	 will go soon too as soon as I revert backup state
[18:58:11] <jynus>	 I think backups are not back to normal, I will see them start in 3 minutes and leave for the day
[19:05:52] <jynus>	 all backups on codfw failed, but not sure why
[19:07:15] <jynus>	 cumin is busted on cumin2002
[19:24:11] <jynus>	 I think it got fixed now, backups running
[20:17:47] <taavi>	 Amir1: btw labweb1001/1002 are now out of service, feel free to remove the labswiki grants from them
[20:18:44] <Amir1>	 taavi: Thanks. Can you file a bug and assign it to me? I'll do it tomorrow. Also please include the IPs, that's what I see in db side
[20:18:51] <taavi>	 will do, thank you
[20:21:35] <taavi>	 T314528
[20:21:36] <stashbot>	 T314528: Revoke MariaDB grants for labweb1001/1002 - https://phabricator.wikimedia.org/T314528
[20:36:37] <Amir1>	 thanks