[00:07:04] PROBLEM - MariaDB sustained replica lag on s1 on db2216 is CRITICAL: 2.4 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2216&var-port=9104 [00:07:04] PROBLEM - MariaDB sustained replica lag on s1 on db2188 is CRITICAL: 2.2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2188&var-port=9104 [00:08:04] RECOVERY - MariaDB sustained replica lag on s1 on db2216 is OK: (C)2 ge (W)1 ge 0.2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2216&var-port=9104 [00:08:04] RECOVERY - MariaDB sustained replica lag on s1 on db2188 is OK: (C)2 ge (W)1 ge 0.4 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2188&var-port=9104 [05:03:08] I am going to disable writes on es5 [07:26:07] I am switching s7 codfw [07:26:27] mmm not yet [09:02:18] Starting s7 codfw switchover [09:49:05] if you see anything with m5: https://phabricator.wikimedia.org/T303930#9752341 [09:49:17] thanks [09:54:23] are s7 and s4 soon to be completed? [09:54:41] s7 yes [09:54:45] codfw is done actually [09:54:58] and I will finish eqiad probably tomorrow or wed [09:55:05] thanks, then switching backups this week [09:55:10] great thank you [11:01:27] (SystemdUnitFailed) firing: prometheus-mysqld-exporter.service on es1038:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:01:36] that's expected [13:03:53] arnaudb: did you just switch s4 codfw? [13:04:02] I did [13:04:09] the old master is still serving api [13:04:11] please remove it from there [13:04:13] oh [14:52:30] performing s6 codfw switchover T363713 [14:52:30] T363713: Switchover s6 master (db2114 -> db2129) - https://phabricator.wikimedia.org/T363713 [15:07:41] hm I've had an issue at runtime with the cookbook [15:08:10] for future ref https://www.irccloud.com/pastebin/4PG3Chpn/ [15:11:31] Amir1: marostegui I don't want to break something by taking any uninformed action [15:11:48] given the situation, and the topology that orchestrator sees [15:11:58] I'm not sure what's the right move here [15:17:16] I'll start replication again to avoid taking too much lag [15:21:28] reenable gtid too, according to that log if disabled [15:28:43] I've depooled db2151 who's still lagging. GTID reenabled on db2151 and db2129 [15:30:24] node is back on track, will left things as they are until marostegui and/or Amir1 are back here, I think the situation is stabilized [15:48:00] will retry the script upon advice [15:55:34] arnaudb: if the host is back in sync you can probably repool and retry. It maybe just timedout [15:55:38] Sometimes it happens [16:02:28] no idea what went wrong as I had no more information than the provided log haha but the switchover has been resumed thanks to jynus' and Amir1's help! - T363713 [16:02:29] T363713: Switchover s6 master (db2114 -> db2129) - https://phabricator.wikimedia.org/T363713 [16:49:45] Emperor: T358830 is amusing [16:49:46] T358830: Uploads fail due to 401 error from swift on wednesdays - https://phabricator.wikimedia.org/T358830 [17:12:48] good, isn't it? :) [18:17:44] milimetric: Are you able to see https://phabricator.wikimedia.org/T363633? And if so, can you plan to take a look sometime soon? [20:24:41] done @andrewbogott, left a comment [20:24:52] thanks!