[04:43:59] Going to start with s6 switchover [06:54:05] PROBLEM - MariaDB sustained replica lag on s5 on db1213 is CRITICAL: 34.4 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1213&var-port=9104 [06:55:05] RECOVERY - MariaDB sustained replica lag on s5 on db1213 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1213&var-port=9104 [08:18:21] arnaudb: marostegui I've pinpointed the issue to a mail delivery issue, so my monitoring worked, but the emails were lost, see: https://phabricator.wikimedia.org/T369253 Nothing for you to do, but I asked IF to comment. Can be closed if the mail issue has been fixed. [08:18:39] ack, thanks jynus! [09:22:49] schema change on old master of s6 happening, it's not pooled, I'll repool it once done [14:26:45] fyi I've broken prometheus-mysqld-exporter on clouddb2002-dev [14:26:51] reverting to the previous version [14:29:54] (does not fix anything, i'm downtiming it and will open a ticket) [14:32:13] all things considered, i'm not sure it was working to begin with 🤔 https://grafana.wikimedia.org/goto/oEi__zlIg?orgId=1 [16:05:42] I don't remember what clouddb2002-dev is for, and I'm not finding any references on wikitech [16:06:00] it's a -dev server anyway so it should not be important [16:06:48] I don't know if clouddb1* were affected, but the exporter seems to be working fine on those [16:13:12] I caught up in private with arnaudb -- you can ignore the above [16:23:03] so, in the end, thanks dhinus for digging up that ticket T229559 that does not give us any answer but is a nice cliffhanger for this host! fyi I've bumped prometheus-mysqld-exporter to its backport version on clouddb1018, will do the same tomorrow on the rest of the few hosts that still have the old version [16:23:04] (https://debmonitor.wikimedia.org/packages/prometheus-mysqld-exporter) because I'll need to enable pt-heartbeat-utc which does not appear before 0.13 [16:23:04] T229559: CloudVPS: codfw1dev: database backup for clouddb2001-dev.codfw.wmnet - https://phabricator.wikimedia.org/T229559 [16:26:33] dhinus: clouddb2002-dev has the labtestwiki but apart from that I don't know what else it has or is used for [16:26:38] I created T369308 to find out what clouddb2002-dev is for and/or decomission it [16:26:39] T369308: Decommission clouddb2002-dev.codfw.wmnet - https://phabricator.wikimedia.org/T369308 [16:27:07] not urgent but at least it will come up if somebody searches for that host name [16:27:42] marostegui: labtestwiki rings a bell, I'm not entirely sure it's still in there but it's possible [16:28:08] It is there [16:28:18] Whether it is used...that I don't know [16:28:31] But we still have to apply schema changes there [16:28:55] ok thanks! [16:29:39] a.ndrew and b.d808 will probably know if/when we can migrate that somewhere else (hopefully a VM?) [16:30:23] * dhinus offline [16:41:48] you got another perfect host for testing the recovery procedures :D [17:32:23] arnaudb: I've updated https://gerrit.wikimedia.org/r/c/operations/software/spicerack/+/1043753 with some refactor and started adding the tests, will complete them on monday. In the meanwhile if you have time please start reviewing the changes since your last PS.