[01:09:22] PROBLEM - MariaDB sustained replica lag on m1 on db2160 is CRITICAL: 10.4 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13321 [01:13:06] RECOVERY - MariaDB sustained replica lag on m1 on db2160 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13321 [11:19:11] hey there [11:19:30] the NetOps folks will perform an upgrade in the eqiad row A switches later today, see [11:19:34] T329073 [11:19:35] T329073: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 [11:19:59] there are a few wiki-replica-db proxies affected, clouddb[1013-1014,1021] [11:20:39] we are not sure of the best course of actions, please advice [11:22:17] mmmm I don't think they are the proxies... it seems they are actual DBs [11:32:51] arturo: you should probably depool them yes [11:33:58] worst case the proxy will send people to the other host but better if you can depool, better experience for them [11:56:51] thanks [13:46:27] does this patch to depool clouddb look good? https://gerrit.wikimedia.org/r/c/operations/puppet/+/895217 [13:46:46] only for 1013-1014, I don't think 1021 can be depooled? [13:51:26] no [13:51:33] it is a server for analytics [13:51:43] it cannot be depooled [13:58:09] I'll merge that patch as netops is going to upgrade the switches soon, and hopefully it will make it less disruptive [14:24:54] o/ [14:29:04] ugh, smartmontools takes >90s to start on ms-be2070 so systemd gets bored and kills it :( [15:30:49] PROBLEM - MariaDB sustained replica lag on es5 on es1023 is CRITICAL: 3616 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=es1023&var-port=9104 [15:31:17] dhinus: you probably need to reload dbproxy1019 [15:32:45] marostegui: I merged that patch but then never applied it because puppet was disabled on that host [15:33:05] and now that maintenance is over and puppet is enabled, I don't need the patch anymore :) [15:33:27] I create a "revert" patch https://gerrit.wikimedia.org/r/c/operations/puppet/+/895286 [15:36:15] RECOVERY - MariaDB sustained replica lag on es5 on es1023 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=es1023&var-port=9104 [15:37:05] dhinus: right, but you still need to reload the proxy :) [15:37:09] as the hosts are reported as down [15:38:53] hmm is there a firing alert? [15:41:02] yep [15:41:03] on icinga [15:41:49] the haproxy failover one? [15:42:16] ok now I see it in alerts.w.org as well [15:42:31] I was filtering for team:wmcs [15:43:02] shouldn't it auto-recover I'm confused as to why it's firing [15:44:00] I ran 'systemctl reload haproxy' anyway [15:44:17] Don't know how you have set up your proxies, for us, masters won't recover automatically and require a reload :) [15:45:38] how, now after the reload, it depooled 1013-1014 [15:45:40] *hah [15:45:51] because the revert patch is not merged yet [15:46:18] could you give it a +1? https://gerrit.wikimedia.org/r/c/operations/puppet/+/895286 [15:46:54] and the alert is still firing :/ [15:48:12] this looks fine though '/usr/local/lib/nagios/plugins/check_haproxy --check=failover' [15:50:38] dhinus: done [15:52:48] thanks, merged [15:53:46] and reloaded haproxy [15:54:19] the alert is now clear [15:54:36] great