[01:09:41] PROBLEM - MariaDB sustained replica lag on m1 on db2160 is CRITICAL: 5.4 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13321 [01:11:13] RECOVERY - MariaDB sustained replica lag on m1 on db2160 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13321 [07:58:14] mmm why https://gerrit.wikimedia.org/r/c/operations/puppet/+/913662 didn't disable notifications on db1132? this is weird [08:53:55] did it trigger a page? [08:54:13] maybe it avoids page but irc note stays [08:54:22] (just thinking out loud) [08:55:55] no, i just saw it [08:56:04] but notifications should've been disabled with that patch you sent [08:56:16] I don't know :/ [09:00:57] I think the issue is that they were disabled manually first: Apr 30 00:48:50 alert1001 icinga: EXTERNAL COMMAND: DISABLE_HOST_SVC_NOTIFICATIONS;db1132 [09:00:57] Apr 30 00:48:50 alert1001 icinga: EXTERNAL COMMAND: DISABLE_HOST_NOTIFICATIONS;db1132 [09:01:11] so maybe they got disabled with an expiration time, then the puppet patch came and...I don't know XD [09:01:41] I am going to open a task with o11y to see if we can find what happened and if it is a race condition or what [09:16:07] see this: https://icinga.wikimedia.org/cgi-bin/icinga/config.cgi?type=services&item_name=db1132^Check+for+large+files+in+client+bucket [09:16:23] Enable Notifications = No [09:28:50] I just created https://phabricator.wikimedia.org/T335937 [09:29:13] jynus: but the host did alert on icinga earlier today when I was running the check tables [09:29:16] For lag [09:29:26] yes, not doubting that [09:29:42] Anyways, let's see what o11y says! [09:29:44] just I think the fact that it is disabled but not really I thought was relevant [13:24:59] jynus: btw I checked and all hosts are using GTID, which reminded me of: https://phabricator.wikimedia.org/T315642 [13:25:33] yeah [13:25:53] I may have messed up when setting up backup1-codfw [13:26:06] that alert would have been nice to have [13:26:53] (not a big deal, as we have comprehensive backups, and mediawiki hosts are handled automatically) [13:27:38] plus I have not checked, but I am sure no data was lost- will recheck once the hw issues are solved [13:35:35] I'll be afk for a bit [15:01:28] Can I get a review for https://gerrit.wikimedia.org/r/c/operations/dns/+/915695 [15:23:10] thanks! [15:23:30] I was distracted with the staff meeting [16:36:57] I need to extend backup*003 fs