[00:16:27] <icinga-wm>	 RECOVERY - Check unit status of swift_ring_manager on thanos-fe1001 is OK: OK: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[01:15:35] <icinga-wm>	 PROBLEM - Check unit status of swift_ring_manager on thanos-fe1001 is CRITICAL: CRITICAL: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[02:12:21] <icinga-wm>	 RECOVERY - Check unit status of swift_ring_manager on thanos-fe1001 is OK: OK: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[03:22:49] <icinga-wm>	 PROBLEM - Check unit status of swift_ring_manager on thanos-fe1001 is CRITICAL: CRITICAL: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[07:21:17] <icinga-wm>	 RECOVERY - Check unit status of swift_ring_manager on thanos-fe1001 is OK: OK: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[08:20:13] <icinga-wm>	 PROBLEM - Check unit status of swift_ring_manager on thanos-fe1001 is CRITICAL: CRITICAL: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[08:48:53] <icinga-wm>	 PROBLEM - pt-heartbeat-wikimedia service on db2134 is CRITICAL: CRITICAL - Expecting active but unit pt-heartbeat-wikimedia is failed https://wikitech.wikimedia.org/wiki/MariaDB/pt-heartbeat
[08:49:12] <marostegui>	 ^ probably me, checking
[08:50:17] <marostegui>	 should be fixed now
[08:51:09] <icinga-wm>	 RECOVERY - pt-heartbeat-wikimedia service on db2134 is OK: OK - pt-heartbeat-wikimedia is active https://wikitech.wikimedia.org/wiki/MariaDB/pt-heartbeat
[08:53:06] <marostegui>	 Finished rebooting all codfw hosts \o/
[08:55:13] <jynus>	 yay
[09:02:13] <icinga-wm>	 PROBLEM - pt-heartbeat-wikimedia service on db2096 is CRITICAL: CRITICAL - Expecting active but unit pt-heartbeat-wikimedia is failed https://wikitech.wikimedia.org/wiki/MariaDB/pt-heartbeat
[09:04:29] <icinga-wm>	 RECOVERY - pt-heartbeat-wikimedia service on db2096 is OK: OK - pt-heartbeat-wikimedia is active https://wikitech.wikimedia.org/wiki/MariaDB/pt-heartbeat
[09:05:11] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on x1 on db2096 is CRITICAL: 893 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2096&var-port=9104
[09:09:39] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on x1 on db2096 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2096&var-port=9104
[09:48:39] <jynus>	 I am now reenabling bacula backups of es, now that are finishing
[09:48:47] <jynus>	 waiting for the last one to recover
[11:07:05] <marostegui>	 jynus: ok to reboot es1021 and es1024 (es4 and es5)
[11:07:18] <jynus>	 one sec to check it had finished
[11:07:30] <marostegui>	 yeh no rush, I can do other db* hosts if not
[11:07:33] <marostegui>	 not a problem
[11:07:34] <jynus>	 yeah, all finished now
[11:07:41] <jynus>	 go ahead
[11:07:46] <marostegui>	 thanks!
[12:17:45] <icinga-wm>	 RECOVERY - Check unit status of swift_ring_manager on thanos-fe1001 is OK: OK: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[12:24:49] <volans>	 jynus: backup2002's primary IPv6 address doesn't have a DNS name, while all the other backup hosts seems to be. Do you know if it's intentional or just an oversight and we can add the DNS name to it?
[12:24:53] <volans>	 https://netbox.wikimedia.org/ipam/ip-addresses/3405/
[12:25:59] <jynus>	 cannot say, because 1002/2002 are specifically mysql clients
[12:26:36] <jynus>	 so maybe the error is that 1002 have it and it is suffering from ipv6 timeouts
[12:27:00] <jynus>	 the other backup hosts should have it, that is not an issue
[12:28:45] <jynus>	 volans: give me some time to check, probably we can add it, but I would like to confirm no timeouts
[12:29:09] <volans>	 clients or servers? if they are clients, having dns name on their IP doesn't change anything on how they connect to other servers
[12:29:23] <volans>	 sure, take your time, nothing time-sensitive here ;)
[12:29:24] <jynus>	 I recently had network issues on those hosts specifically
[12:29:30] <jynus>	 yeah, but it is complicated
[12:29:58] <jynus>	 because I had in the past wrongly configured servers
[12:32:45] <jynus>	 it is a question of being consistency among the service because otherwise I become crazy
[12:32:51] <jynus>	 *consistent
[12:34:33] <jynus>	 I think 1002 was the one that I reimaged first, and gave me huge headaches because of buster driver compatibility
[12:34:59] <jynus>	 and could just be a mistake for manual tampering
[12:35:25] <jynus>	 e.g. the problems I mentioned to you with "losing" the host after it failed reimaging and not being able to boot it
[12:36:51] <volans>	 I don't see how that can be connected with having or not an AAAA record in the DNS
[12:37:06] <volans>	 that's the only difference
[12:41:10] <jynus>	 weird: https://gerrit.wikimedia.org/r/c/operations/dns/+/593596
[12:41:30] <jynus>	 was that improted to netbox?
[12:45:32] <volans>	 we keep 2y of changelog, for digging back then we would need to restore a backup and check it..
[12:45:53] <jynus>	 https://netbox.wikimedia.org/dcim/devices/2630/changelog/
[12:46:13] <volans>	 that's 1002
[12:46:19] <jynus>	 ah
[12:46:32] <volans>	 both changelogs are empty
[12:46:37] <volans>	 (device and interface)
[12:46:43] <volans>	 *(device and ip)
[12:47:21] <jynus>	 yeah, then I think it was a mistake: https://gerrit.wikimedia.org/r/c/operations/dns/+/586431
[12:50:13] <volans>	 can I add the dns name on netbox and run the dns cookbook then? :)
[12:54:23] <jynus>	 yes
[12:54:36] <jynus>	 wait
[12:54:40] <jynus>	 will it restart network?
[12:54:50] <jynus>	 I am running a backup right now there
[12:54:58] <jynus>	 a very long one
[13:01:39] <volans>	 not at all, it's just an additional dns record in our authdns hosts
[13:03:46] <jynus>	 ok, then go ahead
[13:04:24] <jynus>	 well, I guess it could end up on puppet facts and firewall, forcing a ferm reload?
[13:04:39] <jynus>	 but that should not affect ongoing conections
[13:04:40] <jynus>	 so all ok
[13:04:59] <volans>	 ok great
[13:05:00] <volans>	 thanks
[13:05:23] <jynus>	 thanks to you for spotting it
[13:06:16] <jynus>	 sorry for being so careful, I had bad experiences in the past- think of network as quite fragile and this impacts when 25+ hours backups that cannot continue after failure
[13:07:03] <volans>	 there is no hurry, we can totally do it later if that makes you more confortable ;)
[13:07:21] <jynus>	 it should be ok, no worries, was just giving context
[13:16:57] <icinga-wm>	 PROBLEM - Check unit status of swift_ring_manager on thanos-fe1001 is CRITICAL: CRITICAL: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[16:00:19] <icinga-wm>	 RECOVERY - Check unit status of swift_ring_manager on thanos-fe1001 is OK: OK: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[18:13:28] <icinga-wm>	 PROBLEM - Check unit status of swift_ring_manager on thanos-fe1001 is CRITICAL: CRITICAL: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[19:00:07] <Amir1>	 marostegui: https://phabricator.wikimedia.org/T292143#8021972
[19:00:28] <Amir1>	 the flow of queries returning more than 10K rows is cut to half now
[20:12:39] <icinga-wm>	 RECOVERY - Check unit status of swift_ring_manager on thanos-fe1001 is OK: OK: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers