[00:16:27] RECOVERY - Check unit status of swift_ring_manager on thanos-fe1001 is OK: OK: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [01:15:35] PROBLEM - Check unit status of swift_ring_manager on thanos-fe1001 is CRITICAL: CRITICAL: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [02:12:21] RECOVERY - Check unit status of swift_ring_manager on thanos-fe1001 is OK: OK: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [03:22:49] PROBLEM - Check unit status of swift_ring_manager on thanos-fe1001 is CRITICAL: CRITICAL: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [07:21:17] RECOVERY - Check unit status of swift_ring_manager on thanos-fe1001 is OK: OK: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [08:20:13] PROBLEM - Check unit status of swift_ring_manager on thanos-fe1001 is CRITICAL: CRITICAL: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [08:48:53] PROBLEM - pt-heartbeat-wikimedia service on db2134 is CRITICAL: CRITICAL - Expecting active but unit pt-heartbeat-wikimedia is failed https://wikitech.wikimedia.org/wiki/MariaDB/pt-heartbeat [08:49:12] ^ probably me, checking [08:50:17] should be fixed now [08:51:09] RECOVERY - pt-heartbeat-wikimedia service on db2134 is OK: OK - pt-heartbeat-wikimedia is active https://wikitech.wikimedia.org/wiki/MariaDB/pt-heartbeat [08:53:06] Finished rebooting all codfw hosts \o/ [08:55:13] yay [09:02:13] PROBLEM - pt-heartbeat-wikimedia service on db2096 is CRITICAL: CRITICAL - Expecting active but unit pt-heartbeat-wikimedia is failed https://wikitech.wikimedia.org/wiki/MariaDB/pt-heartbeat [09:04:29] RECOVERY - pt-heartbeat-wikimedia service on db2096 is OK: OK - pt-heartbeat-wikimedia is active https://wikitech.wikimedia.org/wiki/MariaDB/pt-heartbeat [09:05:11] PROBLEM - MariaDB sustained replica lag on x1 on db2096 is CRITICAL: 893 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2096&var-port=9104 [09:09:39] RECOVERY - MariaDB sustained replica lag on x1 on db2096 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2096&var-port=9104 [09:48:39] I am now reenabling bacula backups of es, now that are finishing [09:48:47] waiting for the last one to recover [11:07:05] jynus: ok to reboot es1021 and es1024 (es4 and es5) [11:07:18] one sec to check it had finished [11:07:30] yeh no rush, I can do other db* hosts if not [11:07:33] not a problem [11:07:34] yeah, all finished now [11:07:41] go ahead [11:07:46] thanks! [12:17:45] RECOVERY - Check unit status of swift_ring_manager on thanos-fe1001 is OK: OK: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [12:24:49] jynus: backup2002's primary IPv6 address doesn't have a DNS name, while all the other backup hosts seems to be. Do you know if it's intentional or just an oversight and we can add the DNS name to it? [12:24:53] https://netbox.wikimedia.org/ipam/ip-addresses/3405/ [12:25:59] cannot say, because 1002/2002 are specifically mysql clients [12:26:36] so maybe the error is that 1002 have it and it is suffering from ipv6 timeouts [12:27:00] the other backup hosts should have it, that is not an issue [12:28:45] volans: give me some time to check, probably we can add it, but I would like to confirm no timeouts [12:29:09] clients or servers? if they are clients, having dns name on their IP doesn't change anything on how they connect to other servers [12:29:23] sure, take your time, nothing time-sensitive here ;) [12:29:24] I recently had network issues on those hosts specifically [12:29:30] yeah, but it is complicated [12:29:58] because I had in the past wrongly configured servers [12:32:45] it is a question of being consistency among the service because otherwise I become crazy [12:32:51] *consistent [12:34:33] I think 1002 was the one that I reimaged first, and gave me huge headaches because of buster driver compatibility [12:34:59] and could just be a mistake for manual tampering [12:35:25] e.g. the problems I mentioned to you with "losing" the host after it failed reimaging and not being able to boot it [12:36:51] I don't see how that can be connected with having or not an AAAA record in the DNS [12:37:06] that's the only difference [12:41:10] weird: https://gerrit.wikimedia.org/r/c/operations/dns/+/593596 [12:41:30] was that improted to netbox? [12:45:32] we keep 2y of changelog, for digging back then we would need to restore a backup and check it.. [12:45:53] https://netbox.wikimedia.org/dcim/devices/2630/changelog/ [12:46:13] that's 1002 [12:46:19] ah [12:46:32] both changelogs are empty [12:46:37] (device and interface) [12:46:43] *(device and ip) [12:47:21] yeah, then I think it was a mistake: https://gerrit.wikimedia.org/r/c/operations/dns/+/586431 [12:50:13] can I add the dns name on netbox and run the dns cookbook then? :) [12:54:23] yes [12:54:36] wait [12:54:40] will it restart network? [12:54:50] I am running a backup right now there [12:54:58] a very long one [13:01:39] not at all, it's just an additional dns record in our authdns hosts [13:03:46] ok, then go ahead [13:04:24] well, I guess it could end up on puppet facts and firewall, forcing a ferm reload? [13:04:39] but that should not affect ongoing conections [13:04:40] so all ok [13:04:59] ok great [13:05:00] thanks [13:05:23] thanks to you for spotting it [13:06:16] sorry for being so careful, I had bad experiences in the past- think of network as quite fragile and this impacts when 25+ hours backups that cannot continue after failure [13:07:03] there is no hurry, we can totally do it later if that makes you more confortable ;) [13:07:21] it should be ok, no worries, was just giving context [13:16:57] PROBLEM - Check unit status of swift_ring_manager on thanos-fe1001 is CRITICAL: CRITICAL: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [16:00:19] RECOVERY - Check unit status of swift_ring_manager on thanos-fe1001 is OK: OK: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [18:13:28] PROBLEM - Check unit status of swift_ring_manager on thanos-fe1001 is CRITICAL: CRITICAL: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [19:00:07] marostegui: https://phabricator.wikimedia.org/T292143#8021972 [19:00:28] the flow of queries returning more than 10K rows is cut to half now [20:12:39] RECOVERY - Check unit status of swift_ring_manager on thanos-fe1001 is OK: OK: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers