[03:40:17] (SystemdUnitFailed) firing: (2) geoip_update_main.service Failed on puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:40:17] (SystemdUnitFailed) firing: (2) geoip_update_main.service Failed on puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:10:04] the above is getting 403s for downloading GeoIP2-* files [09:10:16] (also a bit annoying to me that alerts every 4h) [09:10:44] are we the owners of geoip or should be re-routed to another team? [09:15:13] volans: what does git blame say? :) [09:16:23] mixture of people :D [09:17:57] volans: call it the geoip task force and assign it to them :) [09:20:16] 10CFSSL-PKI, 10Ganeti, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate Ganeti-rapi to use pki - https://phabricator.wikimedia.org/T350686 (10MoritzMuehlenhoff) [09:20:44] rotfl [09:27:11] 10CFSSL-PKI, 10Ganeti, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate Ganeti-rapi to use pki - https://phabricator.wikimedia.org/T350686 (10MoritzMuehlenhoff) [09:27:35] 10CFSSL-PKI, 10Ganeti, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate Ganeti-rapi to use pki - https://phabricator.wikimedia.org/T350686 (10MoritzMuehlenhoff) 05Open→03Resolved This is complete [09:55:14] 10netops, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE, 10serviceops: Agree strategy for Kubernetes BGP peering to top-of-rack switches - https://phabricator.wikimedia.org/T306649 (10ayounsi) 05Stalled→03Resolved Automation is up and running. Doc updated: https://wikitech.wikimedia.org/w/in... [10:01:58] 10netbox, 10Data-Persistence-Backup, 10Infrastructure-Foundations, 10bacula: Convert Netbox data (PostgresQL) longterm storage backups (bacula) into full backups rather than incrementals - https://phabricator.wikimedia.org/T316655 (10Volans) It was mostly a refresher because too much time has passed and do... [10:03:55] 10netbox, 10Data-Persistence-Backup, 10Infrastructure-Foundations, 10bacula: Convert Netbox data (PostgresQL) longterm storage backups (bacula) into full backups rather than incrementals - https://phabricator.wikimedia.org/T316655 (10jcrespo) File or directory is the same- but **it has to have an exact nam... [10:09:43] 10netbox, 10Data-Persistence-Backup, 10Infrastructure-Foundations, 10bacula: Convert Netbox data (PostgresQL) longterm storage backups (bacula) into full backups rather than incrementals - https://phabricator.wikimedia.org/T316655 (10Volans) Perfect. Last question, should we keep it compressed or could mak... [10:15:36] 10netbox, 10Data-Persistence-Backup, 10Infrastructure-Foundations, 10bacula: Convert Netbox data (PostgresQL) longterm storage backups (bacula) into full backups rather than incrementals - https://phabricator.wikimedia.org/T316655 (10jcrespo) If the idea is to create full backups, it should be compressed-... [11:23:55] (SystemdUnitFailed) resolved: (2) geoip_update_main.service Failed on puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:27:12] (SystemdUnitFailed) firing: (2) geoip_update_main.service Failed on puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:31:53] ^^^ I don't wand to silence 6 alerts manually, asking o11y how to proceed [11:32:03] 6 alerts for a single failing unit [13:44:50] 10netops, 10Infrastructure-Foundations: prometheus5002 unable to ping ipv6 ganeti500[74] eqsin - https://phabricator.wikimedia.org/T353254 (10fgiunchedi) [13:54:09] 10netops, 10Infrastructure-Foundations, 10ops-codfw: cr2-codfw:xe-1/0/1:1 down - https://phabricator.wikimedia.org/T353256 (10ayounsi) p:05Triage→03High [14:04:06] 10netops, 10Ganeti, 10Infrastructure-Foundations: prometheus5002 unable to ping ipv6 ganeti500[74] eqsin - https://phabricator.wikimedia.org/T353254 (10ayounsi) Thanks for finding the issue! The host lost its IP in favor of a SLAAC IP ` ganeti5007:~$ ip -6 addr 1: lo: mtu 65536 state... [14:07:54] folks I fat fingered something in netbox-next and deleted a switch again :( [14:08:16] gonna restore from recent backup unless anyone has work they are doing there? [14:08:24] volans, XioNoX: fyi [14:11:32] the good news is I discovered a bug in our find_tor() function for provisioning which occurs if there is no switch at all in a rack :) [14:12:04] topranks: -next can be overwritten anytime unless someone is doing anything special [14:12:12] so you can restore it from prod for me [14:12:44] volans: cool thanks - yep just checking :) [15:27:12] (SystemdUnitFailed) firing: (2) geoip_update_main.service Failed on puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:42:13] (DiskSpace) firing: Disk space krb1001:9100:/ 5.541% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=krb1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [19:27:13] (SystemdUnitFailed) firing: (2) geoip_update_main.service Failed on puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:42:13] (DiskSpace) resolved: Disk space krb1001:9100:/ 1.473% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=krb1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [20:48:09] I cleared a bit of disk space on krb1001, one of the other servers is spamming the log. I'm not really sure how Kerberos deals with a full disk, so I truncated one of the logs. There's a backup in /srv. It's a little late for debugging the actual Kerberos bug, but we should be good on disk until tomorrow [22:22:40] thanks slyngs [23:27:13] (SystemdUnitFailed) firing: (2) geoip_update_main.service Failed on puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed