[03:40:17] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) geoip_update_main.service Failed on puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:40:17] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) geoip_update_main.service Failed on puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:10:04] <volans>	 the above is getting 403s for downloading GeoIP2-* files
[09:10:16] <volans>	 (also a bit annoying to me that alerts every 4h)
[09:10:44] <volans>	 are we the owners of geoip or should be re-routed to another team?
[09:15:13] <XioNoX>	 volans: what does git blame say? :)
[09:16:23] <volans>	 mixture of people :D
[09:17:57] <XioNoX>	 volans: call it the geoip task force and assign it to them :)
[09:20:16] <wikibugs>	 10CFSSL-PKI, 10Ganeti, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate Ganeti-rapi to use pki - https://phabricator.wikimedia.org/T350686 (10MoritzMuehlenhoff)
[09:20:44] <volans>	 rotfl
[09:27:11] <wikibugs>	 10CFSSL-PKI, 10Ganeti, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate Ganeti-rapi to use pki - https://phabricator.wikimedia.org/T350686 (10MoritzMuehlenhoff)
[09:27:35] <wikibugs>	 10CFSSL-PKI, 10Ganeti, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate Ganeti-rapi to use pki - https://phabricator.wikimedia.org/T350686 (10MoritzMuehlenhoff) 05Open→03Resolved This is complete
[09:55:14] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE, 10serviceops: Agree strategy for Kubernetes BGP peering to top-of-rack switches - https://phabricator.wikimedia.org/T306649 (10ayounsi) 05Stalled→03Resolved Automation is up and running. Doc updated: https://wikitech.wikimedia.org/w/in...
[10:01:58] <wikibugs>	 10netbox, 10Data-Persistence-Backup, 10Infrastructure-Foundations, 10bacula: Convert Netbox data (PostgresQL) longterm storage backups (bacula) into full backups rather than incrementals - https://phabricator.wikimedia.org/T316655 (10Volans) It was mostly a refresher because too much time has passed and do...
[10:03:55] <wikibugs>	 10netbox, 10Data-Persistence-Backup, 10Infrastructure-Foundations, 10bacula: Convert Netbox data (PostgresQL) longterm storage backups (bacula) into full backups rather than incrementals - https://phabricator.wikimedia.org/T316655 (10jcrespo) File or directory is the same- but **it has to have an exact nam...
[10:09:43] <wikibugs>	 10netbox, 10Data-Persistence-Backup, 10Infrastructure-Foundations, 10bacula: Convert Netbox data (PostgresQL) longterm storage backups (bacula) into full backups rather than incrementals - https://phabricator.wikimedia.org/T316655 (10Volans) Perfect. Last question, should we keep it compressed or could mak...
[10:15:36] <wikibugs>	 10netbox, 10Data-Persistence-Backup, 10Infrastructure-Foundations, 10bacula: Convert Netbox data (PostgresQL) longterm storage backups (bacula) into full backups rather than incrementals - https://phabricator.wikimedia.org/T316655 (10jcrespo) If the idea is to create full backups, it should be compressed-...
[11:23:55] <jinxer-wm>	 (SystemdUnitFailed) resolved: (2) geoip_update_main.service Failed on puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:27:12] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) geoip_update_main.service Failed on puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:31:53] <volans>	 ^^^ I don't wand to silence 6 alerts manually, asking o11y how to proceed
[11:32:03] <volans>	 6 alerts for a single failing unit
[13:44:50] <wikibugs>	 10netops, 10Infrastructure-Foundations: prometheus5002 unable to ping ipv6 ganeti500[74] eqsin - https://phabricator.wikimedia.org/T353254 (10fgiunchedi)
[13:54:09] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10ops-codfw: cr2-codfw:xe-1/0/1:1 down - https://phabricator.wikimedia.org/T353256 (10ayounsi) p:05Triage→03High
[14:04:06] <wikibugs>	 10netops, 10Ganeti, 10Infrastructure-Foundations: prometheus5002 unable to ping ipv6 ganeti500[74] eqsin - https://phabricator.wikimedia.org/T353254 (10ayounsi) Thanks for finding the issue! The host lost its IP in favor of a SLAAC IP  ` ganeti5007:~$ ip -6 addr 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 state...
[14:07:54] <topranks>	 folks I fat fingered something in netbox-next and deleted a switch again :(
[14:08:16] <topranks>	 gonna restore from recent backup unless anyone has work they are doing there?
[14:08:24] <topranks>	 volans, XioNoX: fyi 
[14:11:32] <topranks>	 the good news is I discovered a bug in our find_tor() function for provisioning which occurs if there is no switch at all in a rack :)
[14:12:04] <volans>	 topranks: -next can be overwritten anytime unless someone is doing anything special
[14:12:12] <volans>	 so you can restore it from prod for me
[14:12:44] <topranks>	 volans: cool thanks - yep just checking :)
[15:27:12] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) geoip_update_main.service Failed on puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:42:13] <jinxer-wm>	 (DiskSpace) firing: Disk space krb1001:9100:/ 5.541% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=krb1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace
[19:27:13] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) geoip_update_main.service Failed on puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:42:13] <jinxer-wm>	 (DiskSpace) resolved: Disk space krb1001:9100:/ 1.473% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=krb1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace
[20:48:09] <slyngs>	 I cleared a bit of disk space on krb1001, one of the other servers is spamming the log. I'm not really sure how Kerberos deals with a full disk, so I truncated one of the logs. There's a backup in /srv. It's a little late for debugging the actual Kerberos bug, but we should be good on disk until tomorrow
[22:22:40] <jhathaway>	 thanks slyngs 
[23:27:13] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) geoip_update_main.service Failed on puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed