[01:19:54] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:19:54] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:42:58] 10SRE-tools, 06Infrastructure-Foundations, 06serviceops-deprecated: Add a --rack flag to sre.k8s.pool-depool-node - https://phabricator.wikimedia.org/T410537#11737005 (10JMeybohm) 05Resolved→03Open This additional confirmation thing is making bigger reboots pretty annoying since one has to come back and... [08:47:21] 10SRE-tools, 06ServiceOps new: Add a --rack flag to sre.k8s.pool-depool-node - https://phabricator.wikimedia.org/T410537#11737009 (10JMeybohm) [09:01:05] 10SRE-tools, 06ServiceOps new: Add a --rack flag to sre.k8s.pool-depool-node - https://phabricator.wikimedia.org/T410537#11737043 (10MLechvien-WMF) Good point. IMO it feels more intuitive/predictable to have the careful version as the default, and add a `--force` flag which bypasses all confirmation. If it's... [09:04:54] 10SRE-tools, 06ServiceOps new: Add a --rack flag to sre.k8s.pool-depool-node - https://phabricator.wikimedia.org/T410537#11737062 (10JMeybohm) I'm not a huge 'confirmation-fan' in general, but sgtm. When you're at it you could also make the cookbooks that call 'pool-depool-node' call it with `--force` [09:19:54] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:44:00] taavi: fwiw you were correct about the Anycast patch you linked yesterday being responsible for the internet routing change :) [09:45:05] my brain wasn't working right, that does just affect our internal routing, but the reason it was added was so that the aggregates get created and announced, so once the local internal route was preferred we began creating the aggregate, and announcing to peers/transit [10:38:32] 10netbox, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: netbox report error for puppetdb serial versus netbox serial for backup1012 - https://phabricator.wikimedia.org/T420623#11737597 (10Volans) That's what's in puppetdb and what's reported by facter on the host though: ` $ sudo facter -p... [11:29:14] 10SRE-tools, 10Cumin, 06Infrastructure-Foundations: Add proxy support to cumin openstack backend - https://phabricator.wikimedia.org/T420360#11737762 (10Volans) p:05Triage→03Medium a:03Volans [12:21:48] FIRING: PuppetFailure: Puppet has failed on ganeti2033:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [13:08:47] o/ still looking for reviews for https://gerrit.wikimedia.org/r/c/operations/puppet/+/1212097 and https://gerrit.wikimedia.org/r/c/operations/puppet/+/1211650 [13:19:54] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:23:25] taavi: hey I know jess e had been looking at the nftables one, I'd been trying to find time [13:23:31] thanks for that it seems like great work <3 [13:24:07] I'd not feel confident about the wmflib one but I'll mention these at our team meeting today see if anyone can take a look [14:21:37] taavi: I'll have the nftables review finished today, sorry for the long wait [15:09:09] 10netops, 06Infrastructure-Foundations, 06SRE: Wikidough unreachable over IPv6 if it is depooled but still announced from a POP - https://phabricator.wikimedia.org/T420820#11738741 (10cmooney) 05Open→03Resolved a:03cmooney Ok this should no longer be an issue after updating the `wikimedia6` prefix... [16:13:47] Hello. Here's an interesting puppet failure that you might want to know about: https://puppetboard.wikimedia.org/report/an-worker1172.eqiad.wmnet/ff9b1f89512c32b72c9deb0490e0e29dc0b33a96 [16:15:09] It seems to be related to the removal of the Puppet 5 root CA, but it's still mentioned here: https://github.com/wikimedia/operations-puppet/blob/production/hieradata/common/profile/base/certificates.yaml#L8 [16:15:21] It's a freshly reimaged host. [16:15:42] Not urgent, from my seide. [16:15:47] *side. [16:19:22] 5~ [16:22:03] FIRING: PuppetFailure: Puppet has failed on ganeti2033:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [16:32:01] paravoid: all good with you I hope! [16:36:48] RESOLVED: PuppetFailure: Puppet has failed on ganeti2033:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [17:19:54] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:38:46] 10netops, 06Infrastructure-Foundations, 06SRE: Atlas no longer reachable from monitoring on routed ganeti - https://phabricator.wikimedia.org/T420975 (10cmooney) 03NEW p:05Triage→03Medium [17:40:26] 10netops, 06Infrastructure-Foundations, 06SRE: Atlas no longer reachable from monitoring on routed ganeti - https://phabricator.wikimedia.org/T420975#11739860 (10cmooney) [18:48:54] 10netops, 06Infrastructure-Foundations, 06SRE: Anycast services - depool strategy in terms of BGP routing - https://phabricator.wikimedia.org/T420821#11740224 (10ssingh) Thanks for all the work here @cmooney and for mentioning this, something that I had most certainly overlooked at least. I will think a bit... [19:18:46] 10netops, 06Infrastructure-Foundations, 06SRE: Anycast services - depool strategy in terms of BGP routing - https://phabricator.wikimedia.org/T420821#11740353 (10cmooney) Thanks @ssingh. I think a cookbook that takes down doh and durum simultaneously at a site (I assume by changing bird?) would solve this p... [20:47:56] FIRING: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:19:55] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:28:07] hi IF, I'm messing around with mirrors.wikimedia.org since it's not responding? if anyone's around who knows more than me, lmk :) otherwise I'll continue fiddling [22:37:56] RESOLVED: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:42:31] ^ that wasn't me, looks like we're back :) not doing anything in that case