[00:49:38] FIRING: [2x] ProbeDown: Service cloudidm2001-dev:443 has failed probes (http_cloudtestidm_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/IDM/Runbook - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:03:55] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-openldap-exporter.service on seaborgium:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:51:45] FIRING: [2x] ProbeDown: Service cloudidm2001-dev:443 has failed probes (http_cloudtestidm_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/IDM/Runbook - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:06:45] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-openldap-exporter.service on seaborgium:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:53:26] 10netops, 06Infrastructure-Foundations, 06SRE: magru network setup - https://phabricator.wikimedia.org/T362421#9807801 (10ayounsi) Before advertising ns2, we need to do some traffic engineering. Telxius being part of Spain's main ISP, Telefonica ES prefers magru to drmrs : See https://w.wiki/A6qH {F53575207}... [08:04:41] 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, 10Puppet-Core, and 4 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#9807815 (10MoritzMuehlenhoff) [08:51:50] FIRING: [2x] ProbeDown: Service cloudidm2001-dev:443 has failed probes (http_cloudtestidm_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/IDM/Runbook - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:26:24] 10netops, 06Infrastructure-Foundations, 06SRE: magru network setup - https://phabricator.wikimedia.org/T362421#9808055 (10cmooney) +1 sounds like a good idea. Nice we have some limited scope to experiment with the DoH ranges before pulling the plug on ns2. FWIW I think these would be the ones to use with E... [09:27:06] 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, 10Puppet-Core, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#9808056 (10MoritzMuehlenhoff) [10:06:45] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-openldap-exporter.service on seaborgium:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:18:55] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-openldap-exporter.service on seaborgium:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:32:51] 10netops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Problem re-imaging hosts on row-wide vlan on EVPN switches - https://phabricator.wikimedia.org/T365204#9808187 (10cmooney) Pcap of DHCP request from contint2002 here: {F53586857} [10:35:19] 10netops, 06Infrastructure-Foundations, 06SRE: magru network setup - https://phabricator.wikimedia.org/T362421#9808194 (10ayounsi) Cogent is a bit surprising, from EU or the US they route to magru. `lines=15 Fri May 17 10:29:23.898 UTC BGP routing table entry for 185.71.138.0/24 Versions: Process... [10:42:19] 10netops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Problem re-imaging hosts on row-wide vlan on EVPN switches - https://phabricator.wikimedia.org/T365204#9808229 (10cmooney) One observation is that the NAK's are unique in so far as they are sent from 208.80.153.33 (Switch IRB int IP) to 255.255.25... [10:49:32] 10netops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Problem re-imaging hosts on row-wide vlan on EVPN switches - https://phabricator.wikimedia.org/T365204#9808246 (10cmooney) Also I didn't see in the dhcpd docs and way to constrain the generation of NAKs in response to invalid REQUEST messages. [... [10:59:46] 10netops, 06Infrastructure-Foundations, 06SRE: magru network setup - https://phabricator.wikimedia.org/T362421#9808254 (10cmooney) >>! In T362421#9808194, @ayounsi wrote: > They might prefer going through EdgeUno once we add the prepending to Novvacore, so the same change would be needed there as well. It's... [11:21:58] 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, 10Puppet-Core, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#9808280 (10MoritzMuehlenhoff) [12:51:45] FIRING: [2x] ProbeDown: Service cloudidm2001-dev:443 has failed probes (http_cloudtestidm_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/IDM/Runbook - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:17:13] 10netops, 06Infrastructure-Foundations, 06SRE: magru network setup - https://phabricator.wikimedia.org/T362421#9808627 (10ayounsi) The Telxius community doesn't seem to be of any effect so far, I'll wait for their reply, maybe they changed or need to be enabled on their side first. I'll look at the other pro... [13:24:56] 10netops, 06Infrastructure-Foundations, 06SRE: Support Anycast GW on EVPN switches without unique IP - https://phabricator.wikimedia.org/T350579#9808649 (10cmooney) Just a note on this, I only discovered this document after the task: https://www.juniper.net/documentation/us/en/software/nce/nce-216-evpn-... [14:20:44] hello folks [14:21:04] I checked the golang-cfssl version that we have (at least on pki1001), and it is 1.6.1 [14:21:22] from https://github.com/cloudflare/cfssl/tags it seems that we may want to upgrade in the future, we are lagging a little behind [14:21:44] they don't release often, and the changelog doesn't look horrible [14:21:45] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-openldap-exporter.service on serpens:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:22:24] it would take quite some time, not sure if we ever attempted an upgrade before [14:26:30] Would definitely be nice to stay up to date, so we should probably start some type of cadence for upgrades [14:26:45] RESOLVED: SystemdUnitFailed: wmf_auto_restart_prometheus-openldap-exporter.service on serpens:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:28:02] yep I agree, the first time it may take more but afterwards it should be easier [14:28:22] yup [14:28:27] the fact that it is a go binary seems to ease a lot the work, the testing part is what scaries me [14:28:34] now that a ton of things use cfssl [14:35:08] yeah, I agree we need to do that sooner rather than later [14:35:17] putting it off is only going to make it scarier [14:42:58] The scary part of this reminded me of this recent post, https://devblogs.microsoft.com/oldnewthing/20240513-00/?p=109750, perhaps we should try a noop "upgrade" first, just to make sure we no all the parts to monitor [14:47:51] that's interesting [14:47:58] I love the idea but what do you think it would entail in this case? [14:48:34] pontoon would probably be a good starting point for testing part [14:48:58] ye [14:49:00] something like build a new package, as we would expect to build the new version, deploy the new package and test that our monitoring is alerting us correctly [14:49:24] elukey: cfssl also works in dcl, you get a pki server in the base setup, so that is another option for testing [14:49:54] ah yes right! I need to work with dcl, never checked up to now [14:50:19] jhathaway: have you thought about writing external 'integration tests' that wrap around a dcl environment? [14:50:58] yes, that is essentailly what I have done for the postfix work, https://gitlab.wikimedia.org/jhathaway/mx-tests [14:51:11] *beware scary bash lurks in that link* :) [14:51:19] scary bash is my favorite [14:54:19] jhathaway: hey this is a lot less scary than I imagined [14:54:58] sounds like you do like scary bash ;) [14:57:45] TIL dist-upgrade.sh, thanks jhathaway [14:58:11] (I am trying to upgrade the pki nodes in cloud) [14:58:19] jhathaway: is `poll` a bats-ism? [14:58:40] (of course I did it manually the first time without knowing the dist-upgrade script and I didn't clean the puppet cached facts) [14:59:08] elukey: glad you found it helpful [15:00:02] cdanis: no, its definition is in the test_helper/common-setup.bash [15:00:54] ah! [15:01:02] reading is a useful skill [15:21:07] 10netops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE, 13Patch-For-Review: Problem re-imaging hosts on row-wide vlan on EVPN switches - https://phabricator.wikimedia.org/T365204#9809120 (10cmooney) Re-reading the man page for dhcpd.conf it seems that pontentially changing the 'authoritative' stateme... [15:46:53] 10netops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE, 13Patch-For-Review: Problem re-imaging hosts on row-wide vlan on EVPN switches - https://phabricator.wikimedia.org/T365204#9809234 (10cmooney) From what I can tell the 'authoritative' statement only controls NAK generation. I think we're hittin... [15:48:55] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-openldap-exporter.service on serpens:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:14:44] 10netops, 06Infrastructure-Foundations, 10ops-eqiad: partial power outage for lsw1-e5-eqiad - https://phabricator.wikimedia.org/T365289 (10CDanis) 03NEW [19:14:56] 10netops, 06Infrastructure-Foundations, 10ops-eqiad: partial power outage for lsw1-e5-eqiad - https://phabricator.wikimedia.org/T365289#9810003 (10CDanis) p:05Triage→03High [19:51:45] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-openldap-exporter.service on serpens:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:51:45] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-openldap-exporter.service on serpens:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed