[09:41:25] FIRING: [2x] SystemdUnitFailed: sync-puppet-ca.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:43:57] ^ puppetserver2004 is being setup, that will resolve once all patches are deployed [09:56:25] FIRING: [2x] SystemdUnitFailed: sync-puppet-ca.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:41:25] FIRING: [2x] SystemdUnitFailed: puppetserver.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:47:28] 10netops, 06Infrastructure-Foundations, 10ops-drmrs: cr1-drmrs to asw1-b12-drmrs link down - https://phabricator.wikimedia.org/T389071#10682142 (10cmooney) This has been stable ever since the replacement fwiw, no drops or errors etc. So I think we are good to close it {F58929621 width=600} [11:06:25] FIRING: [2x] SystemdUnitFailed: puppetserver.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:07:48] FIRING: PuppetFailure: Puppet has failed on puppetserver2004:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [11:11:25] FIRING: [2x] SystemdUnitFailed: puppetserver.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:29:56] 10netops, 06Infrastructure-Foundations, 06SRE, 07IPv6: WMCS Eqiad: Enable IPv6 in cloud vrf on switches - https://phabricator.wikimedia.org/T389958#10682999 (10aborrero) there was a major network outage as a result of the operations that affected all WMCS systems, including Ceph and Toolforge kubernetes. [13:37:59] volans: hello! [13:38:18] remember the fun hardware configuraton check discussion we had :) [13:38:31] ChrisDobbins901_ and I wanted to talk about that to take it further and see the best path forward [13:38:44] is it fine to set up a call sometime for next week to talk about that? [13:38:51] or we can do phabricator as well [13:38:52] thoughts :) [14:09:47] sukhe: hey, could the tooling and automation office hours on Wed. 5pm UTC work? [14:09:58] oh yes, that's perfect [14:10:00] let's do that? [14:10:07] you can add your items to the agenda here: https://docs.google.com/document/d/1lwgiSgrbFapRjFvQlqeU8Zk0LVljTI90SJGG-TFg5gE/edit?tab=t.0 [14:10:21] checking [14:10:47] thanks, added [14:10:49] see you then [14:10:50] ChrisDobbins901_: ^ [14:10:51] usually also pap.aul attends them [14:11:05] that's even better then [14:11:25] the event is on the staff calendar in case you're looking for it [14:11:50] yep, added to the notes and I see it [14:15:46] thanks! perfect [14:36:51] 10netops, 06Infrastructure-Foundations, 10ops-drmrs: cr1-drmrs to asw1-b12-drmrs link down - https://phabricator.wikimedia.org/T389071#10683332 (10RobH) 05Open→03Resolved Awesome, I've updated the ticket to Interxion so they can close it. [14:58:43] 10netops, 06Infrastructure-Foundations, 06SRE, 07IPv6: WMCS Eqiad: Enable IPv6 in cloud vrf on switches - https://phabricator.wikimedia.org/T389958#10683425 (10cmooney) Just to confirm the timeline of events: (IMPORTANT) Mar 27 13:06:57: IBGP configuration commited on all 4 cloudsw, enabling IBGP in the... [15:26:54] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Install and cable Nokia test devices and test servers in codfw - https://phabricator.wikimedia.org/T385217#10683573 (10Jhancock.wm) [15:33:08] 10netops, 06Infrastructure-Foundations, 06SRE, 07IPv6: WMCS Eqiad: Enable IPv6 in cloud vrf on switches - https://phabricator.wikimedia.org/T389958#10683594 (10cmooney) @aborrero @taavi one thing we could maybe try, if we wanted to make progress sooner (i.e. without replicating the setup elsewhere): * Add... [16:10:02] 10netops, 06Infrastructure-Foundations, 10Observability-Alerting, 13Patch-For-Review: Migrate network icinga alerts to gNMI/prometheus - https://phabricator.wikimedia.org/T388641#10683734 (10cmooney) >>! In T388641#10678444, @ayounsi wrote: > @cmooney another question is that if the service is not present... [16:10:25] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:15:25] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:50:48] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-ulsfo, 06SRE: InboundInterfaceErrors - https://phabricator.wikimedia.org/T389884#10684167 (10RobH) [20:41:24] volans: (or whoever might know) is there a way to pass the mgmt password to the reimage cookbook? our `sre.elasticsearch.rolling-operation` cookbook will be calling the reimage cookbook as we upgrade clusters from elasticsearch to opensearch, and it'd be great if the operator didn't have to input the password for each invocation [20:46:00] here's some context on what we're doing: https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1131446/5/cookbooks/sre/elasticsearch/rolling-operation.py . If we could supply the mgmt password once it could speed up the operatoin