[00:14:23] FIRING: [2x] SystemdUnitFailed: sync-puppet-volatile.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:20:41] FIRING: [3x] SystemdUnitFailed: sync-puppet-volatile.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:24:23] FIRING: [5x] SystemdUnitFailed: sync-puppet-volatile.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:25:41] FIRING: [5x] SystemdUnitFailed: sync-puppet-volatile.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:29:23] FIRING: [7x] SystemdUnitFailed: dump_ip_reputation.service on puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:34:23] FIRING: [7x] SystemdUnitFailed: dump_ip_reputation.service on puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:39:23] FIRING: [2x] SystemdUnitFailed: dump_ip_reputation.service on puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:39:23] FIRING: SystemdUnitFailed: dump_ip_reputation.service on puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:24:25] elukey: you're not in the changelog, but your fix seems to be there https://github.com/netbox-community/pynetbox/releases/tag/v7.4.0 [07:26:51] 10netbox, 06Infrastructure-Foundations: pynetbox incompatibility with Netbox >= 4.0.6 - https://phabricator.wikimedia.org/T371890#10050137 (10ayounsi) Fix released : https://github.com/netbox-community/pynetbox/releases/tag/v7.4.0 [07:59:52] 10SRE-tools, 06Infrastructure-Foundations, 06SRE: debmonitor-client: Warning printed with su from buster - https://phabricator.wikimedia.org/T216832#10050175 (10hashar) That is still happening and is bit annoying. The reason for the warning is `su` is invoked with `-` which starts the shell as a long shell... [08:06:53] 10netops, 06Infrastructure-Foundations, 06SRE: cloudsw1-d5-eqiad instability Aug 6 2024 - https://phabricator.wikimedia.org/T371879#10050198 (10dcaro) That's to be expected when moving lots of data around, we can try to be smarter and/or limit the backfill throughput (and/or add QoS, coming soon!), but it's... [08:07:01] XioNoX: ah nice another release! [08:11:17] slyngs: o/ - I'd need to restart tomcat on idp nodes to pick up the new openjdk, already done in test and everything seems good. Do we have a procedure or can I proceed with simple systemctl restart tomcat9? [08:11:50] tomcat9 for the old ones and tomcat10 for the new ones [08:12:40] But yes, the procedure is just to restart tomcat, the downtime isn't long enough that it makes sense to do a failover [08:13:21] all right doing it :) [08:14:58] slyngs: afaics on idp[12]003 there is only tomcat9 [08:15:16] Correct, those are the old CAS 6, which I need to decommision [08:15:44] [12]004 should have only tomcat10 [08:17:19] ahhh they use openjdk 21 [08:17:22] okok makes sense [08:17:27] I didn't have them in my list for 11/17 [08:18:00] Well then you can restart all you want without affecting production :-) [08:24:23] RESOLVED: SystemdUnitFailed: dump_ip_reputation.service on puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:31:55] 10netbox, 06Infrastructure-Foundations, 13Patch-For-Review: New hosts with "Netbox status: unknown" - https://phabricator.wikimedia.org/T371653#10050286 (10ayounsi) Thanks, it's fixed for those 2 hosts. The trigger was a previous run that generates logs with no associated `object`, so all following re-image... [08:39:54] hashar: https://gerrit.wikimedia.org/r/c/operations/software/netbox-extras/+/1060759/ thx for the fix, the parent's CRs are now passing CI [10:29:06] +1 :) [10:29:14] andrew boggot had the same issue in another repo [10:30:40] https://gerrit.wikimedia.org/r/c/cloud/wmcs-cookbooks/+/1060453 [10:31:12] so yeah stuff is not supporting python 3.12 :] [10:31:22] congrats on the patch and rebasing your series on top of it [11:39:23] FIRING: SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:40:41] RESOLVED: SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:58:40] 10netops, 06Infrastructure-Foundations, 06SRE: cloudsw1-d5-eqiad instability Aug 6 2024 - https://phabricator.wikimedia.org/T371879#10050758 (10cmooney) >>! In T371879#10049699, @Dzahn wrote: > We got paged at 20:19 UTC for "primary outbound port utilisation over 80%" on both cloudsw1-d5 and cloudsw1-f4 toda... [12:37:35] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Core router error logs: "sshd: Did not receive identification string" from prometheus hosts - https://phabricator.wikimedia.org/T368513#10050973 (10ayounsi) This went very well until it didn't. Changes fully rolled back. The cookbook chang... [12:43:10] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Core router error logs: "sshd: Did not receive identification string" from prometheus hosts - https://phabricator.wikimedia.org/T368513#10050996 (10fgiunchedi) Thank you @ayounsi for the write up! I agree with your preferred option, and arg... [13:17:56] 10netbox, 06Infrastructure-Foundations, 13Patch-For-Review: Decom Netbox 3 servers - https://phabricator.wikimedia.org/T371957#10051133 (10ayounsi) 05Open→03Resolved Cleaned up. [14:08:24] 10CFSSL-PKI, 06Infrastructure-Foundations: Establish a process to periodically upgrade the CFSSL infrastructure - https://phabricator.wikimedia.org/T365361#10051346 (10CDanis) Thanks to @JMeybohm for giving us a good head start on this: > I pushed a branch (wmf-v1.6.5) to our gitlab cfssl repo with the 1.6.5 v... [14:10:26] 10CFSSL-PKI, 06Infrastructure-Foundations: Establish a process to periodically upgrade the CFSSL infrastructure - https://phabricator.wikimedia.org/T365361#10051358 (10JMeybohm) For {T337928} I tried to vendor our v1.6.1 which fails badly as etcd seemed to have renamed a bunch of their libraries at some time.... [14:25:05] 10netops, 06Infrastructure-Foundations, 06SRE: Add link from cloudsw1-e4-eqiad to cloudsw1-f4-eiqad - https://phabricator.wikimedia.org/T372061 (10cmooney) 03NEW p:05Triage→03Low [14:32:11] 10netops, 06Infrastructure-Foundations, 06SRE: Add link from cloudsw1-e4-eqiad to cloudsw1-f4-eiqad - https://phabricator.wikimedia.org/T372061#10051491 (10cmooney) [14:46:27] 10netops, 06Infrastructure-Foundations, 06SRE: Add link from cloudsw1-e4-eqiad to cloudsw1-f4-eiqad - https://phabricator.wikimedia.org/T372061#10051545 (10cmooney)