[03:04:26] FIRING: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:04:26] FIRING: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:09:50] 10netbox, 06Infrastructure-Foundations, 13Patch-For-Review: Netbox rq.timeouts.JobTimeoutException - https://phabricator.wikimedia.org/T341843#10072005 (10ayounsi) 05Open→03Resolved a:03ayounsi After discussions we decided to go down the first path of the list. I couldn't replicate the issue since... [07:50:43] RESOLVED: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:59:26] FIRING: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:59:45] I opened https://phabricator.wikimedia.org/T372728 for the issue above, with a possible fix. [11:29:35] 10netbox, 06Infrastructure-Foundations: Netbox: use Custom Model Validation - https://phabricator.wikimedia.org/T310590#10072831 (10ayounsi) 05Open→03Resolved All tested and deployed. [11:59:26] FIRING: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:07:55] 10Mail, 06Infrastructure-Foundations: Updating forwarding rules for Jimmy@wikipedia.org. - https://phabricator.wikimedia.org/T371884#10072913 (10Ladsgroup) 05Open→03Resolved Done now. [13:58:27] 10netops, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, and 2 others: Request additional mgmt IP range for frack servers - https://phabricator.wikimedia.org/T370164#10073333 (10Papaul) @Dwisehaupt all working thank you. [14:21:13] 10netbox, 06Infrastructure-Foundations: Upgrade Netbox to 4.1 - https://phabricator.wikimedia.org/T371889#10073447 (10joanna_borun) p:05Triage→03Low [14:21:47] 10netops, 06Infrastructure-Foundations: Apply egress Source Address Validation on the Wikimedia core routers - https://phabricator.wikimedia.org/T372158#10073453 (10joanna_borun) p:05Triage→03Low [14:22:02] 10netops, 06Infrastructure-Foundations: Apply egress Source Address Validation on the Wikimedia core routers - https://phabricator.wikimedia.org/T372158#10073459 (10joanna_borun) a:03Southparkfan [14:22:30] 10netops, 06Infrastructure-Foundations, 07sre-alert-triage: Alert in need of triage: BGP status (instance cr1-esams) - https://phabricator.wikimedia.org/T372248#10073461 (10ayounsi) p:05Triage→03Low [14:23:25] 10netops, 06Infrastructure-Foundations: Publish, and maintain ASPA records for valid AS14907 upstreams - https://phabricator.wikimedia.org/T372161#10073465 (10joanna_borun) p:05Triage→03Low a:03Southparkfan [14:30:10] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack: sre.hosts.reimage failing due to mkfs.ext4 taking to long - https://phabricator.wikimedia.org/T372648#10073512 (10SLyngshede-WMF) p:05Triage→03Medium a:03SLyngshede-WMF It's probably enough to bump the default timeout as a quick fix. I'll take a l... [14:43:24] 10netops, 06Infrastructure-Foundations, 07sre-alert-triage: Alert in need of triage: BGP status (instance cr1-esams) - https://phabricator.wikimedia.org/T372248#10073647 (10ayounsi) 05Open→03Resolved Peer removed. [14:47:22] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad: cr1-eqiad: disk failure - https://phabricator.wikimedia.org/T372781 (10ayounsi) 03NEW p:05Triage→03High [15:41:33] hello I/F folks - would anyone be interested in reviewing two patches related to the creation of a new apt repo component? (for php 8.1 migration) [15:49:40] swfrench-wmf: for stuff like that IMO just dump the links in the channel when you ask :D [15:59:26] FIRING: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:00:32] cdanis: ah, that's a good idea :) [16:00:32] * https://gerrit.wikimedia.org/r/c/operations/puppet/+/1062753 [16:00:32] * https://gerrit.wikimedia.org/r/c/operations/puppet/+/1062754 [16:06:14] +1'd [16:19:42] cdanis: thank you! one follow-up question, I see manual actions documented for component removal, but nothing for component addition. is that indeed entirely automated? [16:20:55] swfrench-wmf: I think so [16:23:52] awesome, that's what I figured after (a) seeing nothing documented and (b) snooping on some tasks related to adding components and not seeing anyone SAL'ing a command to do this [19:59:26] FIRING: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:59:26] FIRING: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed