[00:03:44] RESOLVED: [2x] NetboxPhysicalHosts: Netbox - Report parity errors between PuppetDB and Netbox for physical devices. - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/extras/scripts/18/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxPhysicalHosts [06:45:44] FIRING: [2x] NetboxAccounting: Netbox - Accounting job failed - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/extras/scripts/12/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxAccounting [06:55:44] RESOLVED: [2x] NetboxAccounting: Netbox - Accounting job failed - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/extras/scripts/12/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxAccounting [09:44:24] 10netops, 10Ceph, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Configure DSCP marking for cloudceph* hosts - https://phabricator.wikimedia.org/T371501#10457073 (10cmooney) >>! In T371501#10453986, @dcaro wrote: > We still have to restart all the osd daemon processes to pick up the config chan... [09:59:23] 10netops, 06Infrastructure-Foundations: Publish, and maintain ASPA records for valid AS14907 upstreams - https://phabricator.wikimedia.org/T372161#10457124 (10cmooney) Thanks for keeping up to date on this @Southparkfan! >>! In T372161#10449848, @Southparkfan wrote: > I understand hosting our own CA setup is... [11:25:39] 10netops, 06Infrastructure-Foundations: Multiple unreachable hosts in eqiad - https://phabricator.wikimedia.org/T382772#10457362 (10cmooney) 05Open→03Resolved I've tried to work out what went on here but wasn't really able to find anything. The common factor is //cr1-eqiad//, which connects to cloudsw... [11:36:35] 10netops, 10Hiddenparma, 06Infrastructure-Foundations, 10Prod-Kubernetes, 07Kubernetes: Allow reaching services on the aux k8s cluster bypassing the CDN - https://phabricator.wikimedia.org/T382269#10457395 (10cmooney) >>! In T382269#10421893, @akosiaris wrote: > Calico Open Source version doesn't support... [14:06:13] 07Puppet, 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: RAID monitoring on new hardware spec requires new or updated user space cli tool - https://phabricator.wikimedia.org/T377853#10457893 (10elukey) Some tests to see if JBOD could be forced directly from the OS without rebooting into... [15:24:31] 10netops, 10Hiddenparma, 06Infrastructure-Foundations, 10Prod-Kubernetes, 07Kubernetes: Allow reaching services on the aux k8s cluster bypassing the CDN - https://phabricator.wikimedia.org/T382269#10458292 (10CDanis) I am wondering if we really need the ability to expose aux services with public IPs. In... [15:42:08] 07Puppet, 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: RAID monitoring on new hardware spec requires new or updated user space cli tool - https://phabricator.wikimedia.org/T377853#10458416 (10elukey) I fear that this SAS controller doesn't support JBOD unless it is configured via BIOS... [15:55:54] 10netbox, 06Infrastructure-Foundations, 10Observability-Alerting, 10SRE Observability (FY2024/2025-Q3): Port netbox reports checks to Prometheus/Alertmanager - https://phabricator.wikimedia.org/T374823#10458523 (10lmata) [15:59:42] 07Puppet, 10MW-on-K8s, 10Observability-Alerting, 10SRE Observability (FY2024/2025-Q3): Clean up "git repo needs merge" checks - https://phabricator.wikimedia.org/T370530#10458592 (10lmata) [16:00:04] 10Packaging, 06Infrastructure-Foundations, 13Patch-For-Review, 10SRE Observability (FY2024/2025-Q3): upgrade prometheus-ipmi-exporter to 1.8.0 - https://phabricator.wikimedia.org/T368088#10458597 (10lmata) [18:20:49] 10netops, 10Ceph, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Configure DSCP marking for cloudceph* hosts - https://phabricator.wikimedia.org/T371501#10459488 (10dcaro) Just finished restarting all the osd daemons, all the traffic should now being tagged correctly 👍 [19:17:14] 07Puppet, 06Data-Engineering-Radar, 06SRE: modules/udp2log/manifests/instance/monitoring.pp has unreachable code - https://phabricator.wikimedia.org/T152104#10459746 (10Ottomata) [19:17:43] 07Puppet, 06Data-Engineering-Radar, 06SRE: modules/udp2log/manifests/instance/monitoring.pp has unreachable code - https://phabricator.wikimedia.org/T152104#10459748 (10Ottomata) Data-Engineering no longer operates udp2log. SRE should feel free to decline this task at will. [20:04:29] 10Mail, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Message sizes exceeding limits - https://phabricator.wikimedia.org/T383271#10460171 (10jhathaway) a:03jhathaway [20:08:56] 10Mail, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Message sizes exceeding limits - https://phabricator.wikimedia.org/T383271#10460184 (10jhathaway) 05Open→03Resolved @DSeyfert_WMF this appears to be a regression in our mail servers when migrating from Exim to Postfix. Exim had a defa... [21:50:14] 10netops, 10Ceph, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Configure DSCP marking for cloudceph* hosts - https://phabricator.wikimedia.org/T371501#10460515 (10cmooney) >>! In T371501#10459488, @dcaro wrote: > Just finished restarting all the osd daemons, all the traffic should now being t... [22:18:32] 10netops, 06Infrastructure-Foundations, 10observability, 10Observability-Alerting, 06SRE: Alertmanager rule for network interface errors? - https://phabricator.wikimedia.org/T335350#10460558 (10cmooney) 05Open→03Resolved a:03cmooney >>! In T335350#10456238, @andrea.denisse wrote: > Hi @cmooney,...