[00:52:54] 06Traffic, 06Infrastructure-Foundations: eqsin: re-image rack 604 servers on new vlan - https://phabricator.wikimedia.org/T428229#12032143 (10Papaul) @BCornwall @MoritzMuehlenhoff thanks to all of you getting this done [08:58:52] 06Traffic, 06Infrastructure-Foundations, 06SRE: Scaling urldownloaders by adding redundancy and load balancing - https://phabricator.wikimedia.org/T429175#12032690 (10MLechvien-WMF) That's correct, we could not prioritize Sophroid remaining work so far and our immediate/Q1 capacity is limited, but we're inte... [09:00:44] 06Traffic, 06Infrastructure-Foundations, 06ServiceOps new, 06SRE: Scaling urldownloaders by adding redundancy and load balancing - https://phabricator.wikimedia.org/T429175#12032722 (10MLechvien-WMF) [09:02:25] 06Traffic, 06Infrastructure-Foundations, 06ServiceOps new, 06SRE: Scaling urldownloaders by adding redundancy and load balancing - https://phabricator.wikimedia.org/T429175#12032735 (10MoritzMuehlenhoff) >>! In T429175#12032690, @MLechvien-WMF wrote: > That's correct, we could not prioritize Sophroid remai... [09:57:06] 10netops, 06Infrastructure-Foundations, 06SRE: Create cookbook to add BGP peering for host by triggering Homer run on correct device - https://phabricator.wikimedia.org/T429488#12032936 (10cmooney) >>! In T429488#12029612, @ayounsi wrote: > Not sure if it's worth adding something temporary to setup BGP on th... [10:01:09] 06Traffic, 06Data-Engineering (Q4 FS25/26 April 1st - June 30st), 13Patch-For-Review: Add X-Provenance data to webrequest_sampled_live - https://phabricator.wikimedia.org/T427068#12032963 (10ops-monitoring-bot) Deployed hiddenparma to alert[1002,2002].wikimedia.org with reason: Change provenance var context... [10:08:38] 06Traffic, 06Data-Engineering (Q4 FS25/26 April 1st - June 30st), 13Patch-For-Review: Add X-Provenance data to webrequest_sampled_live - https://phabricator.wikimedia.org/T427068#12032974 (10Fabfur) [10:30:37] 06Traffic, 06Data-Engineering (Q4 FS25/26 April 1st - June 30st), 13Patch-For-Review: Add X-Provenance data to webrequest_sampled_live - https://phabricator.wikimedia.org/T427068#12033025 (10Volans) Do you need anything added to the [[ https://wikitech.wikimedia.org/wiki/Logs/Runbook#Superset_dashboards | su... [10:55:12] 06Traffic, 06Data-Engineering (Q4 FS25/26 April 1st - June 30st), 13Patch-For-Review: Add X-Provenance data to webrequest_sampled_live - https://phabricator.wikimedia.org/T427068#12033148 (10Fabfur) [12:51:13] 06Traffic, 06Data-Engineering (Q4 FS25/26 April 1st - June 30st): Add X-Provenance data to webrequest_sampled_live - https://phabricator.wikimedia.org/T427068#12033490 (10Fabfur) [12:56:15] 06Traffic, 06Infrastructure-Foundations: eqsin: re-image rack 604 servers on new vlan - https://phabricator.wikimedia.org/T428229#12033502 (10ops-monitoring-bot) VM prometheus5003.eqsin.wmnet switching disk type to drbd [13:16:01] 10netops, 10Cloud-VPS, 06Infrastructure-Foundations, 06tools-infrastructure-team, 06cloud-services-team (FY2025/2026-Q3-Q4): Establish a blackbox network probe vantage point into cloud realm - https://phabricator.wikimedia.org/T429451#12033583 (10fgiunchedi) >>! In T429451#12030754, @cmooney wrote: > @fg... [13:18:29] 06Traffic, 06Data-Engineering (Q4 FS25/26 April 1st - June 30st), 13Patch-For-Review: Add X-Provenance data to webrequest_sampled_live - https://phabricator.wikimedia.org/T427068#12033588 (10Fabfur) [13:18:58] 06Traffic, 06Data-Engineering (Q4 FS25/26 April 1st - June 30st), 13Patch-For-Review: Add X-Provenance data to webrequest_sampled_live - https://phabricator.wikimedia.org/T427068#12033590 (10Fabfur) >>! In T427068#12033025, @Volans wrote: > Do you need anything added to the [[ https://wikitech.wikimedia.org/... [13:28:42] 06Traffic, 06Data-Engineering (Q4 FS25/26 April 1st - June 30st), 13Patch-For-Review: Add X-Provenance data to webrequest_sampled_live - https://phabricator.wikimedia.org/T427068#12033635 (10Fabfur) Checking with @Ahoelzl and @JAllemandou on where to go from there [13:32:13] 10netops, 10Cloud-VPS, 06Infrastructure-Foundations, 06tools-infrastructure-team, 06cloud-services-team (FY2025/2026-Q3-Q4): Establish a blackbox network probe vantage point into cloud realm - https://phabricator.wikimedia.org/T429451#12033677 (10cmooney) >>! In T429451#12033583, @fgiunchedi wrote: > I'm... [13:33:52] 06Traffic, 06Infrastructure-Foundations: eqsin: re-image rack 604 servers on new vlan - https://phabricator.wikimedia.org/T428229#12033696 (10ops-monitoring-bot) VM prometheus5003.eqsin.wmnet switching disk type to drbd [14:04:21] 06Traffic, 06Infrastructure-Foundations, 06ServiceOps new, 06SRE: Scaling urldownloaders by adding redundancy and load balancing - https://phabricator.wikimedia.org/T429175#12033926 (10ssingh) >>! In T429175#12032690, @MLechvien-WMF wrote: > That's correct, we could not prioritize Sophroid remaining work s... [14:06:54] 06Traffic, 06Data-Engineering (Q4 FS25/26 April 1st - June 30st): Add X-Provenance data to webrequest_sampled_live - https://phabricator.wikimedia.org/T427068#12033942 (10ssingh) Thanks for the work on this @Fabfur! Andreas, is there anything else that needs to be done at our end other than adding this to webr... [15:19:38] 06Traffic, 06Data-Engineering (Q4 FS25/26 April 1st - June 30st): Add X-Provenance data to webrequest_sampled_live - https://phabricator.wikimedia.org/T427068#12034446 (10JAllemandou) I have checked the kafka streams, the new field is present in both `webrequest_frontend_text` and `webrequest_sampled`. The bal... [15:41:21] is somehing up with dns7002 ? The switch is alerting about BFD not being up on v4 and v6: https://alerts.wikimedia.org/?q=scope%3Dnetwork&q=instance%3Dasw1-b4-magru%3A9804&q=alertname%3DBFDdown [15:45:00] XioNoX: cjd91 and sukhe are in process of upgrading it to trixie [15:45:17] ok cool, can you downtime the alert with the relevant task? [15:52:02] XioNoX: thanks, we will. sorry, we thought the downtime would be enough [15:56:28] there is no easy way to tie a switch side BFD alert to a server :( [16:05:27] yeah, so manual downtime it is :) [16:05:35] we will reimage it today again and hopefully bring it up [17:28:59] 06Traffic, 10DNS, 06SRE, 13Patch-For-Review: new CNAME record for WikiLearn - https://phabricator.wikimedia.org/T429628#12034949 (10CDobbins) 05Open→03In progress a:03CDobbins [17:37:27] 06Traffic: Upgrade Traffic hosts to trixie - https://phabricator.wikimedia.org/T401832#12034974 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cdobbins@cumin2002 for host dns7002.wikimedia.org with OS bookworm [18:25:01] 06Traffic, 10DNS, 06SRE, 13Patch-For-Review: new CNAME record for WikiLearn - https://phabricator.wikimedia.org/T429628#12035061 (10ssingh) @Asaf: It seems like the record already exists and is being served: ` dig _e8216d92d36158dd2198ac46e3739de7.learn.wiki +short _58bdabc6b3bcd7a4a822c4b55d531e26.tjxr... [18:54:21] 06Traffic: Upgrade Traffic hosts to trixie - https://phabricator.wikimedia.org/T401832#12035127 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cdobbins@cumin2002 for host dns7002.wikimedia.org with OS bookworm completed: - dns7002 (**WARN**) - Downtimed on Icinga/Alertmanager - Disabl... [18:58:18] 06Traffic, 07Sustainability (Incident Followup): Experiment with single backend CDN nodes - https://phabricator.wikimedia.org/T288106#12035135 (10BCornwall) Due to budgetary restraints, we're going to experiment with seeing how much worse off we'd be with single NVMe drives. To that end, we're planning on expe... [19:32:27] FIRING: SystemdUnitCrashLoop: acme-chief.service crashloop on acmechief2002:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [19:34:25] FIRING: SystemdUnitFailed: acme-chief.service on acmechief2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:37:00] looking [19:39:09] consistently erroring on non-canonical-redirect-33 ? [19:39:46] It's also consistently the first prevalidated [19:39:55] brett: Jun 18 18:28:08 acmechief2002 acme-chief-backend[600]: Failed to perform DNS zone update for certificate non-canonical-redirect-33 / ec-prime256v1 [19:40:03] this happened an hour ago and it hasn't retried the DNS zone updates [19:40:28] (guessing from the logs) [19:42:15] aha [19:42:21] but [19:42:26] there's also ongoing letsencrypt issues, anyway [19:42:27] RESOLVED: SystemdUnitCrashLoop: acme-chief.service crashloop on acmechief2002:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [19:42:47] lies, it just crashed a minute ago lol [19:42:57] https://letsencrypt.status.io/ [19:44:25] RESOLVED: SystemdUnitFailed: acme-chief.service on acmechief2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:45:31] eh? [19:45:45] how'd that happen? [19:45:59] yeah, not sure, systemd has (for now) given up on restarting the service unit [19:46:00] it looks pretty failed to me [19:46:02] but it's absolutely still falied [19:46:14] everything I know is a lie [19:46:25] FIRING: SystemdUnitFailed: acme-chief.service on acmechief2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:54:58] FIRING: SystemdUnitCrashLoop: acme-chief.service crashloop on acmechief2002:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [19:56:25] RESOLVED: SystemdUnitFailed: acme-chief.service on acmechief2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:58:25] FIRING: SystemdUnitFailed: acme-chief.service on acmechief2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:04:58] RESOLVED: SystemdUnitCrashLoop: acme-chief.service crashloop on acmechief2002:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [20:23:25] RESOLVED: SystemdUnitFailed: acme-chief.service on acmechief2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed