[00:03:21] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87912 and previous config saved to /var/cache/conftool/dbconfig/20260125-000321-marostegui.json [00:03:27] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [00:03:28] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [00:03:37] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1192.eqiad.wmnet with reason: Maintenance [00:03:46] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1192 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87913 and previous config saved to /var/cache/conftool/dbconfig/20260125-000345-marostegui.json [00:09:13] 06SRE, 10SRE-Access-Requests: Grant Access to analytics-privatedata-users for Silvia G - https://phabricator.wikimedia.org/T411436#11551495 (10Aklapper) @SEgt-WMF: Please reply or otherwise this request will get declined. Thanks. [00:27:27] 06SRE, 06Traffic, 07Documentation: TLS 1.2 on Wikimedia DNS DoH resolver not working - https://phabricator.wikimedia.org/T415449#11551498 (10Naruse_shiroha) Okay, updated it in https://meta.wikimedia.org/w/index.php?title=Wikimedia_DNS&diff=prev&oldid=29976804. [00:29:44] 06SRE, 06Traffic, 07Documentation: Documentation error about TLS 1.2 on Wikimedia DNS DoH on metawiki - https://phabricator.wikimedia.org/T415449#11551499 (10Naruse_shiroha) 05Open→03Resolved [00:40:02] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1232000 [00:40:02] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1232000 (owner: 10TrainBranchBot) [00:52:32] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1232000 (owner: 10TrainBranchBot) [01:00:40] !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image [01:01:01] 06SRE, 06Traffic: TCP Fast Open not working since at least December 2025 - https://phabricator.wikimedia.org/T415454 (10Cuthead) 03NEW [01:04:57] 06SRE, 06Traffic: TCP FastOpen not working since at least December 2025 - https://phabricator.wikimedia.org/T415454#11551516 (10Cuthead) [01:10:31] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1232025 [01:10:31] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1232025 (owner: 10TrainBranchBot) [01:13:40] !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 12m 59s) [01:27:17] RESOLVED: [2x] ProbeDown: Service wdqs1015:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1015:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:29:40] FIRING: [9x] SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:35:09] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1232025 (owner: 10TrainBranchBot) [02:35:17] FIRING: ProbeDown: Service wdqs1016:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1016:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:55:17] FIRING: [2x] ProbeDown: Service wdqs1016:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1016:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:19:14] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventstreams-internal.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [03:34:14] FIRING: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:47:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:09:14] FIRING: [3x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:29:40] FIRING: [9x] SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:34:14] FIRING: [3x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:23:37] (03CR) 10Peterxy12: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1230708 (https://phabricator.wikimedia.org/T415335) (owner: 10Stang) [06:55:32] FIRING: [2x] ProbeDown: Service wdqs1016:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1016:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:19:14] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventstreams-internal.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [07:43:38] 06SRE, 06Traffic: TCP FastOpen not working since at least December 2025 - https://phabricator.wikimedia.org/T415454#11551605 (10Bewfip) There are some mentions of TCP Fast Open in operations/puppet: https://codesearch.wmcloud.org/puppet/?q=tcp_fastopen . Though I don't know how the network stack here works. [07:47:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:55:34] 06SRE, 06Traffic: TCP FastOpen not working since at least December 2025 - https://phabricator.wikimedia.org/T415454#11551607 (10Cuthead) Just confirmed TFO on 3 AuthDNS is OK. ` 208.80.153.231 age 179.516sec fo_mss 1024 fo_cookie 2961b389e2ec5b93 source 192.168.1.141 208.80.154.238 age 236.896sec fo_mss 1024... [08:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260125T0800) [08:43:35] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, 06Traffic: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805#11551645 (10Ladsgroup) The URL to our thumbnails is not an stable API‌ and shouldn't be treated as such. The actual APIs return URL to... [09:13:05] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, 06Traffic: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805#11551665 (10Tacsipacsi) I see. However, people //have// been treating it as a stable API, and there are also just too many places wher... [09:29:41] FIRING: [9x] SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:34:14] FIRING: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:49:10] FIRING: [2x] BFDdown: BFD session down between cr2-eqdfw and fe80::b6f9:5dff:fe30:e538 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [09:54:10] RESOLVED: [2x] BFDdown: BFD session down between cr2-eqdfw and fe80::b6f9:5dff:fe30:e538 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [10:55:32] FIRING: [2x] ProbeDown: Service wdqs1016:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1016:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:19:14] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventstreams-internal.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [11:47:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:02:25] RESOLVED: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:29:41] FIRING: [9x] SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:34:14] FIRING: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:20:17] FIRING: [4x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:09:14] FIRING: [3x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:19:14] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventstreams-internal.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [15:34:14] FIRING: [3x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:35:10] FIRING: [3x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:39:52] 10ops-eqiad, 06DC-Ops: Alert for device ps1-e3-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T415466 (10phaultfinder) 03NEW [17:10:17] FIRING: [6x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:27:00] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2166 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87914 and previous config saved to /var/cache/conftool/dbconfig/20260125-172658-marostegui.json [17:27:06] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [17:27:06] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [17:29:41] FIRING: [9x] SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:37:08] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P87915 and previous config saved to /var/cache/conftool/dbconfig/20260125-173707-marostegui.json [17:47:16] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P87916 and previous config saved to /var/cache/conftool/dbconfig/20260125-174716-marostegui.json [17:57:25] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2166 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87917 and previous config saved to /var/cache/conftool/dbconfig/20260125-175724-marostegui.json [17:57:31] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [17:57:31] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [17:57:41] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance [17:57:49] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2167 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87918 and previous config saved to /var/cache/conftool/dbconfig/20260125-175749-marostegui.json [19:19:14] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventstreams-internal.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [19:39:14] FIRING: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:20:17] FIRING: [8x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:25:17] FIRING: [10x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:35:17] FIRING: [18x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:45:17] FIRING: [20x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:50:17] FIRING: [22x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:26:25] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:29:41] FIRING: [9x] SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:54:11] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1192 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87919 and previous config saved to /var/cache/conftool/dbconfig/20260125-215410-marostegui.json [21:54:16] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [21:54:17] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [22:04:19] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P87920 and previous config saved to /var/cache/conftool/dbconfig/20260125-220418-marostegui.json [22:14:27] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P87921 and previous config saved to /var/cache/conftool/dbconfig/20260125-221427-marostegui.json [22:24:36] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1192 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87922 and previous config saved to /var/cache/conftool/dbconfig/20260125-222435-marostegui.json [22:24:42] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [22:24:42] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [22:24:52] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1193.eqiad.wmnet with reason: Maintenance [22:25:00] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1193 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87923 and previous config saved to /var/cache/conftool/dbconfig/20260125-222500-marostegui.json [22:30:11] 06SRE, 10DNS, 06Traffic, 06Abstract Wikipedia team (26Q3 (Jan–Mar)), and 2 others: Set up DNS for abstract.wikipedia.org to be recognised - https://phabricator.wikimedia.org/T411724#11552111 (10MLechvien-WMF) [23:17:03] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [23:19:14] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventstreams-internal.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [23:32:03] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [23:35:46] PROBLEM - Host cp3068 is DOWN: PING CRITICAL - Packet loss = 100% [23:35:52] PROBLEM - Host cp3077 is DOWN: PING CRITICAL - Packet loss = 100% [23:35:52] PROBLEM - Host cp3072 is DOWN: PING CRITICAL - Packet loss = 100% [23:36:02] PROBLEM - Host cp3075 is DOWN: PING CRITICAL - Packet loss = 100% [23:36:04] PROBLEM - Host ganeti3006 is DOWN: PING CRITICAL - Packet loss = 100% [23:36:08] PROBLEM - Host lvs3008 is DOWN: PING CRITICAL - Packet loss = 100% [23:36:10] PROBLEM - Host lvs3009 is DOWN: PING CRITICAL - Packet loss = 100% [23:36:10] PROBLEM - Host ganeti3005 is DOWN: PING CRITICAL - Packet loss = 100% [23:36:10] PROBLEM - Host lvs3010 is DOWN: PING CRITICAL - Packet loss = 100% [23:36:10] PROBLEM - Host ganeti3008 is DOWN: PING CRITICAL - Packet loss = 100% [23:36:14] PROBLEM - Host cp3079 is DOWN: PING CRITICAL - Packet loss = 100% [23:36:14] PROBLEM - Host cp3080 is DOWN: PING CRITICAL - Packet loss = 100% [23:36:14] PROBLEM - Host cp3073 is DOWN: PING CRITICAL - Packet loss = 100% [23:36:14] PROBLEM - Host cp3081 is DOWN: PING CRITICAL - Packet loss = 100% [23:36:14] PROBLEM - Host cp3070 is DOWN: PING CRITICAL - Packet loss = 100% [23:36:14] PROBLEM - Host cp3074 is DOWN: PING CRITICAL - Packet loss = 100% [23:36:14] PROBLEM - Host cp3067 is DOWN: PING CRITICAL - Packet loss = 100% [23:36:15] PROBLEM - Host cp3076 is DOWN: PING CRITICAL - Packet loss = 100% [23:36:15] PROBLEM - Host cp3069 is DOWN: PING CRITICAL - Packet loss = 100% [23:36:16] PROBLEM - Host cp3066 is DOWN: PING CRITICAL - Packet loss = 100% [23:36:16] PROBLEM - Host cp3078 is DOWN: PING CRITICAL - Packet loss = 100% [23:36:17] PROBLEM - Host cp3071 is DOWN: PING CRITICAL - Packet loss = 100% [23:36:18] PROBLEM - OSPF status on cr2-drmrs is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:36:32] PROBLEM - Host ganeti3007 is DOWN: PING CRITICAL - Packet loss = 100% [23:36:58] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 3/6 UP : OSPFv3: 3/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:37:04] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:37:04] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:37:10] FIRING: [2x] BFDdown: BFD session down between cr2-drmrs and 185.15.58.147 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [23:37:39] FIRING: [2x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-drmrs (185.15.58.146) - group Confed_drmrs - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [23:39:14] FIRING: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:40:30] FIRING: LibericaEtcdErrors: Liberica instance lvs3010:3003 is experiencing etcd issues - https://wikitech.wikimedia.org/wiki/Liberica#LibericaEtcdErrors - https://grafana.wikimedia.org/d/fa4de97a-7114-48c7-a91a-f56089ef554f/liberica?orgId=1&var-site=esams&var-instance=lvs3010&viewPanel=11 - https://alerts.wikimedia.org/?q=alertname%3DLibericaEtcdErrors [23:40:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://en.wikipedia.org/api/rest_v1 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=esams - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [23:42:10] FIRING: [6x] BFDdown: BFD session down between cr1-eqiad and 185.15.59.145 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [23:42:39] FIRING: [6x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-drmrs (185.15.58.146) - group Confed_drmrs - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [23:43:11] * brett here [23:43:49] Hi all [23:43:50] Request from 88.97.96.89 via cp3071.esams.wmnet, ATS/9.2.11 [23:43:50] Error: 502, Broken pipe at 2026-01-25 23:40:55 GMT [23:43:58] I am based in the United Kingdom [23:44:12] ShakespeareFan00: This is known, thank you! We're working on it now [23:44:45] FIRING: WidespreadPuppetFailure: Puppet has failed in esams - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [23:45:16] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-web_443: Servers titan1001.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [23:45:30] RESOLVED: LibericaEtcdErrors: Liberica instance lvs3010:3003 is experiencing etcd issues - https://wikitech.wikimedia.org/wiki/Liberica#LibericaEtcdErrors - https://grafana.wikimedia.org/d/fa4de97a-7114-48c7-a91a-f56089ef554f/liberica?orgId=1&var-site=esams&var-instance=lvs3010&viewPanel=11 - https://alerts.wikimedia.org/?q=alertname%3DLibericaEtcdErrors [23:46:03] 10ops-esams, 06DC-Ops, 06Infrastructure-Foundations, 10netops, 07Wikimedia-production-error: ESAMS 502 broken pipe connection issues - https://phabricator.wikimedia.org/T415473 (10AlexisJazz) 03NEW [23:46:16] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [23:54:41] !log sukhe@cumin1003 START - Cookbook sre.dns.admin DNS admin: depool site esams [reason: no reason specified, ] [23:55:08] !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: depool site esams [reason: no reason specified, ] [23:59:06] 10ops-esams, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: ESAMS 502 broken pipe connection issues - https://phabricator.wikimedia.org/T415473#11552154 (10AlexisJazz)