[00:00:08] 10ops-esams, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: ESAMS 502 broken pipe connection issues - https://phabricator.wikimedia.org/T415473#11552156 (10ssingh) Thank for reporting. Can you please try again from your home connection (without VPN) and let us know if that works? We depooled... [00:00:24] RECOVERY - Host cp3066 is UP: PING OK - Packet loss = 0%, RTA = 80.96 ms [00:00:24] RECOVERY - Host cp3070 is UP: PING OK - Packet loss = 0%, RTA = 80.93 ms [00:00:24] RECOVERY - Host cp3068 is UP: PING OK - Packet loss = 0%, RTA = 80.98 ms [00:00:26] RECOVERY - Host cp3079 is UP: PING OK - Packet loss = 0%, RTA = 80.95 ms [00:00:26] RECOVERY - Host cp3071 is UP: PING OK - Packet loss = 0%, RTA = 80.93 ms [00:00:26] RECOVERY - Host cp3074 is UP: PING OK - Packet loss = 0%, RTA = 81.01 ms [00:00:26] RECOVERY - Host cp3078 is UP: PING OK - Packet loss = 0%, RTA = 80.91 ms [00:00:26] RECOVERY - Host cp3081 is UP: PING OK - Packet loss = 0%, RTA = 80.92 ms [00:00:26] RECOVERY - Host cp3076 is UP: PING OK - Packet loss = 0%, RTA = 80.85 ms [00:00:27] RECOVERY - Host cp3075 is UP: PING OK - Packet loss = 0%, RTA = 80.91 ms [00:00:27] RECOVERY - Host cp3072 is UP: PING OK - Packet loss = 0%, RTA = 80.89 ms [00:00:28] RECOVERY - Host cp3067 is UP: PING OK - Packet loss = 0%, RTA = 82.67 ms [00:00:28] RECOVERY - Host cp3077 is UP: PING OK - Packet loss = 0%, RTA = 80.93 ms [00:00:29] RECOVERY - Host cp3069 is UP: PING OK - Packet loss = 0%, RTA = 80.90 ms [00:00:29] RECOVERY - Host cp3073 is UP: PING OK - Packet loss = 0%, RTA = 82.67 ms [00:00:30] RECOVERY - Host cp3080 is UP: PING OK - Packet loss = 0%, RTA = 82.59 ms [00:00:34] huh [00:00:38] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:00:40] RECOVERY - Host lvs3008 is UP: PING OK - Packet loss = 0%, RTA = 80.81 ms [00:00:40] RECOVERY - Host lvs3010 is UP: PING OK - Packet loss = 0%, RTA = 80.75 ms [00:00:40] RECOVERY - Host ganeti3005 is UP: PING OK - Packet loss = 0%, RTA = 80.90 ms [00:00:40] RECOVERY - Host lvs3009 is UP: PING OK - Packet loss = 0%, RTA = 81.97 ms [00:00:40] RECOVERY - Host ganeti3006 is UP: PING OK - Packet loss = 0%, RTA = 80.92 ms [00:00:41] RECOVERY - Host ganeti3008 is UP: PING OK - Packet loss = 0%, RTA = 82.58 ms [00:00:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://en.wikipedia.org/api/rest_v1 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=esams - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [00:01:00] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:01:02] RECOVERY - Host ganeti3007 is UP: PING OK - Packet loss = 0%, RTA = 80.94 ms [00:01:04] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:01:04] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:01:14] Clearing temporarily for me.. [00:01:18] RECOVERY - OSPF status on cr2-drmrs is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:01:28] 10ops-esams, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: ESAMS 502 broken pipe connection issues - https://phabricator.wikimedia.org/T415473#11552158 (10AlexisJazz) Works now, thanks! [00:02:10] FIRING: [6x] BFDdown: BFD session down between cr1-eqiad and 185.15.59.145 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [00:02:39] FIRING: [6x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-drmrs (185.15.58.146) - group Confed_drmrs - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [00:05:28] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 04 Apr 2026 07:22:16 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:07:10] RESOLVED: [6x] BFDdown: BFD session down between cr1-eqiad and 185.15.59.145 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [00:07:36] 10ops-esams, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: ESAMS 502 broken pipe connection issues - https://phabricator.wikimedia.org/T415473#11552161 (10AlexisJazz) {F71607077} There was a visible dip in editing and surge in error responses. [00:07:39] RESOLVED: [6x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-drmrs (185.15.58.146) - group Confed_drmrs - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [00:13:39] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, January 26 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1230708 (https://phabricator.wikimedia.org/T415335) (owner: 10Stang) [00:19:37] !log sukhe@cumin1003 START - Cookbook sre.dns.admin DNS admin: pool site esams [reason: repooling esams; link issues resolved, T415473] [00:19:42] T415473: ESAMS 502 broken pipe connection issues - https://phabricator.wikimedia.org/T415473 [00:19:52] !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: pool site esams [reason: repooling esams; link issues resolved, T415473] [00:24:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in esams - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [00:34:23] !log remove static routes for esams ranges on cr1-eqiad [00:34:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:40:01] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1232830 [00:40:01] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1232830 (owner: 10TrainBranchBot) [00:50:32] FIRING: [22x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:54:38] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1232830 (owner: 10TrainBranchBot) [00:55:17] FIRING: [22x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:00:41] !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image [01:10:05] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1232841 [01:10:05] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1232841 (owner: 10TrainBranchBot) [01:13:24] !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 12m 43s) [01:19:27] 10ops-esams, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: ESAMS 502 broken pipe connection issues - https://phabricator.wikimedia.org/T415473#11552172 (10ssingh) We had a transient link failure between eqiad and esams that resulted in this issue. It should be resolved now, and esams is pool... [01:26:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:29:41] FIRING: [9x] SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:33:38] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1232841 (owner: 10TrainBranchBot) [01:33:42] 10ops-esams, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: ESAMS 502 broken pipe connection issues - https://phabricator.wikimedia.org/T415473#11552173 (10Peachey88) [02:15:05] (03CR) 10Brouberol: [C:03+1] "I wonder if we should instead..." [puppet] - 10https://gerrit.wikimedia.org/r/1230547 (owner: 10Ladsgroup) [02:20:42] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [03:19:14] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventstreams-internal.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [03:20:42] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [03:39:14] FIRING: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:55:32] FIRING: [20x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:57:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 19.73% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [05:02:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 23.75% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [05:09:14] FIRING: [3x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:26:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:29:41] FIRING: [9x] SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:34:14] FIRING: [3x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:35:10] FIRING: [3x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:40:17] FIRING: [22x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:02:16] PROBLEM - Ensure acme-chief-api is running on acmechief2002 is CRITICAL: PROCS CRITICAL: 2 processes with args /usr/bin/uwsgi --die-on-term --ini /etc/uwsgi/apps-enabled/acme-chief.ini https://wikitech.wikimedia.org/wiki/Acme-chief [06:02:16] PROBLEM - Ensure acme-chief-backend is running only in the active node on acmechief2002 is CRITICAL: PROCS CRITICAL: 2 processes with args acme-chief-backend https://wikitech.wikimedia.org/wiki/Acme-chief [06:03:16] RECOVERY - Ensure acme-chief-api is running on acmechief2002 is OK: PROCS OK: 1 process with args /usr/bin/uwsgi --die-on-term --ini /etc/uwsgi/apps-enabled/acme-chief.ini https://wikitech.wikimedia.org/wiki/Acme-chief [06:03:16] RECOVERY - Ensure acme-chief-backend is running only in the active node on acmechief2002 is OK: PROCS OK: 1 process with args acme-chief-backend https://wikitech.wikimedia.org/wiki/Acme-chief [06:47:39] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-codfw and Hurricane Electric (2001:504:61::1b1b:0:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [07:02:39] RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr2-codfw and Hurricane Electric (2001:504:61::1b1b:0:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [07:19:14] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventstreams-internal.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [08:00:05] Amir1, Urbanecm, and awight: Time to snap out of that daydream and deploy UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260126T0800). [08:00:05] samwilson, koi, and Superpes: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:02:17] (03Abandoned) 10Aqu: Allow connections to eventgates from Airflow [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228599 (https://phabricator.wikimedia.org/T411989) (owner: 10Aqu) [08:03:46] (03CR) 10TrainBranchBot: [C:03+2] "Approved by samwilson@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1229946 (https://phabricator.wikimedia.org/T413967) (owner: 10Samwilson) [08:04:09] o/ [08:04:52] (03Merged) 10jenkins-bot: Enable watchlist labels on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1229946 (https://phabricator.wikimedia.org/T413967) (owner: 10Samwilson) [08:05:40] !log samwilson@deploy2002 Started scap sync-world: Backport for [[gerrit:1229946|Enable watchlist labels on testwiki (T413967)]] [08:05:45] T413967: Deploy watchlist labels - https://phabricator.wikimedia.org/T413967 [08:10:10] FIRING: BFDdown: BFD session down between cr2-esams and fe80::ee38:7300:17e8:9c56 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [08:15:10] RESOLVED: BFDdown: BFD session down between cr2-esams and fe80::ee38:7300:17e8:9c56 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [08:28:44] !log samwilson@deploy2002 samwilson: Backport for [[gerrit:1229946|Enable watchlist labels on testwiki (T413967)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [08:28:50] T413967: Deploy watchlist labels - https://phabricator.wikimedia.org/T413967 [08:30:12] !log samwilson@deploy2002 samwilson: Continuing with sync [08:32:49] (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1232979 [08:43:19] !log samwilson@deploy2002 Finished scap sync-world: Backport for [[gerrit:1229946|Enable watchlist labels on testwiki (T413967)]] (duration: 37m 38s) [08:43:23] T413967: Deploy watchlist labels - https://phabricator.wikimedia.org/T413967 [08:45:07] (03PS1) 10Marostegui: db1264: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1233097 (https://phabricator.wikimedia.org/T415358) [08:47:03] (03CR) 10Marostegui: [C:03+2] db1264: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1233097 (https://phabricator.wikimedia.org/T415358) (owner: 10Marostegui) [08:48:53] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db1264 T415358', diff saved to https://phabricator.wikimedia.org/P87925 and previous config saved to /var/cache/conftool/dbconfig/20260126-084852-marostegui.json [08:48:57] T415358: Migrate 1P db* to Debian Trixie - https://phabricator.wikimedia.org/T415358 [08:49:40] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1264.eqiad.wmnet with reason: reimage to Trixie [08:51:24] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db1264.eqiad.wmnet with OS trixie [09:02:15] (03CR) 10Joal: "I think this patch can be abandoned in favor of https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1229524" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228599 (https://phabricator.wikimedia.org/T411989) (owner: 10Aqu) [09:03:51] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1264.eqiad.wmnet with reason: host reimage [09:08:55] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1264.eqiad.wmnet with reason: host reimage [09:18:48] (03PS1) 10Marostegui: Revert "db1264: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1233111 [09:25:39] !log marostegui@cumin1003 START - Cookbook sre.mysql.clone of db1224.eqiad.wmnet onto db1264.eqiad.wmnet [09:25:43] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool db1224 - Depool db1224.eqiad.wmnet to then clone it to db1264.eqiad.wmnet - marostegui@cumin1003 [09:26:02] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db1224 - Depool db1224.eqiad.wmnet to then clone it to db1264.eqiad.wmnet - marostegui@cumin1003 [09:26:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:27:37] (03CR) 10Marostegui: [C:04-2] "Not yet" [puppet] - 10https://gerrit.wikimedia.org/r/1233111 (owner: 10Marostegui) [09:28:07] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1264.eqiad.wmnet with OS trixie [09:28:21] (03PS1) 10Marostegui: installserver: Do not format /srv on db1264 [puppet] - 10https://gerrit.wikimedia.org/r/1233114 [09:29:41] FIRING: [9x] SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:33:24] (03CR) 10Marostegui: [C:03+2] installserver: Do not format /srv on db1264 [puppet] - 10https://gerrit.wikimedia.org/r/1233114 (owner: 10Marostegui) [09:35:05] (03Abandoned) 10Thiemo Kreuz (WMDE): [beta] Start using Cite's Community Configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133119 (https://phabricator.wikimedia.org/T385597) (owner: 10Thiemo Kreuz (WMDE)) [09:39:14] FIRING: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:40:32] FIRING: [22x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:43:17] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db2198.codfw.wmnet with reason: schema change [09:55:36] (03PS1) 10Sergio Gimeno: fix: avoid logging traffic from overridden experiment users [extensions/GrowthExperiments] (wmf/1.46.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1233116 (https://phabricator.wikimedia.org/T415294) [09:55:43] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, January 26 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [extensions/GrowthExperiments] (wmf/1.46.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1233116 (https://phabricator.wikimedia.org/T415294) (owner: 10Sergio Gimeno) [10:05:30] (03CR) 10BCornwall: [C:03+2] DNSRepository: Automated MarkMonitor domain sync [dns] - 10https://gerrit.wikimedia.org/r/1231057 (owner: 10Ncmonitor) [10:06:13] (03PS1) 10Sergio Gimeno: fix: avoid displaying incorrect additional userpage link [extensions/GrowthExperiments] (wmf/1.46.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1233117 (https://phabricator.wikimedia.org/T415291) [10:06:15] !log brett@dns1006 START - running authdns-update [10:06:24] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, January 26 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [extensions/GrowthExperiments] (wmf/1.46.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1233117 (https://phabricator.wikimedia.org/T415291) (owner: 10Sergio Gimeno) [10:07:51] !log brett@dns1006 END - running authdns-update [10:08:43] (03CR) 10BCornwall: [C:03+1] wmnet: Update s1-master alias [dns] - 10https://gerrit.wikimedia.org/r/1229527 (https://phabricator.wikimedia.org/T415171) (owner: 10Gerrit maintenance bot) [10:12:07] (03CR) 10Brouberol: [C:03+2] Update dse-k8s-eqiad airflow values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1229524 (https://phabricator.wikimedia.org/T411989) (owner: 10Joal) [10:14:15] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-main: apply [10:15:00] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-main: apply [10:16:19] (03PS2) 10BCornwall: NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1231058 (owner: 10Ncmonitor) [10:18:40] (03CR) 10BCornwall: NCRedirRedirects: Automated MarkMonitor domain sync (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1231058 (owner: 10Ncmonitor) [10:18:42] (03CR) 10BCornwall: [C:03+2] NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1231058 (owner: 10Ncmonitor) [10:22:00] (03PS2) 10Ladsgroup: kerberos: Add a space after period in MOTD [puppet] - 10https://gerrit.wikimedia.org/r/1230547 [10:22:09] (03PS3) 10Ladsgroup: kerberos: Add a space after period in MOTD [puppet] - 10https://gerrit.wikimedia.org/r/1230547 [10:22:12] (03CR) 10BCornwall: [C:03+2] ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1231059 (owner: 10Ncmonitor) [10:22:21] (03CR) 10Ladsgroup: [V:03+2 C:03+2] "Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1230547 (owner: 10Ladsgroup)