[00:03:32] FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:04:11] 06SRE, 10DNS, 06serviceops, 06Traffic, 13Patch-For-Review: Create redirect from tj.*.org to tg.*.org - https://phabricator.wikimedia.org/T393803#10846254 (10Scott_French) I just chatted with @jasmine_, who is interested in helping to deploy this change. Many thanks for preparing a patch, @Dzahn! [00:07:07] (03CR) 10Scott French: "Swapping myself in for Reuven." [puppet] - 10https://gerrit.wikimedia.org/r/1148981 (https://phabricator.wikimedia.org/T393803) (owner: 10Dzahn) [00:08:44] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1148986 [00:08:44] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1148986 (owner: 10TrainBranchBot) [00:10:38] PROBLEM - MariaDB Replica Lag: s1 on db2141 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 632.15 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:22:48] (03CR) 10Cwhite: [C:03+1] "Looks like the right thing to do given that graphite is RO." [puppet] - 10https://gerrit.wikimedia.org/r/1148825 (owner: 10Majavah) [00:23:54] (03CR) 10Cwhite: [C:03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1148915 (https://phabricator.wikimedia.org/T394470) (owner: 10Andrea Denisse) [00:44:12] PROBLEM - BGP status on cr2-magru is CRITICAL: BGP CRITICAL - AS6939/IPv4: Active - HE, AS6939/IPv6: Active - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:47:39] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-magru and Hurricane Electric (187.16.221.197) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [01:07:12] (03PS1) 10Andrew Bogott: codfw1dev: allow public access to openstack APIs for a bit [puppet] - 10https://gerrit.wikimedia.org/r/1148990 [01:08:32] FIRING: SystemdUnitFailed: wmf_auto_restart_krb5-admin-server.service on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:08:51] (03CR) 10Andrew Bogott: [C:03+2] codfw1dev: allow public access to openstack APIs for a bit [puppet] - 10https://gerrit.wikimedia.org/r/1148990 (owner: 10Andrew Bogott) [01:09:22] (03PS1) 10Andrew Bogott: Revert "codfw1dev: allow public access to openstack APIs for a bit" [puppet] - 10https://gerrit.wikimedia.org/r/1148991 [01:09:40] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1148986 (owner: 10TrainBranchBot) [01:51:08] (03CR) 10Andrew Bogott: [C:03+2] Revert "codfw1dev: allow public access to openstack APIs for a bit" [puppet] - 10https://gerrit.wikimedia.org/r/1148991 (owner: 10Andrew Bogott) [02:08:32] FIRING: SystemdUnitFailed: curator_actions_apifeatureusage_eqiad.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:23:32] FIRING: [6x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_search-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [03:17:38] RECOVERY - MariaDB Replica Lag: s1 on db2141 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [03:18:32] FIRING: NetworkDeviceAlarmActive: Alarm active on cr2-codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [03:32:39] RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr2-magru and Hurricane Electric (187.16.221.197) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [03:53:27] 06SRE, 10Observability-Alerting: when servers are about to run out of disk, monitoring should notify the owners - https://phabricator.wikimedia.org/T394955#10846372 (10Aklapper) [03:55:11] (03CR) 10Jdlrobson: [C:04-1] bookmark: Fix click event not working [extensions/ReadingLists] (wmf/1.45.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1148939 (https://phabricator.wikimedia.org/T394736) (owner: 10Jdlrobson) [04:03:32] FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:12:16] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 219, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:13:06] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 139, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:14:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [04:19:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [04:27:37] 06SRE, 10Legalpad, 10Phabricator: Allow aklapper to view/edit L3 - https://phabricator.wikimedia.org/T394966 (10Aklapper) 03NEW p:05Triage→03Low [04:46:56] (03PS1) 10Marostegui: mariadb: Productionize pc2018 [puppet] - 10https://gerrit.wikimedia.org/r/1148998 (https://phabricator.wikimedia.org/T394260) [04:47:20] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on pc1018.eqiad.wmnet with reason: Maintenance [04:47:31] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on pc2018.codfw.wmnet with reason: Maintenance [04:49:07] (03CR) 10Marostegui: [C:03+2] mariadb: Productionize pc2018 [puppet] - 10https://gerrit.wikimedia.org/r/1148998 (https://phabricator.wikimedia.org/T394260) (owner: 10Marostegui) [05:06:21] (03PS1) 10Marostegui: instances.yaml: Add pc1018,pc2018 [puppet] - 10https://gerrit.wikimedia.org/r/1148999 (https://phabricator.wikimedia.org/T394260) [05:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:08:08] (03CR) 10Marostegui: [C:03+2] instances.yaml: Add pc1018,pc2018 [puppet] - 10https://gerrit.wikimedia.org/r/1148999 (https://phabricator.wikimedia.org/T394260) (owner: 10Marostegui) [05:08:32] FIRING: SystemdUnitFailed: wmf_auto_restart_krb5-admin-server.service on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:13:31] (03PS1) 10Marostegui: instance.schema: Add pc8 [puppet] - 10https://gerrit.wikimedia.org/r/1149000 (https://phabricator.wikimedia.org/T394260) [05:16:51] (03CR) 10Marostegui: [C:03+2] instance.schema: Add pc8 [puppet] - 10https://gerrit.wikimedia.org/r/1149000 (https://phabricator.wikimedia.org/T394260) (owner: 10Marostegui) [05:26:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Add pc1018 and pc2018 to dbctl depooled T394260', diff saved to https://phabricator.wikimedia.org/P76372 and previous config saved to /var/cache/conftool/dbconfig/20250522-052649-marostegui.json [05:26:53] T394260: Productionize pc8 - https://phabricator.wikimedia.org/T394260 [05:28:45] (03CR) 10Arnaudb: [C:03+1] lists: include nftables throttling profile [puppet] - 10https://gerrit.wikimedia.org/r/1148432 (https://phabricator.wikimedia.org/T394519) (owner: 10Dzahn) [05:39:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1169', diff saved to https://phabricator.wikimedia.org/P76373 and previous config saved to /var/cache/conftool/dbconfig/20250522-053938-marostegui.json [05:39:52] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1169.eqiad.wmnet with reason: Maintenance [05:47:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1169 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76374 and previous config saved to /var/cache/conftool/dbconfig/20250522-054730-root.json [05:51:12] (03PS1) 10Marostegui: pc1018,pc2018: Fix comments [puppet] - 10https://gerrit.wikimedia.org/r/1149003 [05:52:38] (03CR) 10Marostegui: [C:03+2] pc1018,pc2018: Fix comments [puppet] - 10https://gerrit.wikimedia.org/r/1149003 (owner: 10Marostegui) [05:58:14] 06SRE: Training checklist runbook review (Sprint Week 2023-03) - https://phabricator.wikimedia.org/T332391#10846619 (10LSobanski) 05Open→03Declined [06:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250522T0600) [06:00:04] marostegui, Amir1, and federico3: #bothumor My software never has bugs. It just develops random features. Rise for Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250522T0600). [06:01:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:02:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76375 and previous config saved to /var/cache/conftool/dbconfig/20250522-060236-root.json [06:05:43] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1188.eqiad.wmnet with reason: Maintenance [06:05:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1188', diff saved to https://phabricator.wikimedia.org/P76376 and previous config saved to /var/cache/conftool/dbconfig/20250522-060556-marostegui.json [06:07:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1188 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76377 and previous config saved to /var/cache/conftool/dbconfig/20250522-060745-root.json [06:08:32] FIRING: SystemdUnitFailed: curator_actions_apifeatureusage_eqiad.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:17:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76378 and previous config saved to /var/cache/conftool/dbconfig/20250522-061742-root.json [06:22:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1188 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76379 and previous config saved to /var/cache/conftool/dbconfig/20250522-062251-root.json [06:23:32] FIRING: [6x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_search-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [06:26:12] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host mc-misc2002.codfw.wmnet [06:28:40] (03PS1) 10Brouberol: airflow: relax timeout after which DAGs are deleted [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149013 (https://phabricator.wikimedia.org/T394459) [06:31:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-misc2002.codfw.wmnet [06:31:38] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host mc-misc2002.codfw.wmnet [06:32:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76380 and previous config saved to /var/cache/conftool/dbconfig/20250522-063248-root.json [06:37:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1188 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76381 and previous config saved to /var/cache/conftool/dbconfig/20250522-063756-root.json [06:39:54] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, May 22 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148827 (https://phabricator.wikimedia.org/T393615) (owner: 10Tchanders) [06:40:42] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-misc2002.codfw.wmnet [06:43:03] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host mc-misc2002.codfw.wmnet [06:43:38] (03PS1) 10Brouberol: deployment_server: stop pinning airflow-devenv [puppet] - 10https://gerrit.wikimedia.org/r/1149206 (https://phabricator.wikimedia.org/T393998) [06:44:47] (03CR) 10CI reject: [V:04-1] deployment_server: stop pinning airflow-devenv [puppet] - 10https://gerrit.wikimedia.org/r/1149206 (https://phabricator.wikimedia.org/T393998) (owner: 10Brouberol) [06:47:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76382 and previous config saved to /var/cache/conftool/dbconfig/20250522-064754-root.json [06:49:34] (03PS2) 10Brouberol: deployment_server: stop pinning airflow-devenv [puppet] - 10https://gerrit.wikimedia.org/r/1149206 (https://phabricator.wikimedia.org/T393998) [06:50:40] (03CR) 10CI reject: [V:04-1] deployment_server: stop pinning airflow-devenv [puppet] - 10https://gerrit.wikimedia.org/r/1149206 (https://phabricator.wikimedia.org/T393998) (owner: 10Brouberol) [06:51:21] (03PS3) 10Brouberol: deployment_server: stop pinning airflow-devenv [puppet] - 10https://gerrit.wikimedia.org/r/1149206 (https://phabricator.wikimedia.org/T393998) [06:51:54] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-misc2002.codfw.wmnet [06:53:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1188 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76383 and previous config saved to /var/cache/conftool/dbconfig/20250522-065302-root.json [06:53:37] (03CR) 10Slyngshede: [C:03+2] VueJS Permissions App [software/bitu] - 10https://gerrit.wikimedia.org/r/1140498 (owner: 10Slyngshede) [06:55:28] (03PS2) 10Brouberol: airflow: relax timeout after which DAGs are deleted [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149013 (https://phabricator.wikimedia.org/T394459) [06:55:53] (03CR) 10Brouberol: [C:03+2] deployment_server: stop pinning airflow-devenv [puppet] - 10https://gerrit.wikimedia.org/r/1149206 (https://phabricator.wikimedia.org/T393998) (owner: 10Brouberol) [06:55:57] (03Merged) 10jenkins-bot: VueJS Permissions App [software/bitu] - 10https://gerrit.wikimedia.org/r/1140498 (owner: 10Slyngshede) [06:56:34] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host mc-misc2002.codfw.wmnet [07:00:05] Amir1, Urbanecm, and awight: How many deployers does it take to do UTC morning backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250522T0700). [07:00:05] Tran: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:13] 👋 [07:00:41] (03CR) 10Muehlenhoff: [C:03+1] "Looks good, I don't believe I've used these metrics for anything. We can possibly also remove pluginsync= and report= from the puppetised " [puppet] - 10https://gerrit.wikimedia.org/r/1148825 (owner: 10Majavah) [07:02:46] (03PS1) 10Slyngshede: Permissions: Rebuild VueJS app [software/bitu] - 10https://gerrit.wikimedia.org/r/1149209 [07:03:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76384 and previous config saved to /var/cache/conftool/dbconfig/20250522-070259-root.json [07:05:02] If no deployer is around, I should be able to deploy and QA my config change myself [07:05:21] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-misc2002.codfw.wmnet [07:06:35] (03CR) 10Slyngshede: [C:03+2] Permissions: Rebuild VueJS app [software/bitu] - 10https://gerrit.wikimedia.org/r/1149209 (owner: 10Slyngshede) [07:07:38] !log jelto@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: Upgrade Replica to GitLab 17.10 [07:08:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1188 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76385 and previous config saved to /var/cache/conftool/dbconfig/20250522-070807-root.json [07:08:56] (03Merged) 10jenkins-bot: Permissions: Rebuild VueJS app [software/bitu] - 10https://gerrit.wikimedia.org/r/1149209 (owner: 10Slyngshede) [07:09:12] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host mc-misc2002.codfw.wmnet [07:09:21] (03PS2) 10Vgutierrez: cache:haproxy: Pass X-Analytics when X-Wikimedia-Debug is active [puppet] - 10https://gerrit.wikimedia.org/r/1129774 (https://phabricator.wikimedia.org/T305794) [07:13:18] (03CR) 10Fabfur: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1129774 (https://phabricator.wikimedia.org/T305794) (owner: 10Vgutierrez) [07:13:42] (03CR) 10Vgutierrez: [C:03+2] cache:haproxy: Pass X-Analytics when X-Wikimedia-Debug is active [puppet] - 10https://gerrit.wikimedia.org/r/1129774 (https://phabricator.wikimedia.org/T305794) (owner: 10Vgutierrez) [07:13:44] (03PS2) 10Elukey: kubernetes: add maps-test codfw as external service [puppet] - 10https://gerrit.wikimedia.org/r/1148896 (https://phabricator.wikimedia.org/T381565) [07:14:55] 06SRE, 06Traffic, 10WikimediaDebug, 07Developer Productivity, 13Patch-For-Review: Let X-Analytics response header pass through with WikimediaDebug - https://phabricator.wikimedia.org/T305794#10846779 (10Vgutierrez) 05In progress→03Resolved CR got merged now, give it the usual ~30 minutes for pupp... [07:15:05] (03CR) 10Elukey: kubernetes: add maps-test codfw as external service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1148896 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey) [07:15:21] !log jelto@cumin1002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1003.wikimedia.org with reason: Upgrade Replica to GitLab 17.10 [07:18:00] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-misc2002.codfw.wmnet [07:18:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76386 and previous config saved to /var/cache/conftool/dbconfig/20250522-071805-root.json [07:18:32] FIRING: NetworkDeviceAlarmActive: Alarm active on cr2-codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [07:18:40] (03PS1) 10Elukey: conftool-data: remove ml-serve1002 from its pool [puppet] - 10https://gerrit.wikimedia.org/r/1149212 (https://phabricator.wikimedia.org/T387854) [07:18:41] (03PS1) 10Elukey: role::ml_k8s::worker: set ml-serve1002 for containerd [puppet] - 10https://gerrit.wikimedia.org/r/1149213 (https://phabricator.wikimedia.org/T387854) [07:18:44] (03PS1) 10Elukey: conftool-data: re-add ml-serve1002 to its pool after reimage [puppet] - 10https://gerrit.wikimedia.org/r/1149214 (https://phabricator.wikimedia.org/T387854) [07:20:03] 10SRE-swift-storage, 10API Platform, 06Commons, 10MediaWiki-File-management, and 3 others: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872#10846788 (10MatthewVernon) a:05MatthewVernon→03None [07:20:03] Okay I'm going to deploy 1148827. I can't QA it right now as I don't have access to the userrights right on prod but it'll be QAed as part of our migration for this user group and isn't going to degrade anything until then if it fails QA at that point. [07:20:23] (03CR) 10TrainBranchBot: [C:03+2] "Approved by stran@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148827 (https://phabricator.wikimedia.org/T393615) (owner: 10Tchanders) [07:21:03] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (DIFF 1 NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5651/" [puppet] - 10https://gerrit.wikimedia.org/r/1149213 (https://phabricator.wikimedia.org/T387854) (owner: 10Elukey) [07:21:32] !log jelto@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab2002.wikimedia.org with reason: Upgrade GitLab to 17.10 [07:22:05] (03Merged) 10jenkins-bot: Temp accounts: Set group requirements for IP reveal group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148827 (https://phabricator.wikimedia.org/T393615) (owner: 10Tchanders) [07:22:43] !log stran@deploy1003 Started scap sync-world: Backport for [[gerrit:1148827|Temp accounts: Set group requirements for IP reveal group (T393615)]] [07:22:47] T393615: Impose technical restrictions on granting the `temporary-account-viewer` group - https://phabricator.wikimedia.org/T393615 [07:23:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1188 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76387 and previous config saved to /var/cache/conftool/dbconfig/20250522-072313-root.json [07:23:25] (03CR) 10Arnaudb: [C:03+1] "looks good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/1148930 (https://phabricator.wikimedia.org/T393873) (owner: 10Dzahn) [07:25:02] !log stran@deploy1003 tchanders, stran: Backport for [[gerrit:1148827|Temp accounts: Set group requirements for IP reveal group (T393615)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:25:15] (03CR) 10Arnaudb: [C:03+1] role: delete requesttracker role [puppet] - 10https://gerrit.wikimedia.org/r/1148923 (https://phabricator.wikimedia.org/T385777) (owner: 10Dzahn) [07:25:31] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host mc-misc2002.codfw.wmnet [07:25:52] !log stran@deploy1003 tchanders, stran: Continuing with sync [07:26:09] (03CR) 10Tiziano Fogli: [C:03+1] grafana: Disable dashboard sync to ugprade Grafana version [puppet] - 10https://gerrit.wikimedia.org/r/1148915 (https://phabricator.wikimedia.org/T394470) (owner: 10Andrea Denisse) [07:26:12] (03CR) 10Majavah: [C:03+2] puppet_statsd: Uninstall now that statsd is read-only [puppet] - 10https://gerrit.wikimedia.org/r/1148825 (owner: 10Majavah) [07:26:32] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1148923 (https://phabricator.wikimedia.org/T385777) (owner: 10Dzahn) [07:29:49] (03PS1) 10Elukey: conftool-data: remove elastic1058 to fix confd sanity checks [puppet] - 10https://gerrit.wikimedia.org/r/1149289 [07:31:37] !log cleanup (wdqs-internal lvs teardown) - `elukey@config-master1001:/var/run/confd-template$ sudo rm _srv_config-master_pybal_codfw_wdqs-internal.err _srv_config-master_pybal_eqiad_wdqs-internal.err` [07:31:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:42] cc: ryankemper: --^ [07:32:45] !log stran@deploy1003 Finished scap sync-world: Backport for [[gerrit:1148827|Temp accounts: Set group requirements for IP reveal group (T393615)]] (duration: 10m 01s) [07:32:45] (03CR) 10Elukey: [C:03+2] conftool-data: remove elastic1058 to fix confd sanity checks [puppet] - 10https://gerrit.wikimedia.org/r/1149289 (owner: 10Elukey) [07:32:49] T393615: Impose technical restrictions on granting the `temporary-account-viewer` group - https://phabricator.wikimedia.org/T393615 [07:33:01] (03CR) 10Jelto: [C:04-1] "my first approach was also just coping the Gerrit code for the abusers. But I'd slightly favor a more generic approach. This is the reason" [puppet] - 10https://gerrit.wikimedia.org/r/1148433 (https://phabricator.wikimedia.org/T394519) (owner: 10Dzahn) [07:34:16] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-misc2002.codfw.wmnet [07:35:49] !log cleanup - `elukey@config-master2001:/var/run/confd-template$ sudo rm _srv_config-master_pybal_codfw_wdqs-internal.err _srv_config-master_pybal_eqiad_wdqs-internal.err` [07:35:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:36:22] cc: bking --^ [07:36:49] all confd alerts cleared :) [07:38:32] RESOLVED: [6x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_search-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [07:40:49] (03CR) 10Volans: "We're adding support for multiple remotes for homer's private and those needs to be identified, so `origin` there is not an option." [puppet] - 10https://gerrit.wikimedia.org/r/1148267 (owner: 10Volans) [07:40:58] (03PS4) 10Volans: git::clone: set given remote name on initial cloning [puppet] - 10https://gerrit.wikimedia.org/r/1148267 [07:42:18] (03PS3) 10Volans: setup.py: add support up to Python 3.13 [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/1148434 [07:42:34] (03CR) 10Jelto: [V:03+1] "For gerrit we are blocking a lot of IPs. I'm not sure if we want to do that by default on all ouf our machines which use throttling. In th" [puppet] - 10https://gerrit.wikimedia.org/r/1148826 (https://phabricator.wikimedia.org/T394519) (owner: 10Jelto) [07:42:42] (03CR) 10Volans: setup.py: add support up to Python 3.13 (031 comment) [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/1148434 (owner: 10Volans) [07:42:58] (03CR) 10Arnaudb: [C:03+2] gerrit/nftables_throttling: make abusers more generic [puppet] - 10https://gerrit.wikimedia.org/r/1148826 (https://phabricator.wikimedia.org/T394519) (owner: 10Jelto) [07:43:26] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1148896 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey) [07:43:52] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/1148434 (owner: 10Volans) [07:43:58] !log elukey@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host ml-serve1002.eqiad.wmnet [07:44:38] (03CR) 10Muehlenhoff: [C:03+2] snapshot: Remove now unused Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/1148751 (https://phabricator.wikimedia.org/T394647) (owner: 10Muehlenhoff) [07:44:53] (03CR) 10Klausman: [C:03+2] conftool-data: remove ml-serve1002 from its pool [puppet] - 10https://gerrit.wikimedia.org/r/1149212 (https://phabricator.wikimedia.org/T387854) (owner: 10Elukey) [07:44:56] (03CR) 10Klausman: [C:03+1] role::ml_k8s::worker: set ml-serve1002 for containerd [puppet] - 10https://gerrit.wikimedia.org/r/1149213 (https://phabricator.wikimedia.org/T387854) (owner: 10Elukey) [07:45:00] (03CR) 10Klausman: [C:03+1] conftool-data: re-add ml-serve1002 to its pool after reimage [puppet] - 10https://gerrit.wikimedia.org/r/1149214 (https://phabricator.wikimedia.org/T387854) (owner: 10Elukey) [07:45:15] (03CR) 10Volans: homer: make private repo support multiple peers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1148268 (https://phabricator.wikimedia.org/T389380) (owner: 10Volans) [07:46:20] (03CR) 10Volans: [C:03+2] setup.py: add support up to Python 3.13 [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/1148434 (owner: 10Volans) [07:47:54] (03CR) 10Muehlenhoff: [C:03+2] Add a variant of the test role which is Kerberos-enabled [puppet] - 10https://gerrit.wikimedia.org/r/1148360 (owner: 10Muehlenhoff) [07:48:15] (03Merged) 10jenkins-bot: setup.py: add support up to Python 3.13 [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/1148434 (owner: 10Volans) [07:49:02] !log elukey@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host ml-serve1002.eqiad.wmnet [07:51:17] (03PS1) 10Slyngshede: Permissions: Hide the request button for existing permissions [software/bitu] - 10https://gerrit.wikimedia.org/r/1149321 [07:51:44] (03PS1) 10Muehlenhoff: Switch testvm2006 to test_krb role [puppet] - 10https://gerrit.wikimedia.org/r/1149322 [07:53:57] (03PS1) 10Hashar: gerrit: ban old Internet Explorer [puppet] - 10https://gerrit.wikimedia.org/r/1149324 (https://phabricator.wikimedia.org/T392467) [07:54:48] (03CR) 10Slyngshede: [C:03+2] Permissions: Hide the request button for existing permissions [software/bitu] - 10https://gerrit.wikimedia.org/r/1149321 (owner: 10Slyngshede) [07:54:59] (03CR) 10JMeybohm: [C:03+1] kubernetes: add maps-test codfw as external service [puppet] - 10https://gerrit.wikimedia.org/r/1148896 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey) [07:55:20] (03PS1) 10Arnaudb: nftables: add abuser to lists abuser list [puppet] - 10https://gerrit.wikimedia.org/r/1149323 (https://phabricator.wikimedia.org/T394519) [07:55:33] (03CR) 10Elukey: [V:03+1 C:03+2] role::ml_k8s::worker: set ml-serve1002 for containerd [puppet] - 10https://gerrit.wikimedia.org/r/1149213 (https://phabricator.wikimedia.org/T387854) (owner: 10Elukey) [07:56:27] (03CR) 10Effie Mouzeli: memcached: add option to switch to the performance cpu governor (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1148900 (https://phabricator.wikimedia.org/T371881) (owner: 10Effie Mouzeli) [07:56:28] (03CR) 10Effie Mouzeli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1148900 (https://phabricator.wikimedia.org/T371881) (owner: 10Effie Mouzeli) [07:56:47] (03CR) 10Arnaudb: [C:03+1] "looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/1149324 (https://phabricator.wikimedia.org/T392467) (owner: 10Hashar) [07:57:36] (03Merged) 10jenkins-bot: Permissions: Hide the request button for existing permissions [software/bitu] - 10https://gerrit.wikimedia.org/r/1149321 (owner: 10Slyngshede) [07:59:19] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5652/co" [puppet] - 10https://gerrit.wikimedia.org/r/1149323 (https://phabricator.wikimedia.org/T394519) (owner: 10Arnaudb) [08:00:05] andre and jnuche: OwO what's this, a deployment window?? MediaWiki train - Utc-0 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250522T0800). nyaa~ [08:00:10] (03PS2) 10Volans: CHANGELOG: add changelogs for release v0.4.1 [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/1148367 [08:00:10] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host ml-serve1002.eqiad.wmnet with OS bookworm [08:00:29] !log elukey@cumin1002 START - Cookbook sre.hosts.move-vlan for host ml-serve1002 [08:00:29] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host ml-serve1002 [08:00:43] (03CR) 10Jelto: [V:03+1 C:03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1149323 (https://phabricator.wikimedia.org/T394519) (owner: 10Arnaudb) [08:02:05] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64606/IPv4: Connect - kubernetes-ml-eqiad, AS64606/IPv6: Connect - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:02:05] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64606/IPv6: Connect - kubernetes-ml-eqiad, AS64606/IPv4: Connect - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:03:13] (03PS6) 10Effie Mouzeli: memcached: add option to switch to the performance cpu governor [puppet] - 10https://gerrit.wikimedia.org/r/1148900 (https://phabricator.wikimedia.org/T371881) [08:03:32] FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:03:41] (03CR) 10Effie Mouzeli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1148900 (https://phabricator.wikimedia.org/T371881) (owner: 10Effie Mouzeli) [08:03:42] (03CR) 10Arnaudb: [C:03+2] nftables: add abuser to lists abuser list [puppet] - 10https://gerrit.wikimedia.org/r/1149323 (https://phabricator.wikimedia.org/T394519) (owner: 10Arnaudb) [08:04:20] (03CR) 10CI reject: [V:04-1] memcached: add option to switch to the performance cpu governor [puppet] - 10https://gerrit.wikimedia.org/r/1148900 (https://phabricator.wikimedia.org/T371881) (owner: 10Effie Mouzeli) [08:05:20] (03PS7) 10Effie Mouzeli: memcached: add option to switch to the performance cpu governor [puppet] - 10https://gerrit.wikimedia.org/r/1148900 (https://phabricator.wikimedia.org/T371881) [08:07:28] (03CR) 10Jelto: [C:04-1] "one comment regarding the regex. I just done a quick search in logstash. Maybe there are others Trident user agents then let me know and s" [puppet] - 10https://gerrit.wikimedia.org/r/1149324 (https://phabricator.wikimedia.org/T392467) (owner: 10Hashar) [08:08:11] (03PS1) 10TrainBranchBot: group2 to 1.45.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1149327 (https://phabricator.wikimedia.org/T392172) [08:08:13] (03CR) 10TrainBranchBot: [C:03+2] group2 to 1.45.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1149327 (https://phabricator.wikimedia.org/T392172) (owner: 10TrainBranchBot) [08:08:21] (03PS1) 10Majavah: P:toolforge::proxy: Enable IPv6 monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1149328 (https://phabricator.wikimedia.org/T211575) [08:09:13] (03Merged) 10jenkins-bot: group2 to 1.45.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1149327 (https://phabricator.wikimedia.org/T392172) (owner: 10TrainBranchBot) [08:12:14] (03PS3) 10Effie Mouzeli: mediawiki::memcached enable performance cpu governor [puppet] - 10https://gerrit.wikimedia.org/r/1148901 (https://phabricator.wikimedia.org/T371881) [08:12:25] (03CR) 10Effie Mouzeli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1148901 (https://phabricator.wikimedia.org/T371881) (owner: 10Effie Mouzeli) [08:17:14] !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve1002.eqiad.wmnet with reason: host reimage [08:17:39] (03CR) 10Majavah: [C:03+2] P:toolforge::proxy: Enable IPv6 monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1149328 (https://phabricator.wikimedia.org/T211575) (owner: 10Majavah) [08:18:59] !log aklapper@deploy1003 rebuilt and synchronized wikiversions files: group2 to 1.45.0-wmf.2 refs T392172 [08:19:02] T392172: 1.45.0-wmf.2 deployment blockers - https://phabricator.wikimedia.org/T392172 [08:19:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [08:20:57] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve1002.eqiad.wmnet with reason: host reimage [08:22:20] (03CR) 10Muehlenhoff: [C:03+2] Switch testvm2006 to test_krb role [puppet] - 10https://gerrit.wikimedia.org/r/1149322 (owner: 10Muehlenhoff) [08:27:30] (03PS7) 10Fabfur: haproxy: use maxmind lua bindings to lookup client ISP [puppet] - 10https://gerrit.wikimedia.org/r/1146970 (https://phabricator.wikimedia.org/T392219) [08:33:20] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1146970 (https://phabricator.wikimedia.org/T392219) (owner: 10Fabfur) [08:33:53] (03CR) 10Effie Mouzeli: [C:03+2] memcached: add option to switch to the performance cpu governor [puppet] - 10https://gerrit.wikimedia.org/r/1148900 (https://phabricator.wikimedia.org/T371881) (owner: 10Effie Mouzeli) [08:35:12] (03CR) 10Effie Mouzeli: [C:03+2] mediawiki::memcached enable performance cpu governor [puppet] - 10https://gerrit.wikimedia.org/r/1148901 (https://phabricator.wikimedia.org/T371881) (owner: 10Effie Mouzeli) [08:38:24] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-serve1002.eqiad.wmnet with OS bookworm [08:38:25] (03CR) 10Volans: [C:03+2] CHANGELOG: add changelogs for release v0.4.1 [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/1148367 (owner: 10Volans) [08:40:09] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 140, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:40:17] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 220, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:40:37] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v0.4.1 [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/1148367 (owner: 10Volans) [08:40:54] (03CR) 10Hashar: gerrit: ban old Internet Explorer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1149324 (https://phabricator.wikimedia.org/T392467) (owner: 10Hashar) [08:41:03] (03PS2) 10Hashar: gerrit: ban old Internet Explorer [puppet] - 10https://gerrit.wikimedia.org/r/1149324 (https://phabricator.wikimedia.org/T392467) [08:43:10] (03CR) 10Elukey: [C:03+2] conftool-data: re-add ml-serve1002 to its pool after reimage [puppet] - 10https://gerrit.wikimedia.org/r/1149214 (https://phabricator.wikimedia.org/T387854) (owner: 10Elukey) [08:44:08] (03PS1) 10Jcrespo: dbbackups: Check for newly created x3 backups at icinga [puppet] - 10https://gerrit.wikimedia.org/r/1149329 (https://phabricator.wikimedia.org/T384274) [08:44:11] (03CR) 10Vgutierrez: haproxy: use maxmind lua bindings to lookup client ISP (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1146970 (https://phabricator.wikimedia.org/T392219) (owner: 10Fabfur) [08:44:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [08:45:28] (03CR) 10Jcrespo: [C:03+2] dbbackups: Check for newly created x3 backups at icinga [puppet] - 10https://gerrit.wikimedia.org/r/1149329 (https://phabricator.wikimedia.org/T384274) (owner: 10Jcrespo) [08:46:28] (03CR) 10Vgutierrez: haproxy: use maxmind lua bindings to lookup client ISP (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1146970 (https://phabricator.wikimedia.org/T392219) (owner: 10Fabfur) [08:47:29] (03CR) 10Jelto: [C:03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1149324 (https://phabricator.wikimedia.org/T392467) (owner: 10Hashar) [08:47:45] !log elukey@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host ml-serve1002.eqiad.wmnet [08:47:45] !log elukey@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host ml-serve1002.eqiad.wmnet [08:48:07] (03PS1) 10Volans: Upstream release v0.4.1 [software/debmonitor-client] (debian) - 10https://gerrit.wikimedia.org/r/1149331 [08:50:20] (03CR) 10Elukey: [C:03+2] kubernetes: add maps-test codfw as external service [puppet] - 10https://gerrit.wikimedia.org/r/1148896 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey) [08:51:06] (03CR) 10Giuseppe Lavagetto: [C:04-1] "A few minor notes but overall LGTM. We will need to tweak this in any case once we've deployed a first version to ensure we can sustain th" [puppet] - 10https://gerrit.wikimedia.org/r/1146970 (https://phabricator.wikimedia.org/T392219) (owner: 10Fabfur) [08:51:17] (03CR) 10Volans: [C:03+2] Upstream release v0.4.1 [software/debmonitor-client] (debian) - 10https://gerrit.wikimedia.org/r/1149331 (owner: 10Volans) [08:53:01] 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 3 others: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922#10847120 (10Jelto) >>! In T378922#10843403, @MatthewVernon wrote: > @Jelto both buckets deleted. Thanks a lot for the help with `r... [08:53:38] (03Merged) 10jenkins-bot: Upstream release v0.4.1 [software/debmonitor-client] (debian) - 10https://gerrit.wikimedia.org/r/1149331 (owner: 10Volans) [08:53:41] (03PS3) 10Muehlenhoff: Move Kartotherian/staging to the new Bookworm nodes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1148881 (https://phabricator.wikimedia.org/T381565) [08:55:04] (03CR) 10Vgutierrez: haproxy: use maxmind lua bindings to lookup client ISP (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1146970 (https://phabricator.wikimedia.org/T392219) (owner: 10Fabfur) [09:00:33] !log btullis@cumin1002 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:dse-k8s-worker [09:04:51] PROBLEM - dump of x3 in codfw on backupmon1001 is CRITICAL: We could not find any completed dump for x3 at codfw https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [09:04:51] PROBLEM - dump of x3 in eqiad on backupmon1001 is CRITICAL: We could not find any completed dump for x3 at eqiad https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [09:07:44] 06SRE, 06Infrastructure-Foundations, 10netops: Invesitgate requirement for 'session-mode auatomatic' on EVPN iBGP peerings - https://phabricator.wikimedia.org/T332295#10847194 (10cmooney) 05Open→03Declined Didn't get time to work on this, it's not doing any harm for now so closign. [09:08:32] FIRING: SystemdUnitFailed: wmf_auto_restart_krb5-admin-server.service on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:09:27] (03CR) 10Jelto: [C:03+2] gerrit: ban old Internet Explorer [puppet] - 10https://gerrit.wikimedia.org/r/1149324 (https://phabricator.wikimedia.org/T392467) (owner: 10Hashar) [09:09:49] ACKNOWLEDGEMENT - dump of x3 in codfw on backupmon1001 is CRITICAL: We could not find any completed dump for x3 at codfw Jcrespo new backup setup https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [09:09:49] ACKNOWLEDGEMENT - dump of x3 in eqiad on backupmon1001 is CRITICAL: We could not find any completed dump for x3 at eqiad Jcrespo new backup setup https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [09:10:30] (03PS4) 10Klausman: hiera: Add pseudosecrets for MT Thanos-Swift access [labs/private] - 10https://gerrit.wikimedia.org/r/1148855 [09:10:30] (03CR) 10Klausman: "While I am sure this (or rather it's actual-pricate-repo equivalent) is necessary, I am not sure it is sufficient to make the private file" [labs/private] - 10https://gerrit.wikimedia.org/r/1148855 (owner: 10Klausman) [09:21:08] !log btullis@cumin1002 END (ERROR) - Cookbook sre.k8s.reboot-nodes (exit_code=97) rolling reboot on A:dse-k8s-worker [09:21:56] 06SRE, 06Data-Persistence, 06serviceops, 07Datacenter-Switchover: Post March 2023 Datacenter Switchover Tasks - https://phabricator.wikimedia.org/T328907#10847269 (10Clement_Goubert) a:05Clement_Goubert→03None [09:22:15] !log btullis@cumin1002 START - Cookbook sre.hosts.reboot-single for host dse-k8s-worker1002.eqiad.wmnet [09:29:08] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-worker1002.eqiad.wmnet [09:29:47] (03PS1) 10Muehlenhoff: Include profile::base::cuminunpriv in test_krb [puppet] - 10https://gerrit.wikimedia.org/r/1149335 (https://phabricator.wikimedia.org/T390863) [09:31:22] (03PS1) 10Clément Goubert: mw::maintenance: Remove timeout from continuousScan-commonswiki [puppet] - 10https://gerrit.wikimedia.org/r/1149336 (https://phabricator.wikimedia.org/T385799) [09:32:34] !log btullis@cumin1002 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:dse-k8s-worker [09:34:08] 06SRE, 10SRE-Access-Requests, 06collaboration-services, 10Continuous-Integration-Infrastructure (Zuul upgrade), 13Patch-For-Review: create new admin group for "zuul devs" - https://phabricator.wikimedia.org/T394819#10847319 (10LSobanski) a:03Dzahn [09:36:28] 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 3 others: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922#10847339 (10MatthewVernon) Ah, the bucket is gone from `eqiad`, but `codfw` is still catching up: ` root@moss-be2001:/# radosgw-adm... [09:40:09] 06SRE, 06Infrastructure-Foundations, 10netops: Investigate requirement for 'session-mode auatomatic' on EVPN iBGP peerings - https://phabricator.wikimedia.org/T332295#10847371 (10Aklapper) [09:40:11] (03CR) 10Hnowlan: [C:03+1] mw::maintenance: Remove timeout from continuousScan-commonswiki [puppet] - 10https://gerrit.wikimedia.org/r/1149336 (https://phabricator.wikimedia.org/T385799) (owner: 10Clément Goubert) [09:40:18] (03CR) 10Klausman: [C:03+2] admin_ng/LiftWing: add edit-check namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1148803 (https://phabricator.wikimedia.org/T394779) (owner: 10Gkyziridis) [09:40:46] (03PS9) 10Federico Ceratto: sanitarium_restart.py: restart Sanitarium hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1131954 (https://phabricator.wikimedia.org/T363665) [09:40:48] (03CR) 10Clément Goubert: [C:03+2] mw::maintenance: Remove timeout from continuousScan-commonswiki [puppet] - 10https://gerrit.wikimedia.org/r/1149336 (https://phabricator.wikimedia.org/T385799) (owner: 10Clément Goubert) [09:42:39] RECOVERY - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on alert1002 is OK: (C)1e+05 gt (W)1e+04 gt 4 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [09:47:07] (03Merged) 10jenkins-bot: admin_ng/LiftWing: add edit-check namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1148803 (https://phabricator.wikimedia.org/T394779) (owner: 10Gkyziridis) [09:48:13] (03CR) 10CI reject: [V:04-1] sanitarium_restart.py: restart Sanitarium hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1131954 (https://phabricator.wikimedia.org/T363665) (owner: 10Federico Ceratto) [09:49:54] (03PS8) 10Fabfur: haproxy: use maxmind lua bindings to lookup client ISP [puppet] - 10https://gerrit.wikimedia.org/r/1146970 (https://phabricator.wikimedia.org/T392219) [09:50:05] (03CR) 10Fabfur: haproxy: use maxmind lua bindings to lookup client ISP (039 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1146970 (https://phabricator.wikimedia.org/T392219) (owner: 10Fabfur) [09:50:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'Add pc8 T394260', diff saved to https://phabricator.wikimedia.org/P76390 and previous config saved to /var/cache/conftool/dbconfig/20250522-095017-marostegui.json [09:50:21] T394260: Productionize pc8 - https://phabricator.wikimedia.org/T394260 [09:50:30] !log klausman@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [09:50:35] !log dbmaint codfw eqiad Pool pc8 new section T394260 [09:50:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:21] (03PS1) 10Marostegui: pc1018,pc2018: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1149337 (https://phabricator.wikimedia.org/T394260) [09:52:17] !log klausman@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [09:53:33] !log klausman@deploy1003 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [09:54:04] (03CR) 10Marostegui: [C:03+2] pc1018,pc2018: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1149337 (https://phabricator.wikimedia.org/T394260) (owner: 10Marostegui) [09:54:45] !log klausman@deploy1003 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [09:55:22] (03CR) 10Muehlenhoff: [C:03+1] "Looks good, after merging please add some docs to https://wikitech.wikimedia.org/wiki/IDM as well" [software/bitu] - 10https://gerrit.wikimedia.org/r/1148865 (owner: 10Slyngshede) [09:55:36] (03PS1) 10Jcrespo: dbbackups: Update backup ports for x3 [puppet] - 10https://gerrit.wikimedia.org/r/1149338 (https://phabricator.wikimedia.org/T384274) [09:56:07] (03CR) 10Muehlenhoff: [C:03+2] Enable profile::auto_restarts::service for alertmanager-irc-relay [puppet] - 10https://gerrit.wikimedia.org/r/1145815 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [09:56:16] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1146970 (https://phabricator.wikimedia.org/T392219) (owner: 10Fabfur) [09:57:41] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.10 point update - https://phabricator.wikimedia.org/T389034#10847417 (10MoritzMuehlenhoff) [09:58:00] (03PS2) 10Jcrespo: dbbackups: Update backup ports for x3 [puppet] - 10https://gerrit.wikimedia.org/r/1149338 (https://phabricator.wikimedia.org/T384274) [09:58:45] !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [09:58:56] !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [10:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250522T1000) [10:03:10] (03PS1) 10Muehlenhoff: Rebuild against latest package versions in bookworm [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1149339 [10:03:32] !log installing spamassassin bugfix updates from Bookworm point release [10:03:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:09] (03CR) 10Jcrespo: [C:03+2] dbbackups: Update backup ports for x3 [puppet] - 10https://gerrit.wikimedia.org/r/1149338 (https://phabricator.wikimedia.org/T384274) (owner: 10Jcrespo) [10:04:57] (03PS1) 10Btullis: Airflow: Enable the LocalExecutor for the analytics_test instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149340 (https://phabricator.wikimedia.org/T394398) [10:04:58] (03PS1) 10Btullis: Airflow: Allow the scheduler to reach out to Hadoop on analytics_test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149341 (https://phabricator.wikimedia.org/T394398) [10:04:59] (03PS1) 10Btullis: Airflow: increase resources to the analytics_test scheduler [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149342 (https://phabricator.wikimedia.org/T394398) [10:08:32] FIRING: SystemdUnitFailed: curator_actions_apifeatureusage_eqiad.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:09:53] 06SRE, 06Infrastructure-Foundations, 10netops: Investigate requirement for 'session-mode automatic' on EVPN iBGP peerings - https://phabricator.wikimedia.org/T332295#10847440 (10Aklapper) [10:11:31] jouncebot: nowandnext [10:11:31] For the next 0 hour(s) and 48 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250522T1000) [10:11:31] In 1 hour(s) and 48 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250522T1200) [10:11:35] (03CR) 10Hnowlan: [C:03+2] mw::maintenance: migrate all remaining growthexperiments jobs [puppet] - 10https://gerrit.wikimedia.org/r/1148914 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan) [10:11:42] (03PS1) 10Ladsgroup: parsercachepurging: Use for loop [puppet] - 10https://gerrit.wikimedia.org/r/1149344 (https://phabricator.wikimedia.org/T394260) [10:12:22] (03PS2) 10Ladsgroup: parsercachepurging: Use for loop [puppet] - 10https://gerrit.wikimedia.org/r/1149344 (https://phabricator.wikimedia.org/T394260) [10:13:27] (03CR) 10CI reject: [V:04-1] parsercachepurging: Use for loop [puppet] - 10https://gerrit.wikimedia.org/r/1149344 (https://phabricator.wikimedia.org/T394260) (owner: 10Ladsgroup) [10:13:29] (03CR) 10Vgutierrez: [C:03+1] haproxy: use maxmind lua bindings to lookup client ISP [puppet] - 10https://gerrit.wikimedia.org/r/1146970 (https://phabricator.wikimedia.org/T392219) (owner: 10Fabfur) [10:13:34] (03PS3) 10Ladsgroup: parsercachepurging: Use for loop [puppet] - 10https://gerrit.wikimedia.org/r/1149344 (https://phabricator.wikimedia.org/T394260) [10:14:05] !log btullis@cumin1002 END (ERROR) - Cookbook sre.k8s.reboot-nodes (exit_code=97) rolling reboot on A:dse-k8s-worker [10:14:41] (03CR) 10CI reject: [V:04-1] parsercachepurging: Use for loop [puppet] - 10https://gerrit.wikimedia.org/r/1149344 (https://phabricator.wikimedia.org/T394260) (owner: 10Ladsgroup) [10:15:05] 06SRE, 06Project-Admins: Disable #acl*sre_team workboard and update its project description - https://phabricator.wikimedia.org/T394654#10847452 (10LSobanski) 05Open→03Resolved a:03LSobanski Done. [10:17:40] (03CR) 10Hnowlan: [C:03+1] "Thanks for taking care of this!" [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1149339 (owner: 10Muehlenhoff) [10:18:01] (03CR) 10Muehlenhoff: [C:03+2] Rebuild against latest package versions in bookworm [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1149339 (owner: 10Muehlenhoff) [10:18:07] FIRING: MediaWikiElevatedUnknownLogins: Elevated number of failed login attempts (unknown device and IP) via mw-api-ext - TODO - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?from=now-6h&orgId=1&to=now&viewPanel=26 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiElevatedUnknownLogins [10:18:28] (03PS4) 10Ladsgroup: parsercachepurging: Use for loop [puppet] - 10https://gerrit.wikimedia.org/r/1149344 (https://phabricator.wikimedia.org/T394260) [10:19:18] (03PS1) 10Michael Große: stats(SuggestedEdits): avoid tracking negative tti durations [extensions/GrowthExperiments] (wmf/1.45.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1149345 (https://phabricator.wikimedia.org/T394289) [10:19:32] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, May 22 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/GrowthExperiments] (wmf/1.45.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1149345 (https://phabricator.wikimedia.org/T394289) (owner: 10Michael Große) [10:20:32] (03CR) 10CI reject: [V:04-1] parsercachepurging: Use for loop [puppet] - 10https://gerrit.wikimedia.org/r/1149344 (https://phabricator.wikimedia.org/T394260) (owner: 10Ladsgroup) [10:24:13] 06SRE, 06Project-Admins: Disable #acl*sre_team workboard and update its project description - https://phabricator.wikimedia.org/T394654#10847477 (10Aklapper) Thanks!! [10:24:36] (03PS2) 10Michael Große: stats(SuggestedEdits): avoid tracking negative tti durations [extensions/GrowthExperiments] (wmf/1.45.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1149345 (https://phabricator.wikimedia.org/T394289) [10:27:15] (03PS5) 10Ladsgroup: parsercachepurging: Use for loop [puppet] - 10https://gerrit.wikimedia.org/r/1149344 (https://phabricator.wikimedia.org/T394260) [10:29:49] !log btullis@cumin1002 START - Cookbook sre.hosts.reboot-single for host dse-k8s-worker1004.eqiad.wmnet [10:31:12] (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1149344 (https://phabricator.wikimedia.org/T394260) (owner: 10Ladsgroup) [10:31:52] !log installing imagemagick security updates [10:31:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:45] (03PS1) 10Slyngshede: Permissions: Load loading spinner to permission page [software/bitu] - 10https://gerrit.wikimedia.org/r/1149347 [10:33:30] (03CR) 10Muehlenhoff: [C:03+2] Include profile::base::cuminunpriv in test_krb [puppet] - 10https://gerrit.wikimedia.org/r/1149335 (https://phabricator.wikimedia.org/T390863) (owner: 10Muehlenhoff) [10:36:14] (03CR) 10Ladsgroup: "NOOP: https://puppet-compiler.wmflabs.org/output/1149344/6494/" [puppet] - 10https://gerrit.wikimedia.org/r/1149344 (https://phabricator.wikimedia.org/T394260) (owner: 10Ladsgroup) [10:36:34] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-worker1004.eqiad.wmnet [10:37:14] (03CR) 10Elukey: hiera: Add pseudosecrets for MT Thanos-Swift access (031 comment) [labs/private] - 10https://gerrit.wikimedia.org/r/1148855 (owner: 10Klausman) [10:37:41] (03PS7) 10Clément Goubert: mw:maintenance: Adapt generatecaptcha for mw-cron [puppet] - 10https://gerrit.wikimedia.org/r/1148861 (https://phabricator.wikimedia.org/T388531) [10:37:53] (03PS2) 10Clément Goubert: mediawiki: Add fancycaptcha wordlists to mw-cron [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149346 (https://phabricator.wikimedia.org/T388531) [10:37:53] (03PS1) 10Muehlenhoff: thumbor: Update service image to latest rebuild [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149348 [10:39:08] (03CR) 10Slyngshede: [C:03+2] Permissions: Load loading spinner to permission page [software/bitu] - 10https://gerrit.wikimedia.org/r/1149347 (owner: 10Slyngshede) [10:42:12] PROBLEM - Host mr1-esams.oob IPv6 is DOWN: CRITICAL - Host Unreachable (2a00:1188:5:e::4) [10:43:43] jelto@cumin1002 jelto: The backup on gitlab2002 is complete, ready to proceed with upgrade. [10:44:21] !log btullis@cumin1002 START - Cookbook sre.hosts.reboot-single for host dse-k8s-worker1005.eqiad.wmnet [10:44:34] (03CR) 10Muehlenhoff: [C:03+1] Permissions: Load loading spinner to permission page [software/bitu] - 10https://gerrit.wikimedia.org/r/1149347 (owner: 10Slyngshede) [10:45:29] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Migrate the KDCs to Bookworm - https://phabricator.wikimedia.org/T390863#10847542 (10MoritzMuehlenhoff) I created a new keytab for testvm2006 on krb1002 and everything worked fine. [10:45:30] (03CR) 10Alexandros Kosiaris: [C:03+1] parsercachepurging: Use for loop [puppet] - 10https://gerrit.wikimedia.org/r/1149344 (https://phabricator.wikimedia.org/T394260) (owner: 10Ladsgroup) [10:46:29] (03PS10) 10Federico Ceratto: sanitarium_restart.py: restart Sanitarium hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1131954 (https://phabricator.wikimedia.org/T363665) [10:46:32] (03PS6) 10Ladsgroup: parsercachepurging: Use for loop [puppet] - 10https://gerrit.wikimedia.org/r/1149344 (https://phabricator.wikimedia.org/T394260) [10:46:38] (03CR) 10Ladsgroup: [V:03+2 C:03+2] parsercachepurging: Use for loop [puppet] - 10https://gerrit.wikimedia.org/r/1149344 (https://phabricator.wikimedia.org/T394260) (owner: 10Ladsgroup) [10:46:43] jelto@cumin1002 upgrade (PID 216775) is awaiting input [10:47:23] (03CR) 10Slyngshede: [C:03+2] Signup: command for generating activation link [software/bitu] - 10https://gerrit.wikimedia.org/r/1148865 (owner: 10Slyngshede) [10:47:57] (03Merged) 10jenkins-bot: Permissions: Load loading spinner to permission page [software/bitu] - 10https://gerrit.wikimedia.org/r/1149347 (owner: 10Slyngshede) [10:50:41] (03PS1) 10Ladsgroup: parsercachepurging: Add pc8 [puppet] - 10https://gerrit.wikimedia.org/r/1149353 (https://phabricator.wikimedia.org/T394260) [10:51:45] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-worker1005.eqiad.wmnet [10:51:58] (03Merged) 10jenkins-bot: Signup: command for generating activation link [software/bitu] - 10https://gerrit.wikimedia.org/r/1148865 (owner: 10Slyngshede) [10:52:18] RECOVERY - Host mr1-esams.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 88.57 ms [10:52:38] (03PS2) 10Ladsgroup: parsercachepurging: Add pc8 [puppet] - 10https://gerrit.wikimedia.org/r/1149353 (https://phabricator.wikimedia.org/T394260) [10:53:48] !log btullis@cumin1002 START - Cookbook sre.hosts.reboot-single for host dse-k8s-worker1006.eqiad.wmnet [10:54:10] (03CR) 10Joal: [C:03+1] Airflow: increase resources to the analytics_test scheduler [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149342 (https://phabricator.wikimedia.org/T394398) (owner: 10Btullis) [10:54:35] (03CR) 10Joal: [C:03+1] Airflow: Allow the scheduler to reach out to Hadoop on analytics_test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149341 (https://phabricator.wikimedia.org/T394398) (owner: 10Btullis) [10:54:51] (03CR) 10Joal: [C:03+1] Airflow: Enable the LocalExecutor for the analytics_test instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149340 (https://phabricator.wikimedia.org/T394398) (owner: 10Btullis) [10:56:10] (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1149353 (https://phabricator.wikimedia.org/T394260) (owner: 10Ladsgroup) [10:56:18] (03CR) 10Hnowlan: [C:03+1] thumbor: Update service image to latest rebuild [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149348 (owner: 10Muehlenhoff) [11:00:14] (03CR) 10Ladsgroup: "PCC: https://puppet-compiler.wmflabs.org/output/1149353/6496/deploy1003.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1149353 (https://phabricator.wikimedia.org/T394260) (owner: 10Ladsgroup) [11:00:33] (03CR) 10Slyngshede: [C:03+2] "See: https://wikitech.wikimedia.org/wiki/IDM#User_is_not_getting_signup_email" [software/bitu] - 10https://gerrit.wikimedia.org/r/1148865 (owner: 10Slyngshede) [11:01:02] (03CR) 10Marostegui: "Let's test on db2186 or db2187" [cookbooks] - 10https://gerrit.wikimedia.org/r/1131954 (https://phabricator.wikimedia.org/T363665) (owner: 10Federico Ceratto) [11:01:15] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-worker1006.eqiad.wmnet [11:03:32] PROBLEM - Host mr1-esams.oob IPv6 is DOWN: CRITICAL - Host Unreachable (2a00:1188:5:e::4) [11:03:47] (03PS8) 10Clément Goubert: mw:maintenance: Adapt generatecaptcha for mw-cron [puppet] - 10https://gerrit.wikimedia.org/r/1148861 (https://phabricator.wikimedia.org/T388531) [11:05:40] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [11:06:05] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [11:06:31] !log jelto@cumin1002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab2002.wikimedia.org with reason: Upgrade GitLab to 17.10 [11:06:55] FIRING: [2x] ProbeDown: Service gitlab2002:443 has failed probes (http_gitlab_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gitlab2002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:11:55] !log btullis@cumin1002 START - Cookbook sre.hosts.reboot-single for host dse-k8s-worker1007.eqiad.wmnet [11:11:55] RESOLVED: [2x] ProbeDown: Service gitlab2002:443 has failed probes (http_gitlab_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gitlab2002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:13:25] (03PS1) 10Marostegui: es1036: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1149356 (https://phabricator.wikimedia.org/T394469) [11:13:38] RECOVERY - Host mr1-esams.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 89.03 ms [11:13:52] (03CR) 10Jgiannelos: pcs-rb-sunset: Disable changeprop rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1148273 (https://phabricator.wikimedia.org/T264670) (owner: 10Jgiannelos) [11:14:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es1036 T394469', diff saved to https://phabricator.wikimedia.org/P76391 and previous config saved to /var/cache/conftool/dbconfig/20250522-111422-marostegui.json [11:14:27] T394469: Migrate es6 to MariaDB 10.11 - https://phabricator.wikimedia.org/T394469 [11:14:50] (03CR) 10Marostegui: [C:03+2] es1036: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1149356 (https://phabricator.wikimedia.org/T394469) (owner: 10Marostegui) [11:14:52] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es1036.eqiad.wmnet with reason: Maintenance [11:15:13] !log Migrate es1036 es6 eqiad dbmaint to MariaDB 10.11 T394469 [11:15:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:24] (03CR) 10Muehlenhoff: [C:03+2] thumbor: Update service image to latest rebuild [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149348 (owner: 10Muehlenhoff) [11:15:45] !log uploaded debmonitor-client_0.4.1 to apt.wikimedia.org buster-wikimedia,bullseye-wikimedia,bookworm-wikimedia,trixie-wikimedia [11:15:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:34] (03CR) 10Marostegui: [C:03+1] parsercachepurging: Add pc8 [puppet] - 10https://gerrit.wikimedia.org/r/1149353 (https://phabricator.wikimedia.org/T394260) (owner: 10Ladsgroup) [11:16:41] !log jmm@deploy1003 helmfile [staging] START helmfile.d/services/thumbor: apply [11:16:49] !log jmm@deploy1003 helmfile [staging] DONE helmfile.d/services/thumbor: apply [11:17:33] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.10 point update - https://phabricator.wikimedia.org/T389034#10847659 (10MoritzMuehlenhoff) [11:18:32] FIRING: NetworkDeviceAlarmActive: Alarm active on cr2-codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [11:19:13] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-worker1007.eqiad.wmnet [11:19:14] (03PS3) 10Ladsgroup: parsercachepurging: Add pc8 [puppet] - 10https://gerrit.wikimedia.org/r/1149353 (https://phabricator.wikimedia.org/T394260) [11:19:20] (03CR) 10Ladsgroup: [V:03+2 C:03+2] parsercachepurging: Add pc8 [puppet] - 10https://gerrit.wikimedia.org/r/1149353 (https://phabricator.wikimedia.org/T394260) (owner: 10Ladsgroup) [11:21:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1036 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76392 and previous config saved to /var/cache/conftool/dbconfig/20250522-112152-root.json [11:22:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1183 T394507', diff saved to https://phabricator.wikimedia.org/P76393 and previous config saved to /var/cache/conftool/dbconfig/20250522-112245-marostegui.json [11:22:49] T394507: decommission db1183 - https://phabricator.wikimedia.org/T394507 [11:24:00] (03PS1) 10Marostegui: db1183: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1149358 (https://phabricator.wikimedia.org/T394507) [11:26:31] !log jmm@deploy1003 helmfile [codfw] START helmfile.d/services/thumbor: apply [11:27:36] (03CR) 10Marostegui: [C:03+2] db1183: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1149358 (https://phabricator.wikimedia.org/T394507) (owner: 10Marostegui) [11:28:10] (03CR) 10Hnowlan: [C:03+1] "lgtm" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1148273 (https://phabricator.wikimedia.org/T264670) (owner: 10Jgiannelos) [11:28:31] (03CR) 10Hnowlan: [C:03+2] mw::maintenance: migrate db_lag_stats_reporter to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1143533 (https://phabricator.wikimedia.org/T385800) (owner: 10Hnowlan) [11:28:44] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149359 [11:29:22] (03PS3) 10Clément Goubert: mw::maintenance: Migrate generatecaptcha to mw-cron [puppet] - 10https://gerrit.wikimedia.org/r/1149351 (https://phabricator.wikimedia.org/T388531) [11:29:25] !log jmm@deploy1003 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [11:31:10] (03CR) 10Muehlenhoff: [C:03+2] Remove krb1001 from list of KDCs [puppet] - 10https://gerrit.wikimedia.org/r/1145884 (https://phabricator.wikimedia.org/T390863) (owner: 10Muehlenhoff) [11:32:14] !log ladsgroup@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [11:32:29] (03PS1) 10Gkyziridis: ml-services: edit-check model deployment on prod under edit-check ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149361 (https://phabricator.wikimedia.org/T394779) [11:33:17] !log ladsgroup@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [11:36:58] (03CR) 10Ilias Sarantopoulos: [C:03+1] "LGTM, nice work!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149361 (https://phabricator.wikimedia.org/T394779) (owner: 10Gkyziridis) [11:36:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1036 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76394 and previous config saved to /var/cache/conftool/dbconfig/20250522-113657-root.json [11:37:21] (03PS1) 10Jelto: sre.gitlab.upgrade: wait some time before deleting downtimes [cookbooks] - 10https://gerrit.wikimedia.org/r/1149363 (https://phabricator.wikimedia.org/T395013) [11:37:25] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [11:37:53] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [11:41:06] (03CR) 10Gkyziridis: [C:03+2] ml-services: edit-check model deployment on prod under edit-check ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149361 (https://phabricator.wikimedia.org/T394779) (owner: 10Gkyziridis) [11:44:17] (03CR) 10CI reject: [V:04-1] sre.gitlab.upgrade: wait some time before deleting downtimes [cookbooks] - 10https://gerrit.wikimedia.org/r/1149363 (https://phabricator.wikimedia.org/T395013) (owner: 10Jelto) [11:45:05] (03PS2) 10Jelto: sre.gitlab.upgrade: wait some time before deleting downtimes [cookbooks] - 10https://gerrit.wikimedia.org/r/1149363 (https://phabricator.wikimedia.org/T395013) [11:45:32] (03CR) 10Arnaudb: [C:03+1] "thanks for the patch, I've added a comment about the future, but LGTM!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1149363 (https://phabricator.wikimedia.org/T395013) (owner: 10Jelto) [11:47:08] !log jmm@deploy1003 helmfile [eqiad] START helmfile.d/services/thumbor: apply [11:49:47] (03PS1) 10Clément Goubert: mw::maintenance: Migrate wikidata-updateQueryServiceLag to mw-cron [puppet] - 10https://gerrit.wikimedia.org/r/1149366 (https://phabricator.wikimedia.org/T388538) [11:50:41] (03CR) 10CI reject: [V:04-1] sre.gitlab.upgrade: wait some time before deleting downtimes [cookbooks] - 10https://gerrit.wikimedia.org/r/1149363 (https://phabricator.wikimedia.org/T395013) (owner: 10Jelto) [11:50:59] (03PS1) 10Hnowlan: mw::maintenance: migrate checkuser and securepoll jobs to Kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1149367 (https://phabricator.wikimedia.org/T388542) [11:51:27] !log jmm@deploy1003 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [11:51:49] (03PS2) 10Hnowlan: mw::maintenance: migrate checkuser and securepoll jobs to Kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1149367 (https://phabricator.wikimedia.org/T388542) [11:51:57] (03PS3) 10Jelto: sre.gitlab.upgrade: wait some time before deleting downtimes [cookbooks] - 10https://gerrit.wikimedia.org/r/1149363 (https://phabricator.wikimedia.org/T395013) [11:52:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1036 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76398 and previous config saved to /var/cache/conftool/dbconfig/20250522-115203-root.json [12:00:02] 06SRE, 06Infrastructure-Foundations, 07LDAP: Migrate the r/w LDAP servers to Bookworm and MDB storage - https://phabricator.wikimedia.org/T331699#10847811 (10MoritzMuehlenhoff) a:05MoritzMuehlenhoff→03None [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250522T1200) [12:00:23] 07Puppet, 10Infrastructure Security, 06Infrastructure-Foundations: Restrict GIDs for system users to 499 as the upper boundary - https://phabricator.wikimedia.org/T235162#10847812 (10MoritzMuehlenhoff) a:05MoritzMuehlenhoff→03None [12:00:33] 07Puppet, 10Infrastructure Security, 06Infrastructure-Foundations: Restrict GIDs for system users to 499 as the upper boundary - https://phabricator.wikimedia.org/T235162#10847816 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff [12:01:45] (03CR) 10Hnowlan: [C:03+1] mw:maintenance: Adapt generatecaptcha for mw-cron [puppet] - 10https://gerrit.wikimedia.org/r/1148861 (https://phabricator.wikimedia.org/T388531) (owner: 10Clément Goubert) [12:03:30] (03CR) 10Clément Goubert: [C:03+1] mw::maintenance: migrate checkuser and securepoll jobs to Kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1149367 (https://phabricator.wikimedia.org/T388542) (owner: 10Hnowlan) [12:03:32] FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:04:52] !log btullis@cumin1002 START - Cookbook sre.hosts.reboot-single for host dse-k8s-worker1008.eqiad.wmnet [12:05:12] (03PS1) 10Muehlenhoff: Remove unused option to enable host-based auth [puppet] - 10https://gerrit.wikimedia.org/r/1149371 (https://phabricator.wikimedia.org/T393762) [12:07:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1036 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76400 and previous config saved to /var/cache/conftool/dbconfig/20250522-120709-root.json [12:07:24] (03CR) 10Clément Goubert: [C:03+2] mw:maintenance: Adapt generatecaptcha for mw-cron [puppet] - 10https://gerrit.wikimedia.org/r/1148861 (https://phabricator.wikimedia.org/T388531) (owner: 10Clément Goubert) [12:09:01] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1149371 (https://phabricator.wikimedia.org/T393762) (owner: 10Muehlenhoff) [12:09:13] (03PS1) 10Klausman: hiera: add k8s deploy config for editcheck [puppet] - 10https://gerrit.wikimedia.org/r/1149369 (https://phabricator.wikimedia.org/T394779) [12:12:18] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-worker1008.eqiad.wmnet [12:13:38] (03PS4) 10Clément Goubert: mw::maintenance: Migrate generatecaptcha to mw-cron [puppet] - 10https://gerrit.wikimedia.org/r/1149351 (https://phabricator.wikimedia.org/T388531) [12:13:52] (03CR) 10CI reject: [V:04-1] mw::maintenance: Migrate generatecaptcha to mw-cron [puppet] - 10https://gerrit.wikimedia.org/r/1149351 (https://phabricator.wikimedia.org/T388531) (owner: 10Clément Goubert) [12:13:55] (03PS3) 10Clément Goubert: mw::maintenance: Migrate cirrus_build_completion_indices to mw-cron [puppet] - 10https://gerrit.wikimedia.org/r/1149368 (https://phabricator.wikimedia.org/T388538) [12:14:32] !log installing nodejs security updates [12:14:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:46] (03PS5) 10Clément Goubert: mw::maintenance: Migrate generatecaptcha to mw-cron [puppet] - 10https://gerrit.wikimedia.org/r/1149351 (https://phabricator.wikimedia.org/T388531) [12:15:12] (03CR) 10Gkyziridis: "Thank you for the fast patch on this." [puppet] - 10https://gerrit.wikimedia.org/r/1149369 (https://phabricator.wikimedia.org/T394779) (owner: 10Klausman) [12:17:17] (03PS2) 10Klausman: hiera: add k8s deploy config for editcheck [puppet] - 10https://gerrit.wikimedia.org/r/1149369 (https://phabricator.wikimedia.org/T394779) [12:17:21] (03CR) 10Klausman: hiera: add k8s deploy config for editcheck (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1149369 (https://phabricator.wikimedia.org/T394779) (owner: 10Klausman) [12:18:28] I am going to upgrade Gerrit (3.10.4 to 3.10.6) so that is rather minor [12:19:41] (03CR) 10Hashar: [C:03+2] Gerrit 3.10.6 and rebuild plugins [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1148868 (https://phabricator.wikimedia.org/T390666) (owner: 10Hashar) [12:20:21] (03Merged) 10jenkins-bot: Gerrit 3.10.6 and rebuild plugins [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1148868 (https://phabricator.wikimedia.org/T390666) (owner: 10Hashar) [12:21:15] (03CR) 10Federico Ceratto: "I updated the script to filter by hostname as needed for T394884" [cookbooks] - 10https://gerrit.wikimedia.org/r/1131954 (https://phabricator.wikimedia.org/T363665) (owner: 10Federico Ceratto) [12:21:25] (03PS1) 10Gkyziridis: api-gateway: switch the api gw to edit-check prod model. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149375 (https://phabricator.wikimedia.org/T394779) [12:21:33] !log hashar@deploy1003 Started deploy [gerrit/gerrit@facd6ee]: Gerrit to 3.10.6 on gerrit2002 - T390666 [12:21:37] T390666: Upgrade to Gerrit 3.10.6 - https://phabricator.wikimedia.org/T390666 [12:21:43] !log hashar@deploy1003 Finished deploy [gerrit/gerrit@facd6ee]: Gerrit to 3.10.6 on gerrit2002 - T390666 (duration: 00m 10s) [12:22:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1036 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76404 and previous config saved to /var/cache/conftool/dbconfig/20250522-122215-root.json [12:22:19] (03CR) 10Gkyziridis: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1149369 (https://phabricator.wikimedia.org/T394779) (owner: 10Klausman) [12:24:39] !log hashar@deploy1003 Started deploy [gerrit/gerrit@facd6ee]: Gerrit to 3.10.6 on gerrit1003 - T390666 [12:24:48] !log hashar@deploy1003 Finished deploy [gerrit/gerrit@facd6ee]: Gerrit to 3.10.6 on gerrit1003 - T390666 (duration: 00m 09s) [12:25:05] (03CR) 10AOkoth: [C:03+1] sre.gitlab.upgrade: wait some time before deleting downtimes [cookbooks] - 10https://gerrit.wikimedia.org/r/1149363 (https://phabricator.wikimedia.org/T395013) (owner: 10Jelto) [12:25:32] !log Stopping Gerrit for upgrade [12:25:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:10] !log Gerrit is back and was upgraded from 3.10.4 to 3.10.6 | T390666 [12:28:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:14] T390666: Upgrade to Gerrit 3.10.6 - https://phabricator.wikimedia.org/T390666 [12:28:31] FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:33:31] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:34:07] (03CR) 10Jelto: [C:03+2] gitlab: also exclude artifacts from partial backups [puppet] - 10https://gerrit.wikimedia.org/r/1148804 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto) [12:34:43] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/postgresql-airflow-test-k8s: sync [12:34:45] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/postgresql-airflow-test-k8s: sync [12:37:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1036 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76405 and previous config saved to /var/cache/conftool/dbconfig/20250522-123720-root.json [12:40:09] (03CR) 10Jelto: [C:03+2] sre.gitlab.upgrade: wait some time before deleting downtimes (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1149363 (https://phabricator.wikimedia.org/T395013) (owner: 10Jelto) [12:42:12] !log gkyziridis@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . [12:42:27] (03PS5) 10Klausman: hiera: Add pseudosecrets for MT Thanos-Swift access [labs/private] - 10https://gerrit.wikimedia.org/r/1148855 [12:42:46] (03CR) 10Klausman: hiera: Add pseudosecrets for MT Thanos-Swift access (031 comment) [labs/private] - 10https://gerrit.wikimedia.org/r/1148855 (owner: 10Klausman) [12:45:10] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/postgresql-airflow-test-k8s: apply [12:45:14] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/postgresql-airflow-test-k8s: apply [12:46:21] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/postgresql-airflow-test-k8s: apply [12:46:24] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/postgresql-airflow-test-k8s: apply [12:47:17] (03Merged) 10jenkins-bot: sre.gitlab.upgrade: wait some time before deleting downtimes [cookbooks] - 10https://gerrit.wikimedia.org/r/1149363 (https://phabricator.wikimedia.org/T395013) (owner: 10Jelto) [12:47:41] PROBLEM - Hadoop NodeManager on an-worker1153 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [12:47:59] !log btullis@cumin1002 START - Cookbook sre.hosts.reboot-single for host dse-k8s-worker1009.eqiad.wmnet [12:49:41] RECOVERY - Hadoop NodeManager on an-worker1153 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [12:51:19] !log fceratto@cumin1002 START - Cookbook sre.mysql.sanitarium_restart [12:51:32] (03PS1) 10Kosta Harlan: ComputedUserImpactLookup: Use logging table for approximate created articles count [extensions/GrowthExperiments] (wmf/1.45.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1149394 (https://phabricator.wikimedia.org/T394785) [12:51:36] !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.sanitarium_restart (exit_code=99) [12:52:28] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, May 22 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/GrowthExperiments] (wmf/1.45.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1149394 (https://phabricator.wikimedia.org/T394785) (owner: 10Kosta Harlan) [12:53:39] !log gkyziridis@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . [12:53:48] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-worker1009.eqiad.wmnet [12:56:44] 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 3 others: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922#10848007 (10Jelto) >>! In T378922#10847339, @MatthewVernon wrote: > Ah, the bucket is gone from `eqiad`, but `codfw` is still catch... [12:56:48] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/postgresql-airflow-test-k8s: apply [12:56:52] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/postgresql-airflow-test-k8s: apply [12:57:25] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [12:58:08] (03CR) 10Elukey: [C:03+1] hiera: Add pseudosecrets for MT Thanos-Swift access [labs/private] - 10https://gerrit.wikimedia.org/r/1148855 (owner: 10Klausman) [13:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250522T1300). [13:00:05] isaranto, gmodena, stephanebisson, MichaelG_WMF, and mszabo: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:15] o/ [13:00:16] o/ [13:00:21] o/ [13:00:21] o/ [13:00:26] o/ [13:00:43] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [13:00:50] is it ok if I deploy first? we are in a meeting with a couple of folks to do it together so that we can check other things as well [13:01:20] fine with me, no idea about the others [13:01:26] fine with me [13:01:46] sure [13:02:41] sure [13:02:42] starting then! [13:03:56] (03CR) 10TrainBranchBot: [C:03+2] "Approved by isaranto@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1144526 (https://phabricator.wikimedia.org/T382171) (owner: 10Ilias Sarantopoulos) [13:04:46] (03Merged) 10jenkins-bot: ores-extension: enable ores extention for rrla without the UI [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1144526 (https://phabricator.wikimedia.org/T382171) (owner: 10Ilias Sarantopoulos) [13:04:59] !log isaranto@deploy1003 Started scap sync-world: Backport for [[gerrit:1144526|ores-extension: enable ores extention for rrla without the UI (T382171)]] [13:05:03] T382171: Install ORES extension on idwiki - https://phabricator.wikimedia.org/T382171 [13:05:06] (03PS4) 10JMeybohm: k8s.pool-depool-node: Add support to downtime/remove downtime [cookbooks] - 10https://gerrit.wikimedia.org/r/1114000 [13:05:12] (03PS1) 10Muehlenhoff: Remove obsolete jobrunner cergen certs [puppet] - 10https://gerrit.wikimedia.org/r/1149398 (https://phabricator.wikimedia.org/T360636) [13:06:17] (03CR) 10JMeybohm: k8s.pool-depool-node: Add support to downtime/remove downtime (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1114000 (owner: 10JMeybohm) [13:06:37] 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 3 others: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922#10848027 (10jcrespo) I am working on setting up the dedicated gitlab/gerrit storage host, but at the moment //not yet on a specific... [13:07:01] !log isaranto@deploy1003 isaranto: Backport for [[gerrit:1144526|ores-extension: enable ores extention for rrla without the UI (T382171)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:08:32] FIRING: SystemdUnitFailed: wmf_auto_restart_krb5-admin-server.service on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:09:02] (03CR) 10Elukey: [C:03+1] Move Kartotherian/staging to the new Bookworm nodes (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1148881 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [13:09:43] (03CR) 10Klausman: [V:03+2 C:03+2] hiera: Add pseudosecrets for MT Thanos-Swift access [labs/private] - 10https://gerrit.wikimedia.org/r/1148855 (owner: 10Klausman) [13:10:00] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [13:10:17] !log isaranto@deploy1003 isaranto: Continuing with sync [13:11:16] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [13:14:05] 10ops-magru, 06DC-Ops, 10Observability-Metrics, 13Patch-For-Review, 10SRE Observability (FY2024/2025-Q3): missing pdu infos for magru - https://phabricator.wikimedia.org/T387231#10848066 (10tappof) Hi @wiki_willy, based on the breaker alerts currently configured for the Sentry4 model, I’ve set up the sam... [13:14:29] !log installing Java 11 security updates [13:14:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:45] (03PS5) 10JMeybohm: k8s.pool-depool-node: Add support to downtime/remove downtime [cookbooks] - 10https://gerrit.wikimedia.org/r/1114000 (https://phabricator.wikimedia.org/T341984) [13:16:39] (03CR) 10Volans: [C:03+1] "LGTM. Given the change in the downtime cookbook consider checking with o11y too as the owner of it." [cookbooks] - 10https://gerrit.wikimedia.org/r/1114000 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [13:17:14] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch1103.eqiad.wmnet with OS bullseye [13:17:16] !log isaranto@deploy1003 Finished scap sync-world: Backport for [[gerrit:1144526|ores-extension: enable ores extention for rrla without the UI (T382171)]] (duration: 12m 17s) [13:17:18] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch1103 [13:17:18] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch1103 [13:17:20] T382171: Install ORES extension on idwiki - https://phabricator.wikimedia.org/T382171 [13:20:27] !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic1064 to cirrussearch1064 [13:20:40] !log bking@cumin2002 START - Cookbook sre.dns.netbox [13:21:36] all good on our side! next deployer can go! [13:21:58] gmodena: mszabo stephanebisson MichaelG_WMF [13:22:11] (03CR) 10JMeybohm: k8s.pool-depool-node: Add support to downtime/remove downtime (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1114000 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [13:22:18] I can go next [13:22:28] we are just gonna run a maintenance script to backfill a table for the ores extension on idwiki [13:22:47] it shouldn't affect anything [13:23:20] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mszabo@deploy1003 using scap backport" [extensions/GrowthExperiments] (wmf/1.45.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1149394 (https://phabricator.wikimedia.org/T394785) (owner: 10Kosta Harlan) [13:23:33] Mine is a total no-op, code search confirms that config var is not used any more. There is nothing to test. It can be sync along with other things. [13:23:58] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1064 to cirrussearch1064 - bking@cumin2002" [13:24:18] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1064 to cirrussearch1064 - bking@cumin2002" [13:24:19] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:24:19] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch1064 on all recursors [13:24:22] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch1064 on all recursors [13:24:23] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch1064 [13:24:47] (03PS1) 10Jgiannelos: pcs: Add missing headers for MW requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149403 [13:25:02] !log btullis@cumin1002 START - Cookbook sre.hosts.reboot-single for host dse-k8s-worker1010.eqiad.wmnet [13:25:16] (03Merged) 10jenkins-bot: ComputedUserImpactLookup: Use logging table for approximate created articles count [extensions/GrowthExperiments] (wmf/1.45.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1149394 (https://phabricator.wikimedia.org/T394785) (owner: 10Kosta Harlan) [13:25:28] !log mszabo@deploy1003 Started scap sync-world: Backport for [[gerrit:1149394|ComputedUserImpactLookup: Use logging table for approximate created articles count (T394785)]] [13:25:32] T394785: Slow queries on finding articles created by a given user in GrowthExperiments - https://phabricator.wikimedia.org/T394785 [13:26:53] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch1064 [13:27:32] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic1064 to cirrussearch1064 [13:27:38] !log mszabo@deploy1003 mszabo, kharlan: Backport for [[gerrit:1149394|ComputedUserImpactLookup: Use logging table for approximate created articles count (T394785)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:28:57] !log mszabo@deploy1003 mszabo, kharlan: Continuing with sync [13:30:03] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host thanos-be1006.eqiad.wmnet with OS bullseye [13:30:04] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host thanos-be1007.eqiad.wmnet with OS bullseye [13:30:10] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.02 - 2025.05.23): Upgrade an-worker hard drives from 4TB to 8TB (group 4 - rack F3) - https://phabricator.wikimedia.org/T390171#10848146 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host thanos-be1... [13:30:15] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, and 2 others: Q4:rack/setup/install thanos-be100[6-9] - https://phabricator.wikimedia.org/T392909#10848145 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host thanos-be1006.eqiad.wmnet with OS bull... [13:30:32] mine fixes a bug that will be hard to test with mwdebug, but we should be able to see errors stopping once it has actually been rolled out [13:31:09] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-worker1010.eqiad.wmnet [13:31:28] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host thanos-be1009.eqiad.wmnet with OS bullseye [13:31:29] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host thanos-be1008.eqiad.wmnet with OS bullseye [13:31:38] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, and 2 others: Q4:rack/setup/install thanos-be100[6-9] - https://phabricator.wikimedia.org/T392909#10848154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host thanos-be1009.eqiad.wmnet with OS bull... [13:31:40] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, and 2 others: Q4:rack/setup/install thanos-be100[6-9] - https://phabricator.wikimedia.org/T392909#10848155 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host thanos-be1008.eqiad.wmnet with OS bull... [13:31:58] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be1008.eqiad.wmnet with OS bullseye [13:32:01] (03CR) 10Majavah: [C:03+2] openstack: wikireplica_dns: Point x3 records to new VIP [puppet] - 10https://gerrit.wikimedia.org/r/1148313 (https://phabricator.wikimedia.org/T390954) (owner: 10Majavah) [13:32:09] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, and 2 others: Q4:rack/setup/install thanos-be100[6-9] - https://phabricator.wikimedia.org/T392909#10848156 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host thanos-be1008.eqiad.wmnet with OS bullseye... [13:32:48] (03PS1) 10DDesouza: Design Research survey: Undeploy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1149404 (https://phabricator.wikimedia.org/T394315) [13:33:02] (03CR) 10Jforrester: [C:03+1] "Do you want to wait for the train, or just land this now? The default and the no-op values are the same?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148950 (https://phabricator.wikimedia.org/T394054) (owner: 10Arlolra) [13:33:16] (03PS1) 10Vgutierrez: varnish: Allow setting WMF-Uniq cookie for WMCS domains [puppet] - 10https://gerrit.wikimedia.org/r/1149405 (https://phabricator.wikimedia.org/T391411) [13:34:32] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1149405 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [13:35:39] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch1103.eqiad.wmnet with reason: host reimage [13:36:00] !log mszabo@deploy1003 Finished scap sync-world: Backport for [[gerrit:1149394|ComputedUserImpactLookup: Use logging table for approximate created articles count (T394785)]] (duration: 10m 32s) [13:36:04] T394785: Slow queries on finding articles created by a given user in GrowthExperiments - https://phabricator.wikimedia.org/T394785 [13:36:24] MichaelG_WMF go ahead :) [13:36:40] !log btullis@cumin1002 START - Cookbook sre.hosts.reboot-single for host dse-k8s-worker1011.eqiad.wmnet [13:36:53] I'm not a deployer 😳 [13:36:53] !log fceratto@cumin1002 START - Cookbook sre.mysql.sanitarium_restart [13:37:06] !log fceratto@cumin1002 END (ERROR) - Cookbook sre.mysql.sanitarium_restart (exit_code=97) [13:37:11] (though I guess maybe I should finally do that training...) [13:37:23] (03PS2) 10Vgutierrez: varnish: Allow setting WMF-Uniq cookie for WMCS domains [puppet] - 10https://gerrit.wikimedia.org/r/1149405 (https://phabricator.wikimedia.org/T391411) [13:37:31] !log fceratto@cumin1002 START - Cookbook sre.mysql.sanitarium_restart [13:37:45] !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.sanitarium_restart (exit_code=99) [13:38:09] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1149405 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [13:38:24] !log klausman@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . [13:38:33] !log fceratto@cumin1002 START - Cookbook sre.mysql.sanitarium_restart [13:38:54] (03PS3) 10Vgutierrez: varnish: Allow setting WMF-Uniq cookie for WMCS domains [puppet] - 10https://gerrit.wikimedia.org/r/1149405 (https://phabricator.wikimedia.org/T391411) [13:39:09] !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.sanitarium_restart (exit_code=99) [13:39:16] !log fceratto@cumin1002 START - Cookbook sre.mysql.sanitarium_restart [13:39:19] !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.sanitarium_restart (exit_code=99) [13:39:27] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch1103.eqiad.wmnet with reason: host reimage [13:39:33] MichaelG_WMF I can deploy yours [13:39:39] !log klausman@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . [13:39:42] just say the word [13:39:45] stephanebisson: awesome thanks! [13:40:09] (03PS4) 10Vgutierrez: varnish: Allow setting WMF-Uniq cookie for WMCS domains [puppet] - 10https://gerrit.wikimedia.org/r/1149405 (https://phabricator.wikimedia.org/T391411) [13:40:09] (03CR) 10TrainBranchBot: [C:03+2] "Approved by sbisson@deploy1003 using scap backport" [extensions/GrowthExperiments] (wmf/1.45.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1149345 (https://phabricator.wikimedia.org/T394289) (owner: 10Michael Große) [13:40:40] I'm trying the new UI [13:40:48] (03PS1) 10Andrew Bogott: Add cloudcephosd200[567] to site.pp and preseed.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1149406 (https://phabricator.wikimedia.org/T393614) [13:42:20] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-worker1011.eqiad.wmnet [13:44:03] (03CR) 10Andrew Bogott: [C:03+2] Add cloudcephosd200[567] to site.pp and preseed.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1149406 (https://phabricator.wikimedia.org/T393614) (owner: 10Andrew Bogott) [13:44:49] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q4:rack/setup/install cloudcephosd200[567] - https://phabricator.wikimedia.org/T393614#10848217 (10Andrew) a:05Andrew→03None [13:45:01] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch1064.eqiad.wmnet with OS bullseye [13:45:05] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch1064 [13:45:05] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch1064 [13:45:28] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops, 10decommission-hardware: decommission cloudvirt103[1-9].eqiad.wmnet - https://phabricator.wikimedia.org/T394727#10848224 (10Andrew) a:05Andrew→03None [13:46:42] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1149405 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [13:46:43] !log klausman@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . [13:46:52] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be1007.eqiad.wmnet with reason: host reimage [13:46:59] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be1006.eqiad.wmnet with reason: host reimage [13:47:10] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host thanos-be1008.eqiad.wmnet with OS bullseye [13:47:19] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, and 2 others: Q4:rack/setup/install thanos-be100[6-9] - https://phabricator.wikimedia.org/T392909#10848229 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host thanos-be1008.eqiad.wmnet with OS bull... [13:48:24] (03CR) 10Ssingh: [C:03+1] varnish: Allow setting WMF-Uniq cookie for WMCS domains [puppet] - 10https://gerrit.wikimedia.org/r/1149405 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [13:49:40] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2004 (Config D 1P) - https://phabricator.wikimedia.org/T393986#10848232 (10Jhancock.wm) 05Open→03Resolved [13:49:45] stephanebisson holler when ready :). Mine is config change to EventStreamConfig [13:49:51] Spiderpig rocks! <3 [13:50:25] !log fceratto@cumin1002 START - Cookbook sre.mysql.sanitarium_restart [13:50:46] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be1007.eqiad.wmnet with reason: host reimage [13:50:59] !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic1063 to cirrussearch1063 [13:51:12] !log bking@cumin2002 START - Cookbook sre.dns.netbox [13:52:05] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be1009.eqiad.wmnet with reason: host reimage [13:52:51] (03PS1) 10Ilias Sarantopoulos: ores-extension: enable ores extention UI in idwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1149407 (https://phabricator.wikimedia.org/T382171) [13:54:00] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be1006.eqiad.wmnet with reason: host reimage [13:54:05] (03PS11) 10Federico Ceratto: sanitarium_restart.py: restart Sanitarium hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1131954 (https://phabricator.wikimedia.org/T363665) [13:54:24] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1063 to cirrussearch1063 - bking@cumin2002" [13:54:42] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1063 to cirrussearch1063 - bking@cumin2002" [13:54:43] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:54:43] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch1063 on all recursors [13:54:46] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch1063 on all recursors [13:54:47] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch1063 [13:55:22] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.sanitarium_restart (exit_code=0) [13:55:37] jouncebot next [13:55:38] In 1 hour(s) and 4 minute(s): Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250522T1500) [13:56:30] That GrowthExp patch is taking a while to merge but there is nothing right after this window so we have some room [13:56:30] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be1009.eqiad.wmnet with reason: host reimage [13:56:37] the window is almost closing. Can I still deploy my patch? [13:56:47] ah! [13:56:51] stephanebisson ack [13:57:31] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch1063 [13:57:56] (03Merged) 10jenkins-bot: stats(SuggestedEdits): avoid tracking negative tti durations [extensions/GrowthExperiments] (wmf/1.45.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1149345 (https://phabricator.wikimedia.org/T394289) (owner: 10Michael Große) [13:58:00] (03CR) 10Vgutierrez: [C:03+2] varnish: Allow setting WMF-Uniq cookie for WMCS domains [puppet] - 10https://gerrit.wikimedia.org/r/1149405 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [13:58:09] !log sbisson@deploy1003 Started scap sync-world: Backport for [[gerrit:1149345|stats(SuggestedEdits): avoid tracking negative tti durations (T394289 T394701)]] [13:58:11] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic1063 to cirrussearch1063 [13:58:14] T394289: Invalid timing value for mediawiki_GrowthExperiments_suggested_edits_server_tti_seconds - https://phabricator.wikimedia.org/T394289 [13:58:14] T394701: Eventgate Error: '.error_context['serverDuration']' should be string, '.error_context['GEHomepageStartTime']' should be string' - https://phabricator.wikimedia.org/T394701 [13:58:45] (03CR) 10MSantos: [C:03+1] pcs: Add missing headers for MW requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149403 (owner: 10Jgiannelos) [13:59:07] (03CR) 10DCausse: [C:03+1] mw::maintenance: Migrate cirrus_build_completion_indices to mw-cron [puppet] - 10https://gerrit.wikimedia.org/r/1149368 (https://phabricator.wikimedia.org/T388538) (owner: 10Clément Goubert) [13:59:36] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch1103.eqiad.wmnet with OS bullseye [14:00:07] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, May 26 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1149407 (https://phabricator.wikimedia.org/T382171) (owner: 10Ilias Sarantopoulos) [14:00:07] (03PS2) 10Jgiannelos: pcs: Add missing headers for MW requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149403 [14:00:25] !log sbisson@deploy1003 sbisson, migr: Backport for [[gerrit:1149345|stats(SuggestedEdits): avoid tracking negative tti durations (T394289 T394701)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:00:31] (03CR) 10Jgiannelos: [C:03+2] mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149359 (owner: 10PipelineBot) [14:00:34] * MichaelG_WMF looks [14:02:03] (03Merged) 10jenkins-bot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149359 (owner: 10PipelineBot) [14:02:04] stephanebisson: I'm not seeing any errors, so we're probably good. [14:02:10] MichaelG_WMF ok [14:02:15] !log sbisson@deploy1003 sbisson, migr: Continuing with sync [14:02:25] (03CR) 10Federico Ceratto: "Restart tested on db2186" [cookbooks] - 10https://gerrit.wikimedia.org/r/1131954 (https://phabricator.wikimedia.org/T363665) (owner: 10Federico Ceratto) [14:02:27] (03CR) 10Jgiannelos: [C:03+2] pcs: Add missing headers for MW requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149403 (owner: 10Jgiannelos) [14:02:57] (03CR) 10Kgraessle: [C:03+1] ores-extension: enable ores extention UI in idwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1149407 (https://phabricator.wikimedia.org/T382171) (owner: 10Ilias Sarantopoulos) [14:03:02] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch1064.eqiad.wmnet with reason: host reimage [14:03:07] RESOLVED: MediaWikiElevatedUnknownLogins: Elevated number of failed login attempts (unknown device and IP) via mw-api-ext - TODO - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?from=now-6h&orgId=1&to=now&viewPanel=26 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiElevatedUnknownLogins [14:03:27] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be1008.eqiad.wmnet with reason: host reimage [14:03:41] (03CR) 10Kgraessle: [C:03+1] Add AutoModerator to eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1147881 (https://phabricator.wikimedia.org/T391248) (owner: 10Scardenasmolinar) [14:04:08] (03Merged) 10jenkins-bot: pcs: Add missing headers for MW requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149403 (owner: 10Jgiannelos) [14:04:29] FIRING: MediaWikiElevatedUnknownLogins: Elevated number of failed login attempts (unknown device and IP) via mw-api-ext - TODO - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?from=now-6h&orgId=1&to=now&viewPanel=26 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiElevatedUnknownLogins [14:04:54] !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: apply [14:05:03] !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [14:05:15] !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: apply [14:05:42] !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [14:06:36] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch1064.eqiad.wmnet with reason: host reimage [14:07:05] !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: sync [14:07:12] !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: sync [14:07:23] (03CR) 10DCausse: [C:03+1] Replace deprecated wgCirrusSearchWMFExtraFeatures with wgCirrusSearchWeightedTags [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1144484 (https://phabricator.wikimedia.org/T393872) (owner: 10SD0001) [14:08:18] (03CR) 10Hnowlan: [C:03+1] mediawiki: Add fancycaptcha wordlists to mw-cron [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149346 (https://phabricator.wikimedia.org/T388531) (owner: 10Clément Goubert) [14:08:20] !log klausman@deploy1003 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [14:08:32] FIRING: SystemdUnitFailed: curator_actions_apifeatureusage_eqiad.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:08:35] (03CR) 10Hnowlan: [C:03+1] mw::maintenance: Migrate generatecaptcha to mw-cron [puppet] - 10https://gerrit.wikimedia.org/r/1149351 (https://phabricator.wikimedia.org/T388531) (owner: 10Clément Goubert) [14:09:06] !log klausman@deploy1003 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [14:09:14] !log sbisson@deploy1003 Finished scap sync-world: Backport for [[gerrit:1149345|stats(SuggestedEdits): avoid tracking negative tti durations (T394289 T394701)]] (duration: 11m 04s) [14:09:16] (03CR) 10Hnowlan: [C:03+2] mw::maintenance: migrate checkuser and securepoll jobs to Kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1149367 (https://phabricator.wikimedia.org/T388542) (owner: 10Hnowlan) [14:09:18] T394289: Invalid timing value for mediawiki_GrowthExperiments_suggested_edits_server_tti_seconds - https://phabricator.wikimedia.org/T394289 [14:09:19] T394701: Eventgate Error: '.error_context['serverDuration']' should be string, '.error_context['GEHomepageStartTime']' should be string' - https://phabricator.wikimedia.org/T394701 [14:09:32] !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: sync [14:09:34] !log klausman@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'edit-check' for release 'main' . [14:09:39] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be1008.eqiad.wmnet with reason: host reimage [14:09:44] !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: sync [14:09:59] gmodena do you want to go ahead with your patch? [14:10:06] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch1063.eqiad.wmnet with OS bullseye [14:10:10] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch1063 [14:10:11] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch1063 [14:10:19] stephanebisson sure [14:10:36] I'll do mine after [14:11:40] (03CR) 10TrainBranchBot: [C:03+2] "Approved by gmodena@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148818 (https://phabricator.wikimedia.org/T394899) (owner: 10Gmodena) [14:12:28] (03Merged) 10jenkins-bot: EventStreamConfig: add staging page_change stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148818 (https://phabricator.wikimedia.org/T394899) (owner: 10Gmodena) [14:12:42] !log gmodena@deploy1003 Started scap sync-world: Backport for [[gerrit:1148818|EventStreamConfig: add staging page_change stream (T394899)]] [14:12:46] T394899: Testing the domain event refactoring with production data - https://phabricator.wikimedia.org/T394899 [14:12:52] stephanebisson thanks! [14:13:00] woah! Spiderpig <3 [14:13:31] right? ui haters gonna hate :D [14:14:14] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch1103.eqiad.wmnet with OS bullseye [14:14:36] !log gmodena@deploy1003 gmodena: Backport for [[gerrit:1148818|EventStreamConfig: add staging page_change stream (T394899)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:14:36] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch1103 [14:14:37] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch1103 [14:15:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-videoscaler releases routed via main (k8s) 1.536s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-videoscaler&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExcee [14:15:17] (03PS1) 10Jgiannelos: pcs: Default to use http client with service mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149414 [14:15:22] checking [14:15:42] (03CR) 10Arlolra: "No need to wait for the train, we can land it now" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148950 (https://phabricator.wikimedia.org/T394054) (owner: 10Arlolra) [14:16:09] (03PS2) 10Jgiannelos: pcs: Default to use http client with service mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149414 [14:18:55] bking@cumin2002 reimage (PID 4087988) is awaiting input [14:19:02] (03CR) 10Marostegui: "> Let's test on db2186 or db2187" [cookbooks] - 10https://gerrit.wikimedia.org/r/1131954 (https://phabricator.wikimedia.org/T363665) (owner: 10Federico Ceratto) [14:20:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-videoscaler releases routed via main (k8s) 1.536s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-videoscaler&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExc [14:21:39] wat [14:21:47] !log gmodena@deploy1003 gmodena: Continuing with sync [14:22:01] (03PS3) 10Jgiannelos: pcs: Default to use http client with service mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149414 [14:23:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.265s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [14:24:36] is the deploying causing latency on parsoid? [14:26:02] nothing too alarming if it's just a spike [14:26:37] (03PS4) 10Jgiannelos: pcs: Default to use http client with service mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149414 [14:27:07] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cirrussearch1063.eqiad.wmnet with OS bullseye [14:28:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.265s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [14:28:16] (03PS1) 10DDesouza: miscweb(design-strategy): bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149415 (https://phabricator.wikimedia.org/T344471) [14:28:40] !log gmodena@deploy1003 Finished scap sync-world: Backport for [[gerrit:1148818|EventStreamConfig: add staging page_change stream (T394899)]] (duration: 15m 58s) [14:28:44] T394899: Testing the domain event refactoring with production data - https://phabricator.wikimedia.org/T394899 [14:30:02] stephanebisson i'm done [14:30:12] (03CR) 10TrainBranchBot: [C:03+2] "Approved by sbisson@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146020 (https://phabricator.wikimedia.org/T389970) (owner: 10Sbisson) [14:30:20] gmodena thanks [14:30:45] !log fceratto@cumin1002 START - Cookbook sre.mysql.sanitarium_restart [14:31:04] (03Merged) 10jenkins-bot: Remove unused wgContentTranslationEnableSectionTranslation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146020 (https://phabricator.wikimedia.org/T389970) (owner: 10Sbisson) [14:31:20] !log sbisson@deploy1003 Started scap sync-world: Backport for [[gerrit:1146020|Remove unused wgContentTranslationEnableSectionTranslation (T389970)]] [14:31:23] T389970: Remove access to old dashboard and related code - https://phabricator.wikimedia.org/T389970 [14:31:47] (03CR) 10DDesouza: [C:03+2] miscweb(design-strategy): bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149415 (https://phabricator.wikimedia.org/T344471) (owner: 10DDesouza) [14:32:07] (03CR) 10Federico Ceratto: "😂😂 Fixed now." [cookbooks] - 10https://gerrit.wikimedia.org/r/1131954 (https://phabricator.wikimedia.org/T363665) (owner: 10Federico Ceratto) [14:33:16] !log sbisson@deploy1003 sbisson: Backport for [[gerrit:1146020|Remove unused wgContentTranslationEnableSectionTranslation (T389970)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:33:40] (03Merged) 10jenkins-bot: miscweb(design-strategy): bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149415 (https://phabricator.wikimedia.org/T344471) (owner: 10DDesouza) [14:33:58] !log sbisson@deploy1003 sbisson: Continuing with sync [14:34:29] RESOLVED: MediaWikiElevatedUnknownLogins: Elevated number of failed login attempts (unknown device and IP) via mw-api-ext - TODO - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?from=now-6h&orgId=1&to=now&viewPanel=26 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiElevatedUnknownLogins [14:34:35] !log dani@deploy1003 helmfile [staging] START helmfile.d/services/miscweb: apply [14:34:49] !log dani@deploy1003 helmfile [staging] DONE helmfile.d/services/miscweb: apply [14:34:50] !log dani@deploy1003 helmfile [eqiad] START helmfile.d/services/miscweb: apply [14:35:10] !log dani@deploy1003 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [14:35:11] !log dani@deploy1003 helmfile [codfw] START helmfile.d/services/miscweb: apply [14:35:38] !log dani@deploy1003 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [14:36:16] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.sanitarium_restart (exit_code=0) [14:36:31] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch1064.eqiad.wmnet with OS bullseye [14:37:57] (03PS5) 10Jgiannelos: pcs: Default to use http client with service mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149414 (https://phabricator.wikimedia.org/T394896) [14:38:17] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - shellbox-video_4080: Servers wikikube-worker1280.eqiad.wmnet, wikikube-worker1144.eqiad.wmnet, wikikube-worker1291.eqiad.wmnet, wikikube-worker1042.eqiad.wmnet, wikikube-worker1268.eqiad.wmnet, wikikube-worker1118.eqiad.wmnet, wikikube-worker1148.eqiad.wmnet, wikikube-worker1103.eqiad.wmnet, wikikube-worker1101.eqiad.wmnet, wikikube-worker1121.eqiad. [14:38:17] ikikube-worker1116.eqiad.wmnet, wikikube-worker1050.eqiad.wmnet, wikikube-worker1036.eqiad.wmnet, wikikube-worker1029.eqiad.wmnet, wikikube-worker1025.eqiad.wmnet, wikikube-worker1094.eqiad.wmnet, wikikube-worker1016.eqiad.wmnet, wikikube-worker1273.eqiad.wmnet, wikikube-worker1071.eqiad.wmnet, wikikube-worker1279.eqiad.wmnet, wikikube-worker1157.eqiad.wmnet, wikikube-worker1282.eqiad.wmnet, wikikube-worker1263.eqiad.wmnet, wikikube-worke [14:38:17] iad.wmnet, wikikube-worker1313.eqiad.wmnet, wikikube-worker1287.eqiad.wmnet, wikikube-worker1003.eqiad.wmnet, wikikube-worker1112.eqiad.wmnet, wikikube-worker1037.eqiad.wmnet, wikikube- https://wikitech.wikimedia.org/wiki/PyBal [14:38:32] oh dear. looking [14:38:33] oh boy [14:38:45] (03PS3) 10Cathal Mooney: pdus: add pro4x breaker alerts [alerts] - 10https://gerrit.wikimedia.org/r/1149343 (https://phabricator.wikimedia.org/T387231) (owner: 10Tiziano Fogli) [14:38:45] (03CR) 10Cathal Mooney: "LGTM though I am not an expert on the PDUs nor alertmanager rules." [alerts] - 10https://gerrit.wikimedia.org/r/1149343 (https://phabricator.wikimedia.org/T387231) (owner: 10Tiziano Fogli) [14:38:59] sigh, there are 6 instances of mw-videoscaler running in parallel [14:39:03] FIRING: ProbeDown: Service shellbox-video:4080 has failed probes (http_shellbox-video_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#shellbox-video:4080 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:39:16] acking [14:39:37] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - shellbox-video_4080: Servers wikikube-worker1144.eqiad.wmnet, wikikube-worker1322.eqiad.wmnet, wikikube-worker1042.eqiad.wmnet, wikikube-worker1079.eqiad.wmnet, wikikube-worker1118.eqiad.wmnet, wikikube-worker1304.eqiad.wmnet, wikikube-worker1298.eqiad.wmnet, wikikube-worker1306.eqiad.wmnet, wikikube-worker1155.eqiad.wmnet, wikikube-worker1281.eqiad. [14:39:37] ikikube-worker1007.eqiad.wmnet, wikikube-worker1036.eqiad.wmnet, wikikube-worker1029.eqiad.wmnet, wikikube-worker1049.eqiad.wmnet, wikikube-worker1315.eqiad.wmnet, wikikube-worker1132.eqiad.wmnet, wikikube-worker1247.eqiad.wmnet, wikikube-worker1273.eqiad.wmnet, wikikube-worker1251.eqiad.wmnet, wikikube-worker1260.eqiad.wmnet, wikikube-worker1279.eqiad.wmnet, wikikube-worker1053.eqiad.wmnet, wikikube-worker1072.eqiad.wmnet, wikikube-worke [14:39:37] iad.wmnet, wikikube-worker1159.eqiad.wmnet, wikikube-worker1056.eqiad.wmnet, wikikube-worker1244.eqiad.wmnet, wikikube-worker1112.eqiad.wmnet, wikikube-worker1037.eqiad.wmnet, wikikube- https://wikitech.wikimedia.org/wiki/PyBal [14:39:44] thanks [14:39:49] (03CR) 10Xcollazo: [C:03+1] "CC @mforns@wikimedia.org" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149013 (https://phabricator.wikimedia.org/T394459) (owner: 10Brouberol) [14:39:51] at least now the shellbox diff should be clean :) [14:39:57] so we can clean upsize [14:40:42] not sure what 6 means, if too few or too many [14:40:57] !log sbisson@deploy1003 Finished scap sync-world: Backport for [[gerrit:1146020|Remove unused wgContentTranslationEnableSectionTranslation (T389970)]] (duration: 09m 36s) [14:41:01] T389970: Remove access to old dashboard and related code - https://phabricator.wikimedia.org/T389970 [14:41:10] jynus: too many [14:41:14] ties up workers [14:43:52] for clarity, are you handling videoscalers, or should I call someone else? [14:44:06] hnowlan ^ [14:44:14] jynus: I am handling it [14:44:18] ok, thanks [14:44:36] please speak up if need help, reviews, etc. [14:44:39] (03PS6) 10Effie Mouzeli: pcs: Default to use http client with service mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149414 (https://phabricator.wikimedia.org/T394896) (owner: 10Jgiannelos) [14:47:08] might see recovery in a little bit [14:47:30] thanks a lot [14:47:39] hnowlan: suppose I should wait before deploying a mw chart change? [14:47:42] :p [14:47:52] (03CR) 10Ladsgroup: [C:03+1] pcs: Default to use http client with service mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149414 (https://phabricator.wikimedia.org/T394896) (owner: 10Jgiannelos) [14:48:31] claime: maybe just a few minutes :P [14:48:38] ;) [14:48:57] RESOLVED: ProbeDown: Service shellbox-video:4080 has failed probes (http_shellbox-video_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#shellbox-video:4080 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:49:06] nice [14:49:37] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:50:17] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:50:36] (03PS1) 10Hnowlan: mw-videoscaler: drop job concurrency [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149416 [14:52:17] short-term fix there, if anyone wants to review ^ [14:52:18] (03CR) 10Clément Goubert: [C:03+1] mw-videoscaler: drop job concurrency [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149416 (owner: 10Hnowlan) [14:52:24] quick on the draw <3 [14:53:14] thank you a lot for the quick response, hnowlan! [14:53:14] (03CR) 10Hnowlan: [C:03+2] mw-videoscaler: drop job concurrency [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149416 (owner: 10Hnowlan) [14:53:32] checking the patch [14:53:35] jynus: np, sorry for the noise - we need to rethink things a little [14:53:44] ah, claime beat me to it [14:53:54] and I trust him more :-D [14:54:32] https://media1.tenor.com/m/75Wzyfp17XQAAAAd/monumentale-erreur-last-action-hero-monumentale-erreur.gif [14:54:35] hnowlan: please feel free to provoke easy-to-fix p*ges any time, I prefer those over the hard-to-fix kind :-D [14:54:36] (03Merged) 10jenkins-bot: mw-videoscaler: drop job concurrency [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149416 (owner: 10Hnowlan) [14:54:54] claime: LOL [14:55:54] (03CR) 10Ilias Sarantopoulos: [C:03+1] "nice!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149375 (https://phabricator.wikimedia.org/T394779) (owner: 10Gkyziridis) [14:57:13] claime: just need to wait a couple of minutes for the free workers to increase a little [14:57:20] yeah no rush [14:57:41] I'll merge the cirrus cronjob in the meantime [14:58:18] (03PS12) 10Federico Ceratto: sanitarium_restart.py: restart Sanitarium hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1131954 (https://phabricator.wikimedia.org/T363665) [14:59:21] claime: my change from earlier is actually ready to be merged, I can roll all out at the same time if you'd like [14:59:35] hnowlan: sure, go ahead [14:59:56] (03CR) 10Dzahn: "tbh I got a bit confused here because it sounded like on this patch you said you prefer a more generic approach but on the patch with the " [puppet] - 10https://gerrit.wikimedia.org/r/1148433 (https://phabricator.wikimedia.org/T394519) (owner: 10Dzahn) [15:00:03] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1149368 if you want to submit the patch [15:00:05] andre and jnuche: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Train log triage deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250522T1500). [15:00:33] (03CR) 10Gkyziridis: [C:03+2] api-gateway: switch the api gw to edit-check prod model. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149375 (https://phabricator.wikimedia.org/T394779) (owner: 10Gkyziridis) [15:01:11] (03CR) 10Effie Mouzeli: [C:03+2] pcs: Default to use http client with service mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149414 (https://phabricator.wikimedia.org/T394896) (owner: 10Jgiannelos) [15:01:15] (03CR) 10Hnowlan: [C:03+2] mw::maintenance: Migrate cirrus_build_completion_indices to mw-cron [puppet] - 10https://gerrit.wikimedia.org/r/1149368 (https://phabricator.wikimedia.org/T388538) (owner: 10Clément Goubert) [15:02:39] (03Merged) 10jenkins-bot: pcs: Default to use http client with service mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149414 (https://phabricator.wikimedia.org/T394896) (owner: 10Jgiannelos) [15:03:10] !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: apply [15:03:15] !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [15:03:24] !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: apply [15:03:31] !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [15:03:50] !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: apply [15:04:03] !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [15:04:51] !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: apply [15:05:01] !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [15:05:37] !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic1062 to cirrussearch1062 [15:05:51] !log bking@cumin2002 START - Cookbook sre.dns.netbox [15:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:08:21] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [15:08:35] !log installing mariadb security updates (as packaged in Debian, not the wmf-mariadb packages) [15:08:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:48] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [15:09:15] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1062 to cirrussearch1062 - bking@cumin2002" [15:09:32] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1062 to cirrussearch1062 - bking@cumin2002" [15:09:32] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:09:33] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch1062 on all recursors [15:09:36] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch1062 on all recursors [15:09:37] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch1062 [15:11:15] claime: you should be good to go ahead [15:11:24] hnowlan: <3 thanks [15:12:26] also the cirrus jobs have been migrated [15:12:27] (03PS1) 10Clément Goubert: mw::maintenance: Migrate purge_old_cx_drafts to mw-cron [puppet] - 10https://gerrit.wikimedia.org/r/1149420 (https://phabricator.wikimedia.org/T388539) [15:12:36] (03CR) 10Clément Goubert: [C:03+2] mediawiki: Add fancycaptcha wordlists to mw-cron [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149346 (https://phabricator.wikimedia.org/T388531) (owner: 10Clément Goubert) [15:12:43] bking@cumin2002 rename (PID 4115275) is awaiting input [15:13:34] (03CR) 10CI reject: [V:04-1] mw::maintenance: Migrate purge_old_cx_drafts to mw-cron [puppet] - 10https://gerrit.wikimedia.org/r/1149420 (https://phabricator.wikimedia.org/T388539) (owner: 10Clément Goubert) [15:14:28] (03PS2) 10Clément Goubert: mw::maintenance: Migrate purge_old_cx_drafts to mw-cron [puppet] - 10https://gerrit.wikimedia.org/r/1149420 (https://phabricator.wikimedia.org/T388539) [15:15:21] (03Merged) 10jenkins-bot: mediawiki: Add fancycaptcha wordlists to mw-cron [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149346 (https://phabricator.wikimedia.org/T388531) (owner: 10Clément Goubert) [15:16:23] !log jhathaway@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on sretest2001.codfw.wmnet with reason: T383173 [15:16:26] T383173: Supermicro: UEFI HTTP boot request hangs on cold boot - https://phabricator.wikimedia.org/T383173 [15:16:32] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch1062 [15:16:43] 07Puppet, 10Beta-Cluster-Infrastructure, 10CirrusSearch, 06Discovery-Search, 10Data-Platform-SRE (2025.05.02 - 2025.05.23): Puppet failing on deployment-cirrussearch{12,13,14}.deployment-prep.eqiad1.wikimedia.cloud - https://phabricator.wikimedia.org/T393924#10848905 (10dancy) I want to attempt to make p... [15:17:12] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic1062 to cirrussearch1062 [15:17:48] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch1062.eqiad.wmnet with OS bullseye [15:17:52] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch1062 [15:17:52] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch1062 [15:18:32] FIRING: NetworkDeviceAlarmActive: Alarm active on cr2-codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [15:21:18] (03PS1) 10Clément Goubert: mediawiki: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149421 [15:22:01] (03CR) 10Hnowlan: [C:03+1] mw::maintenance: Migrate purge_old_cx_drafts to mw-cron [puppet] - 10https://gerrit.wikimedia.org/r/1149420 (https://phabricator.wikimedia.org/T388539) (owner: 10Clément Goubert) [15:22:27] (03CR) 10Clément Goubert: [C:03+2] mw::maintenance: Migrate purge_old_cx_drafts to mw-cron [puppet] - 10https://gerrit.wikimedia.org/r/1149420 (https://phabricator.wikimedia.org/T388539) (owner: 10Clément Goubert) [15:23:52] 07Puppet, 10Beta-Cluster-Infrastructure, 10CirrusSearch, 06Discovery-Search, 10Data-Platform-SRE (2025.05.02 - 2025.05.23): Puppet failing on deployment-cirrussearch{12,13,14}.deployment-prep.eqiad1.wikimedia.cloud - https://phabricator.wikimedia.org/T393924#10848955 (10dancy) I ended up adding an entry... [15:24:49] (03PS1) 10Hnowlan: mw::maintenance: migrate abusefilteripdata job to Kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1149422 (https://phabricator.wikimedia.org/T388542) [15:25:55] (03CR) 10CI reject: [V:04-1] mw::maintenance: migrate abusefilteripdata job to Kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1149422 (https://phabricator.wikimedia.org/T388542) (owner: 10Hnowlan) [15:26:38] !log bking@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on cirrussearch2110.codfw.wmnet with reason: firmware update cookbook [15:26:41] 10ops-codfw, 10ops-eqiad, 06SRE, 06DC-Ops: Dell SSD Critical Firmware Update - https://phabricator.wikimedia.org/T394348#10848993 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=7389739c-a74e-4785-99a6-55575ec65f75) set by bking@cumin2002 for 7 days, 0:00:00 on 1 host(s) and their servi... [15:28:10] (03PS1) 10Jgiannelos: Revert "mobileapps: pipeline bot promote" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149423 [15:28:12] !log volans@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2110.codfw.wmnet [15:28:17] (03PS1) 10Jgiannelos: Revert "pcs: Add missing headers for MW requests" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149424 [15:28:21] (03PS2) 10Hnowlan: mw::maintenance: migrate abusefilteripdata job to Kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1149422 (https://phabricator.wikimedia.org/T388542) [15:28:25] (03CR) 10CI reject: [V:04-1] Revert "pcs: Add missing headers for MW requests" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149424 (owner: 10Jgiannelos) [15:28:28] (03PS1) 10Jgiannelos: Revert "pcs: Default to use http client with service mesh" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149425 [15:28:42] !log volans@cumin1003 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cirrussearch2110.codfw.wmnet [15:29:00] (03Abandoned) 10Jdlrobson: bookmark: Fix click event not working [extensions/ReadingLists] (wmf/1.45.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1148939 (https://phabricator.wikimedia.org/T394736) (owner: 10Jdlrobson) [15:29:16] !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [15:30:18] !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [15:30:43] (03CR) 10Clément Goubert: [C:03+2] mediawiki: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149421 (owner: 10Clément Goubert) [15:33:01] (03Merged) 10jenkins-bot: mediawiki: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149421 (owner: 10Clément Goubert) [15:33:21] !log akosiaris@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [15:33:44] 10SRE-swift-storage, 10MediaWiki-Uploading, 07Wikimedia-production-error: UploadChunkFileException: Error storing file in '{chunkPath}': backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T395049 (10brennen) 03NEW [15:35:01] (03CR) 10Effie Mouzeli: [C:03+1] Revert "pcs: Default to use http client with service mesh" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149425 (owner: 10Jgiannelos) [15:35:15] (03CR) 10Effie Mouzeli: [C:03+1] Revert "mobileapps: pipeline bot promote" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149423 (owner: 10Jgiannelos) [15:35:20] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cirrussearch1103.eqiad.wmnet with OS bullseye [15:35:48] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch1062.eqiad.wmnet with reason: host reimage [15:36:02] (03PS1) 10Hnowlan: mw::maintenance: migrate remaining translation notifications jobs [puppet] - 10https://gerrit.wikimedia.org/r/1149426 (https://phabricator.wikimedia.org/T388539) [15:36:17] !log cgoubert@deploy1003 Started scap sync-world: 1149346: mediawiki: Add fancycaptcha wordlists to mw-cron - T388531 [15:36:21] T388531: Migrate Security-Team jobs to mw-cron - https://phabricator.wikimedia.org/T388531 [15:38:27] !log cgoubert@deploy1003 Finished scap sync-world: 1149346: mediawiki: Add fancycaptcha wordlists to mw-cron - T388531 (duration: 02m 28s) [15:39:24] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch1062.eqiad.wmnet with reason: host reimage [15:41:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:41:47] (03CR) 10Clément Goubert: [C:03+2] mw::maintenance: Migrate generatecaptcha to mw-cron [puppet] - 10https://gerrit.wikimedia.org/r/1149351 (https://phabricator.wikimedia.org/T388531) (owner: 10Clément Goubert) [15:41:57] FIRING: CalicoKubeControllersDown: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers" - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown [15:43:18] I just had a gitlab CI runner fail for a "no space left on device" meta-failure [15:43:21] related above? [15:43:36] bblack: Job link? [15:43:47] dancy: https://gitlab.wikimedia.org/repos/sre/libvmod-wmfuniq/-/jobs/517608 [15:43:53] thx [15:44:15] I'll get that fixed up. [15:44:20] ok thanks! [15:44:26] (03PS1) 10Majavah: sre: kubernetes: Remove stray quote [alerts] - 10https://gerrit.wikimedia.org/r/1149428 [15:46:10] (03CR) 10Clément Goubert: [C:03+1] sre: kubernetes: Remove stray quote [alerts] - 10https://gerrit.wikimedia.org/r/1149428 (owner: 10Majavah) [15:46:29] bblack: Cleaned. [15:47:51] dancy: thanks! re-running! [15:47:55] 07Puppet, 10Beta-Cluster-Infrastructure, 10CirrusSearch, 06Discovery-Search, 10Data-Platform-SRE (2025.05.02 - 2025.05.23): Puppet failing on deployment-cirrussearch{12,13,14}.deployment-prep.eqiad1.wikimedia.cloud - https://phabricator.wikimedia.org/T393924#10849133 (10dancy) Notes: `modules/role/manif... [15:48:17] (03CR) 10Jgiannelos: [C:03+2] Revert "mobileapps: pipeline bot promote" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149423 (owner: 10Jgiannelos) [15:48:21] (03CR) 10Jgiannelos: [C:03+2] Revert "pcs: Default to use http client with service mesh" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149425 (owner: 10Jgiannelos) [15:48:32] FIRING: HelmReleaseBadStatus: Helm release kube-system/calico on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=kube-system - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [15:48:35] !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [15:48:44] (03CR) 10Majavah: [C:03+2] sre: kubernetes: Remove stray quote [alerts] - 10https://gerrit.wikimedia.org/r/1149428 (owner: 10Majavah) [15:48:57] !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [15:49:16] (03CR) 10Muehlenhoff: "The PCC failures are unrelated" [puppet] - 10https://gerrit.wikimedia.org/r/1149371 (https://phabricator.wikimedia.org/T393762) (owner: 10Muehlenhoff) [15:49:49] (03Merged) 10jenkins-bot: Revert "mobileapps: pipeline bot promote" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149423 (owner: 10Jgiannelos) [15:49:57] (03Merged) 10jenkins-bot: Revert "pcs: Default to use http client with service mesh" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149425 (owner: 10Jgiannelos) [15:50:04] (03CR) 10JHathaway: [C:03+1] Remove unused option to enable host-based auth [puppet] - 10https://gerrit.wikimedia.org/r/1149371 (https://phabricator.wikimedia.org/T393762) (owner: 10Muehlenhoff) [15:50:09] (03Merged) 10jenkins-bot: sre: kubernetes: Remove stray quote [alerts] - 10https://gerrit.wikimedia.org/r/1149428 (owner: 10Majavah) [15:50:23] (03PS2) 10Jgiannelos: Revert "pcs: Add missing headers for MW requests" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149424 [15:50:54] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install Dell Config I 1P Test Host - https://phabricator.wikimedia.org/T393044#10849147 (10Jhancock.wm) [15:51:02] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install Dell Config I 1P Test Host - https://phabricator.wikimedia.org/T393044#10849151 (10Jhancock.wm) a:03Jhancock.wm [15:51:39] PROBLEM - Host mr1-esams.oob IPv6 is DOWN: CRITICAL - Host Unreachable (2a00:1188:5:e::4) [15:55:56] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch1103.eqiad.wmnet with OS bullseye [15:56:00] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch1103 [15:56:01] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch1103 [15:58:11] 10ops-codfw, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#10849187 (10Jhancock.wm) @BCornwall we got the servers in today, probably racking them tomorrow or next week. Is there any preference on where they are being racked, or... [15:58:44] !log volans@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2110.codfw.wmnet [15:59:00] 06SRE, 10SRE-Access-Requests, 06collaboration-services, 10Continuous-Integration-Infrastructure (Zuul upgrade), 13Patch-For-Review: create new admin group for "zuul devs" - https://phabricator.wikimedia.org/T394819#10849192 (10thcipriani) Re-using `contint-roots` makes sense here. I note that that also p... [15:59:06] !log volans@cumin1003 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2110.codfw.wmnet [16:00:05] jhathaway and moritzm: gettimeofday() says it's time for Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250522T1600) [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:01:45] RECOVERY - Host mr1-esams.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 88.94 ms [16:03:32] FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:08:21] !log volans@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2110.codfw.wmnet [16:08:39] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch1062.eqiad.wmnet with OS bullseye [16:11:30] volans@cumin1003 upgrade-firmware (PID 2740799) is awaiting input [16:12:47] !log volans@cumin1003 START - Cookbook sre.hosts.reboot-single for host cirrussearch2110.codfw.wmnet [16:14:21] (03CR) 10Hnowlan: [C:03+1] Remove obsolete jobrunner cergen certs [puppet] - 10https://gerrit.wikimedia.org/r/1149398 (https://phabricator.wikimedia.org/T360636) (owner: 10Muehlenhoff) [16:15:47] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub: apply on production [16:17:40] 07Puppet, 10Beta-Cluster-Infrastructure, 10CirrusSearch, 06Discovery-Search, 10Data-Platform-SRE (2025.05.02 - 2025.05.23): Puppet failing on deployment-cirrussearch{12,13,14}.deployment-prep.eqiad1.wikimedia.cloud - https://phabricator.wikimedia.org/T393924#10849264 (10dancy) I ran: ` $ for n in 12 13 1... [16:19:31] jouncebot: nowandnext [16:19:31] For the next 0 hour(s) and 40 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250522T1600) [16:19:31] In 0 hour(s) and 40 minute(s): Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250522T1700) [16:19:31] In 0 hour(s) and 40 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250522T1700) [16:22:02] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub: sync on production [16:22:25] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub: apply on production [16:23:25] !log volans@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cirrussearch2110.codfw.wmnet [16:23:25] !log volans@cumin1003 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cirrussearch2110.codfw.wmnet [16:25:52] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [16:26:09] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [16:28:32] FIRING: [2x] HelmReleaseBadStatus: Helm release kube-system/calico on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [16:29:20] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[48-51] & relocate cloudcephosd1039 - https://phabricator.wikimedia.org/T394333#10849301 (10dcaro) For easy reading :), here's the racking info [from the parent task](https://phabricator.wikimedia.org/T38985... [16:32:36] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub: sync on production [16:33:32] FIRING: [2x] HelmReleaseBadStatus: Helm release kube-system/calico on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [16:33:43] (03CR) 10Effie Mouzeli: [C:03+2] Revert "pcs: Add missing headers for MW requests" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149424 (owner: 10Jgiannelos) [16:35:18] (03Merged) 10jenkins-bot: Revert "pcs: Add missing headers for MW requests" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149424 (owner: 10Jgiannelos) [16:36:02] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub: apply on production [16:36:07] 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations: SSD firmware update not working in firmware cookbook - https://phabricator.wikimedia.org/T394543#10849344 (10Volans) @bking Yeah, no need to go offtopic for something almost 3y old. I have indeed forgot about the `Re: Request for NIC firmware update advice... [16:38:32] RESOLVED: [2x] HelmReleaseBadStatus: Helm release kube-system/calico on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [16:41:41] (03PS1) 10Clément Goubert: Revert "mw::maintenance: Migrate generatecaptcha to mw-cron" [puppet] - 10https://gerrit.wikimedia.org/r/1149438 (https://phabricator.wikimedia.org/T388531) [16:42:51] (03CR) 10Hnowlan: [C:03+1] Revert "mw::maintenance: Migrate generatecaptcha to mw-cron" [puppet] - 10https://gerrit.wikimedia.org/r/1149438 (https://phabricator.wikimedia.org/T388531) (owner: 10Clément Goubert) [16:46:11] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub: sync on production [16:48:22] (03CR) 10Clément Goubert: [C:03+2] Revert "mw::maintenance: Migrate generatecaptcha to mw-cron" [puppet] - 10https://gerrit.wikimedia.org/r/1149438 (https://phabricator.wikimedia.org/T388531) (owner: 10Clément Goubert) [16:48:38] FIRING: [2x] HelmReleaseBadStatus: Helm release kube-system/calico on k8s-staging@eqiad in state pending-rollback - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=kube-system - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [16:50:44] (03CR) 10Clément Goubert: "Couple missing `ttlsecondsafterfinish`, otherwise lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1149426 (https://phabricator.wikimedia.org/T388539) (owner: 10Hnowlan) [16:50:56] (03CR) 10Clément Goubert: [C:03+1] mw::maintenance: migrate abusefilteripdata job to Kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1149422 (https://phabricator.wikimedia.org/T388542) (owner: 10Hnowlan) [16:51:29] (03Abandoned) 10Clément Goubert: interactive: Ring the bell by default in ask_input [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1069136 (owner: 10Clément Goubert) [16:53:56] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub: apply on production [16:56:07] (03PS2) 10Clément Goubert: mw::maintenance: Migrate wikidata-updateQueryServiceLag to mw-cron [puppet] - 10https://gerrit.wikimedia.org/r/1149366 (https://phabricator.wikimedia.org/T388538) [16:56:50] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub: sync on production [16:56:58] (03PS2) 10Hnowlan: mw::maintenance: migrate remaining translation notifications jobs [puppet] - 10https://gerrit.wikimedia.org/r/1149426 (https://phabricator.wikimedia.org/T388539) [16:57:04] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub: apply on production [16:57:26] (03CR) 10Hnowlan: [C:03+2] mw::maintenance: migrate abusefilteripdata job to Kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1149422 (https://phabricator.wikimedia.org/T388542) (owner: 10Hnowlan) [16:57:57] (03CR) 10CI reject: [V:04-1] mw::maintenance: migrate remaining translation notifications jobs [puppet] - 10https://gerrit.wikimedia.org/r/1149426 (https://phabricator.wikimedia.org/T388539) (owner: 10Hnowlan) [16:58:20] (03CR) 10Clément Goubert: [C:03+1] mw::maintenance: migrate remaining translation notifications jobs (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1149426 (https://phabricator.wikimedia.org/T388539) (owner: 10Hnowlan) [16:59:44] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, May 22 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1149404 (https://phabricator.wikimedia.org/T394315) (owner: 10DDesouza) [16:59:58] (03PS3) 10Hnowlan: mw::maintenance: migrate remaining translation notifications jobs [puppet] - 10https://gerrit.wikimedia.org/r/1149426 (https://phabricator.wikimedia.org/T388539) [17:00:04] bd808: #bothumor My software never has bugs. It just develops random features. Rise for Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250522T1700). [17:00:04] swfrench-wmf: That opportune time for a MediaWiki infrastructure (UTC late) deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250522T1700). [17:01:03] (03CR) 10Scott French: [C:03+1] "LGTM as a direct translation to k8s! The two bits I'm not sure about are (1) how well this might keep up with the 1m interval and (2) whet" [puppet] - 10https://gerrit.wikimedia.org/r/1149366 (https://phabricator.wikimedia.org/T388538) (owner: 10Clément Goubert) [17:01:45] (03CR) 10Clément Goubert: [C:03+2] mw::maintenance: Migrate wikidata-updateQueryServiceLag to mw-cron [puppet] - 10https://gerrit.wikimedia.org/r/1149366 (https://phabricator.wikimedia.org/T388538) (owner: 10Clément Goubert) [17:01:45] o/ [17:02:21] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub: sync on production [17:02:24] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Supermicro: test if Intel card exhibits the same cold boot behavior - https://phabricator.wikimedia.org/T394847#10849468 (10RobH) Ok, reverted as the system is done testing and needs to go back to 10G. [17:03:19] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub: apply on production [17:03:58] no developer portal deploy for me this week [17:04:06] 10ops-magru, 06DC-Ops, 10Observability-Metrics, 13Patch-For-Review, 10SRE Observability (FY2024/2025-Q3): missing pdu infos for magru - https://phabricator.wikimedia.org/T387231#10849470 (10wiki_willy) ++ @Papaul & @RobH - are one of you guys able to review the patch for Tiziano? >>! In T387231#1084806... [17:05:19] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [17:05:48] waiting for a bit on my changes while some mw-cron migrations complete [17:05:53] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [17:06:57] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub: sync on production [17:07:41] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub: apply on production [17:08:32] FIRING: SystemdUnitFailed: wmf_auto_restart_krb5-admin-server.service on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:08:52] !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [17:11:25] !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [17:11:54] (03CR) 10Clément Goubert: [C:03+1] mw::maintenance: migrate remaining translation notifications jobs [puppet] - 10https://gerrit.wikimedia.org/r/1149426 (https://phabricator.wikimedia.org/T388539) (owner: 10Hnowlan) [17:13:41] (03PS1) 10DDesouza: miscweb(design-strategy): bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149442 (https://phabricator.wikimedia.org/T344471) [17:14:13] PROBLEM - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:14:46] (03PS1) 10Clément Goubert: Revert "mw::maintenance: Migrate wikidata-updateQueryServiceLag to mw-cron" [puppet] - 10https://gerrit.wikimedia.org/r/1149447 [17:14:51] !log swfrench@deploy1003 Started scap sync-world: Non-deploy scap run to pick up mw-script / mw-cron logging changes - T378479 [17:14:55] T378479: Allow using helper scripts inside of mwscript-k8s - https://phabricator.wikimedia.org/T378479 [17:15:47] (03CR) 10DDesouza: [C:03+2] miscweb(design-strategy): bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149442 (https://phabricator.wikimedia.org/T344471) (owner: 10DDesouza) [17:15:58] !log swfrench@deploy1003 Stopping before sync operations [17:16:22] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cirrussearch1103.eqiad.wmnet with OS bullseye [17:16:24] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-web-next_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:16:54] (03CR) 10RobH: [C:03+1] pdus: add pro4x breaker alerts [alerts] - 10https://gerrit.wikimedia.org/r/1149343 (https://phabricator.wikimedia.org/T387231) (owner: 10Tiziano Fogli) [17:17:51] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub: sync on production [17:18:03] (03CR) 10Clément Goubert: [C:03+2] Revert "mw::maintenance: Migrate wikidata-updateQueryServiceLag to mw-cron" [puppet] - 10https://gerrit.wikimedia.org/r/1149447 (owner: 10Clément Goubert) [17:18:04] (03Merged) 10jenkins-bot: miscweb(design-strategy): bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149442 (https://phabricator.wikimedia.org/T344471) (owner: 10DDesouza) [17:18:08] 10ops-magru, 06DC-Ops, 10Observability-Metrics, 13Patch-For-Review, 10SRE Observability (FY2024/2025-Q3): missing pdu infos for magru - https://phabricator.wikimedia.org/T387231#10849525 (10RobH) I've given it a 15minute review and a +1. I can give it a more in depth review but if we roll this and it do... [17:23:33] !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [17:23:53] !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [17:24:13] !log dani@deploy1003 helmfile [staging] START helmfile.d/services/miscweb: apply [17:24:27] !log dani@deploy1003 helmfile [staging] DONE helmfile.d/services/miscweb: apply [17:24:29] !log dani@deploy1003 helmfile [eqiad] START helmfile.d/services/miscweb: apply [17:24:46] !log dani@deploy1003 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [17:24:48] !log dani@deploy1003 helmfile [codfw] START helmfile.d/services/miscweb: apply [17:25:08] !log dani@deploy1003 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [17:26:08] !log swfrench@deploy1003 Started scap sync-world: Non-deploy scap run to stop building and publishing PHP 7.4 images - T391057 [17:26:11] T391057: Turn down MediaWiki image builds for PHP 7.4 - https://phabricator.wikimedia.org/T391057 [17:27:09] !log swfrench@deploy1003 Stopping before sync operations [17:30:56] !log gmodena@deploy1003 helmfile [staging] START helmfile.d/services/eventgate-analytics: sync [17:32:08] !log swfrench@deploy1003 Started scap sync-world: Deployment clear no-op image diffs [17:33:33] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub: apply on production [17:34:30] !log akosiaris@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [17:38:32] RESOLVED: HelmReleaseBadStatus: Helm release kube-system/calico on k8s-staging@eqiad in state pending-rollback - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=kube-system - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [17:40:23] !log swfrench@deploy1003 Finished scap sync-world: Deployment clear no-op image diffs (duration: 09m 20s) [17:43:42] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub: sync on production [17:43:53] I'm done with the infra window [17:44:23] 10ops-codfw, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#10849733 (10ssingh) (Adding @Fabfur who will lead this from Traffic with Brett.) [17:48:37] FIRING: [4x] HelmReleaseBadStatus: Helm release kube-system/calico on k8s-staging@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [17:49:25] !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic1061 to cirrussearch1061 [17:49:39] !log bking@cumin2002 START - Cookbook sre.dns.netbox [17:53:20] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1061 to cirrussearch1061 - bking@cumin2002" [17:53:37] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1061 to cirrussearch1061 - bking@cumin2002" [17:53:38] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:53:38] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch1061 on all recursors [17:53:41] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch1061 on all recursors [17:53:42] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch1061 [17:56:59] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch1061 [17:57:38] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic1061 to cirrussearch1061 [17:59:16] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch1061.eqiad.wmnet with OS bullseye [17:59:21] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch1061 [17:59:22] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch1061 [18:00:52] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: eqiad: determine second frack - https://phabricator.wikimedia.org/T392007#10849847 (10RobH) 05Open→03Resolved I should have closed this, we went with ordering a second rack and putting it into the new expansion space. This allows... [18:02:05] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: eqiad: second frack parent tracking task - https://phabricator.wikimedia.org/T392006#10849856 (10RobH) The solution selected was ordering and installing a second new rack in the ML expansion row for frack usage. So no non-frack migrations... [18:08:32] FIRING: SystemdUnitFailed: curator_actions_apifeatureusage_eqiad.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:08:58] !log gmodena@deploy1003 helmfile [codfw] START helmfile.d/services/eventgate-analytics: sync [18:09:28] !log mforns@deploy1003 Started deploy [analytics/refinery@98f8a96] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@98f8a96a] [18:09:46] !log gmodena@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics: sync [18:10:51] !log gmodena@deploy1003 helmfile [eqiad] START helmfile.d/services/eventgate-analytics: sync [18:11:36] !log gmodena@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics: sync [18:13:07] !log mforns@deploy1003 Finished deploy [analytics/refinery@98f8a96] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@98f8a96a] (duration: 03m 39s) [18:14:13] RECOVERY - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:14:21] (03PS1) 10Bvibber: Enable Chart extension on phase 3 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1149454 (https://phabricator.wikimedia.org/T393519) [18:15:18] if nobody's busy ill go ahead and deploy that with spiderpig :D [18:16:24] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-web-next_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:16:27] (03CR) 10TrainBranchBot: [C:03+2] "Approved by bvibber@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1149454 (https://phabricator.wikimedia.org/T393519) (owner: 10Bvibber) [18:16:43] !log mforns@deploy1003 Started deploy [analytics/refinery@98f8a96]: Regular analytics weekly train [analytics/refinery@98f8a96a] [18:17:13] (03CR) 10CI reject: [V:04-1] Enable Chart extension on phase 3 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1149454 (https://phabricator.wikimedia.org/T393519) (owner: 10Bvibber) [18:18:28] wtf [18:18:34] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install Dell Config I 1P Test Host - https://phabricator.wikimedia.org/T393044#10849989 (10Jhancock.wm) [18:18:52] (03CR) 10Bvibber: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1149454 (https://phabricator.wikimedia.org/T393519) (owner: 10Bvibber) [18:18:56] !log mforns@deploy1003 Finished deploy [analytics/refinery@98f8a96]: Regular analytics weekly train [analytics/refinery@98f8a96a] (duration: 02m 12s) [18:19:21] !log mforns@deploy1003 Started deploy [analytics/refinery@98f8a96] (thin): Regular analytics weekly train THIN [analytics/refinery@98f8a96a] [18:19:47] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch1061.eqiad.wmnet with reason: host reimage [18:20:10] (03CR) 10TrainBranchBot: [C:03+2] "Approved by bvibber@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1149454 (https://phabricator.wikimedia.org/T393519) (owner: 10Bvibber) [18:20:31] !log mforns@deploy1003 Finished deploy [analytics/refinery@98f8a96] (thin): Regular analytics weekly train THIN [analytics/refinery@98f8a96a] (duration: 01m 09s) [18:20:32] 10ops-codfw, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#10850006 (10Jhancock.wm) [18:21:01] (03Merged) 10jenkins-bot: Enable Chart extension on phase 3 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1149454 (https://phabricator.wikimedia.org/T393519) (owner: 10Bvibber) [18:21:18] !log bvibber@deploy1003 Started scap sync-world: Backport for [[gerrit:1149454|Enable Chart extension on phase 3 wikis (T393519)]] [18:21:21] T393519: Enable Charts for Phase 3 wikis - https://phabricator.wikimedia.org/T393519 [18:23:10] !log bvibber@deploy1003 bvibber: Backport for [[gerrit:1149454|Enable Chart extension on phase 3 wikis (T393519)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [18:23:28] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch1061.eqiad.wmnet with reason: host reimage [18:23:48] !log bvibber@deploy1003 bvibber: Continuing with sync [18:29:28] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host sretest2003.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [18:30:52] !log bvibber@deploy1003 Finished scap sync-world: Backport for [[gerrit:1149454|Enable Chart extension on phase 3 wikis (T393519)]] (duration: 09m 34s) [18:30:57] T393519: Enable Charts for Phase 3 wikis - https://phabricator.wikimedia.org/T393519 [18:32:29] !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic1060 to cirrussearch1060 [18:32:37] jhancock@cumin2002 provision (PID 15264) is awaiting input [18:32:42] !log bking@cumin2002 START - Cookbook sre.dns.netbox [18:35:39] (03PS2) 10Ryan Kemper: wdqs: nuke previously absented pyrra update lag [puppet] - 10https://gerrit.wikimedia.org/r/1148979 (https://phabricator.wikimedia.org/T393966) [18:35:42] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1148979 (https://phabricator.wikimedia.org/T393966) (owner: 10Ryan Kemper) [18:38:02] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2003.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [18:38:17] bking@cumin2002 rename (PID 16406) is awaiting input [18:39:20] 10ops-codfw, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#10850151 (10RobH) Please note that we took special care to not use BOSS or NVMe backplane for the OS SSDs so this technically should be able to leverage UEFI. However,... [18:39:31] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2003.codfw.wmnet with OS bookworm [18:39:35] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install Dell Config H 1P Test Host - https://phabricator.wikimedia.org/T393042#10850152 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host sretest2003.codfw.wmnet with OS bookworm [18:41:18] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1060 to cirrussearch1060 - bking@cumin2002" [18:44:23] bking@cumin2002 rename (PID 16406) is awaiting input [18:44:47] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1060 to cirrussearch1060 - bking@cumin2002" [18:44:47] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:44:48] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch1060 on all recursors [18:44:51] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch1060 on all recursors [18:44:52] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch1060 [18:46:04] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch1060 [18:46:16] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, and 2 others: Q4:rack/setup/install thanos-be100[6-9] - https://phabricator.wikimedia.org/T392909#10850182 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host thanos-be1006.eqiad.wmnet with OS bullseye... [18:46:19] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.02 - 2025.05.23): Upgrade an-worker hard drives from 4TB to 8TB (group 4 - rack F3) - https://phabricator.wikimedia.org/T390171#10850184 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host thanos-be1007.... [18:46:31] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, and 2 others: Q4:rack/setup/install thanos-be100[6-9] - https://phabricator.wikimedia.org/T392909#10850185 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host thanos-be1008.eqiad.wmnet with OS bullseye... [18:46:40] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, and 2 others: Q4:rack/setup/install thanos-be100[6-9] - https://phabricator.wikimedia.org/T392909#10850186 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host thanos-be1009.eqiad.wmnet with OS bullseye... [18:46:44] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic1060 to cirrussearch1060 [18:48:45] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, and 2 others: Q4:rack/setup/install thanos-be100[6-9] - https://phabricator.wikimedia.org/T392909#10850191 (10Jclark-ctr) [18:49:16] 10ops-codfw, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#10850193 (10BCornwall) >>! In T392851#10849187, @Jhancock.wm wrote: > @BCornwall we got the servers in today, probably racking them tomorrow or next week. Is there any p... [18:49:39] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, and 2 others: Q4:rack/setup/install thanos-be100[6-9] - https://phabricator.wikimedia.org/T392909#10850194 (10Jclark-ctr) @MatthewVernon these have imaged but fail to finish puppet [18:54:45] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch1061.eqiad.wmnet with OS bullseye [18:55:29] (03PS1) 10Aqu: airflow-analytics-test: Temporarily Disable DataHub plugin [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149456 [18:56:24] 06SRE, 10SRE-Access-Requests, 06collaboration-services, 10Continuous-Integration-Infrastructure (Zuul upgrade), 13Patch-For-Review: create new admin group for "zuul devs" - https://phabricator.wikimedia.org/T394819#10850226 (10Dzahn) Thanks for this confirmation and the chat in the meeting today. In thi... [18:56:31] (03CR) 10Dzahn: [C:03+2] role: delete requesttracker role [puppet] - 10https://gerrit.wikimedia.org/r/1148923 (https://phabricator.wikimedia.org/T385777) (owner: 10Dzahn) [18:57:49] (03CR) 10Dzahn: [C:03+1] "just because the bot added me. lgtm, I saw the other patch that already disabled this." [puppet] - 10https://gerrit.wikimedia.org/r/1148436 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite) [18:59:09] (03CR) 10Dzahn: zuul: create role/profile for new zuul main servers, install docker.io (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1148930 (https://phabricator.wikimedia.org/T393873) (owner: 10Dzahn) [18:59:11] (03CR) 10Dzahn: [C:03+2] zuul: create role/profile for new zuul main servers, install docker.io [puppet] - 10https://gerrit.wikimedia.org/r/1148930 (https://phabricator.wikimedia.org/T393873) (owner: 10Dzahn) [19:05:43] (03CR) 10AOkoth: "Ack. Let me set that up. Though I hadn't deployed to codfw yet so I'll have to sort that out first." [dns] - 10https://gerrit.wikimedia.org/r/1148379 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth) [19:05:48] (03PS3) 10AOkoth: wmnet: map os-reports to aux ingress [dns] - 10https://gerrit.wikimedia.org/r/1148379 (https://phabricator.wikimedia.org/T350794) [19:13:25] (03PS5) 10Scott French: deployment_server: Call into the mwscript helper from mwscript-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1148490 (https://phabricator.wikimedia.org/T378479) (owner: 10RLazarus) [19:13:25] (03CR) 10Scott French: "Thanks in advance for the review! I've tested this on the deployment host to confirm the expected behavior. More context on this follow-up" [puppet] - 10https://gerrit.wikimedia.org/r/1148490 (https://phabricator.wikimedia.org/T378479) (owner: 10RLazarus) [19:18:32] FIRING: NetworkDeviceAlarmActive: Alarm active on cr2-codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [19:20:51] !log aokoth@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [19:21:22] !log aokoth@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [19:23:12] !log aokoth@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [19:23:36] !log aokoth@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [19:23:41] !log jhathaway@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on ms-be2088.codfw.wmnet with reason: T381919 [19:23:44] T381919: Supermicro: unable to set boot order after using Redfish to boot once - https://phabricator.wikimedia.org/T381919 [19:24:42] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install ms-be109[2-5] - https://phabricator.wikimedia.org/T393104#10850368 (10Jclark-ctr) @MatthewVernon @wiki_willy I am on hold for this untill some servers get Decom in rows a-d we are out of power at this time in 10... [19:27:28] (03PS1) 10Amire80: Make functionall identical descriptions the same [dumps] - 10https://gerrit.wikimedia.org/r/1149464 [19:28:16] (03PS2) 10Amire80: Make functionally identical descriptions the same [dumps] - 10https://gerrit.wikimedia.org/r/1149464 [19:29:18] (03PS1) 10Bartosz Dziewoński: Revert "In ParserAfterTidy use the new ParserOptions::isMessage" [extensions/DiscussionTools] (wmf/1.45.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1149466 (https://phabricator.wikimedia.org/T395034) [19:29:29] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, May 22 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [extensions/DiscussionTools] (wmf/1.45.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1149466 (https://phabricator.wikimedia.org/T395034) (owner: 10Bartosz Dziewoński) [19:33:06] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q#:rack/setup/install es104[78] - https://phabricator.wikimedia.org/T393107#10850395 (10Jclark-ctr) a:03VRiley-WMF [19:38:21] (03CR) 10Bking: "The comment mentions 'file, filetypes and filerevisions', but I don't see 'filerevisions' in the CR?" [puppet] - 10https://gerrit.wikimedia.org/r/1139115 (https://phabricator.wikimedia.org/T389800) (owner: 10Mforns) [19:41:57] FIRING: CalicoKubeControllersDown: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown [19:46:24] (03CR) 10Mforns: "Yes, the filerevisions table has been added to MediaWiki, but we do not needed in the data lake for our calculation. Will update the readm" [puppet] - 10https://gerrit.wikimedia.org/r/1139115 (https://phabricator.wikimedia.org/T389800) (owner: 10Mforns) [19:46:41] (03PS2) 10Mforns: Add file and filetypes tables to the mediawiki-not-history sqoop [puppet] - 10https://gerrit.wikimedia.org/r/1139115 (https://phabricator.wikimedia.org/T389800) [19:48:31] (03PS1) 10AOkoth: doc: swap doc1003 with doc1004 [puppet] - 10https://gerrit.wikimedia.org/r/1149469 [19:50:24] (03CR) 10Bking: [C:03+2] Add file and filetypes tables to the mediawiki-not-history sqoop [puppet] - 10https://gerrit.wikimedia.org/r/1139115 (https://phabricator.wikimedia.org/T389800) (owner: 10Mforns) [19:55:54] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, May 22 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148888 (https://phabricator.wikimedia.org/T394919) (owner: 10ZhaoFJx) [19:56:00] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, May 22 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148894 (https://phabricator.wikimedia.org/T394920) (owner: 10ZhaoFJx) [20:00:04] RoanKattouw, Urbanecm, TheresNoTime, and kindrobot: #bothumor I � Unicode. All rise for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250522T2000). [20:00:04] kgraessle, danisztls, MatmaRex, and ZhaoFJx: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:14] here [20:00:16] hi [20:00:37] o/ [20:01:21] o/ [20:01:35] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch1060.eqiad.wmnet with OS bullseye [20:01:38] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch1060 [20:01:39] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch1060 [20:03:32] FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:03:33] 06SRE, 10Observability-Alerting: when servers are about to run out of disk, monitoring should notify the owners - https://phabricator.wikimedia.org/T394955#10850474 (10Dzahn) I am not trying to say the disk checks need to be migrated to a different system or anything. The `check_disk` Nagios/Icinga plugin is... [20:06:23] jhancock@cumin2002 reimage (PID 19465) is awaiting input [20:06:49] any deployers around? [20:07:10] (03PS1) 10Dzahn: zuul: let puppet manage docker service, install docker-compose [puppet] - 10https://gerrit.wikimedia.org/r/1149474 (https://phabricator.wikimedia.org/T393873) [20:08:16] (03CR) 10CI reject: [V:04-1] zuul: let puppet manage docker service, install docker-compose [puppet] - 10https://gerrit.wikimedia.org/r/1149474 (https://phabricator.wikimedia.org/T393873) (owner: 10Dzahn) [20:09:11] MatmaRex: do y'all need someone to push buttons? [20:09:26] bd808: it seems so [20:11:08] is there a code review step in between? [20:11:24] I was going to do other things, but I can probably push some buttons on spiderpig [20:12:27] It looks like we can do this in 2 batches, one to do MatmaRex's revert backport and one for all of the config changes. [20:12:42] do you mind being last MatmaRex? [20:13:12] no problem. thanks [20:13:33] katherine_g, ZhaoFJx, danisztls: do all of you know how you will test your changes on the debug servers? [20:13:39] yes [20:13:52] bd808: yes [20:13:56] bd808 yes [20:14:34] cool. I'll try to get us moving then :) [20:14:39] thanks [20:17:21] (03CR) 10TrainBranchBot: [C:03+2] "Approved by bd808@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1147881 (https://phabricator.wikimedia.org/T391248) (owner: 10Scardenasmolinar) [20:17:21] (03CR) 10TrainBranchBot: [C:03+2] "Approved by bd808@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1149404 (https://phabricator.wikimedia.org/T394315) (owner: 10DDesouza) [20:17:21] (03CR) 10TrainBranchBot: [C:03+2] "Approved by bd808@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148888 (https://phabricator.wikimedia.org/T394919) (owner: 10ZhaoFJx) [20:17:22] (03CR) 10TrainBranchBot: [C:03+2] "Approved by bd808@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148894 (https://phabricator.wikimedia.org/T394920) (owner: 10ZhaoFJx) [20:17:29] (03PS2) 10Dzahn: zuul: let puppet manage docker service, install docker-compose [puppet] - 10https://gerrit.wikimedia.org/r/1149474 (https://phabricator.wikimedia.org/T393873) [20:18:23] (03Merged) 10jenkins-bot: Add AutoModerator to eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1147881 (https://phabricator.wikimedia.org/T391248) (owner: 10Scardenasmolinar) [20:18:26] (03Merged) 10jenkins-bot: Design Research survey: Undeploy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1149404 (https://phabricator.wikimedia.org/T394315) (owner: 10DDesouza) [20:18:28] (03Merged) 10jenkins-bot: arbcom_zhwiki: Change wgWhitelistRead Setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148888 (https://phabricator.wikimedia.org/T394919) (owner: 10ZhaoFJx) [20:18:30] (03Merged) 10jenkins-bot: arbcom_zhwiki: Enable local upload [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148894 (https://phabricator.wikimedia.org/T394920) (owner: 10ZhaoFJx) [20:18:44] !log bd808@deploy1003 Started scap sync-world: Backport for [[gerrit:1147881|Add AutoModerator to eswiki (T391248)]], [[gerrit:1149404|Design Research survey: Undeploy (T394315)]], [[gerrit:1148888|arbcom_zhwiki: Change wgWhitelistRead Setting (T394919)]], [[gerrit:1148894|arbcom_zhwiki: Enable local upload (T394920)]] [20:18:52] T391248: Enable AutoModerator on Spanish Wikipedia - https://phabricator.wikimedia.org/T391248 [20:18:52] T394315: ES.wiki QuickSurvey request for DR participant recruitment - https://phabricator.wikimedia.org/T394315 [20:18:52] T394919: Add accessable mainpage for non-logged-in readers on zh.arbcom - https://phabricator.wikimedia.org/T394919 [20:18:52] T394920: Enable local file upload on zh.arbcom - https://phabricator.wikimedia.org/T394920 [20:20:10] (03CR) 10Dzahn: [C:04-1] "doc1004 is still on "insetup". It would need the doc role on it first, right?" [puppet] - 10https://gerrit.wikimedia.org/r/1149469 (owner: 10AOkoth) [20:20:12] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch1060.eqiad.wmnet with reason: host reimage [20:20:36] !log bd808@deploy1003 bd808, zhaofjx, dani, suecarmol: Backport for [[gerrit:1147881|Add AutoModerator to eswiki (T391248)]], [[gerrit:1149404|Design Research survey: Undeploy (T394315)]], [[gerrit:1148888|arbcom_zhwiki: Change wgWhitelistRead Setting (T394919)]], [[gerrit:1148894|arbcom_zhwiki: Enable local upload (T394920)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be [20:20:36] verified there. [20:20:52] !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching A:sessionstore: Upgrading to Java 11.0.27 - eevans@cumin1002 [20:21:39] k, I'm good to sync [20:21:39] katherine_g, ZhaoFJx, danisztls: time to test your config changes [20:21:45] bd808: looks good [20:21:57] (03CR) 10Dzahn: [C:04-1] "I guess what I am trying to say is "your message clains you apply the role on doc1004 but the code does not do it yet." [puppet] - 10https://gerrit.wikimedia.org/r/1149469 (owner: 10AOkoth) [20:22:05] bd808 checked, all working as expected [20:22:25] yay. let's send things out to the wikis then [20:22:30] !log bd808@deploy1003 bd808, zhaofjx, dani, suecarmol: Continuing with sync [20:23:32] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch1060.eqiad.wmnet with reason: host reimage [20:27:51] bd808: thanks! [20:28:51] still syncing, but almost done I think [20:29:32] !log bd808@deploy1003 Finished scap sync-world: Backport for [[gerrit:1147881|Add AutoModerator to eswiki (T391248)]], [[gerrit:1149404|Design Research survey: Undeploy (T394315)]], [[gerrit:1148888|arbcom_zhwiki: Change wgWhitelistRead Setting (T394919)]], [[gerrit:1148894|arbcom_zhwiki: Enable local upload (T394920)]] (duration: 10m 47s) [20:29:39] T391248: Enable AutoModerator on Spanish Wikipedia - https://phabricator.wikimedia.org/T391248 [20:29:39] T394315: ES.wiki QuickSurvey request for DR participant recruitment - https://phabricator.wikimedia.org/T394315 [20:29:39] T394919: Add accessable mainpage for non-logged-in readers on zh.arbcom - https://phabricator.wikimedia.org/T394919 [20:29:40] T394920: Enable local file upload on zh.arbcom - https://phabricator.wikimedia.org/T394920 [20:29:57] (03CR) 10Dzahn: [C:03+2] zuul: let puppet manage docker service, install docker-compose [puppet] - 10https://gerrit.wikimedia.org/r/1149474 (https://phabricator.wikimedia.org/T393873) (owner: 10Dzahn) [20:30:09] katherine_g, ZhaoFJx, danisztls: now would be a good time to double check the wikis your config should have changed [20:31:03] bd808 working perfectly, thanks for deploy! [20:31:51] all as expected, thanks again [20:32:10] your up next MatmaRex [20:32:15] *you're [20:32:29] (03CR) 10TrainBranchBot: [C:03+2] "Approved by bd808@deploy1003 using scap backport" [extensions/DiscussionTools] (wmf/1.45.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1149466 (https://phabricator.wikimedia.org/T395034) (owner: 10Bartosz Dziewoński) [20:32:32] yup [20:32:59] (03PS1) 10Dzahn: zuul: enforce puppet7 on new zuul::main role [puppet] - 10https://gerrit.wikimedia.org/r/1149476 (https://phabricator.wikimedia.org/T393873) [20:33:41] (03Merged) 10jenkins-bot: Revert "In ParserAfterTidy use the new ParserOptions::isMessage" [extensions/DiscussionTools] (wmf/1.45.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1149466 (https://phabricator.wikimedia.org/T395034) (owner: 10Bartosz Dziewoński) [20:33:56] !log bd808@deploy1003 Started scap sync-world: Backport for [[gerrit:1149466|Revert "In ParserAfterTidy use the new ParserOptions::isMessage" (T395034)]] [20:34:01] T395034: Incorrect and doubled empty talk page onboard messages on every talk page of Chinese Wikipedia - https://phabricator.wikimedia.org/T395034 [20:35:03] ah, I saw that bug "discovered" in the tech discord channel earlier today [20:35:52] !log bd808@deploy1003 bd808, matmarex: Backport for [[gerrit:1149466|Revert "In ParserAfterTidy use the new ParserOptions::isMessage" (T395034)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:37:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.205s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [20:37:16] bd808: works as expected [20:37:33] I will push the button then. thanks MatmaRex [20:37:37] !log bd808@deploy1003 bd808, matmarex: Continuing with sync [20:37:41] thanks for deploying [20:37:47] (03CR) 10Scott French: [C:03+1] mw::maintenance: migrate remaining translation notifications jobs [puppet] - 10https://gerrit.wikimedia.org/r/1149426 (https://phabricator.wikimedia.org/T388539) (owner: 10Hnowlan) [20:38:18] it is so easy with spiderpig :) You just have to not worry too much that you aren't staring at an ssh session [20:38:55] but you still need ssh access to use it :'( [20:39:33] this is true. That should go away once we have 2FA available on the Developer accounts themselves [20:40:18] I am in the camp that everyone with +2 should also have prod shell access and use it, but opinions vary [20:42:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.085s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [20:42:41] ^ that's how the system lets us know when deploys are finished :p [20:42:41] i'd be much more comfortable having access to a limited and verbosely logged tool, than basically uncontrolled shell access, from the risk perspective [20:43:54] its pretty far from uncontrolled, but I understand that not everyone has jumped over the "how badly really can I mess this up?" hurdle in their head :) [20:44:38] !log bd808@deploy1003 Finished scap sync-world: Backport for [[gerrit:1149466|Revert "In ParserAfterTidy use the new ParserOptions::isMessage" (T395034)]] (duration: 10m 41s) [20:44:44] T395034: Incorrect and doubled empty talk page onboard messages on every talk page of Chinese Wikipedia - https://phabricator.wikimedia.org/T395034 [20:45:59] that's today's late backport window. Thanks for flying spiderpig airlines. the only airline with no plaines but ascii pigs [20:46:19] not worried about myself, more about losing my credentials [20:46:25] anyway. thank you bd808 [20:46:32] yw MatmaRex [20:48:07] is stil more scared of the content of the patches itself. my worst case something like "merge that rewrite rule that results in a rewrite loop for all pages and it's cached now". sometimes missing a review step between "developer uploads it" and "deployer clicks button". [20:48:20] at least for me the constant fear of accidentally managing to break something never went away. so far I've managed to avoid doing anything catastrophically bad :-) [20:49:50] this.. not sure if it changed by seeing it in a different type of window.. but it is also true that spiderpig did make me deploy my own config change [20:49:52] if you are not a bit scared then you probably are not paying close attention, but at some point you just gotta trust that if things get weird you and your online friends can figure it out [20:52:07] fair [20:52:29] (03CR) 10Dzahn: [C:03+2] zuul: enforce puppet7 on new zuul::main role [puppet] - 10https://gerrit.wikimedia.org/r/1149476 (https://phabricator.wikimedia.org/T393873) (owner: 10Dzahn) [20:53:23] but things do sometimes go wrong, and that's natural if you do enough things. it's been a few years at this point but i once deployed a patch that accidentally removed some important access controls. then we fixed that issue, improved the system to make that harder to accidentally repeat, and then moved on [20:54:04] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch1060.eqiad.wmnet with OS bullseye [20:56:18] !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:sessionstore: Upgrading to Java 11.0.27 - eevans@cumin1002 [21:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250522T2100) [21:00:46] (03CR) 10Dzahn: [C:03+2] "manually fixed /etc/puppet/puppet.conf (replace with a version from a puppet7-enabled host) and ran puppet to get out of the CRL cert erro" [puppet] - 10https://gerrit.wikimedia.org/r/1149476 (https://phabricator.wikimedia.org/T393873) (owner: 10Dzahn) [21:03:09] Hey all (Web Team) - we’ve got some fairly urgent security mitigations to deploy. Any objections? [21:08:32] FIRING: SystemdUnitFailed: wmf_auto_restart_krb5-admin-server.service on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:10:38] !log bking@cumin2002 conftool action : set/pooled=no; selector: name=elastic1055.eqiad.wmnet|elastic1056.eqiad.wmnet|elastic1074.eqiad.wmnet|elastic1075.eqiad.wmnet|elastic1076.eqiad.wmnet|elastic1077.eqiad.wmnet|elastic1078.eqiad.wmnet|elastic1079.eqiad.wmnet|elastic1085.eqiad.wmnet|elastic1086.eqiad.wmnet [21:11:50] (03CR) 10Dzahn: [V:03+1 C:03+2] "approved per https://phabricator.wikimedia.org/T394819#10849192" [puppet] - 10https://gerrit.wikimedia.org/r/1148937 (https://phabricator.wikimedia.org/T394819) (owner: 10Dzahn) [21:13:04] !log jhathaway@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on ms-be2088.codfw.wmnet with reason: T381919 [21:13:08] T381919: Supermicro: unable to set boot order after using Redfish to boot once - https://phabricator.wikimedia.org/T381919 [21:14:26] !log bking@cumin2002 conftool action : set/pooled=no; selector: name=elastic1089.eqiad.wmnet|elastic1090.eqiad.wmnet|elastic1091.eqiad.wmnet|elastic1092.eqiad.wmnet|elastic1093.eqiad.wmnet|elastic1094.eqiad.wmnet|elastic1095.eqiad.wmnet|elastic1108.eqiad.wmnet|elastic1109.eqiad.wmnet [21:16:11] (03PS2) 10Dzahn: zuul: add contint-roots admin group to new zuul::main role [puppet] - 10https://gerrit.wikimedia.org/r/1148937 (https://phabricator.wikimedia.org/T394819) [21:16:39] (03CR) 10BryanDavis: [C:03+1] "Cherry picked to deployment-puppetserver-1.deployment-prep.eqiad1.wikimedia.cloud and applied on deployment-cache-{text,upload}08" [puppet] - 10https://gerrit.wikimedia.org/r/1143602 (https://phabricator.wikimedia.org/T393404) (owner: 10Majavah) [21:17:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 2.5s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [21:18:02] !log bking@cumin2002 conftool action : set/pooled=yes:weight=10; selector: name=cirrussearch1080.eqiad.wmnet|cirrussearch1081.eqiad.wmnet|cirrussearch1082.eqiad.wmnet|cirrussearch1083.eqiad.wmnet|cirrussearch1087.eqiad.wmnet|cirrussearch1088.eqiad.wmnet|cirrussearch1118.eqiad.wmnet|cirrussearch1119.eqiad.wmnet [21:18:52] (03CR) 10Dzahn: [C:03+2] zuul: add contint-roots admin group to new zuul::main role [puppet] - 10https://gerrit.wikimedia.org/r/1148937 (https://phabricator.wikimedia.org/T394819) (owner: 10Dzahn) [21:24:53] 06SRE, 10SRE-Access-Requests, 06collaboration-services, 10Continuous-Integration-Infrastructure (Zuul upgrade), 13Patch-For-Review: give contint-roots access to new zuul VMs (was: create new admin group for "zuul devs") - https://phabricator.wikimedia.org/T394819#10850765 (10Dzahn) 05Open→03In progress [21:26:17] !log bking@cumin2002 conftool action : set/pooled=yes:weight=10; selector: name=cirrussearch.*.eqiad.wmnet [21:27:43] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for jdlrobson - https://phabricator.wikimedia.org/T393723#10850771 (10Dzahn) [21:28:43] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for jdlrobson - https://phabricator.wikimedia.org/T393723#10850777 (10Dzahn) confirmed user has signed L3. Since the group approver is also the person who created this ticket we can also consider it approved. Which should make this ready to go. [21:29:12] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for jdlrobson - https://phabricator.wikimedia.org/T393723#10850778 (10Dzahn) [21:31:51] (03PS1) 10Dzahn: admin: add jdlrobson to deployment group [puppet] - 10https://gerrit.wikimedia.org/r/1149488 (https://phabricator.wikimedia.org/T393723) [21:33:13] (03PS1) 10Kosta Harlan: Enable EmailAuth for users with good ip reputation [extensions/WikimediaEvents] (wmf/1.45.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1149489 [21:33:21] (03PS2) 10Dzahn: admin: add jdlrobson to deployment group [puppet] - 10https://gerrit.wikimedia.org/r/1149488 (https://phabricator.wikimedia.org/T393723) [21:33:29] jouncebot: nowandnext [21:33:29] For the next 0 hour(s) and 26 minute(s): Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250522T2100) [21:33:29] In 8 hour(s) and 26 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250523T0600) [21:33:41] !log tgr@deploy1003 Locking from deployment [MediaWiki]: T395073 [21:33:47] T395073: Incident: May 2025 Account Compromises - https://phabricator.wikimedia.org/T395073 [21:35:38] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for santhosh - https://phabricator.wikimedia.org/T394740#10850797 (10Dzahn) Santhosh has signed L3 on Sep 10 2015 :) [21:36:02] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for santhosh - https://phabricator.wikimedia.org/T394740#10850800 (10Dzahn) [21:37:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 2.5s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [21:37:21] (03PS1) 10Bking: cirrussearch: add cirrussearch row D/remove elastic row E [puppet] - 10https://gerrit.wikimedia.org/r/1149493 (https://phabricator.wikimedia.org/T388610) [21:38:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.548s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [21:38:19] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1149493 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [21:41:11] (03PS2) 10Bking: cirrussearch: add cirrussearch row D/remove elastic row E [puppet] - 10https://gerrit.wikimedia.org/r/1149493 (https://phabricator.wikimedia.org/T388610) [21:41:30] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1149493 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [21:41:37] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for santhosh - https://phabricator.wikimedia.org/T394740#10850810 (10Dzahn) Santhosh already has other shell access, so no new SSH key is involved or needs to be verified. Tyler made the ticket so we can consider it "group approved". The only thin... [21:42:15] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for santhosh - https://phabricator.wikimedia.org/T394740#10850811 (10Dzahn) [21:42:44] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for santhosh - https://phabricator.wikimedia.org/T394740#10850812 (10Dzahn) a:03Arrbee Hello @Arrbee do you approve of this request? [21:43:05] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for santhosh - https://phabricator.wikimedia.org/T394740#10850814 (10Dzahn) 05Open→03In progress [21:43:34] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment for jdlrobson - https://phabricator.wikimedia.org/T393723#10850815 (10Dzahn) 05Open→03In progress [21:44:59] (03PS3) 10Bking: cirrussearch: add cirrussearch row D/remove elastic row E [puppet] - 10https://gerrit.wikimedia.org/r/1149493 (https://phabricator.wikimedia.org/T388610) [21:45:49] (03CR) 10Ryan Kemper: [C:03+1] cirrussearch: add cirrussearch row D/remove elastic row E [puppet] - 10https://gerrit.wikimedia.org/r/1149493 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [21:45:59] 06SRE, 10SRE-Access-Requests: Requesting production SSH key update for Joseph Seddon - https://phabricator.wikimedia.org/T393579#10850817 (10Dzahn) Hi @Seddon this one is still waiting for your input. Alternatively you could verify by putting the new key into your existing home dir on a server. Anything really... [21:46:08] (03CR) 10Bking: [C:03+2] cirrussearch: add cirrussearch row D/remove elastic row E [puppet] - 10https://gerrit.wikimedia.org/r/1149493 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [21:46:54] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics-privatedata-users for Neslihan_Turan_WMDE - https://phabricator.wikimedia.org/T394395#10850819 (10Dzahn) 05In progress→03Stalled [21:48:08] 06SRE, 10SRE-Access-Requests, 10Continuous-Integration-Infrastructure, 10LDAP-Access-Requests, and 2 others: Grant Jenkins admin rights to Peter Hedenskog (QTE) - https://phabricator.wikimedia.org/T394749#10850823 (10Dzahn) a:03thcipriani [21:48:22] !log bking@cumin2002 conftool action : set/pooled=yes:weight=10; selector: name=cirrussearch.*.eqiad.wmnet [21:48:37] FIRING: [3x] HelmReleaseBadStatus: Helm release kube-system/calico on k8s-staging@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [21:53:47] !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic1089 to cirrussearch1089 [21:54:00] !log bking@cumin2002 START - Cookbook sre.dns.netbox [21:54:41] FIRING: [3x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_search-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [21:58:10] 06SRE, 10SRE-Access-Requests, 10Continuous-Integration-Infrastructure, 10LDAP-Access-Requests, and 2 others: Grant Jenkins admin rights to Peter Hedenskog (QTE) - https://phabricator.wikimedia.org/T394749#10850839 (10thcipriani) a:05thcipriani→03None Approved as keeper of contint-admins. Also, I am @P... [21:59:41] FIRING: [6x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_search-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [21:59:44] bking@cumin2002 rename (PID 108538) is awaiting input [22:03:13] (03PS1) 10Ladsgroup: Add one more HIBP TXT record [dns] - 10https://gerrit.wikimedia.org/r/1149502 [22:03:32] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_krb5-admin-server.service on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:04:56] about to merge one of my scariest patches I've made. I really wish HIBP had a better way to confirm domains [22:05:30] (03CR) 10Ladsgroup: [C:03+2] Add one more HIBP TXT record [dns] - 10https://gerrit.wikimedia.org/r/1149502 (owner: 10Ladsgroup) [22:05:50] !log ladsgroup@dns1004 START - running authdns-update [22:06:15] PROBLEM - Disk space on centrallog2002 is CRITICAL: DISK CRITICAL - free space: /srv 84846MiB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=centrallog2002&var-datasource=codfw+prometheus/ops [22:06:34] !log ladsgroup@dns1004 END - running authdns-update [22:07:47] (03PS1) 10Ladsgroup: Add a new HIBP TXT record [dns] - 10https://gerrit.wikimedia.org/r/1149503 [22:08:18] phew, now the next one [22:08:32] FIRING: SystemdUnitFailed: curator_actions_apifeatureusage_eqiad.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:08:42] (03CR) 10Ladsgroup: [C:03+2] Add a new HIBP TXT record [dns] - 10https://gerrit.wikimedia.org/r/1149503 (owner: 10Ladsgroup) [22:08:52] !log ladsgroup@dns1004 START - running authdns-update [22:09:38] !log ladsgroup@dns1004 END - running authdns-update [22:10:59] (03PS1) 10Ladsgroup: Add en.m.wikipedia.org HIBP TXT record [dns] - 10https://gerrit.wikimedia.org/r/1149504 [22:11:01] 06SRE, 10LDAP-Access-Requests: Grant Access to ops-limited for sdeckelmann-wmf - https://phabricator.wikimedia.org/T395094 (10SDeckelmann-WMF) 03NEW [22:11:37] (03CR) 10Ladsgroup: [C:03+2] Add en.m.wikipedia.org HIBP TXT record [dns] - 10https://gerrit.wikimedia.org/r/1149504 (owner: 10Ladsgroup) [22:11:45] !log ladsgroup@dns1004 START - running authdns-update [22:12:30] !log ladsgroup@dns1004 END - running authdns-update [22:13:41] (03PS1) 10Ladsgroup: Add auth.wikimedia.org HIBP TXT record [dns] - 10https://gerrit.wikimedia.org/r/1149506 [22:14:55] (03CR) 10Ladsgroup: [C:03+2] Add auth.wikimedia.org HIBP TXT record [dns] - 10https://gerrit.wikimedia.org/r/1149506 (owner: 10Ladsgroup) [22:15:02] !log ladsgroup@dns1004 START - running authdns-update [22:15:49] !log ladsgroup@dns1004 END - running authdns-update [22:16:34] 06SRE, 10SRE-Access-Requests, 10Continuous-Integration-Infrastructure, 10LDAP-Access-Requests, and 2 others: Grant Jenkins admin rights to Peter Hedenskog (QTE) - https://phabricator.wikimedia.org/T394749#10850867 (10Dzahn) @hashar So it requires 2 things, membership in LDAP group ciadmin and also shell ac... [22:17:04] 06SRE, 10SRE-Access-Requests, 10Continuous-Integration-Infrastructure, 10LDAP-Access-Requests, and 2 others: Grant Jenkins admin rights to Peter Hedenskog (QTE) - https://phabricator.wikimedia.org/T394749#10850868 (10Dzahn) I already did the LDAP group membership just now after Tyler's approval. [22:17:08] (03PS1) 10Ladsgroup: Add logstash.wikimedia.org HIBP TXT record [dns] - 10https://gerrit.wikimedia.org/r/1149507 [22:18:26] (03CR) 10Ladsgroup: [C:03+2] Add logstash.wikimedia.org HIBP TXT record [dns] - 10https://gerrit.wikimedia.org/r/1149507 (owner: 10Ladsgroup) [22:18:34] !log ladsgroup@dns1004 START - running authdns-update [22:19:05] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1089 to cirrussearch1089 - bking@cumin2002" [22:19:21] !log ladsgroup@dns1004 END - running authdns-update [22:19:23] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1089 to cirrussearch1089 - bking@cumin2002" [22:19:23] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:19:23] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch1089 on all recursors [22:19:27] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch1089 on all recursors [22:19:27] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch1089 [22:19:43] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch1089 [22:20:23] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic1089 to cirrussearch1089 [22:21:42] (03PS1) 10Ladsgroup: Add another wiki HIBP TXT domain [dns] - 10https://gerrit.wikimedia.org/r/1149508 [22:22:36] (03CR) 10Ladsgroup: [C:03+2] Add another wiki HIBP TXT domain [dns] - 10https://gerrit.wikimedia.org/r/1149508 (owner: 10Ladsgroup) [22:22:45] !log ladsgroup@dns1004 START - running authdns-update [22:23:34] !log ladsgroup@dns1004 END - running authdns-update [22:25:09] (03PS1) 10Ladsgroup: Add another wiki HBIP TXT record [dns] - 10https://gerrit.wikimedia.org/r/1149509 [22:26:03] (03CR) 10Dzahn: [C:03+2] admin: Add phedenskog to contint-admins [puppet] - 10https://gerrit.wikimedia.org/r/1148264 (https://phabricator.wikimedia.org/T394749) (owner: 10Hashar) [22:26:26] (03CR) 10Ladsgroup: [C:03+2] Add another wiki HBIP TXT record [dns] - 10https://gerrit.wikimedia.org/r/1149509 (owner: 10Ladsgroup) [22:26:35] !log ladsgroup@dns1004 START - running authdns-update [22:27:22] !log ladsgroup@dns1004 END - running authdns-update [22:29:00] (03PS1) 10Ladsgroup: Remove HIBP TXT record [dns] - 10https://gerrit.wikimedia.org/r/1149510 [22:29:49] (03CR) 10Ladsgroup: [C:03+2] Remove HIBP TXT record [dns] - 10https://gerrit.wikimedia.org/r/1149510 (owner: 10Ladsgroup) [22:29:58] !log ladsgroup@dns1004 START - running authdns-update [22:30:48] !log ladsgroup@dns1004 END - running authdns-update [22:34:09] 06SRE, 10SRE-Access-Requests, 10Continuous-Integration-Infrastructure, 10LDAP-Access-Requests, 07Jenkins: Grant Jenkins admin rights to Peter Hedenskog (QTE) - https://phabricator.wikimedia.org/T394749#10850897 (10Dzahn) 05Open→03Resolved a:03Dzahn Done. Peter has a shell user on contint* machi... [22:38:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 2.5s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:42:16] 06SRE, 10LDAP-Access-Requests: Grant Access to ops-limited for sdeckelmann-wmf - https://phabricator.wikimedia.org/T395094#10850905 (10RobH) Please note @SDeckelmann-WMF will also need access to netbox, which includes access to the wmf ldap group. [22:42:51] (03PS4) 10Scott French: profile::prometheus::k8s: drop terminated pod targets [puppet] - 10https://gerrit.wikimedia.org/r/1149505 (https://phabricator.wikimedia.org/T395052) [22:43:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 2.5s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [23:02:33] !log tgr@deploy1003 Unlocked for deployment [MediaWiki]: T395073 (duration: 88m 51s) [23:02:39] T395073: Incident: May 2025 Account Compromises - https://phabricator.wikimedia.org/T395073 [23:05:32] kind of surprised you can log to a private ticket [23:08:10] Stashbot is a subscriber on that task [23:08:24] I am sure I fixed this issue before... [23:11:26] (03CR) 10Andrea Denisse: [V:03+1 C:03+2] grafana: Disable dashboard sync to ugprade Grafana version [puppet] - 10https://gerrit.wikimedia.org/r/1148915 (https://phabricator.wikimedia.org/T394470) (owner: 10Andrea Denisse) [23:12:59] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy1003 using scap backport" [extensions/WikimediaEvents] (wmf/1.45.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1149489 (owner: 10Kosta Harlan) [23:13:07] https://phabricator.wikimedia.org/T301082 [23:14:20] (03Merged) 10jenkins-bot: Enable EmailAuth for users with good ip reputation [extensions/WikimediaEvents] (wmf/1.45.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1149489 (owner: 10Kosta Harlan) [23:14:37] !log tgr@deploy1003 Started scap sync-world: Backport for [[gerrit:1149489|Enable EmailAuth for users with good ip reputation]] [23:14:54] sbassett: oh! makes sense now [23:15:20] Amir1: pahbricator internals changed and also well, don't subscribe the announce bot to secret tasks [23:15:26] Amir1: hah! ok [23:16:00] I dont think it ever occured to me to subscribe the IRC bot tickets [23:16:29] !log tgr@deploy1003 kharlan, tgr: Backport for [[gerrit:1149489|Enable EmailAuth for users with good ip reputation]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [23:17:10] I intentionally did it for this issue because I wanted to track things on the task. But I didn’t remember that it was leaky. I can unsub it now. [23:17:24] amir already did [23:18:20] tx, Amir1 [23:18:32] FIRING: NetworkDeviceAlarmActive: Alarm active on cr2-codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [23:18:36] and if nothing else it is leaky in that stashbot is an unprivileged account run by me as a volunteer from toolforge [23:32:20] okay I think it's working [23:32:31] sbassett: do you want to check or is it good to go? [23:32:48] tgr: good to go for me [23:33:51] (advantages of spiderpig: no realizing that I need to switch connections to test LoginNotify IP checks and forgot to use screen) [23:34:23] !log tgr@deploy1003 kharlan, tgr: Continuing with sync [23:38:40] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1149513 [23:38:40] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1149513 (owner: 10TrainBranchBot) [23:41:17] !log tgr@deploy1003 Finished scap sync-world: Backport for [[gerrit:1149489|Enable EmailAuth for users with good ip reputation]] (duration: 26m 39s) [23:41:33] we are done [23:41:57] tx, tgr [23:41:57] FIRING: CalicoKubeControllersDown: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown [23:43:04] !log deployed mitigations for T395073 [23:43:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:50:53] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1149513 (owner: 10TrainBranchBot) [23:53:03] (03CR) 10Cwhite: [C:03+2] ci: clean up statsite includes [puppet] - 10https://gerrit.wikimedia.org/r/1148436 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite)