[00:07:53] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1197059 [00:07:53] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1197059 (owner: 10TrainBranchBot) [00:11:39] 06SRE, 06Data-Engineering, 06Traffic-Icebox, 10MobileFrontend (Tracking): RFC: Serve mobile and desktop variants through the same URL (unified mobile routing) - https://phabricator.wikimedia.org/T214998#11288227 (10toni.stoev) Now that mobile and desktop are served from the same URL, I am kind of satisfied... [00:43:44] FIRING: [2x] NodeTextfileStale: Stale textfile for wdqs2009:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [00:45:03] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1197059 (owner: 10TrainBranchBot) [00:52:34] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [01:00:39] !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image [01:14:32] !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 13m 52s) [01:28:11] FIRING: [2x] SystemdUnitFailed: amd-k8s-node-labeller.service on ml-serve1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:32:11] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:35:07] FIRING: SwitchCoreInterfaceDown: Switch core interface down - ssw1-d1-eqiad:ethernet-1/31 (Transport: ssw1-f1-eqiad:et-0/0/29 (Equinix, 21996480) {#0107202f1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-d1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [01:37:11] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:09:24] FIRING: [6x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:10:03] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [02:16:35] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, October 21 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180523 (https://phabricator.wikimedia.org/T401288) (owner: 10Seanleong-wmde) [02:17:44] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [03:07:11] FIRING: [6x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:10:03] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [03:45:21] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [03:46:18] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, October 20 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196492 (https://phabricator.wikimedia.org/T389409) (owner: 10BPirkle) [03:51:48] FIRING: KubernetesCalicoDown: wikikube-worker2203.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=wikikube-worker2203.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [04:06:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [04:26:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [04:43:44] FIRING: [2x] NodeTextfileStale: Stale textfile for wdqs2009:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [04:52:34] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [04:54:35] (03PS1) 10Marostegui: mariadb: Productionize db2245 [puppet] - 10https://gerrit.wikimedia.org/r/1197061 (https://phabricator.wikimedia.org/T406551) [04:56:23] (03CR) 10Marostegui: [C:03+2] mariadb: Productionize db2245 [puppet] - 10https://gerrit.wikimedia.org/r/1197061 (https://phabricator.wikimedia.org/T406551) (owner: 10Marostegui) [05:03:27] !log marostegui@cumin1003 START - Cookbook sre.mysql.clone of db2248.codfw.wmnet onto db2245.codfw.wmnet [05:03:32] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool db2248 - Depool db2248.codfw.wmnet to then clone it to db2245.codfw.wmnet - marostegui@cumin1003 [05:04:44] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2248 - Depool db2248.codfw.wmnet to then clone it to db2245.codfw.wmnet - marostegui@cumin1003 [05:04:44] !log marostegui@cumin1003 END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of db2248.codfw.wmnet onto db2245.codfw.wmnet [05:05:15] !log marostegui@cumin1003 START - Cookbook sre.mysql.clone of db2248.codfw.wmnet onto db2245.codfw.wmnet [05:08:29] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:08:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [05:14:11] (03PS1) 10Marostegui: db1206: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1197062 (https://phabricator.wikimedia.org/T407463) [05:14:58] (03CR) 10Marostegui: [C:03+2] db1206: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1197062 (https://phabricator.wikimedia.org/T407463) (owner: 10Marostegui) [05:17:09] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1206.eqiad.wmnet with reason: Maintenance [05:17:13] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db1206 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P84069 and previous config saved to /var/cache/conftool/dbconfig/20251020-051712-marostegui.json [05:19:27] (03PS1) 10Marostegui: instances.yaml: Remove es1027 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1197065 (https://phabricator.wikimedia.org/T407595) [05:19:56] (03CR) 10Marostegui: [C:03+2] instances.yaml: Remove es1027 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1197065 (https://phabricator.wikimedia.org/T407595) (owner: 10Marostegui) [05:20:58] !log marostegui@cumin1003 dbctl commit (dc=all): 'Remove es1027 from dbctl T407595', diff saved to https://phabricator.wikimedia.org/P84070 and previous config saved to /var/cache/conftool/dbconfig/20251020-052057-marostegui.json [05:21:03] T407595: decommission es1027.eqiad.wmnet - https://phabricator.wikimedia.org/T407595 [05:24:39] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1206 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P84071 and previous config saved to /var/cache/conftool/dbconfig/20251020-052438-root.json [05:28:11] FIRING: [2x] SystemdUnitFailed: amd-k8s-node-labeller.service on ml-serve1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:28:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [05:30:31] (03PS1) 10Marostegui: mariadb: Decommission es1027 [puppet] - 10https://gerrit.wikimedia.org/r/1197066 (https://phabricator.wikimedia.org/T407595) [05:33:32] (03CR) 10Marostegui: [C:03+2] mariadb: Decommission es1027 [puppet] - 10https://gerrit.wikimedia.org/r/1197066 (https://phabricator.wikimedia.org/T407595) (owner: 10Marostegui) [05:34:23] !log marostegui@cumin1003 START - Cookbook sre.hosts.decommission for hosts es1027.eqiad.wmnet [05:34:24] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:34:32] (03CR) 10CI reject: [V:04-1] Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1197067 (owner: 10L10n-bot) [05:35:07] FIRING: SwitchCoreInterfaceDown: Switch core interface down - ssw1-d1-eqiad:ethernet-1/31 (Transport: ssw1-f1-eqiad:et-0/0/29 (Equinix, 21996480) {#0107202f1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-d1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [05:39:45] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1206 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P84072 and previous config saved to /var/cache/conftool/dbconfig/20251020-053944-root.json [05:39:59] !log marostegui@cumin1003 START - Cookbook sre.dns.netbox [05:43:17] !log marostegui@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: es1027.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1003" [05:43:36] !log marostegui@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: es1027.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1003" [05:43:37] !log marostegui@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [05:43:37] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts es1027.eqiad.wmnet [05:46:48] 10ops-eqiad, 06DBA, 06DC-Ops, 10decommission-hardware: decommission es1027.eqiad.wmnet - https://phabricator.wikimedia.org/T407595#11288401 (10Marostegui) a:05Marostegui→03None [05:46:50] 10ops-eqiad, 06DBA, 06DC-Ops, 10decommission-hardware: decommission es1027.eqiad.wmnet - https://phabricator.wikimedia.org/T407595#11288405 (10Marostegui) This is ready for DCOps [05:48:52] (03PS1) 10Marostegui: db1261: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1197071 (https://phabricator.wikimedia.org/T406550) [05:49:53] (03CR) 10Marostegui: [C:03+2] db1261: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1197071 (https://phabricator.wikimedia.org/T406550) (owner: 10Marostegui) [05:54:51] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1206 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P84074 and previous config saved to /var/cache/conftool/dbconfig/20251020-055450-root.json [05:56:31] (03PS1) 10Marostegui: instances.yaml: Add db1261 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1197074 (https://phabricator.wikimedia.org/T406550) [05:57:01] (03CR) 10Marostegui: [C:03+2] instances.yaml: Add db1261 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1197074 (https://phabricator.wikimedia.org/T406550) (owner: 10Marostegui) [05:59:00] !log marostegui@cumin1003 dbctl commit (dc=all): 'Add db1261 depooled T406550', diff saved to https://phabricator.wikimedia.org/P84075 and previous config saved to /var/cache/conftool/dbconfig/20251020-055859-marostegui.json [05:59:04] T406550: Productionize db126[0-3] - https://phabricator.wikimedia.org/T406550 [05:59:43] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1261 (re)pooling @ 1%: Host provisioned T406550', diff saved to https://phabricator.wikimedia.org/P84076 and previous config saved to /var/cache/conftool/dbconfig/20251020-055942-root.json [06:09:57] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1206 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P84077 and previous config saved to /var/cache/conftool/dbconfig/20251020-060956-root.json [06:14:49] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1261 (re)pooling @ 5%: Host provisioned T406550', diff saved to https://phabricator.wikimedia.org/P84078 and previous config saved to /var/cache/conftool/dbconfig/20251020-061449-root.json [06:14:54] T406550: Productionize db126[0-3] - https://phabricator.wikimedia.org/T406550 [06:17:44] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [06:29:55] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1261 (re)pooling @ 7%: Host provisioned T406550', diff saved to https://phabricator.wikimedia.org/P84079 and previous config saved to /var/cache/conftool/dbconfig/20251020-062955-root.json [06:29:59] T406550: Productionize db126[0-3] - https://phabricator.wikimedia.org/T406550 [06:45:01] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1261 (re)pooling @ 10%: Host provisioned T406550', diff saved to https://phabricator.wikimedia.org/P84080 and previous config saved to /var/cache/conftool/dbconfig/20251020-064501-root.json [06:45:06] T406550: Productionize db126[0-3] - https://phabricator.wikimedia.org/T406550 [07:00:05] Amir1, Urbanecm, and awight: #bothumor My software never has bugs. It just develops random features. Rise for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251020T0700). [07:00:05] cormacparle, sergi0, and Superpes: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:08] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1261 (re)pooling @ 20%: Host provisioned T406550', diff saved to https://phabricator.wikimedia.org/P84081 and previous config saved to /var/cache/conftool/dbconfig/20251020-070007-root.json [07:00:12] T406550: Productionize db126[0-3] - https://phabricator.wikimedia.org/T406550 [07:00:27] o/ [07:04:38] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7301/co" [puppet] - 10https://gerrit.wikimedia.org/r/1196929 (https://phabricator.wikimedia.org/T405742) (owner: 10FNegri) [07:07:11] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:08:14] (03CR) 10Jelto: [V:03+1 C:03+1] "lgtm from the gitlab-runner side. Also `docker::network` is just used by gitlab-runners afaict." [puppet] - 10https://gerrit.wikimedia.org/r/1196929 (https://phabricator.wikimedia.org/T405742) (owner: 10FNegri) [07:15:14] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1261 (re)pooling @ 25%: Host provisioned T406550', diff saved to https://phabricator.wikimedia.org/P84082 and previous config saved to /var/cache/conftool/dbconfig/20251020-071513-root.json [07:15:18] T406550: Productionize db126[0-3] - https://phabricator.wikimedia.org/T406550 [07:20:00] (03PS1) 10Marostegui: db1218: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1197080 (https://phabricator.wikimedia.org/T407463) [07:20:31] (03CR) 10Marostegui: [C:03+2] db1218: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1197080 (https://phabricator.wikimedia.org/T407463) (owner: 10Marostegui) [07:21:50] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1218.eqiad.wmnet with reason: Maintenance [07:21:54] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db1218 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P84083 and previous config saved to /var/cache/conftool/dbconfig/20251020-072153-marostegui.json [07:22:36] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [07:23:11] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [07:23:38] !log brouberol@deploy2002 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'. [07:24:10] !log brouberol@deploy2002 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'. [07:27:03] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on es2032.codfw.wmnet,sretest2003.codfw.wmnet with reason: Cloning [07:28:33] !log Stop MariaDB on es2032 to clone sretest2003 T407472 [07:28:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:37] T407472: Install a testing db with Debian Trixie - https://phabricator.wikimedia.org/T407472 [07:29:21] meh, I overslept, I will move my change for later window [07:29:41] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1218 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P84084 and previous config saved to /var/cache/conftool/dbconfig/20251020-072939-root.json [07:29:52] (03CR) 10Marostegui: [C:03+1] "We are not removing the check from icinga yet, right?" [puppet] - 10https://gerrit.wikimedia.org/r/1196918 (https://phabricator.wikimedia.org/T407137) (owner: 10Tiziano Fogli) [07:30:15] (03CR) 10Marostegui: [C:03+1] site.pp: set role for db-test* hosts [puppet] - 10https://gerrit.wikimedia.org/r/1196910 (https://phabricator.wikimedia.org/T400056) (owner: 10Federico Ceratto) [07:30:20] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1261 (re)pooling @ 30%: Host provisioned T406550', diff saved to https://phabricator.wikimedia.org/P84085 and previous config saved to /var/cache/conftool/dbconfig/20251020-073019-root.json [07:30:24] T406550: Productionize db126[0-3] - https://phabricator.wikimedia.org/T406550 [07:34:38] (03CR) 10Majavah: [C:03+2] remote: Support timezone-aware objects [software/spicerack] - 10https://gerrit.wikimedia.org/r/1196139 (https://phabricator.wikimedia.org/T401581) (owner: 10Majavah) [07:35:30] !log Stop MariaDB on es2032 to clone sretest2003 T407352 [07:35:31] (03PS1) 10Marostegui: sretest2003: Add note [puppet] - 10https://gerrit.wikimedia.org/r/1197081 [07:35:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:35] T407352: Test config H 1P in external store - https://phabricator.wikimedia.org/T407352 [07:35:50] (03CR) 10Majavah: [C:03+1] Remove Hiera option to disable agent forwarding [puppet] - 10https://gerrit.wikimedia.org/r/1189855 (https://phabricator.wikimedia.org/T198138) (owner: 10Muehlenhoff) [07:36:16] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool db2248 gradually with 4 steps - Pool db2248.codfw.wmnet in after cloning [07:37:13] (03CR) 10Marostegui: [C:03+2] sretest2003: Add note [puppet] - 10https://gerrit.wikimedia.org/r/1197081 (owner: 10Marostegui) [07:41:41] Uhm so no one is available to deploy? [07:43:57] (03Merged) 10jenkins-bot: remote: Support timezone-aware objects [software/spicerack] - 10https://gerrit.wikimedia.org/r/1196139 (https://phabricator.wikimedia.org/T401581) (owner: 10Majavah) [07:44:47] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1218 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P84088 and previous config saved to /var/cache/conftool/dbconfig/20251020-074446-root.json [07:45:21] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [07:45:26] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1261 (re)pooling @ 50%: Host provisioned T406550', diff saved to https://phabricator.wikimedia.org/P84089 and previous config saved to /var/cache/conftool/dbconfig/20251020-074525-root.json [07:45:30] T406550: Productionize db126[0-3] - https://phabricator.wikimedia.org/T406550 [07:46:38] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for vicaplet - https://phabricator.wikimedia.org/T407605#11288528 (10WMDECyn) Approved from WMDE side [07:51:48] FIRING: KubernetesCalicoDown: wikikube-worker2203.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=wikikube-worker2203.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [07:53:27] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, October 20 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193703 (https://phabricator.wikimedia.org/T397258) (owner: 10Seanleong-wmde) [07:53:30] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, October 20 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193703 (https://phabricator.wikimedia.org/T397258) (owner: 10Seanleong-wmde) [07:56:29] (03CR) 10Federico Ceratto: [C:03+2] site.pp: set role for db-test* hosts [puppet] - 10https://gerrit.wikimedia.org/r/1196910 (https://phabricator.wikimedia.org/T400056) (owner: 10Federico Ceratto) [07:59:53] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1218 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P84091 and previous config saved to /var/cache/conftool/dbconfig/20251020-075952-root.json [08:00:32] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1261 (re)pooling @ 60%: Host provisioned T406550', diff saved to https://phabricator.wikimedia.org/P84092 and previous config saved to /var/cache/conftool/dbconfig/20251020-080031-root.json [08:00:35] T406550: Productionize db126[0-3] - https://phabricator.wikimedia.org/T406550 [08:04:12] !log cmooney@cumin1003 START - Cookbook sre.network.peering with action 'configure' for AS: 60051 [08:04:13] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, October 20 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196857 (https://phabricator.wikimedia.org/T406332) (owner: 10Phuedx) [08:04:38] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 60051 [08:07:21] !log marostegui@cumin1003 dbctl commit (dc=all): 'Promote es1051 to es3 primary as es1028 will be decommissioned T406690 T407720', diff saved to https://phabricator.wikimedia.org/P84094 and previous config saved to /var/cache/conftool/dbconfig/20251020-080721-marostegui.json [08:07:27] T406690: Decommission es1026 - es1034 - https://phabricator.wikimedia.org/T406690 [08:07:27] T407720: decommission es1028.eqiad.wmnet - https://phabricator.wikimedia.org/T407720 [08:08:05] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool es1028 T407720', diff saved to https://phabricator.wikimedia.org/P84095 and previous config saved to /var/cache/conftool/dbconfig/20251020-080804-marostegui.json [08:08:50] (03PS1) 10Marostegui: es1028: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1197191 (https://phabricator.wikimedia.org/T407720) [08:09:34] (03CR) 10Marostegui: [C:03+2] es1028: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1197191 (https://phabricator.wikimedia.org/T407720) (owner: 10Marostegui) [08:11:43] federico3: I think your puppet-merge is waiting for your answer, as it's been locked for 20 mins now, can you double check? [08:14:59] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1218 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P84096 and previous config saved to /var/cache/conftool/dbconfig/20251020-081458-root.json [08:15:38] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1261 (re)pooling @ 75%: Host provisioned T406550', diff saved to https://phabricator.wikimedia.org/P84097 and previous config saved to /var/cache/conftool/dbconfig/20251020-081537-root.json [08:15:42] T406550: Productionize db126[0-3] - https://phabricator.wikimedia.org/T406550 [08:17:59] 06SRE, 10Cloud-VPS, 06DC-Ops, 06cloud-services-team (FY2025/26-Q1): Experiment with cloudcephosd1050 and cloudcephosd1051 in single-nic configuration - https://phabricator.wikimedia.org/T405478#11288584 (10dcaro) Nice! I'm eager to see the results of adding it to the cluster, as now a single NIC might be a... [08:20:02] federico3: ping [08:21:02] (03CR) 10Volans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1196886 (https://phabricator.wikimedia.org/T389380) (owner: 10Jcrespo) [08:21:45] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2248 gradually with 4 steps - Pool db2248.codfw.wmnet in after cloning [08:21:48] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2248.codfw.wmnet onto db2245.codfw.wmnet [08:26:13] 06SRE: Sendemail network error (deployment) - https://phabricator.wikimedia.org/T407723 (10MKopec) 03NEW [08:28:43] 06SRE: Sendmail network error (deployment) - https://phabricator.wikimedia.org/T407723#11288627 (10MKopec) [08:30:01] !log fceratto@cumin1003 START - Cookbook sre.mysql.pool es2032 gradually with 4 steps - Pool es2032.codfw.wmnet in after cloning [08:30:44] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1261 (re)pooling @ 100%: Host provisioned T406550', diff saved to https://phabricator.wikimedia.org/P84100 and previous config saved to /var/cache/conftool/dbconfig/20251020-083043-root.json [08:30:48] T406550: Productionize db126[0-3] - https://phabricator.wikimedia.org/T406550 [08:31:44] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on es2028.codfw.wmnet,sretest2003.codfw.wmnet with reason: Cloning [08:33:35] (03PS4) 10Jcrespo: cumin: Migrate cumin1002 mariadb remote backups to cumin1003 [puppet] - 10https://gerrit.wikimedia.org/r/1196886 (https://phabricator.wikimedia.org/T389380) [08:34:21] !log fceratto@cumin1003 END (ERROR) - Cookbook sre.mysql.pool (exit_code=97) es2032 gradually with 4 steps - Pool es2032.codfw.wmnet in after cloning [08:36:03] !log fceratto@cumin1003 END (FAIL) - Cookbook sre.mysql.clone_es (exit_code=99) of es2032.codfw.wmnet onto es2055.codfw.wmnet [08:37:01] !log fceratto@cumin1003 START - Cookbook sre.mysql.depool es2032 - Cloning issue [08:37:09] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) es2032 - Cloning issue [08:38:50] (03CR) 10Jcrespo: [C:03+2] cumin: Migrate cumin1002 mariadb remote backups to cumin1003 [puppet] - 10https://gerrit.wikimedia.org/r/1196886 (https://phabricator.wikimedia.org/T389380) (owner: 10Jcrespo) [08:39:33] PROBLEM - MariaDB read only es1 on es2032 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [08:39:34] PROBLEM - mysqld processes #page on es2032 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [08:40:00] the host had a glitch, looking at it [08:40:05] mmmm [08:40:43] !incidents [08:40:44] 6889 (UNACKED) es2032 (paged)/mysqld processes (paged) [08:40:48] !ack 6689 [08:40:48] Attempt to ack incident 6689 failed. [08:40:53] !ack 6889 [08:40:54] 6889 (ACKED) es2032 (paged)/mysqld processes (paged) [08:41:44] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool es2028 to clone sretest2003', diff saved to https://phabricator.wikimedia.org/P84102 and previous config saved to /var/cache/conftool/dbconfig/20251020-084143-marostegui.json [08:42:05] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on es2032.codfw.wmnet with reason: Cloning tool bug [08:42:35] federico3: If you will reclone it, maybe it will need more than 4 hours given the size of external store? [08:42:43] (03CR) 10Santiago Faci: [C:03+1] MetricsPlatform: Initialize $wgMetricsPlatformExperimentStreamNames [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196857 (https://phabricator.wikimedia.org/T406332) (owner: 10Phuedx) [08:43:05] marostegui: reclone *from* it or reclone es2032 itself? [08:43:26] federico3: i don't know if you have to reclone or not, just asking if 4h is enough for anything you need [08:43:44] FIRING: [2x] NodeTextfileStale: Stale textfile for wdqs2009:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [08:43:50] it's probably longer but transfer.py does not calculate ETA... [08:45:01] federico3: then maybe extend the downtime a bit more [08:45:08] to avoid paging [08:46:31] I don't know yet if we want to repool it now or clone it. Is the bug in transfer.py able to cause data corruption? (in theory it should be only reading from the source host, not make changes) [08:47:07] it shouldn't make data corruption on the source host, no [08:47:13] 06SRE, 06Infrastructure-Foundations: Increase net.nf_conntrack_max on kerberos hosts if needed - https://phabricator.wikimedia.org/T407726 (10cmooney) 03NEW p:05Triage→03Low [08:48:03] federico3: you should be fine to repool [08:48:34] 06SRE, 06Infrastructure-Foundations: Increase net.nf_conntrack_max on kerberos hosts if needed - https://phabricator.wikimedia.org/T407726#11288728 (10cmooney) [08:49:34] odd, pigz and nc terminated by themselves [08:50:34] RECOVERY - mysqld processes #page on es2032 is OK: PROCS OK: 1 process with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [08:50:34] RECOVERY - MariaDB read only es1 on es2032 is OK: Version 10.11.13-MariaDB-log, Uptime 34s, read_only: True, event_scheduler: True, 24.45 QPS, connection latency: 0.032657s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [08:50:54] 06SRE, 06Infrastructure-Foundations: Increase net.nf_conntrack_max on kerberos hosts if needed - https://phabricator.wikimedia.org/T407726#11288760 (10cmooney) [08:52:34] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [08:56:10] (03PS1) 10Brouberol: airflow: enable the triggerer to hit the Kubernetes API servers with appropriate permissions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1197207 (https://phabricator.wikimedia.org/T406958) [08:59:57] (03CR) 10Kevin Bazira: [C:03+1] airflow: enable the triggerer to hit the Kubernetes API servers with appropriate permissions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1197207 (https://phabricator.wikimedia.org/T406958) (owner: 10Brouberol) [09:00:38] (03CR) 10Brouberol: [C:03+2] airflow: enable the triggerer to hit the Kubernetes API servers with appropriate permissions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1197207 (https://phabricator.wikimedia.org/T406958) (owner: 10Brouberol) [09:03:09] !log jgiannelos@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-experimental: apply [09:05:56] !log fceratto@cumin1003 START - Cookbook sre.hosts.remove-downtime for es2032.codfw.wmnet [09:05:57] !log fceratto@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for es2032.codfw.wmnet [09:06:16] !log fceratto@cumin1003 START - Cookbook sre.mysql.pool es2032 gradually with 4 steps - Pooling in [09:07:29] !log jgiannelos@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-experimental: apply [09:07:54] !log jgiannelos@deploy2002 helmfile [codfw] START helmfile.d/services/mw-experimental: apply [09:12:06] !log jgiannelos@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply [09:16:57] (03PS1) 10Jgiannelos: mw-experimental: Fix motd so user with wikidev permissions can restart the timers [puppet] - 10https://gerrit.wikimedia.org/r/1197210 [09:18:16] (03PS2) 10Jgiannelos: mw-experimental: Fix motd for users with wikidev permissions [puppet] - 10https://gerrit.wikimedia.org/r/1197210 [09:27:55] (03PS1) 10Marostegui: db2247: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1197211 (https://phabricator.wikimedia.org/T406551) [09:28:11] FIRING: [2x] SystemdUnitFailed: amd-k8s-node-labeller.service on ml-serve1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:31:13] (03CR) 10Krinkle: trafficserver: Add missing REST Gateway for Beta Cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1182652 (https://phabricator.wikimedia.org/T404387) (owner: 10Krinkle) [09:33:50] (03CR) 10Krinkle: trafficserver: Add missing REST Gateway for Beta Cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1182652 (https://phabricator.wikimedia.org/T404387) (owner: 10Krinkle) [09:34:33] (03CR) 10Marostegui: [C:03+2] db2247: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1197211 (https://phabricator.wikimedia.org/T406551) (owner: 10Marostegui) [09:35:07] FIRING: SwitchCoreInterfaceDown: Switch core interface down - ssw1-d1-eqiad:ethernet-1/31 (Transport: ssw1-f1-eqiad:et-0/0/29 (Equinix, 21996480) {#0107202f1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-d1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [09:37:12] (03PS1) 10Marostegui: db2247: Add to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1197214 (https://phabricator.wikimedia.org/T406551) [09:38:23] (03CR) 10Marostegui: [C:03+2] db2247: Add to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1197214 (https://phabricator.wikimedia.org/T406551) (owner: 10Marostegui) [09:42:07] !log marostegui@cumin1003 dbctl commit (dc=all): 'Add db2247 to dbctl T406551', diff saved to https://phabricator.wikimedia.org/P84106 and previous config saved to /var/cache/conftool/dbconfig/20251020-094207-marostegui.json [09:42:12] T406551: Productionize db224[5-8] - https://phabricator.wikimedia.org/T406551 [09:42:12] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2247 (re)pooling @ 1%: Host provisioned T406551', diff saved to https://phabricator.wikimedia.org/P84107 and previous config saved to /var/cache/conftool/dbconfig/20251020-094212-root.json [09:44:29] (03PS1) 10Federico Ceratto: es2055.yaml: enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1197216 (https://phabricator.wikimedia.org/T402859) [09:45:02] (03CR) 10Marostegui: [C:03+1] "All green in icinta" [puppet] - 10https://gerrit.wikimedia.org/r/1197216 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto) [09:45:35] (03PS2) 10Federico Ceratto: es2055.yaml, instances.yaml: prepare es2055 [puppet] - 10https://gerrit.wikimedia.org/r/1197216 (https://phabricator.wikimedia.org/T402859) [09:46:07] (03CR) 10Marostegui: [C:03+1] es2055.yaml, instances.yaml: prepare es2055 [puppet] - 10https://gerrit.wikimedia.org/r/1197216 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto) [09:51:43] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2032 gradually with 4 steps - Pooling in [09:54:32] (03CR) 10FNegri: [C:03+2] docker::network allow custom MTU value [puppet] - 10https://gerrit.wikimedia.org/r/1196929 (https://phabricator.wikimedia.org/T405742) (owner: 10FNegri) [09:57:18] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2247 (re)pooling @ 5%: Host provisioned T406551', diff saved to https://phabricator.wikimedia.org/P84109 and previous config saved to /var/cache/conftool/dbconfig/20251020-095718-root.json [09:57:23] T406551: Productionize db224[5-8] - https://phabricator.wikimedia.org/T406551 [10:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251020T1000) [10:01:42] (03CR) 10Federico Ceratto: [C:03+2] es2055.yaml, instances.yaml: prepare es2055 [puppet] - 10https://gerrit.wikimedia.org/r/1197216 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto) [10:04:19] !log fceratto@cumin1003 dbctl commit (dc=all): 'Add es2055 T402859', diff saved to https://phabricator.wikimedia.org/P84110 and previous config saved to /var/cache/conftool/dbconfig/20251020-100419-fceratto.json [10:04:24] T402859: Productionize es2049-es2057 - https://phabricator.wikimedia.org/T402859 [10:04:33] !log fceratto@cumin1003 START - Cookbook sre.hosts.remove-downtime for es2055.codfw.wmnet [10:04:34] !log fceratto@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for es2055.codfw.wmnet [10:10:28] !log fceratto@cumin1003 START - Cookbook sre.mysql.pool es2055 gradually with 4 steps - Pooling in new host [10:12:24] gah! sorry folks, mixed up the times for that deployment I had scheduled - will schedule for this afternoon instead [10:12:25] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2247 (re)pooling @ 7%: Host provisioned T406551', diff saved to https://phabricator.wikimedia.org/P84111 and previous config saved to /var/cache/conftool/dbconfig/20251020-101224-root.json [10:12:29] T406551: Productionize db224[5-8] - https://phabricator.wikimedia.org/T406551 [10:14:40] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, October 20 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196703 (https://phabricator.wikimedia.org/T41510) (owner: 10Cparle) [10:17:44] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [10:20:24] (03CR) 10Hnowlan: [C:03+1] [DNM] Set wgRestSandboxSpecs['wmf-restbase'] on testwiki to use the static specs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1190742 (https://phabricator.wikimedia.org/T396805) (owner: 10Aaron Schulz) [10:20:31] (03CR) 10Hnowlan: [C:03+1] Set wgRestSandboxSpecs['wmf-restbase'] to use the static specs everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1190743 (https://phabricator.wikimedia.org/T396805) (owner: 10Aaron Schulz) [10:27:30] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2247 (re)pooling @ 10%: Host provisioned T406551', diff saved to https://phabricator.wikimedia.org/P84112 and previous config saved to /var/cache/conftool/dbconfig/20251020-102730-root.json [10:27:35] T406551: Productionize db224[5-8] - https://phabricator.wikimedia.org/T406551 [10:30:17] (03PS1) 10Effie Mouzeli: mw-experimental-mediawiki-image-update: support environment in release [puppet] - 10https://gerrit.wikimedia.org/r/1197225 (https://phabricator.wikimedia.org/T405110) [10:31:52] (03CR) 10Effie Mouzeli: [C:03+1] mw-experimental-mediawiki-image-update: support environment in release [puppet] - 10https://gerrit.wikimedia.org/r/1197225 (https://phabricator.wikimedia.org/T405110) (owner: 10Effie Mouzeli) [10:32:11] (03CR) 10Effie Mouzeli: [C:03+1] mw-experimental: Fix motd for users with wikidev permissions [puppet] - 10https://gerrit.wikimedia.org/r/1197210 (owner: 10Jgiannelos) [10:33:19] (03CR) 10Jgiannelos: [C:03+1] mw-experimental-mediawiki-image-update: support environment in release [puppet] - 10https://gerrit.wikimedia.org/r/1197225 (https://phabricator.wikimedia.org/T405110) (owner: 10Effie Mouzeli) [10:34:04] (03CR) 10Effie Mouzeli: [C:03+2] mw-experimental: Fix motd for users with wikidev permissions [puppet] - 10https://gerrit.wikimedia.org/r/1197210 (owner: 10Jgiannelos) [10:34:18] (03CR) 10Effie Mouzeli: [C:03+2] mw-experimental-mediawiki-image-update: support environment in release [puppet] - 10https://gerrit.wikimedia.org/r/1197225 (https://phabricator.wikimedia.org/T405110) (owner: 10Effie Mouzeli) [10:42:36] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2247 (re)pooling @ 20%: Host provisioned T406551', diff saved to https://phabricator.wikimedia.org/P84114 and previous config saved to /var/cache/conftool/dbconfig/20251020-104236-root.json [10:42:39] 06SRE, 06Infrastructure-Foundations, 10Mail: Sendmail network error (deployment) - https://phabricator.wikimedia.org/T407723#11289002 (10Aklapper) [10:42:41] T406551: Productionize db224[5-8] - https://phabricator.wikimedia.org/T406551 [10:42:56] 07sre-alert-triage, 06SRE Observability: Alert in need of triage: PuppetConstantChange (instance prometheus2007:9100) - https://phabricator.wikimedia.org/T407484#11289007 (10tappof) a:03tappof [10:48:17] (03PS1) 10Marostegui: db1219: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1197228 (https://phabricator.wikimedia.org/T407463) [10:49:11] (03CR) 10Marostegui: [C:03+2] db1219: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1197228 (https://phabricator.wikimedia.org/T407463) (owner: 10Marostegui) [10:49:58] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1219.eqiad.wmnet with reason: Maintenance [10:50:03] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db1219 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P84115 and previous config saved to /var/cache/conftool/dbconfig/20251020-105002-marostegui.json [10:53:36] (03PS1) 10Effie Mouzeli: proxoid: fix healthchecks [puppet] - 10https://gerrit.wikimedia.org/r/1197230 (https://phabricator.wikimedia.org/T407615) [10:57:42] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2247 (re)pooling @ 25%: Host provisioned T406551', diff saved to https://phabricator.wikimedia.org/P84117 and previous config saved to /var/cache/conftool/dbconfig/20251020-105742-root.json [10:57:48] T406551: Productionize db224[5-8] - https://phabricator.wikimedia.org/T406551 [10:57:55] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1219 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P84118 and previous config saved to /var/cache/conftool/dbconfig/20251020-105754-root.json [10:58:04] (03PS1) 10Slyngshede: P::cache::haproxy enable x-is-browser everywhere [puppet] - 10https://gerrit.wikimedia.org/r/1197231 (https://phabricator.wikimedia.org/T398161) [11:02:40] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7302/console" [puppet] - 10https://gerrit.wikimedia.org/r/1197231 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede) [11:02:56] (03CR) 10Hnowlan: [C:03+1] "I think this is enough of a general concern for SRE at large (and beyond) that keeping SRE as the team here makes sense to me." [puppet] - 10https://gerrit.wikimedia.org/r/1196943 (https://phabricator.wikimedia.org/T407120) (owner: 10Tiziano Fogli) [11:06:44] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7303/console" [puppet] - 10https://gerrit.wikimedia.org/r/1197231 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede) [11:07:11] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:12:27] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7304/co" [puppet] - 10https://gerrit.wikimedia.org/r/1197231 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede) [11:12:48] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2247 (re)pooling @ 30%: Host provisioned T406551', diff saved to https://phabricator.wikimedia.org/P84120 and previous config saved to /var/cache/conftool/dbconfig/20251020-111248-root.json [11:12:53] T406551: Productionize db224[5-8] - https://phabricator.wikimedia.org/T406551 [11:13:01] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1219 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P84121 and previous config saved to /var/cache/conftool/dbconfig/20251020-111300-root.json [11:16:43] PROBLEM - Thanos swift https on thanos-fe1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Thanos [11:18:06] (03PS1) 10Jelto: admin: remove legacy ssh key for jelto [puppet] - 10https://gerrit.wikimedia.org/r/1197233 (https://phabricator.wikimedia.org/T407606) [11:19:33] RECOVERY - Thanos swift https on thanos-fe1004 is OK: HTTP OK: HTTP/1.1 200 OK - 279 bytes in 0.050 second response time https://wikitech.wikimedia.org/wiki/Thanos [11:21:02] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2055 gradually with 4 steps - Pooling in new host [11:23:31] (03CR) 10Vgutierrez: [C:03+1] P::cache::haproxy enable x-is-browser everywhere [puppet] - 10https://gerrit.wikimedia.org/r/1197231 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede) [11:24:17] FIRING: [2x] ProbeDown: Service wdqs1018:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1018:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:27:54] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2247 (re)pooling @ 50%: Host provisioned T406551', diff saved to https://phabricator.wikimedia.org/P84123 and previous config saved to /var/cache/conftool/dbconfig/20251020-112754-root.json [11:27:59] T406551: Productionize db224[5-8] - https://phabricator.wikimedia.org/T406551 [11:28:04] (03CR) 10Vgutierrez: [C:03+1] "tested against the 4 realservers using `curl --connect-to ::$(dig +short hcaptcha1001.wikimedia.org):4260 https://hcaptcha.wikimedia.org/h" [puppet] - 10https://gerrit.wikimedia.org/r/1197230 (https://phabricator.wikimedia.org/T407615) (owner: 10Effie Mouzeli) [11:28:07] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1219 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P84124 and previous config saved to /var/cache/conftool/dbconfig/20251020-112806-root.json [11:29:17] RESOLVED: [2x] ProbeDown: Service wdqs1018:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1018:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:31:06] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops, 10decommission-hardware: decommission es1027.eqiad.wmnet - https://phabricator.wikimedia.org/T407595#11289124 (10Jclark-ctr) a:03Jclark-ctr [11:31:22] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops, 10decommission-hardware: decommission es1027.eqiad.wmnet - https://phabricator.wikimedia.org/T407595#11289127 (10Jclark-ctr) 05Open→03Resolved [11:33:05] (03PS1) 10Federico Ceratto: site.pp, es2056.yaml, preseed.yaml: Prepare es2056 for es2 [puppet] - 10https://gerrit.wikimedia.org/r/1197238 (https://phabricator.wikimedia.org/T402859) [11:34:47] (03CR) 10Fabfur: [C:03+1] P::cache::haproxy enable x-is-browser everywhere [puppet] - 10https://gerrit.wikimedia.org/r/1197231 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede) [11:39:57] (03CR) 10Slyngshede: [V:03+1 C:03+2] P::cache::haproxy enable x-is-browser everywhere [puppet] - 10https://gerrit.wikimedia.org/r/1197231 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede) [11:42:44] (03PS1) 10Majavah: admin: home: Add mux alias for taavi [puppet] - 10https://gerrit.wikimedia.org/r/1197240 [11:43:00] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2247 (re)pooling @ 60%: Host provisioned T406551', diff saved to https://phabricator.wikimedia.org/P84125 and previous config saved to /var/cache/conftool/dbconfig/20251020-114300-root.json [11:43:04] T406551: Productionize db224[5-8] - https://phabricator.wikimedia.org/T406551 [11:43:13] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1219 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P84126 and previous config saved to /var/cache/conftool/dbconfig/20251020-114312-root.json [11:44:08] (03CR) 10Marostegui: [C:03+1] site.pp, es2056.yaml, preseed.yaml: Prepare es2056 for es2 [puppet] - 10https://gerrit.wikimedia.org/r/1197238 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto) [11:45:21] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [11:45:39] (03CR) 10Federico Ceratto: [C:03+2] site.pp, es2056.yaml, preseed.yaml: Prepare es2056 for es2 [puppet] - 10https://gerrit.wikimedia.org/r/1197238 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto) [11:47:04] (03CR) 10Hnowlan: [C:03+1] Route transform/wikitext/to/lint(.*) to the gateway on test2wiki [puppet] - 10https://gerrit.wikimedia.org/r/1189936 (https://phabricator.wikimedia.org/T385066) (owner: 10Aaron Schulz) [11:48:57] (03CR) 10Hnowlan: [C:03+1] "I can get this one out for you today if you'd like." [puppet] - 10https://gerrit.wikimedia.org/r/1189936 (https://phabricator.wikimedia.org/T385066) (owner: 10Aaron Schulz) [11:51:48] FIRING: KubernetesCalicoDown: wikikube-worker2203.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=wikikube-worker2203.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [11:52:43] !log add cloudcephosd1051 to the cluster via wmcs.ceph.osd.bootstrap_and_add - T405478 [11:52:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:48] T405478: Experiment with cloudcephosd1050 and cloudcephosd1051 in single-nic configuration - https://phabricator.wikimedia.org/T405478 [11:54:01] 06SRE, 10Cloud-VPS, 06DC-Ops, 06cloud-services-team (FY2025/26-Q1): Experiment with cloudcephosd1050 and cloudcephosd1051 in single-nic configuration - https://phabricator.wikimedia.org/T405478#11289156 (10fgiunchedi) >>! In T405478#11288584, @dcaro wrote: > Nice! I'm eager to see the results of adding it... [11:56:19] 10ops-eqiad, 06SRE, 06DC-Ops: aqs1012 is down - https://phabricator.wikimedia.org/T407414#11289159 (10Jclark-ctr) a:05Jclark-ctr→03Eevans [11:58:05] (03CR) 10Brouberol: "I think we won't need to, cf the WIP work in https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1196700" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196505 (https://phabricator.wikimedia.org/T406876) (owner: 10Btullis) [11:58:06] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2247 (re)pooling @ 75%: Host provisioned T406551', diff saved to https://phabricator.wikimedia.org/P84127 and previous config saved to /var/cache/conftool/dbconfig/20251020-115805-root.json [11:58:11] T406551: Productionize db224[5-8] - https://phabricator.wikimedia.org/T406551 [12:02:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 2.169s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:05:52] Is it just me or does gerrit feel slow? [12:06:15] Like refreshing the page gets a slow response and my last attempt gets a `ERR_CONNECTION_RESET` error [12:06:56] (03PS1) 10Majavah: toolforge: toolviews: Move nginx-specific parts to nginx profile [puppet] - 10https://gerrit.wikimedia.org/r/1197242 (https://phabricator.wikimedia.org/T284558) [12:06:58] (03PS1) 10Majavah: toolforge: toolviews: Add initial HAProxy support [puppet] - 10https://gerrit.wikimedia.org/r/1197243 (https://phabricator.wikimedia.org/T284558) [12:07:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 2.103s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:07:36] (03CR) 10CI reject: [V:04-1] toolforge: toolviews: Move nginx-specific parts to nginx profile [puppet] - 10https://gerrit.wikimedia.org/r/1197242 (https://phabricator.wikimedia.org/T284558) (owner: 10Majavah) [12:07:42] (03CR) 10CI reject: [V:04-1] toolforge: toolviews: Add initial HAProxy support [puppet] - 10https://gerrit.wikimedia.org/r/1197243 (https://phabricator.wikimedia.org/T284558) (owner: 10Majavah) [12:08:45] Gerrit seems to be back to normal for me now [12:09:03] (03PS1) 10Filippo Giunchedi: cloudceph: set mtu only when interfaces exist [puppet] - 10https://gerrit.wikimedia.org/r/1197245 (https://phabricator.wikimedia.org/T405478) [12:12:20] (03PS2) 10Majavah: toolforge: toolviews: Move nginx-specific parts to nginx profile [puppet] - 10https://gerrit.wikimedia.org/r/1197242 (https://phabricator.wikimedia.org/T284558) [12:12:24] (03PS2) 10Majavah: toolforge: toolviews: Add initial HAProxy support [puppet] - 10https://gerrit.wikimedia.org/r/1197243 (https://phabricator.wikimedia.org/T284558) [12:13:12] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2247 (re)pooling @ 100%: Host provisioned T406551', diff saved to https://phabricator.wikimedia.org/P84128 and previous config saved to /var/cache/conftool/dbconfig/20251020-121311-root.json [12:13:17] T406551: Productionize db224[5-8] - https://phabricator.wikimedia.org/T406551 [12:13:54] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1197243 (https://phabricator.wikimedia.org/T284558) (owner: 10Majavah) [12:14:33] !log ozge@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [12:15:47] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7306/co" [puppet] - 10https://gerrit.wikimedia.org/r/1197243 (https://phabricator.wikimedia.org/T284558) (owner: 10Majavah) [12:35:37] (03CR) 10Slyngshede: [C:03+1] admin: remove legacy ssh key for jelto [puppet] - 10https://gerrit.wikimedia.org/r/1197233 (https://phabricator.wikimedia.org/T407606) (owner: 10Jelto) [12:36:23] (03CR) 10Marostegui: "This is an interesting discussion, and I understand both sides" [puppet] - 10https://gerrit.wikimedia.org/r/1184544 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto) [12:41:33] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on es2056.codfw.wmnet with reason: Setting up new ES host [12:43:44] FIRING: [2x] NodeTextfileStale: Stale textfile for wdqs2009:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [12:43:46] (03CR) 10Filippo Giunchedi: "LGTM, see also inline" [puppet] - 10https://gerrit.wikimedia.org/r/1197243 (https://phabricator.wikimedia.org/T284558) (owner: 10Majavah) [12:44:56] (03CR) 10Filippo Giunchedi: [C:03+1] toolforge: toolviews: Move nginx-specific parts to nginx profile [puppet] - 10https://gerrit.wikimedia.org/r/1197242 (https://phabricator.wikimedia.org/T284558) (owner: 10Majavah) [12:50:40] (03PS3) 10Majavah: toolforge: toolviews: Add initial HAProxy support [puppet] - 10https://gerrit.wikimedia.org/r/1197243 (https://phabricator.wikimedia.org/T284558) [12:50:53] (03CR) 10Majavah: toolforge: toolviews: Add initial HAProxy support (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1197243 (https://phabricator.wikimedia.org/T284558) (owner: 10Majavah) [12:51:25] (03CR) 10Majavah: [C:03+2] toolforge: toolviews: Move nginx-specific parts to nginx profile [puppet] - 10https://gerrit.wikimedia.org/r/1197242 (https://phabricator.wikimedia.org/T284558) (owner: 10Majavah) [12:51:46] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7307/co" [puppet] - 10https://gerrit.wikimedia.org/r/1197243 (https://phabricator.wikimedia.org/T284558) (owner: 10Majavah) [12:51:53] (03CR) 10Filippo Giunchedi: [C:03+1] toolforge: toolviews: Add initial HAProxy support [puppet] - 10https://gerrit.wikimedia.org/r/1197243 (https://phabricator.wikimedia.org/T284558) (owner: 10Majavah) [12:52:04] (03CR) 10Majavah: [V:03+1 C:03+2] toolforge: toolviews: Add initial HAProxy support [puppet] - 10https://gerrit.wikimedia.org/r/1197243 (https://phabricator.wikimedia.org/T284558) (owner: 10Majavah) [12:52:18] (03CR) 10Majavah: [C:03+2] admin: home: Add mux alias for taavi [puppet] - 10https://gerrit.wikimedia.org/r/1197240 (owner: 10Majavah) [12:52:34] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [12:53:21] (03CR) 10Kamila Součková: [C:03+1] "Thank you Effie!" [puppet] - 10https://gerrit.wikimedia.org/r/1197230 (https://phabricator.wikimedia.org/T407615) (owner: 10Effie Mouzeli) [12:55:34] (03CR) 10Kamila Součková: "Not really needed given Ie8a088958116fd9db24c3c678540f3dc3ff65281 ." [puppet] - 10https://gerrit.wikimedia.org/r/1196954 (https://phabricator.wikimedia.org/T407615) (owner: 10Kamila Součková) [12:57:21] (03CR) 10Jelto: [C:03+2] admin: remove legacy ssh key for jelto [puppet] - 10https://gerrit.wikimedia.org/r/1197233 (https://phabricator.wikimedia.org/T407606) (owner: 10Jelto) [12:57:36] (03CR) 10Majavah: "q: Is there a risk of an ordering issue here where the MTU is not set at all? i.e. is it fine to not run the command, or should this have " [puppet] - 10https://gerrit.wikimedia.org/r/1197245 (https://phabricator.wikimedia.org/T405478) (owner: 10Filippo Giunchedi) [12:59:32] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Eqiad C/D refresh: move asw2-c-eqiad CR uplinks to Nokia switches - https://phabricator.wikimedia.org/T405579#11289265 (10Jclark-ctr) After discussing this with @cmooney over IRC, I reviewed the moves on the Eqiad side and noted that we had one fr... [13:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: How many deployers does it take to do UTC afternoon backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251020T1300). [13:00:05] edsanders, bpirkle, sergi0, seanleong-wmde, phuedx, and cormacparle: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:07] o/ [13:00:09] o/ [13:00:10] I can self deploy [13:00:13] o/ [13:00:26] o/ [13:00:52] edsanders: go ahead :) [13:01:04] erm ... my wikimedia debug extension says "unspecified backend" [13:01:07] (looks like Flow backport CI is pretty fast, so no need to put a config change ahead of it I think) [13:01:08] is this expected? [13:01:24] (03CR) 10TrainBranchBot: [C:03+2] "Approved by esanders@deploy2002 using scap backport" [extensions/Flow] (wmf/1.45.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1196884 (https://phabricator.wikimedia.org/T407357) (owner: 10Esanders) [13:01:24] !log fceratto@cumin1003 START - Cookbook sre.mysql.clone_es of es2033.codfw.wmnet onto es2056.codfw.wmnet [13:01:25] cormacparle: are you on a WMF production domain? [13:01:29] !log fceratto@cumin1003 START - Cookbook sre.mysql.depool es2033 - Depool es2033.codfw.wmnet to then clone it to es2056.codfw.wmnet - fceratto@cumin1003 [13:01:34] (the dropdown contents change depending on which domain you’re on) [13:01:48] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) es2033 - Depool es2033.codfw.wmnet to then clone it to es2056.codfw.wmnet - fceratto@cumin1003 [13:02:05] o/ [13:02:21] Lucas_WMDE: no, on beta [13:02:30] (which seems to be down :( ) [13:02:40] it's just a beta config change I want to deploy [13:02:56] you can’t use WikimediaDebug on beta afaik [13:03:09] the config change will just be deployed, and ca. 10 minutes later you can check if it worked or not [13:03:14] aha ok grand [13:03:21] (beta WFM) [13:04:03] (03Merged) 10jenkins-bot: Follow-up I6698875: Set insert-ignore on all insert queries [extensions/Flow] (wmf/1.45.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1196884 (https://phabricator.wikimedia.org/T407357) (owner: 10Esanders) [13:04:23] (03CR) 10Filippo Giunchedi: "I'm not aware of ordering issues no, if the interface is down when interface::setting runs then mtu will be set the next time the interfac" [puppet] - 10https://gerrit.wikimedia.org/r/1197245 (https://phabricator.wikimedia.org/T405478) (owner: 10Filippo Giunchedi) [13:04:48] !log esanders@deploy2002 Started scap sync-world: Backport for [[gerrit:1196884|Follow-up I6698875: Set insert-ignore on all insert queries (T407357)]] [13:04:48] fceratto@cumin1003 clone_es (PID 1381498) is awaiting input [13:04:53] T407357: Ignore duplicate key errors when creating Flow posts from LQT - https://phabricator.wikimedia.org/T407357 [13:09:41] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "Can confirm that this is unused in wmf.23:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192913 (https://phabricator.wikimedia.org/T396382) (owner: 10Sergio Gimeno) [13:10:06] once the current deploy is done I think we can do the changes for bpirkle, sergi0 and cormacparle together [13:10:15] one actual change, one cleanup that should be a no-op, and one beta change [13:10:26] sounds good to me [13:10:40] 👍 [13:11:26] (03CR) 10CI reject: [V:04-1] Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1197247 (owner: 10L10n-bot) [13:15:24] scap is taking a while building those container images [13:20:40] “Waiting 300 seconds for swift after full mediawiki image build (T390251)” [13:20:40] T390251: docker-registry.wikimedia.org keeps serving bad blobs - https://phabricator.wikimedia.org/T390251 [13:20:49] (that was 13:19:12 UTC) [13:20:53] yeah [13:21:26] not sure why it was a full image build, your backport doesn’t include i18n changes [13:21:55] maybe because it’s the first backport this week? there was an earlier window this morning but it seemingly only deployed config changes, maybe that’s different [13:24:37] hmm - finished now at least [13:28:11] FIRING: [2x] SystemdUnitFailed: amd-k8s-node-labeller.service on ml-serve1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:29:08] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Eqiad C/D refresh: move asw2-c-eqiad CR uplinks to Nokia switches - https://phabricator.wikimedia.org/T405579#11289373 (10cmooney) >>! In T405579#11289265, @Jclark-ctr wrote: > After discussing this with @cmooney over IRC, I reviewed the moves on... [13:30:05] !log esanders@deploy2002 esanders: Backport for [[gerrit:1196884|Follow-up I6698875: Set insert-ignore on all insert queries (T407357)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:30:09] T407357: Ignore duplicate key errors when creating Flow posts from LQT - https://phabricator.wikimedia.org/T407357 [13:30:26] !log esanders@deploy2002 esanders: Continuing with sync [13:35:07] FIRING: SwitchCoreInterfaceDown: Switch core interface down - ssw1-d1-eqiad:ethernet-1/31 (Transport: ssw1-f1-eqiad:et-0/0/29 (Equinix, 21996480) {#0107202f1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-d1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [13:37:49] Sorry I'm late [13:37:56] o/ [13:38:08] no worries, you didn’t miss anything yet ^^ [13:38:12] we’re still in the first deployment [13:38:33] That is both good and bad [13:38:38] (: [13:39:00] (kinda tempted to !bash that, with timestamps, ngl) [13:39:30] * phuedx reads the scrollback [13:39:46] D: [13:41:00] Lucas_WMDE Hii, the config changes will be at the last? [13:41:21] I was planning to do the config changes for bpirkle, sergi0 and cormacparle together next [13:41:30] and then yours and that by phuedx afterwards, not yet sure if together or separately [13:42:01] actually, is sergi0 around? [13:42:25] Mine is a NOOP. It can be bundled [13:43:05] ok [13:43:24] !log esanders@deploy2002 Finished scap sync-world: Backport for [[gerrit:1196884|Follow-up I6698875: Set insert-ignore on all insert queries (T407357)]] (duration: 38m 36s) [13:43:28] T407357: Ignore duplicate key errors when creating Flow posts from LQT - https://phabricator.wikimedia.org/T407357 [13:44:10] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196492 (https://phabricator.wikimedia.org/T389409) (owner: 10BPirkle) [13:44:11] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192913 (https://phabricator.wikimedia.org/T396382) (owner: 10Sergio Gimeno) [13:44:11] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196857 (https://phabricator.wikimedia.org/T406332) (owner: 10Phuedx) [13:44:12] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196703 (https://phabricator.wikimedia.org/T41510) (owner: 10Cparle) [13:45:59] (03Merged) 10jenkins-bot: Enable REST Sandbox on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196492 (https://phabricator.wikimedia.org/T389409) (owner: 10BPirkle) [13:46:01] (03Merged) 10jenkins-bot: Growth: remove no longer in use GENewcomerTasksStarterDifficultyEnabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192913 (https://phabricator.wikimedia.org/T396382) (owner: 10Sergio Gimeno) [13:46:26] (03Merged) 10jenkins-bot: MetricsPlatform: Initialize $wgMetricsPlatformExperimentStreamNames [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196857 (https://phabricator.wikimedia.org/T406332) (owner: 10Phuedx) [13:46:28] (03Merged) 10jenkins-bot: Enable Special:EditWatchlist pagination on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196703 (https://phabricator.wikimedia.org/T41510) (owner: 10Cparle) [13:46:46] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1196492|Enable REST Sandbox on all wikis (T389409)]], [[gerrit:1192913|Growth: remove no longer in use GENewcomerTasksStarterDifficultyEnabled (T396382)]], [[gerrit:1196857|MetricsPlatform: Initialize $wgMetricsPlatformExperimentStreamNames (T406332)]], [[gerrit:1196703|Enable Special:EditWatchlist pagination on beta (T41510)]] [13:46:56] T389409: Release REST API Sandbox on all remaining wikis - https://phabricator.wikimedia.org/T389409 [13:46:56] T396382: Deployment Plan: Allow limiting "Add a Link" to new editors - https://phabricator.wikimedia.org/T396382 [13:46:57] T406332: Make XLAB_STREAMS allowlist configurable - https://phabricator.wikimedia.org/T406332 [13:46:57] T41510: Opening Special:EditWatchlist with a large watchlist hits server timeout (Create watchlist pager) - https://phabricator.wikimedia.org/T41510 [13:49:16] (03PS8) 10Federico Ceratto: clone_es.py: clone readonly es* hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1183646 [13:49:33] (03CR) 10Federico Ceratto: "(see comments)" [cookbooks] - 10https://gerrit.wikimedia.org/r/1183646 (owner: 10Federico Ceratto) [13:51:18] !log lucaswerkmeister-wmde@deploy2002 sgimeno, bpirkle, phuedx, lucaswerkmeister-wmde, cparle: Backport for [[gerrit:1196492|Enable REST Sandbox on all wikis (T389409)]], [[gerrit:1192913|Growth: remove no longer in use GENewcomerTasksStarterDifficultyEnabled (T396382)]], [[gerrit:1196857|MetricsPlatform: Initialize $wgMetricsPlatformExperimentStreamNames (T406332)]], [[gerrit:1196703|Enable Special:EditWatchlist paginati [13:51:18] on on beta (T41510)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:51:40] sergi0, bpirkle, phuedx: please test :) [13:52:39] Mine looks good, thank you! [13:54:30] Lucas_WMDE: LGTM. As I said, it's a NOP. I did take a moment to confirm the name though :) [13:54:39] ok :) [13:54:59] !log enable 2x40G lag from asw2-c-eqiad to ssw1-dX-eqiad T405579 [13:55:01] sergi0’s should be a no-op as well but i wouldn’t mind if he could confirm it ^^ [13:55:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:03] T405579: Eqiad C/D refresh: move asw2-c-eqiad CR uplinks to Nokia switches - https://phabricator.wikimedia.org/T405579 [13:55:09] but otherwise I’ll just click the “yes” button in a moment [13:55:42] (03PS1) 10Majavah: toolforge: toolviews: Fix parsing HAProxy logs [puppet] - 10https://gerrit.wikimedia.org/r/1197270 (https://phabricator.wikimedia.org/T284558) [13:56:04] !log lucaswerkmeister-wmde@deploy2002 sgimeno, bpirkle, phuedx, lucaswerkmeister-wmde, cparle: Continuing with sync [13:56:08] Lucas_WMDE, is there still space for this backport to revert the qual and ref change? [13:56:25] there’s always change to deploy reverts that fix UBNs ;) [13:56:27] jouncebot: next [13:56:27] In 0 hour(s) and 33 minute(s): xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251020T1430) [13:56:35] and there’s a half-hour break before the next window, so sure [13:56:40] beta is still down so I can't test anything :/ [13:56:58] it’s still working for me [13:57:09] what does “down” look like? [13:57:30] https://usercontent.irccloud-cdn.com/file/r2fCFMVm/image.png [13:57:48] (03CR) 10Majavah: [C:03+2] toolforge: toolviews: Fix parsing HAProxy logs [puppet] - 10https://gerrit.wikimedia.org/r/1197270 (https://phabricator.wikimedia.org/T284558) (owner: 10Majavah) [13:57:53] cormacparle: your IP might be blocked from beta heh [13:57:55] please see the bottom of the screen [13:58:06] (not included in the screenshot but I’m making an educated guess at what might be there :P) [13:58:22] (03PS1) 10Brouberol: deployment_server: create kubeconfigs to deploy postgresql-growthbook [puppet] - 10https://gerrit.wikimedia.org/r/1197271 (https://phabricator.wikimedia.org/T406578) [13:58:25] (what cdanis said) [13:58:28] (03PS1) 10Brouberol: cloudnative-pg-operator: watch the growthbook namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1197272 (https://phabricator.wikimedia.org/T406578) [13:58:30] (03PS1) 10Brouberol: Deploy a postgresql-growthbook cluster in dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1197273 (https://phabricator.wikimedia.org/T406578) [13:58:46] Error: 403, Requests from your IP have been blocked, please see https://wikitech.wikimedia.org/wiki/Beta/Blocked for more information. at Mon, 20 Oct 2025 13:57:53 GMT [13:58:48] hah! [13:58:51] ok [13:59:02] yeah, that :) [13:59:04] <_joe_> cormacparle: you naughty boy what did you do with beta to get banned? [13:59:04] how do I get unblocked? [13:59:13] I would start from that link :-) [13:59:14] * cormacparle looks innocent [14:00:25] okay, doing the revert and scheduling it now Lucas_WMDE, thanks [14:00:36] alright, thanks! [14:01:57] (03CR) 10Lucas Werkmeister (WMDE): [C:04-1] Add feature flag for pilot wikis about visual changes coming from Wikibase having an icon. (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193703 (https://phabricator.wikimedia.org/T397258) (owner: 10Seanleong-wmde) [14:02:31] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1196492|Enable REST Sandbox on all wikis (T389409)]], [[gerrit:1192913|Growth: remove no longer in use GENewcomerTasksStarterDifficultyEnabled (T396382)]], [[gerrit:1196857|MetricsPlatform: Initialize $wgMetricsPlatformExperimentStreamNames (T406332)]], [[gerrit:1196703|Enable Special:EditWatchlist pagination on beta (T41510)]] (duration [14:02:31] : 15m 45s) [14:02:40] T389409: Release REST API Sandbox on all remaining wikis - https://phabricator.wikimedia.org/T389409 [14:02:40] T396382: Deployment Plan: Allow limiting "Add a Link" to new editors - https://phabricator.wikimedia.org/T396382 [14:02:41] T406332: Make XLAB_STREAMS allowlist configurable - https://phabricator.wikimedia.org/T406332 [14:02:41] T41510: Opening Special:EditWatchlist with a large watchlist hits server timeout (Create watchlist pager) - https://phabricator.wikimedia.org/T41510 [14:04:16] (backport+config window is still open, waiting to deploy a Wikibase revert) [14:04:44] np! Lucas_WMDE, regarding the feature flag for visual changes, the patch is here https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/1196896 and will be +2 today for the train tmr. Is it possible to deploy the config change now or do you prefer it tmr? [14:05:20] Currently waiting [14:05:20] Thank you @Lucas_WMDE [14:05:43] seanleong-wmde: config changes should only be deployed once the code using the config has rolled out with the train [14:05:51] > (backport+config window is still open, waiting to deploy a Wikibase revert) [14:05:51] Currently waiting* for the tests to pass and will be on it's way to backport [14:05:56] so that any potential issues can be checked when the config change is deployed, and not when the train rolls out [14:06:01] 10ops-eqiad, 06SRE, 06DC-Ops: aqs1012 is down - https://phabricator.wikimedia.org/T407414#11289616 (10Eevans) >>! In T407414#11285096, @Jclark-ctr wrote: > @Eevans are you able to reimage the server i have had no luck due to no root partition error. and preseed file has -efi for raid configuration for a s... [14:06:19] seanleong-wmde: I would cherry-pick https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/1197274 to the wmf branch and +2 it immediately [14:06:33] (it’ll still have to go through CI there and that will take long enough anyway. no need to wait for that on the master branch imho) [14:07:44] 10ops-eqiad, 06SRE, 06DC-Ops: aqs1012 is down - https://phabricator.wikimedia.org/T407414#11289657 (10Eevans) >>! In T407414#11289616, @Eevans wrote: >>>! In T407414#11285096, @Jclark-ctr wrote: >> @Eevans are you able to reimage the server i have had no luck due to no root partition error. and preseed fi... [14:09:05] !log cleaning up IPVS leftovers from HTTPS migration of wdqs-internal services - T193473 [14:09:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:10] T193473: Add HTTPS support to wdqs-internal service - https://phabricator.wikimedia.org/T193473 [14:10:03] got it, for the config we will schedule another backport afterwards, for the cherry pick Lucas_WMDE, to this branch wmf/1.45.0-wmf.23? [14:10:22] yes [14:10:34] jouncebot: nowandnext [14:10:34] No deployments scheduled for the next 0 hour(s) and 19 minute(s) [14:10:34] In 0 hour(s) and 19 minute(s): xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251020T1430) [14:10:43] hnowlan: I’m about to deploy a Wikibase revert [14:10:53] (03PS1) 10Neslihan Turan: Revert "Implement new usage types for statement with qualifiers and references" [extensions/Wikibase] (wmf/1.45.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1197276 (https://phabricator.wikimedia.org/T401290) [14:10:56] Lucas_WMDE: ack, no worries [14:11:15] (03CR) 10Hnowlan: [C:03+1] "I think this looks good to go. Let me know when you'd like to try the rollout." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189447 (https://phabricator.wikimedia.org/T405574) (owner: 10Daniel Kinzler) [14:11:22] let’s try it [14:11:31] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [extensions/Wikibase] (wmf/1.45.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1197276 (https://phabricator.wikimedia.org/T401290) (owner: 10Neslihan Turan) [14:13:09] RECOVERY - PyBal IPVS diff check on lvs2014 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [14:13:27] RECOVERY - PyBal IPVS diff check on lvs1020 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [14:13:43] RECOVERY - PyBal IPVS diff check on lvs1019 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [14:16:09] Lucas_WMDE I can't add more patch into this timeslot https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/1197276 [14:16:39] https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251020T1300 [14:16:44] you can edit the wiki page manually [14:16:57] (sorry, those message were supposed to be the other way around but my IRC client eated them) [14:17:44] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [14:18:25] 10ops-eqiad, 06SRE, 06DC-Ops: eqiad row C/D DC Ops host migrations - https://phabricator.wikimedia.org/T405021#11289791 (10Jclark-ctr) T405560 2 servers where racked previously on this ticket and are cabled to nokia switches [14:21:29] hahaha no worries, added now, thanks [14:21:37] Lucas_WMDE o7 [14:21:43] nice, thanks! [14:22:03] (03CR) 10Krinkle: [C:03+1] Add virtual domain mapping for OAuth (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196441 (https://phabricator.wikimedia.org/T348485) (owner: 10D3r1ck01) [14:27:29] (03Merged) 10jenkins-bot: Revert "Implement new usage types for statement with qualifiers and references" [extensions/Wikibase] (wmf/1.45.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1197276 (https://phabricator.wikimedia.org/T401290) (owner: 10Neslihan Turan) [14:27:50] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1197276|Revert "Implement new usage types for statement with qualifiers and references" (T401290 T407684 T407744)]] [14:27:57] T401290: Implement new usage types for qualifiers and references - https://phabricator.wikimedia.org/T401290 [14:27:58] T407684: Lua's ipairs() function can no longer iterate over Wikidata references - https://phabricator.wikimedia.org/T407684 [14:27:58] T407744: Wikibase\DataModel\Entity\EntityIdParsingException: The serialization "Q42902012 " is not recognized by the configured id builders - https://phabricator.wikimedia.org/T407744 [14:28:43] let’s see how it goes [14:30:05] Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251020T1430) [14:31:17] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 07Essential-Work: Reimage failed after prompt...is prompt needed? - https://phabricator.wikimedia.org/T406656#11289937 (10bking) {F66767261} Thanks Luca, I'm learning a lot about the process. A few more questions. > If you are reimaging a node... [14:31:24] I’m still deploying, sorry xLab’ers [14:31:56] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, neslihanturan: Backport for [[gerrit:1197276|Revert "Implement new usage types for statement with qualifiers and references" (T401290 T407684 T407744)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:32:07] seanleong-wmde: please test! [14:32:10] * Lucas_WMDE also looks [14:32:30] https://fi.wikipedia.org/wiki/Vantaa looks okay again on WikimediaDebug, phew [14:33:02] In the meantime, Lucas_WMDE, our config change for visual change will be on hewiki, cawiki (group1), ukwiki (group2), in this case can we schedule the config deployment on this Thursday? [14:33:03] (it’s that place what where lentokenttä is!) [14:33:10] Lucas_WMDE testing now [14:33:56] https://no.wikipedia.org/wiki/Roberta_Williams also has four references on WikimediaDebug [14:34:16] hm, nevermind, it also has four references without it (even after purging) [14:34:35] ah, they worked around it https://phabricator.wikimedia.org/T407684#11287349 [14:35:17] okay, with those instructions I can see a difference between WikimediaDebug and normal [14:35:49] yea they change it from ipairs to pairs [14:35:56] but it's working fine now [14:36:08] looks like it [14:36:15] okay to continue? or do you want to test anything else? [14:36:16] I think it's due to the schema change of the table masking [14:36:27] okay to continue [14:36:31] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, neslihanturan: Continuing with sync [14:36:49] thanks for helping with the tests as well Lucas_WMDE o/ [14:36:52] and about the config change, I think it would be okay to do it on Wednesday (you just wouldn’t be able to test it on ukwiki then) [14:37:03] got it [14:37:28] also depends on whether the train happens at 10:00 or 20:00 CEST this week, I guess [14:37:32] I never know how to tell [14:37:38] both windows are in the deployment calendar and idk which is the “real” one [14:38:43] got it! we will schedule it accordingly this week [14:38:46] (03PS1) 10Dreamy Jazz: Define CheckUser SuggestedInvestigations event stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1197278 (https://phabricator.wikimedia.org/T404177) [14:38:47] o_O scap died, what [14:38:53] canary checks failed [14:38:59] retrying them… [14:39:19] oh no [14:39:34] Top 1 errors: InvalidArgumentException: $aspect must use one of the XXX_USAGE constants, "CQR" given [14:39:56] that’s bad news [14:40:23] yea, because we introduced a new aspect to the DB [14:40:26] oh no [14:40:30] (03PS2) 10Dreamy Jazz: Define CheckUser Suggested Investigations event stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1197278 (https://phabricator.wikimedia.org/T404177) [14:40:36] oh, to the database! [14:40:39] ah fuck [14:41:00] yea C is further granularized to C and CQR [14:41:20] oh god that’s already 1070 hits in logstash [14:41:33] across all sorts of wikis [14:41:41] shit [14:41:45] I don’t think we can deploy that then [14:42:05] can we stop the deployment now? [14:42:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [14:42:37] (03PS1) 10Lucas Werkmeister (WMDE): Restore "Implement new usage types for statement with qualifiers and references" [extensions/Wikibase] (wmf/1.45.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1197281 [14:42:37] we can only fix the patch now [14:42:55] reverting will only work unless we retouch all the affected pages [14:42:57] (03CR) 10LSobanski: "Approved in the IF meeting." [puppet] - 10https://gerrit.wikimedia.org/r/1196090 (https://phabricator.wikimedia.org/T402511) (owner: 10Cathal Mooney) [14:43:04] (03PS2) 10Lucas Werkmeister (WMDE): Restore "Implement new usage types for statement with qualifiers and references" [extensions/Wikibase] (wmf/1.45.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1197281 (https://phabricator.wikimedia.org/T401290) [14:43:16] (03CR) 10TrainBranchBot: [C:03+2] "Copied votes on follow-up patch sets have been updated:" [extensions/Wikibase] (wmf/1.45.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1197281 (https://phabricator.wikimedia.org/T401290) (owner: 10Lucas Werkmeister (WMDE)) [14:43:19] sorry Lucas_WMDE [14:43:42] I’m reverting the revert [14:43:46] (03CR) 10TrainBranchBot: "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [extensions/Wikibase] (wmf/1.45.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1197281 (https://phabricator.wikimedia.org/T401290) (owner: 10Lucas Werkmeister (WMDE)) [14:43:56] because right now the revert is still sitting on the canary servers, soaking up user traffic and causing errors [14:44:05] so that’s my top priority right now [14:44:08] okay [14:44:22] meanwhile, please try to put together a version of the revert that won’t have this InvalidArgumentException [14:44:54] probably still most of the revert code, but some code that reads the usage from the DB, whenever it sees "CQR", just, idk, ignore it or something [14:45:00] and then we can try rolling that out [14:45:38] or "retouch all the affected pages" as you said [14:45:48] but I’m skeptical that that’s realistic [14:45:49] retouch is probably not possible [14:45:54] seemed to affect a lot of pages looking at logstash [14:45:54] yeah [14:45:55] will do the first suggestion [14:46:01] thank you [14:46:04] then patch it back asap afterwards [14:46:11] sorry for the inconveniejnce [14:46:21] (03CR) 10Lucas Werkmeister (WMDE): [V:03+2 C:03+2] "skipping gate-and-submit" [extensions/Wikibase] (wmf/1.45.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1197281 (https://phabricator.wikimedia.org/T401290) (owner: 10Lucas Werkmeister (WMDE)) [14:46:33] (03PS1) 10Majavah: toolforge: toolviews: Ignore requests for *.svc.toolforge.org [puppet] - 10https://gerrit.wikimedia.org/r/1197283 [14:46:49] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1197281|Restore "Implement new usage types for statement with qualifiers and references" (T401290 T407684 T407744)]] [14:46:57] T401290: Implement new usage types for qualifiers and references - https://phabricator.wikimedia.org/T401290 [14:46:57] T407684: Lua's ipairs() function can no longer iterate over Wikidata references - https://phabricator.wikimedia.org/T407684 [14:46:57] T407744: Wikibase\DataModel\Entity\EntityIdParsingException: The serialization "Q42902012 " is not recognized by the configured id builders - https://phabricator.wikimedia.org/T407744 [14:47:07] (03PS1) 10Scott French: hieradata: enable analytics-web listener in mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/1196733 (https://phabricator.wikimedia.org/T309738) [14:47:09] (03PS1) 10Scott French: hieradata: allow access to analytics-web from wikikube [puppet] - 10https://gerrit.wikimedia.org/r/1196734 (https://phabricator.wikimedia.org/T309738) [14:47:10] (03PS1) 10Scott French: mw-*: update network policy for access to analytics-web [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196735 (https://phabricator.wikimedia.org/T309738) [14:47:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [14:47:23] that ^ *might* be me [14:47:34] *looks at logstash* [14:47:36] oh god oh fuck [14:47:37] yeah definitely [14:47:41] fix is already rolling out [14:47:51] at https://spiderpig.wikimedia.org/jobs/776 [14:48:02] wha ta day [14:48:33] why does scap not have an option “yes, the canary servers were correct, this code should be immediately undeployed, please roll back to the previous replicaset of the deployment” [14:48:34] (03CR) 10CDanis: [C:03+1] multirootca: add the client auth usage to the dse_k8s discovery issuer profile [puppet] - 10https://gerrit.wikimedia.org/r/1196920 (https://phabricator.wikimedia.org/T406876) (owner: 10Brouberol) [14:49:18] that logstash volume is *just from the canary servers* [14:49:21] (I think) [14:49:39] yeah the Top Hosts table all says mw-api-int.codfw.canary-[hex] [14:50:13] because that wasn't possible pre-mw-on-k8s ('previous' deployment was not a thing then), and I guess no-one implemented an easy option for that afterwards [14:50:24] yeah looks like it [14:50:37] achievement unlocked: make logstash alert on quantity with only canary logs :P [14:50:42] (congratulations.) [14:50:47] /o\ [14:50:55] :blobfoxnotlikethis: [14:50:59] interesting that it's across everything (-web, -api-ext, -api, even -jobrunner) [14:51:02] I can haz sticker? [14:51:05] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde: Backport for [[gerrit:1197281|Restore "Implement new usage types for statement with qualifiers and references" (T401290 T407684 T407744)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:51:14] just waiting for the testservers check [14:51:25] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde: Continuing with sync [14:51:25] I’m not manually testing this, I’ll just [14:51:27] trust the revert [14:51:35] * ihurbain gives a sticker and a :pat: :pat to Lucas_WMDE [14:51:55] curious what the canaries will say now [14:53:08] they were happy! [14:53:16] sync-prod-k8s is running [14:53:35] “Counted 0 error(s) in the last 20 seconds.” X doubt [14:53:39] (I guess it means 0 *new* errors ^^) [14:53:44] (03PS1) 10Esanders: Follow-up Iedb6361: Set insert-ignore on all insertSelect queries [extensions/Flow] (wmf/1.45.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1197284 (https://phabricator.wikimedia.org/T407357) [14:54:34] (https://spiderpig.wikimedia.org/jobs/775 is an interesting scap crash btw, I’ll report that later) [14:54:41] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, October 20 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [extensions/Flow] (wmf/1.45.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1197284 (https://phabricator.wikimedia.org/T407357) (owner: 10Esanders) [14:54:56] volume appears to be going down again [14:55:31] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1197281|Restore "Implement new usage types for statement with qualifiers and references" (T401290 T407684 T407744)]] (duration: 08m 43s) [14:55:40] T401290: Implement new usage types for qualifiers and references - https://phabricator.wikimedia.org/T401290 [14:55:40] T407684: Lua's ipairs() function can no longer iterate over Wikidata references - https://phabricator.wikimedia.org/T407684 [14:55:40] T407744: Wikibase\DataModel\Entity\EntityIdParsingException: The serialization "Q42902012 " is not recognized by the configured id builders - https://phabricator.wikimedia.org/T407744 [14:55:47] right. [14:55:50] * Lucas_WMDE looks at alerts [14:56:08] * Lucas_WMDE does not understand the alerts website [14:56:42] https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate is empty, but jinxer-wm didn’t say anything about it resolving yet… [14:57:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [14:57:19] yay [14:57:33] jouncebot: nowandnext [14:57:33] For the next 0 hour(s) and 2 minute(s): xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251020T1430) [14:57:33] In 0 hour(s) and 32 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251020T1530) [14:57:59] so. ideally we’d still deploy a version of that revert which won’t cause a flood of production errors [14:58:13] but I don’t know how long it would take to put that version of the change together [14:58:54] 06SRE, 10SRE-Access-Requests: Enroll Jeltos YubiKey for production access - https://phabricator.wikimedia.org/T407606#11290070 (10Jelto) 05Open→03Resolved p:05Triage→03Medium My new FIDO ssh key was added and works and the old ssh key was removed. I'll resolve the task. [14:59:16] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations, 13Patch-For-Review: Key packages missing from trixie-wikimedia - https://phabricator.wikimedia.org/T407513#11290073 (10LSobanski) p:05Triage→03Medium [15:03:19] 06SRE, 06Infrastructure-Foundations, 10netops: Arelion 100G transport cr1-eqiad:et-1/1/2 <-> cr1-codfw:et-1/0/2 flapping on eqiad side [Oct 2025] - https://phabricator.wikimedia.org/T407578#11290097 (10cmooney) p:05Triage→03Low a:03cmooney Gonna leave this a few days before closing, we've had a few fla... [15:03:31] posted a summary at https://phabricator.wikimedia.org/T407684#11290101 [15:07:12] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:08:18] !log jhancock@cumin1003 START - Cookbook sre.dns.netbox [15:08:29] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:11:39] !log jhancock@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding franio2004 to codfw - jhancock@cumin1003" [15:11:44] !log jhancock@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding franio2004 to codfw - jhancock@cumin1003" [15:11:44] !log jhancock@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:11:55] !log bking@deploy2002 helmfile [staging] START helmfile.d/services/kartotherian: apply [15:11:58] Lucas_WMDE Hi sry, got disconnected, we did a quick temp fix patch, pushing it now [15:12:04] is it still possible? [15:12:19] I think so [15:12:20] jouncebot: nowandnext [15:12:20] No deployments scheduled for the next 0 hour(s) and 17 minute(s) [15:12:21] In 0 hour(s) and 17 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251020T1530) [15:12:35] I *think* mediawiki deploys don’t usually conflict with portals deploys [15:13:06] jan_drewniak: just checking, is it okay to deploy a MediaWiki backport (revert, hopefully fixes UBNs) even if it runs into the portals window? [15:13:31] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q2:rack/setup/install franio2004 - https://phabricator.wikimedia.org/T405981#11290148 (10Jhancock.wm) [15:13:48] seanleong-wmde: can you link the change? (I also left a comment at https://phabricator.wikimedia.org/T407684#11290101, idk if you saw that yet) [15:14:30] (03CR) 10Scott French: "Many thanks for the follow-up on the task, Balthazar. If I could have your review on this when you get a chance, that would be greatly app" [puppet] - 10https://gerrit.wikimedia.org/r/1196734 (https://phabricator.wikimedia.org/T309738) (owner: 10Scott French) [15:14:34] Lucas_WMDE: yes, go ahead, I'm not planning a portal deployment this week [15:14:40] great, thank you :) [15:15:07] !log bking@deploy2002 helmfile [staging] DONE helmfile.d/services/kartotherian: apply [15:15:33] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1197271 (https://phabricator.wikimedia.org/T406578) (owner: 10Brouberol) [15:16:20] (03CR) 10Btullis: [C:03+1] cloudnative-pg-operator: watch the growthbook namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1197272 (https://phabricator.wikimedia.org/T406578) (owner: 10Brouberol) [15:16:50] yes I just read it Lucas_WMDE, for now the new C usage will be remain as normal like last time but only the current CQR aspect currently in the DB will show as the new Ref and Aliases [15:17:12] ok [15:17:15] !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/kartotherian: apply [15:17:54] !log hnowlan@deploy1003 helmfile [staging] DONE helmfile.d/services/kartotherian: apply [15:18:09] (03CR) 10Btullis: [C:03+1] Deploy a postgresql-growthbook cluster in dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1197273 (https://phabricator.wikimedia.org/T406578) (owner: 10Brouberol) [15:18:37] created T407767 for the scap error I mentioned above btw [15:18:38] T407767: scap crash in SpiderPig job #775 (change was edited after creating job): TypeError: prompt_for_approval_or_exit() missing 1 required positional argument: 'exit_message' - https://phabricator.wikimedia.org/T407767 [15:19:12] (03CR) 10Btullis: "PCC failure appears unrelated, so +1 in principle from me." [puppet] - 10https://gerrit.wikimedia.org/r/1197271 (https://phabricator.wikimedia.org/T406578) (owner: 10Brouberol) [15:20:39] will include that phab ticket into the revert patch as well [15:20:44] gimme a few more min [15:20:58] the scap one? no need imho, that could’ve happened with any change (I assume) [15:21:15] and thanks :) [15:21:34] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q2:rack/setup/install franio2004 - https://phabricator.wikimedia.org/T405981#11290204 (10Jhancock.wm) a:03Jgreen @Jgreen this is ready for you. please lemme know if you need anything. [15:22:54] (03CR) 10CDanis: [C:03+2] varnish: WMF-Uniq -> Analytics: fix frequency bug [puppet] - 10https://gerrit.wikimedia.org/r/1196154 (https://phabricator.wikimedia.org/T405783) (owner: 10CDanis) [15:27:11] Lucas_WMDE nope, the CQR aspects introduction [15:27:53] yeah, but the fourth Bug: line in https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/1197289/2 is not necessary IMHO [15:27:56] (the first three are useful) [15:28:00] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for VolkerE - https://phabricator.wikimedia.org/T406243#11290252 (10Raine) [15:28:08] the changes in there look good to me so far btw [15:28:29] (but it will need to be squashed into the parent change, at least for deployment) [15:28:45] 06SRE, 06Traffic-Icebox: Improve how we build the 'haproxy_allowed_healthcheck_sources' list of IPs - https://phabricator.wikimedia.org/T407769 (10cmooney) 03NEW p:05Triage→03Low [15:30:05] jan_drewniak: I, the Bot under the Fountain, call upon thee, The Deployer, to do Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251020T1530). [15:30:08] Lucas_WMDE can you guide on how to squash it? [15:30:25] yeah [15:30:31] I’m not sure if the Gerrit UI has an option for it [15:30:55] I would, in a local terminal, run something like `git rebase -i master` (assuming you’re currently on a branch with those changes) [15:31:11] and then, in the “todo list”, change the beginning of the second line from “pick” to “squash” [15:31:23] and then git should squash them together and let you edit the commit messagce [15:31:46] 06SRE, 06Traffic: Improve how we build the 'haproxy_allowed_healthcheck_sources' list of IPs - https://phabricator.wikimedia.org/T407769#11290283 (10ssingh) Thanks for filing this task! I think this is a good idea to reduce the manual updates to this list, and something we have failed to keep updated. We will... [15:33:51] (03PS5) 10Aaron Schulz: [DNM] Set wgRestSandboxSpecs['wmf-restbase'] on testwiki to use the static specs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1190742 (https://phabricator.wikimedia.org/T396805) [15:34:17] (03PS6) 10Aaron Schulz: Set wgRestSandboxSpecs['wmf-restbase'] on testwiki to use the static specs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1190742 (https://phabricator.wikimedia.org/T396805) [15:36:37] (03CR) 10Alexandros Kosiaris: [C:03+1] hieradata: enable analytics-web listener in mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/1196733 (https://phabricator.wikimedia.org/T309738) (owner: 10Scott French) [15:37:25] (03CR) 10Alexandros Kosiaris: [C:03+1] "I am wondering whether this makes sense to put only in mw-cron specific yaml values files, but I am probably over thinking this?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196735 (https://phabricator.wikimedia.org/T309738) (owner: 10Scott French) [15:38:29] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:42:39] Lucas_WMDE okay done [15:43:04] do we create a cherry pick now? [15:43:17] just a moment [15:43:36] the commit message shouldn’t be two commit messages pasted together ^^ [15:43:42] I’ll fix it locally [15:44:36] okay thanks, our changes is just adding back the lines in EntityUsage.php, but since it reverts the revert, so the file is now missing in the current patch [15:45:01] uploaded a new patch set at https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/1197274 [15:45:03] (03CR) 10Aaron Schulz: "Sounds good!" [puppet] - 10https://gerrit.wikimedia.org/r/1189936 (https://phabricator.wikimedia.org/T385066) (owner: 10Aaron Schulz) [15:45:09] and there we can see the diff https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/1197274/2..3/client/includes/Usage/EntityUsage.php [15:45:20] hm, I wonder if Gerrit will even let us cherry pick this [15:45:20] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/kartotherian: apply [15:45:21] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [15:45:26] since there’s already a change with this Change-Id on the wmf.23 branch 🤔 [15:45:29] jouncebot: nowandnext [15:45:29] For the next 0 hour(s) and 14 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251020T1530) [15:45:29] In 1 hour(s) and 14 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251020T1700) [15:45:29] In 1 hour(s) and 14 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251020T1700) [15:45:36] let’s try it [15:45:50] nope [15:45:52] Could not perform action: Cherry-pick with Change-Id Ib6ddef47e577a413ccc11d9cca5f71973faaeae7 could not update the existing change 1197276 in destination branch refs/heads/wmf/1.45.0-wmf.23 of project mediawiki/extensions/Wikibase, because the change was closed (MERGED) [15:45:57] ok, new Change-Id then [15:46:00] just a heads-up, I am applying some changes to kartotherian which will only affect maps. I'll be keeping an eye but if you see anything weird maps-adjacent let me know [15:46:05] !log dancy@deploy2002 Installing scap version "4.215.0" for 2 host(s) [15:46:07] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/kartotherian: apply [15:46:13] ack [15:47:00] (03PS1) 10Lucas Werkmeister (WMDE): Revert "Implement new usage types for statement with qualifiers and references" [extensions/Wikibase] (wmf/1.45.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1197294 (https://phabricator.wikimedia.org/T401290) [15:47:12] there’s our cherry-pick to deploy [15:47:35] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [extensions/Wikibase] (wmf/1.45.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1197294 (https://phabricator.wikimedia.org/T401290) (owner: 10Lucas Werkmeister (WMDE)) [15:47:52] !log dancy@deploy2002 Installation of scap version "4.215.0" completed for 2 hosts [15:47:55] I’ll let CI run normally on this, it’s not as urgent as the revert-revert earlier [15:48:39] okay [15:49:26] !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/kartotherian: apply [15:50:05] 10ops-codfw, 06DC-Ops: Alert for device ps1-a1-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T407772 (10phaultfinder) 03NEW [15:50:36] !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/kartotherian: apply [15:50:49] !log eevans@cumin1003 START - Cookbook sre.cassandra.roll-reboot rolling reboot on A:aqs-codfw [15:51:06] (03PS2) 10DLynch: Edit check: fix some eslint warnings [extensions/VisualEditor] (wmf/1.45.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1197295 [15:51:08] !log sukhe@cumin1003 START - Cookbook sre.hosts.reimage for host durum6001.drmrs.wmnet with OS trixie [15:51:48] FIRING: KubernetesCalicoDown: wikikube-worker2203.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=wikikube-worker2203.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [15:52:50] (03PS3) 10DLynch: Edit check: fix some eslint warnings [extensions/VisualEditor] (wmf/1.45.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1197295 (https://phabricator.wikimedia.org/T407747) [15:54:24] FIRING: [4x] ProbeDown: Service aqs2001-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:54:43] PROBLEM - BFD status on asw1-b12-drmrs.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:55:15] I have a pretty urgent editing-fix that I'm going to deploy, if nobody has any objections: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/VisualEditor/+/1197295 [15:55:47] (Once Lucas_WMDE is done, I mean.) [15:57:14] * Lucas_WMDE looks [15:57:30] ack [15:59:52] (03CR) 10DLynch: "The commit message sounds very non-severe because the original patch didn't *realize* that it was fixing a bug which breaks editcheck pre-" [extensions/VisualEditor] (wmf/1.45.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1197295 (https://phabricator.wikimedia.org/T407747) (owner: 10DLynch) [16:00:22] 07sre-alert-triage, 06SRE Observability: Alert in need of triage: PuppetConstantChange (instance prometheus2007:9100) - https://phabricator.wikimedia.org/T407484#11290430 (10tappof) I found that the certificates used by Prometheus to authenticate against Kubernetes are being renewed every hour. I believe the r... [16:01:16] (03CR) 10Lucas Werkmeister (WMDE): "And here I thought it meant something like “we’re accidentally showing lots of fake eslint warnings to people who are CodeMirror’ing on-wi" [extensions/VisualEditor] (wmf/1.45.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1197295 (https://phabricator.wikimedia.org/T407747) (owner: 10DLynch) [16:02:10] (03Merged) 10jenkins-bot: Revert "Implement new usage types for statement with qualifiers and references" [extensions/Wikibase] (wmf/1.45.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1197294 (https://phabricator.wikimedia.org/T401290) (owner: 10Lucas Werkmeister (WMDE)) [16:02:30] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1197294|Revert "Implement new usage types for statement with qualifiers and references" (T401290 T407684 T407744)]] [16:02:37] T401290: Implement new usage types for qualifiers and references - https://phabricator.wikimedia.org/T401290 [16:02:37] T407684: Lua's ipairs() function can no longer iterate over Wikidata references - https://phabricator.wikimedia.org/T407684 [16:02:38] T407744: Wikibase\DataModel\Entity\EntityIdParsingException: The serialization "Q42902012 " is not recognized by the configured id builders - https://phabricator.wikimedia.org/T407744 [16:05:53] (03CR) 10Hnowlan: Set wgRestSandboxSpecs['wmf-restbase'] on testwiki to use the static specs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1190742 (https://phabricator.wikimedia.org/T396805) (owner: 10Aaron Schulz) [16:07:13] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde: Backport for [[gerrit:1197294|Revert "Implement new usage types for statement with qualifiers and references" (T401290 T407684 T407744)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [16:07:23] seanleong-wmde: please test [16:07:24] Testing it now [16:07:36] https://en.wikipedia.org/w/index.php?title=Samuel_Freeman_(philosopher)&action=info doesn’t crash, which is promising [16:07:42] (I have to make a doctor appointment, so I will do my backport when I get back instead.) [16:07:48] good luck! [16:07:53] it even still shows “Some statements (with qualifiers and references)”, I guess the revert didn’t remove the i18n message [16:08:40] https://fi.wikipedia.org/wiki/Vantaa is fixed [16:08:44] nope, that's another patch, for the curr fix we just make sure that the current CQR entities will remain [16:08:55] I'll find some crashing stuff in the report to check [16:09:16] RECOVERY - PyBal IPVS diff check on lvs2013 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [16:09:29] the “preview page with this template” bit looks like the ipairs() references issue is fixed too, so far so good [16:09:54] (03CR) 10Scott French: "I was wondering the same, yeah. In an ideal world, there would be a straightforward way to both enable the listener and open up the networ" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196735 (https://phabricator.wikimedia.org/T309738) (owner: 10Scott French) [16:10:16] trying some URLs from logstash [16:11:06] hm, https://arz.wikipedia.org/w/rest.php/v1/page/1990_%D8%A8%D8%B7%D9%88%D9%84%D8%A9_%D8%A7%D9%88%D8%B1%D9%88%D8%A8%D8%A7_%D9%84%D8%A7%D9%84%D8%B9%D8%A7%D8%A8_%D8%A7%D9%84%D9%82%D9%88%D9%89_1500_%D9%85%D8%AA%D8%B1_%D8%B3%D9%8A%D8%AF%D8%A7%D8%AA/html shows “خطأ لوا في وحدة:External_links على السطر 843: bad argument #1 to [16:11:06] 'ipairs' (table expected, got nil).” [16:11:12] not sure what to make of that [16:11:43] ah I think that ticket have typo [16:11:50] but it seems to show the same thing without WikimediaDebug [16:11:53] if you copy and paste the pairs one [16:11:56] and also the message quickly vanishes on https://arz.wikipedia.org/wiki/1990_%D8%A8%D8%B7%D9%88%D9%84%D8%A9_%D8%A7%D9%88%D8%B1%D9%88%D8%A8%D8%A7_%D9%84%D8%A7%D9%84%D8%B9%D8%A7%D8%A8_%D8%A7%D9%84%D9%82%D9%88%D9%89_1500_%D9%85%D8%AA%D8%B1_%D8%B3%D9%8A%D8%AF%D8%A7%D8%AA [16:11:58] and just add an i manually [16:12:03] it should work [16:12:11] manually add an i before the pairs [16:12:40] ok, https://arz.wikipedia.org/wiki/1990_%D8%A8%D8%B7%D9%88%D9%84%D8%A9_%D8%A7%D9%88%D8%B1%D9%88%D8%A8%D8%A7_%D9%84%D8%A7%D9%84%D8%B9%D8%A7%D8%A8_%D8%A7%D9%84%D9%82%D9%88%D9%89_1500_%D9%85%D8%AA%D8%B1_%D8%B3%D9%8A%D8%AF%D8%A7%D8%AA?safemode=1 shows the lua error [16:12:48] I guess they have some site JS that hides lua errors by default 🤷 [16:12:59] but it happens with or without WikimediaDebug, so not the revert’s fault [16:14:12] okay, not sure about that issue [16:14:14] I think we should be good to go [16:14:22] but so far the bug report ones are fixed [16:14:24] I tried some more URLs and found no errors [16:14:29] 06SRE, 10SRE-Access-Requests: Requesting access to "analytics-admins" and "deployment" groups for a-pizzata - https://phabricator.wikimedia.org/T407228#11290474 (10Raine) a:03Ahoelzl Assigning to @Ahoelzl for approval. [16:14:32] nothing in mwdebug logstash either [16:14:51] okay, let's go [16:15:00] (well, plenty of boring debug messages, like our two accounts being autocreated on arzwiki :P but no errors) [16:15:07] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde: Continuing with sync [16:15:09] let’s roll [16:15:10] xD [16:15:41] (I’m slightly surprised I didn’t have an account yet, I thought I visited arzwiki before ^^) [16:16:01] * Lucas_WMDE now can’t read arzwiki without thinking of https://de.wikipedia.org/wiki/So_klingt%E2%80%99s_bei_uns_im_Arzgebirg [16:16:43] (03CR) 10Andrea Denisse: [C:03+2] alertmanager: Add Slack route for the rweb team [puppet] - 10https://gerrit.wikimedia.org/r/1196533 (https://phabricator.wikimedia.org/T406689) (owner: 10Andrea Denisse) [16:17:01] (03PS1) 10Tiziano Fogli: k8s/client_cert: adjust Prometheus certificate renewal timing [puppet] - 10https://gerrit.wikimedia.org/r/1197303 (https://phabricator.wikimedia.org/T407484) [16:18:24] wow, spread out over the past 24 hours, the $aspect error is actually less common than the one from T402548 [16:18:24] T402548: PHP Warning: DOMNode::appendChild(): Document Fragment is empty - https://phabricator.wikimedia.org/T402548 [16:18:37] that one has 5k, $aspect 4.6k [16:18:55] !log sukhe@cumin1003 START - Cookbook sre.cdn.roll-reboot rolling reboot on A:cp-ulsfo and not P{cp4037*} and A:cp [16:18:56] anyway, nothing concerning in mediawiki-errors so far as this rolls out [16:19:13] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1197294|Revert "Implement new usage types for statement with qualifiers and references" (T401290 T407684 T407744)]] (duration: 16m 43s) [16:19:14] !log sukhe@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on durum6001.drmrs.wmnet with reason: host reimage [16:19:21] T401290: Implement new usage types for qualifiers and references - https://phabricator.wikimedia.org/T401290 [16:19:21] T407684: Lua's ipairs() function can no longer iterate over Wikidata references - https://phabricator.wikimedia.org/T407684 [16:19:21] T407744: Wikibase\DataModel\Entity\EntityIdParsingException: The serialization "Q42902012 " is not recognized by the configured id builders - https://phabricator.wikimedia.org/T407744 [16:19:38] 10ops-eqiad, 06DC-Ops: Power Supply - Status - issue on wikikube-worker1268:9290 - https://phabricator.wikimedia.org/T407774 (10phaultfinder) 03NEW [16:19:55] hahaha that sounds like a more serious issue [16:20:27] last occurrence of T407744 is at 16:16:08 UTC [16:21:25] fingers crossed [16:21:32] no more after the patch [16:22:19] what a great incident to start the week [16:22:21] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 07Essential-Work: Reimage failed after prompt...is prompt needed? - https://phabricator.wikimedia.org/T406656#11290513 (10Dzahn) I just wanted to add that I still just see a logical conflict between 2 statements around this. The first is made by... [16:22:41] yeah [16:23:32] I think that qualifies you for one of those “I broke Wikipedia but then I fixed it” stickers (t-shirts?) but I have no idea where to get those [16:23:54] I'll stay for a bit to monitor, but thank you for the help Lucas_WMDE! appreciate it, it was a great journey o7 [16:24:15] thank you too! [16:24:16] I def will ask around [16:24:19] 10ops-eqiad, 06DC-Ops: Power Supply - PS Redundancy - issue on wikikube-worker1268:9290 - https://phabricator.wikimedia.org/T407775 (10phaultfinder) 03NEW [16:24:53] (03CR) 10Alexandros Kosiaris: [C:03+1] "ack and agreed." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196735 (https://phabricator.wikimedia.org/T309738) (owner: 10Scott French) [16:27:50] (03CR) 10Tiziano Fogli: "More details on the task." [puppet] - 10https://gerrit.wikimedia.org/r/1197303 (https://phabricator.wikimedia.org/T407484) (owner: 10Tiziano Fogli) [16:28:04] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum6001.drmrs.wmnet with reason: host reimage [16:29:03] still nothing new in logstash, I’ll close the window [16:29:24] !log UTC afternoon backport+config window (belatedly, more or less) done [16:29:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:41] Kemayo: I’m done, feel free to deploy when you’re back :) [16:30:01] (03CR) 10Tiziano Fogli: "Yes, right." [puppet] - 10https://gerrit.wikimedia.org/r/1196918 (https://phabricator.wikimedia.org/T407137) (owner: 10Tiziano Fogli) [16:32:00] Lucas_WMDE \o/ [16:32:15] !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp4038.ulsfo.wmnet [16:32:30] (03CR) 10Tiziano Fogli: [C:03+2] haproxy: enable nrpe2nodexp wrapper on haproxy_alive check [puppet] - 10https://gerrit.wikimedia.org/r/1196918 (https://phabricator.wikimedia.org/T407137) (owner: 10Tiziano Fogli) [16:32:54] (03CR) 10Tiziano Fogli: [C:03+2] mariadb::proxy::master: enable nrpe2ndoexp wrapper on haproxy_failover [puppet] - 10https://gerrit.wikimedia.org/r/1196925 (https://phabricator.wikimedia.org/T407137) (owner: 10Tiziano Fogli) [16:32:59] (03PS1) 10Alexandros Kosiaris: Remove wmf.volumes from all charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1197304 [16:34:14] (03PS1) 10Majavah: toolforge: toolviews: Remove obsolete version check [puppet] - 10https://gerrit.wikimedia.org/r/1197305 (https://phabricator.wikimedia.org/T407750) [16:36:51] (03PS2) 10Tiziano Fogli: monitoring: enable nrpe2nodexp wrapper on _owned [puppet] - 10https://gerrit.wikimedia.org/r/1196943 (https://phabricator.wikimedia.org/T407120) [16:37:32] (03CR) 10Tiziano Fogli: [C:03+2] monitoring: enable nrpe2nodexp wrapper on _owned [puppet] - 10https://gerrit.wikimedia.org/r/1196943 (https://phabricator.wikimedia.org/T407120) (owner: 10Tiziano Fogli) [16:37:54] (03PS1) 10Dzahn: zuul: stop using path including hardcode host name [puppet] - 10https://gerrit.wikimedia.org/r/1197306 (https://phabricator.wikimedia.org/T407671) [16:38:11] (03CR) 10CI reject: [V:04-1] zuul: stop using path including hardcode host name [puppet] - 10https://gerrit.wikimedia.org/r/1197306 (https://phabricator.wikimedia.org/T407671) (owner: 10Dzahn) [16:38:14] (03CR) 10CI reject: [V:04-1] Remove wmf.volumes from all charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1197304 (owner: 10Alexandros Kosiaris) [16:39:00] (03PS1) 10Majavah: P:toolforge: Move toolviews processing to HAProxy [puppet] - 10https://gerrit.wikimedia.org/r/1197308 (https://phabricator.wikimedia.org/T284558) [16:39:16] (03PS2) 10Dzahn: zuul: stop using path including hardcode host name [puppet] - 10https://gerrit.wikimedia.org/r/1197306 (https://phabricator.wikimedia.org/T407671) [16:39:34] (03CR) 10CI reject: [V:04-1] zuul: stop using path including hardcode host name [puppet] - 10https://gerrit.wikimedia.org/r/1197306 (https://phabricator.wikimedia.org/T407671) (owner: 10Dzahn) [16:40:38] 06SRE, 10Domains, 06Traffic: Acquire enwp.org - https://phabricator.wikimedia.org/T332220#11290600 (10BCornwall) Thank you, all. :) This has been migrated and things should continue to behave as expected. If that's not true, please re-open this ticket so we can look into it! [16:40:43] (03CR) 10Marostegui: [C:03+1] clone_es.py: clone readonly es* hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1183646 (owner: 10Federico Ceratto) [16:40:47] 06SRE, 10Domains, 06Traffic: Acquire enwp.org - https://phabricator.wikimedia.org/T332220#11290601 (10BCornwall) 05In progress→03Resolved [16:41:44] RECOVERY - BFD status on asw1-b12-drmrs.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:44:17] FIRING: [2x] NodeTextfileStale: Stale textfile for wdqs2009:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [16:44:44] PROBLEM - BFD status on asw1-b12-drmrs.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:45:01] ^ expected [16:45:44] RECOVERY - BFD status on asw1-b12-drmrs.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:46:24] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum6001.drmrs.wmnet with OS trixie [16:46:54] (03PS3) 10Dzahn: zuul: stop using path including hardcode host name [puppet] - 10https://gerrit.wikimedia.org/r/1197306 (https://phabricator.wikimedia.org/T407671) [16:48:48] !log sukhe@cumin1003 START - Cookbook sre.hosts.reimage for host durum6002.drmrs.wmnet with OS trixie [16:53:44] PROBLEM - BFD status on asw1-b13-drmrs.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:53:52] ^ expected [16:54:12] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [16:54:52] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a1-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T407772#11290631 (10phaultfinder) [16:55:09] (03PS6) 10Ssingh: P:cache::haproxy: exempt releases.wikimedia.org from UA policy [puppet] - 10https://gerrit.wikimedia.org/r/1192210 (https://phabricator.wikimedia.org/T405165) [16:55:56] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7308/co" [puppet] - 10https://gerrit.wikimedia.org/r/1192210 (https://phabricator.wikimedia.org/T405165) (owner: 10Ssingh) [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251020T1700) [17:00:05] ryankemper: Your horoscope predicts another Wikidata Query Service weekly deploy deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251020T1700). [17:00:44] (03CR) 10Dzahn: [V:04-1 C:04-1] "https://puppet-compiler.wmflabs.org/output/1197306/7309/zuul2001.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1197306 (https://phabricator.wikimedia.org/T407671) (owner: 10Dzahn) [17:03:04] 06SRE, 06Infrastructure-Foundations: Increase net.nf_conntrack_max on kerberos hosts if needed - https://phabricator.wikimedia.org/T407726#11290653 (10jhathaway) From a brief look, most of these conntrack entries are from `an-coord1003.eqiad.wmnet`, along with log entries of the form: ` presto/an-coord1003.eq... [17:04:27] (03PS4) 10Dzahn: zuul: stop using path including hardcode host name [puppet] - 10https://gerrit.wikimedia.org/r/1197306 (https://phabricator.wikimedia.org/T407671) [17:06:18] (03PS5) 10Dzahn: zuul: stop using path including hardcode host name [puppet] - 10https://gerrit.wikimedia.org/r/1197306 (https://phabricator.wikimedia.org/T407671) [17:07:12] (03CR) 10Bking: [C:03+2] ganeti-jumbo: Add hosts and partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/1196952 (https://phabricator.wikimedia.org/T405964) (owner: 10Bking) [17:07:54] (03CR) 10Bking: [C:03+2] "self-merging in the interest of time. These are net-new hosts, so I'm not aware of any risks that are involved here." [puppet] - 10https://gerrit.wikimedia.org/r/1196952 (https://phabricator.wikimedia.org/T405964) (owner: 10Bking) [17:08:26] (03PS6) 10Dzahn: zuul: stop using path including hardcode host name [puppet] - 10https://gerrit.wikimedia.org/r/1197306 (https://phabricator.wikimedia.org/T407671) [17:09:39] (03CR) 10FNegri: [C:03+1] toolforge: toolviews: Ignore requests for *.svc.toolforge.org [puppet] - 10https://gerrit.wikimedia.org/r/1197283 (owner: 10Majavah) [17:10:11] (03CR) 10FNegri: [C:03+1] toolforge: toolviews: Remove obsolete version check [puppet] - 10https://gerrit.wikimedia.org/r/1197305 (https://phabricator.wikimedia.org/T407750) (owner: 10Majavah) [17:13:21] (03CR) 10Dzahn: [V:03+1 C:03+2] "https://puppet-compiler.wmflabs.org/output/1197306/7311/zuul2001.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1197306 (https://phabricator.wikimedia.org/T407671) (owner: 10Dzahn) [17:13:38] !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp4039.ulsfo.wmnet [17:15:57] (03CR) 10Btullis: [C:03+2] "This is actually a no-op, since the canary-events resources are absented. I'll merge it, but then follow up with a patch to remove the res" [puppet] - 10https://gerrit.wikimedia.org/r/1195778 (https://phabricator.wikimedia.org/T402943) (owner: 10Btullis) [17:19:03] !log sukhe@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on durum6002.drmrs.wmnet with reason: host reimage [17:24:30] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum6002.drmrs.wmnet with reason: host reimage [17:26:40] (03CR) 10Majavah: [C:03+2] toolforge: toolviews: Ignore requests for *.svc.toolforge.org [puppet] - 10https://gerrit.wikimedia.org/r/1197283 (owner: 10Majavah) [17:26:48] (03CR) 10Majavah: [C:03+2] toolforge: toolviews: Remove obsolete version check [puppet] - 10https://gerrit.wikimedia.org/r/1197305 (https://phabricator.wikimedia.org/T407750) (owner: 10Majavah) [17:29:12] FIRING: [2x] SystemdUnitFailed: amd-k8s-node-labeller.service on ml-serve1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:29:37] (03CR) 10Majavah: [C:04-2] "Holding for now, the switch needs to happen at the same time we move traffic to keep the unique IP counter happy." [puppet] - 10https://gerrit.wikimedia.org/r/1197308 (https://phabricator.wikimedia.org/T284558) (owner: 10Majavah) [17:39:12] FIRING: SwitchCoreInterfaceDown: Switch core interface down - ssw1-d1-eqiad:ethernet-1/31 (Transport: ssw1-f1-eqiad:et-0/0/29 (Equinix, 21996480) {#0107202f1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-d1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [17:39:57] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a1-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T407772#11290753 (10phaultfinder) [17:42:44] RECOVERY - BFD status on asw1-b13-drmrs.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:42:51] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum6002.drmrs.wmnet with OS trixie [17:47:25] (03PS2) 10Krinkle: varnish: Remove unused "Mobile Redirect" logic [puppet] - 10https://gerrit.wikimedia.org/r/1194558 (https://phabricator.wikimedia.org/T405931) [17:52:50] !log sukhe@cumin1003 START - Cookbook sre.cdn.roll-reboot rolling reboot on A:cp-text_eqsin and A:cp [17:52:59] !log sukhe@cumin1003 START - Cookbook sre.cdn.roll-reboot rolling reboot on A:cp-upload_eqsin and A:cp [17:54:56] !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp4040.ulsfo.wmnet [18:02:57] (03PS19) 10CDobbins: sre.loadbalancer: modify admin.py to accept 'reboot' action [cookbooks] - 10https://gerrit.wikimedia.org/r/1180137 (https://phabricator.wikimedia.org/T395240) [18:04:08] !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp5025.eqsin.wmnet [18:05:02] !log eevans@cumin1003 END (PASS) - Cookbook sre.cassandra.roll-reboot (exit_code=0) rolling reboot on A:aqs-codfw [18:05:10] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a1-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T407772#11290875 (10phaultfinder) [18:06:05] !log dzahn@cumin2002 START - Cookbook sre.hosts.reimage for host zuul1001.eqiad.wmnet with OS trixie [18:06:06] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [18:06:10] !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp5017.eqsin.wmnet [18:06:11] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a1-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T407772#11290886 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [18:08:14] (03PS20) 10CDobbins: sre.loadbalancer: modify admin.py to accept 'reboot' action [cookbooks] - 10https://gerrit.wikimedia.org/r/1180137 (https://phabricator.wikimedia.org/T395240) [18:11:00] (03PS21) 10CDobbins: sre.loadbalancer: modify admin.py to accept 'reboot' action [cookbooks] - 10https://gerrit.wikimedia.org/r/1180137 (https://phabricator.wikimedia.org/T395240) [18:13:44] (03PS22) 10CDobbins: sre.loadbalancer: modify admin.py to accept 'reboot' action [cookbooks] - 10https://gerrit.wikimedia.org/r/1180137 (https://phabricator.wikimedia.org/T395240) [18:17:31] !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on zuul1001.eqiad.wmnet with reason: host reimage [18:18:20] 10ops-eqiad, 06SRE, 06DC-Ops: Power Supply - PS Redundancy - issue on wikikube-worker1268:9290 - https://phabricator.wikimedia.org/T407775#11290935 (10VRiley-WMF) [18:18:25] 10ops-eqiad, 06SRE, 06DC-Ops: Power Supply - Status - issue on wikikube-worker1268:9290 - https://phabricator.wikimedia.org/T407774#11290937 (10VRiley-WMF) →14Duplicate dup:03T407775 [18:18:40] 10ops-eqiad, 06SRE, 06DC-Ops: Power Supply - PS Redundancy - issue on wikikube-worker1268:9290 - https://phabricator.wikimedia.org/T407775#11290941 (10VRiley-WMF) a:03VRiley-WMF [18:19:12] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [18:21:06] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [18:21:46] (03CR) 10CI reject: [V:04-1] sre.loadbalancer: modify admin.py to accept 'reboot' action [cookbooks] - 10https://gerrit.wikimedia.org/r/1180137 (https://phabricator.wikimedia.org/T395240) (owner: 10CDobbins) [18:23:12] (03PS1) 10Cathal Mooney: homer-diff-checker: move execution from cumin1002 to cumin1003 [puppet] - 10https://gerrit.wikimedia.org/r/1197321 (https://phabricator.wikimedia.org/T389380) [18:24:05] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on zuul1001.eqiad.wmnet with reason: host reimage [18:25:09] (03CR) 10Cathal Mooney: "Riccardo, sorry to put you on this one but you are probably the one who knows best if this is the correct way to do this. I'm guessing it" [puppet] - 10https://gerrit.wikimedia.org/r/1197321 (https://phabricator.wikimedia.org/T389380) (owner: 10Cathal Mooney) [18:28:50] (03PS23) 10CDobbins: sre.loadbalancer: modify admin.py to accept 'reboot' action [cookbooks] - 10https://gerrit.wikimedia.org/r/1180137 (https://phabricator.wikimedia.org/T395240) [18:31:06] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [18:36:09] !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp4041.ulsfo.wmnet [18:41:13] Okay, I'm back and will do that backport now. [18:41:41] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kemayo@deploy2002 using scap backport" [extensions/VisualEditor] (wmf/1.45.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1197295 (https://phabricator.wikimedia.org/T407747) (owner: 10DLynch) [18:47:08] !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp5026.eqsin.wmnet [18:49:24] !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp5018.eqsin.wmnet [18:51:30] (03PS1) 10CDanis: varnish: WMF-Uniq -> Analytics: fix actual frequency bug [puppet] - 10https://gerrit.wikimedia.org/r/1197323 (https://phabricator.wikimedia.org/T407092) [18:52:25] (03Merged) 10jenkins-bot: Edit check: fix some eslint warnings [extensions/VisualEditor] (wmf/1.45.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1197295 (https://phabricator.wikimedia.org/T407747) (owner: 10DLynch) [18:52:44] !log kemayo@deploy2002 Started scap sync-world: Backport for [[gerrit:1197295|Edit check: fix some eslint warnings (T407747)]] [18:52:49] T407747: Screen freezes for new editors if no or few references are added - https://phabricator.wikimedia.org/T407747 [18:56:43] !log kemayo@deploy2002 kemayo: Backport for [[gerrit:1197295|Edit check: fix some eslint warnings (T407747)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [18:57:25] !log kemayo@deploy2002 kemayo: Continuing with sync [18:59:33] (03PS24) 10CDobbins: sre.loadbalancer: modify admin.py to accept 'reboot' action [cookbooks] - 10https://gerrit.wikimedia.org/r/1180137 (https://phabricator.wikimedia.org/T395240) [19:01:31] !log kemayo@deploy2002 Finished scap sync-world: Backport for [[gerrit:1197295|Edit check: fix some eslint warnings (T407747)]] (duration: 08m 46s) [19:01:36] T407747: Screen freezes for new editors if no or few references are added - https://phabricator.wikimedia.org/T407747 [19:03:14] !log rzl@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/services/miscweb: apply [19:03:39] (03PS4) 10JHathaway: sre.hardware.upgrade-firmware: improve matching for SSD checks [cookbooks] - 10https://gerrit.wikimedia.org/r/1194969 (https://phabricator.wikimedia.org/T392851) (owner: 10Elukey) [19:03:42] !log rzl@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/services/miscweb: apply [19:05:41] (03CR) 10JHathaway: sre.hardware.upgrade-firmware: improve matching for SSD checks (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1194969 (https://phabricator.wikimedia.org/T392851) (owner: 10Elukey) [19:06:10] (03CR) 10CI reject: [V:04-1] sre.loadbalancer: modify admin.py to accept 'reboot' action [cookbooks] - 10https://gerrit.wikimedia.org/r/1180137 (https://phabricator.wikimedia.org/T395240) (owner: 10CDobbins) [19:06:20] !log rzl@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/services/miscweb: apply [19:06:52] !log rzl@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/services/miscweb: apply [19:08:40] (03PS25) 10CDobbins: sre.loadbalancer: modify admin.py to accept 'reboot' action [cookbooks] - 10https://gerrit.wikimedia.org/r/1180137 (https://phabricator.wikimedia.org/T395240) [19:09:12] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:13:48] (03PS2) 10CDanis: varnish: WMF-Uniq -> Analytics: fix actual frequency bug [puppet] - 10https://gerrit.wikimedia.org/r/1197323 (https://phabricator.wikimedia.org/T407092) [19:14:38] (03CR) 10Volans: [C:03+1] "LGTM, the current puppettization will take care of absenting the resource on the old host." [puppet] - 10https://gerrit.wikimedia.org/r/1197321 (https://phabricator.wikimedia.org/T389380) (owner: 10Cathal Mooney) [19:15:07] (03PS26) 10CDobbins: sre.loadbalancer: modify admin.py to accept 'reboot' action [cookbooks] - 10https://gerrit.wikimedia.org/r/1180137 (https://phabricator.wikimedia.org/T395240) [19:17:06] !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp4042.ulsfo.wmnet [19:17:19] (03CR) 10Herron: [V:03+1 C:03+2] thanos-rule: add support for multiple instances [puppet] - 10https://gerrit.wikimedia.org/r/1188441 (https://phabricator.wikimedia.org/T406054) (owner: 10Herron) [19:18:53] (03PS27) 10CDobbins: sre.loadbalancer: modify admin.py to accept 'reboot' action [cookbooks] - 10https://gerrit.wikimedia.org/r/1180137 (https://phabricator.wikimedia.org/T395240) [19:22:22] (03CR) 10Ssingh: [C:03+1] varnish: WMF-Uniq -> Analytics: fix actual frequency bug [puppet] - 10https://gerrit.wikimedia.org/r/1197323 (https://phabricator.wikimedia.org/T407092) (owner: 10CDanis) [19:22:52] (03CR) 10CDanis: [C:03+2] varnish: WMF-Uniq -> Analytics: fix actual frequency bug [puppet] - 10https://gerrit.wikimedia.org/r/1197323 (https://phabricator.wikimedia.org/T407092) (owner: 10CDanis) [19:26:06] FIRING: [2x] MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [19:30:18] !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp5027.eqsin.wmnet [19:32:46] !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp5019.eqsin.wmnet [19:34:12] FIRING: [2x] SLOMetricAbsent: wdqs-scholarly-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-scholarly-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [19:36:06] FIRING: [2x] MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [19:38:33] (03PS28) 10CDobbins: sre.loadbalancer: modify admin.py to accept 'reboot' action [cookbooks] - 10https://gerrit.wikimedia.org/r/1180137 (https://phabricator.wikimedia.org/T395240) [19:41:23] (03PS1) 10Herron: ThanosRecordingRuleGaps: update thanos-rule to thanos-rule@main [alerts] - 10https://gerrit.wikimedia.org/r/1197326 (https://phabricator.wikimedia.org/T406054) [19:43:47] (03CR) 10Herron: [C:03+2] ThanosRecordingRuleGaps: update thanos-rule to thanos-rule@main [alerts] - 10https://gerrit.wikimedia.org/r/1197326 (https://phabricator.wikimedia.org/T406054) (owner: 10Herron) [19:45:00] (03CR) 10CI reject: [V:04-1] sre.loadbalancer: modify admin.py to accept 'reboot' action [cookbooks] - 10https://gerrit.wikimedia.org/r/1180137 (https://phabricator.wikimedia.org/T395240) (owner: 10CDobbins) [19:45:03] (03PS29) 10CDobbins: sre.loadbalancer: modify admin.py to accept 'reboot' action [cookbooks] - 10https://gerrit.wikimedia.org/r/1180137 (https://phabricator.wikimedia.org/T395240) [19:45:43] (03Merged) 10jenkins-bot: ThanosRecordingRuleGaps: update thanos-rule to thanos-rule@main [alerts] - 10https://gerrit.wikimedia.org/r/1197326 (https://phabricator.wikimedia.org/T406054) (owner: 10Herron) [19:49:12] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [19:54:12] FIRING: KubernetesCalicoDown: wikikube-worker2203.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=wikikube-worker2203.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [19:56:06] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [19:56:11] !log eevans@cumin1003 START - Cookbook sre.cassandra.roll-reboot rolling reboot on P{aqs[1014-1022]*} and P{P:Cassandra} [19:58:14] !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp4043.ulsfo.wmnet [19:59:03] (03PS1) 10Dzahn: zuul: use wmflib mkdir_p to ensure /var/www/zuul exists [puppet] - 10https://gerrit.wikimedia.org/r/1197327 (https://phabricator.wikimedia.org/T395938) [19:59:12] FIRING: ProbeDown: Service aqs1014-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#aqs1014-a:7000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:59:50] (03CR) 10CI reject: [V:04-1] zuul: use wmflib mkdir_p to ensure /var/www/zuul exists [puppet] - 10https://gerrit.wikimedia.org/r/1197327 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [20:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: #bothumor My software never has bugs. It just develops random features. Rise for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251020T2000). [20:00:05] edsanders: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:02:07] 10ops-eqiad, 06SRE, 06DC-Ops: Power Supply - PS Redundancy - issue on wikikube-worker1268:9290 - https://phabricator.wikimedia.org/T407775#11291315 (10VRiley-WMF) reseated cable and it came back [20:02:16] 10ops-eqiad, 06SRE, 06DC-Ops: Power Supply - PS Redundancy - issue on wikikube-worker1268:9290 - https://phabricator.wikimedia.org/T407775#11291316 (10VRiley-WMF) 05Open→03Resolved [20:04:49] PROBLEM - Host sretest2001 is DOWN: PING CRITICAL - Packet loss = 100% [20:06:34] (03PS2) 10Dzahn: zuul: ensure /var/www exists [puppet] - 10https://gerrit.wikimedia.org/r/1197327 (https://phabricator.wikimedia.org/T395938) [20:07:42] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7313/co" [puppet] - 10https://gerrit.wikimedia.org/r/1194558 (https://phabricator.wikimedia.org/T405931) (owner: 10Krinkle) [20:09:04] Hi any deployer available? I scheduled 3 patches for the morning window (also mergeable together), I waited an entire hour, but there was no one active this morning... [20:09:28] (03PS30) 10CDobbins: sre.loadbalancer: modify admin.py to accept 'reboot' action [cookbooks] - 10https://gerrit.wikimedia.org/r/1180137 (https://phabricator.wikimedia.org/T395240) [20:09:40] (03CR) 10Dzahn: [C:03+2] zuul: ensure /var/www exists [puppet] - 10https://gerrit.wikimedia.org/r/1197327 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [20:13:18] !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp5028.eqsin.wmnet [20:13:29] !log jhathaway@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on sretest2001.codfw.wmnet with reason: T383173 [20:13:33] T383173: Supermicro: UEFI HTTP boot request hangs on cold boot - https://phabricator.wikimedia.org/T383173 [20:13:53] 10ops-esams, 06DC-Ops, 06Infrastructure-Foundations, 10netops: esams switch oritentation migration - https://phabricator.wikimedia.org/T407794 (10RobH) 03NEW p:05Triage→03Medium [20:15:03] RECOVERY - Host sretest2001 is UP: PING WARNING - Packet loss = 33%, RTA = 30.46 ms [20:15:07] (03CR) 10BCornwall: [V:03+1 C:03+2] varnish: Remove unused "Mobile Redirect" logic [puppet] - 10https://gerrit.wikimedia.org/r/1194558 (https://phabricator.wikimedia.org/T405931) (owner: 10Krinkle) [20:15:19] (03CR) 10BCornwall: [V:03+2 C:03+2] "Tests are happy" [puppet] - 10https://gerrit.wikimedia.org/r/1194558 (https://phabricator.wikimedia.org/T405931) (owner: 10Krinkle) [20:16:06] !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp5020.eqsin.wmnet [20:19:56] (03PS2) 10BCornwall: Remove wikimedia_trust ACLs from varnish/haproxy [puppet] - 10https://gerrit.wikimedia.org/r/1192230 (https://phabricator.wikimedia.org/T399688) [20:21:25] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7315/co" [puppet] - 10https://gerrit.wikimedia.org/r/1192230 (https://phabricator.wikimedia.org/T399688) (owner: 10BCornwall) [20:22:51] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host zuul1001.eqiad.wmnet with OS trixie [20:26:22] (03PS31) 10CDobbins: sre.loadbalancer: modify admin.py to accept 'reboot' action [cookbooks] - 10https://gerrit.wikimedia.org/r/1180137 (https://phabricator.wikimedia.org/T395240) [20:29:39] (03PS1) 10CDanis: varnish: WMF-Uniq -> Analytics: no, really this time [puppet] - 10https://gerrit.wikimedia.org/r/1197331 (https://phabricator.wikimedia.org/T407092) [20:33:45] RoanKattouw urbanecm TheresNoTime cjming Sorry for multi-pinging, but are any of you available for deploy? otherwise I won't wait, thanks :) [20:39:34] (03PS2) 10CDanis: varnish: WMF-Uniq -> Analytics: no, really this time [puppet] - 10https://gerrit.wikimedia.org/r/1197331 (https://phabricator.wikimedia.org/T407092) [20:41:22] 10ops-esams, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: esams switch oritentation migration - https://phabricator.wikimedia.org/T407794#11291474 (10RobH) [20:41:47] !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp4044.ulsfo.wmnet [20:43:51] !log jhathaway@cumin2002 START - Cookbook sre.hosts.provision for host sretest2001.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [20:44:15] FIRING: [2x] NodeTextfileStale: Stale textfile for wdqs2009:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [20:44:27] (03PS3) 10CDanis: varnish: WMF-Uniq -> Analytics: no, really this time [puppet] - 10https://gerrit.wikimedia.org/r/1197331 (https://phabricator.wikimedia.org/T407092) [20:46:51] (03CR) 10BBlack: [C:03+1] "easy peasy right? 😊" [puppet] - 10https://gerrit.wikimedia.org/r/1197331 (https://phabricator.wikimedia.org/T407092) (owner: 10CDanis) [20:48:16] (03PS1) 10Dzahn: zookeeper: drop safety check for buster, no more buster [puppet] - 10https://gerrit.wikimedia.org/r/1197334 [20:51:16] (03CR) 10CDanis: [C:03+2] varnish: WMF-Uniq -> Analytics: no, really this time [puppet] - 10https://gerrit.wikimedia.org/r/1197331 (https://phabricator.wikimedia.org/T407092) (owner: 10CDanis) [20:54:09] !log jhathaway@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2001.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [20:54:12] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [20:55:35] 10ops-esams, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: esams switch orientation migration - https://phabricator.wikimedia.org/T407794#11291516 (10Krinkle) [20:56:30] !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp5029.eqsin.wmnet [20:59:17] !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp5021.eqsin.wmnet [20:59:40] Hey all - one security patch to get out today! [21:00:04] Reedy, sbassett, Maryum, and manfredi: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251020T2100). [21:03:57] (03PS1) 10Dzahn: zookeeper: add support for TLS [puppet] - 10https://gerrit.wikimedia.org/r/1197339 (https://phabricator.wikimedia.org/T395938) [21:10:19] !log Deployed security fix for T406639 [21:10:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:14:00] (03PS2) 10Dzahn: zookeeper: add support for TLS [puppet] - 10https://gerrit.wikimedia.org/r/1197339 (https://phabricator.wikimedia.org/T395938) [21:14:11] (03PS1) 10Krinkle: varnish: Add test for m.wikisource.org x-dt-host rewrite [puppet] - 10https://gerrit.wikimedia.org/r/1197341 (https://phabricator.wikimedia.org/T405931) [21:14:38] (03CR) 10CI reject: [V:04-1] zookeeper: add support for TLS [puppet] - 10https://gerrit.wikimedia.org/r/1197339 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [21:16:58] (03PS3) 10Dzahn: zookeeper: add support for TLS [puppet] - 10https://gerrit.wikimedia.org/r/1197339 (https://phabricator.wikimedia.org/T395938) [21:22:18] !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp4045.ulsfo.wmnet [21:22:38] (03PS1) 10Dzahn: zookeeper: replace legacy facts, fix lint warnings [puppet] - 10https://gerrit.wikimedia.org/r/1197342 [21:28:04] (03PS2) 10Krinkle: varnish: Add test for m.wikisource.org x-dt-host rewrite [puppet] - 10https://gerrit.wikimedia.org/r/1197341 (https://phabricator.wikimedia.org/T405931) [21:28:04] (03PS1) 10Krinkle: varnish: Simplify m-dot rewrite and fix m.wikipedia.org bug [puppet] - 10https://gerrit.wikimedia.org/r/1197343 (https://phabricator.wikimedia.org/T405931) [21:29:12] FIRING: [2x] SystemdUnitFailed: amd-k8s-node-labeller.service on ml-serve1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:31:46] (03CR) 10Krinkle: varnish: Add test for m.wikisource.org x-dt-host rewrite (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1197341 (https://phabricator.wikimedia.org/T405931) (owner: 10Krinkle) [21:32:05] /// [21:32:08] er [21:33:41] (03PS1) 10Clare Ming: Add config for xLab MW Module experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1197344 (https://phabricator.wikimedia.org/T401705) [21:34:36] !log eevans@cumin1003 END (PASS) - Cookbook sre.cassandra.roll-reboot (exit_code=0) rolling reboot on P{aqs[1014-1022]*} and P{P:Cassandra} [21:35:10] (03CR) 10Clare Ming: Add config for xLab MW Module experiment (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1197344 (https://phabricator.wikimedia.org/T401705) (owner: 10Clare Ming) [21:39:12] FIRING: SwitchCoreInterfaceDown: Switch core interface down - ssw1-d1-eqiad:ethernet-1/31 (Transport: ssw1-f1-eqiad:et-0/0/29 (Equinix, 21996480) {#0107202f1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-d1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [21:39:21] !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp5030.eqsin.wmnet [21:42:30] !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp5022.eqsin.wmnet