[00:08:13] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1162636 [00:08:13] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1162636 (owner: 10TrainBranchBot) [00:25:54] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr3-eqsin and Hurricane Electric (2001:de8:4::6939:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [00:28:18] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1162636 (owner: 10TrainBranchBot) [00:46:41] PROBLEM - Disk space on releases1003 is CRITICAL: DISK CRITICAL - /srv/docker/overlay2/48c1d186f13fe1db2265691a55c716d8b29707b635fa808aa232512dcad4de39/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [00:58:30] FIRING: [10x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:06:41] RECOVERY - Disk space on releases1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [02:03:30] FIRING: [10x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:30:40] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:33:30] FIRING: [10x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:10:48] FIRING: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [03:25:48] RESOLVED: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [03:29:42] FIRING: [2x] ProbeDown: Service wdqs2014:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2014:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:45:40] FIRING: [4x] SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [03:55:30] (03CR) 10KartikMistry: [C:03+2] "Plan is to deploy this in staging and then to production if everything is fine." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1159696 (https://phabricator.wikimedia.org/T335491) (owner: 10KartikMistry) [03:57:26] (03Merged) 10jenkins-bot: machinetranslation: Use S3 storage for production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1159696 (https://phabricator.wikimedia.org/T335491) (owner: 10KartikMistry) [04:25:54] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr3-eqsin and Hurricane Electric (2001:de8:4::6939:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [04:36:55] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - kubemaster_6443: Servers wikikube-ctrl2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [04:37:55] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [04:38:30] FIRING: [11x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:42:57] PROBLEM - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [05:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:16:52] (03CR) 10Tim Starling: "I'm planning to deploy this tomorrow." [puppet] - 10https://gerrit.wikimedia.org/r/1161727 (https://phabricator.wikimedia.org/T397267) (owner: 10Tim Starling) [05:32:57] RECOVERY - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [05:33:09] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance [05:34:33] FIRING: [11x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:38:49] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1160.eqiad.wmnet with reason: Maintenance [05:38:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1160 (T396130)', diff saved to https://phabricator.wikimedia.org/P78552 and previous config saved to /var/cache/conftool/dbconfig/20250623-053857-marostegui.json [05:39:02] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [05:41:26] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 15 hosts with reason: Primary switchover x1 T397419 [05:41:31] T397419: Switchover x1 master (db2215 -> db2196) - https://phabricator.wikimedia.org/T397419 [05:42:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set db2196 with weight 0 T397419', diff saved to https://phabricator.wikimedia.org/P78553 and previous config saved to /var/cache/conftool/dbconfig/20250623-054206-root.json [05:43:10] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db2196 to x1 master [puppet] - 10https://gerrit.wikimedia.org/r/1161449 (https://phabricator.wikimedia.org/T397419) (owner: 10Gerrit maintenance bot) [05:44:44] (03PS1) 10Gerrit maintenance bot: mariadb: Promote es2037 to es6 master [puppet] - 10https://gerrit.wikimedia.org/r/1162647 (https://phabricator.wikimedia.org/T397597) [05:45:54] !log Starting x1 codfw failover from db2215 to db2196 - T397419 [05:45:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:46:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote db2196 to x1 primary T397419', diff saved to https://phabricator.wikimedia.org/P78554 and previous config saved to /var/cache/conftool/dbconfig/20250623-054616-marostegui.json [05:46:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1160 (T396130)', diff saved to https://phabricator.wikimedia.org/P78555 and previous config saved to /var/cache/conftool/dbconfig/20250623-054633-marostegui.json [05:46:39] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [05:47:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2215 T397419', diff saved to https://phabricator.wikimedia.org/P78556 and previous config saved to /var/cache/conftool/dbconfig/20250623-054725-marostegui.json [05:47:30] !log stevemunene@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply [05:47:30] T397419: Switchover x1 master (db2215 -> db2196) - https://phabricator.wikimedia.org/T397419 [05:48:30] !log stevemunene@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply [05:50:15] (03PS1) 10Marostegui: db2215: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1162648 (https://phabricator.wikimedia.org/T397279) [05:51:41] (03CR) 10Marostegui: [C:03+2] db2215: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1162648 (https://phabricator.wikimedia.org/T397279) (owner: 10Marostegui) [05:55:36] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: T397597 [05:55:41] T397597: Switchover es6 master (es2035 -> es2037) - https://phabricator.wikimedia.org/T397597 [05:58:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2215 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P78557 and previous config saved to /var/cache/conftool/dbconfig/20250623-055840-root.json [06:01:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P78558 and previous config saved to /var/cache/conftool/dbconfig/20250623-060140-marostegui.json [06:06:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:09:33] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: T397597 [06:09:38] T397597: Switchover es6 master (es2035 -> es2037) - https://phabricator.wikimedia.org/T397597 [06:11:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set es2037 with weight 0 T397597', diff saved to https://phabricator.wikimedia.org/P78559 and previous config saved to /var/cache/conftool/dbconfig/20250623-061143-root.json [06:12:12] (03CR) 10Marostegui: [C:03+2] mariadb: Promote es2037 to es6 master [puppet] - 10https://gerrit.wikimedia.org/r/1162647 (https://phabricator.wikimedia.org/T397597) (owner: 10Gerrit maintenance bot) [06:13:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2215 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P78560 and previous config saved to /var/cache/conftool/dbconfig/20250623-061346-root.json [06:13:49] !log Starting es6 codfw failover from es2035 to es2037 - T397597 [06:13:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:14:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote es2037 to es6 primary and set section read-write T397597', diff saved to https://phabricator.wikimedia.org/P78561 and previous config saved to /var/cache/conftool/dbconfig/20250623-061416-marostegui.json [06:15:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es2035 T397597', diff saved to https://phabricator.wikimedia.org/P78562 and previous config saved to /var/cache/conftool/dbconfig/20250623-061511-marostegui.json [06:15:16] T397597: Switchover es6 master (es2035 -> es2037) - https://phabricator.wikimedia.org/T397597 [06:15:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2035 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P78563 and previous config saved to /var/cache/conftool/dbconfig/20250623-061554-root.json [06:16:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P78564 and previous config saved to /var/cache/conftool/dbconfig/20250623-061648-marostegui.json [06:17:07] (03PS1) 10Gerrit maintenance bot: mariadb: Promote es2038 to es7 master [puppet] - 10https://gerrit.wikimedia.org/r/1162649 (https://phabricator.wikimedia.org/T397599) [06:17:59] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: T397597 [06:18:09] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 T397599 [06:18:13] T397599: Switchover es7 master (es2039 -> es2038) - https://phabricator.wikimedia.org/T397599 [06:24:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set es2038 with weight 0 T397599', diff saved to https://phabricator.wikimedia.org/P78565 and previous config saved to /var/cache/conftool/dbconfig/20250623-062420-root.json [06:24:26] T397599: Switchover es7 master (es2039 -> es2038) - https://phabricator.wikimedia.org/T397599 [06:24:36] (03Abandoned) 10KartikMistry: MinT: Increase liveness probe [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125093 (https://phabricator.wikimedia.org/T386889) (owner: 10KartikMistry) [06:24:43] (03CR) 10Marostegui: [C:03+2] mariadb: Promote es2038 to es7 master [puppet] - 10https://gerrit.wikimedia.org/r/1162649 (https://phabricator.wikimedia.org/T397599) (owner: 10Gerrit maintenance bot) [06:28:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2215 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P78566 and previous config saved to /var/cache/conftool/dbconfig/20250623-062852-root.json [06:29:22] !log Starting es7 codfw failover from es2039 to es2038 - T397599 [06:29:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:29:27] T397599: Switchover es7 master (es2039 -> es2038) - https://phabricator.wikimedia.org/T397599 [06:29:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote es2038 to es7 primary and set section read-write T397599', diff saved to https://phabricator.wikimedia.org/P78567 and previous config saved to /var/cache/conftool/dbconfig/20250623-062949-marostegui.json [06:30:40] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:30:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es2039 T397599', diff saved to https://phabricator.wikimedia.org/P78568 and previous config saved to /var/cache/conftool/dbconfig/20250623-063050-marostegui.json [06:31:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2035 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P78569 and previous config saved to /var/cache/conftool/dbconfig/20250623-063100-root.json [06:31:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2039 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P78570 and previous config saved to /var/cache/conftool/dbconfig/20250623-063123-root.json [06:31:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1160 (T396130)', diff saved to https://phabricator.wikimedia.org/P78571 and previous config saved to /var/cache/conftool/dbconfig/20250623-063155-marostegui.json [06:32:01] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [06:32:11] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1190.eqiad.wmnet with reason: Maintenance [06:32:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1190 (T396130)', diff saved to https://phabricator.wikimedia.org/P78572 and previous config saved to /var/cache/conftool/dbconfig/20250623-063217-marostegui.json [06:39:39] FIRING: CoreBGPDown: Core BGP session down between cr2-eqiad and cr1-esams (185.15.59.149) - group Confed_esams - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=eqiad&var-device=cr2-eqiad:9804&var-bgp_group=Confed_esams&var-bgp_neighbor=cr1-esams - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [06:39:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1190 (T396130)', diff saved to https://phabricator.wikimedia.org/P78573 and previous config saved to /var/cache/conftool/dbconfig/20250623-063956-marostegui.json [06:40:02] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [06:43:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2215 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P78574 and previous config saved to /var/cache/conftool/dbconfig/20250623-064358-root.json [06:44:39] RESOLVED: CoreBGPDown: Core BGP session down between cr2-eqiad and cr1-esams (185.15.59.149) - group Confed_esams - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=eqiad&var-device=cr2-eqiad:9804&var-bgp_group=Confed_esams&var-bgp_neighbor=cr1-esams - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [06:46:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2035 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P78575 and previous config saved to /var/cache/conftool/dbconfig/20250623-064606-root.json [06:46:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P78576 and previous config saved to /var/cache/conftool/dbconfig/20250623-064628-root.json [06:48:53] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db[2160,2235].codfw.wmnet,db[1164,1217,1228].eqiad.wmnet with reason: m5 master switch T397413 [06:48:58] T397413: Switchover m5 master db1228 -> db1164 - https://phabricator.wikimedia.org/T397413 [06:52:55] (03PS1) 10Muehlenhoff: Remove LDAP access for ncreasy [puppet] - 10https://gerrit.wikimedia.org/r/1162721 [06:54:35] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy1003 using scap backport" [extensions/WikiLambda] (wmf/1.45.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1161622 (https://phabricator.wikimedia.org/T396978) (owner: 10Jforrester) [06:54:36] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154121 (owner: 10Jforrester) [06:54:37] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1156351 (owner: 10Jforrester) [06:54:49] I am starting the deployment now given it is going to take an hour to rebuild images etc [06:55:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1190', diff saved to https://phabricator.wikimedia.org/P78577 and previous config saved to /var/cache/conftool/dbconfig/20250623-065503-marostegui.json [06:55:23] (03CR) 10Muehlenhoff: [C:03+2] Remove LDAP access for ncreasy [puppet] - 10https://gerrit.wikimedia.org/r/1162721 (owner: 10Muehlenhoff) [06:55:25] (03PS3) 10Slyngshede: data.yaml: pwaigi1- offboarding [puppet] - 10https://gerrit.wikimedia.org/r/1152559 [06:55:33] (03Merged) 10jenkins-bot: captureSpeedtest: Drop PHP 7 check, no longer needed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154121 (owner: 10Jforrester) [06:55:35] (03Merged) 10jenkins-bot: diffConfig: Add a quick list of affected wikis to the end of the output [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1156351 (owner: 10Jforrester) [06:56:06] (03Merged) 10jenkins-bot: ApiQueryZFunctionReference: Return an actual empty array instead of [false] [extensions/WikiLambda] (wmf/1.45.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1161622 (https://phabricator.wikimedia.org/T396978) (owner: 10Jforrester) [06:56:43] !log hashar@deploy1003 Started scap sync-world: Backport for [[gerrit:1161622|ApiQueryZFunctionReference: Return an actual empty array instead of [false] (T396978)]], [[gerrit:1154121|captureSpeedtest: Drop PHP 7 check, no longer needed]], [[gerrit:1156351|diffConfig: Add a quick list of affected wikis to the end of the output]] [06:56:48] T396978: Function View: function page shows items for "benjamin" item in implementations and tests tables - https://phabricator.wikimedia.org/T396978 [06:57:26] hashar: Oh, thanks! [06:57:46] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1152559 (owner: 10Slyngshede) [06:58:00] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db[2160,2235].codfw.wmnet,db[1164,1217,1228].eqiad.wmnet with reason: m5 master switch T397413 [06:58:04] T397413: Switchover m5 master db1228 -> db1164 - https://phabricator.wikimedia.org/T397413 [06:58:20] (03CR) 10Slyngshede: [C:03+2] data.yaml: pwaigi1- offboarding [puppet] - 10https://gerrit.wikimedia.org/r/1152559 (owner: 10Slyngshede) [07:00:05] Amir1, Urbanecm, and awight: It is that lovely time of the day again! You are hereby commanded to deploy UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250623T0700). [07:00:05] James_F: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:12] Yes yes. [07:00:44] (03PS1) 10Marostegui: mariadb: Promote db1164 to m5 master [puppet] - 10https://gerrit.wikimedia.org/r/1162740 (https://phabricator.wikimedia.org/T397413) [07:01:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2035 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P78578 and previous config saved to /var/cache/conftool/dbconfig/20250623-070112-root.json [07:01:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P78579 and previous config saved to /var/cache/conftool/dbconfig/20250623-070134-root.json [07:02:16] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db1164 to m5 master [puppet] - 10https://gerrit.wikimedia.org/r/1162740 (https://phabricator.wikimedia.org/T397413) (owner: 10Marostegui) [07:06:04] !log Failover m5 from db1228 to db1164 - T397413 [07:06:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:06:09] T397413: Switchover m5 master db1228 -> db1164 - https://phabricator.wikimedia.org/T397413 [07:07:58] 07:07:45 [mediawiki-publish-81] Waiting 300 seconds for swift after full mediawiki image build (T390251) [07:07:58] T390251: docker-registry.wikimedia.org keeps serving bad blobs - https://phabricator.wikimedia.org/T390251 [07:08:01] * hashar whistles [07:08:19] (03PS1) 10Kosta Harlan: UserInfoCard: Enable by default for named users on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1162742 (https://phabricator.wikimedia.org/T397292) [07:08:26] I thought we had a patch to upstream code to fix that [07:09:06] (03CR) 10CI reject: [V:04-1] UserInfoCard: Enable by default for named users on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1162742 (https://phabricator.wikimedia.org/T397292) (owner: 10Kosta Harlan) [07:10:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1190', diff saved to https://phabricator.wikimedia.org/P78580 and previous config saved to /var/cache/conftool/dbconfig/20250623-071011-marostegui.json [07:11:56] hashar: I saw that the registry was re-platformed to newer code and on more than one machine at once. But maybe it's not fixed enough? [07:12:09] E_NO_CLUE [07:13:01] I only remember Ahmon mentioned writing patches for upstream code to fix a corruption in the registry/swift backend https://phabricator.wikimedia.org/T390251#10765976 [07:13:03] anyway [07:16:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2035 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P78581 and previous config saved to /var/cache/conftool/dbconfig/20250623-071618-root.json [07:16:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P78582 and previous config saved to /var/cache/conftool/dbconfig/20250623-071639-root.json [07:16:48] the image push takes 7 minutes and IIRC it is CPU bound (Docker compresses layers serially over a single thread) [07:18:37] !log hashar@deploy1003 hashar, jforrester: Backport for [[gerrit:1161622|ApiQueryZFunctionReference: Return an actual empty array instead of [false] (T396978)]], [[gerrit:1154121|captureSpeedtest: Drop PHP 7 check, no longer needed]], [[gerrit:1156351|diffConfig: Add a quick list of affected wikis to the end of the output]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be ver [07:18:38] ified there. [07:18:43] T396978: Function View: function page shows items for "benjamin" item in implementations and tests tables - https://phabricator.wikimedia.org/T396978 [07:19:54] (03CR) 10Stevemunene: [C:03+2] zookeeper: remove an-conf100[1-3] from the cluster [puppet] - 10https://gerrit.wikimedia.org/r/1135028 (https://phabricator.wikimedia.org/T374922) (owner: 10Stevemunene) [07:20:03] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1228.eqiad.wmnet with reason: Maintenance [07:21:59] James_F: patches are on the debug servers [07:22:06] and it looks like it went faster than I expected :] [07:22:06] hashar: Yeah, checking. [07:23:52] hashar: Yup, fixed. [07:24:07] !log hashar@deploy1003 hashar, jforrester: Continuing with sync [07:25:05] !log stevemunene@cumin1002 START - Cookbook sre.zookeeper.roll-restart-zookeeper for Zookeeper A:zookeeper-analytics cluster: Roll restart of jvm daemons. [07:25:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1190 (T396130)', diff saved to https://phabricator.wikimedia.org/P78583 and previous config saved to /var/cache/conftool/dbconfig/20250623-072519-marostegui.json [07:25:24] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [07:25:35] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1199.eqiad.wmnet with reason: Maintenance [07:25:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1199 (T396130)', diff saved to https://phabricator.wikimedia.org/P78584 and previous config saved to /var/cache/conftool/dbconfig/20250623-072542-marostegui.json [07:28:23] hashar: Thank you! [07:28:35] (03CR) 10Filippo Giunchedi: [C:03+1] grafana: Disable dashboard sync for a version upgrade [puppet] - 10https://gerrit.wikimedia.org/r/1161628 (https://phabricator.wikimedia.org/T397442) (owner: 10Andrea Denisse) [07:29:29] (03PS2) 10Kosta Harlan: UserInfoCard: Enable by default for named users on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1162742 (https://phabricator.wikimedia.org/T397292) [07:29:42] FIRING: [2x] ProbeDown: Service wdqs2014:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2014:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:31:25] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.zookeeper.roll-restart-zookeeper (exit_code=0) for Zookeeper A:zookeeper-analytics cluster: Roll restart of jvm daemons. [07:31:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P78585 and previous config saved to /var/cache/conftool/dbconfig/20250623-073145-root.json [07:33:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1199 (T396130)', diff saved to https://phabricator.wikimedia.org/P78586 and previous config saved to /var/cache/conftool/dbconfig/20250623-073316-marostegui.json [07:33:23] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [07:34:46] (03PS1) 10Volans: setup.py, tox: remove support for older Python [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1162824 [07:37:44] (03CR) 10Muehlenhoff: setup.py, tox: remove support for older Python (031 comment) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1162824 (owner: 10Volans) [07:37:50] !log hashar@deploy1003 Finished scap sync-world: Backport for [[gerrit:1161622|ApiQueryZFunctionReference: Return an actual empty array instead of [false] (T396978)]], [[gerrit:1154121|captureSpeedtest: Drop PHP 7 check, no longer needed]], [[gerrit:1156351|diffConfig: Add a quick list of affected wikis to the end of the output]] (duration: 41m 07s) [07:37:55] T396978: Function View: function page shows items for "benjamin" item in implementations and tests tables - https://phabricator.wikimedia.org/T396978 [07:38:52] (03CR) 10JMeybohm: [C:03+1] Update codfw pod ip range [puppet] - 10https://gerrit.wikimedia.org/r/1161930 (https://phabricator.wikimedia.org/T375845) (owner: 10Kamila Součková) [07:39:24] (03CR) 10Brouberol: [C:03+2] Airflow analytics-test: Optimization for LocalExecutors [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161047 (https://phabricator.wikimedia.org/T369845) (owner: 10Aqu) [07:39:49] (03CR) 10JMeybohm: [C:03+1] admin_ng: Change codfw pod ip range to 10.194.128.0/17 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161948 (https://phabricator.wikimedia.org/T375845) (owner: 10Kamila Součková) [07:40:18] (03PS2) 10Volans: setup.py, tox: remove support for older Python [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1162824 [07:40:25] (03CR) 10Volans: "addressed comment" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1162824 (owner: 10Volans) [07:41:31] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [07:42:19] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [07:42:29] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1162824 (owner: 10Volans) [07:44:51] (03CR) 10JMeybohm: [C:03+1] Update codfw to kubernetes 1.31, calico 3.29 [puppet] - 10https://gerrit.wikimedia.org/r/1161929 (https://phabricator.wikimedia.org/T397148) (owner: 10Kamila Součková) [07:45:40] FIRING: [4x] SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [07:48:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1199', diff saved to https://phabricator.wikimedia.org/P78587 and previous config saved to /var/cache/conftool/dbconfig/20250623-074824-marostegui.json [07:49:17] (03CR) 10Abijeet Patro: [C:03+1] Make functionally identical descriptions the same [dumps] - 10https://gerrit.wikimedia.org/r/1149464 (owner: 10Amire80) [07:50:07] (03PS2) 10JMeybohm: Update codfw to k8s 1.31 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161945 (https://phabricator.wikimedia.org/T397148) (owner: 10Kamila Součková) [07:50:16] (03CR) 10JMeybohm: Update codfw to k8s 1.31 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161945 (https://phabricator.wikimedia.org/T397148) (owner: 10Kamila Součková) [07:50:57] (03PS3) 10JMeybohm: Update codfw to k8s 1.31 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161945 (https://phabricator.wikimedia.org/T397148) (owner: 10Kamila Součková) [07:51:13] (03CR) 10JMeybohm: Update codfw to k8s 1.31 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161945 (https://phabricator.wikimedia.org/T397148) (owner: 10Kamila Součková) [07:53:32] (03CR) 10Filippo Giunchedi: [C:03+2] thanos: split query-frontend logs into their own file [puppet] - 10https://gerrit.wikimedia.org/r/1161505 (https://phabricator.wikimedia.org/T394318) (owner: 10Filippo Giunchedi) [07:53:39] (03PS2) 10Filippo Giunchedi: thanos: split query-frontend logs into their own file [puppet] - 10https://gerrit.wikimedia.org/r/1161505 (https://phabricator.wikimedia.org/T394318) [07:53:45] (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] thanos: split query-frontend logs into their own file [puppet] - 10https://gerrit.wikimedia.org/r/1161505 (https://phabricator.wikimedia.org/T394318) (owner: 10Filippo Giunchedi) [07:54:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.171s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [07:59:21] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.688s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [08:00:21] (03CR) 10Filippo Giunchedi: [C:03+2] thanos: add option for series limits to store [puppet] - 10https://gerrit.wikimedia.org/r/1160670 (https://phabricator.wikimedia.org/T394318) (owner: 10Filippo Giunchedi) [08:03:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1199', diff saved to https://phabricator.wikimedia.org/P78588 and previous config saved to /var/cache/conftool/dbconfig/20250623-080332-marostegui.json [08:09:17] (03PS1) 10Marostegui: db2233: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1162827 (https://phabricator.wikimedia.org/T397602) [08:09:22] (03CR) 10Elukey: [C:03+1] setup.py, tox: remove support for older Python [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1162824 (owner: 10Volans) [08:09:31] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2233.codfw.wmnet with reason: Maintenance [08:09:50] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2160.codfw.wmnet with reason: Maintenance [08:09:50] (03CR) 10Volans: [C:03+2] setup.py, tox: remove support for older Python [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1162824 (owner: 10Volans) [08:10:02] (03CR) 10Marostegui: [C:03+2] db2233: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1162827 (https://phabricator.wikimedia.org/T397602) (owner: 10Marostegui) [08:10:40] (03Merged) 10jenkins-bot: setup.py, tox: remove support for older Python [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1162824 (owner: 10Volans) [08:14:53] 06SRE, 06Data-Engineering, 06Data-Engineering-Radar, 06Infrastructure-Foundations, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Rebuild Spark images with Bookworm / bullseye-backports deprecation - https://phabricator.wikimedia.org/T390139#10937359 (10MoritzMuehlenhoff) The removal of bullseye-backport... [08:15:31] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[1217,1228].eqiad.wmnet with reason: Maintenance [08:18:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1199 (T396130)', diff saved to https://phabricator.wikimedia.org/P78590 and previous config saved to /var/cache/conftool/dbconfig/20250623-081839-marostegui.json [08:18:45] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [08:18:56] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1221.eqiad.wmnet with reason: Maintenance [08:19:11] (03PS1) 10Marostegui: db1228: Move to m2 [puppet] - 10https://gerrit.wikimedia.org/r/1162828 (https://phabricator.wikimedia.org/T397602) [08:19:13] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1015,1019].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [08:19:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1221 (T396130)', diff saved to https://phabricator.wikimedia.org/P78591 and previous config saved to /var/cache/conftool/dbconfig/20250623-081920-marostegui.json [08:20:11] (03CR) 10Marostegui: [C:03+2] db1228: Move to m2 [puppet] - 10https://gerrit.wikimedia.org/r/1162828 (https://phabricator.wikimedia.org/T397602) (owner: 10Marostegui) [08:20:12] PROBLEM - haproxy failover on dbproxy1023 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [08:20:23] ^ expected [08:20:26] PROBLEM - haproxy failover on dbproxy1025 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [08:20:32] same ^ [08:22:48] (03PS1) 10Marostegui: site.pp: Move db1228 to m2 [puppet] - 10https://gerrit.wikimedia.org/r/1162829 (https://phabricator.wikimedia.org/T397602) [08:24:06] (03CR) 10Marostegui: [C:03+2] site.pp: Move db1228 to m2 [puppet] - 10https://gerrit.wikimedia.org/r/1162829 (https://phabricator.wikimedia.org/T397602) (owner: 10Marostegui) [08:25:44] (03CR) 10JMeybohm: "Hmm, mw-api-ext "Template did not render correctly (HEAD of local branch)."" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161945 (https://phabricator.wikimedia.org/T397148) (owner: 10Kamila Součková) [08:25:54] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr3-eqsin and Hurricane Electric (2001:de8:4::6939:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [08:25:57] (03PS4) 10JMeybohm: Update codfw to k8s 1.31 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161945 (https://phabricator.wikimedia.org/T397148) (owner: 10Kamila Součková) [08:26:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1221 (T396130)', diff saved to https://phabricator.wikimedia.org/P78592 and previous config saved to /var/cache/conftool/dbconfig/20250623-082600-marostegui.json [08:26:06] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [08:31:10] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host thanos-be2006.codfw.wmnet [08:36:47] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-be2006.codfw.wmnet [08:38:18] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host thanos-be2007.codfw.wmnet [08:38:38] (03CR) 10JMeybohm: [C:03+1] "After rebase the render error of mw-api-ext is gone. admin_ng/codfw stays but since HEAD of local branch renders fine I think we're good." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161945 (https://phabricator.wikimedia.org/T397148) (owner: 10Kamila Součková) [08:39:22] (03PS1) 10Marostegui: db1224: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1162832 (https://phabricator.wikimedia.org/T397279) [08:39:50] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1224.eqiad.wmnet with reason: Maintenance [08:39:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1224', diff saved to https://phabricator.wikimedia.org/P78594 and previous config saved to /var/cache/conftool/dbconfig/20250623-083954-root.json [08:40:05] (03CR) 10Marostegui: [C:03+2] db1224: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1162832 (https://phabricator.wikimedia.org/T397279) (owner: 10Marostegui) [08:41:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1221', diff saved to https://phabricator.wikimedia.org/P78595 and previous config saved to /var/cache/conftool/dbconfig/20250623-084108-marostegui.json [08:44:40] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-be2007.codfw.wmnet [08:48:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1224 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P78596 and previous config saved to /var/cache/conftool/dbconfig/20250623-084800-root.json [08:54:25] (03PS7) 10Jforrester: Avoid using wikitech dblist in configs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137266 (owner: 10Ladsgroup) [08:54:57] (03CR) 10Jforrester: [C:03+1] "PS7: Manually rebased." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137266 (owner: 10Ladsgroup) [08:55:11] (03CR) 10CI reject: [V:04-1] Avoid using wikitech dblist in configs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137266 (owner: 10Ladsgroup) [08:56:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1221', diff saved to https://phabricator.wikimedia.org/P78597 and previous config saved to /var/cache/conftool/dbconfig/20250623-085616-marostegui.json [09:01:46] (03PS1) 10Marostegui: installserver: Remove pc2018 [puppet] - 10https://gerrit.wikimedia.org/r/1162835 [09:03:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1224 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P78598 and previous config saved to /var/cache/conftool/dbconfig/20250623-090305-root.json [09:04:00] (03CR) 10Marostegui: [C:03+2] installserver: Remove pc2018 [puppet] - 10https://gerrit.wikimedia.org/r/1162835 (owner: 10Marostegui) [09:06:06] (03PS1) 10Marostegui: db1220: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1162836 (https://phabricator.wikimedia.org/T397279) [09:06:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1220', diff saved to https://phabricator.wikimedia.org/P78599 and previous config saved to /var/cache/conftool/dbconfig/20250623-090619-root.json [09:06:43] (03PS8) 10Jforrester: Avoid using wikitech dblist in configs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137266 (owner: 10Ladsgroup) [09:07:34] (03CR) 10CI reject: [V:04-1] Avoid using wikitech dblist in configs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137266 (owner: 10Ladsgroup) [09:07:36] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1220.eqiad.wmnet with reason: Maintenance [09:07:45] (03CR) 10Marostegui: [C:03+2] db1220: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1162836 (https://phabricator.wikimedia.org/T397279) (owner: 10Marostegui) [09:11:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1221 (T396130)', diff saved to https://phabricator.wikimedia.org/P78600 and previous config saved to /var/cache/conftool/dbconfig/20250623-091123-marostegui.json [09:11:29] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [09:11:39] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1238.eqiad.wmnet with reason: Maintenance [09:11:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1238 (T396130)', diff saved to https://phabricator.wikimedia.org/P78601 and previous config saved to /var/cache/conftool/dbconfig/20250623-091146-marostegui.json [09:14:08] (03PS1) 10Marostegui: db1222: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1162838 (https://phabricator.wikimedia.org/T396549) [09:15:36] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, June 24 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1162742 (https://phabricator.wikimedia.org/T397292) (owner: 10Kosta Harlan) [09:18:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1224 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P78602 and previous config saved to /var/cache/conftool/dbconfig/20250623-091811-root.json [09:18:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1220 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P78603 and previous config saved to /var/cache/conftool/dbconfig/20250623-091825-root.json [09:19:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1238 (T396130)', diff saved to https://phabricator.wikimedia.org/P78604 and previous config saved to /var/cache/conftool/dbconfig/20250623-091930-marostegui.json [09:19:36] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [09:22:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1222', diff saved to https://phabricator.wikimedia.org/P78605 and previous config saved to /var/cache/conftool/dbconfig/20250623-092230-root.json [09:22:35] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1222.eqiad.wmnet with reason: Maintenance [09:22:43] (03CR) 10Marostegui: [C:03+2] db1222: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1162838 (https://phabricator.wikimedia.org/T396549) (owner: 10Marostegui) [09:25:21] (03CR) 10Filippo Giunchedi: [C:03+2] hieradata: set thanos sidecar and store series limits [puppet] - 10https://gerrit.wikimedia.org/r/1160750 (https://phabricator.wikimedia.org/T394318) (owner: 10Filippo Giunchedi) [09:26:23] (03PS1) 10Fabfur: install_server: UEFI setup for cp20[43-58] [puppet] - 10https://gerrit.wikimedia.org/r/1162840 (https://phabricator.wikimedia.org/T392851) [09:28:41] (03CR) 10CI reject: [V:04-1] install_server: UEFI setup for cp20[43-58] [puppet] - 10https://gerrit.wikimedia.org/r/1162840 (https://phabricator.wikimedia.org/T392851) (owner: 10Fabfur) [09:33:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1224 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P78606 and previous config saved to /var/cache/conftool/dbconfig/20250623-093317-root.json [09:33:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1220 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P78607 and previous config saved to /var/cache/conftool/dbconfig/20250623-093331-root.json [09:34:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1238', diff saved to https://phabricator.wikimedia.org/P78608 and previous config saved to /var/cache/conftool/dbconfig/20250623-093438-marostegui.json [09:38:30] FIRING: [10x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:38:33] (03PS1) 10Vgutierrez: hiera: Switch lvs6002 (upload) to katran [puppet] - 10https://gerrit.wikimedia.org/r/1162843 (https://phabricator.wikimedia.org/T396561) [09:38:58] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1162843 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez) [09:42:31] (03PS2) 10Fabfur: install_server: UEFI setup for cp20[43-58] [puppet] - 10https://gerrit.wikimedia.org/r/1162840 (https://phabricator.wikimedia.org/T392851) [09:44:45] (03CR) 10CI reject: [V:04-1] install_server: UEFI setup for cp20[43-58] [puppet] - 10https://gerrit.wikimedia.org/r/1162840 (https://phabricator.wikimedia.org/T392851) (owner: 10Fabfur) [09:45:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1222 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P78609 and previous config saved to /var/cache/conftool/dbconfig/20250623-094519-root.json [09:48:03] !log ladsgroup@cumin1002 START - Cookbook sre.mysql.pool db1255 gradually with 4 steps - Work done [09:48:05] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db1255 gradually with 4 steps - Work done [09:48:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1224 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P78610 and previous config saved to /var/cache/conftool/dbconfig/20250623-094822-root.json [09:48:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1220 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P78611 and previous config saved to /var/cache/conftool/dbconfig/20250623-094837-root.json [09:49:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1238', diff saved to https://phabricator.wikimedia.org/P78612 and previous config saved to /var/cache/conftool/dbconfig/20250623-094945-marostegui.json [09:54:32] 06SRE, 10SRE-swift-storage: Disk near-full warnings on ms swift backends for container filesystems due to some bloated sqlite files - https://phabricator.wikimedia.org/T377827#10937636 (10Ladsgroup) After the thumbnail steps, they are not growing rapidly anymore. Only one has hit 91% after months. If I find so... [09:55:11] (03CR) 10Fabfur: [C:03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1162843 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez) [09:56:25] !log ladsgroup@cumin1002 START - Cookbook sre.mysql.pool db2155 gradually with 4 steps - Work done [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250623T1000) [10:00:05] jayme, Raine, and claime: A patch you scheduled for MediaWiki infrastructure (UTC mid-day) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [10:00:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1222 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P78614 and previous config saved to /var/cache/conftool/dbconfig/20250623-100024-root.json [10:03:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1220 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P78615 and previous config saved to /var/cache/conftool/dbconfig/20250623-100342-root.json [10:04:23] (03CR) 10Tiziano Fogli: [C:03+1] centrallog: Add a temporary rsyslog debug config file [puppet] - 10https://gerrit.wikimedia.org/r/1151386 (https://phabricator.wikimedia.org/T383309) (owner: 10Andrea Denisse) [10:04:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1238 (T396130)', diff saved to https://phabricator.wikimedia.org/P78616 and previous config saved to /var/cache/conftool/dbconfig/20250623-100453-marostegui.json [10:04:59] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [10:05:09] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1241.eqiad.wmnet with reason: Maintenance [10:05:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1241 (T396130)', diff saved to https://phabricator.wikimedia.org/P78617 and previous config saved to /var/cache/conftool/dbconfig/20250623-100516-marostegui.json [10:05:53] (03PS1) 10Marostegui: db1228: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1162850 (https://phabricator.wikimedia.org/T397602) [10:05:56] Amir1: effie: We're going to update the wikikube codfw cluster in a bit - T397148 [10:05:56] T397148: Update wikikube codfw to kubernetes 1.31 - https://phabricator.wikimedia.org/T397148 [10:06:11] RECOVERY - haproxy failover on dbproxy1023 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [10:06:22] (03CR) 10Marostegui: [C:03+2] db1228: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1162850 (https://phabricator.wikimedia.org/T397602) (owner: 10Marostegui) [10:06:27] RECOVERY - haproxy failover on dbproxy1025 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [10:06:38] (03CR) 10Tiziano Fogli: [C:03+2] monitoring services: add migration task T384998 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1155136 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [10:07:57] jayme: cheers [10:08:01] gl! [10:08:07] ack [10:10:15] !log upload liberica 0.22 to apt.wm.o (bookworm-wikimedia) [10:10:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:48] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1220 to x1 master [puppet] - 10https://gerrit.wikimedia.org/r/1162851 (https://phabricator.wikimedia.org/T397612) [10:10:52] (03PS1) 10Gerrit maintenance bot: wmnet: Update x1-master alias [dns] - 10https://gerrit.wikimedia.org/r/1162852 (https://phabricator.wikimedia.org/T397612) [10:11:01] (03PS3) 10Fabfur: install_server: UEFI setup for cp20[43-58] [puppet] - 10https://gerrit.wikimedia.org/r/1162840 (https://phabricator.wikimedia.org/T392851) [10:11:19] !log kamila@cumin1003 START - Cookbook sre.k8s.pool-depool-cluster depool all services in codfw/codfw: maintenance [10:11:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1241 (T396130)', diff saved to https://phabricator.wikimedia.org/P78619 and previous config saved to /var/cache/conftool/dbconfig/20250623-101159-marostegui.json [10:12:04] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [10:12:23] (03CR) 10Tiziano Fogli: [C:03+2] monitoring services: add migration task T362397 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1155612 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [10:13:26] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.upgrade upgradeing P{lvs1013.eqiad.wmnet} and A:liberica [10:13:47] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) upgradeing P{lvs1013.eqiad.wmnet} and A:liberica [10:14:45] (03CR) 10Tiziano Fogli: [C:03+2] monitoring services: add migration task T385587 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1155135 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [10:14:54] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.upgrade upgradeing P{lvs7002.magru.wmnet} and A:liberica [10:15:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1222 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P78620 and previous config saved to /var/cache/conftool/dbconfig/20250623-101530-root.json [10:15:42] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) upgradeing P{lvs7002.magru.wmnet} and A:liberica [10:16:19] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.upgrade upgradeing P{lvs7001.magru.wmnet} and A:liberica [10:17:08] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) upgradeing P{lvs7001.magru.wmnet} and A:liberica [10:18:38] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.admin depooling P{lvs6002.drmrs.wmnet} and A:liberica (T396561) [10:18:43] T396561: Switch to katran as forwarding plane on non-core DCs - https://phabricator.wikimedia.org/T396561 [10:18:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1220 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P78621 and previous config saved to /var/cache/conftool/dbconfig/20250623-101848-root.json [10:18:59] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P{lvs6002.drmrs.wmnet} and A:liberica (T396561) [10:19:05] (03CR) 10Vgutierrez: [C:03+2] hiera: Switch lvs6002 (upload) to katran [puppet] - 10https://gerrit.wikimedia.org/r/1162843 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez) [10:19:57] !log vgutierrez@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs6002.drmrs.wmnet with reason: switching to katran [10:20:03] (03Abandoned) 10Tiziano Fogli: monitoring services: add migration task T357099 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1155145 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [10:21:12] (03CR) 10Tiziano Fogli: [C:03+2] monitoring services: add migration task T332764 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1155627 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [10:22:32] (03CR) 10Tiziano Fogli: [C:03+2] monitoring services: add migration task T358029 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1155611 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [10:23:32] !log kamila@cumin1003 END (ERROR) - Cookbook sre.k8s.pool-depool-cluster (exit_code=93) depool all services in codfw/codfw: maintenance [10:24:37] !log kamila@cumin1003 START - Cookbook sre.k8s.pool-depool-cluster depool all services in codfw/codfw: maintenance [10:25:12] (03CR) 10Máté Szabó: [C:04-2] "Yup, I'll -2 this until then." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152770 (https://phabricator.wikimedia.org/T364705) (owner: 10Máté Szabó) [10:26:04] (03PS1) 10Volans: client: remove self-update capability [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/1162855 [10:26:56] !log vgutierrez@cumin1002 START - Cookbook sre.hosts.remove-downtime for lvs6002.drmrs.wmnet [10:26:56] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs6002.drmrs.wmnet [10:27:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1241', diff saved to https://phabricator.wikimedia.org/P78623 and previous config saved to /var/cache/conftool/dbconfig/20250623-102706-marostegui.json [10:27:29] (03PS1) 10Muehlenhoff: Add staging VM for debmonitor [puppet] - 10https://gerrit.wikimedia.org/r/1162858 [10:27:37] (03CR) 10Tiziano Fogli: [C:03+2] monitoring services: add migration task T321808 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1155607 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [10:28:14] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1161873 (https://phabricator.wikimedia.org/T395304) (owner: 10Federico Ceratto) [10:28:51] (03PS1) 10Vgutierrez: hiera: Repool lvs6002 using katran [puppet] - 10https://gerrit.wikimedia.org/r/1162859 (https://phabricator.wikimedia.org/T396561) [10:29:37] (03PS1) 10Volans: debmonitor: remove client self-update capability [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1162860 [10:29:45] (03CR) 10Tiziano Fogli: [C:03+2] monitoring services: add migration task T385590 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1155600 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [10:30:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1222 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P78625 and previous config saved to /var/cache/conftool/dbconfig/20250623-103036-root.json [10:30:40] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:31:07] (03PS2) 10Muehlenhoff: Add staging VM for debmonitor [puppet] - 10https://gerrit.wikimedia.org/r/1162858 [10:31:15] (03CR) 10Tiziano Fogli: [C:03+2] monitoring services: add migration task T385583 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1155598 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [10:32:01] (03CR) 10Volans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1162858 (owner: 10Muehlenhoff) [10:32:15] (03CR) 10Muehlenhoff: [C:03+2] Add staging VM for debmonitor [puppet] - 10https://gerrit.wikimedia.org/r/1162858 (owner: 10Muehlenhoff) [10:32:16] (03CR) 10Tiziano Fogli: [C:03+2] monitoring services: add migration task T384305 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1155124 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [10:34:58] (03CR) 10Tiziano Fogli: [C:03+2] monitoring services: add migration task T384303 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1155127 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [10:35:13] !log scap lock --all "Kubernetes upgrade" [10:35:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 21.96% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [10:35:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:26] (03CR) 10Vgutierrez: [C:04-1] install_server: UEFI setup for cp20[43-58] (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1162840 (https://phabricator.wikimedia.org/T392851) (owner: 10Fabfur) [10:36:10] !log kamila@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-cluster (exit_code=0) depool all services in codfw/codfw: maintenance [10:36:40] (03PS1) 10Volans: client: remove dependency on client [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1162861 [10:36:47] PHPFPMTooBusy is kind of expected, but I'll keep an eye on it [10:36:58] (03CR) 10Tiziano Fogli: [C:03+2] monitoring services: add migration task T375166 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1155128 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [10:37:08] (03CR) 10Tiziano Fogli: [C:03+2] monitoring services: add migration task T374839 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1155129 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [10:37:20] (03CR) 10Tiziano Fogli: [C:03+2] monitoring services: add migration task T374823 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1155130 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [10:38:19] (03CR) 10Tiziano Fogli: [C:03+2] monitoring services: add migration task T370526 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1155132 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [10:38:25] (03CR) 10Tiziano Fogli: [C:03+2] monitoring services: add migration task T384309 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1155133 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [10:38:30] (03CR) 10Tiziano Fogli: [C:03+2] monitoring services: add migration task T371083 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1155131 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [10:38:58] !log kamila@cumin1003 START - Cookbook sre.k8s.wipe-cluster Wipe the K8s cluster wikikube-codfw: Kubernetes upgrade [10:39:13] !log cookbook sre.k8s.wipe-cluster --k8s-cluster wikikube-codfw -H 2 --reason "Kubernetes upgrade" - T397148 [10:39:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:18] T397148: Update wikikube codfw to kubernetes 1.31 - https://phabricator.wikimedia.org/T397148 [10:40:38] (03PS2) 10Tiziano Fogli: monitoring services: add migration task T367065 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1155137 (https://phabricator.wikimedia.org/T395443) [10:41:48] FIRING: PuppetFailure: Puppet has failed on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [10:41:52] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2155 gradually with 4 steps - Work done [10:42:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1241', diff saved to https://phabricator.wikimedia.org/P78627 and previous config saved to /var/cache/conftool/dbconfig/20250623-104214-marostegui.json [10:42:30] (03CR) 10Federico Ceratto: [C:03+2] CAS: Add wmf group for Zarcillo, remove ops (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1161873 (https://phabricator.wikimedia.org/T395304) (owner: 10Federico Ceratto) [10:42:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:43:27] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1162859 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez) [10:43:51] (03CR) 10Tiziano Fogli: [C:03+2] monitoring services: add migration task T367065 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1155137 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [10:44:14] !log cgoubert@deploy1003 Forcefully removing global lock: Kubernetes upgrade [10:46:23] (03CR) 10Kamila Součková: [C:03+2] Update codfw to kubernetes 1.31, calico 3.29 [puppet] - 10https://gerrit.wikimedia.org/r/1161929 (https://phabricator.wikimedia.org/T397148) (owner: 10Kamila Součková) [10:46:27] (03CR) 10Kamila Součková: [C:03+2] Update codfw pod ip range [puppet] - 10https://gerrit.wikimedia.org/r/1161930 (https://phabricator.wikimedia.org/T375845) (owner: 10Kamila Součková) [10:46:29] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - kartotherian-k8s-ssl_6543: Servers wikikube-worker2021.codfw.wmnet, wikikube-worker2141.codfw.wmnet, wikikube-worker2235.codfw.wmnet, wikikube-worker2102.codfw.wmnet, wikikube-worker2081.codfw.wmnet, wikikube-worker2017.codfw.wmnet, wikikube-worker2191.codfw.wmnet, wikikube-worker2026.codfw.wmnet, wikikube-worker2230.codfw.wmnet, wikikube-worker2084. [10:46:29] net, wikikube-worker2155.codfw.wmnet, wikikube-worker2099.codfw.wmnet, wikikube-worker2113.codfw.wmnet, wikikube-worker2158.codfw.wmnet, wikikube-worker2171.codfw.wmnet, wikikube-worker2040.codfw.wmnet, wikikube-worker2033.codfw.wmnet, wikikube-worker2083.codfw.wmnet, wikikube-worker2071.codfw.wmnet, wikikube-worker2110.codfw.wmnet, wikikube-worker2010.codfw.wmnet, wikikube-worker2138.codfw.wmnet, wikikube-worker2215.codfw.wmnet, wikikube [10:46:29] 022.codfw.wmnet, wikikube-worker2161.codfw.wmnet, wikikube-worker2027.codfw.wmnet, wikikube-worker2165.codfw.wmnet, wikikube-worker2213.codfw.wmnet, wikikube-worker2236.codfw.wmnet, wik https://wikitech.wikimedia.org/wiki/PyBal [10:46:39] this is us [10:46:39] federico3: feel free to merge also my patchset whenever you're ready [10:46:55] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - kartotherian-k8s-ssl_6543: Servers wikikube-worker2021.codfw.wmnet, wikikube-worker2233.codfw.wmnet, wikikube-worker2141.codfw.wmnet, wikikube-worker2033.codfw.wmnet, wikikube-worker2120.codfw.wmnet, wikikube-worker2063.codfw.wmnet, wikikube-worker2202.codfw.wmnet, wikikube-worker2081.codfw.wmnet, wikikube-worker2172.codfw.wmnet, wikikube-worker2225. [10:46:55] net, wikikube-worker2036.codfw.wmnet, wikikube-worker2150.codfw.wmnet, wikikube-worker2185.codfw.wmnet, wikikube-worker2076.codfw.wmnet, wikikube-worker2040.codfw.wmnet, wikikube-worker2132.codfw.wmnet, wikikube-worker2165.codfw.wmnet, wikikube-worker2010.codfw.wmnet, wikikube-worker2138.codfw.wmnet, wikikube-worker2092.codfw.wmnet, wikikube-worker2161.codfw.wmnet, wikikube-worker2027.codfw.wmnet, wikikube-worker2157.codfw.wmnet, wikikube [10:46:55] 213.codfw.wmnet, wikikube-worker2030.codfw.wmnet, wikikube-worker2043.codfw.wmnet, wikikube-worker2097.codfw.wmnet, wikikube-worker2065.codfw.wmnet, wikikube-worker2023.codfw.wmnet, wik https://wikitech.wikimedia.org/wiki/PyBal [10:47:24] (03CR) 10Vgutierrez: [C:03+2] hiera: Repool lvs6002 using katran [puppet] - 10https://gerrit.wikimedia.org/r/1162859 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez) [10:47:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:47:57] FIRING: [21x] ProbeDown: Service chart-renderer:30443 has failed probes (http_chart-renderer_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:48:06] !incidents [10:48:06] 6397 (UNACKED) [21x] ProbeDown sre (ip4 probes/service codfw) [10:48:13] !ack 6397 [10:48:14] 6397 (ACKED) [21x] ProbeDown sre (ip4 probes/service codfw) [10:48:30] jayme: is this expected? [10:48:39] kamila@cumin1003 wipe-cluster (PID 2793869) is awaiting input [10:49:14] yeah [10:49:20] Cluster's wiped [10:49:33] tx for acking [10:50:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 23.31% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [10:50:32] (03CR) 10Kamila Součková: [C:03+2] Update codfw to k8s 1.31 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161945 (https://phabricator.wikimedia.org/T397148) (owner: 10Kamila Součková) [10:50:39] (03CR) 10Kamila Součková: [C:03+2] admin_ng: Change codfw pod ip range to 10.194.128.0/17 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161948 (https://phabricator.wikimedia.org/T375845) (owner: 10Kamila Součková) [10:51:42] FIRING: JobUnavailable: Reduced availability for job probes/swagger in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:51:51] FIRING: [14x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [10:52:36] that's us as well [10:55:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 23.34% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [10:56:23] (03PS1) 10Marostegui: Revert "CAS: Add wmf group for Zarcillo, remove ops" [puppet] - 10https://gerrit.wikimedia.org/r/1162865 [10:56:50] (03CR) 10Marostegui: [V:03+2 C:03+2] Revert "CAS: Add wmf group for Zarcillo, remove ops" [puppet] - 10https://gerrit.wikimedia.org/r/1162865 (owner: 10Marostegui) [10:57:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1241 (T396130)', diff saved to https://phabricator.wikimedia.org/P78628 and previous config saved to /var/cache/conftool/dbconfig/20250623-105722-marostegui.json [10:57:31] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [10:57:39] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1242.eqiad.wmnet with reason: Maintenance [10:57:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1242 (T396130)', diff saved to https://phabricator.wikimedia.org/P78629 and previous config saved to /var/cache/conftool/dbconfig/20250623-105746-marostegui.json [10:59:08] (03Merged) 10jenkins-bot: Update codfw to k8s 1.31 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161945 (https://phabricator.wikimedia.org/T397148) (owner: 10Kamila Součková) [10:59:46] !log jmm@cumin1003 START - Cookbook sre.ganeti.makevm for new host debmonitor-dev2001.codfw.wmnet [10:59:47] (03Merged) 10jenkins-bot: admin_ng: Change codfw pod ip range to 10.194.128.0/17 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161948 (https://phabricator.wikimedia.org/T375845) (owner: 10Kamila Součková) [10:59:48] !log jmm@cumin1003 START - Cookbook sre.dns.netbox [11:00:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 23.34% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:01:51] FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=eqsin&var-cluster=upload&var-origin=swift.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [11:02:01] That's not us [11:02:10] ^ Emperor ? [11:02:12] !incidents [11:02:13] 6397 (ACKED) [21x] ProbeDown sre (ip4 probes/service codfw) [11:02:13] 6398 (UNACKED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqsin) [11:02:17] !ack 6398 [11:02:18] 6398 (ACKED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqsin) [11:02:31] steep increase [11:02:45] doesn't look like that slow degradation we had seen other times [11:02:48] ok taking a look [11:03:12] (03CR) 10Jgiannelos: [C:03+1] changeprop: implement batch_size parameter for pcs job [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161957 (https://phabricator.wikimedia.org/T397072) (owner: 10Hnowlan) [11:03:18] started at 10:46 [11:03:33] swift@codfw is struggling for some reason [11:03:45] the 3 PoPs using it are reporting an increased number of 5xx [11:04:15] so I mention the degradation because in the past a rolling restart was the right fix, but I am not sure this is it atm [11:04:24] I am checking superset [11:04:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1242 (T396130)', diff saved to https://phabricator.wikimedia.org/P78630 and previous config saved to /var/cache/conftool/dbconfig/20250623-110428-marostegui.json [11:04:32] 302 -sre [11:04:34] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [11:05:21] jmm@cumin1003 makevm (PID 2798871) is awaiting input [11:06:51] FIRING: [3x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [11:07:25] let me debug this [11:07:33] Emperor: cc [11:08:19] we are on sre- amir [11:08:30] we pinged him but discussing what to do there [11:10:02] !log jmm@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM debmonitor-dev2001.codfw.wmnet - jmm@cumin1003" [11:10:06] !log jmm@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM debmonitor-dev2001.codfw.wmnet - jmm@cumin1003" [11:10:06] !log jmm@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:10:07] !log jmm@cumin1003 START - Cookbook sre.dns.wipe-cache debmonitor-dev2001.codfw.wmnet on all recursors [11:10:10] !log jmm@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) debmonitor-dev2001.codfw.wmnet on all recursors [11:10:18] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1162861 (owner: 10Volans) [11:10:40] !log jmm@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM debmonitor-dev2001.codfw.wmnet - jmm@cumin1003" [11:10:44] !log jmm@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM debmonitor-dev2001.codfw.wmnet - jmm@cumin1003" [11:13:29] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1162860 (owner: 10Volans) [11:14:39] !log mvernon@cumin2002 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-codfw [11:15:02] jmm@cumin1003 makevm (PID 2798871) is awaiting input [11:15:09] !log jmm@cumin1003 START - Cookbook sre.hosts.reimage for host debmonitor-dev2001.codfw.wmnet with OS bookworm [11:15:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 23.13% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:17:30] ^ could this be just a result of the depool (higher eqiad load?) [11:17:53] this is [11:18:06] Just ignore it, I'm watching latency [11:18:15] I can silence if y'all want [11:18:30] no worries, just mentined it becaue of the swift ongoing issue [11:18:38] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-codfw [11:18:38] (03CR) 10Muehlenhoff: client: remove self-update capability (032 comments) [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/1162855 (owner: 10Volans) [11:19:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1242', diff saved to https://phabricator.wikimedia.org/P78631 and previous config saved to /var/cache/conftool/dbconfig/20250623-111935-marostegui.json [11:19:58] tgr|away, Tchanders, seanleong-wmde: we're likely to overrun the schedule with the kubernetes upgrade due to running into errors, so you probably won't be able to start your deploy window on time, maybe you could start later? [11:20:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 23.13% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:20:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [11:23:28] !log kamila@cumin1003 END (FAIL) - Cookbook sre.k8s.wipe-cluster (exit_code=99) Wipe the K8s cluster wikikube-codfw: Kubernetes upgrade [11:23:35] !log kamila@cumin1003 START - Cookbook sre.k8s.wipe-cluster Wipe the K8s cluster wikikube-codfw: Kubernetes upgrade [11:26:48] RESOLVED: PuppetFailure: Puppet has failed on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [11:27:44] (03PS1) 10Ayounsi: Netbox: add primary_mac_address get/set [software/spicerack] - 10https://gerrit.wikimedia.org/r/1162869 [11:29:42] FIRING: [2x] ProbeDown: Service wdqs2014:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2014:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:31:06] (03CR) 10Jgiannelos: [C:03+2] mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1162023 (owner: 10PipelineBot) [11:31:15] (03CR) 10Jgiannelos: [C:03+2] wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1162022 (owner: 10PipelineBot) [11:32:54] (03Merged) 10jenkins-bot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1162023 (owner: 10PipelineBot) [11:33:15] (03Merged) 10jenkins-bot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1162022 (owner: 10PipelineBot) [11:33:22] !log jmm@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on debmonitor-dev2001.codfw.wmnet with reason: host reimage [11:34:33] !log cgoubert@deploy1003 conftool action : set/pooled=false; selector: dnsdisc=thumbor.*,name=codfw [11:34:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1242', diff saved to https://phabricator.wikimedia.org/P78633 and previous config saved to /var/cache/conftool/dbconfig/20250623-113443-marostegui.json [11:37:01] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on debmonitor-dev2001.codfw.wmnet with reason: host reimage [11:37:46] !log restart swift-object-replicator on ms-be1071 [11:37:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:54] (03CR) 10CI reject: [V:04-1] Netbox: add primary_mac_address get/set [software/spicerack] - 10https://gerrit.wikimedia.org/r/1162869 (owner: 10Ayounsi) [11:39:39] FIRING: [3x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch2069-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [11:40:08] !log mfossati@deploy1003 Started deploy [airflow-dags/platform_eng@47dcd3f]: bump section topics to v1.5.0 [11:40:42] (03PS1) 10Muehlenhoff: Record extended contract date for wangombe [puppet] - 10https://gerrit.wikimedia.org/r/1162871 [11:40:51] !log mfossati@deploy1003 Finished deploy [airflow-dags/platform_eng@47dcd3f]: bump section topics to v1.5.0 (duration: 00m 54s) [11:41:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 21.56% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:45:40] FIRING: [4x] SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [11:45:42] (03PS2) 10Volans: client: remove self-update capability [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/1162855 [11:45:51] Deploying MinT in staging. [11:45:52] (03CR) 10Volans: "addressed comments" [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/1162855 (owner: 10Volans) [11:46:46] !log kartik@deploy1003 helmfile [staging] START helmfile.d/services/machinetranslation: apply [11:47:33] (03CR) 10Muehlenhoff: [C:03+2] Record extended contract date for wangombe [puppet] - 10https://gerrit.wikimedia.org/r/1162871 (owner: 10Muehlenhoff) [11:47:51] !log jiji@cumin1003 conftool action : set/pooled=false; selector: dnsdisc=swift,name=codfw [11:48:34] !log Ran fixStuckGlobalRename.php for T397601 [11:48:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:48:38] T397601: Unblock stuck global rename of Rentangan - https://phabricator.wikimedia.org/T397601 [11:49:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1242 (T396130)', diff saved to https://phabricator.wikimedia.org/P78634 and previous config saved to /var/cache/conftool/dbconfig/20250623-114950-marostegui.json [11:49:56] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [11:50:06] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1243.eqiad.wmnet with reason: Maintenance [11:50:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1243 (T396130)', diff saved to https://phabricator.wikimedia.org/P78635 and previous config saved to /var/cache/conftool/dbconfig/20250623-115013-marostegui.json [11:51:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 22.31% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:51:51] FIRING: [3x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [11:52:20] !incidents [11:52:20] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host debmonitor-dev2001.codfw.wmnet with OS bookworm [11:52:20] !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host debmonitor-dev2001.codfw.wmnet [11:52:20] 6397 (ACKED) [21x] ProbeDown sre (ip4 probes/service codfw) [11:52:21] 6398 (ACKED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqsin) [11:53:12] !log jiji@cumin1003 conftool action : set/pooled=false; selector: dnsdisc=swift-rw,name=codfw [11:54:39] FIRING: [4x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch2069-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [11:56:15] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/1162855 (owner: 10Volans) [11:56:51] RESOLVED: [3x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [11:57:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1243 (T396130)', diff saved to https://phabricator.wikimedia.org/P78636 and previous config saved to /var/cache/conftool/dbconfig/20250623-115659-marostegui.json [11:57:05] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [11:57:55] !log jiji@cumin1003 conftool action : set/pooled=false; selector: dnsdisc=swift-ro,name=codfw [11:59:39] FIRING: [5x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch2069-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [12:00:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 24.69% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [12:01:28] PROBLEM - Kafka MirrorMaker main-codfw_to_main-eqiad average message produce rate in last 30m on alert1002 is CRITICAL: 0 le 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad [12:02:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 2.341s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:02:18] PROBLEM - Kafka MirrorMaker main-codfw_to_main-eqiad average message consume rate in last 30m on alert1002 is CRITICAL: 0 le 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad [12:02:48] (03CR) 10Tiziano Fogli: [C:03+2] monitoring services: add migration task T315866 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1155139 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [12:05:59] (03CR) 10Tiziano Fogli: [C:03+2] monitoring services: add migration task T367149 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1155141 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [12:06:56] !log kartik@deploy1003 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply [12:08:02] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host aux-k8s-worker1008.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [12:08:08] (03CR) 10Tiziano Fogli: [C:03+2] monitoring services: add migration task T228830 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1155144 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [12:08:14] FIRING: [2x] SLOMetricAbsent: citoid-latency codfw - https://slo.wikimedia.org/?search=citoid-latency - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [12:08:27] expected [12:09:04] kart_: please don't deploy to production [12:09:09] codfw is being upgraded [12:09:19] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host aux-k8s-worker1008.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [12:09:39] FIRING: [7x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch2069-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [12:09:44] kamila@cumin1003 wipe-cluster (PID 2801966) is awaiting input [12:09:55] (03PS1) 10Muehlenhoff: debmonitor: Remove support for using non-cfssl certs [puppet] - 10https://gerrit.wikimedia.org/r/1162876 [12:10:13] (03CR) 10Tiziano Fogli: [C:03+2] monitoring services: add migration task T370157 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1155134 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [12:11:08] !log kamila@deploy1003 helmfile [codfw] START helmfile.d/admin 'sync'. [12:11:39] (03CR) 10Tiziano Fogli: [C:03+2] monitoring services: add migration task T328502 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1155138 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [12:12:05] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host aux-k8s-worker1008.eqiad.wmnet with OS bookworm [12:12:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1243', diff saved to https://phabricator.wikimedia.org/P78637 and previous config saved to /var/cache/conftool/dbconfig/20250623-121206-marostegui.json [12:12:10] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install aux-k8s-worker100[6-9].eqiad.wmnet - https://phabricator.wikimedia.org/T393053#10938036 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host aux-k8s-worker1008.eqiad.wmnet with OS bookworm [12:13:14] FIRING: [3x] SLOMetricAbsent: citoid-latency codfw - https://slo.wikimedia.org/?search=citoid-latency - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [12:14:26] (03CR) 10Volans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1162876 (owner: 10Muehlenhoff) [12:17:04] !log kamila@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'sync'. [12:19:20] (03CR) 10Tiziano Fogli: [C:03+2] monitoring services: add migration task T384939 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1155254 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [12:19:39] FIRING: [6x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch2069-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [12:20:39] (03CR) 10Tiziano Fogli: [C:03+2] monitoring services: add migration task T384938 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1155251 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [12:20:41] !log kamila@deploy1003 helmfile [codfw] OK helmfile.d/services/apertium: apply [12:20:52] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 15 hosts with reason: maintenance [12:20:59] !log kamila@deploy1003 helmfile [codfw] OK helmfile.d/services/api-gateway: apply [12:22:06] !log akosiaris@cumin1003 START - Cookbook sre.hosts.reimage for host aux-k8s-worker2007.codfw.wmnet with OS bookworm [12:22:07] !log kamila@deploy1003 helmfile [codfw] OK helmfile.d/services/changeprop: apply [12:22:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.429s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:22:35] !log kamila@deploy1003 helmfile [codfw] OK helmfile.d/services/changeprop-jobqueue: apply [12:23:04] !log kamila@deploy1003 helmfile [codfw] OK helmfile.d/services/chart-renderer: apply [12:23:15] !log kamila@deploy1003 helmfile [codfw] OK helmfile.d/services/cirrus-streaming-updater: apply [12:23:37] !log kamila@deploy1003 helmfile [codfw] OK helmfile.d/services/citoid: apply [12:23:57] !log kamila@deploy1003 helmfile [codfw] OK helmfile.d/services/commons-impact-analytics: apply [12:24:05] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on aux-k8s-worker1008.eqiad.wmnet with reason: host reimage [12:24:19] !log kamila@deploy1003 helmfile [codfw] OK helmfile.d/services/cxserver: apply [12:24:33] FIRING: [2x] ProbeDown: Service citoid:4003 has failed probes (http_citoid_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:24:39] FIRING: [7x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch2069-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [12:24:39] !log kamila@deploy1003 helmfile [codfw] OK helmfile.d/services/data-gateway: apply [12:24:53] !log kamila@deploy1003 helmfile [codfw] OK helmfile.d/services/developer-portal: apply [12:25:10] !log kamila@deploy1003 helmfile [codfw] OK helmfile.d/services/device-analytics: apply [12:25:52] !log kamila@deploy1003 helmfile [codfw] OK helmfile.d/services/echostore: apply [12:25:54] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr3-eqsin and Hurricane Electric (2001:de8:4::6939:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [12:26:08] !log kamila@deploy1003 helmfile [codfw] OK helmfile.d/services/edit-analytics: apply [12:26:25] !log kamila@deploy1003 helmfile [codfw] OK helmfile.d/services/editor-analytics: apply [12:27:01] !log kamila@deploy1003 helmfile [codfw] OK helmfile.d/services/eventgate-analytics: apply [12:27:08] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aux-k8s-worker1008.eqiad.wmnet with reason: host reimage [12:27:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1243', diff saved to https://phabricator.wikimedia.org/P78638 and previous config saved to /var/cache/conftool/dbconfig/20250623-122713-marostegui.json [12:27:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.021s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:27:41] !log kamila@deploy1003 helmfile [codfw] OK helmfile.d/services/eventgate-analytics-external: apply [12:27:45] (03CR) 10Tiziano Fogli: [C:03+2] monitoring services: add migration task T384933 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1155250 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [12:28:06] !log kamila@deploy1003 helmfile [codfw] OK helmfile.d/services/eventgate-logging-external: apply [12:28:30] RESOLVED: [3x] ProbeDown: Service citoid:4003 has failed probes (http_citoid_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:28:32] !log kamila@deploy1003 helmfile [codfw] OK helmfile.d/services/eventgate-main: apply [12:29:00] (03PS1) 10Muehlenhoff: debmonitor: Remove unused Hiera option [puppet] - 10https://gerrit.wikimedia.org/r/1162887 [12:29:01] (03CR) 10Tiziano Fogli: [C:03+2] monitoring services: add migration task T384922 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1155245 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [12:29:15] !log kamila@deploy1003 helmfile [codfw] OK helmfile.d/services/eventstreams: apply [12:29:18] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1162887 (owner: 10Muehlenhoff) [12:29:39] RESOLVED: [8x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch2069-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [12:29:42] !log kamila@deploy1003 helmfile [codfw] OK helmfile.d/services/eventstreams-internal: apply [12:30:04] !log kamila@deploy1003 helmfile [codfw] OK helmfile.d/services/geo-analytics: apply [12:30:25] !log kamila@deploy1003 helmfile [codfw] OK helmfile.d/services/image-suggestion: apply [12:30:26] (03CR) 10Tiziano Fogli: [C:03+2] monitoring services: add migration task T384924 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1155248 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [12:30:46] (03CR) 10Tiziano Fogli: [C:03+2] monitoring services: add migration task T384427 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1155230 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [12:30:51] !log kamila@deploy1003 helmfile [codfw] OK helmfile.d/services/ipoid: apply [12:31:34] !log kamila@deploy1003 helmfile [codfw] OK helmfile.d/services/kartotherian: apply [12:31:39] (03CR) 10Tiziano Fogli: [C:03+2] monitoring services: add migration task T384425 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1155226 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [12:31:42] RESOLVED: JobUnavailable: Reduced availability for job probes/swagger in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:32:16] !log kamila@deploy1003 helmfile [codfw] OK helmfile.d/services/linkrecommendation: apply [12:32:18] (03CR) 10Tiziano Fogli: [C:03+2] monitoring services: add migration task T384308 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1155218 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [12:32:38] claime: sorry, I missed ping. Yes. No plan for production yet. [12:32:48] kart_: ok cool ty [12:34:40] !log akosiaris@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on aux-k8s-worker2007.codfw.wmnet with reason: host reimage [12:34:59] (03PS2) 10Tiziano Fogli: monitoring services: add migration task T384321 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1155222 (https://phabricator.wikimedia.org/T395443) [12:35:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 24.29% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [12:38:08] !log akosiaris@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aux-k8s-worker2007.codfw.wmnet with reason: host reimage [12:39:12] (03CR) 10Tiziano Fogli: [C:03+2] monitoring services: add migration task T384321 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1155222 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [12:40:08] (03CR) 10Volans: [C:03+1] "LGTM I think it's possible to use cfssl on cloud VPS nowadays in case it would be needed." [puppet] - 10https://gerrit.wikimedia.org/r/1162876 (owner: 10Muehlenhoff) [12:40:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 23.03% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [12:41:11] (03CR) 10Muehlenhoff: [C:03+2] debmonitor: Remove support for using non-cfssl certs [puppet] - 10https://gerrit.wikimedia.org/r/1162876 (owner: 10Muehlenhoff) [12:41:11] kart_: should machinetranslation be deployable (to prod)? We're seeing strage error messages [12:41:19] Downloading using s3cmd: https://analytics.wikimedia.org/published/wmf-ml-models/mint/20250514081434/nllb/nllb200-600M.tgz [12:41:36] ERROR: /home/somebody/.s3cfg: None [12:42:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1243 (T396130)', diff saved to https://phabricator.wikimedia.org/P78639 and previous config saved to /var/cache/conftool/dbconfig/20250623-124221-marostegui.json [12:42:26] !log kamila@deploy1003 helmfile [codfw] FAIL (1) helmfile.d/services/machinetranslation: apply [12:42:27] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [12:42:37] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1245.eqiad.wmnet with reason: Maintenance [12:42:42] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [12:42:48] !log akosiaris@cumin1003 START - Cookbook sre.hosts.reimage for host aux-k8s-worker2008.codfw.wmnet with OS bookworm [12:42:49] !log kamila@deploy1003 helmfile [codfw] OK helmfile.d/services/mathoid: apply [12:43:04] !log kamila@deploy1003 helmfile [codfw] OK helmfile.d/services/media-analytics: apply [12:43:06] (03CR) 10Volans: "two questions inline" [puppet] - 10https://gerrit.wikimedia.org/r/1162887 (owner: 10Muehlenhoff) [12:43:18] kart_: putting it differently: I think machinetranslation can't be deployed currently [12:43:41] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [12:43:41] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aux-k8s-worker1008.eqiad.wmnet with OS bookworm [12:43:54] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install aux-k8s-worker100[6-9].eqiad.wmnet - https://phabricator.wikimedia.org/T393053#10938161 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host aux-k8s-worker1008.eqiad.wmnet with OS bookworm complete... [12:45:44] (03PS1) 10JMeybohm: Revert "machinetranslation: Use S3 storage for production" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1162894 [12:46:03] (03CR) 10Tiziano Fogli: [C:03+2] systemd::service: parameterize to support migration_task [puppet] - 10https://gerrit.wikimedia.org/r/1155618 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [12:46:22] !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/wikifeeds: apply [12:46:42] !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifeeds: apply [12:47:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.469s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:47:22] jayme: o/ [12:47:28] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1247.eqiad.wmnet with reason: Maintenance [12:47:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1247 (T396130)', diff saved to https://phabricator.wikimedia.org/P78640 and previous config saved to /var/cache/conftool/dbconfig/20250623-124734-marostegui.json [12:47:36] I noticed the revert, any specific reason for the failure? [12:47:40] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [12:47:49] !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: apply [12:47:59] because I followed the move a while ago, https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1159696 is missing network policies to fetch data from thanos swift afaics [12:48:15] elukey: pods don't come up [12:48:22] that's my main reason [12:48:28] it's not trying to pull from s3 [12:48:39] (03PS2) 10Anzx: brwiki: add patroller usergroup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1162889 (https://phabricator.wikimedia.org/T397576) [12:48:47] it tries Downloading using s3cmd: https://analytics.wikimedia.org/published/wmf-ml-models/mint/20250514081434/nllb/nllb200-600M.tgz [12:48:57] so, s3 via analytics.w.o [12:49:00] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 23 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1162889 (https://phabricator.wikimedia.org/T397576) (owner: 10Anzx) [12:49:23] (03CR) 10CI reject: [V:04-1] brwiki: add patroller usergroup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1162889 (https://phabricator.wikimedia.org/T397576) (owner: 10Anzx) [12:49:29] lol ok [12:49:41] anzx: backport window will be delayed [12:49:53] claime: ok [12:49:59] kubernetes upgrade of codfw cluster is taking a little longer than expected [12:50:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 23.8% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [12:50:24] (03PS2) 10JMeybohm: Revert "machinetranslation: Use S3 storage for production" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1162894 [12:51:20] (03PS4) 10Cparle: Add UploadWizard tables [puppet] - 10https://gerrit.wikimedia.org/r/1161562 (https://phabricator.wikimedia.org/T393793) [12:51:22] (03CR) 10Ladsgroup: [C:03+2] Add UploadWizard tables [puppet] - 10https://gerrit.wikimedia.org/r/1161562 (https://phabricator.wikimedia.org/T393793) (owner: 10Cparle) [12:51:26] (03CR) 10Ladsgroup: [V:03+2 C:03+2] Add UploadWizard tables [puppet] - 10https://gerrit.wikimedia.org/r/1161562 (https://phabricator.wikimedia.org/T393793) (owner: 10Cparle) [12:51:43] jayme: ok so surely related to https://gerrit.wikimedia.org/r/c/mediawiki/services/machinetranslation/+/1147812/7/entrypoint.sh [12:52:02] plausible, did not check [12:52:03] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1162894 [12:52:32] (03CR) 10Elukey: [C:03+1] Revert "machinetranslation: Use S3 storage for production" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1162894 (owner: 10JMeybohm) [12:53:02] (03CR) 10JMeybohm: [C:03+2] Revert "machinetranslation: Use S3 storage for production" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1162894 (owner: 10JMeybohm) [12:53:16] !log kamila@deploy1003 helmfile [codfw] FAIL (1) helmfile.d/services/miscweb: apply [12:53:26] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Upgrade an-worker hard drives from 4TB to 8TB (group 9 - rack E3) - https://phabricator.wikimedia.org/T390176#10938170 (10Jclark-ctr) [12:53:36] (03PS3) 10Anzx: brwiki: add patroller usergroup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1162889 (https://phabricator.wikimedia.org/T397576) [12:53:49] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Upgrade an-worker hard drives from 4TB to 8TB (group 9 - rack E3) - https://phabricator.wikimedia.org/T390176#10938172 (10Jclark-ctr) @Stevemunene completed drive swaps [12:54:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1247 (T396130)', diff saved to https://phabricator.wikimedia.org/P78641 and previous config saved to /var/cache/conftool/dbconfig/20250623-125413-marostegui.json [12:54:18] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [12:54:22] jayme: I'll ping folks in the ML channel for this, only staging right? [12:54:26] (03CR) 10CI reject: [V:04-1] brwiki: add patroller usergroup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1162889 (https://phabricator.wikimedia.org/T397576) (owner: 10Anzx) [12:54:35] elukey: nope, it's a prod change [12:54:37] !log akosiaris@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aux-k8s-worker2007.codfw.wmnet with OS bookworm [12:54:39] !log kamila@deploy1003 helmfile [codfw] OK helmfile.d/services/mobileapps: apply [12:54:53] (03Merged) 10jenkins-bot: Revert "machinetranslation: Use S3 storage for production" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1162894 (owner: 10JMeybohm) [12:54:59] !log kamila@deploy1003 helmfile [codfw] OK helmfile.d/services/page-analytics: apply [12:55:10] elukey: that's why we're failing deploying it to the updated wikikube cluster [12:55:21] !log akosiaris@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on aux-k8s-worker2008.codfw.wmnet with reason: host reimage [12:55:41] !log dropping uw_campaign_conf in all wikis that have it (gerrit:1161562) [12:55:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:52] !log jayme@deploy1003 helmfile [codfw] START helmfile.d/services/machinetranslation: sync [12:56:04] !log kamila@deploy1003 helmfile [codfw] OK helmfile.d/services/proton: apply [12:56:08] (03PS4) 10Anzx: brwiki: add patroller usergroup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1162889 (https://phabricator.wikimedia.org/T397576) [12:56:21] jayme: ok right I get it, it was deployed in staging this morning and then it went out with Rai*ne's deployment [12:56:34] !log kamila@deploy1003 helmfile [codfw] OK helmfile.d/services/push-notifications: apply [12:56:36] I'll follow up [12:56:47] (03PS5) 10Anzx: brwiki: add patroller usergroup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1162889 (https://phabricator.wikimedia.org/T397576) [12:56:53] !log kamila@deploy1003 helmfile [codfw] OK helmfile.d/services/ratelimit: apply [12:56:54] yep, exactly elukey [12:57:01] elukey: exactly, but the change was unfortunatey not done to staging only but rather to all releases [12:57:01] !log kamila@deploy1003 helmfile [codfw] OK helmfile.d/services/rdf-streaming-updater: apply [12:57:07] going to follow up with ML and friends! [12:57:21] thanks! [12:57:21] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 2.056s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:57:28] !log kamila@deploy1003 helmfile [codfw] OK helmfile.d/services/recommendation-api: apply [12:57:31] ^ this is fine [12:57:36] elukey: and the chart change and deployment values change where bundled into one, so no one-click revert possible as well [12:57:43] !log kamila@deploy1003 helmfile [codfw] OK helmfile.d/services/rest-gateway: apply [12:57:46] (03PS1) 10Jelto: Revert "miscweb(design-landing-page): bump version" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1162897 (https://phabricator.wikimedia.org/T397148) [12:57:59] !log kamila@deploy1003 helmfile [codfw] OK helmfile.d/services/sessionstore: apply [12:58:11] !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [12:58:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 23.89% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [12:58:37] !log kamila@deploy1003 helmfile [codfw] OK helmfile.d/services/shellbox: apply [12:58:46] !log akosiaris@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aux-k8s-worker2008.codfw.wmnet with reason: host reimage [12:58:54] jayme: /me nods [12:58:55] !log kamila@deploy1003 helmfile [codfw] OK helmfile.d/services/shellbox-constraints: apply [12:59:04] (03CR) 10Clément Goubert: [C:03+1] Revert "miscweb(design-landing-page): bump version" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1162897 (https://phabricator.wikimedia.org/T397148) (owner: 10Jelto) [12:59:17] !log kamila@deploy1003 helmfile [codfw] OK helmfile.d/services/shellbox-media: apply [12:59:33] FIRING: [6x] ProbeDown: Service mathoid:4001 has failed probes (http_mathoid_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:59:37] !log kamila@deploy1003 helmfile [codfw] OK helmfile.d/services/shellbox-syntaxhighlight: apply [12:59:41] (03PS2) 10Muehlenhoff: debmonitor: Remove unused Hiera option [puppet] - 10https://gerrit.wikimedia.org/r/1162887 [12:59:47] (03CR) 10Muehlenhoff: debmonitor: Remove unused Hiera option (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1162887 (owner: 10Muehlenhoff) [13:00:02] !log kamila@deploy1003 helmfile [codfw] OK helmfile.d/services/shellbox-timeline: apply [13:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: How many deployers does it take to do UTC afternoon backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250623T1300). [13:00:05] seanleong-wmde, tgr, Tchanders, and anzx: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:33] !log kamila@deploy1003 helmfile [codfw] OK helmfile.d/services/shellbox-video: apply [13:00:57] !log kamila@deploy1003 helmfile [codfw] OK helmfile.d/services/tegola-vector-tiles: apply [13:01:00] (03CR) 10Jelto: [C:03+2] Revert "miscweb(design-landing-page): bump version" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1162897 (https://phabricator.wikimedia.org/T397148) (owner: 10Jelto) [13:01:06] o/ [13:01:12] o/ [13:01:27] !log kamila@deploy1003 helmfile [codfw] OK helmfile.d/services/termbox: apply [13:01:51] I'll go to the end of the queue, I'm AFK until the second half of the hour [13:01:54] (03CR) 10Tiziano Fogli: [C:03+2] monitoring services: add migration task T384214 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1155619 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [13:02:36] please wait until we are done with the kubernetes upgrade in codfw [13:02:43] !log kamila@deploy1003 helmfile [codfw] OK helmfile.d/services/thumbor: apply [13:02:45] we will ping you [13:03:08] !log akosiaris@cumin1003 START - Cookbook sre.hosts.reimage for host aux-k8s-worker2009.codfw.wmnet with OS bookworm [13:03:11] !log kamila@deploy1003 helmfile [codfw] OK helmfile.d/services/toolhub: apply [13:03:26] (03Merged) 10jenkins-bot: Revert "miscweb(design-landing-page): bump version" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1162897 (https://phabricator.wikimedia.org/T397148) (owner: 10Jelto) [13:03:30] RESOLVED: [10x] ProbeDown: Service mobileapps:4102 has failed probes (http_mobileapps_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:03:34] !log kamila@deploy1003 helmfile [codfw] OK helmfile.d/services/wikidata-query-gui: apply [13:03:59] !log kamila@deploy1003 helmfile [codfw] OK helmfile.d/services/wikifeeds: apply [13:04:53] (03CR) 10Volans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1162887 (owner: 10Muehlenhoff) [13:05:56] !log jayme@deploy1003 helmfile [codfw] DONE helmfile.d/services/machinetranslation: sync [13:07:00] !log kamila@deploy1003 helmfile [codfw] OK helmfile.d/services/wikifunctions: apply [13:07:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.222s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [13:07:21] !log kamila@deploy1003 helmfile [codfw] OK helmfile.d/services/zotero: apply [13:07:35] !log jelto@deploy1003 helmfile [codfw] START helmfile.d/services/miscweb: apply [13:07:50] !log jelto@deploy1003 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [13:08:30] FIRING: [12x] ProbeDown: Service mobileapps:4102 has failed probes (http_mobileapps_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:08:53] (03PS4) 10Fabfur: install_server: UEFI setup for cp20[43-58] [puppet] - 10https://gerrit.wikimedia.org/r/1162840 (https://phabricator.wikimedia.org/T392851) [13:09:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.443s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [13:09:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1247', diff saved to https://phabricator.wikimedia.org/P78643 and previous config saved to /var/cache/conftool/dbconfig/20250623-130920-marostegui.json [13:09:31] (03CR) 10Tiziano Fogli: [C:03+2] monitoring services: add migration task T350694 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1155140 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [13:09:33] RESOLVED: [12x] ProbeDown: Service mobileapps:4102 has failed probes (http_mobileapps_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:09:35] !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: apply [13:09:53] nemo-yiannis: don't deploy to prod please [13:10:08] ok, i thought i did staging [13:10:12] !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [13:10:19] !log kamila@cumin1003 END (PASS) - Cookbook sre.k8s.wipe-cluster (exit_code=0) Wipe the K8s cluster wikikube-codfw: Kubernetes upgrade [13:10:28] nemo-yiannis: yeah I'm just warning :D [13:10:34] ah ok [13:10:43] i thought i typed the wrong env :) [13:10:50] heh, sorry for the confusion [13:11:21] FWIW staging deployment failed [13:12:03] !log cgoubert@deploy1003 Started scap sync-world: Redeploying mediawiki following kubernets upgrade T397148 [13:12:09] T397148: Update wikikube codfw to kubernetes 1.31 - https://phabricator.wikimedia.org/T397148 [13:14:23] !log akosiaris@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aux-k8s-worker2008.codfw.wmnet with OS bookworm [13:15:45] (03CR) 10Muehlenhoff: [C:03+2] debmonitor: Remove unused Hiera option [puppet] - 10https://gerrit.wikimedia.org/r/1162887 (owner: 10Muehlenhoff) [13:15:51] !log akosiaris@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on aux-k8s-worker2009.codfw.wmnet with reason: host reimage [13:16:13] (03PS1) 10Klausman: services/machinetranslation: add network policy to allow access to Thanos/Swift S3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1162900 [13:18:07] !log cgoubert@deploy1003 helmfile [codfw] START helmfile.d/services/mw-debug: sync [13:18:16] !log cgoubert@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-debug: sync [13:18:57] (03PS1) 10Andrew Bogott: Revert "neutron: update policy.yaml" [puppet] - 10https://gerrit.wikimedia.org/r/1162901 [13:19:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.26s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [13:19:29] !log akosiaris@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aux-k8s-worker2009.codfw.wmnet with reason: host reimage [13:19:58] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [13:20:03] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [13:20:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.476s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [13:21:09] (03CR) 10Andrew Bogott: [C:03+2] Revert "neutron: update policy.yaml" [puppet] - 10https://gerrit.wikimedia.org/r/1162901 (owner: 10Andrew Bogott) [13:21:27] (03PS1) 10Muehlenhoff: Role and default Hiera settings for debmonitor-dev [puppet] - 10https://gerrit.wikimedia.org/r/1162902 [13:21:43] (03PS2) 10Muehlenhoff: Role and default Hiera settings for debmonitor-dev [puppet] - 10https://gerrit.wikimedia.org/r/1162902 [13:22:11] (03CR) 10CI reject: [V:04-1] Role and default Hiera settings for debmonitor-dev [puppet] - 10https://gerrit.wikimedia.org/r/1162902 (owner: 10Muehlenhoff) [13:23:54] (03PS3) 10Muehlenhoff: Role and default Hiera settings for debmonitor-dev [puppet] - 10https://gerrit.wikimedia.org/r/1162902 [13:24:10] jayme: yes. on it. [13:24:14] not deployable. [13:24:19] (03CR) 10CI reject: [V:04-1] Role and default Hiera settings for debmonitor-dev [puppet] - 10https://gerrit.wikimedia.org/r/1162902 (owner: 10Muehlenhoff) [13:24:26] kart_: see -ml [13:24:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1247', diff saved to https://phabricator.wikimedia.org/P78644 and previous config saved to /var/cache/conftool/dbconfig/20250623-132427-marostegui.json [13:25:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.359s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [13:25:20] (03PS1) 10Effie Mouzeli: memcached: enable extstore on codfw [puppet] - 10https://gerrit.wikimedia.org/r/1162904 (https://phabricator.wikimedia.org/T352885) [13:25:56] (03PS4) 10Muehlenhoff: Role and default Hiera settings for debmonitor-dev [puppet] - 10https://gerrit.wikimedia.org/r/1162902 [13:26:46] !log kamila@deploy1003 helmfile [codfw] START helmfile.d/services/mw-api-ext: sync [13:27:03] (03PS2) 10Effie Mouzeli: memcached: enable extstore on codfw [puppet] - 10https://gerrit.wikimedia.org/r/1162904 (https://phabricator.wikimedia.org/T352885) [13:27:08] !log kamila@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: sync [13:27:09] !log kamila@deploy1003 helmfile [codfw] START helmfile.d/services/mw-api-int: sync [13:27:29] !log kamila@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-api-int: sync [13:27:30] !log kamila@deploy1003 helmfile [codfw] START helmfile.d/services/mw-cron: sync [13:27:39] !log kamila@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-cron: sync [13:27:40] !log kamila@deploy1003 helmfile [codfw] START helmfile.d/services/mw-debug: sync [13:27:46] !log kamila@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-debug: sync [13:27:47] !log kamila@deploy1003 helmfile [codfw] START helmfile.d/services/mw-experimental: sync [13:27:58] !log kamila@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-experimental: sync [13:27:59] !log kamila@deploy1003 helmfile [codfw] START helmfile.d/services/mw-jobrunner: sync [13:28:13] !log kamila@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-jobrunner: sync [13:28:14] !log kamila@deploy1003 helmfile [codfw] START helmfile.d/services/mw-mcrouter: sync [13:28:30] (03CR) 10Effie Mouzeli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1162904 (https://phabricator.wikimedia.org/T352885) (owner: 10Effie Mouzeli) [13:30:53] !log jayme@deploy1003 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: sync [13:30:56] !log jayme@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: sync [13:32:20] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.admin config_reloading P{lvs6002.drmrs.wmnet} and A:liberica (T396561) [13:32:22] !log repool lvs6002 using katran - T396561 [13:32:26] T396561: Switch to katran as forwarding plane on non-core DCs - https://phabricator.wikimedia.org/T396561 [13:32:26] (03PS3) 10Effie Mouzeli: memcached: enable extstore on codfw [puppet] - 10https://gerrit.wikimedia.org/r/1162904 (https://phabricator.wikimedia.org/T352885) [13:32:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:34] (03CR) 10Effie Mouzeli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1162904 (https://phabricator.wikimedia.org/T352885) (owner: 10Effie Mouzeli) [13:32:38] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P{lvs6002.drmrs.wmnet} and A:liberica (T396561) [13:33:24] (03PS2) 10Alexandros Kosiaris: calico: Switch default-deny to using services instead of ports [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161533 (https://phabricator.wikimedia.org/T397341) [13:33:32] (03CR) 10Alexandros Kosiaris: [C:03+2] calico: Switch default-deny to using services instead of ports [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161533 (https://phabricator.wikimedia.org/T397341) (owner: 10Alexandros Kosiaris) [13:35:21] !log akosiaris@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aux-k8s-worker2009.codfw.wmnet with OS bookworm [13:37:59] (03PS5) 10Muehlenhoff: Role and default Hiera settings for debmonitor-dev [puppet] - 10https://gerrit.wikimedia.org/r/1162902 [13:38:25] (03CR) 10CI reject: [V:04-1] Role and default Hiera settings for debmonitor-dev [puppet] - 10https://gerrit.wikimedia.org/r/1162902 (owner: 10Muehlenhoff) [13:38:30] FIRING: [10x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:39:29] (03PS6) 10Muehlenhoff: Role and default Hiera settings for debmonitor-dev [puppet] - 10https://gerrit.wikimedia.org/r/1162902 [13:39:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1247 (T396130)', diff saved to https://phabricator.wikimedia.org/P78645 and previous config saved to /var/cache/conftool/dbconfig/20250623-133935-marostegui.json [13:39:41] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [13:39:51] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1248.eqiad.wmnet with reason: Maintenance [13:39:52] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1162908 [13:39:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1248 (T396130)', diff saved to https://phabricator.wikimedia.org/P78646 and previous config saved to /var/cache/conftool/dbconfig/20250623-133958-marostegui.json [13:40:09] (03Merged) 10jenkins-bot: calico: Switch default-deny to using services instead of ports [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161533 (https://phabricator.wikimedia.org/T397341) (owner: 10Alexandros Kosiaris) [13:40:33] !log kamila@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: sync [13:40:34] !log kamila@deploy1003 helmfile [codfw] START helmfile.d/services/mw-misc: sync [13:40:44] !log kamila@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-misc: sync [13:40:45] !log kamila@deploy1003 helmfile [codfw] START helmfile.d/services/mw-parsoid: sync [13:41:01] !log kamila@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: sync [13:41:08] !log kamila@deploy1003 helmfile [codfw] START helmfile.d/services/mw-videoscaler: sync [13:41:11] !log kamila@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-videoscaler: sync [13:41:12] !log kamila@deploy1003 helmfile [codfw] START helmfile.d/services/mw-web: sync [13:41:40] !log kamila@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-web: sync [13:41:41] !log kamila@deploy1003 helmfile [codfw] START helmfile.d/services/mw-wikifunctions: sync [13:41:52] !log kamila@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-wikifunctions: sync [13:42:21] (03PS1) 10Alexandros Kosiaris: rdb201[12]: site.pp entries [puppet] - 10https://gerrit.wikimedia.org/r/1162909 (https://phabricator.wikimedia.org/T393121) [13:42:54] !log cgoubert@deploy1003 Started scap sync-world: Redeploying mediawiki following kubernets upgrade T397148 [13:43:00] T397148: Update wikikube codfw to kubernetes 1.31 - https://phabricator.wikimedia.org/T397148 [13:43:23] !log cgoubert@deploy1003 cgoubert: Redeploying mediawiki following kubernets upgrade T397148 synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:43:40] (03CR) 10Alexandros Kosiaris: [C:03+2] rdb201[12]: site.pp entries [puppet] - 10https://gerrit.wikimedia.org/r/1162909 (https://phabricator.wikimedia.org/T393121) (owner: 10Alexandros Kosiaris) [13:43:54] !log cgoubert@deploy1003 cgoubert: Continuing with sync [13:44:45] (03CR) 10Marostegui: "You may need to also restart replication on all sanitarium hosts for all sections." [puppet] - 10https://gerrit.wikimedia.org/r/1155210 (https://phabricator.wikimedia.org/T363581) (owner: 10Ladsgroup) [13:44:55] !log cgoubert@deploy1003 Finished scap sync-world: Redeploying mediawiki following kubernets upgrade T397148 (duration: 02m 00s) [13:45:44] !log sukhe@cumin1002 START - Cookbook sre.cdn.roll-upgrade-ats Rolling upgrade of ATS on A:magru and A:cp - 9.2.11 upgrade (T397456) [13:45:49] T397456: Upgrade to ATS 9.2.11 - https://phabricator.wikimedia.org/T397456 [13:45:50] !log akosiaris@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [13:46:42] !log akosiaris@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [13:47:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1248 (T396130)', diff saved to https://phabricator.wikimedia.org/P78647 and previous config saved to /var/cache/conftool/dbconfig/20250623-134737-marostegui.json [13:47:43] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [13:48:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 23.31% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:52:57] FIRING: [5x] ProbeDown: Service chart-renderer:30443 has failed probes (http_chart-renderer_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:53:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 23.37% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:53:16] still expected? [13:53:18] (03PS2) 10Ayounsi: Netbox: add primary_mac_address get/set [software/spicerack] - 10https://gerrit.wikimedia.org/r/1162869 [13:53:20] !incidents [13:53:21] 6397 (ACKED) [21x] ProbeDown sre (ip4 probes/service codfw) [13:53:21] 6398 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqsin) [13:53:40] vgutierrez: fpm too busy expected, the other one not sure [13:53:42] !log akosiaris@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [13:54:15] !log akosiaris@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [13:54:56] the other one being the acked probedown? [13:55:11] other one expected as well [13:55:19] we forgot to redeploy ingress [13:55:19] ok [13:55:23] it's fine it's still depooled [13:56:25] (03CR) 10Ladsgroup: "so much sacrifice for getting rid of that list but a sacrifice I'm willing to take and restart the sanitarium hosts too." [puppet] - 10https://gerrit.wikimedia.org/r/1155210 (https://phabricator.wikimedia.org/T363581) (owner: 10Ladsgroup) [13:57:57] RESOLVED: [5x] ProbeDown: Service chart-renderer:30443 has failed probes (http_chart-renderer_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:58:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 22.42% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:58:17] 06SRE, 06Traffic, 03FY2025-26 WE 3.3.4 Reading Lists on Web: [Reading Lists] Monitor potential performance impact of Reading Lists for Web - https://phabricator.wikimedia.org/T397526#10938446 (10Jdrewniak) [13:58:30] FIRING: [5x] ProbeDown: Service chart-renderer:30443 has failed probes (http_chart-renderer_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:58:58] (03CR) 10Volans: [C:03+2] client: remove self-update capability (032 comments) [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/1162855 (owner: 10Volans) [13:59:11] !log kamila@cumin1003 START - Cookbook sre.k8s.pool-depool-cluster pool all services in codfw/codfw: maintenance [13:59:20] (03CR) 10Volans: [C:03+2] debmonitor: remove client self-update capability [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1162860 (owner: 10Volans) [13:59:30] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [13:59:33] RESOLVED: [5x] ProbeDown: Service chart-renderer:30443 has failed probes (http_chart-renderer_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:59:38] (03CR) 10Volans: [C:03+2] client: remove dependency on client [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1162861 (owner: 10Volans) [13:59:49] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Upgrade an-worker hard drives from 4TB to 8TB (group 10 - multiple racks - singletons) - https://phabricator.wikimedia.org/T390178#10938448 (10Jclark-ctr) Completed all drive swaps except for an-worker1175 need to sort through th... [13:59:54] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:00:49] Hii, I might have missed the previous messages. Is the deployment still happening? Thanks. [14:00:54] (03Merged) 10jenkins-bot: client: remove self-update capability [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/1162855 (owner: 10Volans) [14:00:54] (03Merged) 10jenkins-bot: debmonitor: remove client self-update capability [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1162860 (owner: 10Volans) [14:01:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 24.02% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:01:16] (03CR) 10Marostegui: "https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1131954/15/cookbooks/sre/mysql/sanitarium_restart.py" [puppet] - 10https://gerrit.wikimedia.org/r/1155210 (https://phabricator.wikimedia.org/T363581) (owner: 10Ladsgroup) [14:02:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1248', diff saved to https://phabricator.wikimedia.org/P78648 and previous config saved to /var/cache/conftool/dbconfig/20250623-140245-marostegui.json [14:02:55] !log kamila@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'. [14:03:06] 10ops-codfw, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#10938456 (10elukey) @Jhancock.wm Hi! The I/F team is doing an hackathon this week so I'll try to work on this but I can't promise a lot of progress :( From a quick che... [14:03:17] (03Merged) 10jenkins-bot: client: remove dependency on client [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1162861 (owner: 10Volans) [14:03:32] !log cgoubert@cumin1003 conftool action : set/pooled=false; selector: dnsdisc=k8s-ingress-wikikube.*,name=codfw [14:03:38] 10SRE-SLO, 06collaboration-services: Implement service level indicator measurement for Gerrit - https://phabricator.wikimedia.org/T396979#10938457 (10ABran-WMF) [14:03:41] !log kamila@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [14:04:46] (03CR) 10CI reject: [V:04-1] Netbox: add primary_mac_address get/set [software/spicerack] - 10https://gerrit.wikimedia.org/r/1162869 (owner: 10Ayounsi) [14:05:18] RECOVERY - Kafka MirrorMaker main-codfw_to_main-eqiad average message consume rate in last 30m on alert1002 is OK: (C)0 le (W)3 le 3.384 https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad [14:05:20] 12 [14:05:24] uff [14:05:30] RECOVERY - Kafka MirrorMaker main-codfw_to_main-eqiad average message produce rate in last 30m on alert1002 is OK: (C)0 le (W)3 le 3.575 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad [14:05:30] !log kamila@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. [14:06:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 24.02% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:06:33] !log kamila@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. [14:06:36] !log kamila@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [14:06:54] !log akosiaris@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'. [14:07:11] !log kamila@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [14:07:16] !log kamila@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [14:07:22] !log akosiaris@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'. [14:07:42] !log kamila@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [14:07:46] !log kamila@deploy1003 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [14:08:54] !log kamila@deploy1003 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [14:08:58] !log kamila@deploy1003 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [14:09:36] !log kamila@deploy1003 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [14:09:40] !log kamila@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'. [14:10:12] !log kamila@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'. [14:10:17] !log kamila@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'. [14:10:44] !log kamila@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [14:13:34] Lucas_WMDE, Urbanecm, and TheresNoTime, seanleong-wmde, tgr, Tchanders, and anzx so sorry for the wait, you should be able to deploy now [14:13:39] !log kamila@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-cluster (exit_code=0) pool all services in codfw/codfw: maintenance [14:14:43] (03PS1) 10Vgutierrez: hiera: Switch lvs6001 (text) to katran [puppet] - 10https://gerrit.wikimedia.org/r/1162915 (https://phabricator.wikimedia.org/T396561) [14:15:31] (03CR) 10Ssingh: [C:03+1] hiera: Switch lvs6001 (text) to katran [puppet] - 10https://gerrit.wikimedia.org/r/1162915 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez) [14:15:35] (03PS1) 10Federico Ceratto: wmf_root_client.pp: install wmfdb-admin on cumin [puppet] - 10https://gerrit.wikimedia.org/r/1160707 (https://phabricator.wikimedia.org/T393990) [14:15:35] (03CR) 10Federico Ceratto: "Installing the package on cumin1003" [puppet] - 10https://gerrit.wikimedia.org/r/1160707 (https://phabricator.wikimedia.org/T393990) (owner: 10Federico Ceratto) [14:15:52] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1162915 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez) [14:17:04] (03CR) 10Muehlenhoff: [C:03+1] "LGTM, simpler option inline" [puppet] - 10https://gerrit.wikimedia.org/r/1160707 (https://phabricator.wikimedia.org/T393990) (owner: 10Federico Ceratto) [14:17:13] !log repool ms swift in codfw [14:17:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:28] !log mvernon@cumin2002 conftool action : set/pooled=true; selector: dnsdisc=swift,name=codfw [14:17:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1248', diff saved to https://phabricator.wikimedia.org/P78649 and previous config saved to /var/cache/conftool/dbconfig/20250623-141753-marostegui.json [14:18:09] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.admin depooling P{lvs6001.drmrs.wmnet} and A:liberica (T396561) [14:18:14] !log mvernon@cumin2002 conftool action : set/pooled=true; selector: dnsdisc=swift-ro,name=codfw [14:18:15] T396561: Switch to katran as forwarding plane on non-core DCs - https://phabricator.wikimedia.org/T396561 [14:18:31] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P{lvs6001.drmrs.wmnet} and A:liberica (T396561) [14:19:35] !log vgutierrez@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs6001.drmrs.wmnet with reason: switching to katran [14:20:35] (03PS1) 10Andrew Bogott: Update horizon version in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/1162916 (https://phabricator.wikimedia.org/T397272) [14:21:05] !log elukey@deploy1003 helmfile [codfw] START helmfile.d/services/machinetranslation: sync [14:21:35] (03CR) 10Andrew Bogott: [C:03+2] Update horizon version in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/1162916 (https://phabricator.wikimedia.org/T397272) (owner: 10Andrew Bogott) [14:22:56] I was in an interview anyway [14:23:01] anything left to deploy from the window? [14:24:11] (03PS1) 10Eevans: sessionstore2005: reimage to JBOD config [puppet] - 10https://gerrit.wikimedia.org/r/1162917 (https://phabricator.wikimedia.org/T390514) [14:24:13] (03PS1) 10Eevans: sessionstore2005: reconfigure instance for JBOD devices [puppet] - 10https://gerrit.wikimedia.org/r/1162918 (https://phabricator.wikimedia.org/T390514) [14:24:54] (03CR) 10Vgutierrez: [C:03+2] hiera: Switch lvs6001 (text) to katran [puppet] - 10https://gerrit.wikimedia.org/r/1162915 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez) [14:27:26] (03PS1) 10BPirkle: REST: Introduce new RestModuleOverride config value. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1162919 (https://phabricator.wikimedia.org/T395719) [14:30:05] Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250623T1430) [14:30:40] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:31:09] !log elukey@deploy1003 helmfile [codfw] DONE helmfile.d/services/machinetranslation: sync [14:32:15] (03PS1) 10Andrew Bogott: Correct horizon version in codfw1dev; typo [puppet] - 10https://gerrit.wikimedia.org/r/1162920 (https://phabricator.wikimedia.org/T397272) [14:32:50] (03CR) 10Andrew Bogott: [C:03+2] Correct horizon version in codfw1dev; typo [puppet] - 10https://gerrit.wikimedia.org/r/1162920 (https://phabricator.wikimedia.org/T397272) (owner: 10Andrew Bogott) [14:33:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1248 (T396130)', diff saved to https://phabricator.wikimedia.org/P78650 and previous config saved to /var/cache/conftool/dbconfig/20250623-143300-marostegui.json [14:33:03] Lucas_WMDE Hii, do you mind helping me deploy my changes? [14:33:04] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1249.eqiad.wmnet with reason: Maintenance [14:33:06] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [14:33:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1249 (T396130)', diff saved to https://phabricator.wikimedia.org/P78651 and previous config saved to /var/cache/conftool/dbconfig/20250623-143311-marostegui.json [14:33:37] seanleong-wmde: in principle no, but right now there’s another deploy window and I don’t know if it’s okay to deploy mediawiki stuff during that window [14:33:45] (since I have no idea who or what the xLab is) [14:34:38] anyone here know more about it? [14:35:26] (03CR) 10MVernon: [C:03+1] sessionstore2005: reimage to JBOD config [puppet] - 10https://gerrit.wikimedia.org/r/1162917 (https://phabricator.wikimedia.org/T390514) (owner: 10Eevans) [14:36:02] (03CR) 10MVernon: [C:03+1] sessionstore2005: reconfigure instance for JBOD devices [puppet] - 10https://gerrit.wikimedia.org/r/1162918 (https://phabricator.wikimedia.org/T390514) (owner: 10Eevans) [14:36:51] ahh okay, it's a config change, it'll great if it's possible, if not then I can also schedule another backport tmr, thanks! - https://gerrit.wikimedia.org/r/c/1141852/ [14:38:22] (03CR) 10Eevans: [C:03+2] sessionstore2005: reimage to JBOD config [puppet] - 10https://gerrit.wikimedia.org/r/1162917 (https://phabricator.wikimedia.org/T390514) (owner: 10Eevans) [14:39:00] !log vgutierrez@cumin1002 START - Cookbook sre.hosts.remove-downtime for lvs6001.drmrs.wmnet [14:39:01] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs6001.drmrs.wmnet [14:39:46] (03PS1) 10Vgutierrez: hiera: Repool lvs6001 with katran [puppet] - 10https://gerrit.wikimedia.org/r/1162922 (https://phabricator.wikimedia.org/T396561) [14:40:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1249 (T396130)', diff saved to https://phabricator.wikimedia.org/P78652 and previous config saved to /var/cache/conftool/dbconfig/20250623-144005-marostegui.json [14:40:11] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1162922 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez) [14:40:12] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [14:41:24] (03CR) 10Ssingh: [C:03+1] hiera: Repool lvs6001 with katran [puppet] - 10https://gerrit.wikimedia.org/r/1162922 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez) [14:45:35] (03PS1) 10Ayounsi: Puppet: add a get_facts function [software/spicerack] - 10https://gerrit.wikimedia.org/r/1162923 [14:46:32] (03CR) 10Ssingh: "Can we expand the country codes in the commit message for our own understanding and for posterity?" [dns] - 10https://gerrit.wikimedia.org/r/1162078 (owner: 10CDobbins) [14:48:44] !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: apply [14:50:43] FIRING: [2x] ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [14:50:51] (03Abandoned) 10Ayounsi: Puppet: add a get_facts function [software/spicerack] - 10https://gerrit.wikimedia.org/r/1162923 (owner: 10Ayounsi) [14:52:35] I am trying to deploy mobileapps on staging but helm hangs. Any idea why this happens? [14:53:41] (03PS3) 10Ayounsi: Netbox: add primary_mac_address get/set [software/spicerack] - 10https://gerrit.wikimedia.org/r/1162869 [14:53:58] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1014:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [14:55:00] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.11 point update - https://phabricator.wikimedia.org/T394489#10938608 (10MoritzMuehlenhoff) [14:55:02] (03CR) 10Vgutierrez: [C:03+2] hiera: Repool lvs6001 with katran [puppet] - 10https://gerrit.wikimedia.org/r/1162922 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez) [14:55:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1249', diff saved to https://phabricator.wikimedia.org/P78653 and previous config saved to /var/cache/conftool/dbconfig/20250623-145514-marostegui.json [14:57:04] (03PS2) 10Scott French: deployment_server: use bookworm httpd in mw-debug/next mw-*/migration [puppet] - 10https://gerrit.wikimedia.org/r/1162036 (https://phabricator.wikimedia.org/T378128) [14:58:53] !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [14:58:58] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1014:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [14:59:10] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.admin config_reloading P{lvs6001.drmrs.wmnet} and A:liberica (T396561) [14:59:13] !log repool lvs6001 using katran - T396561 [14:59:15] T396561: Switch to katran as forwarding plane on non-core DCs - https://phabricator.wikimedia.org/T396561 [14:59:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:28] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P{lvs6001.drmrs.wmnet} and A:liberica (T396561) [15:00:30] !log decommission Cassandra/sessionstore2005-a — T391544 [15:00:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:36] T391544: Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544 [15:00:40] jouncebot: nowandnext [15:00:40] No deployments scheduled for the next 0 hour(s) and 29 minute(s) [15:00:40] In 0 hour(s) and 29 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250623T1530) [15:00:43] FIRING: [2x] ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [15:00:48] seanleong-wmde: we could deploy the config change now [15:00:51] unless anyone objects? [15:00:56] nemo-yiannis: kubectl get events tells me "Error creating: pods "mobileapps-staging-68c4cd89f5-5r7v9" is forbidden: failed quota: quota-compute-resources: must specify limits.cpu for: staging-metrics-exporter; limits.memory for: staging-metrics-exporter" [15:00:56] So a wild guess without more context: some pod is not specifying limits.cpu and limits.memory? [15:01:18] also cc tgr and anzx if you’re still around (and we have enough time before the next window) [15:01:29] jelto: i think i so a bump in the cpu in the helm diff but i didn't put it [15:01:42] maybe its required by prod but staging doesn't allow the resources ? [15:01:44] * nemo-yiannis checks [15:02:35] !log eevans@cumin1003 START - Cookbook sre.hosts.reimage for host sessionstore2005.codfw.wmnet with OS bullseye [15:02:42] 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 07Security, 10Sustainability (Incident Followup): Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10938665 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1003 for host... [15:02:43] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141852 (https://phabricator.wikimedia.org/T388685) (owner: 10Neslihan Turan) [15:02:47] * Lucas_WMDE deploying ^ [15:03:05] (03CR) 10CI reject: [V:04-1] Netbox: add primary_mac_address get/set [software/spicerack] - 10https://gerrit.wikimedia.org/r/1162869 (owner: 10Ayounsi) [15:03:21] Lucas_WMDE : ಒ/ [15:03:27] o/ [15:03:59] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Upgrade an-worker hard drives from 4TB to 8TB (group 10 - multiple racks - singletons) - https://phabricator.wikimedia.org/T390178#10938672 (10Jclark-ctr) @wiki_willy @Stevemunene Figured out the shortage original ticket Parent... [15:04:26] (03Merged) 10jenkins-bot: Create feature flags to resolve Wikibase item labels on the Watchlist. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141852 (https://phabricator.wikimedia.org/T388685) (owner: 10Neslihan Turan) [15:04:32] (03CR) 10Lucas Werkmeister (WMDE): brwiki: add patroller usergroup (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1162889 (https://phabricator.wikimedia.org/T397576) (owner: 10Anzx) [15:04:43] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1141852|Create feature flags to resolve Wikibase item labels on the Watchlist. (T388685)]] [15:04:49] T388685: Show labels for properties and items on Wikipedia watchlist summaries - https://phabricator.wikimedia.org/T388685 [15:04:58] Lucas_WMDE o/ [15:05:12] (03PS4) 10Ayounsi: Netbox: add primary_mac_address get/set [software/spicerack] - 10https://gerrit.wikimedia.org/r/1162869 [15:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:06:54] !log lucaswerkmeister-wmde@deploy1003 neslihanturan, lucaswerkmeister-wmde: Backport for [[gerrit:1141852|Create feature flags to resolve Wikibase item labels on the Watchlist. (T388685)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [15:07:01] (03PS1) 10Muehlenhoff: Update account meta data [puppet] - 10https://gerrit.wikimedia.org/r/1162926 [15:07:15] hm, I guess there’s not a lot to test [15:07:23] seanleong-wmde: please test that the feature isn’t yet enabled in production :P [15:07:29] (03PS6) 10Anzx: brwiki: add patroller usergroup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1162889 (https://phabricator.wikimedia.org/T397576) [15:07:35] even on WikimediaDebug [15:08:00] anzx: I can deploy your config change afterwards :) [15:08:10] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] brwiki: add patroller usergroup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1162889 (https://phabricator.wikimedia.org/T397576) (owner: 10Anzx) [15:08:30] FIRING: [2x] ProbeDown: Service sessionstore2005-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:08:38] Lucas_WMDE Nope, it's not there, we just want it on Beta Cluster [15:08:49] Not in production and not in WMDebug [15:08:55] and you checked that it’s not there? ^^ [15:09:22] but I also did not see it in beta cluster [15:09:23] (03PS1) 10Jgiannelos: Revert "wikifeeds: pipeline bot promote" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1162928 [15:09:30] (03PS1) 10KartikMistry: MinT: Update to 2025-06-23-145751-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1162929 [15:09:35] yeah, that’ll take ca. 10 minutes longer [15:09:40] (03PS1) 10Jgiannelos: Revert "mobileapps: pipeline bot promote" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1162930 [15:09:51] okay, I will confirm after 10 min, thanks @Luka [15:09:56] !log lucaswerkmeister-wmde@deploy1003 neslihanturan, lucaswerkmeister-wmde: Continuing with sync [15:10:00] Lucas_WMDE thanks! [15:10:00] ok, let’s continue the deployment then [15:10:11] (03CR) 10Muehlenhoff: [C:03+2] Update account meta data [puppet] - 10https://gerrit.wikimedia.org/r/1162926 (owner: 10Muehlenhoff) [15:10:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1249', diff saved to https://phabricator.wikimedia.org/P78654 and previous config saved to /var/cache/conftool/dbconfig/20250623-151021-marostegui.json [15:10:28] (03CR) 10Klausman: [C:03+2] MinT: Update to 2025-06-23-145751-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1162929 (owner: 10KartikMistry) [15:11:56] (03PS2) 10Jgiannelos: Revert "mobileapps: pipeline bot promote" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1162930 [15:12:10] (03CR) 10Jgiannelos: [C:03+2] Revert "mobileapps: pipeline bot promote" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1162930 (owner: 10Jgiannelos) [15:12:15] (03CR) 10Jgiannelos: [C:03+2] Revert "wikifeeds: pipeline bot promote" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1162928 (owner: 10Jgiannelos) [15:12:18] (03Merged) 10jenkins-bot: MinT: Update to 2025-06-23-145751-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1162929 (owner: 10KartikMistry) [15:13:26] (03CR) 10Vgutierrez: install_server: UEFI setup for cp20[43-58] (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1162840 (https://phabricator.wikimedia.org/T392851) (owner: 10Fabfur) [15:14:04] (03Merged) 10jenkins-bot: Revert "mobileapps: pipeline bot promote" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1162930 (owner: 10Jgiannelos) [15:14:04] (03Merged) 10jenkins-bot: Revert "wikifeeds: pipeline bot promote" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1162928 (owner: 10Jgiannelos) [15:15:21] !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host rdb2011.codfw.wmnet with OS bookworm [15:15:28] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install rdb201[12] - https://phabricator.wikimedia.org/T393121#10938711 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host rdb2011.codfw.wmnet with OS bookworm [15:15:31] !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host rdb2012.codfw.wmnet with OS bookworm [15:15:38] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install rdb201[12] - https://phabricator.wikimedia.org/T393121#10938714 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host rdb2012.codfw.wmnet with OS bookworm [15:16:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:16:51] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1141852|Create feature flags to resolve Wikibase item labels on the Watchlist. (T388685)]] (duration: 12m 07s) [15:16:56] T388685: Show labels for properties and items on Wikipedia watchlist summaries - https://phabricator.wikimedia.org/T388685 [15:17:25] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1162889 (https://phabricator.wikimedia.org/T397576) (owner: 10Anzx) [15:18:06] !log klausman@deploy1003 helmfile [staging] START helmfile.d/services/machinetranslation: apply [15:18:08] !log eevans@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sessionstore2005.codfw.wmnet with OS bullseye [15:18:14] 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 07Security, 10Sustainability (Incident Followup): Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10938733 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1003 for host sess... [15:18:30] RESOLVED: [2x] ProbeDown: Service sessionstore2005-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:18:38] !log eevans@cumin1003 START - Cookbook sre.hosts.reimage for host sessionstore2005.codfw.wmnet with OS bullseye [15:18:52] 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 07Security, 10Sustainability (Incident Followup): Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10938734 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1003 for host... [15:18:56] (03Merged) 10jenkins-bot: brwiki: add patroller usergroup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1162889 (https://phabricator.wikimedia.org/T397576) (owner: 10Anzx) [15:19:08] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1162889|brwiki: add patroller usergroup (T397576)]] [15:19:13] T397576: Create a patroller group for the Breton Wikipedia - https://phabricator.wikimedia.org/T397576 [15:19:44] Lucas_WMDE it's working fine thanks! [15:20:24] \o/ [15:21:09] !log lucaswerkmeister-wmde@deploy1003 anzx, lucaswerkmeister-wmde: Backport for [[gerrit:1162889|brwiki: add patroller usergroup (T397576)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [15:21:11] Lucas_WMDE: checking [15:22:14] Lucas_WMDE: looks good [15:22:18] !log lucaswerkmeister-wmde@deploy1003 anzx, lucaswerkmeister-wmde: Continuing with sync [15:22:19] thanks! [15:22:23] (LGTM too FWIW) [15:22:40] ((“LGTM2FWIW”? 🤔)) [15:25:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1249 (T396130)', diff saved to https://phabricator.wikimedia.org/P78655 and previous config saved to /var/cache/conftool/dbconfig/20250623-152529-marostegui.json [15:25:31] !log sukhe@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-ats (exit_code=0) Rolling upgrade of ATS on A:magru and A:cp - 9.2.11 upgrade (T397456) [15:25:34] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [15:25:39] T397456: Upgrade to ATS 9.2.11 - https://phabricator.wikimedia.org/T397456 [15:25:44] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1252.eqiad.wmnet with reason: Maintenance [15:25:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1252 (T396130)', diff saved to https://phabricator.wikimedia.org/P78656 and previous config saved to /var/cache/conftool/dbconfig/20250623-152551-marostegui.json [15:26:42] (03CR) 10Ssingh: "Disclaimer: Not an expert by any shot on this so please factor that in. Some "obvious" stuff below:" [puppet] - 10https://gerrit.wikimedia.org/r/1162840 (https://phabricator.wikimedia.org/T392851) (owner: 10Fabfur) [15:27:08] (03PS1) 10Jgiannelos: wikifeeds: Test nodejs 20 on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1162937 [15:28:11] !log jhancock@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on rdb2011.codfw.wmnet with reason: host reimage [15:28:11] !log jhancock@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on rdb2012.codfw.wmnet with reason: host reimage [15:28:37] !log eevans@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sessionstore2005.codfw.wmnet with OS bullseye [15:28:51] 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 07Security, 10Sustainability (Incident Followup): Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10938769 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1003 for host sess... [15:29:03] !log eevans@cumin1003 START - Cookbook sre.hosts.reimage for host sessionstore2005.codfw.wmnet with OS bullseye [15:29:12] 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 07Security, 10Sustainability (Incident Followup): Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10938771 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1003 for host... [15:29:23] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1162889|brwiki: add patroller usergroup (T397576)]] (duration: 10m 14s) [15:29:27] T397576: Create a patroller group for the Breton Wikipedia - https://phabricator.wikimedia.org/T397576 [15:29:31] Lucas_WMDE: Thanks for deploying [15:29:42] FIRING: [2x] ProbeDown: Service wdqs2014:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2014:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:29:55] anzx: np :) [15:30:04] jan_drewniak: Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250623T1530). Please do the needful. [15:30:52] (03PS1) 10Jgiannelos: mobileapps: Test nodejs 20 on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1162940 [15:30:52] (03PS1) 10Jgiannelos: proton: Test nodejs 20 on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1162941 [15:30:52] (03PS1) 10Jgiannelos: push-notifications: Test nodejs 20 on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1162942 [15:32:27] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on rdb2011.codfw.wmnet with reason: host reimage [15:32:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1252 (T396130)', diff saved to https://phabricator.wikimedia.org/P78657 and previous config saved to /var/cache/conftool/dbconfig/20250623-153235-marostegui.json [15:32:41] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [15:34:21] (03PS2) 10Jgiannelos: proton: Test nodejs 20 on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1162941 [15:34:21] (03PS2) 10Jgiannelos: push-notifications: Test nodejs 20 on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1162942 [15:34:21] (03PS1) 10Jgiannelos: mobileapps: Test nodejs 20 on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1162943 [15:35:08] (03PS1) 10Andrew Bogott: Update horizon version in eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/1162945 (https://phabricator.wikimedia.org/T397272) [15:35:51] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:36:00] (03CR) 10Andrew Bogott: [C:03+2] Update horizon version in eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/1162945 (https://phabricator.wikimedia.org/T397272) (owner: 10Andrew Bogott) [15:36:21] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on rdb2012.codfw.wmnet with reason: host reimage [15:37:28] jelto: Can you help me unblock staging mobileapps deployments? I reduced the cpu limits here: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1162943 [15:38:18] !log klausman@deploy1003 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply [15:38:30] FIRING: [11x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:38:36] (03PS2) 10Vgutierrez: hiera,cirrus: Enable IPIP on search*@codfw services [puppet] - 10https://gerrit.wikimedia.org/r/1123652 (https://phabricator.wikimedia.org/T387309) [15:40:07] (03CR) 10Vgutierrez: [C:04-2] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1123652 (https://phabricator.wikimedia.org/T387309) (owner: 10Vgutierrez) [15:40:15] (03CR) 10Vgutierrez: hiera,cirrus: Enable IPIP on search*@codfw services [puppet] - 10https://gerrit.wikimedia.org/r/1123652 (https://phabricator.wikimedia.org/T387309) (owner: 10Vgutierrez) [15:41:41] !log klausman@deploy1003 helmfile [staging] START helmfile.d/services/machinetranslation: apply [15:42:40] 06SRE, 10Pywikibot: pywikipedia.org is using wildcard cert for production projects - https://phabricator.wikimedia.org/T388809#10938853 (10Vgutierrez) ` $ host -t DS pywikipedia.org pywikipedia.org has no DS record ` looking good, I'll attempt to issue a certificate for pywikipedia.org soon [15:43:11] (03PS1) 10Vgutierrez: Revert^4 "acmechief: Add pywikipedia.org to the cert list" [puppet] - 10https://gerrit.wikimedia.org/r/1162949 (https://phabricator.wikimedia.org/T388809) [15:43:27] (03PS2) 10Vgutierrez: Revert^4 "acmechief: Add pywikipedia.org to the cert list" [puppet] - 10https://gerrit.wikimedia.org/r/1162949 (https://phabricator.wikimedia.org/T388809) [15:44:47] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply [15:44:54] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply [15:45:16] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply [15:45:40] FIRING: [4x] SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [15:45:53] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply [15:46:24] (03PS1) 10Jelto: peopleweb: add KUBEPOD ranges to firewall [puppet] - 10https://gerrit.wikimedia.org/r/1162952 (https://phabricator.wikimedia.org/T335491) [15:46:50] (03PS2) 10Jelto: peopleweb: add KUBEPOD ranges to firewall [puppet] - 10https://gerrit.wikimedia.org/r/1162952 (https://phabricator.wikimedia.org/T397148) [15:46:51] !log klausman@deploy1003 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply [15:47:05] jouncebot: nowandnext [15:47:05] For the next 0 hour(s) and 12 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250623T1530) [15:47:06] In 1 hour(s) and 12 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250623T1700) [15:47:06] In 1 hour(s) and 12 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250623T1700) [15:47:13] !log drop backup users from es1-es5 hosts T387892 [15:47:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:18] T387892: Decommission backup1001, backup1002, backup2001, backup2002 (and their arrays) - https://phabricator.wikimedia.org/T387892 [15:47:20] !log eevans@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sessionstore2005.codfw.wmnet with OS bullseye [15:47:27] 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 07Security, 10Sustainability (Incident Followup): Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10938888 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1003 for host sess... [15:47:33] (03CR) 10Elukey: [C:03+1] "Thanks a lot :)" [puppet] - 10https://gerrit.wikimedia.org/r/1162952 (https://phabricator.wikimedia.org/T397148) (owner: 10Jelto) [15:47:42] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply [15:47:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1252', diff saved to https://phabricator.wikimedia.org/P78658 and previous config saved to /var/cache/conftool/dbconfig/20250623-154743-marostegui.json [15:48:13] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply [15:48:14] !log jhancock@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1003" [15:48:51] !log eevans@cumin1003 START - Cookbook sre.hosts.reimage for host sessionstore2005.codfw.wmnet with OS bullseye [15:48:58] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6049/co" [puppet] - 10https://gerrit.wikimedia.org/r/1162952 (https://phabricator.wikimedia.org/T397148) (owner: 10Jelto) [15:51:19] !log jhancock@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1003" [15:51:20] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host rdb2011.codfw.wmnet with OS bookworm [15:51:27] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install rdb201[12] - https://phabricator.wikimedia.org/T393121#10938924 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1003 for host rdb2011.codfw.wmnet with OS bookworm completed: - rdb2011 (**PASS**) - Remov... [15:52:46] !log klausman@deploy1003 helmfile [eqiad] START helmfile.d/services/machinetranslation: apply [15:52:55] !log jhancock@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1003" [15:53:05] (03CR) 10Jelto: [V:03+1] "port 443 seems to be missing, I'll amend another patchset" [puppet] - 10https://gerrit.wikimedia.org/r/1162952 (https://phabricator.wikimedia.org/T397148) (owner: 10Jelto) [15:54:08] (03PS3) 10Jelto: peopleweb: add KUBEPOD ranges to firewall [puppet] - 10https://gerrit.wikimedia.org/r/1162952 (https://phabricator.wikimedia.org/T397148) [15:54:38] !log jhancock@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1003" [15:54:38] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host rdb2012.codfw.wmnet with OS bookworm [15:54:58] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install rdb201[12] - https://phabricator.wikimedia.org/T393121#10938961 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1003 for host rdb2012.codfw.wmnet with OS bookworm completed: - rdb2012 (**PASS**) - Remov... [15:55:08] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install rdb201[12] - https://phabricator.wikimedia.org/T393121#10938964 (10Jhancock.wm) 05Open→03Resolved a:05akosiaris→03Jhancock.wm [15:55:24] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install rdb201[12] - https://phabricator.wikimedia.org/T393121#10938968 (10Jhancock.wm) @akosiaris all done! [15:55:51] (03CR) 10CDobbins: "They do. South America has 12 sovereign states and three non-sovereign territories, one of which (South Georgia and the South Sandwich Isl" [dns] - 10https://gerrit.wikimedia.org/r/1162078 (owner: 10CDobbins) [15:56:38] (03PS1) 10Brouberol: mediawiki-dumps-legacy: deploy the sync toolbox with the sync image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1162954 (https://phabricator.wikimedia.org/T388378) [15:56:49] (03PS2) 10CDobbins: geo-maps: update default for South America [dns] - 10https://gerrit.wikimedia.org/r/1162078 [15:57:03] (03CR) 10CDobbins: geo-maps: update default for South America (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1162078 (owner: 10CDobbins) [15:58:00] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host build2003.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:58:05] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6050/co" [puppet] - 10https://gerrit.wikimedia.org/r/1162952 (https://phabricator.wikimedia.org/T397148) (owner: 10Jelto) [15:58:43] (03CR) 10Btullis: [C:03+1] mediawiki-dumps-legacy: deploy the sync toolbox with the sync image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1162954 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol) [15:58:48] (03CR) 10Ssingh: "Thank you. Can you please add here the mappings for SR, GY, GF, FK? I know you did previously but so that we have a reference for later." [dns] - 10https://gerrit.wikimedia.org/r/1162078 (owner: 10CDobbins) [16:01:40] jhancock@cumin1003 provision (PID 2844930) is awaiting input [16:02:05] (03CR) 10Alexandros Kosiaris: [C:03+1] deployment_server: use bookworm httpd in mw-debug/next mw-*/migration [puppet] - 10https://gerrit.wikimedia.org/r/1162036 (https://phabricator.wikimedia.org/T378128) (owner: 10Scott French) [16:02:23] (03CR) 10Clément Goubert: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1162036 (https://phabricator.wikimedia.org/T378128) (owner: 10Scott French) [16:02:33] jouncebot: nowandnext [16:02:33] No deployments scheduled for the next 0 hour(s) and 57 minute(s) [16:02:33] In 0 hour(s) and 57 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250623T1700) [16:02:33] In 0 hour(s) and 57 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250623T1700) [16:02:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1252', diff saved to https://phabricator.wikimedia.org/P78659 and previous config saved to /var/cache/conftool/dbconfig/20250623-160250-marostegui.json [16:02:53] !log klausman@deploy1003 helmfile [eqiad] DONE helmfile.d/services/machinetranslation: apply [16:03:00] (03PS4) 10Urbanecm: [Growth] Prepare for the Get Started notification experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159465 (https://phabricator.wikimedia.org/T394958) [16:03:04] (03CR) 10Urbanecm: [C:03+2] [Growth] Prepare for the Get Started notification experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159465 (https://phabricator.wikimedia.org/T394958) (owner: 10Urbanecm) [16:04:09] (03Merged) 10jenkins-bot: [Growth] Prepare for the Get Started notification experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159465 (https://phabricator.wikimedia.org/T394958) (owner: 10Urbanecm) [16:04:11] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159465 (https://phabricator.wikimedia.org/T394958) (owner: 10Urbanecm) [16:04:23] !log urbanecm@deploy1003 Started scap sync-world: Backport for [[gerrit:1159465|[Growth] Prepare for the Get Started notification experiment (T394958)]] [16:04:28] T394958: Shorten the NotificationGetStartedJob delay on pilot wikis as an experiment - https://phabricator.wikimedia.org/T394958 [16:05:54] 10SRE-swift-storage, 06Traffic: OpenSSL 3.x performance issues - https://phabricator.wikimedia.org/T352744#10939037 (10Fabfur) a:03Fabfur [16:06:17] (03PS2) 10Vgutierrez: hiera,cirrus: Enable IPIP on search*@eqiad services [puppet] - 10https://gerrit.wikimedia.org/r/1123653 (https://phabricator.wikimedia.org/T387309) [16:06:19] !log urbanecm@deploy1003 urbanecm: Backport for [[gerrit:1159465|[Growth] Prepare for the Get Started notification experiment (T394958)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [16:06:33] (03CR) 10Vgutierrez: hiera,cirrus: Enable IPIP on search*@eqiad services [puppet] - 10https://gerrit.wikimedia.org/r/1123653 (https://phabricator.wikimedia.org/T387309) (owner: 10Vgutierrez) [16:06:37] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1123653 (https://phabricator.wikimedia.org/T387309) (owner: 10Vgutierrez) [16:08:11] !log urbanecm@deploy1003 urbanecm: Continuing with sync [16:09:03] (03CR) 10Ssingh: [C:03+1] Revert^4 "acmechief: Add pywikipedia.org to the cert list" [puppet] - 10https://gerrit.wikimedia.org/r/1162949 (https://phabricator.wikimedia.org/T388809) (owner: 10Vgutierrez) [16:09:35] (03CR) 10Vgutierrez: [C:03+2] Revert^4 "acmechief: Add pywikipedia.org to the cert list" [puppet] - 10https://gerrit.wikimedia.org/r/1162949 (https://phabricator.wikimedia.org/T388809) (owner: 10Vgutierrez) [16:10:53] !log klausman@deploy1003 helmfile [eqiad] START helmfile.d/services/machinetranslation: apply [16:11:01] !log eevans@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sessionstore2005.codfw.wmnet with OS bullseye [16:11:12] 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 07Security, 10Sustainability (Incident Followup): Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10939062 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1003 for host sess... [16:12:01] !log klausman@deploy1003 helmfile [codfw] START helmfile.d/services/machinetranslation: apply [16:12:06] !log eevans@cumin1003 START - Cookbook sre.hosts.reimage for host sessionstore2005.codfw.wmnet with OS bullseye [16:12:12] 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 07Security, 10Sustainability (Incident Followup): Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10939070 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1003 for host... [16:13:14] FIRING: [3x] SLOMetricAbsent: citoid-latency codfw - https://slo.wikimedia.org/?search=citoid-latency - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [16:13:55] (03CR) 10AOkoth: gerrit: read-only plugin orchestration in failover (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1159395 (https://phabricator.wikimedia.org/T395440) (owner: 10Arnaudb) [16:14:31] (03CR) 10Brouberol: [C:03+2] mediawiki-dumps-legacy: deploy the sync toolbox with the sync image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1162954 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol) [16:14:59] !log urbanecm@deploy1003 Finished scap sync-world: Backport for [[gerrit:1159465|[Growth] Prepare for the Get Started notification experiment (T394958)]] (duration: 10m 36s) [16:15:05] T394958: Shorten the NotificationGetStartedJob delay on pilot wikis as an experiment - https://phabricator.wikimedia.org/T394958 [16:17:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1252 (T396130)', diff saved to https://phabricator.wikimedia.org/P78660 and previous config saved to /var/cache/conftool/dbconfig/20250623-161757-marostegui.json [16:18:04] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [16:18:13] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [16:18:56] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1244.eqiad.wmnet with reason: Maintenance [16:20:21] !log klausman@deploy1003 helmfile [codfw] DONE helmfile.d/services/machinetranslation: apply [16:21:53] (03PS1) 10Scott French: mw-(api-ext|web): pilot 5% of traffic on new httpd images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1162962 (https://phabricator.wikimedia.org/T378128) [16:23:11] (03CR) 10Arnaudb: gerrit: read-only plugin orchestration in failover (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1159395 (https://phabricator.wikimedia.org/T395440) (owner: 10Arnaudb) [16:24:08] !log klausman@deploy1003 helmfile [eqiad] START helmfile.d/services/machinetranslation: apply [16:24:13] !log klausman@deploy1003 helmfile [eqiad] DONE helmfile.d/services/machinetranslation: apply [16:25:55] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr3-eqsin and Hurricane Electric (2001:de8:4::6939:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [16:27:28] !log klausman@deploy1003 helmfile [eqiad] START helmfile.d/services/machinetranslation: apply [16:27:33] !log klausman@deploy1003 helmfile [eqiad] DONE helmfile.d/services/machinetranslation: apply [16:28:25] !log klausman@deploy1003 helmfile [eqiad] START helmfile.d/services/machinetranslation: apply [16:28:31] !log klausman@deploy1003 helmfile [eqiad] DONE helmfile.d/services/machinetranslation: apply [16:29:16] !log klausman@deploy1003 helmfile [eqiad] START helmfile.d/services/machinetranslation: apply [16:29:20] !log klausman@deploy1003 helmfile [eqiad] DONE helmfile.d/services/machinetranslation: apply [16:30:07] !log klausman@deploy1003 helmfile [eqiad] START helmfile.d/services/machinetranslation: apply [16:30:10] !log klausman@deploy1003 helmfile [eqiad] DONE helmfile.d/services/machinetranslation: apply [16:31:17] !log eevans@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sessionstore2005.codfw.wmnet with OS bullseye [16:31:34] 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 07Security, 10Sustainability (Incident Followup): Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10939179 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1003 for host sess... [16:31:57] !log eevans@cumin1003 START - Cookbook sre.hosts.reimage for host sessionstore2005.codfw.wmnet with OS bullseye [16:32:11] 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 07Security, 10Sustainability (Incident Followup): Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10939184 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1003 for host... [16:32:13] (03CR) 10AOkoth: gerrit: read-only plugin orchestration in failover (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1159395 (https://phabricator.wikimedia.org/T395440) (owner: 10Arnaudb) [16:32:16] (03PS1) 10Cmelo: Release the CampaignEvents extension to all remaining Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1162967 (https://phabricator.wikimedia.org/T396784) [16:32:34] 06SRE, 10Pywikibot, 13Patch-For-Review: pywikipedia.org is using wildcard cert for production projects - https://phabricator.wikimedia.org/T388809#10939191 (10Vgutierrez) change has been merged, acme-chief won't attempt to issue a new one till 2025-06-26 cause the current certificate is on the certificate st... [16:32:57] (03PS2) 10Cmelo: Release the CampaignEvents extension to all Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1162967 (https://phabricator.wikimedia.org/T396784) [16:33:27] !log jelto@cumin1003 conftool action : set/pooled=true; selector: dnsdisc=k8s-ingress-wikikube-ro,name=codfw [16:34:26] !log klausman@deploy1003 helmfile [eqiad] START helmfile.d/services/machinetranslation: apply [16:34:33] FIRING: [11x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:34:57] (03CR) 10Jelto: [V:03+1 C:03+2] peopleweb: add KUBEPOD ranges to firewall [puppet] - 10https://gerrit.wikimedia.org/r/1162952 (https://phabricator.wikimedia.org/T397148) (owner: 10Jelto) [16:35:45] (03PS1) 10Brouberol: Revert "mediawiki-dumps-legacy: deploy the sync toolbox with the sync image" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1162968 [16:35:51] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:36:02] (03CR) 10Btullis: [C:03+1] Revert "mediawiki-dumps-legacy: deploy the sync toolbox with the sync image" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1162968 (owner: 10Brouberol) [16:42:05] !log eevans@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sessionstore2005.codfw.wmnet with OS bullseye [16:42:12] 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 07Security, 10Sustainability (Incident Followup): Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10939265 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1003 for host sess... [16:42:49] !log eevans@cumin1003 START - Cookbook sre.hosts.reimage for host sessionstore2005.codfw.wmnet with OS bullseye [16:42:56] 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 07Security, 10Sustainability (Incident Followup): Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10939268 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1003 for host... [16:44:37] !log sgimeno@deploy1003 run `foreachwikiindblist growthexperiments CommunityConfiguration:setVersionData GrowthSuggestedEdits 1.0.0` — T393769 [16:44:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:43] T393769: Disable task type X for a user when the configured threshold is reached - https://phabricator.wikimedia.org/T393769 [16:48:20] (03CR) 10CDobbins: "`GF => [codfw, ulsfo, drmrs, eqiad, eqsin, esams, magru], # French Guiana" [dns] - 10https://gerrit.wikimedia.org/r/1162078 (owner: 10CDobbins) [16:52:16] (03CR) 10Andrea Denisse: [C:03+2] grafana: Disable dashboard sync for a version upgrade [puppet] - 10https://gerrit.wikimedia.org/r/1161628 (https://phabricator.wikimedia.org/T397442) (owner: 10Andrea Denisse) [16:54:17] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host build2003.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:54:36] !log klausman@deploy1003 helmfile [eqiad] DONE helmfile.d/services/machinetranslation: apply [16:57:16] (03PS1) 10JHathaway: update dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/1162970 [16:57:40] !log klausman@deploy1003 helmfile [eqiad] START helmfile.d/services/machinetranslation: apply [16:58:43] !log klausman@deploy1003 helmfile [eqiad] DONE helmfile.d/services/machinetranslation: apply [16:59:56] (03CR) 10JHathaway: [C:03+2] update dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/1162970 (owner: 10JHathaway) [17:00:05] swfrench-wmf: Time to do the MediaWiki infrastructure (UTC late) deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250623T1700). [17:00:05] ryankemper: Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250623T1700). Please do the needful. [17:00:25] !log eevans@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on sessionstore2005.codfw.wmnet with reason: host reimage [17:00:26] o/ [17:00:29] (03PS1) 10Andrea Denisse: Revert "grafana: Disable dashboard sync for a version upgrade" [puppet] - 10https://gerrit.wikimedia.org/r/1162973 [17:00:30] !log klausman@deploy1003 helmfile [eqiad] START helmfile.d/services/machinetranslation: apply [17:01:22] I'll be getting started shortly [17:02:16] (03CR) 10Andrea Denisse: [C:03+2] Revert "grafana: Disable dashboard sync for a version upgrade" [puppet] - 10https://gerrit.wikimedia.org/r/1162973 (owner: 10Andrea Denisse) [17:03:47] !log eevans@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sessionstore2005.codfw.wmnet with reason: host reimage [17:04:16] (03CR) 10Scott French: "Thanks for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/1162036 (https://phabricator.wikimedia.org/T378128) (owner: 10Scott French) [17:04:58] (03CR) 10Scott French: [C:03+2] deployment_server: use bookworm httpd in mw-debug/next mw-*/migration [puppet] - 10https://gerrit.wikimedia.org/r/1162036 (https://phabricator.wikimedia.org/T378128) (owner: 10Scott French) [17:09:27] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host build2003.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [17:10:37] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host build2003.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [17:11:20] !log klausman@deploy1003 helmfile [eqiad] DONE helmfile.d/services/machinetranslation: apply [17:11:20] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host build2003.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [17:11:21] !log klausman@deploy1003 helmfile [eqiad] START helmfile.d/services/machinetranslation: apply [17:11:35] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host build2003.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [17:12:32] !log ryankemper@cumin1003 conftool action : set/pooled=false; selector: dnsdisc=search-psi,name=eqiad [17:14:51] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host build2003.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [17:15:02] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host build2003.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [17:15:06] !log swfrench@deploy1003 Started scap sync-world: Deploy bookworm httpd images to mw-debug/next - T378128 [17:15:11] T378128: Upgrade httpd images to bullseye or bookworm - https://phabricator.wikimedia.org/T378128 [17:16:34] !log swfrench@deploy1003 swfrench: Deploy bookworm httpd images to mw-debug/next - T378128 synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [17:17:22] swfrench-wmf: I have a scap update to deploy when you're done. [17:17:26] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host build2003.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [17:17:37] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host build2003.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [17:17:54] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host build2003.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [17:17:56] dancy: sounds good - I should be out of your way in a couple of minutes [17:18:04] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host build2003.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [17:18:30] FIRING: ProbeDown: Service sessionstore2005-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#sessionstore2005-a:7000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:18:35] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q4:rack/setup/install build2003.codfw.wmnet - https://phabricator.wikimedia.org/T393015#10939408 (10Jhancock.wm) [17:19:19] !log klausman@deploy1003 helmfile [eqiad] DONE helmfile.d/services/machinetranslation: apply [17:19:33] FIRING: [2x] ProbeDown: Service sessionstore2005-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:19:56] !log klausman@deploy1003 helmfile [eqiad] START helmfile.d/services/machinetranslation: apply [17:20:45] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q4:rack/setup/install build2003.codfw.wmnet - https://phabricator.wikimedia.org/T393015#10939425 (10Jhancock.wm) This server isn't getting a clean provisioning run Traceback (most recent call last): File "/usr/lib/python3/dist-packages/sp... [17:21:26] !log swfrench@deploy1003 swfrench: Continuing with sync [17:21:51] (03PS1) 10Andrew Bogott: Keystone: increase ldap pool size [puppet] - 10https://gerrit.wikimedia.org/r/1162979 (https://phabricator.wikimedia.org/T379550) [17:22:30] !log swfrench@deploy1003 Finished scap sync-world: Deploy bookworm httpd images to mw-debug/next - T378128 (duration: 08m 00s) [17:22:35] T378128: Upgrade httpd images to bullseye or bookworm - https://phabricator.wikimedia.org/T378128 [17:22:57] dancy: over to you :) [17:23:02] thx [17:23:07] !log dancy@deploy1003 Installing scap version "4.182.0" for 2 host(s) [17:23:29] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host cp2043.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:23:30] !log eevans@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sessionstore2005.codfw.wmnet with OS bullseye [17:23:35] (03CR) 10Andrew Bogott: [C:03+2] Keystone: increase ldap pool size [puppet] - 10https://gerrit.wikimedia.org/r/1162979 (https://phabricator.wikimedia.org/T379550) (owner: 10Andrew Bogott) [17:23:38] 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 07Security, 10Sustainability (Incident Followup): Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10939439 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1003 for host sess... [17:24:52] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp2043.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:24:54] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: Q4:rack/setup/install build2003.codfw.wmnet - https://phabricator.wikimedia.org/T393015#10939450 (10akosiaris) >>! In T393015#10939425, @Jhancock.wm wrote: > This server isn't getting a clean provisioning run > > Traceback (most r... [17:24:57] !log dancy@deploy1003 Installation of scap version "4.182.0" completed for 2 hosts [17:25:50] !log dancy@deploy1003 Locking from deployment [ALL REPOSITORIES]: test [17:25:52] !log dancy@deploy1003 Unlocked for deployment [ALL REPOSITORIES]: test (duration: 00m 02s) [17:30:00] (03PS3) 10Cmelo: Release the CampaignEvents extension to all Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1162967 (https://phabricator.wikimedia.org/T396784) [17:31:22] !log klausman@deploy1003 helmfile [eqiad] DONE helmfile.d/services/machinetranslation: apply [17:33:17] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: Q4:rack/setup/install build2003.codfw.wmnet - https://phabricator.wikimedia.org/T393015#10939527 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm @akosiaris I attempted BIOS and UEFI on this one but it has the same fail b... [17:34:42] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host sretest2003.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [17:36:52] 10ops-codfw, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#10939544 (10Volans) Those new servers are of generations 17, that is the first one shipped with iDRAC 10 and a firmware version of 1.20.x.x. It's Redfish support is slig... [17:40:13] 06SRE, 06Infrastructure-Foundations, 06serviceops, 07ARM support: Adoption of aarch64 (aka arm64) in WMF production? (SRE Summit 2022 Session) - https://phabricator.wikimedia.org/T320811#10939570 (10akosiaris) Our first arm64 server just got racked. We 'll need to figure out how to incorporate it in our to... [17:41:35] jhancock@cumin1003 provision (PID 2856451) is awaiting input [17:42:02] JennH: FYI ^^^ [17:42:25] Checking [17:43:05] FYI this is the mechanism to notify people when there is a cookbook pending input in action ;) [17:44:25] Yeah. I goof that sometimes. The bots like to tattle on me. [17:45:00] lol [17:45:58] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to NDA LDAP for DerHexer - https://phabricator.wikimedia.org/T397099#10939591 (10KFrancis) Hello all, the NDA is complete. Thanks! [17:46:08] it's all fun and games until we don't give it any input, and one day, volans' sentient cookbook that controls all the other cookbooks decides what it wants to do [17:46:13] !log ryankemper@cumin1003 conftool action : set/pooled=false; selector: dnsdisc=search-omega,name=eqiad [17:46:13] !log ryankemper@cumin1003 conftool action : set/pooled=false; selector: dnsdisc=search-psi,name=eqiad [17:46:14] !log ryankemper@cumin1003 conftool action : set/pooled=false; selector: dnsdisc=search,name=eqiad [17:46:36] ryankemper: oh nice, you are testing the search* changes we rolled out. gl :) [17:47:03] (03CR) 10Daimona Eaytoy: Release the CampaignEvents extension to all Wikipedias (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1162967 (https://phabricator.wikimedia.org/T396784) (owner: 10Cmelo) [17:47:58] sukhe: yes indeed! we didn't see all the traffic switch over when we just tested it on psi 20 mins ago (some moved but not all), now we're seeing what happens with the other clusters [17:52:24] (03CR) 10Fabfur: install_server: UEFI setup for cp20[43-58] (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1162840 (https://phabricator.wikimedia.org/T392851) (owner: 10Fabfur) [17:52:53] (03PS5) 10Fabfur: install_server: UEFI setup for cp20[43-58] [puppet] - 10https://gerrit.wikimedia.org/r/1162840 (https://phabricator.wikimedia.org/T392851) [17:53:27] !log eevans@cumin1003 START - Cookbook sre.hosts.reboot-single for host sessionstore2005.codfw.wmnet [17:53:41] !log eevans@cumin1003 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host sessionstore2005.codfw.wmnet [17:54:12] interesting problem report that someone here may be able to answer: https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#Connection_problem_with_wikipedia.org_and_its_wikis_(no_other_site)_at_home_only [17:54:31] !log eevans@cumin1003 START - Cookbook sre.hosts.reboot-single for host sessionstore2005.codfw.wmnet [17:58:27] (03PS2) 10Brouberol: Revert "mediawiki-dumps-legacy: deploy the sync toolbox with the sync image" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1162968 [18:00:21] (03CR) 10Brouberol: [C:03+2] Revert "mediawiki-dumps-legacy: deploy the sync toolbox with the sync image" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1162968 (owner: 10Brouberol) [18:00:46] ryankemper: there is at least a 5min TTL on the discovery record itself. which means that even if you switched, the old record will still be cached up to five mins, unless you explicitly clear the caches. [18:00:54] so at least factor in that much time. [18:00:56] !log eevans@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sessionstore2005.codfw.wmnet [18:01:29] (03CR) 10CDobbins: "I forgot to mention that there wasn't enough data to even get results for the Falkland Islands, so the omission was intentional" [dns] - 10https://gerrit.wikimedia.org/r/1162078 (owner: 10CDobbins) [18:01:52] sukhe: yeah, we waited well beyond the 5min TTL. I suspect we're missing one more piece to get this all working properly. not sure what that piece might be yet tho :) [18:02:37] !log ryankemper@cumin1003 conftool action : set/pooled=true; selector: dnsdisc=search-omega,name=eqiad [18:02:37] !log ryankemper@cumin1003 conftool action : set/pooled=true; selector: dnsdisc=search-psi,name=eqiad [18:02:37] !log ryankemper@cumin1003 conftool action : set/pooled=true; selector: dnsdisc=search,name=eqiad [18:03:06] ryankemper: out of curiosity, how are you measuring traffic that has moved or not? [18:05:09] anyway, please do ping if you need something from Traffic's side. gl :) [18:06:27] (03CR) 10Eevans: [C:03+2] sessionstore2005: reconfigure instance for JBOD devices [puppet] - 10https://gerrit.wikimedia.org/r/1162918 (https://phabricator.wikimedia.org/T390514) (owner: 10Eevans) [18:06:29] MatmaRex: "This site can't be reached" can mean different things even basic stuff like failed DNS lookups [18:06:45] it's a Canadian ISP so I am going to rule out DNS filtering but perhaps we can ask them to run https://wikitech.wikimedia.org/wiki/Reporting_a_connectivity_issue and report this to us through Phabricator [18:06:52] jhancock@cumin1003 provision (PID 2856451) is awaiting input [18:07:47] we don't drop connections anyway and even if the IP was on the throttled list, they would get a 429 and a not a RST from our servers. [18:08:05] for sure [18:08:25] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2003.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [18:08:35] which is what points to site not reachable, either a failed DNS lookup, a middle box sending RST, or us sending RST (which we don't). that leaves some connection issues which the above link is helpful for them reporting it to us [18:08:37] !log jhancock@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['sretest2003'] [18:08:46] !log ryankemper@cumin1003 conftool action : set/pooled=true; selector: dnsdisc=search-omega,name=codfw [18:08:46] !log ryankemper@cumin1003 conftool action : set/pooled=true; selector: dnsdisc=search-psi,name=codfw [18:08:47] !log ryankemper@cumin1003 conftool action : set/pooled=true; selector: dnsdisc=search,name=codfw [18:08:48] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['sretest2003'] [18:09:05] !log jhancock@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['sretest2003'] [18:09:12] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['sretest2003'] [18:09:23] the user seems to be a bit of a techie, i've even seen them on phabricator, so they might be willing to go through the steps :) [18:09:36] MatmaRex: thanks, I am also happy to respond with the above link if you don't want to or can't. [18:09:41] and thanks for sharing it here btw [18:09:45] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host sretest2003.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [18:09:59] sukhe: per e.bernhardson: combination of `watch curl ...` command running that sums up the number of search requests the cluster has run and prints it every 2s, and https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1&from=now-30m&to=now&timezone=utc&var-cluster=elasticsearch&var-exported_cluster=production-search&viewPanel=panel-54 [18:10:26] sukhe: however I just discovered that the codfw main cluster was depooled which we didn't realize. so re-running the tests again with that fixed [18:10:30] sukhe: i think the encouragement to file a bug about this might carry more weight if it comes from someone from the SRE team, it'd be nice if you could comment [18:10:47] i can respond too if you're busy [18:10:55] MatmaRex: not at all, responding. [18:10:58] thanks [18:11:59] ryankemper: ok. I am mostly curious from a service/discovery PoV, on why the traffic is not moving over. hence the questions! [18:12:08] !log bootstrapping Cassandra/sessionstore2005 — T391544 [18:12:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:13] T391544: Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544 [18:12:21] sukhe: yup understood! will follow up with you guys when we have some better info [18:12:54] jhancock@cumin1003 provision (PID 2859502) is awaiting input [18:13:28] ryankemper: sure. [18:13:30] MatmaRex: done. [18:13:30] FIRING: [2x] ProbeDown: Service sessionstore2005-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:13:37] ty [18:16:53] !log ryankemper@cumin1003 conftool action : set/pooled=false; selector: dnsdisc=search-omega,name=eqiad [18:16:53] !log ryankemper@cumin1003 conftool action : set/pooled=false; selector: dnsdisc=search-psi,name=eqiad [18:16:53] !log ryankemper@cumin1003 conftool action : set/pooled=false; selector: dnsdisc=search,name=eqiad [18:17:55] (03PS1) 10Kosta Harlan: Map pre-save RR scores to predefined values [extensions/ORES] (wmf/1.45.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1162998 (https://phabricator.wikimedia.org/T364705) [18:18:20] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 23 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [extensions/ORES] (wmf/1.45.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1162998 (https://phabricator.wikimedia.org/T364705) (owner: 10Kosta Harlan) [18:19:22] (03PS1) 10Andrew Bogott: Neutron policy.yaml: remove firewall overrides [puppet] - 10https://gerrit.wikimedia.org/r/1162999 [18:19:23] (03PS1) 10Andrew Bogott: Remove qos policy defs [puppet] - 10https://gerrit.wikimedia.org/r/1163000 [18:19:23] (03PS1) 10Andrew Bogott: Neutron policy.yaml: remove loadbalancer overrides [puppet] - 10https://gerrit.wikimedia.org/r/1163001 [18:19:23] (03PS1) 10Andrew Bogott: Neutron policy.yaml: remove a whole grab bag of obsolete overrides [puppet] - 10https://gerrit.wikimedia.org/r/1163002 [18:19:24] (03PS1) 10Andrew Bogott: Neutron policy: Update port policies [puppet] - 10https://gerrit.wikimedia.org/r/1163003 [18:19:52] jhancock@cumin1003 provision (PID 2859502) is awaiting input [18:20:42] (03PS1) 10Kosta Harlan: Revert "ores: Disable AbuseFilter integration by default" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163004 (https://phabricator.wikimedia.org/T364705) [18:22:13] (03CR) 10Ssingh: "Yeah thanks, that's fair. I am fine with moving ahead on this change given the low sample size. We can either bump that or we can merge th" [dns] - 10https://gerrit.wikimedia.org/r/1162078 (owner: 10CDobbins) [18:26:13] jinxer-wm: nowandnext [18:26:17] jouncebot: nowandnext [18:26:17] No deployments scheduled for the next 1 hour(s) and 33 minute(s) [18:26:17] In 1 hour(s) and 33 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250623T2000) [18:27:51] (03PS1) 10Urbanecm: Revert "[Growth] Prepare for the Get Started notification experiment" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163005 (https://phabricator.wikimedia.org/T394958) [18:28:06] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163005 (https://phabricator.wikimedia.org/T394958) (owner: 10Urbanecm) [18:30:23] (03CR) 10JHathaway: New structure for sshd_config starting with trixie (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1148338 (https://phabricator.wikimedia.org/T393762) (owner: 10Muehlenhoff) [18:30:40] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:30:41] (03Merged) 10jenkins-bot: Revert "[Growth] Prepare for the Get Started notification experiment" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163005 (https://phabricator.wikimedia.org/T394958) (owner: 10Urbanecm) [18:30:55] !log urbanecm@deploy1003 Started scap sync-world: Backport for [[gerrit:1163005|Revert "[Growth] Prepare for the Get Started notification experiment" (T394958)]] [18:31:01] T394958: Shorten the NotificationGetStartedJob delay on pilot wikis as an experiment - https://phabricator.wikimedia.org/T394958 [18:31:55] (03PS1) 10Eevans: adjust sessionstore disk utilization for JBOD [alerts] - 10https://gerrit.wikimedia.org/r/1163007 (https://phabricator.wikimedia.org/T391544) [18:40:21] !log urbanecm@deploy1003 Finished scap sync-world: Backport for [[gerrit:1163005|Revert "[Growth] Prepare for the Get Started notification experiment" (T394958)]] (duration: 09m 25s) [18:40:27] T394958: Shorten the NotificationGetStartedJob delay on pilot wikis as an experiment - https://phabricator.wikimedia.org/T394958 [18:41:37] (03CR) 10CDobbins: geo-maps: update default for South America (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1162078 (owner: 10CDobbins) [18:42:30] (03CR) 10CDobbins: geo-maps: update default for South America (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1162078 (owner: 10CDobbins) [18:42:36] (03CR) 10CDobbins: geo-maps: update default for South America (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1162078 (owner: 10CDobbins) [18:45:14] (03CR) 10Hashar: "I could use a +1 just to be sure but I can otherwhise self deploy. I have CC Amir who removed `wikitech.php` and Bryan who IIRC wrote the " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1161871 (https://phabricator.wikimedia.org/T371592) (owner: 10Hashar) [18:45:52] !log jhancock@cumin1003 START - Cookbook sre.dns.netbox [18:49:06] !log jhancock@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding cloudcephosd2006-dev to codfw - jhancock@cumin1003" [18:49:10] !log jhancock@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding cloudcephosd2006-dev to codfw - jhancock@cumin1003" [18:49:10] !log jhancock@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:49:12] (03PS1) 10Urbanecm: Backport Getting Started notification code [extensions/GrowthExperiments] (wmf/1.45.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1163012 (https://phabricator.wikimedia.org/T394957) [18:49:54] !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host sretest2003.codfw.wmnet with OS bookworm [18:50:06] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install Dell Config H 1P Test Host - https://phabricator.wikimedia.org/T393042#10939728 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host sretest2003.codfw.wmnet with OS bookworm [18:53:32] (03PS2) 10Andrew Bogott: Neutron policy: Update port policies [puppet] - 10https://gerrit.wikimedia.org/r/1163003 [18:53:32] (03PS1) 10Andrew Bogott: Neutron policy.yaml: update subnetpool rules [puppet] - 10https://gerrit.wikimedia.org/r/1163016 [18:54:02] !log ryankemper@cumin1003 conftool action : set/pooled=true; selector: dnsdisc=search-omega,name=eqiad [18:54:02] !log ryankemper@cumin1003 conftool action : set/pooled=true; selector: dnsdisc=search-psi,name=eqiad [18:54:02] !log ryankemper@cumin1003 conftool action : set/pooled=true; selector: dnsdisc=search,name=eqiad [18:57:40] (03PS3) 10Andrew Bogott: Neutron policy: Update port policies [puppet] - 10https://gerrit.wikimedia.org/r/1163003 [18:57:40] (03PS2) 10Andrew Bogott: Neutron policy.yaml: update subnetpool rules [puppet] - 10https://gerrit.wikimedia.org/r/1163016 [18:59:53] (03CR) 10Andrew Bogott: [C:03+2] Neutron policy.yaml: remove firewall overrides [puppet] - 10https://gerrit.wikimedia.org/r/1162999 (owner: 10Andrew Bogott) [18:59:57] (03CR) 10Andrew Bogott: [C:03+2] Remove qos policy defs [puppet] - 10https://gerrit.wikimedia.org/r/1163000 (owner: 10Andrew Bogott) [19:00:01] (03CR) 10Andrew Bogott: [C:03+2] Neutron policy.yaml: remove loadbalancer overrides [puppet] - 10https://gerrit.wikimedia.org/r/1163001 (owner: 10Andrew Bogott) [19:05:30] (03CR) 10Urbanecm: [C:03+2] Backport Getting Started notification code [extensions/GrowthExperiments] (wmf/1.45.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1163012 (https://phabricator.wikimedia.org/T394957) (owner: 10Urbanecm) [19:05:49] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy1003 using scap backport" [extensions/GrowthExperiments] (wmf/1.45.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1163012 (https://phabricator.wikimedia.org/T394957) (owner: 10Urbanecm) [19:06:26] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2003.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [19:06:52] (03CR) 10JHathaway: [C:03+1] "looks good, one minor suggestion" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1162869 (owner: 10Ayounsi) [19:08:04] (03Merged) 10jenkins-bot: Backport Getting Started notification code [extensions/GrowthExperiments] (wmf/1.45.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1163012 (https://phabricator.wikimedia.org/T394957) (owner: 10Urbanecm) [19:08:19] !log urbanecm@deploy1003 Started scap sync-world: Backport for [[gerrit:1163012|Backport Getting Started notification code (T394957)]] [19:08:24] T394957: Support delaying NotificationGetStartedJob differently based on user variant - https://phabricator.wikimedia.org/T394957 [19:09:41] sukhe: just to follow up, turns out the switching search datacenters was working as intended once we flipped the codfw cluster that was depooled [19:10:30] !log urbanecm@deploy1003 urbanecm: Backport for [[gerrit:1163012|Backport Getting Started notification code (T394957)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [19:11:35] ok :) [19:13:34] !log urbanecm@deploy1003 urbanecm: Continuing with sync [19:16:39] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install aux-k8s-worker100[6-9].eqiad.wmnet - https://phabricator.wikimedia.org/T393053#10939779 (10Jclark-ctr) [19:16:51] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install aux-k8s-worker100[6-9].eqiad.wmnet - https://phabricator.wikimedia.org/T393053#10939782 (10Jclark-ctr) 05Open→03Resolved [19:17:33] (03PS1) 10Eevans: cassandra-dev2001: reimage to test JBOD changes [puppet] - 10https://gerrit.wikimedia.org/r/1163019 [19:20:06] (03CR) 10Eevans: [C:03+2] cassandra-dev2001: reimage to test JBOD changes [puppet] - 10https://gerrit.wikimedia.org/r/1163019 (owner: 10Eevans) [19:20:22] !log urbanecm@deploy1003 Finished scap sync-world: Backport for [[gerrit:1163012|Backport Getting Started notification code (T394957)]] (duration: 12m 03s) [19:20:28] T394957: Support delaying NotificationGetStartedJob differently based on user variant - https://phabricator.wikimedia.org/T394957 [19:21:22] (03PS1) 10Urbanecm: Revert^2 "[Growth] Prepare for the Get Started notification experiment" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163022 (https://phabricator.wikimedia.org/T394958) [19:22:05] (03CR) 10Urbanecm: [C:03+2] Revert^2 "[Growth] Prepare for the Get Started notification experiment" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163022 (https://phabricator.wikimedia.org/T394958) (owner: 10Urbanecm) [19:22:45] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163022 (https://phabricator.wikimedia.org/T394958) (owner: 10Urbanecm) [19:23:04] (03Merged) 10jenkins-bot: Revert^2 "[Growth] Prepare for the Get Started notification experiment" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163022 (https://phabricator.wikimedia.org/T394958) (owner: 10Urbanecm) [19:23:19] !log urbanecm@deploy1003 Started scap sync-world: Backport for [[gerrit:1163022|Revert^2 "[Growth] Prepare for the Get Started notification experiment" (T394958)]] [19:23:24] T394958: Shorten the NotificationGetStartedJob delay on pilot wikis as an experiment - https://phabricator.wikimedia.org/T394958 [19:25:26] !log urbanecm@deploy1003 urbanecm: Backport for [[gerrit:1163022|Revert^2 "[Growth] Prepare for the Get Started notification experiment" (T394958)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [19:26:59] !log eevans@cumin1003 START - Cookbook sre.hosts.reimage for host cassandra-dev2001.codfw.wmnet with OS bullseye [19:27:10] 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 13Patch-For-Review, and 2 others: Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10939797 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1003 for host cassandra-dev2001.... [19:28:02] !log urbanecm@deploy1003 urbanecm: Continuing with sync [19:29:13] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 23 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [extensions/CentralAuth] (wmf/1.45.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1161950 (https://phabricator.wikimedia.org/T395372) (owner: 10Gergő Tisza) [19:31:27] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 23 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1160157 (https://phabricator.wikimedia.org/T395204) (owner: 10Gergő Tisza) [19:34:58] !log urbanecm@deploy1003 Finished scap sync-world: Backport for [[gerrit:1163022|Revert^2 "[Growth] Prepare for the Get Started notification experiment" (T394958)]] (duration: 11m 39s) [19:35:03] T394958: Shorten the NotificationGetStartedJob delay on pilot wikis as an experiment - https://phabricator.wikimedia.org/T394958 [19:38:48] (03PS2) 10Scott French: hieradata: pilot cfssl/pki for nginx on conf2006 [puppet] - 10https://gerrit.wikimedia.org/r/1090583 (https://phabricator.wikimedia.org/T352245) [19:39:00] (03PS2) 10Scott French: hieradata: use cfssl/pki for nginx on all codfw configcluster hosts [puppet] - 10https://gerrit.wikimedia.org/r/1090585 (https://phabricator.wikimedia.org/T352245) [19:39:22] (03PS3) 10Scott French: hieradata: use cfssl/pki for nginx on all configcluster hosts [puppet] - 10https://gerrit.wikimedia.org/r/1090586 (https://phabricator.wikimedia.org/T352245) [19:39:31] (03CR) 10Cory Massaro: [C:03+2] wikifunctions: Enable memcached-based batching for ZObjects [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160761 (https://phabricator.wikimedia.org/T390550) (owner: 10Jforrester) [19:39:57] (03CR) 10SBassett: [C:04-2] "Did this go through Legal review? We can't just grant secure rights like this to groups who aren't even NDA'd." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1162158 (https://phabricator.wikimedia.org/T265726) (owner: 10EggRoll97) [19:41:24] (03Merged) 10jenkins-bot: wikifunctions: Enable memcached-based batching for ZObjects [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160761 (https://phabricator.wikimedia.org/T390550) (owner: 10Jforrester) [19:42:09] !log eevans@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on cassandra-dev2001.codfw.wmnet with reason: host reimage [19:43:59] (03CR) 10EggRoll97: "@sbassett@wikimedia.org See https://foundation.wikimedia.org/wiki/Policy:Wikimedia_Foundation_Access_to_Nonpublic_Personal_Data_Policy/Exc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1162158 (https://phabricator.wikimedia.org/T265726) (owner: 10EggRoll97) [19:44:28] (03PS6) 10Scott French: P:etcd::tlsproxy: add support for PKI certs [puppet] - 10https://gerrit.wikimedia.org/r/1070681 (https://phabricator.wikimedia.org/T352245) [19:44:28] (03PS3) 10Scott French: hieradata: pilot cfssl/pki for nginx on conf2006 [puppet] - 10https://gerrit.wikimedia.org/r/1090583 (https://phabricator.wikimedia.org/T352245) [19:44:28] (03PS3) 10Scott French: hieradata: use cfssl/pki for nginx on all codfw configcluster hosts [puppet] - 10https://gerrit.wikimedia.org/r/1090585 (https://phabricator.wikimedia.org/T352245) [19:44:29] (03PS4) 10Scott French: hieradata: use cfssl/pki for nginx on all configcluster hosts [puppet] - 10https://gerrit.wikimedia.org/r/1090586 (https://phabricator.wikimedia.org/T352245) [19:45:53] !log eevans@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cassandra-dev2001.codfw.wmnet with reason: host reimage [19:49:57] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 23 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163004 (https://phabricator.wikimedia.org/T364705) (owner: 10Kosta Harlan) [19:51:08] (03PS1) 10Urbanecm: [Growth] Disable the Surfacing Structured Tasks experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163027 (https://phabricator.wikimedia.org/T397515) [19:51:10] (03PS1) 10Urbanecm: [Growth] Remove support code for Surfacing Structured Tasks experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163028 (https://phabricator.wikimedia.org/T397515) [19:53:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 2.5s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [19:54:57] 10ops-eqiad, 06DC-Ops: Alert for device ps1-b4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T397658 (10phaultfinder) 03NEW [19:57:08] EggRoll97: thanks for helping out with the 2FA verification task! we aren't sure if that change is fully cleared or it needs more internal dicussion, so please hold off from deploying that today. [19:58:36] tgr: Alright [19:59:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [20:00:04] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC late backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250623T2000). [20:00:05] EggRoll97, Tchanders, kostajh, and tgr: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:40] FIRING: [4x] TransitBGPDown: Transit BGP session down between cr2-codfw and Lumen (2001:1900:2100::4b41) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [20:00:40] (03CR) 10SBassett: [C:04-2] "I'm going to ask that we hold off on deploying this until we re-confirm that policy with WMF Legal. Apologies for any confusion." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1162158 (https://phabricator.wikimedia.org/T265726) (owner: 10EggRoll97) [20:00:41] o/ [20:01:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 2.5s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [20:01:21] hi, I'm here [20:01:36] I can deploy [20:02:23] o/ [20:02:26] tgr: your config patch is independent of the wmf.6 patch you have scheduled, right? [20:03:35] kostajh: Thanks [20:03:52] kostajh: yes [20:03:58] !log eevans@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cassandra-dev2001.codfw.wmnet with OS bullseye [20:04:08] 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 13Patch-For-Review, and 2 others: Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10939872 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1003 for host cassandra-dev2001.codf... [20:04:40] alright. I'd like to sync the wmf.6 patches first, then do the three config patches together. Tchanders is that OK? it would mean your config patch goes out in about 20-30 minutes [20:04:51] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [20:04:59] kostajh: Works for me [20:05:00] alternatively, I can do tgr and Tchanders config patches first, then wmf.6 patches, then the config patch I have [20:05:28] kostajh: Do what's easier for you :) [20:06:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 2.5s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [20:06:18] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [extensions/ORES] (wmf/1.45.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1162998 (https://phabricator.wikimedia.org/T364705) (owner: 10Kosta Harlan) [20:06:18] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [extensions/CentralAuth] (wmf/1.45.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1161950 (https://phabricator.wikimedia.org/T395372) (owner: 10Gergő Tisza) [20:06:34] Tchanders: ty [20:07:12] (03CR) 10Tchanders: Configure event stream for IP auto-reveal instrument (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155725 (https://phabricator.wikimedia.org/T387600) (owner: 10Tchanders) [20:08:21] (03CR) 10Kosta Harlan: [C:03+1] Configure event stream for IP auto-reveal instrument [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155725 (https://phabricator.wikimedia.org/T387600) (owner: 10Tchanders) [20:08:34] (03PS3) 10Tchanders: Configure event stream for IP auto-reveal instrument [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155725 (https://phabricator.wikimedia.org/T387600) [20:11:49] (03PS4) 10Tchanders: Configure event stream for IP auto-reveal instrument [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155725 (https://phabricator.wikimedia.org/T387600) [20:12:12] (03CR) 10Kosta Harlan: Configure event stream for IP auto-reveal instrument (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155725 (https://phabricator.wikimedia.org/T387600) (owner: 10Tchanders) [20:13:29] FIRING: [3x] SLOMetricAbsent: citoid-latency codfw - https://slo.wikimedia.org/?search=citoid-latency - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [20:13:29] (03Merged) 10jenkins-bot: Map pre-save RR scores to predefined values [extensions/ORES] (wmf/1.45.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1162998 (https://phabricator.wikimedia.org/T364705) (owner: 10Kosta Harlan) [20:14:10] (03Merged) 10jenkins-bot: Fix password handling for non-existent users [extensions/CentralAuth] (wmf/1.45.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1161950 (https://phabricator.wikimedia.org/T395372) (owner: 10Gergő Tisza) [20:14:27] !log kharlan@deploy1003 Started scap sync-world: Backport for [[gerrit:1162998|Map pre-save RR scores to predefined values (T364705)]], [[gerrit:1161950|Fix password handling for non-existent users (T395372 T397262)]] [20:14:34] T364705: Provide AbuseFilter condition for revertrisk threshold - https://phabricator.wikimedia.org/T364705 [20:14:35] T395372: Handle scrambled password type in CentralAuth - https://phabricator.wikimedia.org/T395372 [20:14:35] T397262: PHP Deprecated: preg_match(): Passing null to parameter #2 ($subject) of type string is deprecated - https://phabricator.wikimedia.org/T397262 [20:15:21] (03CR) 10Cwhite: [C:03+2] logstash: bump phatality version [puppet] - 10https://gerrit.wikimedia.org/r/1155774 (https://phabricator.wikimedia.org/T387606) (owner: 10Cwhite) [20:22:12] (03PS5) 10Tchanders: Configure event stream for IP auto-reveal instrument [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155725 (https://phabricator.wikimedia.org/T387600) [20:23:47] we're still waiting on `K8s images build/push output redirected to /var/lib/spiderpig/scap-image-build-and-push-log`, might be a while [20:24:56] 10ops-eqiad, 06DC-Ops: Alert for device ps1-b4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T397658#10939936 (10phaultfinder) [20:26:00] Kosta: Looks like localisation files were touched, so it'll be a long backport: 5351595 [20:29:14] dancy: ack [20:29:54] (03PS6) 10Tchanders: Configure event stream for IP auto-reveal instrument [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155725 (https://phabricator.wikimedia.org/T387600) [20:30:13] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply [20:31:02] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply [20:31:07] ok, we're now on a syncing phase [20:33:03] (03CR) 10Tchanders: Configure event stream for IP auto-reveal instrument (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155725 (https://phabricator.wikimedia.org/T387600) (owner: 10Tchanders) [20:36:37] tgr: do you need to verify your wmf.6 patch? [20:38:30] FIRING: [10x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:38:46] !log kharlan@deploy1003 kharlan, tgr: Backport for [[gerrit:1162998|Map pre-save RR scores to predefined values (T364705)]], [[gerrit:1161950|Fix password handling for non-existent users (T395372 T397262)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:38:54] T364705: Provide AbuseFilter condition for revertrisk threshold - https://phabricator.wikimedia.org/T364705 [20:38:54] T395372: Handle scrambled password type in CentralAuth - https://phabricator.wikimedia.org/T395372 [20:38:55] T397262: PHP Deprecated: preg_match(): Passing null to parameter #2 ($subject) of type string is deprecated - https://phabricator.wikimedia.org/T397262 [20:40:04] tgr: it's up on the debug server, if you want to test [20:44:39] ok, I will sync it [20:45:02] (03CR) 10Andrew Bogott: [C:03+2] Neutron policy.yaml: remove a whole grab bag of obsolete overrides [puppet] - 10https://gerrit.wikimedia.org/r/1163002 (owner: 10Andrew Bogott) [20:45:16] !log kharlan@deploy1003 kharlan, tgr: Continuing with sync [20:46:00] (03CR) 10Kosta Harlan: Configure event stream for IP auto-reveal instrument (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155725 (https://phabricator.wikimedia.org/T387600) (owner: 10Tchanders) [20:46:46] kostajh: sorry, I got distracted. It's not directly verifiable but I checked that login in general works. [20:47:08] (03PS2) 10Kosta Harlan: Reapply "ores: Disable AbuseFilter integration by default" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163004 (https://phabricator.wikimedia.org/T364705) [20:47:22] tgr: ack, thanks [20:54:33] 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on logstash2035 - https://phabricator.wikimedia.org/T397366#10940039 (10colewhite) p:05Triage→03High The host is unserviceable in its current state and the rest of the cluster has picked up the slack. The problem is exacerbated due to {T390215}. I'll keep the... [20:58:57] !log kharlan@deploy1003 Finished scap sync-world: Backport for [[gerrit:1162998|Map pre-save RR scores to predefined values (T364705)]], [[gerrit:1161950|Fix password handling for non-existent users (T395372 T397262)]] (duration: 44m 29s) [20:59:04] T364705: Provide AbuseFilter condition for revertrisk threshold - https://phabricator.wikimedia.org/T364705 [20:59:05] T395372: Handle scrambled password type in CentralAuth - https://phabricator.wikimedia.org/T395372 [20:59:05] T397262: PHP Deprecated: preg_match(): Passing null to parameter #2 ($subject) of type string is deprecated - https://phabricator.wikimedia.org/T397262 [20:59:07] ok, on to the config patches [20:59:28] FIRING: [4x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:00:05] Reedy, sbassett, Maryum, and manfredi: gettimeofday() says it's time for Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250623T2100) [21:00:06] (03PS3) 10Kosta Harlan: Reapply "ores: Disable AbuseFilter integration by default" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163004 (https://phabricator.wikimedia.org/T364705) [21:00:38] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163004 (https://phabricator.wikimedia.org/T364705) (owner: 10Kosta Harlan) [21:00:38] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155725 (https://phabricator.wikimedia.org/T387600) (owner: 10Tchanders) [21:00:38] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1160157 (https://phabricator.wikimedia.org/T395204) (owner: 10Gergő Tisza) [21:00:46] we're still finishing up the deployment window [21:01:30] (03Merged) 10jenkins-bot: Reapply "ores: Disable AbuseFilter integration by default" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163004 (https://phabricator.wikimedia.org/T364705) (owner: 10Kosta Harlan) [21:01:33] (03Merged) 10jenkins-bot: Configure event stream for IP auto-reveal instrument [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155725 (https://phabricator.wikimedia.org/T387600) (owner: 10Tchanders) [21:01:36] (03Merged) 10jenkins-bot: Reapply "Use GetSecurityLogContext hook for goodpass/badpass logging" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1160157 (https://phabricator.wikimedia.org/T395204) (owner: 10Gergő Tisza) [21:01:54] !log kharlan@deploy1003 Started scap sync-world: Backport for [[gerrit:1163004|Reapply "ores: Disable AbuseFilter integration by default" (T364705)]], [[gerrit:1155725|Configure event stream for IP auto-reveal instrument (T387600)]], [[gerrit:1160157|Reapply "Use GetSecurityLogContext hook for goodpass/badpass logging" (T395204)]] [21:02:03] T387600: IP Auto-reveal: Agree and implement metrics and instrumentation plan - https://phabricator.wikimedia.org/T387600 [21:02:03] T395204: MediaWiki should log request information (IP, user agent, referrer, HTTP method, etc) in a more uniform and predictable way - https://phabricator.wikimedia.org/T395204 [21:04:28] !log kharlan@deploy1003 kharlan, tgr, tchanders: Backport for [[gerrit:1163004|Reapply "ores: Disable AbuseFilter integration by default" (T364705)]], [[gerrit:1155725|Configure event stream for IP auto-reveal instrument (T387600)]], [[gerrit:1160157|Reapply "Use GetSecurityLogContext hook for goodpass/badpass logging" (T395204)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now [21:04:28] be verified there. [21:04:35] T364705: Provide AbuseFilter condition for revertrisk threshold - https://phabricator.wikimedia.org/T364705 [21:04:56] tgr Tchanders please verify your change on mwdebug [21:06:23] kostajh: still backporting? [21:06:38] sbassett: yes, nearly done [21:06:39] I don't have anything specific to test (event stream, nothing logging to it yet), but the sites don't look broken [21:06:56] i.e. lgtm [21:07:08] kostajh: looks good [21:07:54] !log kharlan@deploy1003 kharlan, tgr, tchanders: Continuing with sync [21:07:56] ok [21:11:55] tgr: I was going to deploy the patch from T389009 today during the sec window. How comfortable are you with deploying the other two from https://phabricator.wikimedia.org/T389010#10916135? [21:13:24] sbassett: not sure if anyone reviewed them [21:14:52] I don't think they are particularly dangerous, most likely if they are wrong they'd break the specific feature they are trying to fix, which is not that disruptive [21:15:01] still, code review would be nice [21:15:03] I looked at the first one and I’m fine with deploying that one for sure. It’d be nice if someone with more CA expertise could look at the other two. [21:15:42] I'll find a reviewer [21:16:05] (03CR) 10Volans: [C:04-1] "One small error, couple of questions/suggestions inline." [software/spicerack] - 10https://gerrit.wikimedia.org/r/1162869 (owner: 10Ayounsi) [21:16:46] !log kharlan@deploy1003 Finished scap sync-world: Backport for [[gerrit:1163004|Reapply "ores: Disable AbuseFilter integration by default" (T364705)]], [[gerrit:1155725|Configure event stream for IP auto-reveal instrument (T387600)]], [[gerrit:1160157|Reapply "Use GetSecurityLogContext hook for goodpass/badpass logging" (T395204)]] (duration: 14m 51s) [21:16:54] T364705: Provide AbuseFilter condition for revertrisk threshold - https://phabricator.wikimedia.org/T364705 [21:16:55] T387600: IP Auto-reveal: Agree and implement metrics and instrumentation plan - https://phabricator.wikimedia.org/T387600 [21:16:55] T395204: MediaWiki should log request information (IP, user agent, referrer, HTTP method, etc) in a more uniform and predictable way - https://phabricator.wikimedia.org/T395204 [21:16:59] sbassett: all done, over to you [21:17:09] tx, kostajh [21:17:13] !log UTC late deploys done [21:17:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:18:27] kostajh: thanks [21:26:45] (03PS1) 10Btullis: Add the geoip databases to the dse-k8s-worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/1163040 (https://phabricator.wikimedia.org/T369845) [21:30:40] FIRING: [4x] TransitBGPDown: Transit BGP session down between cr2-codfw and Lumen (2001:1900:2100::4b41) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [21:31:39] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 DIFF 9): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1163040 (https://phabricator.wikimedia.org/T369845) (owner: 10Btullis) [21:34:52] !log Deployed security fix for T389009 [21:34:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:38:33] I think that ^ should be it for today’s security deployment window [21:42:24] (03PS2) 10Volans: redfish: add support for iDRAC 10 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1162986 (https://phabricator.wikimedia.org/T392851) [21:51:06] tgr, sbassett, I can review https://phabricator.wikimedia.org/T389010 tomorrow. [22:04:58] (03PS1) 10Dwisehaupt: Swap in frnetmon1002 and remove frnetmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/1163044 (https://phabricator.wikimedia.org/T369565) [22:09:25] (03PS2) 10Dwisehaupt: Swap in frnetmon1002 and remove frnetmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/1163044 (https://phabricator.wikimedia.org/T395831) [22:13:14] 10SRE-swift-storage, 06serviceops, 07Datacenter-Switchover: Turn down unused swift-r[ow] discovery services - https://phabricator.wikimedia.org/T376237#10940279 (10MatthewVernon) FWIW, after today's incident we ended up with both `swift-rw` resources depooled: ` mvernon@cumin2002:~$ confctl --object-type dis... [22:20:07] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T397658#10940303 (10phaultfinder) [22:20:19] 10SRE-swift-storage, 06serviceops, 07Datacenter-Switchover: Turn down unused swift-r[ow] discovery services - https://phabricator.wikimedia.org/T376237#10940304 (10Scott_French) @MatthewVernon - Ah, that's great! Yes, let's keep those pointed to failoid, then. I'll post a patch shortly to do the "manual equi... [22:27:05] (03PS1) 10Cwhite: logstash: reduce replicas for high-volume logstash-ml logs [puppet] - 10https://gerrit.wikimedia.org/r/1163053 (https://phabricator.wikimedia.org/T390215) [22:30:45] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:38:04] (03PS1) 10Scott French: wmnet: direct swift-ro DYNA record to failoid [dns] - 10https://gerrit.wikimedia.org/r/1163055 (https://phabricator.wikimedia.org/T376237) [22:48:28] 06SRE, 10Wikimedia-Mailing-lists: Maillman mailing list: Emails not reaching destination - https://phabricator.wikimedia.org/T397642#10940444 (10thcipriani) One other data point, I never received an email that's in the [[https://lists.wikimedia.org/hyperkitty/list/wikitech-l@lists.wikimedia.org/thread/BPK37IBK... [22:48:56] !log eevans@cumin1003 START - Cookbook sre.hosts.reimage for host cassandra-dev2001.codfw.wmnet with OS bullseye [22:49:07] 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 13Patch-For-Review, and 2 others: Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10940445 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1003 for host cassandra-dev2001.... [22:54:00] (03CR) 10Cwhite: [C:03+2] logstash: reduce replicas for high-volume logstash-ml logs [puppet] - 10https://gerrit.wikimedia.org/r/1163053 (https://phabricator.wikimedia.org/T390215) (owner: 10Cwhite) [23:00:04] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250623T2300) [23:00:58] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [23:01:42] (03CR) 10Tim Starling: [C:03+2] Suppress mobile redirect for Googlebot Smartphone on Commons [puppet] - 10https://gerrit.wikimedia.org/r/1161727 (https://phabricator.wikimedia.org/T397267) (owner: 10Tim Starling) [23:04:48] !log eevans@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on cassandra-dev2001.codfw.wmnet with reason: host reimage [23:08:33] !log eevans@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cassandra-dev2001.codfw.wmnet with reason: host reimage [23:22:08] 10SRE-swift-storage, 06serviceops, 07Datacenter-Switchover, 13Patch-For-Review: Turn down unused swift-r[ow] discovery services - https://phabricator.wikimedia.org/T376237#10940459 (10Scott_French) [23:34:12] !log eevans@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cassandra-dev2001.codfw.wmnet with OS bullseye [23:34:23] 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 13Patch-For-Review, and 2 others: Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10940474 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1003 for host cassandra-dev2001.codf... [23:38:16] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1163062 [23:38:16] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1163062 (owner: 10TrainBranchBot) [23:45:40] FIRING: [4x] SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [23:47:54] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1163062 (owner: 10TrainBranchBot)