[00:00:14] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q3:test NIC for lvs1017 - https://phabricator.wikimedia.org/T387145#10897773 (10BCornwall) [00:01:33] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q3:test NIC for lvs1017 - https://phabricator.wikimedia.org/T387145#10897774 (10BCornwall) [00:08:32] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1154904 [00:08:32] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1154904 (owner: 10TrainBranchBot) [00:09:03] (03PS5) 10BCornwall: hiera: Add lvs1016 to high-traffic1 [puppet] - 10https://gerrit.wikimedia.org/r/1153418 (https://phabricator.wikimedia.org/T387145) [00:12:41] (03PS1) 10BCornwall: Promote lvs1016 over lvs1017 [puppet] - 10https://gerrit.wikimedia.org/r/1154905 (https://phabricator.wikimedia.org/T387145) [00:55:53] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1154904 (owner: 10TrainBranchBot) [00:56:13] (03CR) 10Ssingh: "Also update site.pp to set lvs1016 to lvs::balancer from insetup_noferm (current)." [puppet] - 10https://gerrit.wikimedia.org/r/1153418 (https://phabricator.wikimedia.org/T387145) (owner: 10BCornwall) [00:58:29] FIRING: GoRoutinesTooHigh: gNMIc running on netflow1002 have more than 10000 Go routines. - https://wikitech.wikimedia.org/wiki/Network_telemetry#GoRoutinesTooHigh - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGoRoutinesTooHigh [01:04:00] (03CR) 10Ssingh: hiera: Add lvs1016 to high-traffic1 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1153418 (https://phabricator.wikimedia.org/T387145) (owner: 10BCornwall) [01:06:53] PROBLEM - dump of x3 in codfw on backupmon1001 is CRITICAL: Last dump for x3 at codfw (db2200) taken on 2025-06-10 00:45:37 is 35 GiB, but the previous one was 266 GiB, a change of -86.9 % https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [01:08:21] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.45.0-wmf.5 [core] (wmf/1.45.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1154910 (https://phabricator.wikimedia.org/T392175) [01:08:22] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.45.0-wmf.5 [core] (wmf/1.45.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1154910 (https://phabricator.wikimedia.org/T392175) (owner: 10TrainBranchBot) [01:11:31] (03CR) 10Ssingh: "Looks good; also remove from site.pp." [puppet] - 10https://gerrit.wikimedia.org/r/1154905 (https://phabricator.wikimedia.org/T387145) (owner: 10BCornwall) [01:21:57] (03Merged) 10jenkins-bot: Branch commit for wmf/1.45.0-wmf.5 [core] (wmf/1.45.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1154910 (https://phabricator.wikimedia.org/T392175) (owner: 10TrainBranchBot) [01:36:51] PROBLEM - dump of x3 in eqiad on backupmon1001 is CRITICAL: Last dump for x3 at eqiad (db1216) taken on 2025-06-10 00:55:37 is 35 GiB, but the previous one was 266 GiB, a change of -86.9 % https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [01:53:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [01:58:15] RESOLVED: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [02:00:04] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250610T0200) [02:31:35] * Krinkle testing ad-hoc on mwdebug1001 [02:51:57] PROBLEM - Disk space on backup2013 is CRITICAL: DISK CRITICAL - free space: /srv/backups 311059MiB (2% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=backup2013&var-datasource=codfw+prometheus/ops [03:00:05] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250610T0300) [03:24:46] FIRING: [2x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [04:00:05] Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250610T0400) [04:04:26] !log mwpresync@deploy1003 Pruned MediaWiki: 1.45.0-wmf.2 (duration: 04m 22s) [04:46:01] PROBLEM - Disk space on an-worker1093 is CRITICAL: DISK CRITICAL - free space: / 2098 MB (3% inode=95%): /tmp 2098 MB (3% inode=95%): /var/tmp 2098 MB (3% inode=95%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1093&var-datasource=eqiad+prometheus/ops [04:51:42] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2213.codfw.wmnet with reason: Maintenance [04:58:03] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1159.eqiad.wmnet with reason: Maintenance [04:58:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1159 (T396130)', diff saved to https://phabricator.wikimedia.org/P77378 and previous config saved to /var/cache/conftool/dbconfig/20250610-045809-marostegui.json [04:58:13] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [04:58:29] FIRING: GoRoutinesTooHigh: gNMIc running on netflow1002 have more than 10000 Go routines. - https://wikitech.wikimedia.org/wiki/Network_telemetry#GoRoutinesTooHigh - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGoRoutinesTooHigh [05:01:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1159 (T396130)', diff saved to https://phabricator.wikimedia.org/P77379 and previous config saved to /var/cache/conftool/dbconfig/20250610-050107-marostegui.json [05:01:38] (03PS1) 10Marostegui: db1231: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1154919 (https://phabricator.wikimedia.org/T395989) [05:02:09] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1231.eqiad.wmnet with reason: Maintenance [05:02:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1231', diff saved to https://phabricator.wikimedia.org/P77380 and previous config saved to /var/cache/conftool/dbconfig/20250610-050215-marostegui.json [05:02:47] (03CR) 10Marostegui: [C:03+2] db1231: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1154919 (https://phabricator.wikimedia.org/T395989) (owner: 10Marostegui) [05:06:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1231 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P77381 and previous config saved to /var/cache/conftool/dbconfig/20250610-050616-root.json [05:06:39] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-codfw and Lumen (2001:1900:2100::4b41) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [05:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:06:51] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [05:07:39] FIRING: CoreBGPDown: Core BGP session down between cr2-codfw and cr4-ulsfo (198.35.26.193) - group Confed_ulsfo - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=codfw&var-device=cr2-codfw:9804&var-bgp_group=Confed_ulsfo&var-bgp_neighbor=cr4-ulsfo - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [05:11:51] FIRING: [7x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [05:12:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr2-codfw and cr4-ulsfo (198.35.26.193) - group Confed_ulsfo - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [05:16:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1159', diff saved to https://phabricator.wikimedia.org/P77382 and previous config saved to /var/cache/conftool/dbconfig/20250610-051614-marostegui.json [05:21:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1231 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77383 and previous config saved to /var/cache/conftool/dbconfig/20250610-052122-root.json [05:21:54] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es1036.eqiad.wmnet with reason: Maintenance [05:21:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es1036', diff saved to https://phabricator.wikimedia.org/P77384 and previous config saved to /var/cache/conftool/dbconfig/20250610-052155-marostegui.json [05:23:29] PROBLEM - Disk space on backup1013 is CRITICAL: DISK CRITICAL - free space: /srv/backups 293349MiB (2% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=backup1013&var-datasource=eqiad+prometheus/ops [05:30:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1036 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77385 and previous config saved to /var/cache/conftool/dbconfig/20250610-053048-root.json [05:31:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es2034', diff saved to https://phabricator.wikimedia.org/P77386 and previous config saved to /var/cache/conftool/dbconfig/20250610-053119-marostegui.json [05:31:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1159', diff saved to https://phabricator.wikimedia.org/P77387 and previous config saved to /var/cache/conftool/dbconfig/20250610-053128-marostegui.json [05:31:41] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2034.codfw.wmnet with reason: Maintenance [05:36:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1231 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77388 and previous config saved to /var/cache/conftool/dbconfig/20250610-053627-root.json [05:37:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2034 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77389 and previous config saved to /var/cache/conftool/dbconfig/20250610-053735-root.json [05:39:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es2030', diff saved to https://phabricator.wikimedia.org/P77390 and previous config saved to /var/cache/conftool/dbconfig/20250610-053902-marostegui.json [05:39:23] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2030.codfw.wmnet with reason: Maintenance [05:39:42] !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for es2030.codfw.wmnet [05:39:52] !log marostegui@cumin1002 START - Cookbook sre.mysql.depool es2030 - Upgrading es2030.codfw.wmnet [05:39:59] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) es2030 - Upgrading es2030.codfw.wmnet [05:43:29] RECOVERY - Disk space on backup1013 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=backup1013&var-datasource=eqiad+prometheus/ops [05:45:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1036 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77391 and previous config saved to /var/cache/conftool/dbconfig/20250610-054554-root.json [05:46:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1159 (T396130)', diff saved to https://phabricator.wikimedia.org/P77392 and previous config saved to /var/cache/conftool/dbconfig/20250610-054635-marostegui.json [05:46:37] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for es2030.codfw.wmnet [05:46:39] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [05:46:51] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1161.eqiad.wmnet with reason: Maintenance [05:46:58] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [05:47:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1161 (T396130)', diff saved to https://phabricator.wikimedia.org/P77393 and previous config saved to /var/cache/conftool/dbconfig/20250610-054705-marostegui.json [05:47:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2030 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77394 and previous config saved to /var/cache/conftool/dbconfig/20250610-054713-root.json [05:50:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T396130)', diff saved to https://phabricator.wikimedia.org/P77395 and previous config saved to /var/cache/conftool/dbconfig/20250610-055003-marostegui.json [05:51:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1231 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77396 and previous config saved to /var/cache/conftool/dbconfig/20250610-055132-root.json [05:52:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2034 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77397 and previous config saved to /var/cache/conftool/dbconfig/20250610-055241-root.json [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250610T0600) [06:00:06] marostegui, Amir1, and federico3: gettimeofday() says it's time for Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250610T0600) [06:00:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1036 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77398 and previous config saved to /var/cache/conftool/dbconfig/20250610-060059-root.json [06:02:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2030 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77399 and previous config saved to /var/cache/conftool/dbconfig/20250610-060218-root.json [06:05:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P77400 and previous config saved to /var/cache/conftool/dbconfig/20250610-060510-marostegui.json [06:05:21] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for santhosh - https://phabricator.wikimedia.org/T394740#10898070 (10Arrbee) This is approved. [06:06:10] FIRING: BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [06:06:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1231 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77401 and previous config saved to /var/cache/conftool/dbconfig/20250610-060638-root.json [06:06:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:07:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr2-codfw and cr4-ulsfo (198.35.26.193) - group Confed_ulsfo - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [06:07:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2034 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77402 and previous config saved to /var/cache/conftool/dbconfig/20250610-060746-root.json [06:11:10] RESOLVED: [3x] BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [06:12:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr2-codfw and cr4-ulsfo (198.35.26.193) - group Confed_ulsfo - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [06:16:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1036 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77403 and previous config saved to /var/cache/conftool/dbconfig/20250610-061604-root.json [06:17:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2030 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77404 and previous config saved to /var/cache/conftool/dbconfig/20250610-061724-root.json [06:19:56] (03PS1) 10Muehlenhoff: Remove bastion role from bast7001 [puppet] - 10https://gerrit.wikimedia.org/r/1154923 (https://phabricator.wikimedia.org/T394263) [06:20:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P77405 and previous config saved to /var/cache/conftool/dbconfig/20250610-062017-marostegui.json [06:21:19] !log jmm@cumin1003 START - Cookbook sre.ganeti.makevm for new host install7002.wikimedia.org [06:21:21] !log jmm@cumin1003 START - Cookbook sre.dns.netbox [06:21:51] FIRING: [7x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [06:22:39] RESOLVED: [4x] CoreBGPDown: Core BGP session down between cr2-codfw and cr4-ulsfo (198.35.26.193) - group Confed_ulsfo - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [06:22:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2034 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77406 and previous config saved to /var/cache/conftool/dbconfig/20250610-062252-root.json [06:23:13] (03CR) 10Muehlenhoff: [C:03+2] Remove bastion role from bast7001 [puppet] - 10https://gerrit.wikimedia.org/r/1154923 (https://phabricator.wikimedia.org/T394263) (owner: 10Muehlenhoff) [06:25:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es2031', diff saved to https://phabricator.wikimedia.org/P77407 and previous config saved to /var/cache/conftool/dbconfig/20250610-062501-marostegui.json [06:25:10] !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for es2031.codfw.wmnet [06:25:10] !log jmm@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM install7002.wikimedia.org - jmm@cumin1003" [06:25:15] !log jmm@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM install7002.wikimedia.org - jmm@cumin1003" [06:25:15] !log jmm@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [06:25:15] !log jmm@cumin1003 START - Cookbook sre.dns.wipe-cache install7002.wikimedia.org on all recursors [06:25:18] !log jmm@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) install7002.wikimedia.org on all recursors [06:25:31] !log marostegui@cumin1002 START - Cookbook sre.mysql.depool es2031 - Upgrading es2031.codfw.wmnet [06:25:38] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) es2031 - Upgrading es2031.codfw.wmnet [06:25:48] !log jmm@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM install7002.wikimedia.org - jmm@cumin1003" [06:25:48] (03CR) 10Cyndywikime: [C:03+1] [beta] GrowthExperiments: enable limiting add a link task via config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154282 (https://phabricator.wikimedia.org/T393769) (owner: 10Sergio Gimeno) [06:25:55] !log jmm@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM install7002.wikimedia.org - jmm@cumin1003" [06:26:51] (03Abandoned) 10Cyndywikime: Growth-Beta: Enable starter difficulty for newcomer tasks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151244 (https://phabricator.wikimedia.org/T393769) (owner: 10Cyndywikime) [06:26:57] PROBLEM - Disk space on an-worker1124 is CRITICAL: DISK CRITICAL - free space: / 2076 MB (3% inode=93%): /tmp 2076 MB (3% inode=93%): /var/tmp 2076 MB (3% inode=93%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1124&var-datasource=eqiad+prometheus/ops [06:28:24] !log jmm@cumin1003 START - Cookbook sre.hosts.reimage for host install7002.wikimedia.org with OS bookworm [06:31:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1036 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77408 and previous config saved to /var/cache/conftool/dbconfig/20250610-063110-root.json [06:31:51] FIRING: [7x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [06:32:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2030 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77409 and previous config saved to /var/cache/conftool/dbconfig/20250610-063229-root.json [06:33:42] marostegui@cumin1002 upgrade (PID 3895100) is awaiting input [06:34:44] jmm@cumin1003 decommission (PID 952270) is awaiting input [06:35:12] !log jmm@cumin1003 START - Cookbook sre.hosts.decommission for hosts bast7001.wikimedia.org [06:35:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T396130)', diff saved to https://phabricator.wikimedia.org/P77410 and previous config saved to /var/cache/conftool/dbconfig/20250610-063524-marostegui.json [06:35:28] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [06:35:39] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for es2031.codfw.wmnet [06:35:40] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1185.eqiad.wmnet with reason: Maintenance [06:35:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1185 (T396130)', diff saved to https://phabricator.wikimedia.org/P77411 and previous config saved to /var/cache/conftool/dbconfig/20250610-063547-marostegui.json [06:36:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2031 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P77412 and previous config saved to /var/cache/conftool/dbconfig/20250610-063608-root.json [06:36:10] FIRING: BFDdown: BFD session down between cr2-codfw and 208.80.153.222 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [06:36:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [06:37:24] PROBLEM - Check whether ferm is active by checking the default input chain on es2031 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:37:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2034 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77413 and previous config saved to /var/cache/conftool/dbconfig/20250610-063757-root.json [06:39:39] !log jmm@cumin1003 START - Cookbook sre.dns.netbox [06:41:10] RESOLVED: BFDdown: BFD session down between cr2-codfw and 208.80.153.222 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [06:41:15] RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [06:44:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T396130)', diff saved to https://phabricator.wikimedia.org/P77414 and previous config saved to /var/cache/conftool/dbconfig/20250610-064420-marostegui.json [06:44:25] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [06:45:16] jmm@cumin1003 decommission (PID 952270) is awaiting input [06:46:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1036 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77415 and previous config saved to /var/cache/conftool/dbconfig/20250610-064615-root.json [06:47:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=eqiad%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [06:47:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2030 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77416 and previous config saved to /var/cache/conftool/dbconfig/20250610-064735-root.json [06:51:14] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, June 10 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154369 (https://phabricator.wikimedia.org/T396178) (owner: 10Bunnypranav) [06:51:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2031 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77417 and previous config saved to /var/cache/conftool/dbconfig/20250610-065114-root.json [06:52:09] (03PS2) 10Jcrespo: dbbackups: Upgrade s6, s2 to 10.11 and produce new backups on codfw [puppet] - 10https://gerrit.wikimedia.org/r/1153641 (https://phabricator.wikimedia.org/T395989) [06:52:09] (03PS1) 10Jcrespo: dbbackups: Add grants for zuul backups @ m1 [puppet] - 10https://gerrit.wikimedia.org/r/1155082 (https://phabricator.wikimedia.org/T394844) [06:52:15] FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [06:52:22] !log jmm@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: bast7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" [06:52:25] (03PS2) 10Jcrespo: dbbackups: Add grants for zuul backups @ m1 [puppet] - 10https://gerrit.wikimedia.org/r/1155082 (https://phabricator.wikimedia.org/T394844) [06:52:47] !log jmm@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: bast7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" [06:52:47] !log jmm@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [06:52:48] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts bast7001.wikimedia.org [06:52:59] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating magru to routed Ganeti - https://phabricator.wikimedia.org/T394263#10898130 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin1003 for hosts: `bast7001.wikimedia.org` - bast7001.wikimedia.org (**PASS**)... [06:53:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2034 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77418 and previous config saved to /var/cache/conftool/dbconfig/20250610-065303-root.json [06:53:46] (03PS3) 10Jcrespo: dbbackups: Add grants for zuul backups @ m1 [puppet] - 10https://gerrit.wikimedia.org/r/1155082 (https://phabricator.wikimedia.org/T394844) [06:53:47] !log jmm@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on install7002.wikimedia.org with reason: host reimage [06:53:57] (03PS4) 10Jcrespo: dbbackups: Add grants for zuul backups @ m1 [puppet] - 10https://gerrit.wikimedia.org/r/1155082 (https://phabricator.wikimedia.org/T394844) [06:57:15] RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [06:57:21] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on install7002.wikimedia.org with reason: host reimage [06:59:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P77419 and previous config saved to /var/cache/conftool/dbconfig/20250610-065927-marostegui.json [07:00:01] (03PS5) 10Jcrespo: dbbackups: Add grants for zuul backups @ m1 [puppet] - 10https://gerrit.wikimedia.org/r/1155082 (https://phabricator.wikimedia.org/T394844) [07:00:05] Amir1, Urbanecm, and awight: May I have your attention please! UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250610T0700) [07:00:05] DreamRimmer and bunnypranav: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:10] (03PS6) 10Jcrespo: dbbackups: Add grants for zuul backups @ m1 [puppet] - 10https://gerrit.wikimedia.org/r/1155082 (https://phabricator.wikimedia.org/T394844) [07:00:13] o/ [07:00:43] o/ [07:02:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2030 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77420 and previous config saved to /var/cache/conftool/dbconfig/20250610-070240-root.json [07:06:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2031 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77421 and previous config saved to /var/cache/conftool/dbconfig/20250610-070620-root.json [07:07:23] RECOVERY - Check whether ferm is active by checking the default input chain on es2031 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [07:07:48] ACKNOWLEDGEMENT - dump of x3 in codfw on backupmon1001 is CRITICAL: Last dump for x3 at codfw (db2200) taken on 2025-06-10 00:45:37 is 35 GiB, but the previous one was 266 GiB, a change of -86.9 % Jcrespo expected, x3 maintenance - The acknowledgement expires at: 2025-06-17 09:07:05. https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [07:07:48] ACKNOWLEDGEMENT - dump of x3 in eqiad on backupmon1001 is CRITICAL: Last dump for x3 at eqiad (db1216) taken on 2025-06-10 00:55:37 is 35 GiB, but the previous one was 266 GiB, a change of -86.9 % Jcrespo expected, x3 maintenance - The acknowledgement expires at: 2025-06-17 09:07:05. https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [07:08:45] (03CR) 10Kai Nissen (WMDE): "We created a new event logging schema for our fundraising banners. May I ask you to have a look at this?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148322 (owner: 10Abban Dunne) [07:13:00] Hmm. Is anyone there and willing to deploy? [07:13:38] https://wikitech.wikimedia.org/wiki/Talk:Deployments#Delete_the_UTC_morning_backport_window%3F [07:14:12] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host install7002.wikimedia.org with OS bookworm [07:14:12] !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host install7002.wikimedia.org [07:14:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P77422 and previous config saved to /var/cache/conftool/dbconfig/20250610-071434-marostegui.json [07:15:22] DreamRimmer: Perfect timing, xD [07:16:32] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, June 10 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1083870 (https://phabricator.wikimedia.org/T378287) (owner: 10Dreamrimmer) [07:19:04] I have moved my change to the afternoon backport window [07:19:28] I think I will also do the same, no one seems to be around now. [07:21:18] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 10310 [07:21:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2031 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77423 and previous config saved to /var/cache/conftool/dbconfig/20250610-072125-root.json [07:24:16] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 10310 [07:25:00] 10ops-codfw, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): SSD firmware update for cirrussearch211[0-5] - https://phabricator.wikimedia.org/T394432#10898181 (10RKemper) 05Open→03Resolved All these hosts are on `DL7C` now [07:25:35] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 60427 [07:25:57] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 60427 [07:26:34] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 28173 [07:26:43] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 28173 [07:27:14] (03PS1) 10Muehlenhoff: Remove access for mwilliams [puppet] - 10https://gerrit.wikimedia.org/r/1155086 [07:28:26] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 8849 [07:28:29] FIRING: [2x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [07:28:49] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 8849 [07:29:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T396130)', diff saved to https://phabricator.wikimedia.org/P77424 and previous config saved to /var/cache/conftool/dbconfig/20250610-072941-marostegui.json [07:29:45] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [07:29:57] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1200.eqiad.wmnet with reason: Maintenance [07:30:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1200 (T396130)', diff saved to https://phabricator.wikimedia.org/P77425 and previous config saved to /var/cache/conftool/dbconfig/20250610-073003-marostegui.json [07:31:57] RECOVERY - Disk space on backup2013 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=backup2013&var-datasource=codfw+prometheus/ops [07:32:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T396130)', diff saved to https://phabricator.wikimedia.org/P77426 and previous config saved to /var/cache/conftool/dbconfig/20250610-073234-marostegui.json [07:32:38] (03CR) 10Muehlenhoff: [C:03+2] Remove access for mwilliams [puppet] - 10https://gerrit.wikimedia.org/r/1155086 (owner: 10Muehlenhoff) [07:35:11] (03CR) 10Tiziano Fogli: [C:03+1] titan: deploy local memcached [puppet] - 10https://gerrit.wikimedia.org/r/1154844 (https://phabricator.wikimedia.org/T394319) (owner: 10Filippo Giunchedi) [07:36:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2031 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77427 and previous config saved to /var/cache/conftool/dbconfig/20250610-073631-root.json [07:47:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P77428 and previous config saved to /var/cache/conftool/dbconfig/20250610-074742-marostegui.json [07:52:55] (03CR) 10Filippo Giunchedi: [C:03+2] titan: deploy local memcached [puppet] - 10https://gerrit.wikimedia.org/r/1154844 (https://phabricator.wikimedia.org/T394319) (owner: 10Filippo Giunchedi) [07:53:44] (03PS1) 10Alexandros Kosiaris: tlsproxy: Switch retry_on to 5xx [puppet] - 10https://gerrit.wikimedia.org/r/1155117 (https://phabricator.wikimedia.org/T380958) [07:53:46] (03PS3) 10Jcrespo: dbbackups: Upgrade s6, s2 to 10.11 and produce new backups on codfw [puppet] - 10https://gerrit.wikimedia.org/r/1153641 (https://phabricator.wikimedia.org/T395989) [07:54:20] (03CR) 10Jcrespo: [C:03+1] "this is ready, I would appreciate a +1/ok." [puppet] - 10https://gerrit.wikimedia.org/r/1153641 (https://phabricator.wikimedia.org/T395989) (owner: 10Jcrespo) [08:00:10] FIRING: BFDdown: BFD session down between cr4-ulsfo and 198.35.26.203 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr4-ulsfo:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [08:01:06] !log deploying grants for zuul backups @ m1 T394844 [08:01:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:01:09] T394844: Request mariadb database for Zuul - https://phabricator.wikimedia.org/T394844 [08:02:37] (03CR) 10Filippo Giunchedi: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5861/co" [puppet] - 10https://gerrit.wikimedia.org/r/1154782 (https://phabricator.wikimedia.org/T394318) (owner: 10Filippo Giunchedi) [08:02:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P77429 and previous config saved to /var/cache/conftool/dbconfig/20250610-080248-marostegui.json [08:03:35] (03CR) 10Jcrespo: [C:03+2] dbbackups: Add grants for zuul backups @ m1 [puppet] - 10https://gerrit.wikimedia.org/r/1155082 (https://phabricator.wikimedia.org/T394844) (owner: 10Jcrespo) [08:03:40] (03CR) 10Alexandros Kosiaris: [C:03+2] tlsproxy: Switch retry_on to 5xx [puppet] - 10https://gerrit.wikimedia.org/r/1155117 (https://phabricator.wikimedia.org/T380958) (owner: 10Alexandros Kosiaris) [08:03:46] (03CR) 10Filippo Giunchedi: [V:03+1 C:03+2] thanos: add tracing define [puppet] - 10https://gerrit.wikimedia.org/r/1154782 (https://phabricator.wikimedia.org/T394318) (owner: 10Filippo Giunchedi) [08:04:12] merging your patches akosiaris jynus [08:04:16] (03PS6) 10Brouberol: Airflow: Add local settings to enable the xcom_sidecar functionality [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154248 (https://phabricator.wikimedia.org/T388378) (owner: 10Btullis) [08:04:20] ok, I was about to ask [08:04:43] I did a mcgyver when he slides past the closing automatic gate [08:04:44] 3 patches in 10 seconds, nice [08:05:10] RESOLVED: BFDdown: BFD session down between cr4-ulsfo and 198.35.26.203 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr4-ulsfo:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [08:05:13] mine is a noop for production [08:05:14] (03CR) 10Brouberol: Airflow: Add local settings to enable the xcom_sidecar functionality (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154248 (https://phabricator.wikimedia.org/T388378) (owner: 10Btullis) [08:05:21] (03PS7) 10Brouberol: Airflow: Add local settings to enable the xcom_sidecar functionality [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154248 (https://phabricator.wikimedia.org/T388378) (owner: 10Btullis) [08:05:30] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'configure' for AS: 32098 [08:06:06] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 32098 [08:06:18] (03CR) 10Brouberol: [C:03+1] Airflow: Increase k8s check frequency in analytics_test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1152681 (https://phabricator.wikimedia.org/T369845) (owner: 10Aqu) [08:06:20] (03PS5) 10Tiziano Fogli: monitoring services: add migration task as parameter [puppet] - 10https://gerrit.wikimedia.org/r/1150709 (https://phabricator.wikimedia.org/T395443) [08:06:55] (03PS13) 10Tiziano Fogli: pdb_resource_exporter: add puppetdb resource exporter to puppedb [puppet] - 10https://gerrit.wikimedia.org/r/1143600 (https://phabricator.wikimedia.org/T395442) [08:06:58] https://www.youtube.com/watch?v=yOEe1uzurKo#t=1m06 for reference [08:08:04] godog: thanks! [08:08:29] yw akosiaris [08:10:25] (03CR) 10Filippo Giunchedi: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5863/console" [puppet] - 10https://gerrit.wikimedia.org/r/1154784 (https://phabricator.wikimedia.org/T394318) (owner: 10Filippo Giunchedi) [08:10:34] (03CR) 10Filippo Giunchedi: [V:03+1 C:03+2] hieradata: set default otel-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1154784 (https://phabricator.wikimedia.org/T394318) (owner: 10Filippo Giunchedi) [08:11:26] (03PS3) 10Filippo Giunchedi: thanos-sidecar: enable tracing [puppet] - 10https://gerrit.wikimedia.org/r/1154785 (https://phabricator.wikimedia.org/T394318) [08:12:41] (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] thanos-sidecar: enable tracing [puppet] - 10https://gerrit.wikimedia.org/r/1154785 (https://phabricator.wikimedia.org/T394318) (owner: 10Filippo Giunchedi) [08:15:00] godog: I think my grandpa watched that show [08:15:43] good taste your granpa jynus, it is a great show [08:16:01] (just kidding, one of my recent ttrpg characters was a mix of mcgyver and colonel o'neill) [08:16:22] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1158.eqiad.wmnet with reason: Maintenance [08:16:38] (03PS1) 10Brouberol: Convert an-db100[1-2] to dse-k8s-worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/1155118 (https://phabricator.wikimedia.org/T395557) [08:16:40] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [08:16:47] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1158 (T395241)', diff saved to https://phabricator.wikimedia.org/P77430 and previous config saved to /var/cache/conftool/dbconfig/20250610-081647-fceratto.json [08:17:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T396130)', diff saved to https://phabricator.wikimedia.org/P77431 and previous config saved to /var/cache/conftool/dbconfig/20250610-081756-marostegui.json [08:17:59] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [08:18:11] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1210.eqiad.wmnet with reason: Maintenance [08:18:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1210 (T396130)', diff saved to https://phabricator.wikimedia.org/P77432 and previous config saved to /var/cache/conftool/dbconfig/20250610-081817-marostegui.json [08:19:22] (03PS1) 10Brouberol: Convert an-db100[1-2] to dse-k8s-worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/1155119 (https://phabricator.wikimedia.org/T395557) [08:19:23] (03PS1) 10Brouberol: Configure dse-k8s-worker100[2-3] with the dse_k8s::worker role [puppet] - 10https://gerrit.wikimedia.org/r/1155120 (https://phabricator.wikimedia.org/T395557) [08:19:43] (03Abandoned) 10Brouberol: Convert an-db100[1-2] to dse-k8s-worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/1155118 (https://phabricator.wikimedia.org/T395557) (owner: 10Brouberol) [08:19:47] (03CR) 10CI reject: [V:04-1] Convert an-db100[1-2] to dse-k8s-worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/1155119 (https://phabricator.wikimedia.org/T395557) (owner: 10Brouberol) [08:21:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1210 (T396130)', diff saved to https://phabricator.wikimedia.org/P77433 and previous config saved to /var/cache/conftool/dbconfig/20250610-082114-marostegui.json [08:21:45] (03PS2) 10Brouberol: Convert an-db100[1-2] to dse-k8s-worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/1155119 (https://phabricator.wikimedia.org/T395557) [08:21:45] (03PS2) 10Brouberol: Configure dse-k8s-worker100[2-3] with the dse_k8s::worker role [puppet] - 10https://gerrit.wikimedia.org/r/1155120 (https://phabricator.wikimedia.org/T395557) [08:22:18] (03PS4) 10Jcrespo: dbbackups: Upgrade s6, s2 to 10.11 and produce new backups on codfw [puppet] - 10https://gerrit.wikimedia.org/r/1153641 (https://phabricator.wikimedia.org/T395989) [08:23:10] (03CR) 10Jelto: [V:03+1 C:03+2] gitlab: enable object storage on all hosts [puppet] - 10https://gerrit.wikimedia.org/r/1154020 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto) [08:24:55] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T395241)', diff saved to https://phabricator.wikimedia.org/P77434 and previous config saved to /var/cache/conftool/dbconfig/20250610-082454-fceratto.json [08:25:27] (03CR) 10Brouberol: Airflow: Add local settings to enable the xcom_sidecar functionality (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154248 (https://phabricator.wikimedia.org/T388378) (owner: 10Btullis) [08:28:02] (03CR) 10Filippo Giunchedi: [C:03+1] monitoring services: add migration task as parameter [puppet] - 10https://gerrit.wikimedia.org/r/1150709 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [08:31:50] (03CR) 10Tiziano Fogli: [C:03+2] monitoring services: add migration task as parameter [puppet] - 10https://gerrit.wikimedia.org/r/1150709 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [08:34:07] (03PS12) 10Ayounsi: [WIP] gNMI: spread targets on multiple netflow hosts [puppet] - 10https://gerrit.wikimedia.org/r/1154303 [08:34:26] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1154303 (owner: 10Ayounsi) [08:34:33] (03CR) 10CI reject: [V:04-1] [WIP] gNMI: spread targets on multiple netflow hosts [puppet] - 10https://gerrit.wikimedia.org/r/1154303 (owner: 10Ayounsi) [08:35:25] 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 3 others: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922#10898434 (10Jelto) [08:36:08] (03PS1) 10Brouberol: airflow: inject an AIRFLOW_ENVIRONMENT env var with dev/prod values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155123 (https://phabricator.wikimedia.org/T394297) [08:36:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1210', diff saved to https://phabricator.wikimedia.org/P77435 and previous config saved to /var/cache/conftool/dbconfig/20250610-083622-marostegui.json [08:40:02] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P77436 and previous config saved to /var/cache/conftool/dbconfig/20250610-084002-fceratto.json [08:40:56] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [08:51:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1210', diff saved to https://phabricator.wikimedia.org/P77437 and previous config saved to /var/cache/conftool/dbconfig/20250610-085128-marostegui.json [08:52:06] (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1142712 (https://phabricator.wikimedia.org/T390556) (owner: 10Aleksandar Mastilovic) [08:52:58] (03PS1) 10Alexandros Kosiaris: mesh: Add configuration_1.14 (copy/paste patch) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155125 (https://phabricator.wikimedia.org/T380958) [08:52:59] (03PS1) 10Alexandros Kosiaris: mesh: Support retry_policy for upstream cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155126 (https://phabricator.wikimedia.org/T380958) [08:53:15] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [08:53:45] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [08:54:42] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [08:55:09] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P77438 and previous config saved to /var/cache/conftool/dbconfig/20250610-085508-fceratto.json [08:56:32] (03CR) 10Urbanecm: [C:03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154282 (https://phabricator.wikimedia.org/T393769) (owner: 10Sergio Gimeno) [08:56:51] (03PS4) 10Aqu: Airflow analytics-test: Add scheduler access to hadoop [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154071 (https://phabricator.wikimedia.org/T369845) [08:58:17] (03CR) 10CI reject: [V:04-1] Airflow analytics-test: Add scheduler access to hadoop [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154071 (https://phabricator.wikimedia.org/T369845) (owner: 10Aqu) [08:58:29] FIRING: GoRoutinesTooHigh: gNMIc running on netflow1002 have more than 10000 Go routines. - https://wikitech.wikimedia.org/wiki/Network_telemetry#GoRoutinesTooHigh - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGoRoutinesTooHigh [09:00:51] (03PS1) 10Jelto: gitlab: remove artifacts from failover backup [puppet] - 10https://gerrit.wikimedia.org/r/1155146 (https://phabricator.wikimedia.org/T378922) [09:06:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1210 (T396130)', diff saved to https://phabricator.wikimedia.org/P77439 and previous config saved to /var/cache/conftool/dbconfig/20250610-090635-marostegui.json [09:06:39] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [09:06:50] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1216.eqiad.wmnet with reason: Maintenance [09:07:53] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1245.eqiad.wmnet with reason: Maintenance [09:08:54] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [09:10:17] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T395241)', diff saved to https://phabricator.wikimedia.org/P77440 and previous config saved to /var/cache/conftool/dbconfig/20250610-091016-fceratto.json [09:10:17] FIRING: [2x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2007:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:10:34] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1170.eqiad.wmnet with reason: Maintenance [09:10:41] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1170 (T395241)', diff saved to https://phabricator.wikimedia.org/P77441 and previous config saved to /var/cache/conftool/dbconfig/20250610-091040-fceratto.json [09:11:38] (03PS1) 10Effie Mouzeli: orchestrator: testing keyroute Change-Id: I121bac5dcf4ceb2fb23482ef161ab528e6af7757 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155148 [09:12:46] (03CR) 10Alexandros Kosiaris: [C:03+1] "OK, but please put a comment in there explaining why and link to task." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153975 (https://phabricator.wikimedia.org/T396107) (owner: 10JMeybohm) [09:13:53] (03Abandoned) 10Effie Mouzeli: orchestrator: testing keyroute Change-Id: I121bac5dcf4ceb2fb23482ef161ab528e6af7757 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155148 (owner: 10Effie Mouzeli) [09:15:09] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [09:15:31] (03PS1) 10Majavah: hieradata: Remove old Cloud VPS proxies [puppet] - 10https://gerrit.wikimedia.org/r/1155149 (https://phabricator.wikimedia.org/T379175) [09:17:34] !log jynus@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2197.codfw.wmnet,dbprov2003.codfw.wmnet with reason: Downtime hosts for MariaDB 10.11 upgrade [09:18:53] (03PS1) 10Effie Mouzeli: orchestrator: testing mcrouter fix [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155150 [09:20:11] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170 (T395241)', diff saved to https://phabricator.wikimedia.org/P77442 and previous config saved to /var/cache/conftool/dbconfig/20250610-092011-fceratto.json [09:20:51] (03CR) 10Majavah: [C:03+2] hieradata: Remove old Cloud VPS proxies [puppet] - 10https://gerrit.wikimedia.org/r/1155149 (https://phabricator.wikimedia.org/T379175) (owner: 10Majavah) [09:24:19] (03PS2) 10Effie Mouzeli: orchestrator: testing mcrouter fix [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155150 [09:24:53] !log fnegri@cumin1003 conftool action : set/pooled=no; selector: name=clouddb1019.eqiad.wmnet [09:24:59] (03PS1) 10Muehlenhoff: Apply installserver role to install7002 [puppet] - 10https://gerrit.wikimedia.org/r/1155151 (https://phabricator.wikimedia.org/T394263) [09:25:24] (03CR) 10Tiziano Fogli: query-frontend: enable memcached on titan[21]001 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1154845 (https://phabricator.wikimedia.org/T394319) (owner: 10Filippo Giunchedi) [09:26:54] !log upgrade db2197 to MariaDB 10.11 T394487 [09:26:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:57] T394487: Migrate backup sources to MariaDB 10.11 - https://phabricator.wikimedia.org/T394487 [09:26:58] !log fnegri@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1019.eqiad.wmnet with reason: Upgrading clouddbs T394372 [09:27:02] T394372: Migrate clouddb* hosts to MariaDB 10.11 - https://phabricator.wikimedia.org/T394372 [09:27:26] (03PS1) 10Jelto: gitlab: bump gitlab-settings to v1.8.0 [puppet] - 10https://gerrit.wikimedia.org/r/1155152 (https://phabricator.wikimedia.org/T395014) [09:27:47] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [09:28:25] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, June 10 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154282 (https://phabricator.wikimedia.org/T393769) (owner: 10Sergio Gimeno) [09:28:39] (03CR) 10Jcrespo: [C:03+2] dbbackups: Upgrade s6, s2 to 10.11 and produce new backups on codfw [puppet] - 10https://gerrit.wikimedia.org/r/1153641 (https://phabricator.wikimedia.org/T395989) (owner: 10Jcrespo) [09:29:09] (03PS4) 10JMeybohm: Use Wikimedia DNS IPs as mock [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153975 (https://phabricator.wikimedia.org/T396107) [09:29:10] (03PS4) 10JMeybohm: calico: Add support to manage CNI installation by daemonset [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153976 (https://phabricator.wikimedia.org/T396107) [09:29:10] (03PS4) 10JMeybohm: coredns: Run coredns on an unprivileged port (5353) instead of 53 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153977 (https://phabricator.wikimedia.org/T396107) [09:29:10] (03PS4) 10JMeybohm: cfssl-issuer: Allow to provide a custom CA certificate store [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153978 (https://phabricator.wikimedia.org/T396107) [09:29:43] (03PS1) 10Filippo Giunchedi: thanos: enable tracing for store [puppet] - 10https://gerrit.wikimedia.org/r/1155153 (https://phabricator.wikimedia.org/T394318) [09:29:48] (03CR) 10FNegri: [C:03+2] clouddb1019: upgrade to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1154803 (https://phabricator.wikimedia.org/T394372) (owner: 10FNegri) [09:30:47] (03PS1) 10Effie Mouzeli: mcrouter: update to 1.3.4 (vanilla) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155154 [09:31:37] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [09:31:43] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1230.eqiad.wmnet with reason: Maintenance [09:32:05] 06SRE, 10Continuous-Integration-Infrastructure, 13Patch-For-Review, 07Technical-Debt: debian-glue jobs ignored error messages about libeatmydata.so in LD_PRELOAD - https://phabricator.wikimedia.org/T240430#10898749 (10hashar) [09:32:40] (03CR) 10Hashar: [C:03+1] "Done" [puppet] - 10https://gerrit.wikimedia.org/r/1135966 (https://phabricator.wikimedia.org/T240430) (owner: 10Hashar) [09:32:46] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2148.codfw.wmnet with reason: Maintenance [09:32:49] (03PS9) 10Hashar: ci: always add eatmydata to cow images [puppet] - 10https://gerrit.wikimedia.org/r/1135966 (https://phabricator.wikimedia.org/T240430) [09:32:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2148 (T396130)', diff saved to https://phabricator.wikimedia.org/P77443 and previous config saved to /var/cache/conftool/dbconfig/20250610-093252-marostegui.json [09:32:54] (03CR) 10Hashar: [C:03+1] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1135966 (https://phabricator.wikimedia.org/T240430) (owner: 10Hashar) [09:32:57] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [09:33:09] !log jmm@cumin1003 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2038.codfw.wmnet [09:35:18] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170', diff saved to https://phabricator.wikimedia.org/P77444 and previous config saved to /var/cache/conftool/dbconfig/20250610-093518-fceratto.json [09:36:08] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [09:36:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote es2030 to es1 master T395241', diff saved to https://phabricator.wikimedia.org/P77445 and previous config saved to /var/cache/conftool/dbconfig/20250610-093628-root.json [09:36:42] !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for es2032.codfw.wmnet [09:37:02] !log marostegui@cumin1002 START - Cookbook sre.mysql.depool es2032 - Upgrading es2032.codfw.wmnet [09:37:19] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) es2032 - Upgrading es2032.codfw.wmnet [09:37:55] !log akosiaris@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [09:38:27] (03PS1) 10Marostegui: db1187: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1155155 (https://phabricator.wikimedia.org/T395989) [09:38:28] !log akosiaris@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [09:38:44] (03PS3) 10Filippo Giunchedi: query-frontend: enable memcached on titan[21]001 [puppet] - 10https://gerrit.wikimedia.org/r/1154845 (https://phabricator.wikimedia.org/T394319) [09:38:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1187 T395989', diff saved to https://phabricator.wikimedia.org/P77447 and previous config saved to /var/cache/conftool/dbconfig/20250610-093846-marostegui.json [09:38:50] T395989: Migrate s6 to MariaDB 10.11 - https://phabricator.wikimedia.org/T395989 [09:39:11] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1187.eqiad.wmnet with reason: Maintenance [09:39:58] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [09:40:08] (03CR) 10Marostegui: [C:03+2] db1187: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1155155 (https://phabricator.wikimedia.org/T395989) (owner: 10Marostegui) [09:40:18] (03CR) 10Filippo Giunchedi: query-frontend: enable memcached on titan[21]001 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1154845 (https://phabricator.wikimedia.org/T394319) (owner: 10Filippo Giunchedi) [09:40:55] jmm@cumin1003 drain-node (PID 971328) is awaiting input [09:42:53] (03PS1) 10Effie Mouzeli: mcrouter: update to 1.3.4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155157 [09:43:02] (03CR) 10Btullis: [C:03+1] "Nice and simple, thanks." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155123 (https://phabricator.wikimedia.org/T394297) (owner: 10Brouberol) [09:43:41] (03CR) 10Alexandros Kosiaris: [C:03+1] mcrouter: update to 1.3.4 (vanilla) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155154 (owner: 10Effie Mouzeli) [09:43:42] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for es2032.codfw.wmnet [09:43:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2032 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77448 and previous config saved to /var/cache/conftool/dbconfig/20250610-094345-root.json [09:43:54] aux-k8s-etcd2005 is going down for a Ganeti reboot [09:43:59] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host ganeti2038.codfw.wmnet [09:44:01] (03CR) 10Alexandros Kosiaris: [C:03+1] mcrouter: update to 1.3.4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155157 (owner: 10Effie Mouzeli) [09:44:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote es2031 to es2 master T395241', diff saved to https://phabricator.wikimedia.org/P77449 and previous config saved to /var/cache/conftool/dbconfig/20250610-094401-root.json [09:44:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1187 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P77450 and previous config saved to /var/cache/conftool/dbconfig/20250610-094429-root.json [09:44:52] !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for es2033.codfw.wmnet [09:45:12] !log marostegui@cumin1002 START - Cookbook sre.mysql.depool es2033 - Upgrading es2033.codfw.wmnet [09:45:29] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) es2033 - Upgrading es2033.codfw.wmnet [09:46:00] PROBLEM - Disk space on an-worker1093 is CRITICAL: DISK CRITICAL - free space: / 2017 MB (3% inode=95%): /tmp 2017 MB (3% inode=95%): /var/tmp 2017 MB (3% inode=95%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1093&var-datasource=eqiad+prometheus/ops [09:46:12] PROBLEM - Host aux-k8s-etcd2005 is DOWN: PING CRITICAL - Packet loss = 100% [09:46:34] !log installing postgresql-15 security updates [09:46:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:04] (03CR) 10Tiziano Fogli: [C:03+1] query-frontend: enable memcached on titan[21]001 [puppet] - 10https://gerrit.wikimedia.org/r/1154845 (https://phabricator.wikimedia.org/T394319) (owner: 10Filippo Giunchedi) [09:47:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T396130)', diff saved to https://phabricator.wikimedia.org/P77453 and previous config saved to /var/cache/conftool/dbconfig/20250610-094731-marostegui.json [09:47:36] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [09:47:57] (03CR) 10Brouberol: [C:03+2] airflow: inject an AIRFLOW_ENVIRONMENT env var with dev/prod values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155123 (https://phabricator.wikimedia.org/T394297) (owner: 10Brouberol) [09:48:33] !log fnegri@cumin1003 conftool action : set/pooled=yes; selector: name=clouddb1019.eqiad.wmnet [09:49:17] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2038.codfw.wmnet [09:49:46] !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2038.codfw.wmnet [09:50:03] !log jmm@cumin1003 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2039.codfw.wmnet [09:50:14] (03PS1) 10Effie Mouzeli: function-orchestrator: update mcrouter module [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155158 (https://phabricator.wikimedia.org/T396074) [09:50:25] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170', diff saved to https://phabricator.wikimedia.org/P77454 and previous config saved to /var/cache/conftool/dbconfig/20250610-095025-fceratto.json [09:50:40] RECOVERY - Host aux-k8s-etcd2005 is UP: PING OK - Packet loss = 0%, RTA = 30.74 ms [09:50:42] (03Abandoned) 10Effie Mouzeli: orchestrator: testing mcrouter fix [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155150 (owner: 10Effie Mouzeli) [09:51:17] (03CR) 10Muehlenhoff: [C:03+2] Apply installserver role to install7002 [puppet] - 10https://gerrit.wikimedia.org/r/1155151 (https://phabricator.wikimedia.org/T394263) (owner: 10Muehlenhoff) [09:51:21] (03CR) 10Effie Mouzeli: [C:03+2] mcrouter: update to 1.3.4 (vanilla) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155154 (owner: 10Effie Mouzeli) [09:52:03] (03PS2) 10Effie Mouzeli: mcrouter: update to 1.3.4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155157 [09:52:44] (03Merged) 10jenkins-bot: mcrouter: update to 1.3.4 (vanilla) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155154 (owner: 10Effie Mouzeli) [09:53:36] (03PS1) 10Marostegui: dbstore1009: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1155159 (https://phabricator.wikimedia.org/T394373) [09:54:08] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on dbstore1009.eqiad.wmnet with reason: Maintenance [09:54:21] (03PS3) 10Effie Mouzeli: mcrouter: update to 1.3.4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155157 [09:54:33] (03CR) 10Marostegui: [C:03+2] dbstore1009: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1155159 (https://phabricator.wikimedia.org/T394373) (owner: 10Marostegui) [09:54:54] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for es2033.codfw.wmnet [09:55:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2033 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77455 and previous config saved to /var/cache/conftool/dbconfig/20250610-095527-root.json [09:55:30] kubestagemaster2004 is going down for a Ganeti reboot [09:55:35] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host ganeti2039.codfw.wmnet [09:57:46] PROBLEM - Host kubestagemaster2004 is DOWN: PING CRITICAL - Packet loss = 100% [09:58:20] (03CR) 10Effie Mouzeli: [C:03+2] mcrouter: update to 1.3.4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155157 (owner: 10Effie Mouzeli) [09:58:50] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [09:58:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2032 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77456 and previous config saved to /var/cache/conftool/dbconfig/20250610-095850-root.json [09:59:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1187 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77457 and previous config saved to /var/cache/conftool/dbconfig/20250610-095933-root.json [09:59:38] (03Merged) 10jenkins-bot: mcrouter: update to 1.3.4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155157 (owner: 10Effie Mouzeli) [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250610T1000) [10:00:30] RECOVERY - Host kubestagemaster2004 is UP: PING OK - Packet loss = 0%, RTA = 31.74 ms [10:00:53] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2039.codfw.wmnet [10:01:00] !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2039.codfw.wmnet [10:01:23] (03PS1) 10Kamila Součková: modules/nginx: install extra modules after main nginx package [puppet] - 10https://gerrit.wikimedia.org/r/1155160 [10:02:24] (03CR) 10Kamila Součková: "In particular, installing the lua module fails without this." [puppet] - 10https://gerrit.wikimedia.org/r/1155160 (owner: 10Kamila Součková) [10:02:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P77458 and previous config saved to /var/cache/conftool/dbconfig/20250610-100239-marostegui.json [10:02:56] (03PS1) 10Jcrespo: dbbackups: Upgrade dbprov2004, downgrade dbprov2003 [puppet] - 10https://gerrit.wikimedia.org/r/1155162 (https://phabricator.wikimedia.org/T394487) [10:03:40] (03CR) 10JMeybohm: [C:03+2] Use Wikimedia DNS IPs as mock [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153975 (https://phabricator.wikimedia.org/T396107) (owner: 10JMeybohm) [10:03:54] (03CR) 10Jcrespo: [C:03+2] dbbackups: Upgrade dbprov2004, downgrade dbprov2003 [puppet] - 10https://gerrit.wikimedia.org/r/1155162 (https://phabricator.wikimedia.org/T394487) (owner: 10Jcrespo) [10:05:33] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170 (T395241)', diff saved to https://phabricator.wikimedia.org/P77459 and previous config saved to /var/cache/conftool/dbconfig/20250610-100532-fceratto.json [10:05:51] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1174.eqiad.wmnet with reason: Maintenance [10:05:59] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1174 (T395241)', diff saved to https://phabricator.wikimedia.org/P77460 and previous config saved to /var/cache/conftool/dbconfig/20250610-100558-fceratto.json [10:06:48] !log jmm@cumin1003 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2040.codfw.wmnet [10:08:52] !log installing ninja2 security updates [10:08:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:58] !log installing jinja2 security updates [10:08:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2033 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77461 and previous config saved to /var/cache/conftool/dbconfig/20250610-101032-root.json [10:10:44] (03PS1) 10Marostegui: dbstore1008: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1155167 (https://phabricator.wikimedia.org/T394373) [10:11:09] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [10:11:44] (03CR) 10Marostegui: [C:03+2] dbstore1008: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1155167 (https://phabricator.wikimedia.org/T394373) (owner: 10Marostegui) [10:12:02] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host ganeti2040.codfw.wmnet [10:13:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2032 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77462 and previous config saved to /var/cache/conftool/dbconfig/20250610-101355-root.json [10:14:06] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T395241)', diff saved to https://phabricator.wikimedia.org/P77463 and previous config saved to /var/cache/conftool/dbconfig/20250610-101406-fceratto.json [10:14:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1187 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77464 and previous config saved to /var/cache/conftool/dbconfig/20250610-101438-root.json [10:17:33] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2040.codfw.wmnet [10:17:40] !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2040.codfw.wmnet [10:17:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P77465 and previous config saved to /var/cache/conftool/dbconfig/20250610-101745-marostegui.json [10:17:58] !log jmm@cumin1003 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2041.codfw.wmnet [10:20:54] (03CR) 10Hnowlan: [C:03+2] rest-gateway: enable per-route statistics for all routes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154854 (owner: 10Hnowlan) [10:22:00] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox [10:23:04] (03Merged) 10jenkins-bot: Use Wikimedia DNS IPs as mock [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153975 (https://phabricator.wikimedia.org/T396107) (owner: 10JMeybohm) [10:25:33] jmm@cumin1003 drain-node (PID 974671) is awaiting input [10:25:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2033 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77466 and previous config saved to /var/cache/conftool/dbconfig/20250610-102538-root.json [10:25:54] !log cmooney@cumin1003 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [10:26:29] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host ganeti2041.codfw.wmnet [10:27:18] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox [10:29:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2032 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77467 and previous config saved to /var/cache/conftool/dbconfig/20250610-102900-root.json [10:29:13] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P77468 and previous config saved to /var/cache/conftool/dbconfig/20250610-102913-fceratto.json [10:29:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1187 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77469 and previous config saved to /var/cache/conftool/dbconfig/20250610-102943-root.json [10:31:03] (03PS2) 10Clément Goubert: mw-cron: Disable memory limit [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155175 (https://phabricator.wikimedia.org/T395436) [10:31:17] !log cmooney@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add AAAA records for WMCS cloud-private IPs in eqiad - cmooney@cumin1003" [10:31:22] !log cmooney@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add AAAA records for WMCS cloud-private IPs in eqiad - cmooney@cumin1003" [10:31:22] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:31:30] !log taavi@cumin1003 START - Cookbook sre.dns.wipe-cache 'private.eqiad.wikimedia.cloud$' on eqiad recursors [10:31:31] !log taavi@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) 'private.eqiad.wikimedia.cloud$' on eqiad recursors [10:31:47] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2041.codfw.wmnet [10:31:55] !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2041.codfw.wmnet [10:32:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T396130)', diff saved to https://phabricator.wikimedia.org/P77470 and previous config saved to /var/cache/conftool/dbconfig/20250610-103252-marostegui.json [10:32:56] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [10:33:08] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2175.codfw.wmnet with reason: Maintenance [10:33:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2175 (T396130)', diff saved to https://phabricator.wikimedia.org/P77471 and previous config saved to /var/cache/conftool/dbconfig/20250610-103315-marostegui.json [10:33:48] !log jmm@cumin1003 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2042.codfw.wmnet [10:37:13] is jenkins a little under the weather? [10:38:53] 10ops-codfw, 06SRE, 06DC-Ops: mc-misc2001 won't power up - https://phabricator.wikimedia.org/T395526#10899029 (10jijiki) @Jhancock.wm thank you for digging into this! This set of servers is very new, however, they are not in production, which is why there was no alert about the server going offline. Accord... [10:39:04] (03CR) 10Hnowlan: [C:03+1] mw-cron: Disable memory limit [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155175 (https://phabricator.wikimedia.org/T395436) (owner: 10Clément Goubert) [10:39:29] (03PS2) 10Effie Mouzeli: function-orchestrator: update mcrouter module [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155158 (https://phabricator.wikimedia.org/T396074) [10:39:35] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox [10:39:35] (03CR) 10Stevemunene: [C:03+2] zookeeper: onboard an-conf1004 to the cluster [puppet] - 10https://gerrit.wikimedia.org/r/1135025 (https://phabricator.wikimedia.org/T374922) (owner: 10Stevemunene) [10:39:56] (03CR) 10Clément Goubert: [C:03+2] mw-cron: Disable memory limit [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155175 (https://phabricator.wikimedia.org/T395436) (owner: 10Clément Goubert) [10:40:23] (03Merged) 10jenkins-bot: rest-gateway: enable per-route statistics for all routes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154854 (owner: 10Hnowlan) [10:40:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2033 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77472 and previous config saved to /var/cache/conftool/dbconfig/20250610-104043-root.json [10:41:29] (03PS1) 10Marostegui: db1180: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1155176 (https://phabricator.wikimedia.org/T395989) [10:41:32] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host ganeti2042.codfw.wmnet [10:41:41] (03Merged) 10jenkins-bot: mw-cron: Disable memory limit [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155175 (https://phabricator.wikimedia.org/T395436) (owner: 10Clément Goubert) [10:41:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1180 T395989', diff saved to https://phabricator.wikimedia.org/P77473 and previous config saved to /var/cache/conftool/dbconfig/20250610-104143-marostegui.json [10:41:46] T395989: Migrate s6 to MariaDB 10.11 - https://phabricator.wikimedia.org/T395989 [10:42:04] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1180.eqiad.wmnet with reason: Maintenance [10:42:13] (03CR) 10Marostegui: [C:03+2] db1180: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1155176 (https://phabricator.wikimedia.org/T395989) (owner: 10Marostegui) [10:42:34] !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [10:42:45] !log cmooney@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Change AAAA records for eqiad cloudsw cloud-private GW IRB address - cmooney@cumin1003" [10:42:49] !log cmooney@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Change AAAA records for eqiad cloudsw cloud-private GW IRB address - cmooney@cumin1003" [10:42:49] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:42:54] !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/rest-gateway: apply [10:42:54] !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [10:43:44] PROBLEM - TFTP service on install7002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 65534 (nobody), regex args .*/usr/sbin/atftpd .* https://wikitech.wikimedia.org/wiki/Monitoring/atftpd [10:43:44] !log hnowlan@deploy1003 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [10:44:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2032 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77474 and previous config saved to /var/cache/conftool/dbconfig/20250610-104406-root.json [10:44:20] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P77475 and previous config saved to /var/cache/conftool/dbconfig/20250610-104419-fceratto.json [10:44:31] !log taavi@cumin1003 START - Cookbook sre.dns.wipe-cache 'private.eqiad.wikimedia.cloud$' on eqiad recursors [10:44:32] !log taavi@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) 'private.eqiad.wikimedia.cloud$' on eqiad recursors [10:44:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1187 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77476 and previous config saved to /var/cache/conftool/dbconfig/20250610-104449-root.json [10:45:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1180 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P77477 and previous config saved to /var/cache/conftool/dbconfig/20250610-104556-root.json [10:46:59] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2042.codfw.wmnet [10:47:06] !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2042.codfw.wmnet [10:47:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2175 (T396130)', diff saved to https://phabricator.wikimedia.org/P77478 and previous config saved to /var/cache/conftool/dbconfig/20250610-104745-marostegui.json [10:47:49] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [10:48:06] (03PS1) 10Joal: Temporarily bump analytics webrequest retention [puppet] - 10https://gerrit.wikimedia.org/r/1155178 (https://phabricator.wikimedia.org/T395934) [10:53:23] (03CR) 10Clément Goubert: [C:03+1] mesh: Add configuration_1.14 (copy/paste patch) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155125 (https://phabricator.wikimedia.org/T380958) (owner: 10Alexandros Kosiaris) [10:53:44] !log jmm@cumin1003 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2043.codfw.wmnet [10:54:25] (03CR) 10Stevemunene: [C:03+2] Temporarily bump analytics webrequest retention [puppet] - 10https://gerrit.wikimedia.org/r/1155178 (https://phabricator.wikimedia.org/T395934) (owner: 10Joal) [10:55:25] (03PS1) 10Majavah: hieradata: cloudlb: Bind codfw1dev mysql listener to VIP [puppet] - 10https://gerrit.wikimedia.org/r/1155180 [10:55:25] (03PS1) 10Majavah: hieradata: Announce eqiad1 OpenStack API VIP on IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/1155181 (https://phabricator.wikimedia.org/T379282) [10:55:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2033 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77479 and previous config saved to /var/cache/conftool/dbconfig/20250610-105548-root.json [10:57:31] (03CR) 10Clément Goubert: [C:03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155126 (https://phabricator.wikimedia.org/T380958) (owner: 10Alexandros Kosiaris) [10:57:54] 06SRE, 10Observability-Alerting, 07SecTeam-Processed, 07Security: Update MediaWikiElevatedUnknownLogins alert recipients - https://phabricator.wikimedia.org/T395117#10899102 (10kostajh) >>! In T395117#10887774, @tappof wrote: > Please, have a look at https://gerrit.wikimedia.org/r/c/operations/puppet/+... [10:58:06] (03PS3) 10JMeybohm: Make simple-cfssl usable for local WMF PKI deployments [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/1154266 (https://phabricator.wikimedia.org/T396107) [10:58:12] (03PS2) 10Majavah: hieradata: cloudlb: Bind codfw1dev mysql listener to VIP [puppet] - 10https://gerrit.wikimedia.org/r/1155180 [10:58:12] (03PS2) 10Majavah: hieradata: Announce eqiad1 OpenStack API VIP on IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/1155181 (https://phabricator.wikimedia.org/T379282) [10:58:12] (03PS1) 10Majavah: cloudlb: Support firewall config with multiple frontends on same port [puppet] - 10https://gerrit.wikimedia.org/r/1155182 [10:58:56] (03CR) 10CI reject: [V:04-1] cloudlb: Support firewall config with multiple frontends on same port [puppet] - 10https://gerrit.wikimedia.org/r/1155182 (owner: 10Majavah) [10:59:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2032 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77480 and previous config saved to /var/cache/conftool/dbconfig/20250610-105911-root.json [10:59:12] (03PS1) 10Novem Linguae: tables-catalog: add PageTriage [puppet] - 10https://gerrit.wikimedia.org/r/1155183 (https://phabricator.wikimedia.org/T391582) [10:59:26] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T395241)', diff saved to https://phabricator.wikimedia.org/P77481 and previous config saved to /var/cache/conftool/dbconfig/20250610-105926-fceratto.json [10:59:44] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1181.eqiad.wmnet with reason: Maintenance [10:59:52] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1181 (T395241)', diff saved to https://phabricator.wikimedia.org/P77482 and previous config saved to /var/cache/conftool/dbconfig/20250610-105951-fceratto.json [11:00:00] (03PS2) 10Majavah: cloudlb: Support firewall config with multiple frontends on same port [puppet] - 10https://gerrit.wikimedia.org/r/1155182 [11:00:00] (03PS3) 10Majavah: hieradata: cloudlb: Bind codfw1dev mysql listener to VIP [puppet] - 10https://gerrit.wikimedia.org/r/1155180 [11:00:00] (03PS3) 10Majavah: hieradata: Announce eqiad1 OpenStack API VIP on IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/1155181 (https://phabricator.wikimedia.org/T379282) [11:00:27] (03CR) 10CI reject: [V:04-1] cloudlb: Support firewall config with multiple frontends on same port [puppet] - 10https://gerrit.wikimedia.org/r/1155182 (owner: 10Majavah) [11:01:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1180 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77483 and previous config saved to /var/cache/conftool/dbconfig/20250610-110101-root.json [11:01:14] (03PS3) 10Majavah: cloudlb: Support firewall config with multiple frontends on same port [puppet] - 10https://gerrit.wikimedia.org/r/1155182 [11:01:14] (03PS4) 10Majavah: hieradata: cloudlb: Bind codfw1dev mysql listener to VIP [puppet] - 10https://gerrit.wikimedia.org/r/1155180 [11:01:14] (03PS4) 10Majavah: hieradata: Announce eqiad1 OpenStack API VIP on IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/1155181 (https://phabricator.wikimedia.org/T379282) [11:01:15] !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [11:01:28] !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [11:02:07] (03PS2) 10Novem Linguae: tables-catalog: add PageTriage [puppet] - 10https://gerrit.wikimedia.org/r/1155183 (https://phabricator.wikimedia.org/T391582) [11:02:09] (03CR) 10Ladsgroup: [C:03+2] tables-catalog: add PageTriage [puppet] - 10https://gerrit.wikimedia.org/r/1155183 (https://phabricator.wikimedia.org/T391582) (owner: 10Novem Linguae) [11:02:11] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5878/co" [puppet] - 10https://gerrit.wikimedia.org/r/1155180 (owner: 10Majavah) [11:02:12] (03CR) 10Ladsgroup: [V:03+2 C:03+2] tables-catalog: add PageTriage [puppet] - 10https://gerrit.wikimedia.org/r/1155183 (https://phabricator.wikimedia.org/T391582) (owner: 10Novem Linguae) [11:02:50] jmm@cumin1003 drain-node (PID 980040) is awaiting input [11:02:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2175', diff saved to https://phabricator.wikimedia.org/P77484 and previous config saved to /var/cache/conftool/dbconfig/20250610-110252-marostegui.json [11:03:55] (03PS1) 10Clément Goubert: build2003: Add partman and site.cfg [puppet] - 10https://gerrit.wikimedia.org/r/1155184 (https://phabricator.wikimedia.org/T393015) [11:04:36] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [11:04:44] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [11:06:59] (03PS4) 10Majavah: cloudlb: Support firewall config with multiple frontends on same port [puppet] - 10https://gerrit.wikimedia.org/r/1155182 [11:06:59] (03PS5) 10Majavah: hieradata: cloudlb: Bind codfw1dev mysql listener to VIP [puppet] - 10https://gerrit.wikimedia.org/r/1155180 [11:07:00] (03PS5) 10Majavah: hieradata: Announce eqiad1 OpenStack API VIP on IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/1155181 (https://phabricator.wikimedia.org/T379282) [11:08:03] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5879/co" [puppet] - 10https://gerrit.wikimedia.org/r/1155180 (owner: 10Majavah) [11:08:59] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T395241)', diff saved to https://phabricator.wikimedia.org/P77485 and previous config saved to /var/cache/conftool/dbconfig/20250610-110859-fceratto.json [11:09:15] FIRING: ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip6) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:10:11] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5880/co" [puppet] - 10https://gerrit.wikimedia.org/r/1155181 (https://phabricator.wikimedia.org/T379282) (owner: 10Majavah) [11:10:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2033 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77486 and previous config saved to /var/cache/conftool/dbconfig/20250610-111054-root.json [11:13:36] (03CR) 10Cathal Mooney: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1155181 (https://phabricator.wikimedia.org/T379282) (owner: 10Majavah) [11:14:15] RESOLVED: ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip6) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:14:28] (03PS1) 10Marostegui: db1168: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1155188 (https://phabricator.wikimedia.org/T395989) [11:14:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1168 T395989', diff saved to https://phabricator.wikimedia.org/P77487 and previous config saved to /var/cache/conftool/dbconfig/20250610-111440-marostegui.json [11:14:44] T395989: Migrate s6 to MariaDB 10.11 - https://phabricator.wikimedia.org/T395989 [11:14:59] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1168.eqiad.wmnet with reason: Maintenance [11:15:11] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host ganeti2043.codfw.wmnet [11:15:16] (03CR) 10Alexandros Kosiaris: [C:03+1] function-orchestrator: update mcrouter module [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155158 (https://phabricator.wikimedia.org/T396074) (owner: 10Effie Mouzeli) [11:15:17] (03CR) 10Marostegui: [C:03+2] db1168: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1155188 (https://phabricator.wikimedia.org/T395989) (owner: 10Marostegui) [11:16:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1180 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77488 and previous config saved to /var/cache/conftool/dbconfig/20250610-111606-root.json [11:16:47] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1155182 (owner: 10Majavah) [11:17:09] (03PS5) 10Majavah: cloudlb: Support firewall config with multiple frontends on same port [puppet] - 10https://gerrit.wikimedia.org/r/1155182 [11:17:09] (03PS6) 10Majavah: hieradata: cloudlb: Bind codfw1dev mysql listener to VIP [puppet] - 10https://gerrit.wikimedia.org/r/1155180 [11:17:09] (03PS6) 10Majavah: hieradata: Announce eqiad1 OpenStack API VIP on IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/1155181 (https://phabricator.wikimedia.org/T379282) [11:17:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2175', diff saved to https://phabricator.wikimedia.org/P77489 and previous config saved to /var/cache/conftool/dbconfig/20250610-111759-marostegui.json [11:18:16] (03CR) 10Btullis: [C:03+1] airflow: emit lineage metadata to datahub via kafka instead of the GMS REST API [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150595 (https://phabricator.wikimedia.org/T395106) (owner: 10Brouberol) [11:18:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1168 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77490 and previous config saved to /var/cache/conftool/dbconfig/20250610-111856-root.json [11:19:06] (03CR) 10Alexandros Kosiaris: [C:03+1] "I don't recall what the exact issue was with https://gerrit.wikimedia.org/r/c/operations/puppet/+/1135385, we should retry that. But I too" [puppet] - 10https://gerrit.wikimedia.org/r/1153573 (https://phabricator.wikimedia.org/T395999) (owner: 10Ladsgroup) [11:20:02] (03CR) 10Btullis: [C:03+1] airflow: emit lineage metadata to datahub via kafka instead of the GMS REST API (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150595 (https://phabricator.wikimedia.org/T395106) (owner: 10Brouberol) [11:20:59] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2043.codfw.wmnet [11:21:06] !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2043.codfw.wmnet [11:21:22] !log jmm@cumin1003 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2044.codfw.wmnet [11:21:25] (03CR) 10Majavah: [C:03+2] cloudlb: Support firewall config with multiple frontends on same port [puppet] - 10https://gerrit.wikimedia.org/r/1155182 (owner: 10Majavah) [11:21:32] (03CR) 10Majavah: [C:03+2] hieradata: cloudlb: Bind codfw1dev mysql listener to VIP [puppet] - 10https://gerrit.wikimedia.org/r/1155180 (owner: 10Majavah) [11:23:16] FIRING: ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip6) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:24:06] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P77491 and previous config saved to /var/cache/conftool/dbconfig/20250610-112406-fceratto.json [11:26:03] kubestagemaster2003 is going down for a Ganeti reboot [11:26:20] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host ganeti2044.codfw.wmnet [11:26:57] (03CR) 10Majavah: [C:03+2] hieradata: Announce eqiad1 OpenStack API VIP on IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/1155181 (https://phabricator.wikimedia.org/T379282) (owner: 10Majavah) [11:28:15] RESOLVED: ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip6) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:28:18] PROBLEM - Host kubestagemaster2003 is DOWN: PING CRITICAL - Packet loss = 100% [11:28:29] (03PS3) 10Ladsgroup: mariadb: Comment out future sections [puppet] - 10https://gerrit.wikimedia.org/r/1153573 (https://phabricator.wikimedia.org/T395999) [11:28:36] (03CR) 10Ladsgroup: [V:03+2 C:03+2] mariadb: Comment out future sections [puppet] - 10https://gerrit.wikimedia.org/r/1153573 (https://phabricator.wikimedia.org/T395999) (owner: 10Ladsgroup) [11:28:43] FIRING: [2x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [11:28:46] PROBLEM - Host ml-etcd2003 is DOWN: PING CRITICAL - Packet loss = 100% [11:30:33] RECOVERY - Host ml-etcd2003 is UP: PING OK - Packet loss = 0%, RTA = 30.61 ms [11:30:45] RECOVERY - Host kubestagemaster2003 is UP: PING OK - Packet loss = 0%, RTA = 30.72 ms [11:31:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1180 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77492 and previous config saved to /var/cache/conftool/dbconfig/20250610-113112-root.json [11:31:39] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2044.codfw.wmnet [11:31:46] !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2044.codfw.wmnet [11:32:08] (03CR) 10Jbond: [WIP] gNMI: spread targets on multiple netflow hosts (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1154303 (owner: 10Ayounsi) [11:33:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2175 (T396130)', diff saved to https://phabricator.wikimedia.org/P77493 and previous config saved to /var/cache/conftool/dbconfig/20250610-113306-marostegui.json [11:33:09] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [11:33:22] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2189.codfw.wmnet with reason: Maintenance [11:33:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2189 (T396130)', diff saved to https://phabricator.wikimedia.org/P77494 and previous config saved to /var/cache/conftool/dbconfig/20250610-113328-marostegui.json [11:34:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1168 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77495 and previous config saved to /var/cache/conftool/dbconfig/20250610-113401-root.json [11:35:32] !log cgoubert@deploy1003 Started scap sync-world: mediawiki-cli: Fix the paths of some of the dumps scripts and config files - T394389 [11:35:36] btullis: ^ [11:35:37] T394389: Migrate the additional dump types from snapshot1016 to Airflow - https://phabricator.wikimedia.org/T394389 [11:37:51] !log failover Ganeti master in codfw to ganeti2032 [11:37:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:13] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P77497 and previous config saved to /var/cache/conftool/dbconfig/20250610-113913-fceratto.json [11:40:13] PROBLEM - ganeti-wconfd running on ganeti2021 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 110 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [11:44:22] !log cgoubert@deploy1003 Finished scap sync-world: mediawiki-cli: Fix the paths of some of the dumps scripts and config files - T394389 (duration: 08m 49s) [11:44:26] T394389: Migrate the additional dump types from snapshot1016 to Airflow - https://phabricator.wikimedia.org/T394389 [11:44:28] btullis: all done & [11:45:16] FIRING: ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip6) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:45:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2189 (T396130)', diff saved to https://phabricator.wikimedia.org/P77499 and previous config saved to /var/cache/conftool/dbconfig/20250610-114556-marostegui.json [11:46:00] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [11:46:17] (03PS1) 10Filippo Giunchedi: thanos: enforce series limit for sidecar [puppet] - 10https://gerrit.wikimedia.org/r/1155190 (https://phabricator.wikimedia.org/T394318) [11:46:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1180 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77500 and previous config saved to /var/cache/conftool/dbconfig/20250610-114617-root.json [11:47:28] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host install7002.wikimedia.org [11:47:41] (03PS1) 10Marostegui: dbstore1007: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1155191 (https://phabricator.wikimedia.org/T394373) [11:47:56] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [11:48:02] !log installing qemu bugfix updates [11:48:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1168 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77501 and previous config saved to /var/cache/conftool/dbconfig/20250610-114906-root.json [11:50:15] RESOLVED: ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip6) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:52:27] PROBLEM - Zookeeper Server on an-conf1004 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.zookeeper.server.quorum.QuorumPeerMain /etc/zookeeper/conf/zoo.cfg https://wikitech.wikimedia.org/wiki/Zookeeper [11:54:20] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T395241)', diff saved to https://phabricator.wikimedia.org/P77502 and previous config saved to /var/cache/conftool/dbconfig/20250610-115419-fceratto.json [11:54:38] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1191.eqiad.wmnet with reason: Maintenance [11:54:45] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1191 (T395241)', diff saved to https://phabricator.wikimedia.org/P77503 and previous config saved to /var/cache/conftool/dbconfig/20250610-115444-fceratto.json [11:54:46] (03Abandoned) 10Jforrester: function-orchestrator: add mcrouter support [deployment-charts] - 10https://gerrit.wikimedia.org/r/944159 (https://phabricator.wikimedia.org/T297815) (owner: 10Giuseppe Lavagetto) [11:54:54] (03Abandoned) 10Jforrester: wikifunctions: enable mcrouter for orchestrator [deployment-charts] - 10https://gerrit.wikimedia.org/r/944160 (https://phabricator.wikimedia.org/T297815) (owner: 10Giuseppe Lavagetto) [11:57:20] (03CR) 10Jforrester: [C:03+1] "Thanks! Should I try to deploy this, or did you want to?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155158 (https://phabricator.wikimedia.org/T396074) (owner: 10Effie Mouzeli) [11:58:16] FIRING: ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip6) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:59:01] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.11 point update - https://phabricator.wikimedia.org/T394489#10899258 (10MoritzMuehlenhoff) [11:59:17] (03PS1) 10Majavah: Add include for WMCS eqiad1 service VIP reverse IPv6 [dns] - 10https://gerrit.wikimedia.org/r/1155194 (https://phabricator.wikimedia.org/T379282) [11:59:19] !log jmm@cumin1003 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host install7002.wikimedia.org [11:59:52] (03CR) 10CI reject: [V:04-1] Add include for WMCS eqiad1 service VIP reverse IPv6 [dns] - 10https://gerrit.wikimedia.org/r/1155194 (https://phabricator.wikimedia.org/T379282) (owner: 10Majavah) [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250610T1200) [12:00:33] 06SRE, 10Legalpad, 10Phabricator: Allow aklapper to view/edit L3 - https://phabricator.wikimedia.org/T394966#10899262 (10Aklapper) Thank you! Confirming it works [12:00:39] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155195 [12:01:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2189', diff saved to https://phabricator.wikimedia.org/P77504 and previous config saved to /var/cache/conftool/dbconfig/20250610-120103-marostegui.json [12:01:17] (03CR) 10Cathal Mooney: [C:03+1] "Looks good" [dns] - 10https://gerrit.wikimedia.org/r/1155194 (https://phabricator.wikimedia.org/T379282) (owner: 10Majavah) [12:01:24] jouncebot: nowandnext [12:01:24] For the next 0 hour(s) and 58 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250610T1200) [12:01:24] In 0 hour(s) and 58 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250610T1300) [12:01:32] (03CR) 10Filippo Giunchedi: [C:03+2] query-frontend: enable memcached on titan[21]001 [puppet] - 10https://gerrit.wikimedia.org/r/1154845 (https://phabricator.wikimedia.org/T394319) (owner: 10Filippo Giunchedi) [12:02:50] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T395241)', diff saved to https://phabricator.wikimedia.org/P77505 and previous config saved to /var/cache/conftool/dbconfig/20250610-120249-fceratto.json [12:03:15] FIRING: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:03:20] !log taavi@cumin1003 START - Cookbook sre.dns.netbox [12:03:34] (03PS1) 10Muehlenhoff: Failover cas [dns] - 10https://gerrit.wikimedia.org/r/1155197 [12:04:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1168 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77506 and previous config saved to /var/cache/conftool/dbconfig/20250610-120412-root.json [12:05:43] (03CR) 10Muehlenhoff: [C:03+2] Failover cas [dns] - 10https://gerrit.wikimedia.org/r/1155197 (owner: 10Muehlenhoff) [12:05:47] !log jmm@dns1004 START - running authdns-update [12:06:32] !log jmm@dns1004 END - running authdns-update [12:06:46] (03CR) 10Jbond: [WIP] gNMI: spread targets on multiple netflow hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1154303 (owner: 10Ayounsi) [12:06:47] !log taavi@cumin1003 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [12:06:48] (03CR) 10Jelto: [C:03+1] "lgtm!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153978 (https://phabricator.wikimedia.org/T396107) (owner: 10JMeybohm) [12:06:49] !log taavi@cumin1003 START - Cookbook sre.dns.netbox [12:07:13] (03PS1) 10Hnowlan: trafficserver::multi-dc: route PATCH and DELETE to primary DC [puppet] - 10https://gerrit.wikimedia.org/r/1155198 (https://phabricator.wikimedia.org/T387509) [12:09:27] RECOVERY - Zookeeper Server on an-conf1004 is OK: PROCS OK: 1 process with command name java, args org.apache.zookeeper.server.quorum.QuorumPeerMain /etc/zookeeper/conf/zoo.cfg https://wikitech.wikimedia.org/wiki/Zookeeper [12:10:46] !log taavi@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add AAAA record for openstack.eqiad1.wikimediacloud.org - taavi@cumin1003" [12:11:03] (03CR) 10Majavah: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/1155194 (https://phabricator.wikimedia.org/T379282) (owner: 10Majavah) [12:11:04] !log taavi@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add AAAA record for openstack.eqiad1.wikimediacloud.org - taavi@cumin1003" [12:11:05] !log taavi@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:11:27] (03PS2) 10Majavah: Add include for WMCS eqiad1 service VIP reverse IPv6 [dns] - 10https://gerrit.wikimedia.org/r/1155194 (https://phabricator.wikimedia.org/T379282) [12:12:00] (03PS5) 10JMeybohm: calico: Add support to manage CNI installation by daemonset [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153976 (https://phabricator.wikimedia.org/T396107) [12:12:00] (03PS5) 10JMeybohm: coredns: Run coredns on an unprivileged port (5353) instead of 53 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153977 (https://phabricator.wikimedia.org/T396107) [12:12:00] (03PS5) 10JMeybohm: cfssl-issuer: Allow to provide a custom CA certificate store [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153978 (https://phabricator.wikimedia.org/T396107) [12:12:14] (03CR) 10Majavah: [C:03+2] Add include for WMCS eqiad1 service VIP reverse IPv6 [dns] - 10https://gerrit.wikimedia.org/r/1155194 (https://phabricator.wikimedia.org/T379282) (owner: 10Majavah) [12:12:17] !log taavi@dns1004 START - running authdns-update [12:13:08] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [12:13:10] !log taavi@dns1004 END - running authdns-update [12:13:42] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [12:15:40] !log Ran fixStuckGlobalRename.php for T396371 and T396452 [12:15:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:45] T396371: Unblock stuck global rename of ReiKaze - https://phabricator.wikimedia.org/T396371 [12:15:45] T396452: Unblock stuck global rename of Jacob Holmén-Holmgren - https://phabricator.wikimedia.org/T396452 [12:16:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2189', diff saved to https://phabricator.wikimedia.org/P77507 and previous config saved to /var/cache/conftool/dbconfig/20250610-121610-marostegui.json [12:16:28] (03CR) 10JMeybohm: [C:03+1] trafficserver::multi-dc: route PATCH and DELETE to primary DC [puppet] - 10https://gerrit.wikimedia.org/r/1155198 (https://phabricator.wikimedia.org/T387509) (owner: 10Hnowlan) [12:17:57] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P77508 and previous config saved to /var/cache/conftool/dbconfig/20250610-121756-fceratto.json [12:18:31] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for santhosh - https://phabricator.wikimedia.org/T394740#10899361 (10elukey) [12:19:06] (03PS7) 10AOkoth: miscweb: add os-reports update mechanism [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154866 (https://phabricator.wikimedia.org/T350794) [12:19:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1168 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77509 and previous config saved to /var/cache/conftool/dbconfig/20250610-121917-root.json [12:19:21] !log cmooney@cumin1003 START - Cookbook sre.dns.wipe-cache openstack.eqiad1.wikimediacloud.org on all recursors [12:19:24] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) openstack.eqiad1.wikimediacloud.org on all recursors [12:19:54] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users, SSH and Kerberos for GGoncalves-WMF - https://phabricator.wikimedia.org/T395428#10899371 (10GGoncalves-WMF) 05Resolved→03Open Hi, I was testing these credentials in the past couple of days. I can use Superset (so I'm in `analy... [12:20:05] (03PS4) 10Stevemunene: zookeeper: onboard an-conf1005 to the cluster [puppet] - 10https://gerrit.wikimedia.org/r/1135026 (https://phabricator.wikimedia.org/T374922) [12:21:13] (03PS1) 10Elukey: admin: add santhosh to the deployment group [puppet] - 10https://gerrit.wikimedia.org/r/1155203 (https://phabricator.wikimedia.org/T394740) [12:22:26] (03PS1) 10Jakob: wikidata-query-gui: Bump query-gui image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155201 [12:22:26] (03CR) 10Jakob: "Need a +1 before deployment. The list of available images is here: https://docker-registry.wikimedia.org/repos/wmde/wikidata-query-gui/tag" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155201 (owner: 10Jakob) [12:23:20] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1155203 (https://phabricator.wikimedia.org/T394740) (owner: 10Elukey) [12:24:14] (03CR) 10Alexandros Kosiaris: [C:03+1] trafficserver::multi-dc: route PATCH and DELETE to primary DC [puppet] - 10https://gerrit.wikimedia.org/r/1155198 (https://phabricator.wikimedia.org/T387509) (owner: 10Hnowlan) [12:27:27] !log jmm@cumin1003 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1023.eqiad.wmnet [12:28:35] (03PS1) 10JMeybohm: CI: Remove invasive log message on helmfile compilation error [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155204 (https://phabricator.wikimedia.org/T396234) [12:30:29] (03CR) 10AOkoth: [C:03+1] gitlab: remove artifacts from failover backup [puppet] - 10https://gerrit.wikimedia.org/r/1155146 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto) [12:30:37] (03PS1) 10Muehlenhoff: Revert back to install7001 [puppet] - 10https://gerrit.wikimedia.org/r/1155206 [12:31:10] (03CR) 10AOkoth: [C:03+1] gitlab: bump gitlab-settings to v1.8.0 [puppet] - 10https://gerrit.wikimedia.org/r/1155152 (https://phabricator.wikimedia.org/T395014) (owner: 10Jelto) [12:31:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2189 (T396130)', diff saved to https://phabricator.wikimedia.org/P77510 and previous config saved to /var/cache/conftool/dbconfig/20250610-123117-marostegui.json [12:31:21] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [12:31:33] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2204.codfw.wmnet with reason: Maintenance [12:31:38] (03CR) 10Arnaudb: [C:03+1] gitlab: bump gitlab-settings to v1.8.0 [puppet] - 10https://gerrit.wikimedia.org/r/1155152 (https://phabricator.wikimedia.org/T395014) (owner: 10Jelto) [12:31:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2204 (T396130)', diff saved to https://phabricator.wikimedia.org/P77511 and previous config saved to /var/cache/conftool/dbconfig/20250610-123140-marostegui.json [12:32:37] 06SRE, 07SRE-Unowned: The ops-maint-gcal.js script is missing support for some vendors - https://phabricator.wikimedia.org/T381680#10899417 (10elukey) @Scott_French Today I got a 400 Content-Too-Large from Google for an Arelion event, I tried to manually decrease `MAX_PATH_QUERY_LEN = 14384;` (instead of 16384... [12:33:04] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P77512 and previous config saved to /var/cache/conftool/dbconfig/20250610-123303-fceratto.json [12:34:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1168 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77513 and previous config saved to /var/cache/conftool/dbconfig/20250610-123422-root.json [12:34:26] (03CR) 10Effie Mouzeli: [C:03+2] function-orchestrator: update mcrouter module [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155158 (https://phabricator.wikimedia.org/T396074) (owner: 10Effie Mouzeli) [12:34:36] (03CR) 10Muehlenhoff: [C:03+2] Revert back to install7001 [puppet] - 10https://gerrit.wikimedia.org/r/1155206 (owner: 10Muehlenhoff) [12:34:39] jmm@cumin1003 drain-node (PID 989348) is awaiting input [12:35:23] FYI, aux-k8s-etcd1003, dse-k8s-etcd1001 and kubestagemaster1005 will do down for a Ganeti reboot [12:35:29] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host ganeti1023.eqiad.wmnet [12:35:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2204 (T396130)', diff saved to https://phabricator.wikimedia.org/P77514 and previous config saved to /var/cache/conftool/dbconfig/20250610-123541-marostegui.json [12:35:53] (03Merged) 10jenkins-bot: function-orchestrator: update mcrouter module [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155158 (https://phabricator.wikimedia.org/T396074) (owner: 10Effie Mouzeli) [12:37:13] PROBLEM - Host aux-k8s-etcd1003 is DOWN: PING CRITICAL - Packet loss = 100% [12:37:41] PROBLEM - Host dse-k8s-etcd1001 is DOWN: PING CRITICAL - Packet loss = 100% [12:37:41] PROBLEM - Host kubestagemaster1005 is DOWN: PING CRITICAL - Packet loss = 100% [12:39:26] jouncebot: now [12:39:26] For the next 0 hour(s) and 20 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250610T1200) [12:40:41] RECOVERY - Host aux-k8s-etcd1003 is UP: PING OK - Packet loss = 0%, RTA = 0.63 ms [12:40:45] RECOVERY - Host dse-k8s-etcd1001 is UP: PING OK - Packet loss = 0%, RTA = 0.78 ms [12:41:09] RECOVERY - Host kubestagemaster1005 is UP: PING OK - Packet loss = 0%, RTA = 0.65 ms [12:41:32] (03PS1) 10Majavah: hieradata: cloudlb: Add IPv6 listeners for wiki replica endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1155209 (https://phabricator.wikimedia.org/T379282) [12:41:51] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1023.eqiad.wmnet [12:41:58] !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1023.eqiad.wmnet [12:43:16] RESOLVED: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:43:28] (03PS2) 10Majavah: hieradata: cloudlb: Add IPv6 listeners for wiki replica endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1155209 (https://phabricator.wikimedia.org/T396451) [12:45:52] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5883/co" [puppet] - 10https://gerrit.wikimedia.org/r/1155209 (https://phabricator.wikimedia.org/T396451) (owner: 10Majavah) [12:46:23] (03CR) 10Dima koushha: [C:03+1] wikidata-query-gui: Bump query-gui image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155201 (owner: 10Jakob) [12:47:08] (03PS1) 10Ladsgroup: [WIP] mariadb: Load list of private tables from the table catalog [puppet] - 10https://gerrit.wikimedia.org/r/1155210 (https://phabricator.wikimedia.org/T363581) [12:47:28] !log jmm@cumin1003 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1024.eqiad.wmnet [12:47:49] (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1155210 (https://phabricator.wikimedia.org/T363581) (owner: 10Ladsgroup) [12:48:11] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T395241)', diff saved to https://phabricator.wikimedia.org/P77515 and previous config saved to /var/cache/conftool/dbconfig/20250610-124810-fceratto.json [12:48:29] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1194.eqiad.wmnet with reason: Maintenance [12:48:36] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1194 (T395241)', diff saved to https://phabricator.wikimedia.org/P77516 and previous config saved to /var/cache/conftool/dbconfig/20250610-124835-fceratto.json [12:48:41] !log jmm@cumin1003 START - Cookbook sre.hosts.reimage for host install7002.wikimedia.org with OS bullseye [12:48:54] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating magru to routed Ganeti - https://phabricator.wikimedia.org/T394263#10899475 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin1003 for host install7002.wikimedia.org with OS bullseye [12:50:06] (03CR) 10Jakob: [C:03+2] wikidata-query-gui: Bump query-gui image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155201 (owner: 10Jakob) [12:50:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2204', diff saved to https://phabricator.wikimedia.org/P77517 and previous config saved to /var/cache/conftool/dbconfig/20250610-125048-marostegui.json [12:51:34] (03Merged) 10jenkins-bot: wikidata-query-gui: Bump query-gui image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155201 (owner: 10Jakob) [12:51:38] (03PS1) 10JMeybohm: Add a script to visualize the dependencies of admin_ng environments [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155212 (https://phabricator.wikimedia.org/T389080) [12:52:14] (03PS1) 10Majavah: openstack: wmcs-wikireplica-dns: Add IPv6 support [puppet] - 10https://gerrit.wikimedia.org/r/1155213 (https://phabricator.wikimedia.org/T396451) [12:52:15] (03PS1) 10Majavah: openstack: wmcs-wikireplica-dns: Add data for AAAA records [puppet] - 10https://gerrit.wikimedia.org/r/1155214 (https://phabricator.wikimedia.org/T396451) [12:52:34] !log jakob@deploy1003 helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply [12:52:50] !log jiji@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [12:52:51] !log jakob@deploy1003 helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply [12:53:14] !log jakob@deploy1003 helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply [12:53:22] !log jiji@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [12:53:35] !log jakob@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply [12:53:39] !log jmm@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host install7002.wikimedia.org with OS bullseye [12:53:53] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating magru to routed Ganeti - https://phabricator.wikimedia.org/T394263#10899507 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin1003 for host install7002.wikimedia.org with OS bullseye executed with err... [12:54:17] !log jmm@cumin1003 START - Cookbook sre.hosts.reimage for host install7002.wikimedia.org with OS bullseye [12:54:25] !log jakob@deploy1003 helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply [12:54:31] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating magru to routed Ganeti - https://phabricator.wikimedia.org/T394263#10899508 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin1003 for host install7002.wikimedia.org with OS bullseye [12:54:43] !log jakob@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply [12:56:42] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T395241)', diff saved to https://phabricator.wikimedia.org/P77518 and previous config saved to /var/cache/conftool/dbconfig/20250610-125641-fceratto.json [12:58:29] FIRING: GoRoutinesTooHigh: gNMIc running on netflow1002 have more than 10000 Go routines. - https://wikitech.wikimedia.org/wiki/Network_telemetry#GoRoutinesTooHigh - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGoRoutinesTooHigh [12:58:34] (03PS2) 10JMeybohm: Add a script to visualize the dependencies of admin_ng environments [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155212 (https://phabricator.wikimedia.org/T389080) [12:58:41] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [12:58:57] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [12:59:04] jmm@cumin1003 reimage (PID 992184) is awaiting input [13:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: How many deployers does it take to do UTC afternoon backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250610T1300). [13:00:05] DreamRimmer and sergi0: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:22] o/ [13:00:26] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [13:00:37] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [13:01:47] I can self-deploy my change [13:03:00] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, June 11 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152064 (https://phabricator.wikimedia.org/T395185) (owner: 10D3r1ck01) [13:05:10] o/ [13:05:49] (03CR) 10TrainBranchBot: [C:03+2] "Approved by sgimeno@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154282 (https://phabricator.wikimedia.org/T393769) (owner: 10Sergio Gimeno) [13:05:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2204', diff saved to https://phabricator.wikimedia.org/P77519 and previous config saved to /var/cache/conftool/dbconfig/20250610-130555-marostegui.json [13:06:36] (03Merged) 10jenkins-bot: [beta] GrowthExperiments: enable limiting add a link task via config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154282 (https://phabricator.wikimedia.org/T393769) (owner: 10Sergio Gimeno) [13:06:55] !log jmm@cumin1003 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti1024.eqiad.wmnet [13:06:59] !log jmm@cumin1003 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1024.eqiad.wmnet [13:07:39] Hi @DreamRimmer, I'm self-deploying my patch, can you do yours? [13:08:03] I don’t have access [13:09:38] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host ganeti1024.eqiad.wmnet [13:10:17] FIRING: [2x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2007:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:11:25] I can assist with the deployment but not sure how to test it [13:11:36] I can test [13:11:48] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P77520 and previous config saved to /var/cache/conftool/dbconfig/20250610-131148-fceratto.json [13:12:54] hey, sergi0 asked me for 2o on the patch DreamRimmer scheduled [13:12:58] looking right now [13:12:58] (03CR) 10Tiziano Fogli: [C:03+1] thanos: enforce series limit for sidecar [puppet] - 10https://gerrit.wikimedia.org/r/1155190 (https://phabricator.wikimedia.org/T394318) (owner: 10Filippo Giunchedi) [13:13:03] (03CR) 10Elukey: [C:03+2] admin: add santhosh to the deployment group [puppet] - 10https://gerrit.wikimedia.org/r/1155203 (https://phabricator.wikimedia.org/T394740) (owner: 10Elukey) [13:13:21] thanks to both of you [13:13:47] Hi sergi0/urbanecm, I also happen to have a patch (https://gerrit.wikimedia.org/r/1083870). I was not sure of my availability, so did not schedule it before. Would you be willing to help with deploying if I schedule it now? I can test it. [13:14:04] bunnypranav: that is the same patch DreamRimmer has scheduled [13:14:44] Wrong link, may bad [13:14:52] https://gerrit.wikimedia.org/r/c/1154369/ is the correct one. [13:15:07] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment for santhosh - https://phabricator.wikimedia.org/T394740#10899567 (10elukey) 05In progress→03Resolved All done! [13:15:37] bunnypranav: i'll take a look at it in a sec [13:15:42] Thanks! [13:15:42] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1024.eqiad.wmnet [13:15:48] !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1024.eqiad.wmnet [13:16:39] DreamRimmer: i see the last info on the task from TSP is https://phabricator.wikimedia.org/T378287#10797991, which says "we will give notice here once this is ready to move forward". i don't see that on the task, only a summary from NovemLinguae that links to a checkbox checked _prior_ to mszabo saying they'll comment in the future [13:16:56] let me quickly sync with mszabo to double check [13:17:00] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, June 10 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154369 (https://phabricator.wikimedia.org/T396178) (owner: 10Bunnypranav) [13:17:04] urbanecm: yes this is good to go sorry [13:17:15] mszabo: okay, you were quicker than me slacking you. thanks! [13:17:20] in that case, this SGTM as well [13:17:47] !log jmm@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host install7002.wikimedia.org with OS bullseye [13:17:49] sergi0: are you comfortable with backporting? i can safeguard the testing process (but tbh if DreamRimmer confirms it looks good on mwdebug, then it should be okay syncing) [13:18:01] yep can do [13:18:01] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating magru to routed Ganeti - https://phabricator.wikimedia.org/T394263#10899571 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin1003 for host install7002.wikimedia.org with OS bullseye executed with err... [13:18:18] !log jmm@cumin1003 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1025.eqiad.wmnet [13:18:19] (03CR) 10Urbanecm: [C:03+1] "Per conversation with Maté/TSP on IRC:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1083870 (https://phabricator.wikimedia.org/T378287) (owner: 10Dreamrimmer) [13:18:30] +1'ed on the patch too [13:18:39] I'm around if a friendly enwiki sysop is needed for testing the patch :) [13:18:48] what about bunnypranav 's can we do both? [13:18:54] I am okay! [13:18:58] !log jmm@cumin1003 START - Cookbook sre.hosts.reimage for host install7002.wikimedia.org with OS bullseye [13:19:06] sergi0: let me double check it [13:19:09] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating magru to routed Ganeti - https://phabricator.wikimedia.org/T394263#10899576 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin1003 for host install7002.wikimedia.org with OS bullseye [13:21:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2204 (T396130)', diff saved to https://phabricator.wikimedia.org/P77521 and previous config saved to /var/cache/conftool/dbconfig/20250610-132102-marostegui.json [13:21:06] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [13:21:17] (03CR) 10Urbanecm: [C:03+1] "This would require any new user to wait 4 days before being able to edit, but if cawikimedia is okay with that..." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154369 (https://phabricator.wikimedia.org/T396178) (owner: 10Bunnypranav) [13:21:18] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2225.codfw.wmnet with reason: Maintenance [13:21:22] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host ganeti1025.eqiad.wmnet [13:21:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2225 (T396130)', diff saved to https://phabricator.wikimedia.org/P77522 and previous config saved to /var/cache/conftool/dbconfig/20250610-132124-marostegui.json [13:21:26] sergi0: +1'ed too, let's do both [13:21:32] (they can go at once too if you're comfortable with that) [13:21:33] Thanks! :) [13:21:49] (03CR) 10TrainBranchBot: [C:03+2] "Approved by sgimeno@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1083870 (https://phabricator.wikimedia.org/T378287) (owner: 10Dreamrimmer) [13:21:49] (03CR) 10TrainBranchBot: [C:03+2] "Approved by sgimeno@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154369 (https://phabricator.wikimedia.org/T396178) (owner: 10Bunnypranav) [13:22:37] (03Merged) 10jenkins-bot: Enable electionclerk user group on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1083870 (https://phabricator.wikimedia.org/T378287) (owner: 10Dreamrimmer) [13:22:41] (03Merged) 10jenkins-bot: core-Permissions:Restrict editing on cawikimedia to autoconfirmed only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154369 (https://phabricator.wikimedia.org/T396178) (owner: 10Bunnypranav) [13:22:57] !log sgimeno@deploy1003 Started scap sync-world: Backport for [[gerrit:1083870|Enable electionclerk user group on enwiki (T378287)]], [[gerrit:1154369|core-Permissions:Restrict editing on cawikimedia to autoconfirmed only (T396178)]] [13:23:01] T378287: Enable SecurePoll extension and electionclerk user group on enwiki - https://phabricator.wikimedia.org/T378287 [13:23:02] T396178: Restrict editing on ca.wikimedia.org to autoconfirmed users only - https://phabricator.wikimedia.org/T396178 [13:23:25] FIRING: SystemdUnitFailed: prometheus-ethtool-exporter.service on wikikube-worker1005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:23:57] James_F: I have deployed on codfw, how can I check if the app is ok? is there a dashboard for the orchestrator? [13:24:58] !log sgimeno@deploy1003 bunnypranav, dreamrimmer, sgimeno: Backport for [[gerrit:1083870|Enable electionclerk user group on enwiki (T378287)]], [[gerrit:1154369|core-Permissions:Restrict editing on cawikimedia to autoconfirmed only (T396178)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:25:30] @bunnypranav @DreamRimmer can you test please? [13:25:38] On it. [13:25:48] (03PS1) 10Majavah: P:openstack: pdns: auth: Bind the API on IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/1155219 (https://phabricator.wikimedia.org/T396448) [13:25:50] (03PS1) 10Majavah: P:openstack: pdns: auth: Support query_local_address for IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/1155220 (https://phabricator.wikimedia.org/T396448) [13:26:28] Looks good [13:26:55] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P77523 and previous config saved to /var/cache/conftool/dbconfig/20250610-132655-fceratto.json [13:26:59] +1 [13:27:03] Looks good! [13:27:05] (03PS1) 10Kamila Součková: Add fake hcaptcha proxy secrets. [labs/private] - 10https://gerrit.wikimedia.org/r/1155221 (https://phabricator.wikimedia.org/T381265) [13:27:13] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5884/co" [puppet] - 10https://gerrit.wikimedia.org/r/1155219 (https://phabricator.wikimedia.org/T396448) (owner: 10Majavah) [13:27:14] Alright, syncing [13:27:22] !log sgimeno@deploy1003 bunnypranav, dreamrimmer, sgimeno: Continuing with sync [13:27:38] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1025.eqiad.wmnet [13:27:47] !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1025.eqiad.wmnet [13:30:56] !log jmm@cumin1003 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1026.eqiad.wmnet [13:31:39] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on pc2011.codfw.wmnet,pc1011.eqiad.wmnet with reason: Maintenance [13:31:45] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, and 2 others: decommission cirrussearch1053.eqiad.wmnet + more (see description) - https://phabricator.wikimedia.org/T394350#10899661 (10VRiley-WMF) [13:32:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool pc1 T378715', diff saved to https://phabricator.wikimedia.org/P77524 and previous config saved to /var/cache/conftool/dbconfig/20250610-133207-marostegui.json [13:32:13] T378715: Possibility to transition some codfw data persistence hosts to 10G - https://phabricator.wikimedia.org/T378715 [13:32:34] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host ganeti1026.eqiad.wmnet [13:33:23] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, and 2 others: decommission cirrussearch1053.eqiad.wmnet + more (see description) - https://phabricator.wikimedia.org/T394350#10899668 (10VRiley-WMF) @bking It looks like cirrussearch1063 is saying it's still active, would you be able to put this into t... [13:33:25] RESOLVED: SystemdUnitFailed: prometheus-ethtool-exporter.service on wikikube-worker1005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:33:37] !log jiji@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [13:34:20] !log sgimeno@deploy1003 Finished scap sync-world: Backport for [[gerrit:1083870|Enable electionclerk user group on enwiki (T378287)]], [[gerrit:1154369|core-Permissions:Restrict editing on cawikimedia to autoconfirmed only (T396178)]] (duration: 11m 22s) [13:34:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2225 (T396130)', diff saved to https://phabricator.wikimedia.org/P77525 and previous config saved to /var/cache/conftool/dbconfig/20250610-133424-marostegui.json [13:34:24] T378287: Enable SecurePoll extension and electionclerk user group on enwiki - https://phabricator.wikimedia.org/T378287 [13:34:25] T396178: Restrict editing on ca.wikimedia.org to autoconfirmed users only - https://phabricator.wikimedia.org/T396178 [13:34:27] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [13:34:29] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, and 2 others: decommission cirrussearch1053.eqiad.wmnet + more (see description) - https://phabricator.wikimedia.org/T394350#10899672 (10VRiley-WMF) [13:34:48] @DreamRimmer @bunnypranav your changes are live [13:35:21] Thanks :) [13:35:36] Thanks a lot :D [13:36:28] !log jiji@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [13:36:57] (03CR) 10FNegri: [C:03+1] openstack: wmcs-wikireplica-dns: Add data for AAAA records [puppet] - 10https://gerrit.wikimedia.org/r/1155214 (https://phabricator.wikimedia.org/T396451) (owner: 10Majavah) [13:37:14] (03CR) 10FNegri: [C:03+1] openstack: wmcs-wikireplica-dns: Add IPv6 support [puppet] - 10https://gerrit.wikimedia.org/r/1155213 (https://phabricator.wikimedia.org/T396451) (owner: 10Majavah) [13:37:34] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users, SSH and Kerberos for GGoncalves-WMF - https://phabricator.wikimedia.org/T395428#10899693 (10elukey) @GGoncalves-WMF Hi! I verified that your user wasn't there, I created it now. You should have received an email with the instructio... [13:37:54] (03CR) 10FNegri: [C:03+1] hieradata: cloudlb: Add IPv6 listeners for wiki replica endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1155209 (https://phabricator.wikimedia.org/T396451) (owner: 10Majavah) [13:37:58] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [13:38:16] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [13:38:52] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1026.eqiad.wmnet [13:38:53] (03CR) 10FNegri: [C:03+1] P:wmcs::metricsinfra: Log all alerts [puppet] - 10https://gerrit.wikimedia.org/r/1154841 (https://phabricator.wikimedia.org/T396038) (owner: 10Majavah) [13:38:59] !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1026.eqiad.wmnet [13:39:58] !log jmm@cumin1003 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1027.eqiad.wmnet [13:41:05] (03CR) 10Majavah: [V:03+1 C:03+2] hieradata: cloudlb: Add IPv6 listeners for wiki replica endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1155209 (https://phabricator.wikimedia.org/T396451) (owner: 10Majavah) [13:41:12] (03CR) 10Majavah: [C:03+2] openstack: wmcs-wikireplica-dns: Add IPv6 support [puppet] - 10https://gerrit.wikimedia.org/r/1155213 (https://phabricator.wikimedia.org/T396451) (owner: 10Majavah) [13:41:23] (03CR) 10Majavah: [V:03+1 C:03+2] P:wmcs::metricsinfra: Log all alerts [puppet] - 10https://gerrit.wikimedia.org/r/1154841 (https://phabricator.wikimedia.org/T396038) (owner: 10Majavah) [13:42:02] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T395241)', diff saved to https://phabricator.wikimedia.org/P77526 and previous config saved to /var/cache/conftool/dbconfig/20250610-134202-fceratto.json [13:42:20] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1202.eqiad.wmnet with reason: Maintenance [13:42:27] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1202 (T395241)', diff saved to https://phabricator.wikimedia.org/P77527 and previous config saved to /var/cache/conftool/dbconfig/20250610-134227-fceratto.json [13:44:05] (03PS1) 10Fabfur: wikimedia.org: add google-site-verification [dns] - 10https://gerrit.wikimedia.org/r/1155227 (https://phabricator.wikimedia.org/T396188) [13:44:42] (03CR) 10Btullis: [C:03+1] zookeeper: onboard an-conf1005 to the cluster [puppet] - 10https://gerrit.wikimedia.org/r/1135026 (https://phabricator.wikimedia.org/T374922) (owner: 10Stevemunene) [13:44:44] effie: Sorry, there's a script in deployment-charts/helmfile.d/services/wikifunctions, check-wf-services.sh, that will show if the services are working as expected. [13:44:46] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host ganeti1027.eqiad.wmnet [13:45:31] (03PS2) 10Majavah: P:openstack: pdns: auth: Bind the API on IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/1155219 (https://phabricator.wikimedia.org/T396448) [13:45:32] (03PS2) 10Majavah: P:openstack: pdns: auth: Support query_local_address for IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/1155220 (https://phabricator.wikimedia.org/T396448) [13:45:32] (03PS1) 10Majavah: P:openstack: pdns: auth: Support binding on multiple addresses [puppet] - 10https://gerrit.wikimedia.org/r/1155228 (https://phabricator.wikimedia.org/T396448) [13:46:29] (03CR) 10Ssingh: [C:03+1] wikimedia.org: add google-site-verification [dns] - 10https://gerrit.wikimedia.org/r/1155227 (https://phabricator.wikimedia.org/T396188) (owner: 10Fabfur) [13:46:43] (03PS3) 10Jforrester: wikifunctions: Update orchestrator from 2025-05-21-192453 to 2025-06-04-185118 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153617 (https://phabricator.wikimedia.org/T391971) [13:46:50] (03PS1) 10FNegri: Revert "maintain-dbusers: Revert overly strict type" [puppet] - 10https://gerrit.wikimedia.org/r/1155229 [13:47:23] (03PS1) 10Filippo Giunchedi: hieradata: enable memcache on all titan hosts [puppet] - 10https://gerrit.wikimedia.org/r/1155231 (https://phabricator.wikimedia.org/T394319) [13:47:31] !log taavi@cumin1003 START - Cookbook sre.hosts.reboot-single for host cloudlb1002.eqiad.wmnet [13:48:30] (03CR) 10Jforrester: [C:03+2] wikifunctions: Update orchestrator from 2025-05-21-192453 to 2025-06-04-185118 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153617 (https://phabricator.wikimedia.org/T391971) (owner: 10Jforrester) [13:48:42] (03CR) 10Fabfur: [C:03+2] wikimedia.org: add google-site-verification [dns] - 10https://gerrit.wikimedia.org/r/1155227 (https://phabricator.wikimedia.org/T396188) (owner: 10Fabfur) [13:48:44] !log jforrester@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [13:48:52] !log fabfur@dns1004 START - running authdns-update [13:48:56] !log jforrester@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [13:49:07] (03CR) 10Muehlenhoff: [C:03+2] ssh: Stop managing /run/sshd with Trixie and later [puppet] - 10https://gerrit.wikimedia.org/r/1154261 (owner: 10Muehlenhoff) [13:49:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2225', diff saved to https://phabricator.wikimedia.org/P77528 and previous config saved to /var/cache/conftool/dbconfig/20250610-134931-marostegui.json [13:49:38] !log fabfur@dns1004 END - running authdns-update [13:49:58] (03Merged) 10jenkins-bot: wikifunctions: Update orchestrator from 2025-05-21-192453 to 2025-06-04-185118 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153617 (https://phabricator.wikimedia.org/T391971) (owner: 10Jforrester) [13:50:09] !log jforrester@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [13:50:27] !log jforrester@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [13:50:34] PROBLEM - BFD status on cloudsw1-d5-eqiad.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:50:37] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T395241)', diff saved to https://phabricator.wikimedia.org/P77529 and previous config saved to /var/cache/conftool/dbconfig/20250610-135037-fceratto.json [13:50:42] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1027.eqiad.wmnet [13:50:48] !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1027.eqiad.wmnet [13:50:50] !log jforrester@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [13:51:21] !log jforrester@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [13:51:27] !log jforrester@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [13:51:57] !log jforrester@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [13:53:34] RECOVERY - BFD status on cloudsw1-d5-eqiad.mgmt is OK: UP: 15 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:55:30] (03PS1) 10Aklapper: Increase span of Logstash / phlog debugging before reaching threshold [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1155234 [13:55:50] !log taavi@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb1002.eqiad.wmnet [13:55:53] (03CR) 10Aklapper: [V:03+2 C:03+2] Increase span of Logstash / phlog debugging before reaching threshold [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1155234 (owner: 10Aklapper) [13:56:27] !log taavi@cumin1003 START - Cookbook sre.hosts.reboot-single for host cloudlb1001.eqiad.wmnet [13:56:32] (03PS2) 10Ladsgroup: [WIP] mariadb: Load list of private tables from the table catalog [puppet] - 10https://gerrit.wikimedia.org/r/1155210 (https://phabricator.wikimedia.org/T363581) [13:57:13] (03CR) 10Herron: [C:03+1] thanos: enforce series limit for sidecar [puppet] - 10https://gerrit.wikimedia.org/r/1155190 (https://phabricator.wikimedia.org/T394318) (owner: 10Filippo Giunchedi) [13:57:29] (03CR) 10Jelto: [C:03+1] "lgtm!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155204 (https://phabricator.wikimedia.org/T396234) (owner: 10JMeybohm) [13:57:56] !log jmm@cumin1003 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1028.eqiad.wmnet [13:58:03] (03PS1) 10Máté Szabó: ores: Disable AbuseFilter integration by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155235 (https://phabricator.wikimedia.org/T364705) [13:58:26] (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1155210 (https://phabricator.wikimedia.org/T363581) (owner: 10Ladsgroup) [13:59:34] PROBLEM - BFD status on cloudsw1-c8-eqiad.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:02:34] RECOVERY - BFD status on cloudsw1-c8-eqiad.mgmt is OK: UP: 15 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:04:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2225', diff saved to https://phabricator.wikimedia.org/P77531 and previous config saved to /var/cache/conftool/dbconfig/20250610-140439-marostegui.json [14:04:48] !log taavi@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb1001.eqiad.wmnet [14:04:56] (03CR) 10Kosta Harlan: [C:03+1] ores: Disable AbuseFilter integration by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155235 (https://phabricator.wikimedia.org/T364705) (owner: 10Máté Szabó) [14:04:58] (03PS1) 10Ssingh: Release 9.2.10-1wm2 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/1155237 (https://phabricator.wikimedia.org/T390912) [14:05:07] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5889/co" [puppet] - 10https://gerrit.wikimedia.org/r/1155219 (https://phabricator.wikimedia.org/T396448) (owner: 10Majavah) [14:05:10] (03CR) 10Jelto: [C:03+1] "lgtm now" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153977 (https://phabricator.wikimedia.org/T396107) (owner: 10JMeybohm) [14:05:14] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users, SSH and Kerberos for GGoncalves-WMF - https://phabricator.wikimedia.org/T395428#10899884 (10GGoncalves-WMF) 05Open→03Resolved Yep, I was able to run `kinit`, set my password and run it again. Thank you! [14:05:45] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P77532 and previous config saved to /var/cache/conftool/dbconfig/20250610-140544-fceratto.json [14:08:42] (03PS3) 10Majavah: P:openstack: pdns: auth: Support query_local_address for IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/1155220 (https://phabricator.wikimedia.org/T396448) [14:08:42] (03PS2) 10Majavah: P:openstack: pdns: auth: Support binding on multiple addresses [puppet] - 10https://gerrit.wikimedia.org/r/1155228 (https://phabricator.wikimedia.org/T396448) [14:09:59] (03PS1) 10Jforrester: wikifunctions: Configure memcachedUri for the function-orchestrator [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155241 (https://phabricator.wikimedia.org/T390746) [14:10:44] (03PS1) 10Fabfur: wikimedia.org: fix previous commit [dns] - 10https://gerrit.wikimedia.org/r/1155242 (https://phabricator.wikimedia.org/T396188) [14:10:51] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5891/co" [puppet] - 10https://gerrit.wikimedia.org/r/1155220 (https://phabricator.wikimedia.org/T396448) (owner: 10Majavah) [14:12:04] (03CR) 10Ssingh: [C:03+1] wikimedia.org: fix previous commit [dns] - 10https://gerrit.wikimedia.org/r/1155242 (https://phabricator.wikimedia.org/T396188) (owner: 10Fabfur) [14:12:06] (03PS13) 10Ayounsi: [WIP] gNMI: spread targets on multiple netflow hosts [puppet] - 10https://gerrit.wikimedia.org/r/1154303 [14:12:06] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5892/co" [puppet] - 10https://gerrit.wikimedia.org/r/1155228 (https://phabricator.wikimedia.org/T396448) (owner: 10Majavah) [14:12:21] jmm@cumin1003 drain-node (PID 998542) is awaiting input [14:12:29] (03CR) 10CI reject: [V:04-1] [WIP] gNMI: spread targets on multiple netflow hosts [puppet] - 10https://gerrit.wikimedia.org/r/1154303 (owner: 10Ayounsi) [14:12:38] (03CR) 10Fabfur: [C:03+2] wikimedia.org: fix previous commit [dns] - 10https://gerrit.wikimedia.org/r/1155242 (https://phabricator.wikimedia.org/T396188) (owner: 10Fabfur) [14:12:44] (03CR) 10CI reject: [V:04-1] Release 9.2.10-1wm2 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/1155237 (https://phabricator.wikimedia.org/T390912) (owner: 10Ssingh) [14:12:53] !log fabfur@dns1004 START - running authdns-update [14:13:01] !log jmm@cumin1003 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti1028.eqiad.wmnet [14:13:04] !log jmm@cumin1003 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1028.eqiad.wmnet [14:13:40] !log fabfur@dns1004 END - running authdns-update [14:13:53] (03PS14) 10Ayounsi: gNMI: spread targets on multiple netflow hosts [puppet] - 10https://gerrit.wikimedia.org/r/1154303 [14:16:07] (03CR) 10Stevemunene: [C:03+2] zookeeper: onboard an-conf1005 to the cluster [puppet] - 10https://gerrit.wikimedia.org/r/1135026 (https://phabricator.wikimedia.org/T374922) (owner: 10Stevemunene) [14:16:14] (03CR) 10Ayounsi: "Big thanks to John !!!" [puppet] - 10https://gerrit.wikimedia.org/r/1154303 (owner: 10Ayounsi) [14:16:17] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1154303 (owner: 10Ayounsi) [14:19:22] !log jmm@cumin1003 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti1028.eqiad.wmnet [14:19:38] (03PS1) 10Lucas Werkmeister (WMDE): Update searchsuggest message key [extensions/Wikibase] (wmf/1.45.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1155244 (https://phabricator.wikimedia.org/T396219) [14:19:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2225 (T396130)', diff saved to https://phabricator.wikimedia.org/P77533 and previous config saved to /var/cache/conftool/dbconfig/20250610-141946-marostegui.json [14:19:50] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [14:19:51] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, June 11 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [extensions/Wikibase] (wmf/1.45.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1155244 (https://phabricator.wikimedia.org/T396219) (owner: 10Lucas Werkmeister (WMDE)) [14:20:02] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2226.codfw.wmnet with reason: Maintenance [14:20:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2226 (T396130)', diff saved to https://phabricator.wikimedia.org/P77534 and previous config saved to /var/cache/conftool/dbconfig/20250610-142009-marostegui.json [14:20:52] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P77535 and previous config saved to /var/cache/conftool/dbconfig/20250610-142051-fceratto.json [14:21:38] jouncebot: nowandnext [14:21:38] No deployments scheduled for the next 0 hour(s) and 38 minute(s) [14:21:38] In 0 hour(s) and 38 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250610T1500) [14:23:45] (03PS3) 10Ladsgroup: [WIP] mariadb: Load list of private tables from the table catalog [puppet] - 10https://gerrit.wikimedia.org/r/1155210 (https://phabricator.wikimedia.org/T363581) [14:24:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2226 (T396130)', diff saved to https://phabricator.wikimedia.org/P77536 and previous config saved to /var/cache/conftool/dbconfig/20250610-142410-marostegui.json [14:24:45] (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1155210 (https://phabricator.wikimedia.org/T363581) (owner: 10Ladsgroup) [14:26:06] (03PS1) 10Herron: pyrra: update o11y slos to 4w window [puppet] - 10https://gerrit.wikimedia.org/r/1155246 (https://phabricator.wikimedia.org/T395916) [14:27:28] (03PS2) 10Cory Massaro: wikifunctions: Configure memcachedUri for the function-orchestrator and enable [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155241 (https://phabricator.wikimedia.org/T390746) (owner: 10Jforrester) [14:28:24] !log bking@cumin2002 START - Cookbook sre.hosts.decommission for hosts cirrussearch1063.eqiad.wmnet [14:29:41] !log jmm@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host install7002.wikimedia.org with OS bullseye [14:29:55] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating magru to routed Ganeti - https://phabricator.wikimedia.org/T394263#10900028 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin1003 for host install7002.wikimedia.org with OS bullseye executed with err... [14:30:33] 10SRE-SLO, 10Observability-Metrics, 13Patch-For-Review, 10SRE Observability (FY2024/2025-Q4): Reduce Pyrra's default window from 12w to 4w - https://phabricator.wikimedia.org/T395916#10900032 (10herron) >>! In T395916#10884971, @elukey wrote: > +1 for the 4w, my only doubt is about backfilling - do we have... [14:33:20] (03PS1) 10Muehlenhoff: Switch install7002 back to insetup [puppet] - 10https://gerrit.wikimedia.org/r/1155253 [14:35:59] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T395241)', diff saved to https://phabricator.wikimedia.org/P77537 and previous config saved to /var/cache/conftool/dbconfig/20250610-143558-fceratto.json [14:36:05] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [14:36:17] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1227.eqiad.wmnet with reason: Maintenance [14:36:23] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1227 (T395241)', diff saved to https://phabricator.wikimedia.org/P77538 and previous config saved to /var/cache/conftool/dbconfig/20250610-143623-fceratto.json [14:36:37] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [14:36:47] !log bking@cumin2002 START - Cookbook sre.dns.netbox [14:36:55] (03CR) 10Muehlenhoff: [C:03+2] Switch install7002 back to insetup [puppet] - 10https://gerrit.wikimedia.org/r/1155253 (owner: 10Muehlenhoff) [14:37:29] (03CR) 10Ayounsi: "Hmm, not sure why PCC wants to remove half of the magru targets as there is only one hosts." [puppet] - 10https://gerrit.wikimedia.org/r/1154303 (owner: 10Ayounsi) [14:37:37] (03CR) 10Hnowlan: [C:03+1] shellbox: align image version to 2025-06-05-215815 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127188 (https://phabricator.wikimedia.org/T388260) (owner: 10Scott French) [14:38:24] PROBLEM - SSH on bast5004 is CRITICAL: Server answer: Exceeded MaxStartups https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:39:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2226', diff saved to https://phabricator.wikimedia.org/P77539 and previous config saved to /var/cache/conftool/dbconfig/20250610-143917-marostegui.json [14:39:24] RECOVERY - SSH on bast5004 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u5 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:40:08] (03PS2) 10Ssingh: Release 9.2.10-1wm2 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/1155237 (https://phabricator.wikimedia.org/T390912) [14:40:17] (03CR) 10Majavah: [C:03+2] openstack: wmcs-wikireplica-dns: Add data for AAAA records [puppet] - 10https://gerrit.wikimedia.org/r/1155214 (https://phabricator.wikimedia.org/T396451) (owner: 10Majavah) [14:40:51] (03CR) 10Brouberol: airflow: emit lineage metadata to datahub via kafka instead of the GMS REST API (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150595 (https://phabricator.wikimedia.org/T395106) (owner: 10Brouberol) [14:41:42] (03CR) 10Ayounsi: [V:03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5893/console" [puppet] - 10https://gerrit.wikimedia.org/r/1154303 (owner: 10Ayounsi) [14:42:00] (03PS4) 10Brouberol: airflow: emit lineage metadata to datahub via kafka instead of the GMS REST API [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150595 (https://phabricator.wikimedia.org/T395106) [14:42:21] (03CR) 10Ayounsi: [V:03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5894/console" [puppet] - 10https://gerrit.wikimedia.org/r/1154303 (owner: 10Ayounsi) [14:42:22] bking@cumin2002 decommission (PID 2467631) is awaiting input [14:43:29] (03PS15) 10Ayounsi: gNMI: spread targets on multiple netflow hosts [puppet] - 10https://gerrit.wikimedia.org/r/1154303 [14:43:53] (03PS16) 10Ayounsi: gNMI: spread targets on multiple netflow hosts [puppet] - 10https://gerrit.wikimedia.org/r/1154303 [14:43:56] (03CR) 10CI reject: [V:04-1] gNMI: spread targets on multiple netflow hosts [puppet] - 10https://gerrit.wikimedia.org/r/1154303 (owner: 10Ayounsi) [14:44:17] (03CR) 10CI reject: [V:04-1] gNMI: spread targets on multiple netflow hosts [puppet] - 10https://gerrit.wikimedia.org/r/1154303 (owner: 10Ayounsi) [14:44:19] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1154303 (owner: 10Ayounsi) [14:48:16] (03PS1) 10Aklapper: Penalize on creating tasks with short task titles [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1155255 (https://phabricator.wikimedia.org/T396471) [14:49:23] (03CR) 10Aklapper: [V:03+2 C:03+2] Penalize on creating tasks with short task titles [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1155255 (https://phabricator.wikimedia.org/T396471) (owner: 10Aklapper) [14:49:32] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch1063.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" [14:49:36] 06SRE, 06Infrastructure-Foundations: Move install servers to Bookworm - https://phabricator.wikimedia.org/T396487 (10MoritzMuehlenhoff) 03NEW [14:49:49] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch1063.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" [14:49:50] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:49:51] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts cirrussearch1063.eqiad.wmnet [14:49:58] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, and 2 others: decommission cirrussearch1053.eqiad.wmnet + more (see description) - https://phabricator.wikimedia.org/T394350#10900090 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by bking@cumin2002 for hosts: `cirrussearch1063.eqiad... [14:50:45] (03CR) 10Jelto: [C:03+1] "lgtm, image builds fine locally" [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/1154266 (https://phabricator.wikimedia.org/T396107) (owner: 10JMeybohm) [14:51:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repool pc1 T378715', diff saved to https://phabricator.wikimedia.org/P77541 and previous config saved to /var/cache/conftool/dbconfig/20250610-145137-marostegui.json [14:51:41] T378715: Possibility to transition some codfw data persistence hosts to 10G - https://phabricator.wikimedia.org/T378715 [14:52:47] jmm@cumin1003 reimage (PID 1002669) is awaiting input [14:53:18] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [14:53:36] 07sre-alert-triage, 06Machine-Learning-Team: Alert in need of triage: DiskSpace (instance ml-lab1001:9100) - https://phabricator.wikimedia.org/T391465#10900103 (10klausman) 05Open→03Resolved a:03klausman SSDs have been enabled and 1002 is using Ceph homedirs. [14:53:52] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [14:54:10] !log taavi@cumin1003 START - Cookbook sre.dns.netbox [14:54:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2226', diff saved to https://phabricator.wikimedia.org/P77542 and previous config saved to /var/cache/conftool/dbconfig/20250610-145424-marostegui.json [14:54:29] (03CR) 10Hnowlan: [C:03+2] trafficserver::multi-dc: route PATCH and DELETE to primary DC [puppet] - 10https://gerrit.wikimedia.org/r/1155198 (https://phabricator.wikimedia.org/T387509) (owner: 10Hnowlan) [14:54:32] jouncebot: nowandnext [14:54:32] No deployments scheduled for the next 0 hour(s) and 5 minute(s) [14:54:32] In 0 hour(s) and 5 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250610T1500) [14:55:50] !log jmm@cumin1003 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1028.eqiad.wmnet [14:56:07] (03PS2) 10Tiziano Fogli: monitoring services: add migration task T328502 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1155138 (https://phabricator.wikimedia.org/T395443) [14:56:07] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host ganeti1028.eqiad.wmnet [14:56:50] (03CR) 10Vgutierrez: [C:04-1] "please remove hieradata/hosts/cp7001.yaml" [puppet] - 10https://gerrit.wikimedia.org/r/1154157 (https://phabricator.wikimedia.org/T392217) (owner: 10Fabfur) [14:57:59] (03PS1) 10Majavah: Add include for WMCS eqiad1 private service VIP reverse IPv6 [dns] - 10https://gerrit.wikimedia.org/r/1155256 (https://phabricator.wikimedia.org/T379282) [14:58:02] (03CR) 10Filippo Giunchedi: [C:04-1] "Taavi, I forget: is blackbox-exporter at 0.26 on cloud/metricsinfra too ?" [puppet] - 10https://gerrit.wikimedia.org/r/1143810 (https://phabricator.wikimedia.org/T385022) (owner: 10Filippo Giunchedi) [14:58:16] !log taavi@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add wiki replica cloudlb v6 addresses - taavi@cumin1003" [14:58:21] !log taavi@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add wiki replica cloudlb v6 addresses - taavi@cumin1003" [14:58:21] !log taavi@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:58:22] (03PS2) 10Fabfur: hiera: x-provenance header on all DCs [puppet] - 10https://gerrit.wikimedia.org/r/1154157 (https://phabricator.wikimedia.org/T392217) [14:58:28] (03CR) 10CI reject: [V:04-1] Add include for WMCS eqiad1 private service VIP reverse IPv6 [dns] - 10https://gerrit.wikimedia.org/r/1155256 (https://phabricator.wikimedia.org/T379282) (owner: 10Majavah) [14:58:30] (03PS3) 10Alexandros Kosiaris: docker_registry_ha: Refactor to make it docker_registry [puppet] - 10https://gerrit.wikimedia.org/r/1154302 (https://phabricator.wikimedia.org/T390251) [14:58:30] (03PS1) 10Alexandros Kosiaris: docker_registry: Move rsyslog rules from init to web.pp [puppet] - 10https://gerrit.wikimedia.org/r/1155257 (https://phabricator.wikimedia.org/T390251) [14:58:31] (03PS1) 10Alexandros Kosiaris: docker_registry: Refactor to allow >1 instance [puppet] - 10https://gerrit.wikimedia.org/r/1155258 (https://phabricator.wikimedia.org/T390251) [14:58:36] (03CR) 10Majavah: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/1155256 (https://phabricator.wikimedia.org/T379282) (owner: 10Majavah) [14:59:53] (03PS2) 10Majavah: Add include for WMCS eqiad1 private service VIP reverse IPv6 [dns] - 10https://gerrit.wikimedia.org/r/1155256 (https://phabricator.wikimedia.org/T379282) [15:00:04] jelto, arnoldokoth, and mutante: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for SRE Collaboration Services office hours . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250610T1500). [15:00:42] (03CR) 10Majavah: [C:03+2] Add include for WMCS eqiad1 private service VIP reverse IPv6 [dns] - 10https://gerrit.wikimedia.org/r/1155256 (https://phabricator.wikimedia.org/T379282) (owner: 10Majavah) [15:00:48] !log taavi@dns1004 START - running authdns-update [15:00:50] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, and 2 others: decommission cirrussearch1053.eqiad.wmnet + more (see description) - https://phabricator.wikimedia.org/T394350#10900156 (10bking) @VRiley-WMF Apologies, as it looks like we forgot to run the decom cookbook against `cirrussearch1063`. I ju... [15:01:19] (03CR) 10CI reject: [V:04-1] docker_registry: Refactor to allow >1 instance [puppet] - 10https://gerrit.wikimedia.org/r/1155258 (https://phabricator.wikimedia.org/T390251) (owner: 10Alexandros Kosiaris) [15:01:29] (03PS17) 10Ayounsi: gNMI: spread targets on multiple netflow hosts [puppet] - 10https://gerrit.wikimedia.org/r/1154303 [15:01:36] !log taavi@dns1004 END - running authdns-update [15:01:40] (03PS1) 10Majavah: openstack: mwopenstackclients: Fix ensuring multiple records for name [puppet] - 10https://gerrit.wikimedia.org/r/1155259 [15:01:54] !log klausman@cumin1003 START - Cookbook sre.hosts.reboot-single for host ml-lab1001.eqiad.wmnet [15:02:02] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1028.eqiad.wmnet [15:02:09] !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1028.eqiad.wmnet [15:02:26] (03PS4) 10Ladsgroup: mariadb: Load list of private tables from the table catalog [puppet] - 10https://gerrit.wikimedia.org/r/1155210 (https://phabricator.wikimedia.org/T363581) [15:02:48] !log jmm@cumin1003 START - Cookbook sre.hosts.reimage for host install7002.wikimedia.org with OS bookworm [15:02:57] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating magru to routed Ganeti - https://phabricator.wikimedia.org/T394263#10900176 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin1003 for host install7002.wikimedia.org with OS bookworm [15:03:42] (03CR) 10Jelto: "nice addition! The graph for `admin-ng` looks nice. One small comment in-line" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155212 (https://phabricator.wikimedia.org/T389080) (owner: 10JMeybohm) [15:04:37] (03CR) 10Ayounsi: "From the clever John's `notify { "${netflow_hosts}": }` it looks like the ghost of netflow7001 is still around PCC :" [puppet] - 10https://gerrit.wikimedia.org/r/1154303 (owner: 10Ayounsi) [15:05:08] !log aokoth@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on phab1004.eqiad.wmnet with reason: Phorge Update [15:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:07:08] !log aokoth@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on phab2002.codfw.wmnet with reason: Phorge Update [15:07:12] !log brennen@deploy1003 Started deploy [phabricator/deployment@f8d7b38]: test deploy phab2002 for T396490 [15:07:15] T396490: Deploy Phabricator/Phorge 2025-06-10 - https://phabricator.wikimedia.org/T396490 [15:07:18] (03PS3) 10CDobbins: varnish: Replace X-RB-NOREDIR with rb_noredir var [puppet] - 10https://gerrit.wikimedia.org/r/1154085 [15:07:45] (03CR) 10CDanis: [C:03+1] gNMI: spread targets on multiple netflow hosts [puppet] - 10https://gerrit.wikimedia.org/r/1154303 (owner: 10Ayounsi) [15:07:48] !log klausman@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-lab1001.eqiad.wmnet [15:07:52] !log brennen@deploy1003 Finished deploy [phabricator/deployment@f8d7b38]: test deploy phab2002 for T396490 (duration: 00m 40s) [15:07:53] (03CR) 10CDanis: [C:03+1] "thanks john!" [puppet] - 10https://gerrit.wikimedia.org/r/1154303 (owner: 10Ayounsi) [15:08:13] !log klausman@cumin1003 START - Cookbook sre.hosts.reboot-single for host ml-lab1002.eqiad.wmnet [15:08:16] !log brennen@deploy1003 Started deploy [phabricator/deployment@f8d7b38]: deploy phab1004 for T396490 [15:08:55] !log brennen@deploy1003 Finished deploy [phabricator/deployment@f8d7b38]: deploy phab1004 for T396490 (duration: 00m 39s) [15:09:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2226 (T396130)', diff saved to https://phabricator.wikimedia.org/P77543 and previous config saved to /var/cache/conftool/dbconfig/20250610-150931-marostegui.json [15:09:35] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [15:09:48] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2238.codfw.wmnet with reason: Maintenance [15:09:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2238 (T396130)', diff saved to https://phabricator.wikimedia.org/P77544 and previous config saved to /var/cache/conftool/dbconfig/20250610-150954-marostegui.json [15:10:27] (03CR) 10Ladsgroup: [C:04-1] "I need to run the check the other way around too to make sure we don't accidentally make a public/partially public table a private one eit" [puppet] - 10https://gerrit.wikimedia.org/r/1155210 (https://phabricator.wikimedia.org/T363581) (owner: 10Ladsgroup) [15:12:30] (03CR) 10David Caro: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1155259 (owner: 10Majavah) [15:14:06] !log klausman@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-lab1002.eqiad.wmnet [15:14:34] (03CR) 10Vgutierrez: [C:03+1] hiera: x-provenance header on all DCs [puppet] - 10https://gerrit.wikimedia.org/r/1154157 (https://phabricator.wikimedia.org/T392217) (owner: 10Fabfur) [15:14:54] (03CR) 10Jelto: [C:03+2] gitlab: remove artifacts from failover backup [puppet] - 10https://gerrit.wikimedia.org/r/1155146 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto) [15:16:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:20:36] (03PS1) 10JHathaway: postfix: add ability to mask Received header IPs [puppet] - 10https://gerrit.wikimedia.org/r/1155263 (https://phabricator.wikimedia.org/T383715) [15:21:04] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1155263 (https://phabricator.wikimedia.org/T383715) (owner: 10JHathaway) [15:21:16] (03CR) 10Ssingh: "Ready for review. Please compare against https://github.com/apache/trafficserver/issues/12171 since this is a backport for 9.2.x." [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/1155237 (https://phabricator.wikimedia.org/T390912) (owner: 10Ssingh) [15:22:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2238 (T396130)', diff saved to https://phabricator.wikimedia.org/P77545 and previous config saved to /var/cache/conftool/dbconfig/20250610-152243-marostegui.json [15:22:47] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [15:27:39] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1227 (T395241)', diff saved to https://phabricator.wikimedia.org/P77546 and previous config saved to /var/cache/conftool/dbconfig/20250610-152738-fceratto.json [15:28:39] FIRING: [2x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [15:28:44] (03CR) 10Andrew Bogott: [C:03+1] openstack: mwopenstackclients: Fix ensuring multiple records for name [puppet] - 10https://gerrit.wikimedia.org/r/1155259 (owner: 10Majavah) [15:29:31] (03CR) 10Btullis: [C:03+1] Convert an-db100[1-2] to dse-k8s-worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/1155119 (https://phabricator.wikimedia.org/T395557) (owner: 10Brouberol) [15:30:24] (03PS3) 10Btullis: Configure dse-k8s-worker101[2-3] with the dse_k8s::worker role [puppet] - 10https://gerrit.wikimedia.org/r/1155120 (https://phabricator.wikimedia.org/T395557) (owner: 10Brouberol) [15:34:40] (03CR) 10JHathaway: "puppet 5 errors can be ignored" [puppet] - 10https://gerrit.wikimedia.org/r/1155263 (https://phabricator.wikimedia.org/T383715) (owner: 10JHathaway) [15:36:30] (03PS1) 10TrainBranchBot: testwikis to 1.45.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155265 (https://phabricator.wikimedia.org/T392175) [15:36:33] (03CR) 10TrainBranchBot: [C:03+2] testwikis to 1.45.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155265 (https://phabricator.wikimedia.org/T392175) (owner: 10TrainBranchBot) [15:36:42] (03CR) 10Majavah: [C:03+2] openstack: mwopenstackclients: Fix ensuring multiple records for name [puppet] - 10https://gerrit.wikimedia.org/r/1155259 (owner: 10Majavah) [15:37:26] (03Merged) 10jenkins-bot: testwikis to 1.45.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155265 (https://phabricator.wikimedia.org/T392175) (owner: 10TrainBranchBot) [15:37:49] !log dancy@deploy1003 Started scap sync-world: testwikis to 1.45.0-wmf.5 refs T392175 [15:37:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2238', diff saved to https://phabricator.wikimedia.org/P77547 and previous config saved to /var/cache/conftool/dbconfig/20250610-153750-marostegui.json [15:37:53] T392175: 1.45.0-wmf.5 deployment blockers - https://phabricator.wikimedia.org/T392175 [15:40:56] (03CR) 10Majavah: "Toolforge seems to be using 0.26, but the metricsinfra servers are still on bullseye / 0.18.0+ds-3+b2." [puppet] - 10https://gerrit.wikimedia.org/r/1143810 (https://phabricator.wikimedia.org/T385022) (owner: 10Filippo Giunchedi) [15:42:41] PROBLEM - Check unit status of statograph_post on alert1002 is CRITICAL: CRITICAL: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:42:46] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1227', diff saved to https://phabricator.wikimedia.org/P77548 and previous config saved to /var/cache/conftool/dbconfig/20250610-154245-fceratto.json [15:43:54] !log jmm@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on install7002.wikimedia.org with reason: host reimage [15:47:04] (03CR) 10Btullis: "Should we add the new nodes to the hieradata/common/kubernetes.yaml as well now, or do this afterwards?" [puppet] - 10https://gerrit.wikimedia.org/r/1155120 (https://phabricator.wikimedia.org/T395557) (owner: 10Brouberol) [15:47:20] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on install7002.wikimedia.org with reason: host reimage [15:52:40] RECOVERY - Check unit status of statograph_post on alert1002 is OK: OK: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:52:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2238', diff saved to https://phabricator.wikimedia.org/P77549 and previous config saved to /var/cache/conftool/dbconfig/20250610-155257-marostegui.json [15:57:53] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1227', diff saved to https://phabricator.wikimedia.org/P77550 and previous config saved to /var/cache/conftool/dbconfig/20250610-155752-fceratto.json [16:00:05] jhathaway and moritzm: #bothumor My software never has bugs. It just develops random features. Rise for Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250610T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:00:20] PROBLEM - Hadoop NodeManager on an-worker1116 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:01:10] PROBLEM - Hadoop NodeManager on an-worker1175 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:02:13] Hi, I keep experiencing "Error: too many requests" when I try to use Phabricator, but I can't find any indication of an outage that would cause this. I restarted my computer and cleared my cache but it keeps happening, and I'm not sure where to report since...I can't file a Phab issue. https://www.irccloud.com/pastebin/Tb3Se0u0/Error%20message [16:03:04] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host install7002.wikimedia.org with OS bookworm [16:03:13] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating magru to routed Ganeti - https://phabricator.wikimedia.org/T394263#10900547 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin1003 for host install7002.wikimedia.org with OS bookworm completed: - inst... [16:03:40] PROBLEM - Check unit status of statograph_post on alert1002 is CRITICAL: CRITICAL: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:04:18] tburm: that's the CDN's protection rate limits having kicked in. Are you in a place where you might be sharing your network connectivity with others? [16:05:30] (03CR) 10Volans: "small improvement suggested inline" [puppet] - 10https://gerrit.wikimedia.org/r/1154303 (owner: 10Ayounsi) [16:05:49] or mayhaps you just started your browser and that like triggered 50+ tabs to phab? [16:06:00] it should fix itself in a bit in a such a case [16:07:28] tburm: what browser and version do you use? you might try to update your browser [16:08:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2238 (T396130)', diff saved to https://phabricator.wikimedia.org/P77551 and previous config saved to /var/cache/conftool/dbconfig/20250610-160804-marostegui.json [16:08:08] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [16:08:10] RECOVERY - Hadoop NodeManager on an-worker1175 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:11:40] (03CR) 10Bernard Wang: [C:04-1] "needs to exclude enwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154057 (https://phabricator.wikimedia.org/T395344) (owner: 10Bernard Wang) [16:12:14] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2197.codfw.wmnet with reason: Maintenance [16:12:59] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1227 (T395241)', diff saved to https://phabricator.wikimedia.org/P77552 and previous config saved to /var/cache/conftool/dbconfig/20250610-161258-fceratto.json [16:13:17] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1253.eqiad.wmnet with reason: Maintenance [16:13:24] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1253 (T395241)', diff saved to https://phabricator.wikimedia.org/P77553 and previous config saved to /var/cache/conftool/dbconfig/20250610-161323-fceratto.json [16:13:40] RECOVERY - Check unit status of statograph_post on alert1002 is OK: OK: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:15:19] (03PS3) 10Jdlrobson: Enable empty search recommendations for Vector on all wikipedias, and for Minerva on group1 wikis and wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154057 (https://phabricator.wikimedia.org/T395344) (owner: 10Bernard Wang) [16:18:38] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [16:20:23] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1253 (T395241)', diff saved to https://phabricator.wikimedia.org/P77554 and previous config saved to /var/cache/conftool/dbconfig/20250610-162022-fceratto.json [16:21:52] !log dancy@deploy1003 Finished scap sync-world: testwikis to 1.45.0-wmf.5 refs T392175 (duration: 44m 02s) [16:21:55] T392175: 1.45.0-wmf.5 deployment blockers - https://phabricator.wikimedia.org/T392175 [16:23:20] RECOVERY - Hadoop NodeManager on an-worker1116 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:23:48] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2207.codfw.wmnet with reason: Maintenance [16:23:49] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [16:24:55] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [16:25:22] 06SRE, 10Infrastructure Security, 06Infrastructure-Foundations, 10Mail: Add DMarcian trial-account address to the dmarc-ruf@wikimedia.org mailing list - https://phabricator.wikimedia.org/T396062#10900690 (10Jgreen) [16:25:36] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [16:25:48] (03PS1) 10Máté Szabó: Set ORESDeveloperSetup to false by default [extensions/ORES] (wmf/1.45.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1155276 (https://phabricator.wikimedia.org/T364705) [16:26:18] 06SRE, 10Infrastructure Security, 06Infrastructure-Foundations, 10Mail: Add DMarcian trial-account address to the dmarc-ruf@wikimedia.org postfix mailing list - https://phabricator.wikimedia.org/T396062#10900698 (10Jgreen) [16:26:36] jouncebot: nowandnext [16:26:36] For the next 0 hour(s) and 33 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250610T1600) [16:26:36] In 0 hour(s) and 33 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250610T1700) [16:26:56] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mszabo@deploy1003 using scap backport" [extensions/ORES] (wmf/1.45.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1155276 (https://phabricator.wikimedia.org/T364705) (owner: 10Máté Szabó) [16:26:56] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mszabo@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155235 (https://phabricator.wikimedia.org/T364705) (owner: 10Máté Szabó) [16:27:47] (03Merged) 10jenkins-bot: ores: Disable AbuseFilter integration by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155235 (https://phabricator.wikimedia.org/T364705) (owner: 10Máté Szabó) [16:32:34] (03PS3) 10Jasmine: wikikube: decommission wikikube-worker103[23].eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1151808 (https://phabricator.wikimedia.org/T383227) [16:32:35] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [16:34:02] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [16:34:44] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1156.eqiad.wmnet with reason: Maintenance [16:34:51] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [16:34:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1156 (T396130)', diff saved to https://phabricator.wikimedia.org/P77555 and previous config saved to /var/cache/conftool/dbconfig/20250610-163458-marostegui.json [16:35:02] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [16:35:08] (03CR) 10CI reject: [V:04-1] Set ORESDeveloperSetup to false by default [extensions/ORES] (wmf/1.45.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1155276 (https://phabricator.wikimedia.org/T364705) (owner: 10Máté Szabó) [16:35:14] (03PS1) 10Btullis: Add a prometheus connector for thanos in the test presto cluster [puppet] - 10https://gerrit.wikimedia.org/r/1155278 (https://phabricator.wikimedia.org/T347430) [16:35:29] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1253', diff saved to https://phabricator.wikimedia.org/P77556 and previous config saved to /var/cache/conftool/dbconfig/20250610-163529-fceratto.json [16:36:53] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5895/co" [puppet] - 10https://gerrit.wikimedia.org/r/1155278 (https://phabricator.wikimedia.org/T347430) (owner: 10Btullis) [16:37:00] (03PS1) 10Máté Szabó: tests: Run only defered updates on LinkRecommendationUpdaterTest [extensions/GrowthExperiments] (wmf/1.45.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1155280 [16:37:11] (03PS2) 10Marostegui: dbstore1007: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1155191 (https://phabricator.wikimedia.org/T394373) [16:37:11] (03PS1) 10Marostegui: db1201: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1155281 (https://phabricator.wikimedia.org/T395989) [16:37:28] (03PS2) 10Máté Szabó: Set ORESDeveloperSetup to false by default [extensions/ORES] (wmf/1.45.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1155276 (https://phabricator.wikimedia.org/T364705) [16:37:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1201 T395989', diff saved to https://phabricator.wikimedia.org/P77557 and previous config saved to /var/cache/conftool/dbconfig/20250610-163742-marostegui.json [16:37:46] T395989: Migrate s6 to MariaDB 10.11 - https://phabricator.wikimedia.org/T395989 [16:37:53] (03CR) 10TrainBranchBot: "Approved by mszabo@deploy1003 using scap backport" [extensions/ORES] (wmf/1.45.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1155276 (https://phabricator.wikimedia.org/T364705) (owner: 10Máté Szabó) [16:37:54] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mszabo@deploy1003 using scap backport" [extensions/GrowthExperiments] (wmf/1.45.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1155280 (owner: 10Máté Szabó) [16:38:25] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1201.eqiad.wmnet with reason: Maintenance [16:39:02] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [16:39:34] (03CR) 10Marostegui: [C:03+2] db1201: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1155281 (https://phabricator.wikimedia.org/T395989) (owner: 10Marostegui) [16:39:59] (03CR) 10Marostegui: [C:03+2] dbstore1007: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1155191 (https://phabricator.wikimedia.org/T394373) (owner: 10Marostegui) [16:40:32] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [16:42:40] (03CR) 10Dwisehaupt: [C:03+1] "I think this looks good. Is there a header checks file example for hosts that would be replaced?" [puppet] - 10https://gerrit.wikimedia.org/r/1155263 (https://phabricator.wikimedia.org/T383715) (owner: 10JHathaway) [16:42:46] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [16:48:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1201 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77558 and previous config saved to /var/cache/conftool/dbconfig/20250610-164806-root.json [16:48:54] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning hosts: relforge1003*relforg1004* for testtesttest - bking@cumin2002 - T390565 [16:48:54] !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.ban (exit_code=99) Banning hosts: relforge1003*relforg1004* for testtesttest - bking@cumin2002 - T390565 [16:48:58] T390565: decommission relforge100[34] - https://phabricator.wikimedia.org/T390565 [16:49:05] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning hosts: relforge1003*,relforge1004* for testtesttest - bking@cumin2002 - T390565 [16:49:06] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: relforge1003*,relforge1004* for testtesttest - bking@cumin2002 - T390565 [16:49:08] (03PS4) 10Jdlrobson: Enable empty search recommendations for Vector on all wikipedias, and for Minerva on group1 wikis and wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154057 (https://phabricator.wikimedia.org/T395344) (owner: 10Bernard Wang) [16:49:29] (03PS2) 10JHathaway: postfix: add ability to mask Received header IPs [puppet] - 10https://gerrit.wikimedia.org/r/1155263 (https://phabricator.wikimedia.org/T383715) [16:49:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T396130)', diff saved to https://phabricator.wikimedia.org/P77559 and previous config saved to /var/cache/conftool/dbconfig/20250610-164930-marostegui.json [16:49:34] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [16:50:03] (03CR) 10JHathaway: "I just added an example to the commit message, or were you asking for something else?" [puppet] - 10https://gerrit.wikimedia.org/r/1155263 (https://phabricator.wikimedia.org/T383715) (owner: 10JHathaway) [16:50:26] (03Merged) 10jenkins-bot: tests: Run only defered updates on LinkRecommendationUpdaterTest [extensions/GrowthExperiments] (wmf/1.45.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1155280 (owner: 10Máté Szabó) [16:50:27] (03Merged) 10jenkins-bot: Set ORESDeveloperSetup to false by default [extensions/ORES] (wmf/1.45.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1155276 (https://phabricator.wikimedia.org/T364705) (owner: 10Máté Szabó) [16:50:37] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1253', diff saved to https://phabricator.wikimedia.org/P77560 and previous config saved to /var/cache/conftool/dbconfig/20250610-165036-fceratto.json [16:50:59] !log mszabo@deploy1003 Started scap sync-world: Backport for [[gerrit:1155276|Set ORESDeveloperSetup to false by default (T364705)]], [[gerrit:1155235|ores: Disable AbuseFilter integration by default (T364705)]], [[gerrit:1155280|tests: Run only defered updates on LinkRecommendationUpdaterTest]] [16:51:03] T364705: Provide AbuseFilter condition for revertrisk threshold - https://phabricator.wikimedia.org/T364705 [16:52:41] !log mszabo@deploy1003 sync-world failed: Command 'sudo -u mwbuilder /srv/mwbuilder/release/make-container-image/build-images.py /srv/mediawiki-staging/scap/image-build --staging-dir /srv/mediawiki-staging --mediawiki-versions 1.45.0-wmf.4,1.45.0-wmf.5 --multiversion-image-name docker-registry.discovery.wmnet/restricted/mediawiki-multiversion --multiversion-debug-image-name docker-registry.discovery.w [16:52:41] mnet/restricted/mediawiki-multiversion-debug --multiversion-cli-image-name docker-registry.discovery.wmnet/restricted/mediawiki-multiversion-cli --webserver-image-name docker-registry.discovery.wmnet/restricted/mediawiki-webserver --latest-tag latest --label vnd.wikimedia.builder.name=scap --label vnd.wikimedia.builder.version=4.172.0 --label vnd.wikimedia.mediawiki.versions=1.45.0-wmf.4,1.45.0-wmf.5 --label vnd.wikimedia [16:52:42] .scap.stage_dir=/srv/mediawiki-staging --label vnd.wikimedia.scap.build_state_dir=/srv/mediawiki-staging/scap/image-build --http-proxy http://webproxy:8080 --https-proxy http://webproxy:8080' returned non-zero exit status 1. (scap version: 4.172.0) (duration: 01m 41s) [16:53:29] FIRING: [4x] ProbeDown: Ripe Atlas anchor atlas1001:80 is not returning HTTP 200 OK on port 80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:53:45] (03CR) 10Dwisehaupt: "Thanks. That example is good. Looks good to me for moving ahead." [puppet] - 10https://gerrit.wikimedia.org/r/1155263 (https://phabricator.wikimedia.org/T383715) (owner: 10JHathaway) [16:53:50] E: Failed to fetch http://apt.wikimedia.org/wikimedia/pool/component/php81/t/tideways/php8.1-tideways_5.0.4-16%2bwmf11u1_amd64.deb Could not connect to webproxy:8080 (208.80.154.74), connection timed out [16:53:51] fun [16:54:20] mszabo: I recommend re-running the backport. [16:54:46] RESOLVED: [4x] ProbeDown: Ripe Atlas anchor atlas1001:80 is not returning HTTP 200 OK on port 80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:54:51] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [16:54:53] yeah doing that already [16:55:05] !log mszabo@deploy1003 Started scap sync-world: Backport for [[gerrit:1155276|Set ORESDeveloperSetup to false by default (T364705)]], [[gerrit:1155235|ores: Disable AbuseFilter integration by default (T364705)]], [[gerrit:1155280|tests: Run only defered updates on LinkRecommendationUpdaterTest]] [16:55:54] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [16:58:29] FIRING: GoRoutinesTooHigh: gNMIc running on netflow1002 have more than 10000 Go routines. - https://wikitech.wikimedia.org/wiki/Network_telemetry#GoRoutinesTooHigh - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGoRoutinesTooHigh [16:58:34] 10ops-codfw, 06DC-Ops: Alert for device lsw1-c5-codfw.mgmt.codfw.wmnet - Port with no description on access switch - https://phabricator.wikimedia.org/T396506 (10phaultfinder) 03NEW [16:59:17] !log mszabo@deploy1003 mszabo: Backport for [[gerrit:1155276|Set ORESDeveloperSetup to false by default (T364705)]], [[gerrit:1155235|ores: Disable AbuseFilter integration by default (T364705)]], [[gerrit:1155280|tests: Run only defered updates on LinkRecommendationUpdaterTest]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [16:59:21] T364705: Provide AbuseFilter condition for revertrisk threshold - https://phabricator.wikimedia.org/T364705 [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250610T1700) [17:01:13] !log mszabo@deploy1003 mszabo: Continuing with sync [17:03:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1201 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77561 and previous config saved to /var/cache/conftool/dbconfig/20250610-170312-root.json [17:04:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P77562 and previous config saved to /var/cache/conftool/dbconfig/20250610-170437-marostegui.json [17:04:57] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [17:05:44] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1253 (T395241)', diff saved to https://phabricator.wikimedia.org/P77563 and previous config saved to /var/cache/conftool/dbconfig/20250610-170543-fceratto.json [17:07:40] PROBLEM - Check unit status of statograph_post on alert1002 is CRITICAL: CRITICAL: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:09:38] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [17:09:59] (03PS1) 10Bking: cirrussearch: remove references to defunct elastic hosts, part 1 [puppet] - 10https://gerrit.wikimedia.org/r/1155288 (https://phabricator.wikimedia.org/T388610) [17:10:11] !log mszabo@deploy1003 Finished scap sync-world: Backport for [[gerrit:1155276|Set ORESDeveloperSetup to false by default (T364705)]], [[gerrit:1155235|ores: Disable AbuseFilter integration by default (T364705)]], [[gerrit:1155280|tests: Run only defered updates on LinkRecommendationUpdaterTest]] (duration: 15m 06s) [17:10:15] T364705: Provide AbuseFilter condition for revertrisk threshold - https://phabricator.wikimedia.org/T364705 [17:10:17] FIRING: [2x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2007:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:13:19] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1155288 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [17:14:14] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [17:14:16] (03PS2) 10Bking: cirrussearch: remove references to defunct elastic hosts, part 1 [puppet] - 10https://gerrit.wikimedia.org/r/1155288 (https://phabricator.wikimedia.org/T388610) [17:15:17] RESOLVED: [2x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2007:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:15:27] 10ops-esams, 10ops-magru, 06SRE, 06DC-Ops, and 2 others: CPU temperature issues in cp hosts - https://phabricator.wikimedia.org/T373993#10900974 (10BCornwall) Fun little tidbit: Our power consumption lowered after increasing the fan speeds in magru {F62284645} [17:15:40] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1155288 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [17:16:58] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [17:17:18] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2214 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/1155289 (https://phabricator.wikimedia.org/T396509) [17:18:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1201 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77564 and previous config saved to /var/cache/conftool/dbconfig/20250610-171817-root.json [17:19:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P77565 and previous config saved to /var/cache/conftool/dbconfig/20250610-171943-marostegui.json [17:24:31] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [17:25:08] (03CR) 10Joal: [C:03+1] "\o/" [puppet] - 10https://gerrit.wikimedia.org/r/1155278 (https://phabricator.wikimedia.org/T347430) (owner: 10Btullis) [17:26:58] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [17:27:33] (03PS18) 10Ayounsi: gNMI: spread targets on multiple netflow hosts [puppet] - 10https://gerrit.wikimedia.org/r/1154303 [17:27:50] (03CR) 10Ayounsi: gNMI: spread targets on multiple netflow hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1154303 (owner: 10Ayounsi) [17:28:29] FIRING: HelmReleaseBadStatus: Helm release zarcillo/main on k8s-aux@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-aux&var-namespace=zarcillo - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [17:30:43] (03CR) 10Btullis: [C:03+1] cirrussearch: remove references to defunct elastic hosts, part 1 [puppet] - 10https://gerrit.wikimedia.org/r/1155288 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [17:30:44] (03PS4) 10CDobbins: varnish: Replace X-RB-NOREDIR with rb_noredir var [puppet] - 10https://gerrit.wikimedia.org/r/1154085 [17:30:44] (03PS1) 10CDobbins: fix if statement [puppet] - 10https://gerrit.wikimedia.org/r/1155293 [17:33:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1201 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77566 and previous config saved to /var/cache/conftool/dbconfig/20250610-173322-root.json [17:34:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T396130)', diff saved to https://phabricator.wikimedia.org/P77567 and previous config saved to /var/cache/conftool/dbconfig/20250610-173450-marostegui.json [17:34:56] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [17:35:07] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1182.eqiad.wmnet with reason: Maintenance [17:35:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1182 (T396130)', diff saved to https://phabricator.wikimedia.org/P77568 and previous config saved to /var/cache/conftool/dbconfig/20250610-173514-marostegui.json [17:38:26] akosiaris: yes I am at a coworking space so sharing network connectivity with others [17:38:29] RESOLVED: HelmReleaseBadStatus: Helm release zarcillo/main on k8s-aux@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-aux&var-namespace=zarcillo - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [17:46:27] (03PS1) 10Esanders: Enable DiscussionTools visual enhancements everywhere except 12 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155295 (https://phabricator.wikimedia.org/T392121) [17:47:36] (03PS2) 10Esanders: Enable DiscussionTools visual enhancements everywhere except 12 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155295 (https://phabricator.wikimedia.org/T392121) [17:47:36] (03PS1) 10CDobbins: testing change's effects [puppet] - 10https://gerrit.wikimedia.org/r/1155296 [17:47:40] RECOVERY - Check unit status of statograph_post on alert1002 is OK: OK: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:48:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1201 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77569 and previous config saved to /var/cache/conftool/dbconfig/20250610-174828-root.json [17:49:24] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, June 11 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155295 (https://phabricator.wikimedia.org/T392121) (owner: 10Esanders) [17:49:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T396130)', diff saved to https://phabricator.wikimedia.org/P77570 and previous config saved to /var/cache/conftool/dbconfig/20250610-174944-marostegui.json [17:49:48] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [17:56:11] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1154303 (owner: 10Ayounsi) [17:58:40] PROBLEM - Check unit status of statograph_post on alert1002 is CRITICAL: CRITICAL: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:59:03] (03CR) 10JHathaway: [C:03+2] postfix: add ability to mask Received header IPs [puppet] - 10https://gerrit.wikimedia.org/r/1155263 (https://phabricator.wikimedia.org/T383715) (owner: 10JHathaway) [18:00:04] thcipriani and thcipriani: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for MediaWiki train - Utc-7 Version . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250610T1800). [18:00:53] uhhh [18:00:58] I forgot something [18:01:20] o/ [18:01:22] heh [18:01:42] hrm, I wonder if I assigned this task after the calendar tool already made the deploy windows [18:01:43] * brennen all expectantly waiting for the ping [18:01:57] I'll fix the rest of them :) [18:01:59] yeah, probably. i was a late shuffle to this slot. [18:02:30] well, glad you were waiting for the ping, rather than not expecting the ping :) [18:03:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1201 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77571 and previous config saved to /var/cache/conftool/dbconfig/20250610-180333-root.json [18:04:00] (03CR) 10Bking: [C:03+2] cirrussearch: remove references to defunct elastic hosts, part 1 [puppet] - 10https://gerrit.wikimedia.org/r/1155288 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [18:04:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P77572 and previous config saved to /var/cache/conftool/dbconfig/20250610-180451-marostegui.json [18:04:53] (03PS1) 10TChin: [eventgate-analytics-external] bump version v1.14.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155298 (https://phabricator.wikimedia.org/T391959) [18:05:30] (03PS1) 10Bartosz Dziewoński: Set $wgPHPSessionHandling to 'disable' on testwiki and beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155299 (https://phabricator.wikimedia.org/T362324) [18:06:05] (03CR) 10Bartosz Dziewoński: [C:03+1] "Sure, let's start with that and try to increase our confidence. https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1155299" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1144497 (https://phabricator.wikimedia.org/T362324) (owner: 10Gergő Tisza) [18:06:51] (03PS1) 10TrainBranchBot: group0 to 1.45.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155301 (https://phabricator.wikimedia.org/T392175) [18:06:52] (03CR) 10TrainBranchBot: [C:03+2] group0 to 1.45.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155301 (https://phabricator.wikimedia.org/T392175) (owner: 10TrainBranchBot) [18:07:39] (03Merged) 10jenkins-bot: group0 to 1.45.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155301 (https://phabricator.wikimedia.org/T392175) (owner: 10TrainBranchBot) [18:13:48] (03PS1) 10Bartosz Dziewoński: Stop logging $wgPHPSessionHandling warnings for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155303 (https://phabricator.wikimedia.org/T393963) [18:16:42] (03PS1) 10JHathaway: postfix: mask crm2001 receive header [puppet] - 10https://gerrit.wikimedia.org/r/1155304 (https://phabricator.wikimedia.org/T383715) [18:16:54] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1155304 (https://phabricator.wikimedia.org/T383715) (owner: 10JHathaway) [18:17:17] !log brennen@deploy1003 rebuilt and synchronized wikiversions files: group0 to 1.45.0-wmf.5 refs T392175 [18:17:21] T392175: 1.45.0-wmf.5 deployment blockers - https://phabricator.wikimedia.org/T392175 [18:18:40] RECOVERY - Check unit status of statograph_post on alert1002 is OK: OK: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:19:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P77573 and previous config saved to /var/cache/conftool/dbconfig/20250610-181958-marostegui.json [18:24:44] (03PS1) 10AOkoth: wmnet: switch active doc host [dns] - 10https://gerrit.wikimedia.org/r/1155306 (https://phabricator.wikimedia.org/T392130) [18:32:54] (03CR) 10Umherirrender: "It is very appreciated when you can take or make the deployment step. It is okay to wait another week for deployment. Thanks." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130201 (https://phabricator.wikimedia.org/T171115) (owner: 10Umherirrender) [18:35:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T396130)', diff saved to https://phabricator.wikimedia.org/P77574 and previous config saved to /var/cache/conftool/dbconfig/20250610-183505-marostegui.json [18:35:10] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [18:35:14] (03CR) 10JHathaway: [C:03+2] postfix: mask crm2001 receive header [puppet] - 10https://gerrit.wikimedia.org/r/1155304 (https://phabricator.wikimedia.org/T383715) (owner: 10JHathaway) [18:35:21] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1188.eqiad.wmnet with reason: Maintenance [18:35:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1188 (T396130)', diff saved to https://phabricator.wikimedia.org/P77575 and previous config saved to /var/cache/conftool/dbconfig/20250610-183528-marostegui.json [18:37:38] 10SRE-SLO, 10observability, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): WDQS Update Lag SLO looks wrong - https://phabricator.wikimedia.org/T395987#10901407 (10herron) If I'm understanding correctly "WDQS update lag" was replaced by "Search update lag" which looks healthy but it seems we've hit a bug wher... [18:39:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T396130)', diff saved to https://phabricator.wikimedia.org/P77576 and previous config saved to /var/cache/conftool/dbconfig/20250610-183919-marostegui.json [18:39:40] PROBLEM - Check unit status of statograph_post on alert1002 is CRITICAL: CRITICAL: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:40:52] (03CR) 10DLynch: [C:03+1] Enable DiscussionTools visual enhancements everywhere except 12 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155295 (https://phabricator.wikimedia.org/T392121) (owner: 10Esanders) [18:45:58] (03CR) 10Dr0ptp4kt: [C:03+2] [eventgate-analytics-external] bump version v1.14.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155298 (https://phabricator.wikimedia.org/T391959) (owner: 10TChin) [18:47:30] (03Merged) 10jenkins-bot: [eventgate-analytics-external] bump version v1.14.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155298 (https://phabricator.wikimedia.org/T391959) (owner: 10TChin) [18:49:40] RECOVERY - Check unit status of statograph_post on alert1002 is OK: OK: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:49:58] (03CR) 10Krinkle: multiversion: Document how it all works (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154136 (https://phabricator.wikimedia.org/T289318) (owner: 10Krinkle) [18:50:20] (03PS2) 10Krinkle: multiversion: Document how it all works [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154136 (https://phabricator.wikimedia.org/T289318) [18:50:23] (03PS2) 10Krinkle: multivesion: Remove unused newFromDBName() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154139 [18:50:28] (03PS4) 10Krinkle: multiversion: Re-use prod for beta setSiteInfoForWiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154140 (https://phabricator.wikimedia.org/T289318) [18:54:02] (03CR) 10Jforrester: [C:03+1] "Deploy away!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154136 (https://phabricator.wikimedia.org/T289318) (owner: 10Krinkle) [18:54:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P77577 and previous config saved to /var/cache/conftool/dbconfig/20250610-185426-marostegui.json [18:59:09] (03PS1) 10Bking: cirrus streaming updater staging: replace decom'd hosts in net policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155315 (https://phabricator.wikimedia.org/T390565) [18:59:12] 06SRE, 10SRE-SLO, 10Observability-Metrics: Pyrra detail grafana dashboard contains two panels displaying misleading data - https://phabricator.wikimedia.org/T393797#10901452 (10herron) 05Open→03Resolved a:03herron Since we've addressed the misleading panels I think we're ok to resolve. There will... [19:00:04] (03PS2) 10Bking: cirrus streaming updater staging: replace decom'd hosts in net policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155315 (https://phabricator.wikimedia.org/T390565) [19:01:00] 10SRE-SLO: Add a section to the SLO template that explains SLO windows, and Pyrra's dashboards and alerts - https://phabricator.wikimedia.org/T395920#10901477 (10herron) [19:03:42] (03PS1) 10Herron: profile::pyrra::filesystem::slos::istio: default to 4w [puppet] - 10https://gerrit.wikimedia.org/r/1155316 (https://phabricator.wikimedia.org/T395916) [19:08:46] (03CR) 10Jdlrobson: [C:03+1] Enable empty search recommendations for Vector on all wikipedias, and for Minerva on group1 wikis and wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154057 (https://phabricator.wikimedia.org/T395344) (owner: 10Bernard Wang) [19:09:10] (03PS3) 10D3r1ck01: multiversion: Document how it all works [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154136 (https://phabricator.wikimedia.org/T289318) (owner: 10Krinkle) [19:09:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P77578 and previous config saved to /var/cache/conftool/dbconfig/20250610-190934-marostegui.json [19:10:28] (03CR) 10D3r1ck01: [C:03+1] "PS3 fixes some typos but otherwise looks good. Thanks so much for these docs Krinkle. They're very helpful to me." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154136 (https://phabricator.wikimedia.org/T289318) (owner: 10Krinkle) [19:10:37] (03PS3) 10Krinkle: multivesion: Remove unused newFromDBName() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154139 [19:10:45] (03PS5) 10Krinkle: multiversion: Re-use prod for beta setSiteInfoForWiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154140 (https://phabricator.wikimedia.org/T289318) [19:16:55] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, June 10 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1144484 (https://phabricator.wikimedia.org/T393872) (owner: 10SD0001) [19:19:10] (03CR) 10Brouberol: "Oh right, I completely forgot. Again." [puppet] - 10https://gerrit.wikimedia.org/r/1155120 (https://phabricator.wikimedia.org/T395557) (owner: 10Brouberol) [19:21:40] PROBLEM - Check unit status of statograph_post on alert1002 is CRITICAL: CRITICAL: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:24:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T396130)', diff saved to https://phabricator.wikimedia.org/P77579 and previous config saved to /var/cache/conftool/dbconfig/20250610-192441-marostegui.json [19:24:45] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [19:24:57] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1197.eqiad.wmnet with reason: Maintenance [19:25:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1197 (T396130)', diff saved to https://phabricator.wikimedia.org/P77580 and previous config saved to /var/cache/conftool/dbconfig/20250610-192503-marostegui.json [19:26:42] (03PS1) 10Ahmon Dancy: scap.cfg.erb: Drop unused php_fpm* config parameters [puppet] - 10https://gerrit.wikimedia.org/r/1155318 (https://phabricator.wikimedia.org/T396166) [19:28:28] (03CR) 10CI reject: [V:04-1] scap.cfg.erb: Drop unused php_fpm* config parameters [puppet] - 10https://gerrit.wikimedia.org/r/1155318 (https://phabricator.wikimedia.org/T396166) (owner: 10Ahmon Dancy) [19:28:39] FIRING: [2x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [19:28:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T396130)', diff saved to https://phabricator.wikimedia.org/P77581 and previous config saved to /var/cache/conftool/dbconfig/20250610-192856-marostegui.json [19:29:35] (03PS2) 10Ahmon Dancy: scap.cfg.erb: Drop unused php_fpm* config parameters [puppet] - 10https://gerrit.wikimedia.org/r/1155318 (https://phabricator.wikimedia.org/T396166) [19:31:40] RECOVERY - Check unit status of statograph_post on alert1002 is OK: OK: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:38:23] (03CR) 10Jgreen: [C:03+1] Add civi.frdev.wm.o cname pointing at frdev host [dns] - 10https://gerrit.wikimedia.org/r/1154895 (https://phabricator.wikimedia.org/T396084) (owner: 10Dwisehaupt) [19:38:56] (03PS3) 10Dwisehaupt: Add civi.frdev.wm.o cname pointing at frdev host [dns] - 10https://gerrit.wikimedia.org/r/1154895 (https://phabricator.wikimedia.org/T396084) [19:39:54] (03CR) 10Dwisehaupt: [C:03+2] Add civi.frdev.wm.o cname pointing at frdev host [dns] - 10https://gerrit.wikimedia.org/r/1154895 (https://phabricator.wikimedia.org/T396084) (owner: 10Dwisehaupt) [19:40:13] !log dwisehaupt@dns1004 START - running authdns-update [19:41:03] !log dwisehaupt@dns1004 END - running authdns-update [19:44:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P77582 and previous config saved to /var/cache/conftool/dbconfig/20250610-194403-marostegui.json [19:46:26] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, and 2 others: decommission cirrussearch1053.eqiad.wmnet + more (see description) - https://phabricator.wikimedia.org/T394350#10901655 (10VRiley-WMF) [19:48:11] (03CR) 10BCornwall: "Thanks for the patch! It looks good save for a few missing comment changes. I'm a little concerned that there are no tests for this behavi" [puppet] - 10https://gerrit.wikimedia.org/r/1154085 (owner: 10CDobbins) [19:48:12] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, and 2 others: decommission cirrussearch1053.eqiad.wmnet + more (see description) - https://phabricator.wikimedia.org/T394350#10901656 (10VRiley-WMF) 05Open→03Resolved a:03VRiley-WMF This is complete [19:49:28] (03CR) 10BCornwall: [C:04-1] "You'll want to rebase/squash this into the previous CR rather than submit another!" [puppet] - 10https://gerrit.wikimedia.org/r/1155293 (owner: 10CDobbins) [19:50:57] (03CR) 10D3r1ck01: multivesion: Remove unused newFromDBName() (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154139 (owner: 10Krinkle) [19:51:22] (03CR) 10BCornwall: [C:03+1] wmnet: switch active doc host [dns] - 10https://gerrit.wikimedia.org/r/1155306 (https://phabricator.wikimedia.org/T392130) (owner: 10AOkoth) [19:59:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P77583 and previous config saved to /var/cache/conftool/dbconfig/20250610-195910-marostegui.json [20:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: #bothumor I � Unicode. All rise for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250610T2000). [20:00:05] bwang and sd0001: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:01:05] Gonna deploy the first one now [20:01:56] o/ [20:01:58] (03CR) 10TrainBranchBot: [C:03+2] "Approved by toyofuku@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154057 (https://phabricator.wikimedia.org/T395344) (owner: 10Bernard Wang) [20:02:40] PROBLEM - Check unit status of statograph_post on alert1002 is CRITICAL: CRITICAL: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:03:01] toyofuku: im around for testing [20:03:12] Sounds good! Thank you [20:03:28] (03Merged) 10jenkins-bot: Enable empty search recommendations for Vector on all wikipedias, and for Minerva on group1 wikis and wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154057 (https://phabricator.wikimedia.org/T395344) (owner: 10Bernard Wang) [20:03:32] Hopefully you are not doing too much and getting ready to unwind for the next two months though 👁️ [20:03:53] !log toyofuku@deploy1003 Started scap sync-world: Backport for [[gerrit:1154057|Enable empty search recommendations for Vector on all wikipedias, and for Minerva on group1 wikis and wikivoyage (T395344 T395339)]] [20:03:58] T395344: Deploy Vector empty search recoms to all wikis - https://phabricator.wikimedia.org/T395344 [20:03:58] T395339: Deploy mobile search suggestions to group 1 wikis - https://phabricator.wikimedia.org/T395339 [20:06:03] !log toyofuku@deploy1003 bwang, toyofuku: Backport for [[gerrit:1154057|Enable empty search recommendations for Vector on all wikipedias, and for Minerva on group1 wikis and wikivoyage (T395344 T395339)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:06:30] Testing now! [20:08:50] toyofuku: lgtm [20:09:13] Same - verified no english, spanish desktop, and italian mobile [20:09:18] Proceeding! [20:09:21] !log toyofuku@deploy1003 bwang, toyofuku: Continuing with sync [20:09:57] (03PS1) 10Brennen Bearnes: gitlab runners: update buildkitd to v0.22.0 [puppet] - 10https://gerrit.wikimedia.org/r/1155324 (https://phabricator.wikimedia.org/T394931) [20:10:06] The spanish wikipedia article for Kenpachi Zaraki is LONG: https://es.wikipedia.org/wiki/Kenpachi_Zaraki [20:10:55] !log vriley@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host an-worker1185 [20:11:02] !log vriley@cumin1002 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host an-worker1185 [20:11:45] !log vriley@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host an-worker1185 [20:11:51] !log vriley@cumin1002 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host an-worker1185 [20:12:40] RECOVERY - Check unit status of statograph_post on alert1002 is OK: OK: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:12:57] Speaking of which, stream YOASOBI: https://open.spotify.com/track/2F9iSs6DypVQu26t5uaeFM [20:14:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T396130)', diff saved to https://phabricator.wikimedia.org/P77584 and previous config saved to /var/cache/conftool/dbconfig/20250610-201418-marostegui.json [20:14:23] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [20:14:34] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1222.eqiad.wmnet with reason: Maintenance [20:14:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1222 (T396130)', diff saved to https://phabricator.wikimedia.org/P77585 and previous config saved to /var/cache/conftool/dbconfig/20250610-201441-marostegui.json [20:15:21] !log tchin@deploy1003 helmfile [staging] START helmfile.d/services/eventgate-analytics-external: apply [20:15:47] !log tchin@deploy1003 helmfile [staging] DONE helmfile.d/services/eventgate-analytics-external: apply [20:16:13] 06SRE, 06Data-Engineering, 10LDAP-Access-Requests: Grant Access to Product's Superset & Turnilo for SKivlehan - https://phabricator.wikimedia.org/T393626#10901815 (10SKivlehan-WMF) Hello! Thank you all for the assistance on this ticket -- I don't seem to have access to Turnilo, attempting to login return... [20:16:15] !log vriley@cumin1002 START - Cookbook sre.dns.netbox [20:16:54] !log toyofuku@deploy1003 Finished scap sync-world: Backport for [[gerrit:1154057|Enable empty search recommendations for Vector on all wikipedias, and for Minerva on group1 wikis and wikivoyage (T395344 T395339)]] (duration: 13m 01s) [20:16:58] T395344: Deploy Vector empty search recoms to all wikis - https://phabricator.wikimedia.org/T395344 [20:16:59] T395339: Deploy mobile search suggestions to group 1 wikis - https://phabricator.wikimedia.org/T395339 [20:17:06] All set! Thanks everyone [20:17:27] sd0001: all yours [20:17:37] (03CR) 10Ahmon Dancy: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1155324 (https://phabricator.wikimedia.org/T394931) (owner: 10Brennen Bearnes) [20:18:36] !log tchin@deploy1003 helmfile [codfw] START helmfile.d/services/eventgate-analytics-external: apply [20:19:29] !log tchin@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics-external: apply [20:19:31] !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1185 - vriley@cumin1002" [20:19:36] !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1185 - vriley@cumin1002" [20:19:36] !log vriley@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:19:47] sd0001: are you able to self-deploy or do you need a deployer? [20:20:09] cjming: yeah, I don't have access [20:20:20] np [20:20:38] (03PS3) 10SD0001: Replace deprecated wgCirrusSearchWMFExtraFeatures with wgCirrusSearchWeightedTags [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1144484 (https://phabricator.wikimedia.org/T393872) [20:20:49] !log vriley@cumin1002 START - Cookbook sre.dns.netbox [20:21:20] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1144484 (https://phabricator.wikimedia.org/T393872) (owner: 10SD0001) [20:22:07] (03Merged) 10jenkins-bot: Replace deprecated wgCirrusSearchWMFExtraFeatures with wgCirrusSearchWeightedTags [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1144484 (https://phabricator.wikimedia.org/T393872) (owner: 10SD0001) [20:22:31] !log cjming@deploy1003 Started scap sync-world: Backport for [[gerrit:1144484|Replace deprecated wgCirrusSearchWMFExtraFeatures with wgCirrusSearchWeightedTags (T393872)]] [20:22:35] T393872: Make weighted tags no longer be WMF-specific - https://phabricator.wikimedia.org/T393872 [20:22:40] !log tchin@deploy1003 helmfile [eqiad] START helmfile.d/services/eventgate-analytics-external: apply [20:23:21] !log tchin@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics-external: apply [20:24:06] !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1185 - vriley@cumin1002" [20:24:11] !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1185 - vriley@cumin1002" [20:24:11] !log vriley@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:24:44] !log cjming@deploy1003 cjming, sd: Backport for [[gerrit:1144484|Replace deprecated wgCirrusSearchWMFExtraFeatures with wgCirrusSearchWeightedTags (T393872)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:25:02] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:25:08] sd0001: testable? ok to sync? [20:25:22] let me check [20:25:52] working fine [20:26:02] cool - syncing [20:26:08] !log cjming@deploy1003 cjming, sd: Continuing with sync [20:27:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1222 (T396130)', diff saved to https://phabricator.wikimedia.org/P77586 and previous config saved to /var/cache/conftool/dbconfig/20250610-202713-marostegui.json [20:27:17] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [20:29:01] vriley@cumin1002 provision (PID 474103) is awaiting input [20:32:32] (03PS1) 10Kamila Součková: hcaptcha: initial commit for proxy config [puppet] - 10https://gerrit.wikimedia.org/r/1155325 (https://phabricator.wikimedia.org/T381265) [20:32:50] !log cjming@deploy1003 Finished scap sync-world: Backport for [[gerrit:1144484|Replace deprecated wgCirrusSearchWMFExtraFeatures with wgCirrusSearchWeightedTags (T393872)]] (duration: 10m 18s) [20:32:54] sd0001: should be live :) [20:32:55] T393872: Make weighted tags no longer be WMF-specific - https://phabricator.wikimedia.org/T393872 [20:33:01] (03CR) 10CI reject: [V:04-1] hcaptcha: initial commit for proxy config [puppet] - 10https://gerrit.wikimedia.org/r/1155325 (https://phabricator.wikimedia.org/T381265) (owner: 10Kamila Součková) [20:33:32] cjming: thanks ;) [20:39:10] (03CR) 10Bking: [C:03+2] cirrus streaming updater staging: replace decom'd hosts in net policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155315 (https://phabricator.wikimedia.org/T390565) (owner: 10Bking) [20:39:24] (03CR) 10Bking: [C:03+2] "self-merging, as this only affects a staging environment." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155315 (https://phabricator.wikimedia.org/T390565) (owner: 10Bking) [20:40:03] (03PS2) 10Kamila Součková: Add fake hcaptcha proxy secrets. [labs/private] - 10https://gerrit.wikimedia.org/r/1155221 (https://phabricator.wikimedia.org/T381265) [20:40:39] (03Merged) 10jenkins-bot: cirrus streaming updater staging: replace decom'd hosts in net policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155315 (https://phabricator.wikimedia.org/T390565) (owner: 10Bking) [20:42:18] (03PS2) 10Kamila Součková: hcaptcha: initial commit for proxy config [puppet] - 10https://gerrit.wikimedia.org/r/1155325 (https://phabricator.wikimedia.org/T381265) [20:42:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P77587 and previous config saved to /var/cache/conftool/dbconfig/20250610-204220-marostegui.json [20:42:46] (03CR) 10CI reject: [V:04-1] hcaptcha: initial commit for proxy config [puppet] - 10https://gerrit.wikimedia.org/r/1155325 (https://phabricator.wikimedia.org/T381265) (owner: 10Kamila Součková) [20:50:22] (03PS3) 10Kamila Součková: hcaptcha: initial commit for proxy config [puppet] - 10https://gerrit.wikimedia.org/r/1155325 (https://phabricator.wikimedia.org/T381265) [20:51:19] (03PS1) 10Clare Ming: xLab: Deploy v0.6.8 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155328 (https://phabricator.wikimedia.org/T396045) [20:52:11] (03CR) 10CI reject: [V:04-1] hcaptcha: initial commit for proxy config [puppet] - 10https://gerrit.wikimedia.org/r/1155325 (https://phabricator.wikimedia.org/T381265) (owner: 10Kamila Součková) [20:52:41] (03CR) 10Santiago Faci: [C:03+2] xLab: Deploy v0.6.8 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155328 (https://phabricator.wikimedia.org/T396045) (owner: 10Clare Ming) [20:54:07] (03Merged) 10jenkins-bot: xLab: Deploy v0.6.8 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155328 (https://phabricator.wikimedia.org/T396045) (owner: 10Clare Ming) [20:55:25] !log cjming@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply [20:55:26] !log bking@deploy1003 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [20:55:27] !log bking@deploy1003 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:55:51] !log bking@deploy1003 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [20:55:52] !log cjming@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply [20:56:20] !log bking@deploy1003 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:57:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P77588 and previous config saved to /var/cache/conftool/dbconfig/20250610-205727-marostegui.json [20:58:29] FIRING: GoRoutinesTooHigh: gNMIc running on netflow1002 have more than 10000 Go routines. - https://wikitech.wikimedia.org/wiki/Network_telemetry#GoRoutinesTooHigh - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGoRoutinesTooHigh [21:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250610T2100) [21:08:17] (03PS4) 10Kamila Součková: hcaptcha: initial commit for proxy config [puppet] - 10https://gerrit.wikimedia.org/r/1155325 (https://phabricator.wikimedia.org/T381265) [21:10:05] (03CR) 10Kamila Součková: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1155325 (https://phabricator.wikimedia.org/T381265) (owner: 10Kamila Součková) [21:12:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1222 (T396130)', diff saved to https://phabricator.wikimedia.org/P77590 and previous config saved to /var/cache/conftool/dbconfig/20250610-211234-marostegui.json [21:12:38] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [21:12:50] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1225.eqiad.wmnet with reason: Maintenance [21:14:41] (03CR) 10Jdlrobson: [C:03+1] "@mmuhlenhoff@wikimedia.org you can merge this (ideally tomorrow when I definitely won't need it) but if you need me around today to test a" [puppet] - 10https://gerrit.wikimedia.org/r/1152307 (owner: 10Jdlrobson) [21:17:17] (03PS5) 10Kamila Součková: hcaptcha: initial commit for proxy config [puppet] - 10https://gerrit.wikimedia.org/r/1155325 (https://phabricator.wikimedia.org/T381265) [21:17:35] (03CR) 10Kamila Součková: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1155325 (https://phabricator.wikimedia.org/T381265) (owner: 10Kamila Součková) [21:21:13] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, June 11 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155299 (https://phabricator.wikimedia.org/T362324) (owner: 10Bartosz Dziewoński) [21:21:24] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, June 11 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155303 (https://phabricator.wikimedia.org/T393963) (owner: 10Bartosz Dziewoński) [21:22:32] (03CR) 10Kamila Součková: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1155325 (https://phabricator.wikimedia.org/T381265) (owner: 10Kamila Součková) [21:23:26] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1229.eqiad.wmnet with reason: Maintenance [21:23:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1229 (T396130)', diff saved to https://phabricator.wikimedia.org/P77591 and previous config saved to /var/cache/conftool/dbconfig/20250610-212332-marostegui.json [21:23:36] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [21:26:46] (03PS1) 10Catrope: Fixes TypeError: undefined is not an object (evaluating 'sources.map') [extensions/TimedMediaHandler] (wmf/1.45.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1155331 (https://phabricator.wikimedia.org/T396370) [21:27:00] (03PS1) 10Catrope: Fixes TypeError: undefined is not an object (evaluating 'sources.map') [extensions/TimedMediaHandler] (wmf/1.45.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1155332 (https://phabricator.wikimedia.org/T396370) [21:27:13] I'm gonna take advantage of the Web window to deploy this TMH fix --^^ [21:27:21] (03PS6) 10Kamila Součková: hcaptcha: initial commit for proxy config [puppet] - 10https://gerrit.wikimedia.org/r/1155325 (https://phabricator.wikimedia.org/T381265) [21:27:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1229 (T396130)', diff saved to https://phabricator.wikimedia.org/P77592 and previous config saved to /var/cache/conftool/dbconfig/20250610-212727-marostegui.json [21:27:34] (03CR) 10Kamila Součková: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1155325 (https://phabricator.wikimedia.org/T381265) (owner: 10Kamila Součková) [21:28:26] (03CR) 10TrainBranchBot: [C:03+2] "Approved by catrope@deploy1003 using scap backport" [extensions/TimedMediaHandler] (wmf/1.45.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1155331 (https://phabricator.wikimedia.org/T396370) (owner: 10Catrope) [21:28:26] (03CR) 10TrainBranchBot: [C:03+2] "Approved by catrope@deploy1003 using scap backport" [extensions/TimedMediaHandler] (wmf/1.45.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1155332 (https://phabricator.wikimedia.org/T396370) (owner: 10Catrope) [21:38:10] (03PS1) 10Ryan Kemper: wdqs: fork SLOs for wdqs-main and wdqs-scholarly [puppet] - 10https://gerrit.wikimedia.org/r/1155335 (https://phabricator.wikimedia.org/T393966) [21:38:34] (03Merged) 10jenkins-bot: Fixes TypeError: undefined is not an object (evaluating 'sources.map') [extensions/TimedMediaHandler] (wmf/1.45.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1155331 (https://phabricator.wikimedia.org/T396370) (owner: 10Catrope) [21:39:29] (03Merged) 10jenkins-bot: Fixes TypeError: undefined is not an object (evaluating 'sources.map') [extensions/TimedMediaHandler] (wmf/1.45.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1155332 (https://phabricator.wikimedia.org/T396370) (owner: 10Catrope) [21:39:56] !log catrope@deploy1003 Started scap sync-world: Backport for [[gerrit:1155331|Fixes TypeError: undefined is not an object (evaluating 'sources.map') (T396370)]], [[gerrit:1155332|Fixes TypeError: undefined is not an object (evaluating 'sources.map') (T396370)]] [21:39:59] T396370: Fixes TypeError: undefined is not an object (evaluating 'sources.map') - https://phabricator.wikimedia.org/T396370 [21:41:20] (03CR) 10Ryan Kemper: "@kherron@wikimedia.org : the previous wdqs availability & update lag SLOs are splitting into two. does this look right?" [puppet] - 10https://gerrit.wikimedia.org/r/1155335 (https://phabricator.wikimedia.org/T393966) (owner: 10Ryan Kemper) [21:42:01] !log catrope@deploy1003 catrope: Backport for [[gerrit:1155331|Fixes TypeError: undefined is not an object (evaluating 'sources.map') (T396370)]], [[gerrit:1155332|Fixes TypeError: undefined is not an object (evaluating 'sources.map') (T396370)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:42:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P77593 and previous config saved to /var/cache/conftool/dbconfig/20250610-214234-marostegui.json [21:43:56] 10SRE-SLO, 10observability, 10Data-Platform-SRE (2025.05.24 - 2025.06.13), 13Patch-For-Review: WDQS Update Lag SLO looks wrong - https://phabricator.wikimedia.org/T395987#10902127 (10RKemper) 05Open→03In progress https://gerrit.wikimedia.org/r/c/operations/puppet/+/1155335 should fix this. The metrics... [21:44:16] !log catrope@deploy1003 catrope: Continuing with sync [21:50:07] (03PS7) 10Kamila Součková: hcaptcha: initial commit for proxy config [puppet] - 10https://gerrit.wikimedia.org/r/1155325 (https://phabricator.wikimedia.org/T381265) [21:50:44] (03CR) 10Kamila Součková: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1155325 (https://phabricator.wikimedia.org/T381265) (owner: 10Kamila Součková) [21:51:16] !log catrope@deploy1003 Finished scap sync-world: Backport for [[gerrit:1155331|Fixes TypeError: undefined is not an object (evaluating 'sources.map') (T396370)]], [[gerrit:1155332|Fixes TypeError: undefined is not an object (evaluating 'sources.map') (T396370)]] (duration: 11m 20s) [21:51:20] T396370: Fixes TypeError: undefined is not an object (evaluating 'sources.map') - https://phabricator.wikimedia.org/T396370 [21:57:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P77594 and previous config saved to /var/cache/conftool/dbconfig/20250610-215741-marostegui.json [22:08:41] 06SRE: restbase2030 (and others) running low on disk space - https://phabricator.wikimedia.org/T395845#10902170 (10Eevans) @Jgiannelos: Are you the right one to talk to about the PCS storage transition? I'd like to get a bit more headroom if possible, the `pregenerated_cache` keyspace is now [[ https://grafana-... [22:12:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1229 (T396130)', diff saved to https://phabricator.wikimedia.org/P77595 and previous config saved to /var/cache/conftool/dbconfig/20250610-221248-marostegui.json [22:12:53] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [22:13:04] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1233.eqiad.wmnet with reason: Maintenance [22:13:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1233 (T396130)', diff saved to https://phabricator.wikimedia.org/P77596 and previous config saved to /var/cache/conftool/dbconfig/20250610-221311-marostegui.json [22:24:35] (03CR) 10Herron: "looks good overall to me, left a couple questions and a suggestion inline" [puppet] - 10https://gerrit.wikimedia.org/r/1155335 (https://phabricator.wikimedia.org/T393966) (owner: 10Ryan Kemper) [22:25:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1233 (T396130)', diff saved to https://phabricator.wikimedia.org/P77597 and previous config saved to /var/cache/conftool/dbconfig/20250610-222532-marostegui.json [22:25:38] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [22:27:17] (03PS5) 10Ladsgroup: mariadb: Load list of private tables from the table catalog [puppet] - 10https://gerrit.wikimedia.org/r/1155210 (https://phabricator.wikimedia.org/T363581) [22:40:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P77598 and previous config saved to /var/cache/conftool/dbconfig/20250610-224039-marostegui.json [22:42:33] (03CR) 10Ladsgroup: mariadb: Load list of private tables from the table catalog [puppet] - 10https://gerrit.wikimedia.org/r/1155210 (https://phabricator.wikimedia.org/T363581) (owner: 10Ladsgroup) [22:42:59] (03PS1) 10Ladsgroup: tables-catalog: Fix visibility of four tables based on maintain-views [puppet] - 10https://gerrit.wikimedia.org/r/1155336 (https://phabricator.wikimedia.org/T363581) [22:55:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P77599 and previous config saved to /var/cache/conftool/dbconfig/20250610-225546-marostegui.json [23:10:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1233 (T396130)', diff saved to https://phabricator.wikimedia.org/P77600 and previous config saved to /var/cache/conftool/dbconfig/20250610-231053-marostegui.json [23:10:58] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [23:11:05] (03CR) 10TrainBranchBot: [C:03+2] "Approved by krinkle@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154136 (https://phabricator.wikimedia.org/T289318) (owner: 10Krinkle) [23:11:09] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1239.eqiad.wmnet with reason: Maintenance [23:11:52] (03Merged) 10jenkins-bot: multiversion: Document how it all works [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154136 (https://phabricator.wikimedia.org/T289318) (owner: 10Krinkle) [23:12:18] !log krinkle@deploy1003 Started scap sync-world: Backport for [[gerrit:1154136|multiversion: Document how it all works (T289318)]] [23:12:21] T289318: Move *.beta.wmflabs.org to *.beta.wmcloud.org - https://phabricator.wikimedia.org/T289318 [23:14:26] !log krinkle@deploy1003 krinkle: Backport for [[gerrit:1154136|multiversion: Document how it all works (T289318)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [23:18:15] !log krinkle@deploy1003 krinkle: Continuing with sync [23:21:59] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1254.eqiad.wmnet with reason: Maintenance [23:22:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1254 (T396130)', diff saved to https://phabricator.wikimedia.org/P77602 and previous config saved to /var/cache/conftool/dbconfig/20250610-232206-marostegui.json [23:22:10] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [23:24:44] FIRING: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): ... [23:24:45] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [23:25:14] !log krinkle@deploy1003 Finished scap sync-world: Backport for [[gerrit:1154136|multiversion: Document how it all works (T289318)]] (duration: 12m 56s) [23:25:17] T289318: Move *.beta.wmflabs.org to *.beta.wmcloud.org - https://phabricator.wikimedia.org/T289318 [23:26:19] (03PS1) 10Cwhite: add ecs validator tool v1 [software/ecs] - 10https://gerrit.wikimedia.org/r/1155342 (https://phabricator.wikimedia.org/T395819) [23:28:34] FIRING: [2x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [23:34:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1254 (T396130)', diff saved to https://phabricator.wikimedia.org/P77603 and previous config saved to /var/cache/conftool/dbconfig/20250610-233427-marostegui.json [23:34:33] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [23:38:16] (03CR) 10Krinkle: multivesion: Remove unused newFromDBName() (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154139 (owner: 10Krinkle) [23:38:28] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1155343 [23:38:28] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1155343 (owner: 10TrainBranchBot) [23:39:45] RESOLVED: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): ... [23:39:50] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [23:49:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P77604 and previous config saved to /var/cache/conftool/dbconfig/20250610-234934-marostegui.json [23:49:58] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1155343 (owner: 10TrainBranchBot)