[00:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250128T0000) [00:16:25] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:21:00] (03CR) 10Jeena Huneidi: "recheck" [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1114445 (owner: 10TrainBranchBot) [00:30:02] (03Abandoned) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1114445 (owner: 10TrainBranchBot) [00:38:21] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1114475 [00:38:21] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1114475 (owner: 10TrainBranchBot) [01:02:06] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1114475 (owner: 10TrainBranchBot) [01:05:00] 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, 06serviceops: wikikube-ctrl1002 and wikikube-ctrl1003: Switch network cable from port 2 to port 1 on the 10G NIC - https://phabricator.wikimedia.org/T379717#10499341 (10Papaul) @VRiley-WMF not yet we have to work on this tomorrow. [01:08:25] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1114478 [01:08:26] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1114478 (owner: 10TrainBranchBot) [01:09:41] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10499343 (10phaultfinder) [01:17:00] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on ms-fe1014 - https://phabricator.wikimedia.org/T384297#10499346 (10Papaul) 05Open→03Resolved a:03Papaul closing this since we have T384317 [01:21:26] (03CR) 10CI reject: [V:04-1] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1114478 (owner: 10TrainBranchBot) [01:30:49] (03CR) 10Bartosz Dziewoński: "Build failure is unrelated, caused by build failure on https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1112730, which prevented it from " [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1114478 (owner: 10TrainBranchBot) [01:40:42] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1014 - https://phabricator.wikimedia.org/T382984#10499352 (10Papaul) Create Dispatch: Service Tag: JJ3ZWP3 [02:08:12] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.44.0-wmf.14 [core] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1114483 (https://phabricator.wikimedia.org/T382365) [02:08:14] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.44.0-wmf.14 [core] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1114483 (https://phabricator.wikimedia.org/T382365) (owner: 10TrainBranchBot) [02:19:58] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [02:20:09] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [02:25:27] (03Merged) 10jenkins-bot: Branch commit for wmf/1.44.0-wmf.14 [core] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1114483 (https://phabricator.wikimedia.org/T382365) (owner: 10TrainBranchBot) [02:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:39:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10499403 (10phaultfinder) [03:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250128T0300) [03:06:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:08:38] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:19:23] 10ops-eqiad, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T384892 (10phaultfinder) 03NEW [03:23:38] FIRING: ProbeDown: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:28:38] RESOLVED: ProbeDown: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:38:41] PROBLEM - Wikitech and wt-static content in sync on wikitech-static.wikimedia.org is CRITICAL: wikitech-static CRIT - wikitech and wikitech-static out of sync (203889s 200000s) https://wikitech.wikimedia.org/wiki/Wikitech-static [03:46:38] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic1096-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [03:51:31] FIRING: [5x] Not accepting/receiving prefixes from anycast BGP peer: Alert for device cr1-codfw.wikimedia.org - Not accepting/receiving prefixes from anycast BGP peer - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer [04:00:04] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250128T0400) [04:02:09] (03PS1) 10TrainBranchBot: testwikis to 1.44.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114488 (https://phabricator.wikimedia.org/T382365) [04:02:10] (03CR) 10TrainBranchBot: [C:03+2] testwikis to 1.44.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114488 (https://phabricator.wikimedia.org/T382365) (owner: 10TrainBranchBot) [04:02:56] (03Merged) 10jenkins-bot: testwikis to 1.44.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114488 (https://phabricator.wikimedia.org/T382365) (owner: 10TrainBranchBot) [04:03:24] !log mwpresync@deploy2002 Started scap sync-world: testwikis to 1.44.0-wmf.14 refs T382365 [04:03:28] T382365: 1.44.0-wmf.14 deployment blockers - https://phabricator.wikimedia.org/T382365 [04:11:15] (03CR) 10AikoChou: [C:03+2] "Thanks for the review!!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114401 (https://phabricator.wikimedia.org/T384172) (owner: 10AikoChou) [04:12:46] (03Merged) 10jenkins-bot: ml-services: update reference-quality storage uri [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114401 (https://phabricator.wikimedia.org/T384172) (owner: 10AikoChou) [04:16:25] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:17:53] !log aikochou@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . [04:50:49] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 218, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:58:24] !log aikochou@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revision-models' for release 'main' . [05:00:05] Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250128T0500) [05:03:02] !log mwpresync@deploy2002 Finished scap sync-world: testwikis to 1.44.0-wmf.14 refs T382365 (duration: 59m 38s) [05:03:06] T382365: 1.44.0-wmf.14 deployment blockers - https://phabricator.wikimedia.org/T382365 [05:04:49] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 220, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:06:27] !log mwpresync@deploy2002 Pruned MediaWiki: 1.44.0-wmf.11 (duration: 06m 25s) [05:11:52] !log aikochou@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . [05:17:49] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 219, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:17:59] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 128, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:35:49] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 207, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:35:59] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 112, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:41:49] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 220, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:41:59] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 129, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:02:49] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 219, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:03:01] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 128, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:12:04] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on db1183.eqiad.wmnet with reason: Maintenance [06:12:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2205 T384807', diff saved to https://phabricator.wikimedia.org/P72555 and previous config saved to /var/cache/conftool/dbconfig/20250128-061230-marostegui.json [06:12:35] T384807: Upgrade and rebuild s3 - https://phabricator.wikimedia.org/T384807 [06:16:18] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2205.codfw.wmnet with reason: Index rebuild [06:16:25] RESOLVED: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:18:42] !log root@cumin1002 START - Cookbook sre.mysql.upgrade for db1175.eqiad.wmnet [06:19:25] !log root@cumin1002 START - Cookbook sre.mysql.upgrade for db2205.codfw.wmnet [06:25:17] !log root@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1175.eqiad.wmnet [06:25:24] !log root@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2205.codfw.wmnet [06:25:52] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1175.eqiad.wmnet with reason: Index rebuild [06:25:56] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2205.codfw.wmnet with reason: Index rebuild [06:28:27] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1002 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [06:28:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1175 T384807', diff saved to https://phabricator.wikimedia.org/P72556 and previous config saved to /var/cache/conftool/dbconfig/20250128-062846-marostegui.json [06:28:53] T384807: Upgrade and rebuild s3 - https://phabricator.wikimedia.org/T384807 [06:33:27] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1002 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [06:39:57] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [06:40:01] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [06:50:13] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250128T0700) [07:00:05] marostegui, Amir1, and federico3: Time to do the Primary database switchover deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250128T0700). [07:08:38] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:21:31] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on db2213.codfw.wmnet with reason: Maintenance [07:25:16] (03PS1) 10Marostegui: es1024: Remove from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1114640 (https://phabricator.wikimedia.org/T384820) [07:26:01] (03CR) 10Marostegui: [C:03+2] es1024: Remove from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1114640 (https://phabricator.wikimedia.org/T384820) (owner: 10Marostegui) [07:27:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'Remove es1024 from dbctl T384820', diff saved to https://phabricator.wikimedia.org/P72557 and previous config saved to /var/cache/conftool/dbconfig/20250128-072707-root.json [07:27:13] T384820: decommission es1024.eqiad.wmnet - https://phabricator.wikimedia.org/T384820 [07:29:44] (03PS1) 10Marostegui: mariadb: Remove es1024 [puppet] - 10https://gerrit.wikimedia.org/r/1114642 (https://phabricator.wikimedia.org/T384820) [07:29:57] !log marostegui@cumin1002 START - Cookbook sre.hosts.decommission for hosts es1024.eqiad.wmnet [07:30:53] (03CR) 10Marostegui: [C:03+2] mariadb: Remove es1024 [puppet] - 10https://gerrit.wikimedia.org/r/1114642 (https://phabricator.wikimedia.org/T384820) (owner: 10Marostegui) [07:34:41] (03PS1) 10Marostegui: Revert "db2203: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1114643 [07:35:05] (03PS1) 10Marostegui: Revert "db2207,db2148: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1114644 [07:35:19] (03CR) 10Marostegui: [C:03+2] Revert "db2203: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1114643 (owner: 10Marostegui) [07:35:47] !log marostegui@cumin1002 START - Cookbook sre.dns.netbox [07:35:52] (03CR) 10Marostegui: [C:03+2] Revert "db2207,db2148: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1114644 (owner: 10Marostegui) [07:37:49] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 220, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:38:01] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 129, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:46:39] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic1096-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [07:47:25] !log jmm@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ganeti2020.codfw.wmnet with reason: remove from cluster for reimage [07:47:33] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10499612 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=e9f62dcb-2ecf-4d32-84ca-34c181e86093) set by jmm@cumin2002 for 1 day, 0:00:00 on 1 host(... [07:48:01] RECOVERY - Host ripe-atlas-eqiad is UP: PING WARNING - Packet loss = 77%, RTA = 0.32 ms [07:50:57] !log marostegui@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: es1024.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1002" [07:51:31] FIRING: [5x] Not accepting/receiving prefixes from anycast BGP peer: Alert for device cr1-codfw.wikimedia.org - Not accepting/receiving prefixes from anycast BGP peer - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer [07:51:35] !log marostegui@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: es1024.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1002" [07:51:36] !log marostegui@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:51:36] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts es1024.eqiad.wmnet [07:52:14] 10ops-eqiad, 06DBA, 06DC-Ops, 10decommission-hardware: decommission es1024.eqiad.wmnet - https://phabricator.wikimedia.org/T384820#10499614 (10Marostegui) a:05Marostegui→03None [07:52:25] 10ops-eqiad, 06DBA, 06DC-Ops, 10decommission-hardware: decommission es1024.eqiad.wmnet - https://phabricator.wikimedia.org/T384820#10499619 (10Marostegui) [07:52:46] 10ops-eqiad, 06DBA, 06DC-Ops, 10decommission-hardware: decommission es1024.eqiad.wmnet - https://phabricator.wikimedia.org/T384820#10499621 (10Marostegui) This is ready for #dc-ops [07:54:25] PROBLEM - Host ripe-atlas-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [07:54:44] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2020.codfw.wmnet with OS bookworm [07:54:57] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10499624 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2020.codfw.wmnet with OS bookworm [07:56:12] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1158.eqiad.wmnet with reason: Maintenance [07:56:30] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [07:56:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1158 (T384592)', diff saved to https://phabricator.wikimedia.org/P72558 and previous config saved to /var/cache/conftool/dbconfig/20250128-075636-marostegui.json [07:56:42] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [07:56:57] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2026.codfw.wmnet [07:57:02] (03CR) 10Slyngshede: [C:03+2] Upgrade to CAS 7.1 [dns] - 10https://gerrit.wikimedia.org/r/1114388 (owner: 10Slyngshede) [07:57:11] !log slyngshede@dns1004 START - running authdns-update [07:57:13] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10499629 (10ops-monitoring-bot) Draining ganeti2026.codfw.wmnet of running VMs [07:58:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2155 T382842', diff saved to https://phabricator.wikimedia.org/P72559 and previous config saved to /var/cache/conftool/dbconfig/20250128-075857-marostegui.json [07:59:00] !log slyngshede@dns1004 END - running authdns-update [07:59:03] T382842: Upgrade to 10.6.20 and rebuild recentchanges and pagelinks tables - https://phabricator.wikimedia.org/T382842 [07:59:08] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2026.codfw.wmnet [07:59:25] (03CR) 10Slyngshede: [C:03+1] "LGTM, but let's just have Moritz confirm that we're not actually using this. My memory is that this is for a previous issue on hardware we" [puppet] - 10https://gerrit.wikimedia.org/r/1114391 (https://phabricator.wikimedia.org/T350694) (owner: 10Filippo Giunchedi) [07:59:36] !log root@cumin1002 START - Cookbook sre.mysql.upgrade for db2155.codfw.wmnet [08:00:05] Amir1, Urbanecm, and awight: How many deployers does it take to do UTC morning backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250128T0800). [08:00:05] No Gerrit patches in the queue for this window AFAICS. [08:00:24] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[2186-2187].codfw.wmnet with reason: Index rebuild + upgrade [08:00:33] (03PS1) 10Marostegui: db2155: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1114646 (https://phabricator.wikimedia.org/T382842) [08:01:54] (03CR) 10Marostegui: [C:03+2] db2155: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1114646 (https://phabricator.wikimedia.org/T382842) (owner: 10Marostegui) [08:01:59] (03PS1) 10Slyngshede: Revert "Upgrade to CAS 7.1" [dns] - 10https://gerrit.wikimedia.org/r/1114647 [08:04:28] (03CR) 10Slyngshede: [C:03+2] Revert "Upgrade to CAS 7.1" [dns] - 10https://gerrit.wikimedia.org/r/1114647 (owner: 10Slyngshede) [08:04:36] !log slyngshede@dns1004 START - running authdns-update [08:06:25] !log slyngshede@dns1004 END - running authdns-update [08:06:52] !log root@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2155.codfw.wmnet [08:07:30] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2155.codfw.wmnet with reason: Index rebuild [08:09:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T384592)', diff saved to https://phabricator.wikimedia.org/P72560 and previous config saved to /var/cache/conftool/dbconfig/20250128-080945-marostegui.json [08:09:51] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [08:13:10] (03CR) 10Muehlenhoff: [C:03+1] "We retire the check at this point: This was introduced to catch cases where the microcode updates to fix L1TF, SSBD and MDS were not corre" [puppet] - 10https://gerrit.wikimedia.org/r/1114391 (https://phabricator.wikimedia.org/T350694) (owner: 10Filippo Giunchedi) [08:14:44] (03PS2) 10Urbanecm: [Growth] enwiki: Release Add Link to 10% of newcomers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114379 (https://phabricator.wikimedia.org/T384551) [08:14:48] (03CR) 10Urbanecm: [C:03+2] [Growth] enwiki: Release Add Link to 10% of newcomers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114379 (https://phabricator.wikimedia.org/T384551) (owner: 10Urbanecm) [08:15:30] (03Merged) 10jenkins-bot: [Growth] enwiki: Release Add Link to 10% of newcomers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114379 (https://phabricator.wikimedia.org/T384551) (owner: 10Urbanecm) [08:16:46] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1114379|[Growth] enwiki: Release Add Link to 10% of newcomers (T384551)]] [08:16:51] T384551: Add a link (Structured task): Increase rollout on English Wikipedia to 10% - https://phabricator.wikimedia.org/T384551 [08:17:17] (03CR) 10Muehlenhoff: [C:03+1] base: absent check_microcode (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1114391 (https://phabricator.wikimedia.org/T350694) (owner: 10Filippo Giunchedi) [08:19:36] jouncebot: nowandnext [08:19:36] For the next 0 hour(s) and 40 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250128T0800) [08:19:36] In 2 hour(s) and 40 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250128T1100) [08:19:57] (03PS2) 10Reedy: SimpleCaptcha: Don't look up captcha if no ID was given [extensions/ConfirmEdit] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1114454 (https://phabricator.wikimedia.org/T384858) (owner: 10Jforrester) [08:19:59] (03CR) 10Fabfur: hiera: enable haproxykafka on codfw [puppet] - 10https://gerrit.wikimedia.org/r/1114415 (https://phabricator.wikimedia.org/T378578) (owner: 10Fabfur) [08:20:08] (03CR) 10Fabfur: hiera: enable haproxykafka on eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1114417 (https://phabricator.wikimedia.org/T378578) (owner: 10Fabfur) [08:21:16] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti2020.codfw.wmnet with reason: host reimage [08:21:29] (03CR) 10Reedy: [C:03+2] SimpleCaptcha: Don't look up captcha if no ID was given [extensions/ConfirmEdit] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1114454 (https://phabricator.wikimedia.org/T384858) (owner: 10Jforrester) [08:23:10] urbanecm: Are you deploying many patches? :) [08:23:19] !log urbanecm@deploy2002 urbanecm: Backport for [[gerrit:1114379|[Growth] enwiki: Release Add Link to 10% of newcomers (T384551)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:23:24] T384551: Add a link (Structured task): Increase rollout on English Wikipedia to 10% - https://phabricator.wikimedia.org/T384551 [08:23:24] Reedy: no, just this one [08:23:31] !log urbanecm@deploy2002 urbanecm: Continuing with sync [08:23:33] sweet [08:23:53] i'll ping you when done :) [08:24:26] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2026.codfw.wmnet [08:24:37] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti2020.codfw.wmnet with reason: host reimage [08:24:39] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10499648 (10ops-monitoring-bot) Draining ganeti2026.codfw.wmnet of running VMs [08:24:43] (03PS2) 10Reedy: UcfirstOverrides: Fix indenting of comment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093387 [08:24:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P72561 and previous config saved to /var/cache/conftool/dbconfig/20250128-082452-marostegui.json [08:25:01] (03PS2) 10Reedy: CommonSettings.php: Remove deprecated $wgOATHAuthDatabase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1088655 [08:25:24] (03CR) 10Jelto: [C:03+2] "this is configured at firewall level and can be removed from apache" [puppet] - 10https://gerrit.wikimedia.org/r/1114438 (owner: 10Dzahn) [08:25:56] (03PS2) 10Reedy: Disable Dns Blacklist checks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1108179 (https://phabricator.wikimedia.org/T382987) [08:26:56] (03CR) 10Filippo Giunchedi: base: absent check_microcode (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1114391 (https://phabricator.wikimedia.org/T350694) (owner: 10Filippo Giunchedi) [08:27:00] (03PS2) 10Filippo Giunchedi: base: absent check_microcode [puppet] - 10https://gerrit.wikimedia.org/r/1114391 (https://phabricator.wikimedia.org/T350694) [08:27:31] (03CR) 10Filippo Giunchedi: base: absent check_microcode (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1114391 (https://phabricator.wikimedia.org/T350694) (owner: 10Filippo Giunchedi) [08:29:35] (03CR) 10Filippo Giunchedi: [C:03+2] thanos: send sigkill as needed to stateless components [puppet] - 10https://gerrit.wikimedia.org/r/1114336 (https://phabricator.wikimedia.org/T383570) (owner: 10Filippo Giunchedi) [08:30:05] (03CR) 10Vgutierrez: [C:03+1] hiera: enable haproxykafka on codfw [puppet] - 10https://gerrit.wikimedia.org/r/1114415 (https://phabricator.wikimedia.org/T378578) (owner: 10Fabfur) [08:33:33] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1114379|[Growth] enwiki: Release Add Link to 10% of newcomers (T384551)]] (duration: 16m 46s) [08:33:37] T384551: Add a link (Structured task): Increase rollout on English Wikipedia to 10% - https://phabricator.wikimedia.org/T384551 [08:35:09] Reedy: over to you! [08:35:18] cheers :) [08:35:30] (03CR) 10Reedy: [C:03+2] UcfirstOverrides: Fix indenting of comment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093387 (owner: 10Reedy) [08:35:32] (03CR) 10Reedy: [C:03+2] CommonSettings.php: Remove deprecated $wgOATHAuthDatabase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1088655 (owner: 10Reedy) [08:35:34] (03CR) 10Reedy: [C:03+2] Disable Dns Blacklist checks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1108179 (https://phabricator.wikimedia.org/T382987) (owner: 10Reedy) [08:36:17] (03Merged) 10jenkins-bot: UcfirstOverrides: Fix indenting of comment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093387 (owner: 10Reedy) [08:36:19] (03Merged) 10jenkins-bot: CommonSettings.php: Remove deprecated $wgOATHAuthDatabase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1088655 (owner: 10Reedy) [08:36:22] (03Merged) 10jenkins-bot: Disable Dns Blacklist checks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1108179 (https://phabricator.wikimedia.org/T382987) (owner: 10Reedy) [08:37:15] (03CR) 10Jelto: [C:04-1] "two of those UserAgents can be found in the access logs. So I'd say let's keep them for now and we can clean that up once this is in reque" [puppet] - 10https://gerrit.wikimedia.org/r/1114442 (owner: 10Dzahn) [08:37:39] 10ops-codfw, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti20[46-51] - https://phabricator.wikimedia.org/T384838#10499657 (10MoritzMuehlenhoff) [08:38:01] (03PS2) 10Reedy: noc: Expose MobileUrlCallback.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1091818 [08:38:05] (03CR) 10Reedy: [C:03+2] noc: Expose MobileUrlCallback.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1091818 (owner: 10Reedy) [08:38:05] PROBLEM - Host mr1-drmrs is DOWN: PING CRITICAL - Packet loss = 100% [08:38:34] (03CR) 10Filippo Giunchedi: [C:03+2] base: absent check_microcode [puppet] - 10https://gerrit.wikimedia.org/r/1114391 (https://phabricator.wikimedia.org/T350694) (owner: 10Filippo Giunchedi) [08:38:49] PROBLEM - Host ps1-b13-drmrs is DOWN: PING CRITICAL - Packet loss = 100% [08:39:17] PROBLEM - Host mr1-drmrs IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [08:39:17] PROBLEM - Host mr1-drmrs.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [08:39:31] (03Merged) 10jenkins-bot: noc: Expose MobileUrlCallback.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1091818 (owner: 10Reedy) [08:39:41] PROBLEM - Host ps1-b12-drmrs is DOWN: PING CRITICAL - Packet loss = 100% [08:39:50] (03PS3) 10Reedy: CommonSettings: Set 'lang=en' on Wikimedia Foundation entry in $wgFooterIcons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1110053 (https://phabricator.wikimedia.org/T383501) [08:39:58] (03CR) 10Reedy: [C:03+2] CommonSettings: Set 'lang=en' on Wikimedia Foundation entry in $wgFooterIcons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1110053 (https://phabricator.wikimedia.org/T383501) (owner: 10Reedy) [08:40:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P72562 and previous config saved to /var/cache/conftool/dbconfig/20250128-083959-marostegui.json [08:40:40] (03Merged) 10jenkins-bot: CommonSettings: Set 'lang=en' on Wikimedia Foundation entry in $wgFooterIcons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1110053 (https://phabricator.wikimedia.org/T383501) (owner: 10Reedy) [08:41:23] (03Merged) 10jenkins-bot: SimpleCaptcha: Don't look up captcha if no ID was given [extensions/ConfirmEdit] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1114454 (https://phabricator.wikimedia.org/T384858) (owner: 10Jforrester) [08:42:42] FIRING: JobUnavailable: Reduced availability for job pdu_sentry4 in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:43:29] !log reedy@deploy2002 Started scap sync-world: Backport for [[gerrit:1110053|CommonSettings: Set 'lang=en' on Wikimedia Foundation entry in $wgFooterIcons (T383501)]], [[gerrit:1114454|SimpleCaptcha: Don't look up captcha if no ID was given (T384858)]], [[gerrit:1091818|noc: Expose MobileUrlCallback.php]], [[gerrit:1108179|Disable Dns Blacklist checks (T382987)]], [[gerrit:1088655|CommonSettings.php: Remove deprecated $wg [08:43:29] OATHAuthDatabase]], [[gerrit:1093387|UcfirstOverrides: Fix indenting of comment]] [08:43:36] T383501: Add language to footer icons - https://phabricator.wikimedia.org/T383501 [08:43:37] T384858: PHP Deprecated: strtr(): Passing null to parameter #1 ($string) of type string is deprecated - https://phabricator.wikimedia.org/T384858 [08:43:37] T382987: Set the default of wgDnsBlacklistUrls to empty - https://phabricator.wikimedia.org/T382987 [08:44:43] (03PS1) 10Muehlenhoff: Add ganeti2045-ganeti2050 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1114649 (https://phabricator.wikimedia.org/T384838) [08:47:45] (03PS1) 10Filippo Giunchedi: kartotherian: disable icinga check [puppet] - 10https://gerrit.wikimedia.org/r/1114650 (https://phabricator.wikimedia.org/T321808) [08:48:17] !log reedy@deploy2002 reedy, jforrester: Backport for [[gerrit:1110053|CommonSettings: Set 'lang=en' on Wikimedia Foundation entry in $wgFooterIcons (T383501)]], [[gerrit:1114454|SimpleCaptcha: Don't look up captcha if no ID was given (T384858)]], [[gerrit:1091818|noc: Expose MobileUrlCallback.php]], [[gerrit:1108179|Disable Dns Blacklist checks (T382987)]], [[gerrit:1088655|CommonSettings.php: Remove deprecated $wgOATHAu [08:48:17] thDatabase]], [[gerrit:1093387|UcfirstOverrides: Fix indenting of comment]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:48:31] !log reedy@deploy2002 reedy, jforrester: Continuing with sync [08:48:38] 10ops-codfw, 06DC-Ops, 06Infrastructure-Foundations, 13Patch-For-Review: Q3:rack/setup/install ganeti20[46-51] - https://phabricator.wikimedia.org/T384838#10499694 (10MoritzMuehlenhoff) a:05MoritzMuehlenhoff→03RobH @RobH Why 2046 onwards? Our highest Ganeti server in codfw is 2044; I've filled in the r... [08:51:05] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2020.codfw.wmnet with OS bookworm [08:51:11] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10499699 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2020.codfw.wmnet with OS bookworm completed: - ganeti202... [08:52:43] RECOVERY - Host ps1-b13-drmrs is UP: PING OK - Packet loss = 0%, RTA = 87.37 ms [08:52:49] RECOVERY - Host ps1-b12-drmrs is UP: PING OK - Packet loss = 0%, RTA = 87.29 ms [08:52:49] RECOVERY - Host mr1-drmrs is UP: PING OK - Packet loss = 0%, RTA = 86.83 ms [08:52:56] (03CR) 10Muehlenhoff: [C:03+2] Add ganeti2045-ganeti2050 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1114649 (https://phabricator.wikimedia.org/T384838) (owner: 10Muehlenhoff) [08:54:41] RECOVERY - Host mr1-drmrs IPv6 is UP: PING OK - Packet loss = 0%, RTA = 86.81 ms [08:54:41] RECOVERY - Host mr1-drmrs.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 86.18 ms [08:55:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T384592)', diff saved to https://phabricator.wikimedia.org/P72563 and previous config saved to /var/cache/conftool/dbconfig/20250128-085506-marostegui.json [08:55:11] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [08:55:21] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [08:55:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1170 (T384592)', diff saved to https://phabricator.wikimedia.org/P72564 and previous config saved to /var/cache/conftool/dbconfig/20250128-085528-marostegui.json [08:56:27] !log reedy@deploy2002 Finished scap sync-world: Backport for [[gerrit:1110053|CommonSettings: Set 'lang=en' on Wikimedia Foundation entry in $wgFooterIcons (T383501)]], [[gerrit:1114454|SimpleCaptcha: Don't look up captcha if no ID was given (T384858)]], [[gerrit:1091818|noc: Expose MobileUrlCallback.php]], [[gerrit:1108179|Disable Dns Blacklist checks (T382987)]], [[gerrit:1088655|CommonSettings.php: Remove deprecated $w [08:56:27] gOATHAuthDatabase]], [[gerrit:1093387|UcfirstOverrides: Fix indenting of comment]] (duration: 12m 58s) [08:56:34] T383501: Add language to footer icons - https://phabricator.wikimedia.org/T383501 [08:56:34] T384858: PHP Deprecated: strtr(): Passing null to parameter #1 ($string) of type string is deprecated - https://phabricator.wikimedia.org/T384858 [08:56:34] T382987: Set the default of wgDnsBlacklistUrls to empty - https://phabricator.wikimedia.org/T382987 [08:57:24] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2020.codfw.wmnet [08:57:33] (03PS1) 10Filippo Giunchedi: profile: remove obsolete poolcounter icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/1114651 (https://phabricator.wikimedia.org/T321808) [08:57:42] RESOLVED: JobUnavailable: Reduced availability for job pdu_sentry4 in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:00:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es2036', diff saved to https://phabricator.wikimedia.org/P72565 and previous config saved to /var/cache/conftool/dbconfig/20250128-090000-marostegui.json [09:00:10] !log root@cumin1002 START - Cookbook sre.mysql.upgrade for es2036.codfw.wmnet [09:00:54] !log jynus@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1150.eqiad.wmnet with reason: reimage [09:00:55] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 208, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:01:03] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 113, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:01:46] (03PS1) 10Marostegui: Revert "db2155: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1114652 [09:02:16] (03CR) 10Marostegui: [C:04-2] "Not yet" [puppet] - 10https://gerrit.wikimedia.org/r/1114652 (owner: 10Marostegui) [09:03:45] (03PS1) 10Muehlenhoff: Switch ganeti2026 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1114653 [09:04:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1175 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P72566 and previous config saved to /var/cache/conftool/dbconfig/20250128-090439-root.json [09:04:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote es2028 to es1 codfw master', diff saved to https://phabricator.wikimedia.org/P72567 and previous config saved to /var/cache/conftool/dbconfig/20250128-090454-marostegui.json [09:05:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2205 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P72568 and previous config saved to /var/cache/conftool/dbconfig/20250128-090525-root.json [09:05:42] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2020.codfw.wmnet [09:05:50] !log root@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for es2036.codfw.wmnet [09:06:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2205', diff saved to https://phabricator.wikimedia.org/P72569 and previous config saved to /var/cache/conftool/dbconfig/20250128-090601-marostegui.json [09:06:10] (03PS3) 10Cyndywikime: Add configurable MinimumTasksPerTopic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113984 (https://phabricator.wikimedia.org/T383714) [09:06:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1175', diff saved to https://phabricator.wikimedia.org/P72570 and previous config saved to /var/cache/conftool/dbconfig/20250128-090620-marostegui.json [09:06:21] (03CR) 10Cyndywikime: Add configurable MinimumTasksPerTopic (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113984 (https://phabricator.wikimedia.org/T383714) (owner: 10Cyndywikime) [09:06:34] !log root@cumin1002 START - Cookbook sre.hosts.reimage for host db1150.eqiad.wmnet with OS bookworm [09:08:36] 06SRE, 07SRE-Unowned, 10Deployments, 06Release-Engineering-Team, 13Patch-For-Review: Reduce automatic messages on #wikimedia-operations - https://phabricator.wikimedia.org/T384804#10499724 (10hashar) 05Open→03Declined I had enough push back that I am not interested in pursuing. I will keep using... [09:12:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es2030', diff saved to https://phabricator.wikimedia.org/P72571 and previous config saved to /var/cache/conftool/dbconfig/20250128-091242-marostegui.json [09:13:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2036 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P72572 and previous config saved to /var/cache/conftool/dbconfig/20250128-091302-root.json [09:13:16] (03PS1) 10Marostegui: wmnet: Promote es2028 to es1 master [dns] - 10https://gerrit.wikimedia.org/r/1114654 (https://phabricator.wikimedia.org/T376905) [09:13:29] !log root@cumin1002 START - Cookbook sre.mysql.upgrade for es2030.codfw.wmnet [09:13:39] (03CR) 10Muehlenhoff: [C:03+2] sre.hosts.reimage: Add link to the help text for move-vlan [cookbooks] - 10https://gerrit.wikimedia.org/r/1112171 (owner: 10Muehlenhoff) [09:16:27] (03PS1) 10Filippo Giunchedi: dumps: remove nfs port icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/1114655 (https://phabricator.wikimedia.org/T321808) [09:18:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1175 (re)pooling @ 10%: Repooling T384807', diff saved to https://phabricator.wikimedia.org/P72574 and previous config saved to /var/cache/conftool/dbconfig/20250128-091846-root.json [09:18:52] T384807: Upgrade and rebuild s3 - https://phabricator.wikimedia.org/T384807 [09:22:35] !log root@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for es2030.codfw.wmnet [09:22:38] !log root@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1150.eqiad.wmnet with reason: host reimage [09:24:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10499743 (10phaultfinder) [09:26:22] !log root@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1150.eqiad.wmnet with reason: host reimage [09:26:38] (03CR) 10Fabfur: "tnx for the +1!" [puppet] - 10https://gerrit.wikimedia.org/r/1114415 (https://phabricator.wikimedia.org/T378578) (owner: 10Fabfur) [09:26:40] (03CR) 10Fabfur: [C:03+2] hiera: enable haproxykafka on codfw [puppet] - 10https://gerrit.wikimedia.org/r/1114415 (https://phabricator.wikimedia.org/T378578) (owner: 10Fabfur) [09:27:52] !log installing/enabling haproxykafka on codfw (https://gerrit.wikimedia.org/r/c/operations/puppet/+/1114415) (T378578) [09:27:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:57] T378578: Rollout haproxykafka on all hosts - https://phabricator.wikimedia.org/T378578 [09:28:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2036 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P72575 and previous config saved to /var/cache/conftool/dbconfig/20250128-092808-root.json [09:28:57] (03CR) 10Effie Mouzeli: "I am afraid I do not have any useful input here 😊" [puppet] - 10https://gerrit.wikimedia.org/r/1109526 (https://phabricator.wikimedia.org/T369024) (owner: 10Ladsgroup) [09:33:10] RECOVERY - Host ripe-atlas-eqiad is UP: PING WARNING - Packet loss = 90%, RTA = 30.22 ms [09:33:34] (03PS1) 10Effie Mouzeli: Enroll 2% of client sessions in PHP 8.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114657 (https://phabricator.wikimedia.org/T383845) [09:33:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1175 (re)pooling @ 25%: Repooling T384807', diff saved to https://phabricator.wikimedia.org/P72576 and previous config saved to /var/cache/conftool/dbconfig/20250128-093352-root.json [09:33:58] T384807: Upgrade and rebuild s3 - https://phabricator.wikimedia.org/T384807 [09:34:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2205 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P72577 and previous config saved to /var/cache/conftool/dbconfig/20250128-093423-root.json [09:34:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2190', diff saved to https://phabricator.wikimedia.org/P72578 and previous config saved to /var/cache/conftool/dbconfig/20250128-093446-marostegui.json [09:34:56] !log root@cumin1002 START - Cookbook sre.mysql.upgrade for db2190.codfw.wmnet [09:39:34] PROBLEM - Host ripe-atlas-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [09:39:57] !log root@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2190.codfw.wmnet [09:41:13] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2190.codfw.wmnet with reason: Index rebuild [09:43:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2036 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P72580 and previous config saved to /var/cache/conftool/dbconfig/20250128-094313-root.json [09:48:00] (03PS1) 10Muehlenhoff: profile::docker::firewall: Remove unused profile [puppet] - 10https://gerrit.wikimedia.org/r/1114661 [09:48:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1175 (re)pooling @ 50%: Repooling T384807', diff saved to https://phabricator.wikimedia.org/P72581 and previous config saved to /var/cache/conftool/dbconfig/20250128-094857-root.json [09:49:03] T384807: Upgrade and rebuild s3 - https://phabricator.wikimedia.org/T384807 [09:49:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2205 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P72582 and previous config saved to /var/cache/conftool/dbconfig/20250128-094928-root.json [09:49:58] (03CR) 10Giuseppe Lavagetto: [C:03+1] Enroll 2% of client sessions in PHP 8.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114657 (https://phabricator.wikimedia.org/T383845) (owner: 10Effie Mouzeli) [09:50:07] !log root@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1150.eqiad.wmnet with OS bookworm [09:50:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170 (T384592)', diff saved to https://phabricator.wikimedia.org/P72583 and previous config saved to /var/cache/conftool/dbconfig/20250128-095032-marostegui.json [09:50:37] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [09:51:21] (03PS1) 10Dreamrimmer: Change "$wgUploadMissingFileUrl" for svwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114663 (https://phabricator.wikimedia.org/T383452) [09:53:20] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, January 28 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114663 (https://phabricator.wikimedia.org/T383452) (owner: 10Dreamrimmer) [09:58:06] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 07Performance Issue, 07Upstream: https://lists.wikimedia.org is often slow to load - https://phabricator.wikimedia.org/T353891#10499820 (10LSobanski) p:05Triage→03Medium [09:58:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2036 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P72584 and previous config saved to /var/cache/conftool/dbconfig/20250128-095818-root.json [10:00:18] (03CR) 10Muehlenhoff: [C:04-1] "This should be broken down to logical, dependant patches, each with their own commit message detailing the change (adding support for new " [puppet] - 10https://gerrit.wikimedia.org/r/1109726 (https://phabricator.wikimedia.org/T370677) (owner: 10Arnaudb) [10:04:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1175 (re)pooling @ 75%: Repooling T384807', diff saved to https://phabricator.wikimedia.org/P72585 and previous config saved to /var/cache/conftool/dbconfig/20250128-100402-root.json [10:04:08] T384807: Upgrade and rebuild s3 - https://phabricator.wikimedia.org/T384807 [10:04:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2205 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P72586 and previous config saved to /var/cache/conftool/dbconfig/20250128-100434-root.json [10:05:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170', diff saved to https://phabricator.wikimedia.org/P72587 and previous config saved to /var/cache/conftool/dbconfig/20250128-100539-marostegui.json [10:07:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2155 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P72588 and previous config saved to /var/cache/conftool/dbconfig/20250128-100754-root.json [10:10:58] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 207, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:11:06] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 112, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:12:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2155 T384807', diff saved to https://phabricator.wikimedia.org/P72589 and previous config saved to /var/cache/conftool/dbconfig/20250128-101224-marostegui.json [10:12:29] T384807: Upgrade and rebuild s3 - https://phabricator.wikimedia.org/T384807 [10:13:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2036 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P72590 and previous config saved to /var/cache/conftool/dbconfig/20250128-101324-root.json [10:14:58] (03PS1) 10Jelto: Support multiple helm versions [debs/helm3] - 10https://gerrit.wikimedia.org/r/1114666 (https://phabricator.wikimedia.org/T341984) [10:19:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1175 (re)pooling @ 100%: Repooling T384807', diff saved to https://phabricator.wikimedia.org/P72591 and previous config saved to /var/cache/conftool/dbconfig/20250128-101908-root.json [10:19:13] T384807: Upgrade and rebuild s3 - https://phabricator.wikimedia.org/T384807 [10:19:28] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - ASunknown/IPv4: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:19:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2205 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P72592 and previous config saved to /var/cache/conftool/dbconfig/20250128-101939-root.json [10:19:42] (03CR) 10Vgutierrez: [C:03+2] service: Add scheduler_flag field to ServiceLVS [software/spicerack] - 10https://gerrit.wikimedia.org/r/1114356 (https://phabricator.wikimedia.org/T373027) (owner: 10Vgutierrez) [10:20:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170', diff saved to https://phabricator.wikimedia.org/P72593 and previous config saved to /var/cache/conftool/dbconfig/20250128-102046-marostegui.json [10:21:34] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10499923 (10MatthewVernon) ` /dev/sda -d scsi # /dev/sda, SCSI device /dev/sdb -d scsi # /dev/sdb, SCSI device /dev/sdc -d scsi # /dev/sdc, SCSI device /dev/sdd -d scsi # /dev/sd... [10:22:58] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 208, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:23:06] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 113, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:25:11] (03CR) 10Vgutierrez: [C:03+2] wmflib,pybal: Add scheduler_flag support [puppet] - 10https://gerrit.wikimedia.org/r/1114352 (https://phabricator.wikimedia.org/T373027) (owner: 10Vgutierrez) [10:28:16] (03CR) 10MVernon: "How does that relate to the nginx and swift-fe services that are being used in confctl to pool/depool these systems,then?" [puppet] - 10https://gerrit.wikimedia.org/r/1114015 (https://phabricator.wikimedia.org/T373020) (owner: 10Vgutierrez) [10:28:55] 10ops-magru, 06Infrastructure-Foundations, 10netops: Jan 2025 - Magru core router connectivity blips - https://phabricator.wikimedia.org/T384774#10499926 (10cmooney) Everything remains stable since the upgrade/reset of the routers yesterday. All protocol adjacencies, interfaces etc look good as are the gene... [10:33:55] (03CR) 10Vgutierrez: "the provided configuration sets the mapping between local services (envoy and swift-proxy) with conftool services, so the provided scripts" [puppet] - 10https://gerrit.wikimedia.org/r/1114015 (https://phabricator.wikimedia.org/T373020) (owner: 10Vgutierrez) [10:34:27] 10ops-magru, 06Infrastructure-Foundations, 10netops: Jan 2025 - Magru core router connectivity blips - https://phabricator.wikimedia.org/T384774#10499947 (10Vgutierrez) thanks @cmooney, I'll re-pool the site [10:34:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2205 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P72594 and previous config saved to /var/cache/conftool/dbconfig/20250128-103444-root.json [10:34:53] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1112224 (https://phabricator.wikimedia.org/T383707) (owner: 10Slyngshede) [10:35:35] !log vgutierrez@cumin1002 START - Cookbook sre.dns.admin DNS admin: pool site magru [reason: no reason specified, T384774] [10:35:39] T384774: Jan 2025 - Magru core router connectivity blips - https://phabricator.wikimedia.org/T384774 [10:35:49] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: pool site magru [reason: no reason specified, T384774] [10:35:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170 (T384592)', diff saved to https://phabricator.wikimedia.org/P72595 and previous config saved to /var/cache/conftool/dbconfig/20250128-103553-marostegui.json [10:35:58] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [10:36:01] jouncebot: nowandnext [10:36:01] No deployments scheduled for the next 0 hour(s) and 23 minute(s) [10:36:01] In 0 hour(s) and 23 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250128T1100) [10:36:09] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance [10:36:47] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2026.codfw.wmnet [10:38:07] !log jmm@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ganeti2026.codfw.wmnet with reason: remove from cluster for reimage [10:38:14] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10499969 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=bc2c7bb0-3133-43fd-9040-c01d53f22d8f) set by jmm@cumin2002 for 1 day, 0:00:00 on 1 host(... [10:39:46] (03CR) 10Muehlenhoff: [C:03+2] Switch ganeti2026 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1114653 (owner: 10Muehlenhoff) [10:39:50] (03CR) 10Vgutierrez: [C:03+1] hiera: enable haproxykafka on eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1114417 (https://phabricator.wikimedia.org/T378578) (owner: 10Fabfur) [10:42:58] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 219, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:43:08] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 128, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:44:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2155 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P72596 and previous config saved to /var/cache/conftool/dbconfig/20250128-104415-root.json [10:44:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2030 (re)pooling @ 10%: Repooling T384807', diff saved to https://phabricator.wikimedia.org/P72597 and previous config saved to /var/cache/conftool/dbconfig/20250128-104436-root.json [10:44:41] T384807: Upgrade and rebuild s3 - https://phabricator.wikimedia.org/T384807 [10:47:10] (03PS1) 10Volans: CHANGELOG: add changelogs for release v9.1.1 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1114674 [10:54:01] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti2020.codfw.wmnet to cluster codfw and group B [10:54:50] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti2020.codfw.wmnet to cluster codfw and group B [10:57:25] (03CR) 10Volans: [C:03+2] CHANGELOG: add changelogs for release v9.1.1 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1114674 (owner: 10Volans) [10:59:07] (03PS1) 10Volans: Upstream release v9.1.1 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1114675 [10:59:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2155 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P72598 and previous config saved to /var/cache/conftool/dbconfig/20250128-105920-root.json [10:59:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2030 (re)pooling @ 25%: Repooling T384807', diff saved to https://phabricator.wikimedia.org/P72599 and previous config saved to /var/cache/conftool/dbconfig/20250128-105942-root.json [10:59:47] T384807: Upgrade and rebuild s3 - https://phabricator.wikimedia.org/T384807 [11:00:05] effie mouzeli: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for MediaWiki infrastructure (UTC mid-day) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250128T1100). [11:01:50] !log klausman@deploy2002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [11:02:18] !log klausman@deploy2002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [11:02:39] !log klausman@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [11:03:03] !log klausman@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [11:04:36] !log installing runc security updates [11:04:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:37] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jiji@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114657 (https://phabricator.wikimedia.org/T383845) (owner: 10Effie Mouzeli) [11:06:25] (03Merged) 10jenkins-bot: Enroll 2% of client sessions in PHP 8.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114657 (https://phabricator.wikimedia.org/T383845) (owner: 10Effie Mouzeli) [11:06:55] !log jiji@deploy2002 Started scap sync-world: Backport for [[gerrit:1114657|Enroll 2% of client sessions in PHP 8.1 (T383845)]] [11:07:00] T383845: MediaWiki on PHP 8.1 production traffic ramp-up - https://phabricator.wikimedia.org/T383845 [11:08:38] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:11:22] (03PS1) 10Btullis: Fix incompatibility between /mnt/hdfs and envoyproxy [puppet] - 10https://gerrit.wikimedia.org/r/1114679 (https://phabricator.wikimedia.org/T384329) [11:11:33] !log jiji@deploy2002 jiji: Backport for [[gerrit:1114657|Enroll 2% of client sessions in PHP 8.1 (T383845)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [11:11:50] !log jiji@deploy2002 jiji: Continuing with sync [11:12:22] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4873/console" [puppet] - 10https://gerrit.wikimedia.org/r/1114679 (https://phabricator.wikimedia.org/T384329) (owner: 10Btullis) [11:13:59] (03CR) 10Volans: [C:03+2] Upstream release v9.1.1 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1114675 (owner: 10Volans) [11:14:07] (03CR) 10MVernon: [C:03+1] "Thank you for taking the time to explain all this to me again :)" [puppet] - 10https://gerrit.wikimedia.org/r/1114015 (https://phabricator.wikimedia.org/T373020) (owner: 10Vgutierrez) [11:14:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2155 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P72600 and previous config saved to /var/cache/conftool/dbconfig/20250128-111425-root.json [11:14:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2030 (re)pooling @ 50%: Repooling T384807', diff saved to https://phabricator.wikimedia.org/P72601 and previous config saved to /var/cache/conftool/dbconfig/20250128-111447-root.json [11:14:53] T384807: Upgrade and rebuild s3 - https://phabricator.wikimedia.org/T384807 [11:15:58] (03PS2) 10Jelto: Support multiple helm versions [debs/helm3] - 10https://gerrit.wikimedia.org/r/1114666 (https://phabricator.wikimedia.org/T341984) [11:17:15] !log klausman@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [11:18:44] !log jiji@deploy2002 Finished scap sync-world: Backport for [[gerrit:1114657|Enroll 2% of client sessions in PHP 8.1 (T383845)]] (duration: 11m 48s) [11:18:49] T383845: MediaWiki on PHP 8.1 production traffic ramp-up - https://phabricator.wikimedia.org/T383845 [11:19:00] !log uploaded spicerack_9.1.1 to apt.wikimedia.org bullseye-wikimedia [11:19:01] !log klausman@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [11:19:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://mobileapps.svc.eqiad.wmnet:4102 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [11:25:30] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-query_443: Servers titan1001.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [11:25:36] PROBLEM - SSH on prometheus2006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [11:26:25] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1174.eqiad.wmnet with reason: Maintenance [11:26:26] RECOVERY - SSH on prometheus2006 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [11:26:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1174 (T384592)', diff saved to https://phabricator.wikimedia.org/P72602 and previous config saved to /var/cache/conftool/dbconfig/20250128-112631-marostegui.json [11:26:37] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [11:26:47] (03CR) 10Clément Goubert: [C:03+1] fc-list: update font list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114398 (https://phabricator.wikimedia.org/T280718) (owner: 10Hnowlan) [11:26:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://mobileapps.svc.eqiad.wmnet:4102 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [11:27:32] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-web_443: Servers titan1002.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [11:27:57] FIRING: ProbeDown: Service thanos-query:443 has failed probes (http_thanos-query_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:28:10] here [11:28:52] !incidents [11:28:52] 5638 (UNACKED) ProbeDown sre (10.2.2.53 ip4 thanos-query:443 probes/service http_thanos-query_ip4 eqiad) [11:28:52] 5637 (RESOLVED) [4x] ProbeDown sre (probes/custom eqiad) [11:28:53] 5636 (RESOLVED) [4x] ProbeDown sre (probes/custom eqiad) [11:29:01] !ack 5638 [11:29:02] 5638 (ACKED) ProbeDown sre (10.2.2.53 ip4 thanos-query:443 probes/service http_thanos-query_ip4 eqiad) [11:29:18] impact: issues loading grafana [11:29:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2155 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P72603 and previous config saved to /var/cache/conftool/dbconfig/20250128-112931-root.json [11:29:33] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [11:29:43] (03PS2) 10Btullis: Fix incompatibility between /mnt/hdfs and envoyproxy [puppet] - 10https://gerrit.wikimedia.org/r/1114679 (https://phabricator.wikimedia.org/T384329) [11:29:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2030 (re)pooling @ 75%: Repooling T384807', diff saved to https://phabricator.wikimedia.org/P72604 and previous config saved to /var/cache/conftool/dbconfig/20250128-112952-root.json [11:29:58] T384807: Upgrade and rebuild s3 - https://phabricator.wikimedia.org/T384807 [11:30:31] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4874/co" [puppet] - 10https://gerrit.wikimedia.org/r/1114679 (https://phabricator.wikimedia.org/T384329) (owner: 10Btullis) [11:30:37] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [11:31:56] grafana is back [11:32:57] RESOLVED: ProbeDown: Service thanos-query:443 has failed probes (http_thanos-query_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:33:38] FIRING: [3x] ProbeDown: Service prometheus1006:443 has failed probes (http_prometheus_eqiad_wikimedia_org_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:34:16] !log installed spicerack v9.1.1 on cumin2002 [11:34:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:50] !log installed spicerack v9.1.1 on cumin1002 [11:35:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:07] don't know what the trigger is, by the time I get to the host and check envoy is up and listening [11:36:08] FIRING: UdpIRCStreamThroughput: irc1003:16667 has relayed less than 100 messages over past 5 minutes }} - https://wikitech.wikimedia.org/wiki/Irc.wikimedia.org - https://grafana.wikimedia.org/d/eb101795-c69e-4b9c-b848-f042d604f234/ircstream - https://alerts.wikimedia.org/?q=alertname%3DUdpIRCStreamThroughput [11:38:08] is that real or is it due to the prometheus down time? [11:38:31] oomkill for prometheus k8s on prometheus1006 [11:38:38] RESOLVED: [3x] ProbeDown: Service prometheus1006:443 has failed probes (http_prometheus_eqiad_wikimedia_org_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:38:38] godog: ^ [11:38:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T384592)', diff saved to https://phabricator.wikimedia.org/P72605 and previous config saved to /var/cache/conftool/dbconfig/20250128-113845-marostegui.json [11:38:50] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [11:39:16] maybe a restart, although if it is traffic volume-caused, it won't do much [11:41:08] RESOLVED: UdpIRCStreamThroughput: irc1003:16667 has relayed less than 100 messages over past 5 minutes }} - https://wikitech.wikimedia.org/wiki/Irc.wikimedia.org - https://grafana.wikimedia.org/d/eb101795-c69e-4b9c-b848-f042d604f234/ircstream - https://alerts.wikimedia.org/?q=alertname%3DUdpIRCStreamThroughput [11:41:19] claime: ack thx, will take a look [11:41:54] /api/v1/series spiked up to over 2min [11:44:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2155 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P72606 and previous config saved to /var/cache/conftool/dbconfig/20250128-114436-root.json [11:44:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2030 (re)pooling @ 100%: Repooling T384807', diff saved to https://phabricator.wikimedia.org/P72607 and previous config saved to /var/cache/conftool/dbconfig/20250128-114458-root.json [11:45:03] T384807: Upgrade and rebuild s3 - https://phabricator.wikimedia.org/T384807 [11:45:06] (03CR) 10Btullis: [V:03+1 C:03+2] Fix incompatibility between /mnt/hdfs and envoyproxy [puppet] - 10https://gerrit.wikimedia.org/r/1114679 (https://phabricator.wikimedia.org/T384329) (owner: 10Btullis) [11:46:39] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic1096-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [11:49:34] (03PS1) 10Slyngshede: Move to CAS 7.1 for debugging [dns] - 10https://gerrit.wikimedia.org/r/1114689 [11:51:31] FIRING: [5x] Not accepting/receiving prefixes from anycast BGP peer: Alert for device cr1-codfw.wikimedia.org - Not accepting/receiving prefixes from anycast BGP peer - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer [11:51:50] claime: yeah prometheus@k8s-mlserve exploded in memory, I'm looking at https://grafana.wikimedia.org/goto/e5yDMQOHR?orgId=1 and https://grafana.wikimedia.org/goto/Qc6vMwOHR?orgId=1 [11:52:29] I mean other instances too [11:52:44] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudgw1003.eqiad.wmnet with OS bookworm [11:53:00] coincides with the end of the deployment of php8.1 on k8s, but 2% of traffic shouldn't cause such an explosion [11:53:07] 10ops-eqiad, 06SRE, 10Cloud-VPS, 06DC-Ops, 10cloud-services-team (Hardware): Relocate cloudnet1007-dev and cloudnet1008-dev to new racks and rename - https://phabricator.wikimedia.org/T382412#10500134 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1002 for host clo... [11:53:18] also with a ml-serve apply yeah [11:53:50] I'd think the same re: php rollout hardly be the cause [11:53:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P72608 and previous config saved to /var/cache/conftool/dbconfig/20250128-115352-marostegui.json [11:53:58] (03CR) 10Slyngshede: [C:03+2] Move to CAS 7.1 for debugging [dns] - 10https://gerrit.wikimedia.org/r/1114689 (owner: 10Slyngshede) [11:54:15] !log slyngshede@dns1004 START - running authdns-update [11:56:05] !log slyngshede@dns1004 END - running authdns-update [12:02:44] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudgw1004.eqiad.wmnet with OS bookworm [12:05:28] (03CR) 10Vgutierrez: "I think we should provide further documentation, because after merging this CR we won't be able to set single_backend to `false` on the im" [puppet] - 10https://gerrit.wikimedia.org/r/1114074 (owner: 10BCornwall) [12:07:07] (03PS1) 10Reedy: FormatMetadata: Prevent running preg_match() on null [core] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1114701 (https://phabricator.wikimedia.org/T384879) [12:07:15] (03PS1) 10Reedy: FormatMetadata: Prevent running preg_match() on null [core] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1114702 (https://phabricator.wikimedia.org/T384879) [12:08:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P72609 and previous config saved to /var/cache/conftool/dbconfig/20250128-120859-marostegui.json [12:09:37] !log andrew@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudgw1003.eqiad.wmnet with reason: host reimage [12:11:09] (03PS1) 10Stang: zhwiki: Add 2025 CNY celebration logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114703 (https://phabricator.wikimedia.org/T384913) [12:12:36] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudgw1003.eqiad.wmnet with reason: host reimage [12:12:42] (03CR) 10Elukey: [C:03+2] admin_ng: disable PSP mutation for ml-staging-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114423 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey) [12:18:45] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, January 28 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114703 (https://phabricator.wikimedia.org/T384913) (owner: 10Stang) [12:19:01] 06SRE, 06Commons, 10MediaWiki-Uploading, 06Traffic: HTTP 503 error when uploading images on Wikimedia Commons - https://phabricator.wikimedia.org/T383274#10500234 (10Vgutierrez) {F58297395} This high TTFB values make me suspect of some kind of connectivity issue. Could you try to reproduce this behavior o... [12:19:38] !log andrew@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudgw1004.eqiad.wmnet with reason: host reimage [12:22:16] !log cmooney@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on netflow2003.codfw.wmnet with reason: disabling alerts as I'm running gnmic manually rather than with systemd [12:22:22] 06SRE, 06Infrastructure-Foundations, 10netops: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#10500289 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=892c37cf-859a-4da6-8f59-c75b5d153219) set by cmooney@cumin1002 for 3:00:00 on 1 host(s) and th... [12:23:14] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudgw1004.eqiad.wmnet with reason: host reimage [12:23:26] (03PS1) 10Slyngshede: Revert "Move to CAS 7.1 for debugging" [dns] - 10https://gerrit.wikimedia.org/r/1114705 [12:24:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T384592)', diff saved to https://phabricator.wikimedia.org/P72610 and previous config saved to /var/cache/conftool/dbconfig/20250128-122406-marostegui.json [12:24:11] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [12:24:21] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1191.eqiad.wmnet with reason: Maintenance [12:24:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1191 (T384592)', diff saved to https://phabricator.wikimedia.org/P72611 and previous config saved to /var/cache/conftool/dbconfig/20250128-122428-marostegui.json [12:24:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10500301 (10phaultfinder) [12:25:25] (03CR) 10Slyngshede: [C:03+2] Revert "Move to CAS 7.1 for debugging" [dns] - 10https://gerrit.wikimedia.org/r/1114705 (owner: 10Slyngshede) [12:25:32] !log slyngshede@dns1004 START - running authdns-update [12:25:48] (03CR) 10Btullis: [C:03+1] "Good stuff, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1114655 (https://phabricator.wikimedia.org/T321808) (owner: 10Filippo Giunchedi) [12:27:09] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2230.codfw.wmnet with reason: Index rebuild [12:27:21] !log slyngshede@dns1004 END - running authdns-update [12:27:32] (03CR) 10Filippo Giunchedi: [C:03+2] dumps: remove nfs port icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/1114655 (https://phabricator.wikimedia.org/T321808) (owner: 10Filippo Giunchedi) [12:27:38] (03PS2) 10Filippo Giunchedi: dumps: remove nfs port icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/1114655 (https://phabricator.wikimedia.org/T321808) [12:27:59] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2230.codfw.wmnet with reason: Index rebuild [12:28:21] (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] dumps: remove nfs port icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/1114655 (https://phabricator.wikimedia.org/T321808) (owner: 10Filippo Giunchedi) [12:30:25] !log elukey@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [12:30:44] (03PS1) 10Vgutierrez: liberica: Depool on liberica-cp.service stop [puppet] - 10https://gerrit.wikimedia.org/r/1114708 [12:31:15] !log elukey@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [12:32:43] !log root@cumin1002 START - Cookbook sre.mysql.pool db2190 gradually with 4 steps - Repooling after rebuild index [12:32:50] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10500328 (10MoritzMuehlenhoff) [12:32:58] !log andrew@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" [12:34:44] (03PS1) 10Marostegui: rebuild_tables.sh Add automatic repooling [software] - 10https://gerrit.wikimedia.org/r/1114709 (https://phabricator.wikimedia.org/T382842) [12:36:13] (03CR) 10Marostegui: "FYI. This has been tested, just trying to make it less painful to rebuild tables." [software] - 10https://gerrit.wikimedia.org/r/1114709 (https://phabricator.wikimedia.org/T382842) (owner: 10Marostegui) [12:36:25] (03CR) 10Marostegui: [C:03+2] rebuild_tables.sh Add automatic repooling [software] - 10https://gerrit.wikimedia.org/r/1114709 (https://phabricator.wikimedia.org/T382842) (owner: 10Marostegui) [12:36:51] (03Merged) 10jenkins-bot: rebuild_tables.sh Add automatic repooling [software] - 10https://gerrit.wikimedia.org/r/1114709 (https://phabricator.wikimedia.org/T382842) (owner: 10Marostegui) [12:37:06] (03CR) 10Dbrant: [C:03+2] mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114437 (owner: 10PipelineBot) [12:37:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T384592)', diff saved to https://phabricator.wikimedia.org/P72614 and previous config saved to /var/cache/conftool/dbconfig/20250128-123706-marostegui.json [12:37:11] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [12:37:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1166 T382842', diff saved to https://phabricator.wikimedia.org/P72615 and previous config saved to /var/cache/conftool/dbconfig/20250128-123713-marostegui.json [12:37:19] T382842: Upgrade to 10.6.20 and rebuild recentchanges and pagelinks tables - https://phabricator.wikimedia.org/T382842 [12:38:14] (03Merged) 10jenkins-bot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114437 (owner: 10PipelineBot) [12:38:42] (03CR) 10Dbrant: [C:03+2] wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114433 (owner: 10PipelineBot) [12:39:10] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1166.eqiad.wmnet with reason: Index rebuild [12:40:00] (03Merged) 10jenkins-bot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114433 (owner: 10PipelineBot) [12:40:42] 06SRE, 06Infrastructure-Foundations, 10netops: Routinator 0.14 causing tempfs file system to fill up - https://phabricator.wikimedia.org/T383116#10500343 (10MoritzMuehlenhoff) 05Open→03Resolved After running 0.14.1 for five days, we can confirm this fixed, disk usage of /var/lib/routinator/repository... [12:40:43] FIRING: BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs2016:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [12:41:18] !log andrew@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" [12:45:09] (03CR) 10JMeybohm: Support multiple helm versions (032 comments) [debs/helm3] - 10https://gerrit.wikimedia.org/r/1114666 (https://phabricator.wikimedia.org/T341984) (owner: 10Jelto) [12:45:10] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2026.codfw.wmnet with OS bookworm [12:45:22] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10500378 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2026.codfw.wmnet with OS bookworm [12:49:04] (03CR) 10Elukey: [C:03+1] kartotherian: disable icinga check [puppet] - 10https://gerrit.wikimedia.org/r/1114650 (https://phabricator.wikimedia.org/T321808) (owner: 10Filippo Giunchedi) [12:50:21] !log andrew@cumin1002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" [12:50:21] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudgw1004.eqiad.wmnet with OS bookworm [12:50:28] !log andrew@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" [12:50:29] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudgw1003.eqiad.wmnet with OS bookworm [12:50:42] 10ops-eqiad, 06SRE, 10Cloud-VPS, 06DC-Ops, 10cloud-services-team (Hardware): Relocate cloudnet1007-dev and cloudnet1008-dev to new racks and rename - https://phabricator.wikimedia.org/T382412#10500383 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1002 for host cloudgw... [12:51:05] !log cmooney@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on netflow3003.esams.wmnet with reason: disabling alerts as I'm running gnmic manually rather than with systemd [12:51:11] 06SRE, 06Infrastructure-Foundations, 10netops: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#10500385 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=7b04d5bf-ab80-4626-96ba-3c376dfc52c2) set by cmooney@cumin1002 for 3:00:00 on 1 host(s) and th... [12:52:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P72617 and previous config saved to /var/cache/conftool/dbconfig/20250128-125213-marostegui.json [12:56:27] (03CR) 10Elukey: "I checked https://github.com/Wikia/poolcounter-prometheus-exporter/blob/master/collector.go and afaics it just pulls metrics from poolcoun" [puppet] - 10https://gerrit.wikimedia.org/r/1114651 (https://phabricator.wikimedia.org/T321808) (owner: 10Filippo Giunchedi) [13:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250128T1300) [13:02:17] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ganeti2026.codfw.wmnet with OS bookworm [13:02:22] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10500439 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2026.codfw.wmnet with OS bookworm executed with errors:... [13:02:35] (03CR) 10Filippo Giunchedi: [C:03+2] kartotherian: disable icinga check [puppet] - 10https://gerrit.wikimedia.org/r/1114650 (https://phabricator.wikimedia.org/T321808) (owner: 10Filippo Giunchedi) [13:03:11] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2026.codfw.wmnet with OS bookworm [13:03:16] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10500442 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2026.codfw.wmnet with OS bookworm [13:03:53] !log dbrant@deploy2002 helmfile [staging] START helmfile.d/services/wikifeeds: apply [13:04:30] !log dbrant@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifeeds: apply [13:04:52] (03PS2) 10Marostegui: wmnet: Promote es2028 to es1 master [dns] - 10https://gerrit.wikimedia.org/r/1114654 (https://phabricator.wikimedia.org/T376905) [13:05:09] !log dbrant@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifeeds: apply [13:06:25] !log dbrant@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply [13:06:39] (03CR) 10Filippo Giunchedi: "Good point, not AFAIK. The alternative would be to deploy a blackbox exporter, or better yet add 'poolcounter_up' metric to the exporter w" [puppet] - 10https://gerrit.wikimedia.org/r/1114651 (https://phabricator.wikimedia.org/T321808) (owner: 10Filippo Giunchedi) [13:06:49] !log dbrant@deploy2002 helmfile [codfw] START helmfile.d/services/wikifeeds: apply [13:07:15] !log dbrant@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply [13:07:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P72619 and previous config saved to /var/cache/conftool/dbconfig/20250128-130720-marostegui.json [13:12:33] !log dbrant@deploy2002 helmfile [staging] START helmfile.d/services/mobileapps: apply [13:13:17] !log dbrant@deploy2002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [13:13:32] !log dbrant@deploy2002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [13:14:01] (03CR) 10Federico Ceratto: [C:03+1] "(+1, discussed on IRC)" [dns] - 10https://gerrit.wikimedia.org/r/1114654 (https://phabricator.wikimedia.org/T376905) (owner: 10Marostegui) [13:14:39] !log dbrant@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [13:15:04] !log dbrant@deploy2002 helmfile [codfw] START helmfile.d/services/mobileapps: apply [13:15:46] !log dbrant@deploy2002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [13:15:48] (03CR) 10Federico Ceratto: [C:03+2] wmnet: Promote es2028 to es1 master [dns] - 10https://gerrit.wikimedia.org/r/1114654 (https://phabricator.wikimedia.org/T376905) (owner: 10Marostegui) [13:17:48] (03CR) 10Fabfur: liberica: Depool on liberica-cp.service stop (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1114708 (owner: 10Vgutierrez) [13:17:53] !log fceratto@dns1004 START - running authdns-update [13:18:06] !log root@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2190 gradually with 4 steps - Repooling after rebuild index [13:19:39] !log root@cumin1002 START - Cookbook sre.mysql.upgrade for db1166.eqiad.wmnet [13:19:50] !log fceratto@dns1004 END - running authdns-update [13:20:54] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [13:21:43] (03CR) 10Marostegui: Revert "db2155: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1114652 (owner: 10Marostegui) [13:21:44] (03CR) 10Marostegui: [C:03+2] Revert "db2155: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1114652 (owner: 10Marostegui) [13:22:06] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [13:22:22] (03PS2) 10Muehlenhoff: sre.ganeti.resource-report: Stop logging to SAL [cookbooks] - 10https://gerrit.wikimedia.org/r/1113084 (https://phabricator.wikimedia.org/T324655) [13:22:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T384592)', diff saved to https://phabricator.wikimedia.org/P72622 and previous config saved to /var/cache/conftool/dbconfig/20250128-132227-marostegui.json [13:22:31] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1194.eqiad.wmnet with reason: Maintenance [13:22:32] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [13:22:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1194 (T384592)', diff saved to https://phabricator.wikimedia.org/P72623 and previous config saved to /var/cache/conftool/dbconfig/20250128-132238-marostegui.json [13:23:34] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti2026.codfw.wmnet with reason: host reimage [13:23:34] (03CR) 10Fabfur: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1114708 (owner: 10Vgutierrez) [13:25:41] (03PS3) 10Jelto: Support multiple helm versions [debs/helm3] - 10https://gerrit.wikimedia.org/r/1114666 (https://phabricator.wikimedia.org/T341984) [13:26:13] !log root@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1166.eqiad.wmnet [13:26:20] (03CR) 10Jelto: Support multiple helm versions (032 comments) [debs/helm3] - 10https://gerrit.wikimedia.org/r/1114666 (https://phabricator.wikimedia.org/T341984) (owner: 10Jelto) [13:26:44] (03PS1) 10FNegri: alertmanager: fix WMCS email address [puppet] - 10https://gerrit.wikimedia.org/r/1114723 [13:27:11] PROBLEM - MariaDB Replica Lag: s3 #page on db1166 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 2541.52 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:27:31] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti2026.codfw.wmnet with reason: host reimage [13:27:52] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1166.eqiad.wmnet with reason: Index rebuild [13:27:57] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] alertmanager: fix WMCS email address [puppet] - 10https://gerrit.wikimedia.org/r/1114723 (owner: 10FNegri) [13:28:24] (03CR) 10FNegri: [C:03+2] alertmanager: fix WMCS email address [puppet] - 10https://gerrit.wikimedia.org/r/1114723 (owner: 10FNegri) [13:28:29] !incidents [13:28:29] 5639 (ACKED) db1166 (paged)/MariaDB Replica Lag: s3 (paged) [13:28:29] 5638 (RESOLVED) ProbeDown sre (10.2.2.53 ip4 thanos-query:443 probes/service http_thanos-query_ip4 eqiad) [13:28:29] 5637 (RESOLVED) [4x] ProbeDown sre (probes/custom eqiad) [13:28:29] 5636 (RESOLVED) [4x] ProbeDown sre (probes/custom eqiad) [13:29:32] should I depool it? [13:29:59] claime: No, downtime expired! [13:30:01] Sorry :( [13:30:03] (03CR) 10Fabfur: [C:03+2] hiera: enable haproxykafka on eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1114417 (https://phabricator.wikimedia.org/T378578) (owner: 10Fabfur) [13:30:03] It is depooled [13:30:07] ah cool [13:30:08] happens [13:30:14] back to lunch then :p [13:33:30] !log installing/enabling haproxykafka on eqiad (https://gerrit.wikimedia.org/r/c/operations/puppet/+/1114417) (T378578) [13:33:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:35] T378578: Rollout haproxykafka on all hosts - https://phabricator.wikimedia.org/T378578 [13:37:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T384592)', diff saved to https://phabricator.wikimedia.org/P72624 and previous config saved to /var/cache/conftool/dbconfig/20250128-133701-marostegui.json [13:37:07] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [13:38:38] FIRING: [2x] ProbeDown: Service ml-staging-ctrl2001:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ml-staging-ctrl2001:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:39:13] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [13:39:50] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [13:40:52] (03CR) 10JMeybohm: Support multiple helm versions (032 comments) [debs/helm3] - 10https://gerrit.wikimedia.org/r/1114666 (https://phabricator.wikimedia.org/T341984) (owner: 10Jelto) [13:42:57] (03CR) 10Jelto: Support multiple helm versions (032 comments) [debs/helm3] - 10https://gerrit.wikimedia.org/r/1114666 (https://phabricator.wikimedia.org/T341984) (owner: 10Jelto) [13:43:38] RESOLVED: [2x] ProbeDown: Service ml-staging-ctrl2001:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ml-staging-ctrl2001:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:44:01] (03PS4) 10Jelto: Support multiple helm versions [debs/helm3] - 10https://gerrit.wikimedia.org/r/1114666 (https://phabricator.wikimedia.org/T341984) [13:45:50] (03PS1) 10Lucas Werkmeister (WMDE): Enable mul language code on Wikidata (full release) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114727 (https://phabricator.wikimedia.org/T312176) [13:46:09] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, January 29 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114727 (https://phabricator.wikimedia.org/T312176) (owner: 10Lucas Werkmeister (WMDE)) [13:47:06] (03PS1) 10Fabfur: hiera: consolidate haproxykafka into common profile [puppet] - 10https://gerrit.wikimedia.org/r/1114728 (https://phabricator.wikimedia.org/T377931) [13:47:28] (03CR) 10CI reject: [V:04-1] hiera: consolidate haproxykafka into common profile [puppet] - 10https://gerrit.wikimedia.org/r/1114728 (https://phabricator.wikimedia.org/T377931) (owner: 10Fabfur) [13:49:18] (03PS2) 10Fabfur: hiera: consolidate haproxykafka into common profile [puppet] - 10https://gerrit.wikimedia.org/r/1114728 (https://phabricator.wikimedia.org/T377931) [13:49:20] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2026.codfw.wmnet with OS bookworm [13:49:30] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10500606 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2026.codfw.wmnet with OS bookworm completed: - ganeti202... [13:49:44] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1114728 (https://phabricator.wikimedia.org/T377931) (owner: 10Fabfur) [13:49:59] (03CR) 10JMeybohm: [C:03+1] Support multiple helm versions [debs/helm3] - 10https://gerrit.wikimedia.org/r/1114666 (https://phabricator.wikimedia.org/T341984) (owner: 10Jelto) [13:50:22] (03CR) 10Arnaudb: [C:04-1] "Sure! I broke this patch down in 4 commits: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1114726/1 the core firewall modification" [puppet] - 10https://gerrit.wikimedia.org/r/1109726 (https://phabricator.wikimedia.org/T370677) (owner: 10Arnaudb) [13:50:26] (03Abandoned) 10Arnaudb: gitlab_runner: migrate ferm rules to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1109726 (https://phabricator.wikimedia.org/T370677) (owner: 10Arnaudb) [13:50:38] (03PS2) 10Vgutierrez: liberica: Depool on liberica-cp.service stop [puppet] - 10https://gerrit.wikimedia.org/r/1114708 [13:50:41] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [13:51:02] (03PS1) 10Arnaudb: gitlab_runner: add nftables logic [puppet] - 10https://gerrit.wikimedia.org/r/1114726 (https://phabricator.wikimedia.org/T370677) [13:51:14] (03CR) 10Fabfur: [C:03+1] liberica: Depool on liberica-cp.service stop (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1114708 (owner: 10Vgutierrez) [13:51:15] (03PS3) 10Arnaudb: nftables: add nftable docker manifest [puppet] - 10https://gerrit.wikimedia.org/r/1114718 (https://phabricator.wikimedia.org/T370677) [13:51:17] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [13:51:27] (03PS2) 10Arnaudb: nftables: add types and directories [puppet] - 10https://gerrit.wikimedia.org/r/1114717 (https://phabricator.wikimedia.org/T370677) [13:51:36] (03PS2) 10Arnaudb: nftables: add docker profile and forward chain [puppet] - 10https://gerrit.wikimedia.org/r/1114716 (https://phabricator.wikimedia.org/T370677) [13:52:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P72625 and previous config saved to /var/cache/conftool/dbconfig/20250128-135208-marostegui.json [13:57:57] (03CR) 10Vgutierrez: [C:03+2] liberica: Depool on liberica-cp.service stop [puppet] - 10https://gerrit.wikimedia.org/r/1114708 (owner: 10Vgutierrez) [13:58:00] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2026.codfw.wmnet [14:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250128T1400). [14:00:05] Daimona, DreamRimmer, and koi: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:41] I have a meeting in half an hour, but I could deploy until then if nobody else is around… [14:00:48] o/ [14:00:53] o/ [14:00:59] (03CR) 10Jelto: [C:03+2] Support multiple helm versions [debs/helm3] - 10https://gerrit.wikimedia.org/r/1114666 (https://phabricator.wikimedia.org/T341984) (owner: 10Jelto) [14:01:56] (03PS1) 10Brouberol: dse-k8s-eqiad: deploy the sidecar job controller [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114732 (https://phabricator.wikimedia.org/T384329) [14:02:47] let’s start with Daimona then [14:02:50] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114440 (https://phabricator.wikimedia.org/T380818) (owner: 10Daimona Eaytoy) [14:03:54] (03Merged) 10jenkins-bot: prod: Enable $wgCampaignEventsEnableEventTopics [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114440 (https://phabricator.wikimedia.org/T380818) (owner: 10Daimona Eaytoy) [14:04:12] o/ [14:04:24] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1114440|prod: Enable $wgCampaignEventsEnableEventTopics (T380818)]] [14:04:30] T380818: Enable the event topics feature in production - https://phabricator.wikimedia.org/T380818 [14:06:36] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2026.codfw.wmnet [14:07:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P72626 and previous config saved to /var/cache/conftool/dbconfig/20250128-140715-marostegui.json [14:09:02] 10ops-codfw, 06DC-Ops, 06Infrastructure-Foundations, 13Patch-For-Review: Q3:rack/setup/install ganeti20[46-51] - https://phabricator.wikimedia.org/T384838#10500645 (10RobH) a:05RobH→03MoritzMuehlenhoff >>! In T384838#10499694, @MoritzMuehlenhoff wrote: > @RobH Why 2046 onwards? Our highest Ganeti serve... [14:09:06] (03PS2) 10Brouberol: dse-k8s-eqiad: deploy the sidecar job controller [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114732 (https://phabricator.wikimedia.org/T384329) [14:09:18] 10ops-codfw, 06DC-Ops, 06Infrastructure-Foundations, 13Patch-For-Review: Q3:rack/setup/install ganeti20[46-51] - https://phabricator.wikimedia.org/T384838#10500647 (10RobH) [14:09:27] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, daimona: Backport for [[gerrit:1114440|prod: Enable $wgCampaignEventsEnableEventTopics (T380818)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:09:31] T380818: Enable the event topics feature in production - https://phabricator.wikimedia.org/T380818 [14:09:33] Daimona: can you test on mwdebug? [14:09:42] 10ops-codfw, 06DC-Ops, 06Infrastructure-Foundations, 13Patch-For-Review: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10500652 (10RobH) [14:09:51] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti2026.codfw.wmnet to cluster codfw and group D [14:09:54] * Lucas_WMDE sees a lot of tracing errors in mwdebug logstash [14:10:32] Testing [14:11:25] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti2026.codfw.wmnet to cluster codfw and group D [14:15:12] (03CR) 10Btullis: [C:03+1] "Looks good, thanks." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114732 (https://phabricator.wikimedia.org/T384329) (owner: 10Brouberol) [14:15:16] (03PS10) 10Clément Goubert: admin_ng: add mwcron namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087212 [14:15:41] (03PS11) 10Clément Goubert: admin_ng: add mw-cron namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087212 [14:15:50] (03CR) 10Vgutierrez: [C:04-1] "change looks good, commit needs to be fixed (see inline comment)" [puppet] - 10https://gerrit.wikimedia.org/r/1114728 (https://phabricator.wikimedia.org/T377931) (owner: 10Fabfur) [14:16:19] Daimona: out of interest, do you know if cmelo was also here for the CampaignEvents config change or something else? ^^ [14:16:21] (03CR) 10Brouberol: [C:03+2] dse-k8s-eqiad: deploy the sidecar job controller [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114732 (https://phabricator.wikimedia.org/T384329) (owner: 10Brouberol) [14:16:22] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops, 10decommission-hardware: decommission cloudcephmon100[1-3].eqiad.wmnet - https://phabricator.wikimedia.org/T380893#10500680 (10cmooney) >>! In T380893#10396432, @Andrew wrote: > These hosts have a somewhat unusual vlan setup, so my guess is something i... [14:16:44] (he quit before I could ask – SAL / deployments archive suggests he works roughly in this area, as far as I understand it anyway ^^) [14:16:57] Lucas_WMDE: you can go ahead. We found a DB error, seems like a recent schema change has not been applied. It's unrelated to the current config change though. [14:17:05] hm [14:17:06] ok ^^ [14:17:14] lemme just peek at logstash real quick [14:17:17] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops, 10decommission-hardware: decommission cloudcephmon100[1-3].eqiad.wmnet - https://phabricator.wikimedia.org/T380893#10500684 (10cmooney) a:05cmooney→03None [14:17:27] (and filter out the damn tracing channel spam) [14:17:36] Yep, we're in the same call, testing together :) [14:17:39] “Unknown column 'event_is_test_event' in 'field list'” [14:17:39] ok :) [14:17:55] there’s also some PHP Notice: Undefined property: stdClass::$event_is_test_event [14:17:57] is that known? [14:18:00] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [14:18:09] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [14:18:10] oh, right [14:18:14] that’ll be the same error [14:18:22] 'event_is_test_event' column missing from a result set [14:18:39] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, daimona: Continuing with sync [14:18:43] ok then let’s try it [14:19:42] Yep, same thing. Trying to figure out why the column doesn't exist in prod. [14:20:06] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: ms-fe1014 hardware fault (may need new disk controller?) - https://phabricator.wikimedia.org/T384317#10500691 (10MatthewVernon) @Papaul is this host likely to get some attention soon, please? [14:21:08] !log Imported helm311 | 3.11.3-3 to bookworm-wikimedia - T341984 [14:21:10] (03PS1) 10Brouberol: dse-k8s-eqiad: create the sidecar-controller ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114738 (https://phabricator.wikimedia.org/T384329) [14:21:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:12] T341984: Update Kubernetes clusters to 1.31 - https://phabricator.wikimedia.org/T341984 [14:21:25] <_joe_> Lucas_WMDE: once you're done, let me know [14:21:38] (03CR) 10Muehlenhoff: sre.ganeti.resource-report: Stop logging to SAL (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1113084 (https://phabricator.wikimedia.org/T324655) (owner: 10Muehlenhoff) [14:22:13] (03PS3) 10Fabfur: hiera: consolidate haproxykafka into common profile [puppet] - 10https://gerrit.wikimedia.org/r/1114728 (https://phabricator.wikimedia.org/T378578) [14:22:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T384592)', diff saved to https://phabricator.wikimedia.org/P72627 and previous config saved to /var/cache/conftool/dbconfig/20250128-142222-marostegui.json [14:22:26] _joe_: I’ll have to stop after this deployment anyway, meeting coming up [14:22:27] (03CR) 10Fabfur: hiera: consolidate haproxykafka into common profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1114728 (https://phabricator.wikimedia.org/T378578) (owner: 10Fabfur) [14:22:28] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [14:22:38] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1202.eqiad.wmnet with reason: Maintenance [14:22:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1202 (T384592)', diff saved to https://phabricator.wikimedia.org/P72628 and previous config saved to /var/cache/conftool/dbconfig/20250128-142244-marostegui.json [14:22:52] <_joe_> Lucas_WMDE: so I can merge a patch of mine instead? :) [14:23:08] if you think it’s more important than the other schedule changes, I guess? ^^ [14:23:10] up to you :P [14:23:20] 10ops-codfw, 06DC-Ops, 06Infrastructure-Foundations, 13Patch-For-Review: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10500704 (10MoritzMuehlenhoff) a:05MoritzMuehlenhoff→03None >>! In T384838#10500645, @RobH wrote: >>>! In T384838#10499694, @MoritzMuehlenhoff wrote... [14:23:36] (idk if anyone else would volunteer to deploy those otherwise, I didn’t see anyone else speak up at the beginning of the window but I might have missed it) [14:23:37] (03CR) 10Volans: [C:03+1] "LGTM, I've also tested it with test-cookbook" [cookbooks] - 10https://gerrit.wikimedia.org/r/1113084 (https://phabricator.wikimedia.org/T324655) (owner: 10Muehlenhoff) [14:23:55] (03CR) 10Btullis: [C:03+1] dse-k8s-eqiad: create the sidecar-controller ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114738 (https://phabricator.wikimedia.org/T384329) (owner: 10Brouberol) [14:23:58] 10ops-codfw, 06DC-Ops, 06Infrastructure-Foundations, 13Patch-For-Review: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10500707 (10MoritzMuehlenhoff) [14:24:29] :( [14:25:28] * Lucas_WMDE looks up when CNY 2025 starts/ends [14:25:29] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1114440|prod: Enable $wgCampaignEventsEnableEventTopics (T380818)]] (duration: 21m 04s) [14:25:30] (03CR) 10Brouberol: [C:03+2] dse-k8s-eqiad: create the sidecar-controller ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114738 (https://phabricator.wikimedia.org/T384329) (owner: 10Brouberol) [14:25:34] T380818: Enable the event topics feature in production - https://phabricator.wikimedia.org/T380818 [14:25:36] _joe_: I’m done [14:26:01] 29 January… so preferably we shouldn’t postpone https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1114703 for *too* long I guess :/ [14:26:19] <_joe_> Lucas_WMDE: can you +1 koi's patch? [14:26:32] <_joe_> if you think it's good, I have no experience with logos [14:26:39] <_joe_> and I can deploy it given it's time sensitive [14:26:50] yes it will happen very soon [14:26:53] * Lucas_WMDE looks [14:27:10] I don’t know the logos stuff very well either, I assume we have CI that asserts PHP and YAML are in sync [14:27:25] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [14:27:26] the SVG… has a PNG embedded :S [14:27:30] but at least no sodipodi junk [14:28:11] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [14:28:32] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [14:28:47] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [14:29:14] !log Imported helm311 | 3.11.3-3 to bullseye-wikimedia - T341984 [14:29:16] Lucas: thanks! [14:29:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:18] T341984: Update Kubernetes clusters to 1.31 - https://phabricator.wikimedia.org/T341984 [14:29:34] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "Should be okay to deploy; the embedded PNG in the SVG isn’t super nice but I think for a temporary logo we can live with it. (It’s the “ma" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114703 (https://phabricator.wikimedia.org/T384913) (owner: 10Stang) [14:29:54] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10500737 (10MoritzMuehlenhoff) [14:30:32] <_joe_> koi: given it's time sensitive, I'm deploying your patch [14:30:52] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2028.codfw.wmnet [14:30:57] * Lucas_WMDE afk [14:31:04] _joe_: thanks! [14:31:32] (03CR) 10TrainBranchBot: [C:03+2] "Approved by oblivian@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114703 (https://phabricator.wikimedia.org/T384913) (owner: 10Stang) [14:31:36] thanks for the help! [14:31:43] <_joe_> np [14:31:45] (03PS1) 10Muehlenhoff: Switch ganeti2028 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1114741 [14:31:48] <_joe_> my patch will wait :) [14:31:53] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10500740 (10ops-monitoring-bot) Draining ganeti2028.codfw.wmnet of running VMs [14:32:30] (03Merged) 10jenkins-bot: zhwiki: Add 2025 CNY celebration logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114703 (https://phabricator.wikimedia.org/T384913) (owner: 10Stang) [14:33:01] !log oblivian@deploy2002 Started scap sync-world: Backport for [[gerrit:1114703|zhwiki: Add 2025 CNY celebration logos (T384913)]] [14:33:05] T384913: Requesting temporary logo change for zhwiki - https://phabricator.wikimedia.org/T384913 [14:33:31] !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-ctrl1002.eqiad.wmnet [14:33:35] 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, 06serviceops: wikikube-ctrl1002 and wikikube-ctrl1003: Switch network cable from port 2 to port 1 on the 10G NIC - https://phabricator.wikimedia.org/T379717#10500746 (10ops-monitoring-bot) depool host wikikube-ctrl1002.eqiad.wmnet by jayme@cumin1002 with r... [14:33:45] !log jayme@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-ctrl1002.eqiad.wmnet with reason: Depooled via sre.k8s.pool-depool-node [14:33:47] !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-ctrl1002.eqiad.wmnet [14:33:56] 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, 06serviceops: wikikube-ctrl1002 and wikikube-ctrl1003: Switch network cable from port 2 to port 1 on the 10G NIC - https://phabricator.wikimedia.org/T379717#10500747 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by jayme@cumin1... [14:34:09] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2028.codfw.wmnet [14:34:39] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10500751 (10phaultfinder) [14:36:08] (03PS2) 10Clément Goubert: kubernetes: Add mw-cron deploy config [puppet] - 10https://gerrit.wikimedia.org/r/1077001 (https://phabricator.wikimedia.org/T377962) [14:36:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T384592)', diff saved to https://phabricator.wikimedia.org/P72629 and previous config saved to /var/cache/conftool/dbconfig/20250128-143616-marostegui.json [14:36:21] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [14:36:28] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops, 10decommission-hardware: decommission cloudcephmon100[1-3].eqiad.wmnet - https://phabricator.wikimedia.org/T380893#10500778 (10Andrew) Thanks @cmooney ! @VRiley-WMF, you can give this another try at your convenience. [14:37:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:37:42] (03CR) 10Hnowlan: [C:03+1] kubernetes: Add mw-cron deploy config [puppet] - 10https://gerrit.wikimedia.org/r/1077001 (https://phabricator.wikimedia.org/T377962) (owner: 10Clément Goubert) [14:37:47] <_joe_> koi: can you check you like how the logo is displayed using the wikimedia-debug extension? [14:37:47] !log oblivian@deploy2002 stang, oblivian: Backport for [[gerrit:1114703|zhwiki: Add 2025 CNY celebration logos (T384913)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:38:04] _joe_, sure, looking [14:38:07] (03CR) 10JMeybohm: [C:03+1] admin_ng: add mw-cron namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087212 (owner: 10Clément Goubert) [14:38:17] (03CR) 10Hnowlan: [C:03+1] admin_ng: add mw-cron namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087212 (owner: 10Clément Goubert) [14:38:21] (03CR) 10JMeybohm: [C:03+1] kubernetes: Add mw-cron deploy config [puppet] - 10https://gerrit.wikimedia.org/r/1077001 (https://phabricator.wikimedia.org/T377962) (owner: 10Clément Goubert) [14:39:22] (03CR) 10Arthur taylor: [C:03+1] "Looks good to me!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114727 (https://phabricator.wikimedia.org/T312176) (owner: 10Lucas Werkmeister (WMDE)) [14:39:46] !log upload liberica 0.6 to apt.wm.o (bookworm-wikimedia) [14:39:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:19] _joe_, tested and LGTM [14:40:34] !log updating to liberica 0.6 in lvs1013 [14:40:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:36] (03PS1) 10Elukey: custom_deploy.d: rework dse-k8s-eqiad's istio config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114743 [14:41:50] <_joe_> koi: ok proceeding [14:41:53] !log oblivian@deploy2002 stang, oblivian: Continuing with sync [14:42:17] (03CR) 10CI reject: [V:04-1] custom_deploy.d: rework dse-k8s-eqiad's istio config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114743 (owner: 10Elukey) [14:42:26] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of aux-k8s-etcd2003.codfw.wmnet to drbd [14:42:42] FIRING: [2x] JobUnavailable: Reduced availability for job gnmi in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:42:48] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10500818 (10ops-monitoring-bot) VM aux-k8s-etcd2003.codfw.wmnet switching disk type to drbd [14:43:13] !log klausman@deploy2002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [14:43:50] (03CR) 10Vgutierrez: [C:03+2] liberica: Use libericad instead of liberica binary [puppet] - 10https://gerrit.wikimedia.org/r/1108875 (owner: 10Vgutierrez) [14:44:06] (03PS1) 10Btullis: airflow: Update the default package version [puppet] - 10https://gerrit.wikimedia.org/r/1114745 (https://phabricator.wikimedia.org/T383430) [14:44:12] (03PS1) 10Brouberol: Enable the sidecar-controller in all airflow namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114746 (https://phabricator.wikimedia.org/T384329) [14:44:47] !log klausman@deploy2002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [14:44:54] (03PS2) 10Elukey: custom_deploy.d: rework dse-k8s-eqiad's istio config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114743 [14:45:33] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4875/co" [puppet] - 10https://gerrit.wikimedia.org/r/1114745 (https://phabricator.wikimedia.org/T383430) (owner: 10Btullis) [14:45:48] (03CR) 10Elukey: "root@deploy2002:/home/elukey# istioctl-1.15.7 manifest diff /srv/deployment-charts/custom_deploy.d/istio/dse-k8s/config.yaml /tmp/new-dse-" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114743 (owner: 10Elukey) [14:47:32] (03CR) 10Btullis: [C:03+1] Enable the sidecar-controller in all airflow namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114746 (https://phabricator.wikimedia.org/T384329) (owner: 10Brouberol) [14:47:42] FIRING: [3x] JobUnavailable: Reduced availability for job gnmi in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:47:58] (03CR) 10Btullis: [V:03+1 C:03+2] airflow: Update the default package version [puppet] - 10https://gerrit.wikimedia.org/r/1114745 (https://phabricator.wikimedia.org/T383430) (owner: 10Btullis) [14:48:22] (03CR) 10Muehlenhoff: [C:03+2] sre.ganeti.resource-report: Stop logging to SAL [cookbooks] - 10https://gerrit.wikimedia.org/r/1113084 (https://phabricator.wikimedia.org/T324655) (owner: 10Muehlenhoff) [14:48:41] !log oblivian@deploy2002 Finished scap sync-world: Backport for [[gerrit:1114703|zhwiki: Add 2025 CNY celebration logos (T384913)]] (duration: 15m 40s) [14:48:46] T384913: Requesting temporary logo change for zhwiki - https://phabricator.wikimedia.org/T384913 [14:48:52] (03CR) 10Brouberol: [C:03+2] Enable the sidecar-controller in all airflow namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114746 (https://phabricator.wikimedia.org/T384329) (owner: 10Brouberol) [14:49:02] (03PS3) 10Brouberol: airflow: include an envoy mesh sidecar in all the airflow task pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114386 (https://phabricator.wikimedia.org/T384329) [14:49:05] (03PS6) 10Brouberol: Add discovery listeners to airflow-analytics(-test) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114387 (https://phabricator.wikimedia.org/T384329) [14:49:56] (03CR) 10Btullis: [C:03+1] airflow: include an envoy mesh sidecar in all the airflow task pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114386 (https://phabricator.wikimedia.org/T384329) (owner: 10Brouberol) [14:50:14] <_joe_> koi: {{done}} [14:50:23] (03CR) 10Btullis: [C:03+1] Add discovery listeners to airflow-analytics(-test) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114387 (https://phabricator.wikimedia.org/T384329) (owner: 10Brouberol) [14:50:23] ty [14:51:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P72630 and previous config saved to /var/cache/conftool/dbconfig/20250128-145123-marostegui.json [14:51:26] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:51:28] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:51:55] (03PS1) 10Vgutierrez: liberica,hiera: Provide grpc endpoint config for liberica-cp [puppet] - 10https://gerrit.wikimedia.org/r/1114748 [14:52:10] PROBLEM - Host analytics1073 is DOWN: PING CRITICAL - Packet loss = 100% [14:52:19] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1114748 (owner: 10Vgutierrez) [14:52:22] (03CR) 10CI reject: [V:04-1] liberica,hiera: Provide grpc endpoint config for liberica-cp [puppet] - 10https://gerrit.wikimedia.org/r/1114748 (owner: 10Vgutierrez) [14:54:26] (03PS1) 10Elukey: custom_deploy.d: remove ML-specific bits from DSE's istio config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114749 [14:55:22] !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2013,2036,2088].codfw.wmnet [14:55:31] 10ops-codfw, 06SRE, 06collaboration-services, 06Data-Persistence, and 2 others: Tracking List: Relocating servers to free up 10G switch space in codfw - https://phabricator.wikimedia.org/T383709#10500883 (10ops-monitoring-bot) depool host wikikube-worker[2013,2036,2088].codfw.wmnet by jayme@cumin1002 with... [14:55:38] !log jayme@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on wikikube-worker[2013,2036,2088].codfw.wmnet with reason: Depooled via sre.k8s.pool-depool-node [14:55:48] (03PS2) 10Vgutierrez: liberica,hiera: Provide grpc endpoint config for liberica-cp [puppet] - 10https://gerrit.wikimedia.org/r/1114748 [14:57:09] (03CR) 10Ssingh: [C:03+1] "Looks good at a quick glance!" [puppet] - 10https://gerrit.wikimedia.org/r/1114748 (owner: 10Vgutierrez) [14:57:22] !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2013,2036,2088].codfw.wmnet [14:57:29] 10ops-codfw, 06SRE, 06collaboration-services, 06Data-Persistence, and 2 others: Tracking List: Relocating servers to free up 10G switch space in codfw - https://phabricator.wikimedia.org/T383709#10500889 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by jayme@cumin1002 depool fo... [14:58:06] (03CR) 10Vgutierrez: [C:03+2] liberica,hiera: Provide grpc endpoint config for liberica-cp [puppet] - 10https://gerrit.wikimedia.org/r/1114748 (owner: 10Vgutierrez) [14:59:32] (03PS1) 10Urbanecm: [tests] Add ConfigWrapperTest [extensions/Babel] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1114751 (https://phabricator.wikimedia.org/T383905) [14:59:33] (03PS1) 10Urbanecm: Remove BabelCategorizeNamespaces from CommunityConfiguration [extensions/Babel] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1114752 (https://phabricator.wikimedia.org/T383905) [14:59:57] 10ops-codfw, 06SRE, 06collaboration-services, 06Data-Persistence, and 2 others: Tracking List: Relocating servers to free up 10G switch space in codfw - https://phabricator.wikimedia.org/T383709#10500895 (10JMeybohm) @Jhancock.wm wikikube-worker[2013,2036,2088].codfw.wmnet have been shut down, lmk when you... [15:01:43] PROBLEM - BGP status on lsw1-b8-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:01:47] PROBLEM - BGP status on lsw1-a5-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:02:42] FIRING: [2x] JobUnavailable: Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:03:28] (03PS12) 10Muehlenhoff: Make maps-test2001 a bookworm maps master node [puppet] - 10https://gerrit.wikimedia.org/r/1111634 (https://phabricator.wikimedia.org/T381565) [15:04:48] (03PS1) 10Elukey: kubernetes: remove ad-hoc CNI config from dse-k8s-worker [puppet] - 10https://gerrit.wikimedia.org/r/1114753 [15:05:38] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of aux-k8s-etcd2003.codfw.wmnet to drbd [15:05:56] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4876/co" [puppet] - 10https://gerrit.wikimedia.org/r/1114753 (owner: 10Elukey) [15:06:13] !log jelto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on gerrit2002.wikimedia.org with reason: NIC port switch -t T383709 [15:06:17] T383709: Tracking List: Relocating servers to free up 10G switch space in codfw - https://phabricator.wikimedia.org/T383709 [15:06:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P72631 and previous config saved to /var/cache/conftool/dbconfig/20250128-150630-marostegui.json [15:07:42] RESOLVED: [2x] JobUnavailable: Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:08:21] 10ops-magru, 06DC-Ops: hw troubleshooting: Power supply failure (PSU) for cp7001.magru.wmnet and cp7006.magru.wmnet - https://phabricator.wikimedia.org/T381446#10500925 (10RobH) Ok, progress. I had to provide 3 possible call back numbers, so I provided myself as primary, with Papaul and Willy as backup only i... [15:08:42] FIRING: [2x] JobUnavailable: Reduced availability for job gnmi in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:08:50] 10ops-magru, 06DC-Ops: hw troubleshooting: Power supply failure (PSU) for cp7001.magru.wmnet and cp7006.magru.wmnet - https://phabricator.wikimedia.org/T381446#10500926 (10RobH) > ** Por favor não alterar o título deste email. ** > > Prezado(a) Rob, > > Conforme plano de ação, foi aberto o chamado número 45... [15:09:31] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:10:30] (03PS1) 10Muehlenhoff: maps_bookworm: Initially disable replication/tile gen timers [puppet] - 10https://gerrit.wikimedia.org/r/1114755 (https://phabricator.wikimedia.org/T381565) [15:10:48] (03PS2) 10Muehlenhoff: maps_bookworm: Initially disable replication/tile gen timers [puppet] - 10https://gerrit.wikimedia.org/r/1114755 (https://phabricator.wikimedia.org/T381565) [15:10:58] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2028.codfw.wmnet [15:11:13] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10500945 (10ops-monitoring-bot) Draining ganeti2028.codfw.wmnet of running VMs [15:11:25] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2028.codfw.wmnet [15:12:10] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1114755 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [15:12:32] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of aux-k8s-etcd2003.codfw.wmnet to plain [15:12:57] FIRING: [4x] JobUnavailable: Reduced availability for job gnmi in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:13:09] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10500950 (10ops-monitoring-bot) VM aux-k8s-etcd2003.codfw.wmnet switching disk type to plain [15:13:22] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of aux-k8s-etcd2003.codfw.wmnet to plain [15:14:32] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2028.codfw.wmnet [15:14:49] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10500955 (10ops-monitoring-bot) Draining ganeti2028.codfw.wmnet of running VMs [15:15:11] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 10Thumbor, and 6 others: Change default image thumbnail size - https://phabricator.wikimedia.org/T355914#10500956 (10MatthewVernon) @ovasileva any update on progress on this, please? I see a bunch of changes (e.g. Incoming -> Freezer) that suggests this is ma... [15:19:45] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10500977 (10phaultfinder) [15:21:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T384592)', diff saved to https://phabricator.wikimedia.org/P72634 and previous config saved to /var/cache/conftool/dbconfig/20250128-152137-marostegui.json [15:21:42] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [15:21:53] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1227.eqiad.wmnet with reason: Maintenance [15:22:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1227 (T384592)', diff saved to https://phabricator.wikimedia.org/P72635 and previous config saved to /var/cache/conftool/dbconfig/20250128-152159-marostegui.json [15:22:26] (03Abandoned) 10Dzahn: gerrit: remove UA-based blocking of some old bots/spiders [puppet] - 10https://gerrit.wikimedia.org/r/1114442 (owner: 10Dzahn) [15:24:49] (03CR) 10Clément Goubert: [C:03+2] kubernetes: Add mw-cron deploy config [puppet] - 10https://gerrit.wikimedia.org/r/1077001 (https://phabricator.wikimedia.org/T377962) (owner: 10Clément Goubert) [15:24:53] (03CR) 10Elukey: [C:03+1] maps_bookworm: Initially disable replication/tile gen timers [puppet] - 10https://gerrit.wikimedia.org/r/1114755 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [15:27:29] (03CR) 10Clément Goubert: [C:03+2] admin_ng: add mw-cron namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087212 (owner: 10Clément Goubert) [15:32:53] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, January 29 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114398 (https://phabricator.wikimedia.org/T280718) (owner: 10Hnowlan) [15:32:59] (03CR) 10Clément Goubert: [C:03+2] mediawiki: Add mwcron feature [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076746 (https://phabricator.wikimedia.org/T341555) (owner: 10Clément Goubert) [15:35:27] jouncebot: nowandnext [15:35:27] No deployments scheduled for the next 0 hour(s) and 24 minute(s) [15:35:27] In 0 hour(s) and 24 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250128T1600) [15:35:52] (03CR) 10Reedy: [C:03+2] FormatMetadata: Prevent running preg_match() on null [core] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1114701 (https://phabricator.wikimedia.org/T384879) (owner: 10Reedy) [15:35:54] (03CR) 10Reedy: [C:03+2] FormatMetadata: Prevent running preg_match() on null [core] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1114702 (https://phabricator.wikimedia.org/T384879) (owner: 10Reedy) [15:36:26] (03Merged) 10jenkins-bot: mediawiki: Add mwcron feature [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076746 (https://phabricator.wikimedia.org/T341555) (owner: 10Clément Goubert) [15:37:49] (03CR) 10Muehlenhoff: [C:03+2] maps_bookworm: Initially disable replication/tile gen timers [puppet] - 10https://gerrit.wikimedia.org/r/1114755 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [15:38:33] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2013 [15:38:41] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2013 [15:39:11] (03Merged) 10jenkins-bot: admin_ng: add mw-cron namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087212 (owner: 10Clément Goubert) [15:39:49] RECOVERY - BGP status on lsw1-a5-codfw.mgmt is OK: BGP OK - up: 32, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:40:33] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 06Infrastructure-Foundations: Perform fake disk swap on ms-be2088 as test - https://phabricator.wikimedia.org/T384003#10501075 (10elukey) Followed up with Supermicro to show our results, let's see what they say. [15:41:29] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:41:29] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:42:57] RESOLVED: [2x] JobUnavailable: Reduced availability for job gnmi in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:45:37] !log cgoubert@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [15:45:44] (03PS1) 10Ottomata: beta - EventStreamConfig - Rename hoist_http_headers_to_fields setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114767 (https://phabricator.wikimedia.org/T382173) [15:46:48] !log cgoubert@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [15:47:06] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic1096-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [15:47:47] !log cgoubert@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [15:47:56] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1114768 [15:47:56] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1114768 (owner: 10TrainBranchBot) [15:47:57] !log cgoubert@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [15:48:30] !log cgoubert@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [15:48:30] !log About to deploy analytics/refinery/source 0.2.57 [15:48:31] !log cgoubert@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [15:48:35] !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-ctrl1002.eqiad.wmnet [15:48:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:36] !log jayme@cumin1002 START - Cookbook sre.hosts.remove-downtime for wikikube-ctrl1002.eqiad.wmnet [15:48:37] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for wikikube-ctrl1002.eqiad.wmnet [15:48:38] !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-ctrl1002.eqiad.wmnet [15:48:48] 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, 06serviceops: wikikube-ctrl1002 and wikikube-ctrl1003: Switch network cable from port 2 to port 1 on the 10G NIC - https://phabricator.wikimedia.org/T379717#10501105 (10ops-monitoring-bot) pool host wikikube-ctrl1002.eqiad.wmnet by jayme@cumin1002 with rea... [15:48:50] 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, 06serviceops: wikikube-ctrl1002 and wikikube-ctrl1003: Switch network cable from port 2 to port 1 on the 10G NIC - https://phabricator.wikimedia.org/T379717#10501106 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by jayme@cumin1... [15:49:03] !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-ctrl1003.eqiad.wmnet [15:49:11] !log cgoubert@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [15:49:11] 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, 06serviceops: wikikube-ctrl1002 and wikikube-ctrl1003: Switch network cable from port 2 to port 1 on the 10G NIC - https://phabricator.wikimedia.org/T379717#10501108 (10ops-monitoring-bot) depool host wikikube-ctrl1003.eqiad.wmnet by jayme@cumin1002 with r... [15:49:16] !log cgoubert@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [15:49:17] !log jayme@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-ctrl1003.eqiad.wmnet with reason: Depooled via sre.k8s.pool-depool-node [15:49:19] !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-ctrl1003.eqiad.wmnet [15:49:34] 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, 06serviceops: wikikube-ctrl1002 and wikikube-ctrl1003: Switch network cable from port 2 to port 1 on the 10G NIC - https://phabricator.wikimedia.org/T379717#10501111 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by jayme@cumin1... [15:49:43] !log cgoubert@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [15:50:20] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2036 [15:50:25] !log cgoubert@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [15:50:29] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2036 [15:51:28] !log cgoubert@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [15:51:43] (03PS1) 10Muehlenhoff: Add a separate Hiera option to control the waterlines import [puppet] - 10https://gerrit.wikimedia.org/r/1114769 (https://phabricator.wikimedia.org/T381565) [15:52:04] (03CR) 10CI reject: [V:04-1] Add a separate Hiera option to control the waterlines import [puppet] - 10https://gerrit.wikimedia.org/r/1114769 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [15:52:31] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:52:31] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:52:54] !log cgoubert@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [15:53:10] (03CR) 10Jelto: [C:03+1] "then we should be good to remove the rsa-2048 key from Gerrit as well." [puppet] - 10https://gerrit.wikimedia.org/r/1075614 (https://phabricator.wikimedia.org/T375569) (owner: 10BCornwall) [15:53:19] !log cgoubert@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [15:54:18] (03PS1) 10Cathal Mooney: gnmic: use event-value-tag-v2 to improve performance [puppet] - 10https://gerrit.wikimedia.org/r/1114770 (https://phabricator.wikimedia.org/T369384) [15:54:27] !log cgoubert@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [15:54:39] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/admin 'apply'. [15:54:40] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10501134 (10phaultfinder) [15:55:47] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [15:56:01] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/admin 'apply'. [15:56:26] (03PS2) 10Muehlenhoff: Add a separate Hiera option to control the waterlines import [puppet] - 10https://gerrit.wikimedia.org/r/1114769 (https://phabricator.wikimedia.org/T381565) [15:56:42] (03CR) 10Hashar: [C:03+1] "Thanks for the verification Valentin, very much appreciated. I reached out to Jelto we will roll it on our Wednesday morning and and monit" [puppet] - 10https://gerrit.wikimedia.org/r/1075614 (https://phabricator.wikimedia.org/T375569) (owner: 10BCornwall) [15:56:52] (03CR) 10CI reject: [V:04-1] Add a separate Hiera option to control the waterlines import [puppet] - 10https://gerrit.wikimedia.org/r/1114769 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [15:56:57] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [15:56:58] (03CR) 10Ottomata: [C:03+2] beta - EventStreamConfig - Rename hoist_http_headers_to_fields setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114767 (https://phabricator.wikimedia.org/T382173) (owner: 10Ottomata) [15:57:29] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'. [15:57:51] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [15:58:46] !log upload liberica 0.7 to apt.wm.o (bookworm-wikimedia) [15:58:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:24] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'. [15:59:31] RESOLVED: BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs2016:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [15:59:48] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [15:59:58] (03Merged) 10jenkins-bot: beta - EventStreamConfig - Rename hoist_http_headers_to_fields setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114767 (https://phabricator.wikimedia.org/T382173) (owner: 10Ottomata) [16:00:05] jelto, arnoldokoth, and mutante: SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250128T1600). Please do the needful. [16:00:33] !log jelto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on gerrit2002.wikimedia.org with reason: NIC port switch -t T383709 [16:00:38] T383709: Tracking List: Relocating servers to free up 10G switch space in codfw - https://phabricator.wikimedia.org/T383709 [16:01:52] (03CR) 10Jelto: nftables: add docker profile and forward chain (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1114716 (https://phabricator.wikimedia.org/T370677) (owner: 10Arnaudb) [16:03:02] !log root@cumin1002 START - Cookbook sre.mysql.pool db1166 gradually with 4 steps - Repooling after rebuild index T384807 [16:03:06] T384807: Upgrade and rebuild s3 - https://phabricator.wikimedia.org/T384807 [16:04:52] (03PS3) 10Muehlenhoff: Add a separate Hiera option to control the waterlines import [puppet] - 10https://gerrit.wikimedia.org/r/1114769 (https://phabricator.wikimedia.org/T381565) [16:04:55] !log root@cumin1002 END (ERROR) - Cookbook sre.mysql.pool (exit_code=97) db1166 gradually with 4 steps - Repooling after rebuild index T384807 [16:05:06] (03Merged) 10jenkins-bot: FormatMetadata: Prevent running preg_match() on null [core] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1114701 (https://phabricator.wikimedia.org/T384879) (owner: 10Reedy) [16:05:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1166', diff saved to https://phabricator.wikimedia.org/P72637 and previous config saved to /var/cache/conftool/dbconfig/20250128-160518-marostegui.json [16:05:19] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#10501179 (10cmooney) I'm very happy to say Karim Radhouani, one of the gnmic devs, has been extremely helpful in response to the github issue I poste... [16:06:03] !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-ctrl1003.eqiad.wmnet [16:06:05] !log jayme@cumin1002 START - Cookbook sre.hosts.remove-downtime for wikikube-ctrl1003.eqiad.wmnet [16:06:05] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for wikikube-ctrl1003.eqiad.wmnet [16:06:06] !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-ctrl1003.eqiad.wmnet [16:06:09] 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, 06serviceops: wikikube-ctrl1002 and wikikube-ctrl1003: Switch network cable from port 2 to port 1 on the 10G NIC - https://phabricator.wikimedia.org/T379717#10501182 (10ops-monitoring-bot) pool host wikikube-ctrl1003.eqiad.wmnet by jayme@cumin1002 with rea... [16:06:12] 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, 06serviceops: wikikube-ctrl1002 and wikikube-ctrl1003: Switch network cable from port 2 to port 1 on the 10G NIC - https://phabricator.wikimedia.org/T379717#10501183 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by jayme@cumin1... [16:07:50] (03Merged) 10jenkins-bot: FormatMetadata: Prevent running preg_match() on null [core] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1114702 (https://phabricator.wikimedia.org/T384879) (owner: 10Reedy) [16:08:44] !log reedy@deploy2002 Started scap sync-world: Backport for [[gerrit:1114701|FormatMetadata: Prevent running preg_match() on null (T384879)]], [[gerrit:1114702|FormatMetadata: Prevent running preg_match() on null (T384879)]] [16:08:46] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1114769 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [16:08:49] T384879: PHP Deprecated: preg_match(): Passing null to parameter #2 ($subject) of type string is deprecated - https://phabricator.wikimedia.org/T384879 [16:08:50] 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, 06serviceops: wikikube-ctrl1002 and wikikube-ctrl1003: Switch network cable from port 2 to port 1 on the 10G NIC - https://phabricator.wikimedia.org/T379717#10501198 (10JMeybohm) 05Open→03Resolved a:03JMeybohm All done, thank @Papaul for your pat... [16:09:59] (03CR) 10Vgutierrez: [C:03+2] site,hiera: Reimage lvs4010 as a liberica LB [puppet] - 10https://gerrit.wikimedia.org/r/1113478 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [16:11:12] RECOVERY - MariaDB Replica Lag: s3 #page on db1166 is OK: OK slave_sql_lag Replication lag: 0.23 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:11:40] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2088 [16:11:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1166 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P72638 and previous config saved to /var/cache/conftool/dbconfig/20250128-161143-root.json [16:11:48] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2088 [16:12:47] RECOVERY - BGP status on lsw1-b8-codfw.mgmt is OK: BGP OK - up: 16, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:13:54] !log reedy@deploy2002 reedy: Backport for [[gerrit:1114701|FormatMetadata: Prevent running preg_match() on null (T384879)]], [[gerrit:1114702|FormatMetadata: Prevent running preg_match() on null (T384879)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [16:14:00] T384879: PHP Deprecated: preg_match(): Passing null to parameter #2 ($subject) of type string is deprecated - https://phabricator.wikimedia.org/T384879 [16:14:09] !log reedy@deploy2002 reedy: Continuing with sync [16:14:43] !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2013,2036,2088].codfw.wmnet [16:14:46] !log jayme@cumin1002 START - Cookbook sre.hosts.remove-downtime for wikikube-worker[2013,2036,2088].codfw.wmnet [16:14:48] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for wikikube-worker[2013,2036,2088].codfw.wmnet [16:14:49] !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2013,2036,2088].codfw.wmnet [16:14:54] 10ops-codfw, 06SRE, 06collaboration-services, 06Data-Persistence, and 2 others: Tracking List: Relocating servers to free up 10G switch space in codfw - https://phabricator.wikimedia.org/T383709#10501231 (10ops-monitoring-bot) pool host wikikube-worker[2013,2036,2088].codfw.wmnet by jayme@cumin1002 with re... [16:14:56] 10ops-codfw, 06SRE, 06collaboration-services, 06Data-Persistence, and 2 others: Tracking List: Relocating servers to free up 10G switch space in codfw - https://phabricator.wikimedia.org/T383709#10501232 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by jayme@cumin1002 pool for... [16:15:08] !log vgutierrez@cumin1002 START - Cookbook sre.hosts.reimage for host lvs4010.ulsfo.wmnet with OS bookworm [16:15:19] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1114768 (owner: 10TrainBranchBot) [16:15:50] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-videoscaler: apply [16:15:53] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-videoscaler: apply [16:16:29] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64600/IPv4: Connect - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:16:38] ^^ expected, lvs4010 is being reimaged [16:16:45] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64600/IPv4: Connect - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:18:25] jouncebot: nowandnext [16:18:25] For the next 0 hour(s) and 41 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250128T1600) [16:18:25] In 0 hour(s) and 41 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250128T1700) [16:18:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1227 (T384592)', diff saved to https://phabricator.wikimedia.org/P72639 and previous config saved to /var/cache/conftool/dbconfig/20250128-161829-marostegui.json [16:18:34] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [16:18:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2177 T382842', diff saved to https://phabricator.wikimedia.org/P72640 and previous config saved to /var/cache/conftool/dbconfig/20250128-161857-marostegui.json [16:19:02] !log root@cumin1002 START - Cookbook sre.mysql.upgrade for db2177.codfw.wmnet [16:19:03] T382842: Upgrade to 10.6.20 and rebuild recentchanges and pagelinks tables - https://phabricator.wikimedia.org/T382842 [16:19:53] (03CR) 10Scott French: "Adding Hugh as well. Thanks in advance, Hugh!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113213 (https://phabricator.wikimedia.org/T377038) (owner: 10Scott French) [16:20:57] !log reedy@deploy2002 Finished scap sync-world: Backport for [[gerrit:1114701|FormatMetadata: Prevent running preg_match() on null (T384879)]], [[gerrit:1114702|FormatMetadata: Prevent running preg_match() on null (T384879)]] (duration: 12m 12s) [16:21:09] T384879: PHP Deprecated: preg_match(): Passing null to parameter #2 ($subject) of type string is deprecated - https://phabricator.wikimedia.org/T384879 [16:21:30] (03PS4) 10Muehlenhoff: Add a separate Hiera option to control the waterlines import [puppet] - 10https://gerrit.wikimedia.org/r/1114769 (https://phabricator.wikimedia.org/T381565) [16:22:34] 06SRE, 06Commons, 10MediaWiki-Uploading, 06Traffic: HTTP 503 error when uploading images on Wikimedia Commons - https://phabricator.wikimedia.org/T383274#10501271 (10Underbar_dk) I disabled IPv6 and the multiple uploads went through! I then switched it back on and the uploads also went through no problem.... [16:22:39] (03CR) 10Hnowlan: [C:03+1] shellbox-video: 3 codfw replicas on 8.1 (change 1/4) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113213 (https://phabricator.wikimedia.org/T377038) (owner: 10Scott French) [16:24:01] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1114769 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [16:24:41] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10501277 (10phaultfinder) [16:25:38] (03CR) 10Bartosz Dziewoński: [C:03+1] Update CentralAuth multi-DC rules for SUL3 [puppet] - 10https://gerrit.wikimedia.org/r/1114070 (https://phabricator.wikimedia.org/T363695) (owner: 10Gergő Tisza) [16:25:50] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2177.codfw.wmnet with reason: maintenance [16:25:51] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host gerrit2002 [16:25:52] tgr|away: happen to be around? I think you linked the wrong patch in the puppet window [16:25:59] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host gerrit2002 [16:26:04] !log root@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2177.codfw.wmnet [16:26:45] oh got it, it should be https://gerrit.wikimedia.org/r/1114070 [16:26:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1166 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P72641 and previous config saved to /var/cache/conftool/dbconfig/20250128-162649-root.json [16:26:57] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2177.codfw.wmnet with reason: Index rebuild [16:29:26] 10ops-codfw, 06SRE, 06collaboration-services, 06Data-Persistence, and 2 others: Tracking List: Relocating servers to free up 10G switch space in codfw - https://phabricator.wikimedia.org/T383709#10501283 (10Jhancock.wm) [16:33:33] !log vgutierrez@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs4010.ulsfo.wmnet with reason: host reimage [16:33:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1227', diff saved to https://phabricator.wikimedia.org/P72642 and previous config saved to /var/cache/conftool/dbconfig/20250128-163336-marostegui.json [16:33:59] (03PS5) 10Muehlenhoff: Add a separate Hiera option to control the waterlines import [puppet] - 10https://gerrit.wikimedia.org/r/1114769 (https://phabricator.wikimedia.org/T381565) [16:36:26] (03PS1) 10Clément Goubert: mw-script: Add conftool state to helmfile.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114777 (https://phabricator.wikimedia.org/T367118) [16:37:30] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs4010.ulsfo.wmnet with reason: host reimage [16:38:33] (03CR) 10RLazarus: [C:03+1] mw-script: Add conftool state to helmfile.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114777 (https://phabricator.wikimedia.org/T367118) (owner: 10Clément Goubert) [16:38:41] (03CR) 10Clément Goubert: [C:03+2] mw-script: Add conftool state to helmfile.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114777 (https://phabricator.wikimedia.org/T367118) (owner: 10Clément Goubert) [16:39:25] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1114769 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [16:40:03] (03Merged) 10jenkins-bot: mw-script: Add conftool state to helmfile.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114777 (https://phabricator.wikimedia.org/T367118) (owner: 10Clément Goubert) [16:40:35] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:40:45] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:41:07] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:41:13] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:41:34] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:41:39] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:41:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1166 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P72643 and previous config saved to /var/cache/conftool/dbconfig/20250128-164154-root.json [16:42:04] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:42:06] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:42:35] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:42:39] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:42:54] (03PS6) 10Muehlenhoff: Add a separate Hiera option to control the waterlines import [puppet] - 10https://gerrit.wikimedia.org/r/1114769 (https://phabricator.wikimedia.org/T381565) [16:44:43] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10501321 (10phaultfinder) [16:46:46] (03PS1) 10Sergio Gimeno: beta wgEventStreams: set enrich_fields_from_http_headers on metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114779 (https://phabricator.wikimedia.org/T382173) [16:46:51] (03PS1) 10Vgutierrez: hiera: Fix NIC name on liberica@ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1114780 [16:47:13] (03CR) 10CI reject: [V:04-1] hiera: Fix NIC name on liberica@ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1114780 (owner: 10Vgutierrez) [16:47:25] (03CR) 10Ssingh: [C:03+1] hiera: Fix NIC name on liberica@ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1114780 (owner: 10Vgutierrez) [16:47:43] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1114769 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [16:47:43] (03PS2) 10Vgutierrez: hiera: Fix NIC name on liberica@ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1114780 (https://phabricator.wikimedia.org/T384477) [16:47:52] (03CR) 10Ssingh: hiera: Fix NIC name on liberica@ulsfo (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1114780 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [16:48:15] (03CR) 10jenkins-bot: hiera: Fix NIC name on liberica@ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1114780 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [16:48:26] (03CR) 10Ssingh: hiera: Fix NIC name on liberica@ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1114780 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [16:48:39] (03CR) 10Vgutierrez: [C:03+2] hiera: Fix NIC name on liberica@ulsfo (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1114780 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [16:48:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1227', diff saved to https://phabricator.wikimedia.org/P72644 and previous config saved to /var/cache/conftool/dbconfig/20250128-164843-marostegui.json [16:49:25] (03PS2) 10Sergio Gimeno: beta wgEventStreams: opt out collecting user agent for HompageVisit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114779 (https://phabricator.wikimedia.org/T382173) [16:51:42] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:51:45] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:52:33] !log restart kartotherian on maps1009 as test [16:52:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:38] (03PS7) 10Muehlenhoff: Add a separate Hiera option to control the waterlines import [puppet] - 10https://gerrit.wikimedia.org/r/1114769 (https://phabricator.wikimedia.org/T381565) [16:53:40] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:53:42] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:55:50] (03CR) 10Btullis: [C:03+1] "Looks good to me." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114743 (owner: 10Elukey) [16:56:11] (03CR) 10Ottomata: beta wgEventStreams: opt out collecting user agent for HompageVisit (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114779 (https://phabricator.wikimedia.org/T382173) (owner: 10Sergio Gimeno) [16:57:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1166 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P72645 and previous config saved to /var/cache/conftool/dbconfig/20250128-165700-root.json [16:57:12] (03CR) 10Vgutierrez: [C:03+1] hiera: consolidate haproxykafka into common profile [puppet] - 10https://gerrit.wikimedia.org/r/1114728 (https://phabricator.wikimedia.org/T378578) (owner: 10Fabfur) [16:57:49] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 10Thumbor, and 6 others: Change default image thumbnail size - https://phabricator.wikimedia.org/T355914#10501367 (10ovasileva) a:03ovasileva [16:58:00] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 10Thumbor, and 6 others: Change default image thumbnail size - https://phabricator.wikimedia.org/T355914#10501368 (10ovasileva) a:05ovasileva→03None [16:58:19] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1114769 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [16:59:00] (03CR) 10Btullis: [C:03+1] custom_deploy.d: remove ML-specific bits from DSE's istio config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114749 (owner: 10Elukey) [16:59:55] (03CR) 10Btullis: [C:03+1] "Thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1114753 (owner: 10Elukey) [17:00:04] jhathaway and rzl: That opportune time for a Puppet request window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250128T1700). [17:00:04] tgr: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [17:00:10] o/ [17:01:46] !log cmooney@cumin1002 START - Cookbook sre.hosts.remove-downtime for cr[1-2]-magru,cr[1-2]-magru IPv6 [17:01:48] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for cr[1-2]-magru,cr[1-2]-magru IPv6 [17:02:58] o/ [17:03:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1227 (T384592)', diff saved to https://phabricator.wikimedia.org/P72646 and previous config saved to /var/cache/conftool/dbconfig/20250128-170350-marostegui.json [17:03:56] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [17:04:03] (03PS8) 10Dzahn: gerrit: fix todo from 2022, remove nist key setting [puppet] - 10https://gerrit.wikimedia.org/r/1064413 (https://phabricator.wikimedia.org/T315942) [17:04:06] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1236.eqiad.wmnet with reason: Maintenance [17:04:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1236 (T384592)', diff saved to https://phabricator.wikimedia.org/P72647 and previous config saved to /var/cache/conftool/dbconfig/20250128-170412-marostegui.json [17:04:29] tgr|away: just as a heads up, ATS lua is moderately scary and I can't promise we can always do it in the puppet window :) but I chatted with traffic and I think we're all set [17:04:46] plan is I'll stop puppet on cp-text, deploy this to a host or two, we can test, and then deploy it everywhere [17:05:39] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install puppetserver2004 - https://phabricator.wikimedia.org/T381274#10501392 (10Jhancock.wm) @elukey hey having a little bit of trouble with the provisioning on this one. i know it's a custom model and wanted to see if you had any i... [17:05:42] before we start, will you open an affected url, check the x-cache header, and let me know what it says? [17:05:46] !log cmooney@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on netflow3003.esams.wmnet with reason: disabling alerts as I'm running gnmic manually rather than with systemd [17:05:56] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#10501393 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=e5ab529a-1fb4-461d-b85a-a2d5a66a020a) set by cmooney@cumin1002 for 1:00:... [17:06:33] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 219, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:06:42] rzl: ack. Let me know if there's a process to follow that would make it easier on the deployer (although I hope we are done with SUL3 puppet patches after this one) [17:06:57] !log stopping puppet on A:cp-text [17:07:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:07:47] the ideal thing would just be finding a reviewer on the traffic team who can deploy for you, but also I understand finding reviewers on each team can be challenging [17:08:35] curl -vo/dev/null 'http://auth.wikimedia.org/enwiki/wiki/Special:CentralLogin' |& grep x-cache [17:08:38] < x-cache: cp3070 int [17:08:41] < x-cache-status: int-tls [17:09:00] that's an URL that should go from not forced to primary DC to forced to primary DC [17:09:27] perfect thanks, I'll deploy to cp3070 for testing [17:09:27] will try to do that next time [17:09:37] (and cp4038, which is what I get here in san francisco) [17:09:51] (03CR) 10RLazarus: [C:03+2] Update CentralAuth multi-DC rules for SUL3 [puppet] - 10https://gerrit.wikimedia.org/r/1114070 (https://phabricator.wikimedia.org/T363695) (owner: 10Gergő Tisza) [17:10:10] tgr|away: https:// and not http:// BTW [17:10:37] on port http:// you're just getting a 301 to https:// [17:10:37] ah thanks, missed that [17:10:46] oops sorry [17:10:46] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs4010.ulsfo.wmnet with OS bookworm [17:10:52] (see? reviewer on the traffic team) [17:10:56] < x-cache: cp3069 miss, cp3069 pass [17:10:56] < x-cache-status: pass [17:10:59] yeah good point, which is why you see the int-tls there [17:11:01] that'll affect the hashing, can you-- perfect [17:11:19] rzl: feel free to add me or fabfur or brett for the patches and one of us can triage it (and add vg where required) [17:12:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1166 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P72648 and previous config saved to /var/cache/conftool/dbconfig/20250128-171205-root.json [17:13:46] (puppet's running) [17:13:53] sukhe: 👍 👍 [17:15:32] tgr|away: okay, deployed to cp3069 and cp4039 [17:16:05] I get the same response, not sure if that's good or bad [17:16:40] okay, maybe I should have opened with "is this testable" :) [17:16:56] but it doesn't look like ATS is failing on those hosts now, which is the good news [17:17:12] not easily, all the patch does is influence which URLs get always sent to the primary DC [17:17:29] (03CR) 10Urbanecm: [C:03+1] "lgtm, but let's wait with deployment until the GE counterpart is finalised" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113984 (https://phabricator.wikimedia.org/T383714) (owner: 10Cyndywikime) [17:18:09] I thought that would a different-first-digit cp host, but I only have very vague ideas of how multi-DC works [17:18:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1236 (T384592)', diff saved to https://phabricator.wikimedia.org/P72649 and previous config saved to /var/cache/conftool/dbconfig/20250128-171814-marostegui.json [17:18:20] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [17:19:04] *oh* [17:19:20] no, it goes to mediawiki either way, so that'll change the "server" header in the response [17:19:25] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudcephosd2004-dev - https://phabricator.wikimedia.org/T378825#10501461 (10Jhancock.wm) 05Open→03Resolved [17:19:31] < server: mw-web.eqiad.main-c544b8984-bwnqh [17:19:37] (03PS1) 10Vgutierrez: hiera: Fix BGP peers for liberica@ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1114783 (https://phabricator.wikimedia.org/T384477) [17:19:40] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudcephosd2004-dev - https://phabricator.wikimedia.org/T378825#10501464 (10Jhancock.wm) [17:19:46] you might have seen it change from mw-web.eqiad... to mw-web.codfw... if this worked, or change from codfw to codfw if you were already there :) [17:19:49] eqiad is the secondary now, right? [17:19:53] yeah [17:20:01] yeah, if it changed from eqiad to eqiad it means this didn't work as intended [17:20:12] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1114783 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [17:23:38] 10ops-codfw, 06DC-Ops, 06Infrastructure-Foundations, 13Patch-For-Review: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10501476 (10Jhancock.wm) [17:23:52] tgr|away: take your time digging but let me know what you'd like to do -- if we don't end up keeping this, I can roll back on those two hosts and resume puppet on the rest [17:24:41] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10501477 (10phaultfinder) [17:24:55] thx, let me try a few more requests [17:25:21] sure -- note that they may hash to different hosts that don't have your patch, check the x-cache header [17:26:10] esams is a single_backend DC nowadays [17:26:28] so requests from the same client IP will hit the same ATS instances [17:26:32] *instance [17:26:36] possibly a naive question - `string.find(path, "/wiki/Special:CentralLogin") == 1 ` - that won't work if the wiki name is prefix of the path, right? [17:26:48] i.e., from the test URL above [17:27:11] vgutierrez: doh thanks [17:27:39] swfrench-wmf: you're right [17:27:40] FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker1245:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1245 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [17:27:54] lua string matching strikes again [17:27:54] that needs to be refactored using a regex with string.match [17:27:58] yeah, just realized that [17:28:05] swfrench-wmf++ [17:28:11] which is annoying because not all queries will have that prefix [17:28:14] gg swfrench-wmf [17:28:50] it could be refactored into string.find(...) != nil [17:29:35] > path = "/enwiki/foo" [17:29:35] > string.find(path, "/foo") [17:29:35] 8 11 [17:30:11] now I wonder if the current code even works. [17:30:32] what if the URL is index.php-style? What if some wiki localizes those special page names? [17:30:54] will have to double-check the MediaWiki code to see if that can happen [17:32:19] 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, 06serviceops: wikikube-ctrl1002 and wikikube-ctrl1003: Switch network cable from port 2 to port 1 on the 10G NIC - https://phabricator.wikimedia.org/T379717#10501502 (10Papaul) you welcome [17:32:23] of course.. `~= nil` rather than `!= nil` :) [17:32:24] rzl: sorry, can we abandon the deploy for now? I'll need to read through the CentralAuth code, no point in fixing the string matching if it's not using this URL format [17:33:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1236', diff saved to https://phabricator.wikimedia.org/P72650 and previous config saved to /var/cache/conftool/dbconfig/20250128-173321-marostegui.json [17:34:33] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 06Infrastructure-Foundations: Perform fake disk swap on ms-be2088 as test - https://phabricator.wikimedia.org/T384003#10501511 (10Neobeta61) Ciao, try - { "Target": "/redfish/v1/Systems/{SystemId}/Storage/{StorageId}/Actions/StorageController.ClearFo... [17:35:06] tgr|away: no worries, rolling back [17:35:51] (03PS1) 10RLazarus: Revert "Update CentralAuth multi-DC rules for SUL3" [puppet] - 10https://gerrit.wikimedia.org/r/1114785 [17:36:16] (03PS2) 10Cathal Mooney: gnmic: use event-value-tag-v2 to improve performance [puppet] - 10https://gerrit.wikimedia.org/r/1114770 (https://phabricator.wikimedia.org/T369384) [17:37:40] RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-worker1245:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1245 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [17:38:03] (03CR) 10RLazarus: [C:03+2] Revert "Update CentralAuth multi-DC rules for SUL3" [puppet] - 10https://gerrit.wikimedia.org/r/1114785 (owner: 10RLazarus) [17:39:27] merged, deploying on our two test hosts first just for caution [17:42:43] done, re-enabling puppet [17:42:56] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1114783 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [17:43:40] FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker1245:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1245 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [17:43:52] sukhe, vgutierrez: thanks <3 can tgr|away send you a revised patch directly, without waiting for a puppet window? [17:44:30] can confirm that cp3069 behaves as expected before the patch [17:46:05] (03CR) 10Dzahn: [C:03+2] gerrit: fix todo from 2022, remove nist key setting [puppet] - 10https://gerrit.wikimedia.org/r/1064413 (https://phabricator.wikimedia.org/T315942) (owner: 10Dzahn) [17:46:05] (03CR) 10Vgutierrez: [C:03+2] hiera: Fix BGP peers for liberica@ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1114783 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [17:47:02] rzl: yes [17:47:23] thanks! [17:48:08] thanks all! will make a new patch, and make sure the requirements are documented on the PHP side [17:48:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1236', diff saved to https://phabricator.wikimedia.org/P72651 and previous config saved to /var/cache/conftool/dbconfig/20250128-174828-marostegui.json [17:48:40] RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-worker1245:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1245 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [17:49:59] I am removing SSH config option "KexAlgorithms ecdh-sha2-nistp521" from gerrit. this is supposed to be not needed anymore since at least 2022. but in the unlikely even that someone says something.. I would expect it must be about some ancient client. [17:50:40] there was literally a TODO to remove it once we are on Gerrit 3.6 and MINA 2.8.0 (that's the Gerrit sshd, not openssh) and we are many versions past that [17:51:23] (03PS3) 10Cathal Mooney: gnmic: use event-value-tag-v2 to improve performance [puppet] - 10https://gerrit.wikimedia.org/r/1114770 (https://phabricator.wikimedia.org/T369384) [17:52:33] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops, 10decommission-hardware: decommission cloudcephmon100[1-3].eqiad.wmnet - https://phabricator.wikimedia.org/T380893#10501556 (10Papaul) replaced 1002 with {F58301781} [17:54:40] FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker1245:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1245 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [17:57:47] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: ms-fe1014 hardware fault (may need new disk controller?) - https://phabricator.wikimedia.org/T384317#10501586 (10Papaul) @MatthewVernon i will take a look at it, thanks [17:59:59] (03CR) 10Sergio Gimeno: beta wgEventStreams: opt out collecting user agent for HompageVisit (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114779 (https://phabricator.wikimedia.org/T382173) (owner: 10Sergio Gimeno) [18:00:05] swfrench-wmf: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for MediaWiki infrastructure (UTC late) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250128T1800). [18:00:17] o/ [18:01:26] (03CR) 10Scott French: [C:03+2] shellbox-constraints: 1 eqiad replica on 8.1 (change 1/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113217 (https://phabricator.wikimedia.org/T377038) (owner: 10Scott French) [18:02:42] (03Merged) 10jenkins-bot: shellbox-constraints: 1 eqiad replica on 8.1 (change 1/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113217 (https://phabricator.wikimedia.org/T377038) (owner: 10Scott French) [18:03:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1236 (T384592)', diff saved to https://phabricator.wikimedia.org/P72652 and previous config saved to /var/cache/conftool/dbconfig/20250128-180335-marostegui.json [18:03:41] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [18:03:51] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [18:04:04] (03PS8) 10Btullis: mediawiki: Add support for dumps suspended job [deployment-charts] - 10https://gerrit.wikimedia.org/r/1104605 (https://phabricator.wikimedia.org/T352650) (owner: 10Giuseppe Lavagetto) [18:04:10] (03PS8) 10Btullis: mediwiki-dumps-legacy: Create helmfile deployment of a suspended job [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114001 (https://phabricator.wikimedia.org/T352650) [18:04:23] !log starting shellbox-constraints pilot on PHP 8.1 (1 replica, eqiad only) - T377038 [18:04:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:27] T377038: Migrate production Shellbox variants to PHP 8.1 - https://phabricator.wikimedia.org/T377038 [18:04:32] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply [18:05:08] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply [18:05:47] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10501623 (10phaultfinder) [18:08:07] (03PS1) 10Jasmine: wikikube: decommission wikikube-worker102[2-5].eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1114788 (https://phabricator.wikimedia.org/T383227) [18:08:22] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops, 10decommission-hardware: decommission cloudcephmon100[1-3].eqiad.wmnet - https://phabricator.wikimedia.org/T380893#10501630 (10Papaul) 05Open→03Resolved a:03Papaul This is complete [18:12:43] 10ops-codfw, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T384951 (10phaultfinder) 03NEW [18:13:14] (03CR) 10Kamila Součková: wikikube: decommission wikikube-worker102[2-5].eqiad.wmnet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1114788 (https://phabricator.wikimedia.org/T383227) (owner: 10Jasmine) [18:15:20] (03CR) 10Scott French: [C:03+2] shellbox-video: 3 codfw replicas on 8.1 (change 1/4) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113213 (https://phabricator.wikimedia.org/T377038) (owner: 10Scott French) [18:16:50] (03Merged) 10jenkins-bot: shellbox-video: 3 codfw replicas on 8.1 (change 1/4) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113213 (https://phabricator.wikimedia.org/T377038) (owner: 10Scott French) [18:17:23] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2150.codfw.wmnet with reason: Maintenance [18:17:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2150 (T384592)', diff saved to https://phabricator.wikimedia.org/P72653 and previous config saved to /var/cache/conftool/dbconfig/20250128-181729-marostegui.json [18:17:35] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1014 - https://phabricator.wikimedia.org/T384281#10501677 (10Papaul) 05Open→03Resolved a:03Papaul working on this on T382984 [18:17:35] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [18:17:39] (03CR) 10Dzahn: [C:03+1] "that patch is merged now" [puppet] - 10https://gerrit.wikimedia.org/r/1074381 (owner: 10Muehlenhoff) [18:18:10] !log starting shellbox-video pilot on PHP 8.1 (3 replicas, codfw only) - T377038 [18:18:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:14] T377038: Migrate production Shellbox variants to PHP 8.1 - https://phabricator.wikimedia.org/T377038 [18:18:21] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-video: apply [18:18:24] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1014 - https://phabricator.wikimedia.org/T384415#10501686 (10Papaul) 05Open→03Resolved a:03Papaul Working on this in T382984 [18:21:29] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-video: apply [18:21:40] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 06Infrastructure-Foundations: Perform fake disk swap on ms-be2088 as test - https://phabricator.wikimedia.org/T384003#10501708 (10elukey) @Neobeta61 Ciao! Grazie :) I tested it but the Action is not available afaics: ` 'Actions': {'Oem': {'#SmcHARAIDC... [18:23:15] (03PS3) 10Ottomata: beta wgEventStreams: opt out collecting user agent for HompageVisit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114779 (https://phabricator.wikimedia.org/T382173) (owner: 10Sergio Gimeno) [18:23:53] swfrench-wmf: okay if I deploy a beta only mw config change? [18:24:40] RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-worker1245:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1245 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [18:26:22] 10ops-eqiad, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T384892#10501716 (10VRiley-WMF) 05Open→03Resolved a:03VRiley-WMF Reseated cable and pinged the managment IP. Seems to be resolved now. [18:27:25] swfrench-wmf: i'm going to merge, and do the no-op prod scap when you verify its okay. ty! [18:27:34] (03CR) 10Ottomata: [C:03+2] beta wgEventStreams: opt out collecting user agent for HompageVisit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114779 (https://phabricator.wikimedia.org/T382173) (owner: 10Sergio Gimeno) [18:28:28] (03Merged) 10jenkins-bot: beta wgEventStreams: opt out collecting user agent for HompageVisit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114779 (https://phabricator.wikimedia.org/T382173) (owner: 10Sergio Gimeno) [18:28:36] ottomata: thanks for checking. I think I'm at the point in my work where a deployment is unlikely to disrupt anything, so I think you're good to go. [18:32:00] okay thanks! [18:33:40] FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker1245:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1245 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [18:34:39] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10501723 (10phaultfinder) [18:38:27] 10ops-eqiad, 06DC-Ops: Inbound interface errors - https://phabricator.wikimedia.org/T384861#10501742 (10VRiley-WMF) @Jgreen It looks like we are having an issue on this connection. Could we plan for a time for us to swap the transceiver? Let us know, thanks! [18:38:40] RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-worker1245:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1245 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [18:38:55] (03PS2) 10Jasmine: wikikube: decommission wikikube-worker102[2-5].eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1114788 (https://phabricator.wikimedia.org/T383227) [18:39:31] (03CR) 10Jasmine: "ty!" [puppet] - 10https://gerrit.wikimedia.org/r/1114788 (https://phabricator.wikimedia.org/T383227) (owner: 10Jasmine) [18:44:35] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10501758 (10phaultfinder) [18:54:40] FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker1245:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1245 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [18:56:06] FIRING: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [18:59:40] RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-worker1245:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1245 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [19:00:04] jeena and hashar: Deploy window MediaWiki train - Utc-7+Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250128T1900) [19:01:03] PROBLEM - Checks that the local airflow scheduler for airflow @analytics_product is working properly on an-airflow1006 is CRITICAL: CRITICAL: /usr/bin/env PYTHONPATH=/srv/deployment/airflow-dags/analytics_product AIRFLOW_HOME=/srv/airflow-analytics_product /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1006.eqiad.wmnet did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [19:01:06] RESOLVED: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [19:02:03] RECOVERY - Checks that the local airflow scheduler for airflow @analytics_product is working properly on an-airflow1006 is OK: OK: /usr/bin/env PYTHONPATH=/srv/deployment/airflow-dags/analytics_product AIRFLOW_HOME=/srv/airflow-analytics_product /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1006.eqiad.wmnet succeeded https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [19:06:18] (03PS1) 10Xcollazo: Scale down mw-content-history-reconcile-enrich for nominal events intake [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114790 (https://phabricator.wikimedia.org/T382953) [19:07:36] FIRING: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [19:10:16] (03PS1) 10TrainBranchBot: group0 to 1.44.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114791 (https://phabricator.wikimedia.org/T382365) [19:10:18] (03CR) 10TrainBranchBot: [C:03+2] group0 to 1.44.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114791 (https://phabricator.wikimedia.org/T382365) (owner: 10TrainBranchBot) [19:11:00] (03Merged) 10jenkins-bot: group0 to 1.44.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114791 (https://phabricator.wikimedia.org/T382365) (owner: 10TrainBranchBot) [19:12:06] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:12:31] 10ops-eqiad, 06DBA, 06DC-Ops, 10decommission-hardware: decommission es1024.eqiad.wmnet - https://phabricator.wikimedia.org/T384820#10501811 (10Papaul) [19:12:36] RESOLVED: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [19:12:51] 10ops-eqiad, 06DBA, 06DC-Ops, 10decommission-hardware: decommission es1024.eqiad.wmnet - https://phabricator.wikimedia.org/T384820#10501814 (10Papaul) 05Open→03Resolved a:03Papaul Complete [19:20:14] !log jhuneidi@deploy2002 rebuilt and synchronized wikiversions files: group0 to 1.44.0-wmf.14 refs T382365 [19:20:18] T382365: 1.44.0-wmf.14 deployment blockers - https://phabricator.wikimedia.org/T382365 [19:22:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext/canary at eqiad: 23.21% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:25:17] (03PS1) 10DLynch: Enable VisualEditor's EditCheck multiple-check mode on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114792 (https://phabricator.wikimedia.org/T384658) [19:25:36] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, January 28 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114792 (https://phabricator.wikimedia.org/T384658) (owner: 10DLynch) [19:27:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext/canary at eqiad: 25% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:36:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2028.codfw.wmnet [19:44:21] 10ops-eqiad, 06Data-Platform-SRE, 06DC-Ops: Q3:rack/setup/install elastic1108-elastic1119 - https://phabricator.wikimedia.org/T384966 (10RobH) 03NEW [19:45:07] 10ops-eqiad, 06Data-Platform-SRE, 06DC-Ops: Q3:rack/setup/install elastic1108-elastic1119 - https://phabricator.wikimedia.org/T384966#10501995 (10RobH) a:03bking @bking, As discussed in IRC, assigning this to you for further details on racking restrictions section of racking details. In addition to the a... [19:45:25] 10ops-eqiad, 06Data-Platform-SRE, 06DC-Ops: Q3:rack/setup/install elastic1108-elastic1119 - https://phabricator.wikimedia.org/T384966#10501999 (10RobH) [19:45:52] 10ops-eqiad, 06Data-Platform-SRE, 06DC-Ops: Q3:rack/setup/install elastic1108-elastic1119 - https://phabricator.wikimedia.org/T384966#10502002 (10RobH) [19:46:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2177 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P72654 and previous config saved to /var/cache/conftool/dbconfig/20250128-194651-root.json [19:47:06] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic1096-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [19:50:00] 06SRE, 06Infrastructure-Foundations, 06Traffic: Console domain and property access request - https://phabricator.wikimedia.org/T381904#10502013 (10Scott_French) Tagging #traffic in hopes that someone (especially with expertise in our DNS configuration) may be able to help advance the request in T381904#10464... [19:51:14] 10ops-eqiad, 06DC-Ops: Inbound interface errors - https://phabricator.wikimedia.org/T384861#10502019 (10Jgreen) >>! In T384861#10501741, @VRiley-WMF wrote: > @Jgreen It looks like we are having an issue on this connection. Could we plan for a time for us to swap the transceiver? Let us know, thanks! The timin... [19:52:34] 10ops-eqiad, 06Data-Platform-SRE, 06DC-Ops: Q3:rack/setup/install elastic1108-elastic1119 - https://phabricator.wikimedia.org/T384966#10502027 (10RobH) [20:01:24] (03PS1) 10Ottomata: eventgate - templatize module name, default to @eventgate/wikimedia [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114795 (https://phabricator.wikimedia.org/T383814) [20:01:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2177 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P72655 and previous config saved to /var/cache/conftool/dbconfig/20250128-200157-root.json [20:02:16] (03CR) 10CI reject: [V:04-1] eventgate - templatize module name, default to @eventgate/wikimedia [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114795 (https://phabricator.wikimedia.org/T383814) (owner: 10Ottomata) [20:03:06] (03PS2) 10Ottomata: eventgate - templatize module name, default to @eventgate/wikimedia [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114795 (https://phabricator.wikimedia.org/T383814) [20:03:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T384592)', diff saved to https://phabricator.wikimedia.org/P72656 and previous config saved to /var/cache/conftool/dbconfig/20250128-200346-marostegui.json [20:03:51] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [20:04:02] (03CR) 10CI reject: [V:04-1] eventgate - templatize module name, default to @eventgate/wikimedia [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114795 (https://phabricator.wikimedia.org/T383814) (owner: 10Ottomata) [20:04:38] (03PS3) 10Ottomata: eventgate - templatize module name, default to @eventgate/wikimedia [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114795 (https://phabricator.wikimedia.org/T383814) [20:05:03] (03PS4) 10Ottomata: eventgate - templatize module name, default to @eventgate/wikimedia [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114795 (https://phabricator.wikimedia.org/T383814) [20:09:09] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 129, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:09:35] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 220, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:11:33] (03PS1) 10Ottomata: eventgate-analytics - upgrade to v1.10.0 and NodeJS 20 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114798 (https://phabricator.wikimedia.org/T383814) [20:15:39] (03Abandoned) 10Gergő Tisza: Add machine-readable markings for SUL3 extension denylist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114351 (owner: 10Gergő Tisza) [20:15:58] (03PS1) 10Bartosz Dziewoński: Fix PHP 7.4 issue [extensions/Flow] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1114799 (https://phabricator.wikimedia.org/T384905) [20:16:10] (03PS1) 10Bartosz Dziewoński: wikimedia/request-timeout: 2.0.1 -> 2.0.2 [vendor] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1114800 (https://phabricator.wikimedia.org/T384905) [20:17:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2177 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P72659 and previous config saved to /var/cache/conftool/dbconfig/20250128-201702-root.json [20:17:38] (03CR) 10Bartosz Dziewoński: "I don't know if the corresponding core patch https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1114760 also needs to be backported? (I kno" [vendor] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1114800 (https://phabricator.wikimedia.org/T384905) (owner: 10Bartosz Dziewoński) [20:18:13] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, January 28 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [extensions/Flow] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1114799 (https://phabricator.wikimedia.org/T384905) (owner: 10Bartosz Dziewoński) [20:18:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P72660 and previous config saved to /var/cache/conftool/dbconfig/20250128-201853-marostegui.json [20:26:46] (03CR) 10Bartosz Dziewoński: "It looks like it should, I see other cases where we've done both, e.g. https://gerrit.wikimedia.org/r/c/mediawiki/vendor/+/1098581 and htt" [vendor] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1114800 (https://phabricator.wikimedia.org/T384905) (owner: 10Bartosz Dziewoński) [20:27:05] (03PS1) 10Bartosz Dziewoński: composer: wikimedia/request-timeout 2.0.1 -> 2.0.2 [core] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1114802 (https://phabricator.wikimedia.org/T384905) [20:27:13] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, January 28 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [core] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1114802 (https://phabricator.wikimedia.org/T384905) (owner: 10Bartosz Dziewoński) [20:32:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2177 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P72661 and previous config saved to /var/cache/conftool/dbconfig/20250128-203207-root.json [20:34:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P72662 and previous config saved to /var/cache/conftool/dbconfig/20250128-203400-marostegui.json [20:34:59] (03CR) 10TChin: [C:03+1] eventgate - templatize module name, default to @eventgate/wikimedia [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114795 (https://phabricator.wikimedia.org/T383814) (owner: 10Ottomata) [20:47:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2177 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P72663 and previous config saved to /var/cache/conftool/dbconfig/20250128-204712-root.json [20:47:30] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, January 28 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114425 (https://phabricator.wikimedia.org/T365367) (owner: 10C. Scott Ananian) [20:49:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T384592)', diff saved to https://phabricator.wikimedia.org/P72664 and previous config saved to /var/cache/conftool/dbconfig/20250128-204907-marostegui.json [20:49:12] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [20:49:23] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2159.codfw.wmnet with reason: Maintenance [20:49:27] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2187.codfw.wmnet with reason: Maintenance [20:49:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2159 (T384592)', diff saved to https://phabricator.wikimedia.org/P72665 and previous config saved to /var/cache/conftool/dbconfig/20250128-204933-marostegui.json [21:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250128T2100). [21:00:04] kemayo, MatmaRex, and cscott: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:11] hi [21:00:20] o/ [21:00:37] my patches are not directly testable - i'm backporting them just to unbreak CI for future backports [21:02:01] Hi, I can start deploying if no backport deployer is available [21:03:15] I'll start with Kemayo's patch [21:03:26] 🎉 [21:03:46] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jhuneidi@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114792 (https://phabricator.wikimedia.org/T384658) (owner: 10DLynch) [21:04:34] (03Merged) 10jenkins-bot: Enable VisualEditor's EditCheck multiple-check mode on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114792 (https://phabricator.wikimedia.org/T384658) (owner: 10DLynch) [21:05:44] \o/ [21:06:05] all done, now MatmaRex, is it okay to deploy all yours together? [21:06:08] jeena: arlolra is also here for our config change. [21:06:25] 👍 [21:06:32] jeena: yes [21:07:04] our patch shouldn't have any visible effect -- `composer checkDiff` shows no output -- it's just a cleanup. Nevertheless we can smoke test it by checking that parsoid read views is still on/off on those wikis it should be on/off for. [21:08:22] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jhuneidi@deploy2002 using scap backport" [extensions/Flow] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1114799 (https://phabricator.wikimedia.org/T384905) (owner: 10Bartosz Dziewoński) [21:08:23] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jhuneidi@deploy2002 using scap backport" [vendor] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1114800 (https://phabricator.wikimedia.org/T384905) (owner: 10Bartosz Dziewoński) [21:08:24] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jhuneidi@deploy2002 using scap backport" [core] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1114802 (https://phabricator.wikimedia.org/T384905) (owner: 10Bartosz Dziewoński) [21:18:17] (03Merged) 10jenkins-bot: Fix PHP 7.4 issue [extensions/Flow] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1114799 (https://phabricator.wikimedia.org/T384905) (owner: 10Bartosz Dziewoński) [21:30:53] jeena: For what it's worth, my one patch actually is testable if you want me to stick around to do it before it goes out. [21:31:39] (03Merged) 10jenkins-bot: wikimedia/request-timeout: 2.0.1 -> 2.0.2 [vendor] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1114800 (https://phabricator.wikimedia.org/T384905) (owner: 10Bartosz Dziewoński) [21:31:52] Kemayo: your patch was beta-only, right? So no production deployment happened. I think there is a job that updates beta? [21:32:55] jeena: Ah, I actually didn't realize that testwiki was on that sort of update schedule. I suppose I will check back on it in an hour or two and see whether said job has occurred. [21:33:15] (03CR) 10CI reject: [V:04-1] composer: wikimedia/request-timeout 2.0.1 -> 2.0.2 [core] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1114802 (https://phabricator.wikimedia.org/T384905) (owner: 10Bartosz Dziewoński) [21:34:02] i'm afk for 15 minutes, @subbu and @arlolra are here and can check the config patch when it deploys. [21:35:18] Kemayo: I think it's because the you changed was InitialiseSettings-labs.php [21:36:30] MatmaRex: I'm going to try running tests again on your patch that failed [21:37:04] jeena: It makes sense, I just don't really think of testwiki as being part of the beta cluster. [21:37:46] testwiki isn't part of the betacluster [21:38:04] its a normal wiki, unless something has changed [21:38:33] Ah, so a deployment would be needed after all? [21:38:51] (03CR) 10Jeena Huneidi: "recheck" [core] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1114802 (https://phabricator.wikimedia.org/T384905) (owner: 10Bartosz Dziewoński) [21:38:54] -labs.php changes will only go to the betacluster [21:39:09] jeena: please do. the failure seems unrelated to the changes, the error messahe is a database deadlock: https://integration.wikimedia.org/ci/job/mediawiki-quibble-apitests-vendor-php74/33528/artifact/log/mw-error.log/*view*/ [21:39:10] if you want test wiki you need to update the non labs file [21:41:23] Hm, I suppose I'm not the only one who was confused about that. There are other uses of testwiki in that -labs file, which is why I thought it was okay. [21:41:42] jeena: If I make a patch to fix that mixup, would you still be able to get it into this window? [21:42:06] Kemayo: that would probably be fine [21:42:26] jeena: okay, one sec [21:44:36] (03PS1) 10DLynch: Move VE EditCheck testwiki enabling into the correct file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114804 (https://phabricator.wikimedia.org/T384658) [21:45:29] jeena: ^ that should do it [21:46:52] Kemayo: are you missing adding testwiki to wgVisualEditorEditCheck? [21:47:15] jeena: It's not needed, since testwiki is a wikipedia and it's already enabled for all those. [21:47:24] oh ok [21:51:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T384592)', diff saved to https://phabricator.wikimedia.org/P72666 and previous config saved to /var/cache/conftool/dbconfig/20250128-215109-marostegui.json [21:51:15] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [21:56:23] !log Deployed refinery-source using jenkins [21:56:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:57:01] !log About to deploy analytics/refinery [21:57:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:58:34] !log aqu@deploy2002 Started deploy [analytics/refinery@3959b36]: Regular analytics weekly train [analytics/refinery@3959b36b] [22:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250128T2200) [22:00:38] !log aqu@deploy2002 Finished deploy [analytics/refinery@3959b36]: Regular analytics weekly train [analytics/refinery@3959b36b] (duration: 02m 03s) [22:00:59] !log aqu@deploy2002 Started deploy [analytics/refinery@3959b36] (thin): Regular analytics weekly train THIN [analytics/refinery@3959b36b] [22:01:20] !log jhuneidi@deploy2002 Started scap sync-world: Backport for [[gerrit:1114799|Fix PHP 7.4 issue (T384905)]], [[gerrit:1114800|wikimedia/request-timeout: 2.0.1 -> 2.0.2 (T384905)]], [[gerrit:1114802|composer: wikimedia/request-timeout 2.0.1 -> 2.0.2 (T384905)]] [22:01:25] T384905: Class Flow\Exception\InvalidDataException does not exist / Declaration of Flow\Exception\FlowException::__construct should be compatible with Wikimedia\NormalizedException\NormalizedException::normalizedConstructor - https://phabricator.wikimedia.org/T384905 [22:02:07] !log aqu@deploy2002 Finished deploy [analytics/refinery@3959b36] (thin): Regular analytics weekly train THIN [analytics/refinery@3959b36b] (duration: 01m 08s) [22:02:47] !log aqu@deploy2002 Started deploy [analytics/refinery@3959b36] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@3959b36b] [22:03:22] !log aqu@deploy2002 Finished deploy [analytics/refinery@3959b36] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@3959b36b] (duration: 00m 34s) [22:03:31] PROBLEM - BGP status on cr1-drmrs is CRITICAL: BGP CRITICAL - AS6939/IPv4: Active - HE, AS6939/IPv6: Active - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [22:04:31] PROBLEM - BGP status on cr2-drmrs is CRITICAL: BGP CRITICAL - AS6939/IPv4: Active - HE, AS6939/IPv6: Active - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [22:06:14] !log jhuneidi@deploy2002 jhuneidi, matmarex: Backport for [[gerrit:1114799|Fix PHP 7.4 issue (T384905)]], [[gerrit:1114800|wikimedia/request-timeout: 2.0.1 -> 2.0.2 (T384905)]], [[gerrit:1114802|composer: wikimedia/request-timeout 2.0.1 -> 2.0.2 (T384905)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [22:06:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P72667 and previous config saved to /var/cache/conftool/dbconfig/20250128-220616-marostegui.json [22:06:21] !log jhuneidi@deploy2002 jhuneidi, matmarex: Continuing with sync [22:08:56] !log Deployed refinery-source using jenkins [22:10:22] (03PS5) 10Raymond Ndibe: [toolforge::harbor] use latest thirdparty/docker [puppet] - 10https://gerrit.wikimedia.org/r/1114007 (https://phabricator.wikimedia.org/T384720) [22:11:00] (03PS5) 10Raymond Ndibe: [toolforge::harbor] upgrade harbor v2.10.1 ---> v2.12.2 [puppet] - 10https://gerrit.wikimedia.org/r/1113871 (https://phabricator.wikimedia.org/T358225) [22:12:48] !log jhuneidi@deploy2002 Finished scap sync-world: Backport for [[gerrit:1114799|Fix PHP 7.4 issue (T384905)]], [[gerrit:1114800|wikimedia/request-timeout: 2.0.1 -> 2.0.2 (T384905)]], [[gerrit:1114802|composer: wikimedia/request-timeout 2.0.1 -> 2.0.2 (T384905)]] (duration: 11m 27s) [22:13:55] cscott: Kemayo I'm going to do both of your config changes together if that's okay [22:14:04] jeena: Fine by me! [22:14:56] cscott may still be away, but go for it [22:15:00] cool thanks [22:15:12] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jhuneidi@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114804 (https://phabricator.wikimedia.org/T384658) (owner: 10DLynch) [22:15:13] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jhuneidi@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114425 (https://phabricator.wikimedia.org/T365367) (owner: 10C. Scott Ananian) [22:15:20] thanks for deploying jeena [22:15:50] you're welcome! sorry i messed up with the recheck thing thinking it would gate-and-submit again [22:15:58] (03Merged) 10jenkins-bot: Move VE EditCheck testwiki enabling into the correct file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114804 (https://phabricator.wikimedia.org/T384658) (owner: 10DLynch) [22:16:00] (03Merged) 10jenkins-bot: Condense wikivoyage configuration options for Parsoid Read Views [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114425 (https://phabricator.wikimedia.org/T365367) (owner: 10C. Scott Ananian) [22:16:12] (03PS1) 10Aqu: Refine: Bump jar version to 0.2.49.3 [puppet] - 10https://gerrit.wikimedia.org/r/1114806 (https://phabricator.wikimedia.org/T383914) [22:16:31] !log jhuneidi@deploy2002 Started scap sync-world: Backport for [[gerrit:1114804|Move VE EditCheck testwiki enabling into the correct file (T384658)]], [[gerrit:1114425|Condense wikivoyage configuration options for Parsoid Read Views (T365367)]] [22:19:30] !log jhuneidi@deploy2002 jhuneidi, cscott, kemayo: Backport for [[gerrit:1114804|Move VE EditCheck testwiki enabling into the correct file (T384658)]], [[gerrit:1114425|Condense wikivoyage configuration options for Parsoid Read Views (T365367)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [22:20:24] jeena: Tested mine on 2002, and it looks good. [22:20:51] arlolra: ready for you to test [22:21:01] ok, one sec [22:21:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P72670 and previous config saved to /var/cache/conftool/dbconfig/20250128-222123-marostegui.json [22:22:55] (03CR) 10Aqu: [V:03+1] "PCC SUCCESS (NOOP 2 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1114806 (https://phabricator.wikimedia.org/T383914) (owner: 10Aqu) [22:23:32] jeena: seems good [22:23:42] !log jhuneidi@deploy2002 jhuneidi, cscott, kemayo: Continuing with sync [22:26:31] RECOVERY - BGP status on cr2-drmrs is OK: BGP OK - up: 114, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [22:26:31] RECOVERY - BGP status on cr1-drmrs is OK: BGP OK - up: 107, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [22:29:40] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10502550 (10phaultfinder) [22:30:19] !log jhuneidi@deploy2002 Finished scap sync-world: Backport for [[gerrit:1114804|Move VE EditCheck testwiki enabling into the correct file (T384658)]], [[gerrit:1114425|Condense wikivoyage configuration options for Parsoid Read Views (T365367)]] (duration: 13m 48s) [22:30:25] T384658: Conduct pre-deployment QA of showing multiple Reference Checks in a given edit - https://phabricator.wikimedia.org/T384658 [22:30:25] T365367: [EPIC] Deploy Parsoid Read Views for English Wikivoyage and Hebrew Wikivoyage - https://phabricator.wikimedia.org/T365367 [22:31:04] (03CR) 10Ottomata: [C:03+1] Scale down mw-content-history-reconcile-enrich for nominal events intake [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114790 (https://phabricator.wikimedia.org/T382953) (owner: 10Xcollazo) [22:31:48] thanks jeena [22:32:15] 👍 [22:32:28] backport window completed [22:35:30] (03PS6) 10Raymond Ndibe: [toolforge::harbor] use latest thirdparty/docker [puppet] - 10https://gerrit.wikimedia.org/r/1114007 (https://phabricator.wikimedia.org/T384720) [22:35:55] (03PS6) 10Raymond Ndibe: [toolforge::harbor] upgrade harbor v2.10.1 ---> v2.12.2 [puppet] - 10https://gerrit.wikimedia.org/r/1113871 (https://phabricator.wikimedia.org/T358225) [22:36:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T384592)', diff saved to https://phabricator.wikimedia.org/P72672 and previous config saved to /var/cache/conftool/dbconfig/20250128-223630-marostegui.json [22:36:36] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [22:36:46] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2168.codfw.wmnet with reason: Maintenance [22:36:49] jeena: Thanks! And sorry about the misunderstanding leading to extra work. [22:36:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2168 (T384592)', diff saved to https://phabricator.wikimedia.org/P72673 and previous config saved to /var/cache/conftool/dbconfig/20250128-223652-marostegui.json [22:37:18] (Also, p858snake|cloud, thanks for letting me know what I'd misunderstood.) [22:37:37] (03CR) 10CI reject: [V:04-1] [toolforge::harbor] use latest thirdparty/docker [puppet] - 10https://gerrit.wikimedia.org/r/1114007 (https://phabricator.wikimedia.org/T384720) (owner: 10Raymond Ndibe) [22:37:57] Kemayo: no worries, glad we could get it sorted, and yeah thanks p858snake|cloud for a better explanation :) [22:38:14] (03CR) 10CI reject: [V:04-1] [toolforge::harbor] upgrade harbor v2.10.1 ---> v2.12.2 [puppet] - 10https://gerrit.wikimedia.org/r/1113871 (https://phabricator.wikimedia.org/T358225) (owner: 10Raymond Ndibe) [22:39:02] thanks jeena, arlolra ! [22:42:53] (03PS7) 10Raymond Ndibe: [toolforge::harbor] use latest thirdparty/docker [puppet] - 10https://gerrit.wikimedia.org/r/1114007 (https://phabricator.wikimedia.org/T384720) [22:46:49] (03PS7) 10Raymond Ndibe: [toolforge::harbor] upgrade harbor v2.10.1 ---> v2.12.2 [puppet] - 10https://gerrit.wikimedia.org/r/1113871 (https://phabricator.wikimedia.org/T358225) [23:12:07] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:31:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2168 (T384592)', diff saved to https://phabricator.wikimedia.org/P72675 and previous config saved to /var/cache/conftool/dbconfig/20250128-233130-marostegui.json [23:31:35] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [23:37:55] (03PS8) 10Raymond Ndibe: [toolforge::harbor] use latest thirdparty/docker [puppet] - 10https://gerrit.wikimedia.org/r/1114007 (https://phabricator.wikimedia.org/T384720) [23:39:57] (03CR) 10CI reject: [V:04-1] [toolforge::harbor] use latest thirdparty/docker [puppet] - 10https://gerrit.wikimedia.org/r/1114007 (https://phabricator.wikimedia.org/T384720) (owner: 10Raymond Ndibe) [23:45:09] (03PS9) 10Raymond Ndibe: [toolforge::harbor] use latest thirdparty/docker [puppet] - 10https://gerrit.wikimedia.org/r/1114007 (https://phabricator.wikimedia.org/T384720) [23:46:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2168', diff saved to https://phabricator.wikimedia.org/P72676 and previous config saved to /var/cache/conftool/dbconfig/20250128-234637-marostegui.json [23:47:07] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic1096-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [23:47:09] (03CR) 10CI reject: [V:04-1] [toolforge::harbor] use latest thirdparty/docker [puppet] - 10https://gerrit.wikimedia.org/r/1114007 (https://phabricator.wikimedia.org/T384720) (owner: 10Raymond Ndibe) [23:49:31] FIRING: [2x] CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic1071-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [23:52:13] 06SRE, 06Infrastructure-Foundations, 10netops, 10observability: LibreNMS reporting no routes learnt from doh/durum Anycast peers at various POPs - https://phabricator.wikimedia.org/T384258#10502663 (10cmooney) I was able to run a manual poller command with the updated 'lmns' command and it shows errors pro... [23:53:04] (03PS10) 10Raymond Ndibe: [toolforge::harbor] use latest thirdparty/docker [puppet] - 10https://gerrit.wikimedia.org/r/1114007 (https://phabricator.wikimedia.org/T384720) [23:53:26] (03CR) 10CI reject: [V:04-1] [toolforge::harbor] use latest thirdparty/docker [puppet] - 10https://gerrit.wikimedia.org/r/1114007 (https://phabricator.wikimedia.org/T384720) (owner: 10Raymond Ndibe) [23:55:07] (03PS11) 10Raymond Ndibe: [toolforge::harbor] use latest thirdparty/docker [puppet] - 10https://gerrit.wikimedia.org/r/1114007 (https://phabricator.wikimedia.org/T384720) [23:56:44] (03PS2) 10Scott French: Enroll 5% of client sessions in PHP 8.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114793 (https://phabricator.wikimedia.org/T383845) [23:56:44] (03CR) 10Scott French: "Thanks in advance for the review! I plan to move forward with this during the one-off infra window I've scheduled for Wednesday at 16:00 U" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114793 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [23:59:15] (03CR) 10Xcollazo: "@tchin@wikimedia.org can you please merge? I don't have +2 in this repo." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114790 (https://phabricator.wikimedia.org/T382953) (owner: 10Xcollazo)