[00:00:05] <jouncebot>	 Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250128T0000)
[00:16:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:21:00] <wikibugs>	 (03CR) 10Jeena Huneidi: "recheck" [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1114445 (owner: 10TrainBranchBot)
[00:30:02] <wikibugs>	 (03Abandoned) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1114445 (owner: 10TrainBranchBot)
[00:38:21] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1114475
[00:38:21] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1114475 (owner: 10TrainBranchBot)
[01:02:06] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1114475 (owner: 10TrainBranchBot)
[01:05:00] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, 06serviceops: wikikube-ctrl1002 and wikikube-ctrl1003: Switch network cable from port 2 to port 1 on the 10G NIC - https://phabricator.wikimedia.org/T379717#10499341 (10Papaul) @VRiley-WMF not yet we have to work on this tomorrow.
[01:08:25] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1114478
[01:08:26] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1114478 (owner: 10TrainBranchBot)
[01:09:41] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10499343 (10phaultfinder)
[01:17:00] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on ms-fe1014 - https://phabricator.wikimedia.org/T384297#10499346 (10Papaul) 05Open→03Resolved a:03Papaul closing this since we have T384317
[01:21:26] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1114478 (owner: 10TrainBranchBot)
[01:30:49] <wikibugs>	 (03CR) 10Bartosz Dziewoński: "Build failure is unrelated, caused by build failure on https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1112730, which prevented it from " [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1114478 (owner: 10TrainBranchBot)
[01:40:42] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1014 - https://phabricator.wikimedia.org/T382984#10499352 (10Papaul) Create Dispatch: Service Tag: JJ3ZWP3
[02:08:12] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/1.44.0-wmf.14 [core] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1114483 (https://phabricator.wikimedia.org/T382365)
[02:08:14] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.44.0-wmf.14 [core] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1114483 (https://phabricator.wikimedia.org/T382365) (owner: 10TrainBranchBot)
[02:19:58] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[02:20:09] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[02:25:27] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/1.44.0-wmf.14 [core] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1114483 (https://phabricator.wikimedia.org/T382365) (owner: 10TrainBranchBot)
[02:36:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:39:42] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10499403 (10phaultfinder)
[03:00:05] <jouncebot>	 Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250128T0300)
[03:06:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:08:38] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:19:23] <wikibugs>	 10ops-eqiad, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T384892 (10phaultfinder) 03NEW
[03:23:38] <jinxer-wm>	 FIRING: ProbeDown: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:28:38] <jinxer-wm>	 RESOLVED: ProbeDown: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:38:41] <icinga-wm>	 PROBLEM - Wikitech and wt-static content in sync on wikitech-static.wikimedia.org is CRITICAL: wikitech-static CRIT - wikitech and wikitech-static out of sync (203889s 200000s) https://wikitech.wikimedia.org/wiki/Wikitech-static
[03:46:38] <jinxer-wm>	 FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic1096-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[03:51:31] <jinxer-wm>	 FIRING: [5x] Not accepting/receiving prefixes from anycast BGP peer: Alert for device cr1-codfw.wikimedia.org - Not accepting/receiving prefixes from anycast BGP peer   - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer
[04:00:04] <jouncebot>	 Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250128T0400)
[04:02:09] <wikibugs>	 (03PS1) 10TrainBranchBot: testwikis to 1.44.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114488 (https://phabricator.wikimedia.org/T382365)
[04:02:10] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] testwikis to 1.44.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114488 (https://phabricator.wikimedia.org/T382365) (owner: 10TrainBranchBot)
[04:02:56] <wikibugs>	 (03Merged) 10jenkins-bot: testwikis to 1.44.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114488 (https://phabricator.wikimedia.org/T382365) (owner: 10TrainBranchBot)
[04:03:24] <logmsgbot>	 !log mwpresync@deploy2002 Started scap sync-world: testwikis to 1.44.0-wmf.14  refs T382365
[04:03:28] <stashbot>	 T382365: 1.44.0-wmf.14 deployment blockers - https://phabricator.wikimedia.org/T382365
[04:11:15] <wikibugs>	 (03CR) 10AikoChou: [C:03+2] "Thanks for the review!!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114401 (https://phabricator.wikimedia.org/T384172) (owner: 10AikoChou)
[04:12:46] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: update reference-quality storage uri [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114401 (https://phabricator.wikimedia.org/T384172) (owner: 10AikoChou)
[04:16:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:17:53] <logmsgbot>	 !log aikochou@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' .
[04:50:49] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 218, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[04:58:24] <logmsgbot>	 !log aikochou@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revision-models' for release 'main' .
[05:00:05] <jouncebot>	 Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250128T0500)
[05:03:02] <logmsgbot>	 !log mwpresync@deploy2002 Finished scap sync-world: testwikis to 1.44.0-wmf.14  refs T382365 (duration: 59m 38s)
[05:03:06] <stashbot>	 T382365: 1.44.0-wmf.14 deployment blockers - https://phabricator.wikimedia.org/T382365
[05:04:49] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 220, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:06:27] <logmsgbot>	 !log mwpresync@deploy2002 Pruned MediaWiki: 1.44.0-wmf.11 (duration: 06m 25s)
[05:11:52] <logmsgbot>	 !log aikochou@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' .
[05:17:49] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 219, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:17:59] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 128, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:35:49] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 207, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:35:59] <icinga-wm>	 PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 112, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:41:49] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 220, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:41:59] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 129, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:02:49] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 219, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:03:01] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 128, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:12:04] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on db1183.eqiad.wmnet with reason: Maintenance
[06:12:31] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2205 T384807', diff saved to https://phabricator.wikimedia.org/P72555 and previous config saved to /var/cache/conftool/dbconfig/20250128-061230-marostegui.json
[06:12:35] <stashbot>	 T384807: Upgrade and rebuild s3 - https://phabricator.wikimedia.org/T384807
[06:16:18] <logmsgbot>	 !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2205.codfw.wmnet with reason: Index rebuild
[06:16:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:18:42] <logmsgbot>	 !log root@cumin1002 START - Cookbook sre.mysql.upgrade for db1175.eqiad.wmnet
[06:19:25] <logmsgbot>	 !log root@cumin1002 START - Cookbook sre.mysql.upgrade for db2205.codfw.wmnet
[06:25:17] <logmsgbot>	 !log root@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1175.eqiad.wmnet
[06:25:24] <logmsgbot>	 !log root@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2205.codfw.wmnet
[06:25:52] <logmsgbot>	 !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1175.eqiad.wmnet with reason: Index rebuild
[06:25:56] <logmsgbot>	 !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2205.codfw.wmnet with reason: Index rebuild
[06:28:27] <icinga-wm>	 PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1002 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs
[06:28:47] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1175 T384807', diff saved to https://phabricator.wikimedia.org/P72556 and previous config saved to /var/cache/conftool/dbconfig/20250128-062846-marostegui.json
[06:28:53] <stashbot>	 T384807: Upgrade and rebuild s3 - https://phabricator.wikimedia.org/T384807
[06:33:27] <icinga-wm>	 RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1002 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs
[06:39:57] <icinga-wm>	 PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[06:40:01] <icinga-wm>	 PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[06:50:13] <icinga-wm>	 PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[07:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250128T0700)
[07:00:05] <jouncebot>	 marostegui, Amir1, and federico3: Time to do the Primary database switchover deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250128T0700).
[07:08:38] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:21:31] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on db2213.codfw.wmnet with reason: Maintenance
[07:25:16] <wikibugs>	 (03PS1) 10Marostegui: es1024: Remove from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1114640 (https://phabricator.wikimedia.org/T384820)
[07:26:01] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] es1024: Remove from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1114640 (https://phabricator.wikimedia.org/T384820) (owner: 10Marostegui)
[07:27:08] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Remove es1024 from dbctl T384820', diff saved to https://phabricator.wikimedia.org/P72557 and previous config saved to /var/cache/conftool/dbconfig/20250128-072707-root.json
[07:27:13] <stashbot>	 T384820: decommission es1024.eqiad.wmnet - https://phabricator.wikimedia.org/T384820
[07:29:44] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Remove es1024 [puppet] - 10https://gerrit.wikimedia.org/r/1114642 (https://phabricator.wikimedia.org/T384820)
[07:29:57] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.decommission for hosts es1024.eqiad.wmnet
[07:30:53] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] mariadb: Remove es1024 [puppet] - 10https://gerrit.wikimedia.org/r/1114642 (https://phabricator.wikimedia.org/T384820) (owner: 10Marostegui)
[07:34:41] <wikibugs>	 (03PS1) 10Marostegui: Revert "db2203: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1114643
[07:35:05] <wikibugs>	 (03PS1) 10Marostegui: Revert "db2207,db2148: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1114644
[07:35:19] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] Revert "db2203: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1114643 (owner: 10Marostegui)
[07:35:47] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.dns.netbox
[07:35:52] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] Revert "db2207,db2148: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1114644 (owner: 10Marostegui)
[07:37:49] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 220, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:38:01] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 129, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:46:39] <jinxer-wm>	 FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic1096-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[07:47:25] <logmsgbot>	 !log jmm@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ganeti2020.codfw.wmnet with reason: remove from cluster for reimage
[07:47:33] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10499612 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=e9f62dcb-2ecf-4d32-84ca-34c181e86093) set by jmm@cumin2002 for 1 day, 0:00:00 on 1 host(...
[07:48:01] <icinga-wm>	 RECOVERY - Host ripe-atlas-eqiad is UP: PING WARNING - Packet loss = 77%, RTA = 0.32 ms
[07:50:57] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: es1024.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1002"
[07:51:31] <jinxer-wm>	 FIRING: [5x] Not accepting/receiving prefixes from anycast BGP peer: Alert for device cr1-codfw.wikimedia.org - Not accepting/receiving prefixes from anycast BGP peer   - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer
[07:51:35] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: es1024.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1002"
[07:51:36] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[07:51:36] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts es1024.eqiad.wmnet
[07:52:14] <wikibugs>	 10ops-eqiad, 06DBA, 06DC-Ops, 10decommission-hardware: decommission es1024.eqiad.wmnet - https://phabricator.wikimedia.org/T384820#10499614 (10Marostegui) a:05Marostegui→03None
[07:52:25] <wikibugs>	 10ops-eqiad, 06DBA, 06DC-Ops, 10decommission-hardware: decommission es1024.eqiad.wmnet - https://phabricator.wikimedia.org/T384820#10499619 (10Marostegui)
[07:52:46] <wikibugs>	 10ops-eqiad, 06DBA, 06DC-Ops, 10decommission-hardware: decommission es1024.eqiad.wmnet - https://phabricator.wikimedia.org/T384820#10499621 (10Marostegui) This is ready for #dc-ops
[07:54:25] <icinga-wm>	 PROBLEM - Host ripe-atlas-eqiad is DOWN: PING CRITICAL - Packet loss = 100%
[07:54:44] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2020.codfw.wmnet with OS bookworm
[07:54:57] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10499624 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2020.codfw.wmnet with OS bookworm
[07:56:12] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1158.eqiad.wmnet with reason: Maintenance
[07:56:30] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[07:56:37] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1158 (T384592)', diff saved to https://phabricator.wikimedia.org/P72558 and previous config saved to /var/cache/conftool/dbconfig/20250128-075636-marostegui.json
[07:56:42] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[07:56:57] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2026.codfw.wmnet
[07:57:02] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] Upgrade to CAS 7.1 [dns] - 10https://gerrit.wikimedia.org/r/1114388 (owner: 10Slyngshede)
[07:57:11] <logmsgbot>	 !log slyngshede@dns1004 START - running authdns-update
[07:57:13] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10499629 (10ops-monitoring-bot) Draining ganeti2026.codfw.wmnet of running VMs
[07:58:58] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2155 T382842', diff saved to https://phabricator.wikimedia.org/P72559 and previous config saved to /var/cache/conftool/dbconfig/20250128-075857-marostegui.json
[07:59:00] <logmsgbot>	 !log slyngshede@dns1004 END - running authdns-update
[07:59:03] <stashbot>	 T382842: Upgrade to 10.6.20 and rebuild recentchanges and pagelinks tables - https://phabricator.wikimedia.org/T382842
[07:59:08] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2026.codfw.wmnet
[07:59:25] <wikibugs>	 (03CR) 10Slyngshede: [C:03+1] "LGTM, but let's just have Moritz confirm that we're not actually using this. My memory is that this is for a previous issue on hardware we" [puppet] - 10https://gerrit.wikimedia.org/r/1114391 (https://phabricator.wikimedia.org/T350694) (owner: 10Filippo Giunchedi)
[07:59:36] <logmsgbot>	 !log root@cumin1002 START - Cookbook sre.mysql.upgrade for db2155.codfw.wmnet
[08:00:05] <jouncebot>	 Amir1, Urbanecm, and awight: How many deployers does it take to do UTC morning backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250128T0800).
[08:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[08:00:24] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[2186-2187].codfw.wmnet with reason: Index rebuild + upgrade
[08:00:33] <wikibugs>	 (03PS1) 10Marostegui: db2155: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1114646 (https://phabricator.wikimedia.org/T382842)
[08:01:54] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db2155: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1114646 (https://phabricator.wikimedia.org/T382842) (owner: 10Marostegui)
[08:01:59] <wikibugs>	 (03PS1) 10Slyngshede: Revert "Upgrade to CAS 7.1" [dns] - 10https://gerrit.wikimedia.org/r/1114647
[08:04:28] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] Revert "Upgrade to CAS 7.1" [dns] - 10https://gerrit.wikimedia.org/r/1114647 (owner: 10Slyngshede)
[08:04:36] <logmsgbot>	 !log slyngshede@dns1004 START - running authdns-update
[08:06:25] <logmsgbot>	 !log slyngshede@dns1004 END - running authdns-update
[08:06:52] <logmsgbot>	 !log root@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2155.codfw.wmnet
[08:07:30] <logmsgbot>	 !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2155.codfw.wmnet with reason: Index rebuild
[08:09:46] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T384592)', diff saved to https://phabricator.wikimedia.org/P72560 and previous config saved to /var/cache/conftool/dbconfig/20250128-080945-marostegui.json
[08:09:51] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[08:13:10] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "We retire the check at this point: This was introduced to catch cases where the microcode updates to fix L1TF, SSBD and MDS were not corre" [puppet] - 10https://gerrit.wikimedia.org/r/1114391 (https://phabricator.wikimedia.org/T350694) (owner: 10Filippo Giunchedi)
[08:14:44] <wikibugs>	 (03PS2) 10Urbanecm: [Growth] enwiki: Release Add Link to 10% of newcomers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114379 (https://phabricator.wikimedia.org/T384551)
[08:14:48] <wikibugs>	 (03CR) 10Urbanecm: [C:03+2] [Growth] enwiki: Release Add Link to 10% of newcomers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114379 (https://phabricator.wikimedia.org/T384551) (owner: 10Urbanecm)
[08:15:30] <wikibugs>	 (03Merged) 10jenkins-bot: [Growth] enwiki: Release Add Link to 10% of newcomers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114379 (https://phabricator.wikimedia.org/T384551) (owner: 10Urbanecm)
[08:16:46] <logmsgbot>	 !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1114379|[Growth] enwiki: Release Add Link to 10% of newcomers (T384551)]]
[08:16:51] <stashbot>	 T384551: Add a link (Structured task): Increase rollout on English Wikipedia to 10% - https://phabricator.wikimedia.org/T384551
[08:17:17] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] base: absent check_microcode (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1114391 (https://phabricator.wikimedia.org/T350694) (owner: 10Filippo Giunchedi)
[08:19:36] <Reedy>	 jouncebot: nowandnext
[08:19:36] <jouncebot>	 For the next 0 hour(s) and 40 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250128T0800)
[08:19:36] <jouncebot>	 In 2 hour(s) and 40 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250128T1100)
[08:19:57] <wikibugs>	 (03PS2) 10Reedy: SimpleCaptcha: Don't look up captcha if no ID was given [extensions/ConfirmEdit] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1114454 (https://phabricator.wikimedia.org/T384858) (owner: 10Jforrester)
[08:19:59] <wikibugs>	 (03CR) 10Fabfur: hiera: enable haproxykafka on codfw [puppet] - 10https://gerrit.wikimedia.org/r/1114415 (https://phabricator.wikimedia.org/T378578) (owner: 10Fabfur)
[08:20:08] <wikibugs>	 (03CR) 10Fabfur: hiera: enable haproxykafka on eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1114417 (https://phabricator.wikimedia.org/T378578) (owner: 10Fabfur)
[08:21:16] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti2020.codfw.wmnet with reason: host reimage
[08:21:29] <wikibugs>	 (03CR) 10Reedy: [C:03+2] SimpleCaptcha: Don't look up captcha if no ID was given [extensions/ConfirmEdit] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1114454 (https://phabricator.wikimedia.org/T384858) (owner: 10Jforrester)
[08:23:10] <Reedy>	 urbanecm: Are you deploying many patches? :)
[08:23:19] <logmsgbot>	 !log urbanecm@deploy2002 urbanecm: Backport for [[gerrit:1114379|[Growth] enwiki: Release Add Link to 10% of newcomers (T384551)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[08:23:24] <stashbot>	 T384551: Add a link (Structured task): Increase rollout on English Wikipedia to 10% - https://phabricator.wikimedia.org/T384551
[08:23:24] <urbanecm>	 Reedy: no, just this one
[08:23:31] <logmsgbot>	 !log urbanecm@deploy2002 urbanecm: Continuing with sync
[08:23:33] <Reedy>	 sweet
[08:23:53] <urbanecm>	 i'll ping you when done :)
[08:24:26] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2026.codfw.wmnet
[08:24:37] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti2020.codfw.wmnet with reason: host reimage
[08:24:39] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10499648 (10ops-monitoring-bot) Draining ganeti2026.codfw.wmnet of running VMs
[08:24:43] <wikibugs>	 (03PS2) 10Reedy: UcfirstOverrides: Fix indenting of comment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093387
[08:24:53] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P72561 and previous config saved to /var/cache/conftool/dbconfig/20250128-082452-marostegui.json
[08:25:01] <wikibugs>	 (03PS2) 10Reedy: CommonSettings.php: Remove deprecated $wgOATHAuthDatabase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1088655
[08:25:24] <wikibugs>	 (03CR) 10Jelto: [C:03+2] "this is configured at firewall level and can be removed from apache" [puppet] - 10https://gerrit.wikimedia.org/r/1114438 (owner: 10Dzahn)
[08:25:56] <wikibugs>	 (03PS2) 10Reedy: Disable Dns Blacklist checks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1108179 (https://phabricator.wikimedia.org/T382987)
[08:26:56] <wikibugs>	 (03CR) 10Filippo Giunchedi: base: absent check_microcode (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1114391 (https://phabricator.wikimedia.org/T350694) (owner: 10Filippo Giunchedi)
[08:27:00] <wikibugs>	 (03PS2) 10Filippo Giunchedi: base: absent check_microcode [puppet] - 10https://gerrit.wikimedia.org/r/1114391 (https://phabricator.wikimedia.org/T350694)
[08:27:31] <wikibugs>	 (03CR) 10Filippo Giunchedi: base: absent check_microcode (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1114391 (https://phabricator.wikimedia.org/T350694) (owner: 10Filippo Giunchedi)
[08:29:35] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] thanos: send sigkill as needed to stateless components [puppet] - 10https://gerrit.wikimedia.org/r/1114336 (https://phabricator.wikimedia.org/T383570) (owner: 10Filippo Giunchedi)
[08:30:05] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] hiera: enable haproxykafka on codfw [puppet] - 10https://gerrit.wikimedia.org/r/1114415 (https://phabricator.wikimedia.org/T378578) (owner: 10Fabfur)
[08:33:33] <logmsgbot>	 !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1114379|[Growth] enwiki: Release Add Link to 10% of newcomers (T384551)]] (duration: 16m 46s)
[08:33:37] <stashbot>	 T384551: Add a link (Structured task): Increase rollout on English Wikipedia to 10% - https://phabricator.wikimedia.org/T384551
[08:35:09] <urbanecm>	 Reedy: over to you!
[08:35:18] <Reedy>	 cheers :)
[08:35:30] <wikibugs>	 (03CR) 10Reedy: [C:03+2] UcfirstOverrides: Fix indenting of comment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093387 (owner: 10Reedy)
[08:35:32] <wikibugs>	 (03CR) 10Reedy: [C:03+2] CommonSettings.php: Remove deprecated $wgOATHAuthDatabase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1088655 (owner: 10Reedy)
[08:35:34] <wikibugs>	 (03CR) 10Reedy: [C:03+2] Disable Dns Blacklist checks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1108179 (https://phabricator.wikimedia.org/T382987) (owner: 10Reedy)
[08:36:17] <wikibugs>	 (03Merged) 10jenkins-bot: UcfirstOverrides: Fix indenting of comment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093387 (owner: 10Reedy)
[08:36:19] <wikibugs>	 (03Merged) 10jenkins-bot: CommonSettings.php: Remove deprecated $wgOATHAuthDatabase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1088655 (owner: 10Reedy)
[08:36:22] <wikibugs>	 (03Merged) 10jenkins-bot: Disable Dns Blacklist checks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1108179 (https://phabricator.wikimedia.org/T382987) (owner: 10Reedy)
[08:37:15] <wikibugs>	 (03CR) 10Jelto: [C:04-1] "two of those UserAgents can be found in the access logs. So I'd say let's keep them for now and we can clean that up once this is in reque" [puppet] - 10https://gerrit.wikimedia.org/r/1114442 (owner: 10Dzahn)
[08:37:39] <wikibugs>	 10ops-codfw, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti20[46-51] - https://phabricator.wikimedia.org/T384838#10499657 (10MoritzMuehlenhoff)
[08:38:01] <wikibugs>	 (03PS2) 10Reedy: noc: Expose MobileUrlCallback.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1091818
[08:38:05] <wikibugs>	 (03CR) 10Reedy: [C:03+2] noc: Expose MobileUrlCallback.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1091818 (owner: 10Reedy)
[08:38:05] <icinga-wm>	 PROBLEM - Host mr1-drmrs is DOWN: PING CRITICAL - Packet loss = 100%
[08:38:34] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] base: absent check_microcode [puppet] - 10https://gerrit.wikimedia.org/r/1114391 (https://phabricator.wikimedia.org/T350694) (owner: 10Filippo Giunchedi)
[08:38:49] <icinga-wm>	 PROBLEM - Host ps1-b13-drmrs is DOWN: PING CRITICAL - Packet loss = 100%
[08:39:17] <icinga-wm>	 PROBLEM - Host mr1-drmrs IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
[08:39:17] <icinga-wm>	 PROBLEM - Host mr1-drmrs.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
[08:39:31] <wikibugs>	 (03Merged) 10jenkins-bot: noc: Expose MobileUrlCallback.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1091818 (owner: 10Reedy)
[08:39:41] <icinga-wm>	 PROBLEM - Host ps1-b12-drmrs is DOWN: PING CRITICAL - Packet loss = 100%
[08:39:50] <wikibugs>	 (03PS3) 10Reedy: CommonSettings: Set 'lang=en' on Wikimedia Foundation entry in $wgFooterIcons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1110053 (https://phabricator.wikimedia.org/T383501)
[08:39:58] <wikibugs>	 (03CR) 10Reedy: [C:03+2] CommonSettings: Set 'lang=en' on Wikimedia Foundation entry in $wgFooterIcons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1110053 (https://phabricator.wikimedia.org/T383501) (owner: 10Reedy)
[08:40:00] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P72562 and previous config saved to /var/cache/conftool/dbconfig/20250128-083959-marostegui.json
[08:40:40] <wikibugs>	 (03Merged) 10jenkins-bot: CommonSettings: Set 'lang=en' on Wikimedia Foundation entry in $wgFooterIcons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1110053 (https://phabricator.wikimedia.org/T383501) (owner: 10Reedy)
[08:41:23] <wikibugs>	 (03Merged) 10jenkins-bot: SimpleCaptcha: Don't look up captcha if no ID was given [extensions/ConfirmEdit] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1114454 (https://phabricator.wikimedia.org/T384858) (owner: 10Jforrester)
[08:42:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job pdu_sentry4 in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[08:43:29] <logmsgbot>	 !log reedy@deploy2002 Started scap sync-world: Backport for [[gerrit:1110053|CommonSettings: Set 'lang=en' on Wikimedia Foundation entry in $wgFooterIcons (T383501)]], [[gerrit:1114454|SimpleCaptcha: Don't look up captcha if no ID was given (T384858)]], [[gerrit:1091818|noc: Expose MobileUrlCallback.php]], [[gerrit:1108179|Disable Dns Blacklist checks (T382987)]], [[gerrit:1088655|CommonSettings.php: Remove deprecated $wg
[08:43:29] <logmsgbot>	 OATHAuthDatabase]], [[gerrit:1093387|UcfirstOverrides: Fix indenting of comment]]
[08:43:36] <stashbot>	 T383501: Add language to footer icons - https://phabricator.wikimedia.org/T383501
[08:43:37] <stashbot>	 T384858: PHP Deprecated: strtr(): Passing null to parameter #1 ($string) of type string is deprecated - https://phabricator.wikimedia.org/T384858
[08:43:37] <stashbot>	 T382987: Set the default of wgDnsBlacklistUrls to empty - https://phabricator.wikimedia.org/T382987
[08:44:43] <wikibugs>	 (03PS1) 10Muehlenhoff: Add ganeti2045-ganeti2050 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1114649 (https://phabricator.wikimedia.org/T384838)
[08:47:45] <wikibugs>	 (03PS1) 10Filippo Giunchedi: kartotherian: disable icinga check [puppet] - 10https://gerrit.wikimedia.org/r/1114650 (https://phabricator.wikimedia.org/T321808)
[08:48:17] <logmsgbot>	 !log reedy@deploy2002 reedy, jforrester: Backport for [[gerrit:1110053|CommonSettings: Set 'lang=en' on Wikimedia Foundation entry in $wgFooterIcons (T383501)]], [[gerrit:1114454|SimpleCaptcha: Don't look up captcha if no ID was given (T384858)]], [[gerrit:1091818|noc: Expose MobileUrlCallback.php]], [[gerrit:1108179|Disable Dns Blacklist checks (T382987)]], [[gerrit:1088655|CommonSettings.php: Remove deprecated $wgOATHAu
[08:48:17] <logmsgbot>	 thDatabase]], [[gerrit:1093387|UcfirstOverrides: Fix indenting of comment]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[08:48:31] <logmsgbot>	 !log reedy@deploy2002 reedy, jforrester: Continuing with sync
[08:48:38] <wikibugs>	 10ops-codfw, 06DC-Ops, 06Infrastructure-Foundations, 13Patch-For-Review: Q3:rack/setup/install ganeti20[46-51] - https://phabricator.wikimedia.org/T384838#10499694 (10MoritzMuehlenhoff) a:05MoritzMuehlenhoff→03RobH @RobH Why 2046 onwards? Our highest Ganeti server in codfw is 2044; I've filled in the r...
[08:51:05] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2020.codfw.wmnet with OS bookworm
[08:51:11] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10499699 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2020.codfw.wmnet with OS bookworm completed: - ganeti202...
[08:52:43] <icinga-wm>	 RECOVERY - Host ps1-b13-drmrs is UP: PING OK - Packet loss = 0%, RTA = 87.37 ms
[08:52:49] <icinga-wm>	 RECOVERY - Host ps1-b12-drmrs is UP: PING OK - Packet loss = 0%, RTA = 87.29 ms
[08:52:49] <icinga-wm>	 RECOVERY - Host mr1-drmrs is UP: PING OK - Packet loss = 0%, RTA = 86.83 ms
[08:52:56] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Add ganeti2045-ganeti2050 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1114649 (https://phabricator.wikimedia.org/T384838) (owner: 10Muehlenhoff)
[08:54:41] <icinga-wm>	 RECOVERY - Host mr1-drmrs IPv6 is UP: PING OK - Packet loss = 0%, RTA = 86.81 ms
[08:54:41] <icinga-wm>	 RECOVERY - Host mr1-drmrs.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 86.18 ms
[08:55:07] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T384592)', diff saved to https://phabricator.wikimedia.org/P72563 and previous config saved to /var/cache/conftool/dbconfig/20250128-085506-marostegui.json
[08:55:11] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[08:55:21] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance
[08:55:28] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1170 (T384592)', diff saved to https://phabricator.wikimedia.org/P72564 and previous config saved to /var/cache/conftool/dbconfig/20250128-085528-marostegui.json
[08:56:27] <logmsgbot>	 !log reedy@deploy2002 Finished scap sync-world: Backport for [[gerrit:1110053|CommonSettings: Set 'lang=en' on Wikimedia Foundation entry in $wgFooterIcons (T383501)]], [[gerrit:1114454|SimpleCaptcha: Don't look up captcha if no ID was given (T384858)]], [[gerrit:1091818|noc: Expose MobileUrlCallback.php]], [[gerrit:1108179|Disable Dns Blacklist checks (T382987)]], [[gerrit:1088655|CommonSettings.php: Remove deprecated $w
[08:56:27] <logmsgbot>	 gOATHAuthDatabase]], [[gerrit:1093387|UcfirstOverrides: Fix indenting of comment]] (duration: 12m 58s)
[08:56:34] <stashbot>	 T383501: Add language to footer icons - https://phabricator.wikimedia.org/T383501
[08:56:34] <stashbot>	 T384858: PHP Deprecated: strtr(): Passing null to parameter #1 ($string) of type string is deprecated - https://phabricator.wikimedia.org/T384858
[08:56:34] <stashbot>	 T382987: Set the default of wgDnsBlacklistUrls to empty - https://phabricator.wikimedia.org/T382987
[08:57:24] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2020.codfw.wmnet
[08:57:33] <wikibugs>	 (03PS1) 10Filippo Giunchedi: profile: remove obsolete poolcounter icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/1114651 (https://phabricator.wikimedia.org/T321808)
[08:57:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job pdu_sentry4 in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[09:00:00] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es2036', diff saved to https://phabricator.wikimedia.org/P72565 and previous config saved to /var/cache/conftool/dbconfig/20250128-090000-marostegui.json
[09:00:10] <logmsgbot>	 !log root@cumin1002 START - Cookbook sre.mysql.upgrade for es2036.codfw.wmnet
[09:00:54] <logmsgbot>	 !log jynus@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1150.eqiad.wmnet with reason: reimage
[09:00:55] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 208, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[09:01:03] <icinga-wm>	 RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 113, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[09:01:46] <wikibugs>	 (03PS1) 10Marostegui: Revert "db2155: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1114652
[09:02:16] <wikibugs>	 (03CR) 10Marostegui: [C:04-2] "Not yet" [puppet] - 10https://gerrit.wikimedia.org/r/1114652 (owner: 10Marostegui)
[09:03:45] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch ganeti2026 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1114653
[09:04:40] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1175 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P72566 and previous config saved to /var/cache/conftool/dbconfig/20250128-090439-root.json
[09:04:55] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote es2028 to es1 codfw master', diff saved to https://phabricator.wikimedia.org/P72567 and previous config saved to /var/cache/conftool/dbconfig/20250128-090454-marostegui.json
[09:05:26] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2205 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P72568 and previous config saved to /var/cache/conftool/dbconfig/20250128-090525-root.json
[09:05:42] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2020.codfw.wmnet
[09:05:50] <logmsgbot>	 !log root@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for es2036.codfw.wmnet
[09:06:02] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2205', diff saved to https://phabricator.wikimedia.org/P72569 and previous config saved to /var/cache/conftool/dbconfig/20250128-090601-marostegui.json
[09:06:10] <wikibugs>	 (03PS3) 10Cyndywikime: Add configurable MinimumTasksPerTopic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113984 (https://phabricator.wikimedia.org/T383714)
[09:06:20] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1175', diff saved to https://phabricator.wikimedia.org/P72570 and previous config saved to /var/cache/conftool/dbconfig/20250128-090620-marostegui.json
[09:06:21] <wikibugs>	 (03CR) 10Cyndywikime: Add configurable MinimumTasksPerTopic (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113984 (https://phabricator.wikimedia.org/T383714) (owner: 10Cyndywikime)
[09:06:34] <logmsgbot>	 !log root@cumin1002 START - Cookbook sre.hosts.reimage for host db1150.eqiad.wmnet with OS bookworm
[09:08:36] <wikibugs>	 06SRE, 07SRE-Unowned, 10Deployments, 06Release-Engineering-Team, 13Patch-For-Review: Reduce automatic messages on #wikimedia-operations - https://phabricator.wikimedia.org/T384804#10499724 (10hashar) 05Open→03Declined I had enough push back that I am not interested in pursuing. I will keep using...
[09:12:43] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es2030', diff saved to https://phabricator.wikimedia.org/P72571 and previous config saved to /var/cache/conftool/dbconfig/20250128-091242-marostegui.json
[09:13:03] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2036 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P72572 and previous config saved to /var/cache/conftool/dbconfig/20250128-091302-root.json
[09:13:16] <wikibugs>	 (03PS1) 10Marostegui: wmnet: Promote es2028 to es1 master [dns] - 10https://gerrit.wikimedia.org/r/1114654 (https://phabricator.wikimedia.org/T376905)
[09:13:29] <logmsgbot>	 !log root@cumin1002 START - Cookbook sre.mysql.upgrade for es2030.codfw.wmnet
[09:13:39] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] sre.hosts.reimage: Add link to the help text for move-vlan [cookbooks] - 10https://gerrit.wikimedia.org/r/1112171 (owner: 10Muehlenhoff)
[09:16:27] <wikibugs>	 (03PS1) 10Filippo Giunchedi: dumps: remove nfs port icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/1114655 (https://phabricator.wikimedia.org/T321808)
[09:18:47] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1175 (re)pooling @ 10%: Repooling T384807', diff saved to https://phabricator.wikimedia.org/P72574 and previous config saved to /var/cache/conftool/dbconfig/20250128-091846-root.json
[09:18:52] <stashbot>	 T384807: Upgrade and rebuild s3 - https://phabricator.wikimedia.org/T384807
[09:22:35] <logmsgbot>	 !log root@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for es2030.codfw.wmnet
[09:22:38] <logmsgbot>	 !log root@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1150.eqiad.wmnet with reason: host reimage
[09:24:42] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10499743 (10phaultfinder)
[09:26:22] <logmsgbot>	 !log root@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1150.eqiad.wmnet with reason: host reimage
[09:26:38] <wikibugs>	 (03CR) 10Fabfur: "tnx for the +1!" [puppet] - 10https://gerrit.wikimedia.org/r/1114415 (https://phabricator.wikimedia.org/T378578) (owner: 10Fabfur)
[09:26:40] <wikibugs>	 (03CR) 10Fabfur: [C:03+2] hiera: enable haproxykafka on codfw [puppet] - 10https://gerrit.wikimedia.org/r/1114415 (https://phabricator.wikimedia.org/T378578) (owner: 10Fabfur)
[09:27:52] <fabfur>	 !log installing/enabling haproxykafka on codfw (https://gerrit.wikimedia.org/r/c/operations/puppet/+/1114415) (T378578)
[09:27:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:27:57] <stashbot>	 T378578: Rollout haproxykafka on all hosts - https://phabricator.wikimedia.org/T378578
[09:28:08] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2036 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P72575 and previous config saved to /var/cache/conftool/dbconfig/20250128-092808-root.json
[09:28:57] <wikibugs>	 (03CR) 10Effie Mouzeli: "I am afraid I do not have any useful input here 😊" [puppet] - 10https://gerrit.wikimedia.org/r/1109526 (https://phabricator.wikimedia.org/T369024) (owner: 10Ladsgroup)
[09:33:10] <icinga-wm>	 RECOVERY - Host ripe-atlas-eqiad is UP: PING WARNING - Packet loss = 90%, RTA = 30.22 ms
[09:33:34] <wikibugs>	 (03PS1) 10Effie Mouzeli: Enroll 2% of client sessions in PHP 8.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114657 (https://phabricator.wikimedia.org/T383845)
[09:33:52] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1175 (re)pooling @ 25%: Repooling T384807', diff saved to https://phabricator.wikimedia.org/P72576 and previous config saved to /var/cache/conftool/dbconfig/20250128-093352-root.json
[09:33:58] <stashbot>	 T384807: Upgrade and rebuild s3 - https://phabricator.wikimedia.org/T384807
[09:34:24] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2205 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P72577 and previous config saved to /var/cache/conftool/dbconfig/20250128-093423-root.json
[09:34:46] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2190', diff saved to https://phabricator.wikimedia.org/P72578 and previous config saved to /var/cache/conftool/dbconfig/20250128-093446-marostegui.json
[09:34:56] <logmsgbot>	 !log root@cumin1002 START - Cookbook sre.mysql.upgrade for db2190.codfw.wmnet
[09:39:34] <icinga-wm>	 PROBLEM - Host ripe-atlas-eqiad is DOWN: PING CRITICAL - Packet loss = 100%
[09:39:57] <logmsgbot>	 !log root@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2190.codfw.wmnet
[09:41:13] <logmsgbot>	 !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2190.codfw.wmnet with reason: Index rebuild
[09:43:14] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2036 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P72580 and previous config saved to /var/cache/conftool/dbconfig/20250128-094313-root.json
[09:48:00] <wikibugs>	 (03PS1) 10Muehlenhoff: profile::docker::firewall: Remove unused profile [puppet] - 10https://gerrit.wikimedia.org/r/1114661
[09:48:58] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1175 (re)pooling @ 50%: Repooling T384807', diff saved to https://phabricator.wikimedia.org/P72581 and previous config saved to /var/cache/conftool/dbconfig/20250128-094857-root.json
[09:49:03] <stashbot>	 T384807: Upgrade and rebuild s3 - https://phabricator.wikimedia.org/T384807
[09:49:29] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2205 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P72582 and previous config saved to /var/cache/conftool/dbconfig/20250128-094928-root.json
[09:49:58] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C:03+1] Enroll 2% of client sessions in PHP 8.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114657 (https://phabricator.wikimedia.org/T383845) (owner: 10Effie Mouzeli)
[09:50:07] <logmsgbot>	 !log root@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1150.eqiad.wmnet with OS bookworm
[09:50:32] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170 (T384592)', diff saved to https://phabricator.wikimedia.org/P72583 and previous config saved to /var/cache/conftool/dbconfig/20250128-095032-marostegui.json
[09:50:37] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[09:51:21] <wikibugs>	 (03PS1) 10Dreamrimmer: Change "$wgUploadMissingFileUrl" for svwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114663 (https://phabricator.wikimedia.org/T383452)
[09:53:20] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, January 28 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114663 (https://phabricator.wikimedia.org/T383452) (owner: 10Dreamrimmer)
[09:58:06] <wikibugs>	 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 07Performance Issue, 07Upstream: https://lists.wikimedia.org is often slow to load - https://phabricator.wikimedia.org/T353891#10499820 (10LSobanski) p:05Triage→03Medium
[09:58:19] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2036 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P72584 and previous config saved to /var/cache/conftool/dbconfig/20250128-095818-root.json
[10:00:18] <wikibugs>	 (03CR) 10Muehlenhoff: [C:04-1] "This should be broken down to logical, dependant patches, each with their own commit message detailing the change (adding support for new " [puppet] - 10https://gerrit.wikimedia.org/r/1109726 (https://phabricator.wikimedia.org/T370677) (owner: 10Arnaudb)
[10:04:03] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1175 (re)pooling @ 75%: Repooling T384807', diff saved to https://phabricator.wikimedia.org/P72585 and previous config saved to /var/cache/conftool/dbconfig/20250128-100402-root.json
[10:04:08] <stashbot>	 T384807: Upgrade and rebuild s3 - https://phabricator.wikimedia.org/T384807
[10:04:34] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2205 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P72586 and previous config saved to /var/cache/conftool/dbconfig/20250128-100434-root.json
[10:05:39] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170', diff saved to https://phabricator.wikimedia.org/P72587 and previous config saved to /var/cache/conftool/dbconfig/20250128-100539-marostegui.json
[10:07:55] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2155 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P72588 and previous config saved to /var/cache/conftool/dbconfig/20250128-100754-root.json
[10:10:58] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 207, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[10:11:06] <icinga-wm>	 PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 112, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[10:12:25] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2155 T384807', diff saved to https://phabricator.wikimedia.org/P72589 and previous config saved to /var/cache/conftool/dbconfig/20250128-101224-marostegui.json
[10:12:29] <stashbot>	 T384807: Upgrade and rebuild s3 - https://phabricator.wikimedia.org/T384807
[10:13:24] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2036 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P72590 and previous config saved to /var/cache/conftool/dbconfig/20250128-101324-root.json
[10:14:58] <wikibugs>	 (03PS1) 10Jelto: Support multiple helm versions [debs/helm3] - 10https://gerrit.wikimedia.org/r/1114666 (https://phabricator.wikimedia.org/T341984)
[10:19:09] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1175 (re)pooling @ 100%: Repooling T384807', diff saved to https://phabricator.wikimedia.org/P72591 and previous config saved to /var/cache/conftool/dbconfig/20250128-101908-root.json
[10:19:13] <stashbot>	 T384807: Upgrade and rebuild s3 - https://phabricator.wikimedia.org/T384807
[10:19:28] <icinga-wm>	 PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - ASunknown/IPv4: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:19:40] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2205 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P72592 and previous config saved to /var/cache/conftool/dbconfig/20250128-101939-root.json
[10:19:42] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] service: Add scheduler_flag field to ServiceLVS [software/spicerack] - 10https://gerrit.wikimedia.org/r/1114356 (https://phabricator.wikimedia.org/T373027) (owner: 10Vgutierrez)
[10:20:46] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170', diff saved to https://phabricator.wikimedia.org/P72593 and previous config saved to /var/cache/conftool/dbconfig/20250128-102046-marostegui.json
[10:21:34] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10499923 (10MatthewVernon) ` /dev/sda -d scsi # /dev/sda, SCSI device /dev/sdb -d scsi # /dev/sdb, SCSI device /dev/sdc -d scsi # /dev/sdc, SCSI device /dev/sdd -d scsi # /dev/sd...
[10:22:58] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 208, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[10:23:06] <icinga-wm>	 RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 113, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[10:25:11] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] wmflib,pybal: Add scheduler_flag support [puppet] - 10https://gerrit.wikimedia.org/r/1114352 (https://phabricator.wikimedia.org/T373027) (owner: 10Vgutierrez)
[10:28:16] <wikibugs>	 (03CR) 10MVernon: "How does that relate to the nginx and swift-fe services that are being used in confctl to pool/depool these systems,then?" [puppet] - 10https://gerrit.wikimedia.org/r/1114015 (https://phabricator.wikimedia.org/T373020) (owner: 10Vgutierrez)
[10:28:55] <wikibugs>	 10ops-magru, 06Infrastructure-Foundations, 10netops: Jan 2025 - Magru core router connectivity blips - https://phabricator.wikimedia.org/T384774#10499926 (10cmooney) Everything remains stable since the upgrade/reset of the routers yesterday.  All protocol adjacencies, interfaces etc look good as are the gene...
[10:33:55] <wikibugs>	 (03CR) 10Vgutierrez: "the provided configuration sets the mapping between local services (envoy and swift-proxy) with conftool services, so the provided scripts" [puppet] - 10https://gerrit.wikimedia.org/r/1114015 (https://phabricator.wikimedia.org/T373020) (owner: 10Vgutierrez)
[10:34:27] <wikibugs>	 10ops-magru, 06Infrastructure-Foundations, 10netops: Jan 2025 - Magru core router connectivity blips - https://phabricator.wikimedia.org/T384774#10499947 (10Vgutierrez) thanks @cmooney, I'll re-pool the site
[10:34:45] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2205 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P72594 and previous config saved to /var/cache/conftool/dbconfig/20250128-103444-root.json
[10:34:53] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1112224 (https://phabricator.wikimedia.org/T383707) (owner: 10Slyngshede)
[10:35:35] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.dns.admin DNS admin: pool site magru [reason: no reason specified, T384774]
[10:35:39] <stashbot>	 T384774: Jan 2025 - Magru core router connectivity blips - https://phabricator.wikimedia.org/T384774
[10:35:49] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: pool site magru [reason: no reason specified, T384774]
[10:35:53] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170 (T384592)', diff saved to https://phabricator.wikimedia.org/P72595 and previous config saved to /var/cache/conftool/dbconfig/20250128-103553-marostegui.json
[10:35:58] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[10:36:01] <urbanecm>	 jouncebot: nowandnext
[10:36:01] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 23 minute(s)
[10:36:01] <jouncebot>	 In 0 hour(s) and 23 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250128T1100)
[10:36:09] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance
[10:36:47] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2026.codfw.wmnet
[10:38:07] <logmsgbot>	 !log jmm@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ganeti2026.codfw.wmnet with reason: remove from cluster for reimage
[10:38:14] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10499969 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=bc2c7bb0-3133-43fd-9040-c01d53f22d8f) set by jmm@cumin2002 for 1 day, 0:00:00 on 1 host(...
[10:39:46] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Switch ganeti2026 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1114653 (owner: 10Muehlenhoff)
[10:39:50] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] hiera: enable haproxykafka on eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1114417 (https://phabricator.wikimedia.org/T378578) (owner: 10Fabfur)
[10:42:58] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 219, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[10:43:08] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 128, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[10:44:15] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2155 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P72596 and previous config saved to /var/cache/conftool/dbconfig/20250128-104415-root.json
[10:44:37] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2030 (re)pooling @ 10%: Repooling T384807', diff saved to https://phabricator.wikimedia.org/P72597 and previous config saved to /var/cache/conftool/dbconfig/20250128-104436-root.json
[10:44:41] <stashbot>	 T384807: Upgrade and rebuild s3 - https://phabricator.wikimedia.org/T384807
[10:47:10] <wikibugs>	 (03PS1) 10Volans: CHANGELOG: add changelogs for release v9.1.1 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1114674
[10:54:01] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti2020.codfw.wmnet to cluster codfw and group B
[10:54:50] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti2020.codfw.wmnet to cluster codfw and group B
[10:57:25] <wikibugs>	 (03CR) 10Volans: [C:03+2] CHANGELOG: add changelogs for release v9.1.1 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1114674 (owner: 10Volans)
[10:59:07] <wikibugs>	 (03PS1) 10Volans: Upstream release v9.1.1 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1114675
[10:59:20] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2155 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P72598 and previous config saved to /var/cache/conftool/dbconfig/20250128-105920-root.json
[10:59:42] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2030 (re)pooling @ 25%: Repooling T384807', diff saved to https://phabricator.wikimedia.org/P72599 and previous config saved to /var/cache/conftool/dbconfig/20250128-105942-root.json
[10:59:47] <stashbot>	 T384807: Upgrade and rebuild s3 - https://phabricator.wikimedia.org/T384807
[11:00:05] <jouncebot>	 effie mouzeli: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for MediaWiki infrastructure (UTC mid-day) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250128T1100).
[11:01:50] <logmsgbot>	 !log klausman@deploy2002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'.
[11:02:18] <logmsgbot>	 !log klausman@deploy2002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'.
[11:02:39] <logmsgbot>	 !log klausman@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'.
[11:03:03] <logmsgbot>	 !log klausman@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'.
[11:04:36] <moritzm>	 !log installing runc security updates
[11:04:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:05:37] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jiji@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114657 (https://phabricator.wikimedia.org/T383845) (owner: 10Effie Mouzeli)
[11:06:25] <wikibugs>	 (03Merged) 10jenkins-bot: Enroll 2% of client sessions in PHP 8.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114657 (https://phabricator.wikimedia.org/T383845) (owner: 10Effie Mouzeli)
[11:06:55] <logmsgbot>	 !log jiji@deploy2002 Started scap sync-world: Backport for [[gerrit:1114657|Enroll 2% of client sessions in PHP 8.1 (T383845)]]
[11:07:00] <stashbot>	 T383845: MediaWiki on PHP 8.1 production traffic ramp-up - https://phabricator.wikimedia.org/T383845
[11:08:38] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:11:22] <wikibugs>	 (03PS1) 10Btullis: Fix incompatibility between /mnt/hdfs and envoyproxy [puppet] - 10https://gerrit.wikimedia.org/r/1114679 (https://phabricator.wikimedia.org/T384329)
[11:11:33] <logmsgbot>	 !log jiji@deploy2002 jiji: Backport for [[gerrit:1114657|Enroll 2% of client sessions in PHP 8.1 (T383845)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[11:11:50] <logmsgbot>	 !log jiji@deploy2002 jiji: Continuing with sync
[11:12:22] <wikibugs>	 (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4873/console" [puppet] - 10https://gerrit.wikimedia.org/r/1114679 (https://phabricator.wikimedia.org/T384329) (owner: 10Btullis)
[11:13:59] <wikibugs>	 (03CR) 10Volans: [C:03+2] Upstream release v9.1.1 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1114675 (owner: 10Volans)
[11:14:07] <wikibugs>	 (03CR) 10MVernon: [C:03+1] "Thank you for taking the time to explain all this to me again :)" [puppet] - 10https://gerrit.wikimedia.org/r/1114015 (https://phabricator.wikimedia.org/T373020) (owner: 10Vgutierrez)
[11:14:26] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2155 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P72600 and previous config saved to /var/cache/conftool/dbconfig/20250128-111425-root.json
[11:14:48] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2030 (re)pooling @ 50%: Repooling T384807', diff saved to https://phabricator.wikimedia.org/P72601 and previous config saved to /var/cache/conftool/dbconfig/20250128-111447-root.json
[11:14:53] <stashbot>	 T384807: Upgrade and rebuild s3 - https://phabricator.wikimedia.org/T384807
[11:15:58] <wikibugs>	 (03PS2) 10Jelto: Support multiple helm versions [debs/helm3] - 10https://gerrit.wikimedia.org/r/1114666 (https://phabricator.wikimedia.org/T341984)
[11:17:15] <logmsgbot>	 !log klausman@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'.
[11:18:44] <logmsgbot>	 !log jiji@deploy2002 Finished scap sync-world: Backport for [[gerrit:1114657|Enroll 2% of client sessions in PHP 8.1 (T383845)]] (duration: 11m 48s)
[11:18:49] <stashbot>	 T383845: MediaWiki on PHP 8.1 production traffic ramp-up - https://phabricator.wikimedia.org/T383845
[11:19:00] <volans>	 !log uploaded spicerack_9.1.1 to apt.wikimedia.org bullseye-wikimedia
[11:19:01] <logmsgbot>	 !log klausman@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'.
[11:19:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:21:51] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://mobileapps.svc.eqiad.wmnet:4102 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[11:25:30] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-query_443: Servers titan1001.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[11:25:36] <icinga-wm>	 PROBLEM - SSH on prometheus2006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[11:26:25] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1174.eqiad.wmnet with reason: Maintenance
[11:26:26] <icinga-wm>	 RECOVERY - SSH on prometheus2006 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[11:26:32] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1174 (T384592)', diff saved to https://phabricator.wikimedia.org/P72602 and previous config saved to /var/cache/conftool/dbconfig/20250128-112631-marostegui.json
[11:26:37] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[11:26:47] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] fc-list: update font list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114398 (https://phabricator.wikimedia.org/T280718) (owner: 10Hnowlan)
[11:26:51] <jinxer-wm>	 RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://mobileapps.svc.eqiad.wmnet:4102 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[11:27:32] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-web_443: Servers titan1002.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[11:27:57] <jinxer-wm>	 FIRING: ProbeDown: Service thanos-query:443 has failed probes (http_thanos-query_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:28:10] <claime>	 here
[11:28:52] <claime>	 !incidents
[11:28:52] <sirenbot>	 5638 (UNACKED)  ProbeDown sre (10.2.2.53 ip4 thanos-query:443 probes/service http_thanos-query_ip4 eqiad)
[11:28:52] <sirenbot>	 5637 (RESOLVED)  [4x] ProbeDown sre (probes/custom eqiad)
[11:28:53] <sirenbot>	 5636 (RESOLVED)  [4x] ProbeDown sre (probes/custom eqiad)
[11:29:01] <claime>	 !ack 5638
[11:29:02] <sirenbot>	 5638 (ACKED)  ProbeDown sre (10.2.2.53 ip4 thanos-query:443 probes/service http_thanos-query_ip4 eqiad)
[11:29:18] <jynus>	 impact: issues loading grafana
[11:29:31] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2155 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P72603 and previous config saved to /var/cache/conftool/dbconfig/20250128-112931-root.json
[11:29:33] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[11:29:43] <wikibugs>	 (03PS2) 10Btullis: Fix incompatibility between /mnt/hdfs and envoyproxy [puppet] - 10https://gerrit.wikimedia.org/r/1114679 (https://phabricator.wikimedia.org/T384329)
[11:29:53] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2030 (re)pooling @ 75%: Repooling T384807', diff saved to https://phabricator.wikimedia.org/P72604 and previous config saved to /var/cache/conftool/dbconfig/20250128-112952-root.json
[11:29:58] <stashbot>	 T384807: Upgrade and rebuild s3 - https://phabricator.wikimedia.org/T384807
[11:30:31] <wikibugs>	 (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4874/co" [puppet] - 10https://gerrit.wikimedia.org/r/1114679 (https://phabricator.wikimedia.org/T384329) (owner: 10Btullis)
[11:30:37] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[11:31:56] <jynus>	 grafana is back
[11:32:57] <jinxer-wm>	 RESOLVED: ProbeDown: Service thanos-query:443 has failed probes (http_thanos-query_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:33:38] <jinxer-wm>	 FIRING: [3x] ProbeDown: Service prometheus1006:443 has failed probes (http_prometheus_eqiad_wikimedia_org_ip4)   - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:34:16] <volans>	 !log installed spicerack v9.1.1 on cumin2002
[11:34:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:35:50] <volans>	 !log installed spicerack v9.1.1 on cumin1002
[11:35:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:36:07] <claime>	 don't know what the trigger is, by the time I get to the host and check envoy is up and listening
[11:36:08] <jinxer-wm>	 FIRING: UdpIRCStreamThroughput: irc1003:16667 has relayed less than 100 messages over past 5 minutes }} - https://wikitech.wikimedia.org/wiki/Irc.wikimedia.org - https://grafana.wikimedia.org/d/eb101795-c69e-4b9c-b848-f042d604f234/ircstream - https://alerts.wikimedia.org/?q=alertname%3DUdpIRCStreamThroughput
[11:38:08] <jynus>	 is that real or is it due to the prometheus down time?
[11:38:31] <claime>	 oomkill for prometheus k8s on prometheus1006
[11:38:38] <jinxer-wm>	 RESOLVED: [3x] ProbeDown: Service prometheus1006:443 has failed probes (http_prometheus_eqiad_wikimedia_org_ip4)   - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:38:38] <claime>	 godog: ^
[11:38:45] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T384592)', diff saved to https://phabricator.wikimedia.org/P72605 and previous config saved to /var/cache/conftool/dbconfig/20250128-113845-marostegui.json
[11:38:50] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[11:39:16] <jynus>	 maybe a restart, although if it is traffic volume-caused, it won't do much
[11:41:08] <jinxer-wm>	 RESOLVED: UdpIRCStreamThroughput: irc1003:16667 has relayed less than 100 messages over past 5 minutes }} - https://wikitech.wikimedia.org/wiki/Irc.wikimedia.org - https://grafana.wikimedia.org/d/eb101795-c69e-4b9c-b848-f042d604f234/ircstream - https://alerts.wikimedia.org/?q=alertname%3DUdpIRCStreamThroughput
[11:41:19] <godog>	 claime: ack thx, will take a look
[11:41:54] <claime>	 /api/v1/series spiked up to over 2min
[11:44:37] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2155 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P72606 and previous config saved to /var/cache/conftool/dbconfig/20250128-114436-root.json
[11:44:58] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2030 (re)pooling @ 100%: Repooling T384807', diff saved to https://phabricator.wikimedia.org/P72607 and previous config saved to /var/cache/conftool/dbconfig/20250128-114458-root.json
[11:45:03] <stashbot>	 T384807: Upgrade and rebuild s3 - https://phabricator.wikimedia.org/T384807
[11:45:06] <wikibugs>	 (03CR) 10Btullis: [V:03+1 C:03+2] Fix incompatibility between /mnt/hdfs and envoyproxy [puppet] - 10https://gerrit.wikimedia.org/r/1114679 (https://phabricator.wikimedia.org/T384329) (owner: 10Btullis)
[11:46:39] <jinxer-wm>	 FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic1096-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[11:49:34] <wikibugs>	 (03PS1) 10Slyngshede: Move to CAS 7.1 for debugging [dns] - 10https://gerrit.wikimedia.org/r/1114689
[11:51:31] <jinxer-wm>	 FIRING: [5x] Not accepting/receiving prefixes from anycast BGP peer: Alert for device cr1-codfw.wikimedia.org - Not accepting/receiving prefixes from anycast BGP peer   - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer
[11:51:50] <godog>	 claime: yeah prometheus@k8s-mlserve exploded in memory, I'm looking at https://grafana.wikimedia.org/goto/e5yDMQOHR?orgId=1 and https://grafana.wikimedia.org/goto/Qc6vMwOHR?orgId=1
[11:52:29] <godog>	 I mean other instances too
[11:52:44] <logmsgbot>	 !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudgw1003.eqiad.wmnet with OS bookworm
[11:53:00] <claime>	 coincides with the end of the deployment of php8.1 on k8s, but 2% of traffic shouldn't cause such an explosion
[11:53:07] <wikibugs>	 10ops-eqiad, 06SRE, 10Cloud-VPS, 06DC-Ops, 10cloud-services-team (Hardware): Relocate cloudnet1007-dev and cloudnet1008-dev to new racks and rename - https://phabricator.wikimedia.org/T382412#10500134 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1002 for host clo...
[11:53:18] <godog>	 also with a ml-serve apply yeah
[11:53:50] <godog>	 I'd think the same re: php rollout hardly be the cause
[11:53:52] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P72608 and previous config saved to /var/cache/conftool/dbconfig/20250128-115352-marostegui.json
[11:53:58] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] Move to CAS 7.1 for debugging [dns] - 10https://gerrit.wikimedia.org/r/1114689 (owner: 10Slyngshede)
[11:54:15] <logmsgbot>	 !log slyngshede@dns1004 START - running authdns-update
[11:56:05] <logmsgbot>	 !log slyngshede@dns1004 END - running authdns-update
[12:02:44] <logmsgbot>	 !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudgw1004.eqiad.wmnet with OS bookworm
[12:05:28] <wikibugs>	 (03CR) 10Vgutierrez: "I think we should provide further documentation, because after merging this CR we won't be able to set single_backend to `false` on the im" [puppet] - 10https://gerrit.wikimedia.org/r/1114074 (owner: 10BCornwall)
[12:07:07] <wikibugs>	 (03PS1) 10Reedy: FormatMetadata: Prevent running preg_match() on null [core] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1114701 (https://phabricator.wikimedia.org/T384879)
[12:07:15] <wikibugs>	 (03PS1) 10Reedy: FormatMetadata: Prevent running preg_match() on null [core] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1114702 (https://phabricator.wikimedia.org/T384879)
[12:08:59] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P72609 and previous config saved to /var/cache/conftool/dbconfig/20250128-120859-marostegui.json
[12:09:37] <logmsgbot>	 !log andrew@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudgw1003.eqiad.wmnet with reason: host reimage
[12:11:09] <wikibugs>	 (03PS1) 10Stang: zhwiki: Add 2025 CNY celebration logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114703 (https://phabricator.wikimedia.org/T384913)
[12:12:36] <logmsgbot>	 !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudgw1003.eqiad.wmnet with reason: host reimage
[12:12:42] <wikibugs>	 (03CR) 10Elukey: [C:03+2] admin_ng: disable PSP mutation for ml-staging-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114423 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey)
[12:18:45] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, January 28 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114703 (https://phabricator.wikimedia.org/T384913) (owner: 10Stang)
[12:19:01] <wikibugs>	 06SRE, 06Commons, 10MediaWiki-Uploading, 06Traffic: HTTP 503 error when uploading images on Wikimedia Commons - https://phabricator.wikimedia.org/T383274#10500234 (10Vgutierrez) {F58297395}  This high TTFB values make me suspect of some kind of connectivity issue. Could you try to reproduce this behavior o...
[12:19:38] <logmsgbot>	 !log andrew@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudgw1004.eqiad.wmnet with reason: host reimage
[12:22:16] <logmsgbot>	 !log cmooney@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on netflow2003.codfw.wmnet with reason: disabling alerts as I'm running gnmic manually rather than with systemd
[12:22:22] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#10500289 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=892c37cf-859a-4da6-8f59-c75b5d153219) set by cmooney@cumin1002 for 3:00:00 on 1 host(s) and th...
[12:23:14] <logmsgbot>	 !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudgw1004.eqiad.wmnet with reason: host reimage
[12:23:26] <wikibugs>	 (03PS1) 10Slyngshede: Revert "Move to CAS 7.1 for debugging" [dns] - 10https://gerrit.wikimedia.org/r/1114705
[12:24:06] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T384592)', diff saved to https://phabricator.wikimedia.org/P72610 and previous config saved to /var/cache/conftool/dbconfig/20250128-122406-marostegui.json
[12:24:11] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[12:24:21] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1191.eqiad.wmnet with reason: Maintenance
[12:24:28] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1191 (T384592)', diff saved to https://phabricator.wikimedia.org/P72611 and previous config saved to /var/cache/conftool/dbconfig/20250128-122428-marostegui.json
[12:24:36] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10500301 (10phaultfinder)
[12:25:25] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] Revert "Move to CAS 7.1 for debugging" [dns] - 10https://gerrit.wikimedia.org/r/1114705 (owner: 10Slyngshede)
[12:25:32] <logmsgbot>	 !log slyngshede@dns1004 START - running authdns-update
[12:25:48] <wikibugs>	 (03CR) 10Btullis: [C:03+1] "Good stuff, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1114655 (https://phabricator.wikimedia.org/T321808) (owner: 10Filippo Giunchedi)
[12:27:09] <logmsgbot>	 !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2230.codfw.wmnet with reason: Index rebuild
[12:27:21] <logmsgbot>	 !log slyngshede@dns1004 END - running authdns-update
[12:27:32] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] dumps: remove nfs port icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/1114655 (https://phabricator.wikimedia.org/T321808) (owner: 10Filippo Giunchedi)
[12:27:38] <wikibugs>	 (03PS2) 10Filippo Giunchedi: dumps: remove nfs port icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/1114655 (https://phabricator.wikimedia.org/T321808)
[12:27:59] <logmsgbot>	 !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2230.codfw.wmnet with reason: Index rebuild
[12:28:21] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] dumps: remove nfs port icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/1114655 (https://phabricator.wikimedia.org/T321808) (owner: 10Filippo Giunchedi)
[12:30:25] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[12:30:44] <wikibugs>	 (03PS1) 10Vgutierrez: liberica: Depool on liberica-cp.service stop [puppet] - 10https://gerrit.wikimedia.org/r/1114708
[12:31:15] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[12:32:43] <logmsgbot>	 !log root@cumin1002 START - Cookbook sre.mysql.pool db2190 gradually with 4 steps - Repooling after rebuild index
[12:32:50] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10500328 (10MoritzMuehlenhoff)
[12:32:58] <logmsgbot>	 !log andrew@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002"
[12:34:44] <wikibugs>	 (03PS1) 10Marostegui: rebuild_tables.sh Add automatic repooling [software] - 10https://gerrit.wikimedia.org/r/1114709 (https://phabricator.wikimedia.org/T382842)
[12:36:13] <wikibugs>	 (03CR) 10Marostegui: "FYI. This has been tested, just trying to make it less painful to rebuild tables." [software] - 10https://gerrit.wikimedia.org/r/1114709 (https://phabricator.wikimedia.org/T382842) (owner: 10Marostegui)
[12:36:25] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] rebuild_tables.sh Add automatic repooling [software] - 10https://gerrit.wikimedia.org/r/1114709 (https://phabricator.wikimedia.org/T382842) (owner: 10Marostegui)
[12:36:51] <wikibugs>	 (03Merged) 10jenkins-bot: rebuild_tables.sh Add automatic repooling [software] - 10https://gerrit.wikimedia.org/r/1114709 (https://phabricator.wikimedia.org/T382842) (owner: 10Marostegui)
[12:37:06] <wikibugs>	 (03CR) 10Dbrant: [C:03+2] mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114437 (owner: 10PipelineBot)
[12:37:07] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T384592)', diff saved to https://phabricator.wikimedia.org/P72614 and previous config saved to /var/cache/conftool/dbconfig/20250128-123706-marostegui.json
[12:37:11] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[12:37:14] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1166 T382842', diff saved to https://phabricator.wikimedia.org/P72615 and previous config saved to /var/cache/conftool/dbconfig/20250128-123713-marostegui.json
[12:37:19] <stashbot>	 T382842: Upgrade to 10.6.20 and rebuild recentchanges and pagelinks tables - https://phabricator.wikimedia.org/T382842
[12:38:14] <wikibugs>	 (03Merged) 10jenkins-bot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114437 (owner: 10PipelineBot)
[12:38:42] <wikibugs>	 (03CR) 10Dbrant: [C:03+2] wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114433 (owner: 10PipelineBot)
[12:39:10] <logmsgbot>	 !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1166.eqiad.wmnet with reason: Index rebuild
[12:40:00] <wikibugs>	 (03Merged) 10jenkins-bot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114433 (owner: 10PipelineBot)
[12:40:42] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Routinator 0.14 causing tempfs file system to fill up - https://phabricator.wikimedia.org/T383116#10500343 (10MoritzMuehlenhoff) 05Open→03Resolved After running 0.14.1 for five days, we can confirm this fixed, disk usage of /var/lib/routinator/repository...
[12:40:43] <jinxer-wm>	 FIRING: BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs2016:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[12:41:18] <logmsgbot>	 !log andrew@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002"
[12:45:09] <wikibugs>	 (03CR) 10JMeybohm: Support multiple helm versions (032 comments) [debs/helm3] - 10https://gerrit.wikimedia.org/r/1114666 (https://phabricator.wikimedia.org/T341984) (owner: 10Jelto)
[12:45:10] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2026.codfw.wmnet with OS bookworm
[12:45:22] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10500378 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2026.codfw.wmnet with OS bookworm
[12:49:04] <wikibugs>	 (03CR) 10Elukey: [C:03+1] kartotherian: disable icinga check [puppet] - 10https://gerrit.wikimedia.org/r/1114650 (https://phabricator.wikimedia.org/T321808) (owner: 10Filippo Giunchedi)
[12:50:21] <logmsgbot>	 !log andrew@cumin1002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002"
[12:50:21] <logmsgbot>	 !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudgw1004.eqiad.wmnet with OS bookworm
[12:50:28] <logmsgbot>	 !log andrew@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002"
[12:50:29] <logmsgbot>	 !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudgw1003.eqiad.wmnet with OS bookworm
[12:50:42] <wikibugs>	 10ops-eqiad, 06SRE, 10Cloud-VPS, 06DC-Ops, 10cloud-services-team (Hardware): Relocate cloudnet1007-dev and cloudnet1008-dev to new racks and rename - https://phabricator.wikimedia.org/T382412#10500383 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1002 for host cloudgw...
[12:51:05] <logmsgbot>	 !log cmooney@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on netflow3003.esams.wmnet with reason: disabling alerts as I'm running gnmic manually rather than with systemd
[12:51:11] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#10500385 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=7b04d5bf-ab80-4626-96ba-3c376dfc52c2) set by cmooney@cumin1002 for 3:00:00 on 1 host(s) and th...
[12:52:13] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P72617 and previous config saved to /var/cache/conftool/dbconfig/20250128-125213-marostegui.json
[12:56:27] <wikibugs>	 (03CR) 10Elukey: "I checked https://github.com/Wikia/poolcounter-prometheus-exporter/blob/master/collector.go and afaics it just pulls metrics from poolcoun" [puppet] - 10https://gerrit.wikimedia.org/r/1114651 (https://phabricator.wikimedia.org/T321808) (owner: 10Filippo Giunchedi)
[13:00:05] <jouncebot>	 Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250128T1300)
[13:02:17] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ganeti2026.codfw.wmnet with OS bookworm
[13:02:22] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10500439 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2026.codfw.wmnet with OS bookworm executed with errors:...
[13:02:35] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] kartotherian: disable icinga check [puppet] - 10https://gerrit.wikimedia.org/r/1114650 (https://phabricator.wikimedia.org/T321808) (owner: 10Filippo Giunchedi)
[13:03:11] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2026.codfw.wmnet with OS bookworm
[13:03:16] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10500442 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2026.codfw.wmnet with OS bookworm
[13:03:53] <logmsgbot>	 !log dbrant@deploy2002 helmfile [staging] START helmfile.d/services/wikifeeds: apply
[13:04:30] <logmsgbot>	 !log dbrant@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifeeds: apply
[13:04:52] <wikibugs>	 (03PS2) 10Marostegui: wmnet: Promote es2028 to es1 master [dns] - 10https://gerrit.wikimedia.org/r/1114654 (https://phabricator.wikimedia.org/T376905)
[13:05:09] <logmsgbot>	 !log dbrant@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifeeds: apply
[13:06:25] <logmsgbot>	 !log dbrant@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply
[13:06:39] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Good point, not AFAIK. The alternative would be to deploy a blackbox exporter, or better yet add 'poolcounter_up' metric to the exporter w" [puppet] - 10https://gerrit.wikimedia.org/r/1114651 (https://phabricator.wikimedia.org/T321808) (owner: 10Filippo Giunchedi)
[13:06:49] <logmsgbot>	 !log dbrant@deploy2002 helmfile [codfw] START helmfile.d/services/wikifeeds: apply
[13:07:15] <logmsgbot>	 !log dbrant@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply
[13:07:20] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P72619 and previous config saved to /var/cache/conftool/dbconfig/20250128-130720-marostegui.json
[13:12:33] <logmsgbot>	 !log dbrant@deploy2002 helmfile [staging] START helmfile.d/services/mobileapps: apply
[13:13:17] <logmsgbot>	 !log dbrant@deploy2002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply
[13:13:32] <logmsgbot>	 !log dbrant@deploy2002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply
[13:14:01] <wikibugs>	 (03CR) 10Federico Ceratto: [C:03+1] "(+1, discussed on IRC)" [dns] - 10https://gerrit.wikimedia.org/r/1114654 (https://phabricator.wikimedia.org/T376905) (owner: 10Marostegui)
[13:14:39] <logmsgbot>	 !log dbrant@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply
[13:15:04] <logmsgbot>	 !log dbrant@deploy2002 helmfile [codfw] START helmfile.d/services/mobileapps: apply
[13:15:46] <logmsgbot>	 !log dbrant@deploy2002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply
[13:15:48] <wikibugs>	 (03CR) 10Federico Ceratto: [C:03+2] wmnet: Promote es2028 to es1 master [dns] - 10https://gerrit.wikimedia.org/r/1114654 (https://phabricator.wikimedia.org/T376905) (owner: 10Marostegui)
[13:17:48] <wikibugs>	 (03CR) 10Fabfur: liberica: Depool on liberica-cp.service stop (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1114708 (owner: 10Vgutierrez)
[13:17:53] <logmsgbot>	 !log fceratto@dns1004 START - running authdns-update
[13:18:06] <logmsgbot>	 !log root@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2190 gradually with 4 steps - Repooling after rebuild index
[13:19:39] <logmsgbot>	 !log root@cumin1002 START - Cookbook sre.mysql.upgrade for db1166.eqiad.wmnet
[13:19:50] <logmsgbot>	 !log fceratto@dns1004 END - running authdns-update
[13:20:54] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[13:21:43] <wikibugs>	 (03CR) 10Marostegui: Revert "db2155: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1114652 (owner: 10Marostegui)
[13:21:44] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] Revert "db2155: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1114652 (owner: 10Marostegui)
[13:22:06] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[13:22:22] <wikibugs>	 (03PS2) 10Muehlenhoff: sre.ganeti.resource-report: Stop logging to SAL [cookbooks] - 10https://gerrit.wikimedia.org/r/1113084 (https://phabricator.wikimedia.org/T324655)
[13:22:27] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T384592)', diff saved to https://phabricator.wikimedia.org/P72622 and previous config saved to /var/cache/conftool/dbconfig/20250128-132227-marostegui.json
[13:22:31] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1194.eqiad.wmnet with reason: Maintenance
[13:22:32] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[13:22:38] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1194 (T384592)', diff saved to https://phabricator.wikimedia.org/P72623 and previous config saved to /var/cache/conftool/dbconfig/20250128-132238-marostegui.json
[13:23:34] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti2026.codfw.wmnet with reason: host reimage
[13:23:34] <wikibugs>	 (03CR) 10Fabfur: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1114708 (owner: 10Vgutierrez)
[13:25:41] <wikibugs>	 (03PS3) 10Jelto: Support multiple helm versions [debs/helm3] - 10https://gerrit.wikimedia.org/r/1114666 (https://phabricator.wikimedia.org/T341984)
[13:26:13] <logmsgbot>	 !log root@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1166.eqiad.wmnet
[13:26:20] <wikibugs>	 (03CR) 10Jelto: Support multiple helm versions (032 comments) [debs/helm3] - 10https://gerrit.wikimedia.org/r/1114666 (https://phabricator.wikimedia.org/T341984) (owner: 10Jelto)
[13:26:44] <wikibugs>	 (03PS1) 10FNegri: alertmanager: fix WMCS email address [puppet] - 10https://gerrit.wikimedia.org/r/1114723
[13:27:11] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s3 #page on db1166 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 2541.52 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[13:27:31] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti2026.codfw.wmnet with reason: host reimage
[13:27:52] <logmsgbot>	 !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1166.eqiad.wmnet with reason: Index rebuild
[13:27:57] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C:03+1] alertmanager: fix WMCS email address [puppet] - 10https://gerrit.wikimedia.org/r/1114723 (owner: 10FNegri)
[13:28:24] <wikibugs>	 (03CR) 10FNegri: [C:03+2] alertmanager: fix WMCS email address [puppet] - 10https://gerrit.wikimedia.org/r/1114723 (owner: 10FNegri)
[13:28:29] <claime>	 !incidents
[13:28:29] <sirenbot>	 5639 (ACKED)  db1166 (paged)/MariaDB Replica Lag: s3 (paged)
[13:28:29] <sirenbot>	 5638 (RESOLVED)  ProbeDown sre (10.2.2.53 ip4 thanos-query:443 probes/service http_thanos-query_ip4 eqiad)
[13:28:29] <sirenbot>	 5637 (RESOLVED)  [4x] ProbeDown sre (probes/custom eqiad)
[13:28:29] <sirenbot>	 5636 (RESOLVED)  [4x] ProbeDown sre (probes/custom eqiad)
[13:29:32] <claime>	 should I depool it?
[13:29:59] <marostegui>	 claime: No, downtime expired!
[13:30:01] <marostegui>	 Sorry :(
[13:30:03] <wikibugs>	 (03CR) 10Fabfur: [C:03+2] hiera: enable haproxykafka on eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1114417 (https://phabricator.wikimedia.org/T378578) (owner: 10Fabfur)
[13:30:03] <marostegui>	 It is depooled
[13:30:07] <claime>	 ah cool
[13:30:08] <claime>	 happens
[13:30:14] <claime>	 back to lunch then :p
[13:33:30] <fabfur>	 !log installing/enabling haproxykafka on eqiad (https://gerrit.wikimedia.org/r/c/operations/puppet/+/1114417) (T378578)
[13:33:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:33:35] <stashbot>	 T378578: Rollout haproxykafka on all hosts - https://phabricator.wikimedia.org/T378578
[13:37:02] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T384592)', diff saved to https://phabricator.wikimedia.org/P72624 and previous config saved to /var/cache/conftool/dbconfig/20250128-133701-marostegui.json
[13:37:07] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[13:38:38] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service ml-staging-ctrl2001:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ml-staging-ctrl2001:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:39:13] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[13:39:50] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[13:40:52] <wikibugs>	 (03CR) 10JMeybohm: Support multiple helm versions (032 comments) [debs/helm3] - 10https://gerrit.wikimedia.org/r/1114666 (https://phabricator.wikimedia.org/T341984) (owner: 10Jelto)
[13:42:57] <wikibugs>	 (03CR) 10Jelto: Support multiple helm versions (032 comments) [debs/helm3] - 10https://gerrit.wikimedia.org/r/1114666 (https://phabricator.wikimedia.org/T341984) (owner: 10Jelto)
[13:43:38] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service ml-staging-ctrl2001:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ml-staging-ctrl2001:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:44:01] <wikibugs>	 (03PS4) 10Jelto: Support multiple helm versions [debs/helm3] - 10https://gerrit.wikimedia.org/r/1114666 (https://phabricator.wikimedia.org/T341984)
[13:45:50] <wikibugs>	 (03PS1) 10Lucas Werkmeister (WMDE): Enable mul language code on Wikidata (full release) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114727 (https://phabricator.wikimedia.org/T312176)
[13:46:09] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, January 29 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114727 (https://phabricator.wikimedia.org/T312176) (owner: 10Lucas Werkmeister (WMDE))
[13:47:06] <wikibugs>	 (03PS1) 10Fabfur: hiera: consolidate haproxykafka into common profile [puppet] - 10https://gerrit.wikimedia.org/r/1114728 (https://phabricator.wikimedia.org/T377931)
[13:47:28] <wikibugs>	 (03CR) 10CI reject: [V:04-1] hiera: consolidate haproxykafka into common profile [puppet] - 10https://gerrit.wikimedia.org/r/1114728 (https://phabricator.wikimedia.org/T377931) (owner: 10Fabfur)
[13:49:18] <wikibugs>	 (03PS2) 10Fabfur: hiera: consolidate haproxykafka into common profile [puppet] - 10https://gerrit.wikimedia.org/r/1114728 (https://phabricator.wikimedia.org/T377931)
[13:49:20] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2026.codfw.wmnet with OS bookworm
[13:49:30] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10500606 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2026.codfw.wmnet with OS bookworm completed: - ganeti202...
[13:49:44] <wikibugs>	 (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1114728 (https://phabricator.wikimedia.org/T377931) (owner: 10Fabfur)
[13:49:59] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] Support multiple helm versions [debs/helm3] - 10https://gerrit.wikimedia.org/r/1114666 (https://phabricator.wikimedia.org/T341984) (owner: 10Jelto)
[13:50:22] <wikibugs>	 (03CR) 10Arnaudb: [C:04-1] "Sure! I broke this patch down in 4 commits: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1114726/1  the core firewall modification" [puppet] - 10https://gerrit.wikimedia.org/r/1109726 (https://phabricator.wikimedia.org/T370677) (owner: 10Arnaudb)
[13:50:26] <wikibugs>	 (03Abandoned) 10Arnaudb: gitlab_runner: migrate ferm rules to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1109726 (https://phabricator.wikimedia.org/T370677) (owner: 10Arnaudb)
[13:50:38] <wikibugs>	 (03PS2) 10Vgutierrez: liberica: Depool on liberica-cp.service stop [puppet] - 10https://gerrit.wikimedia.org/r/1114708
[13:50:41] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[13:51:02] <wikibugs>	 (03PS1) 10Arnaudb: gitlab_runner: add nftables logic [puppet] - 10https://gerrit.wikimedia.org/r/1114726 (https://phabricator.wikimedia.org/T370677)
[13:51:14] <wikibugs>	 (03CR) 10Fabfur: [C:03+1] liberica: Depool on liberica-cp.service stop (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1114708 (owner: 10Vgutierrez)
[13:51:15] <wikibugs>	 (03PS3) 10Arnaudb: nftables: add nftable docker manifest [puppet] - 10https://gerrit.wikimedia.org/r/1114718 (https://phabricator.wikimedia.org/T370677)
[13:51:17] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[13:51:27] <wikibugs>	 (03PS2) 10Arnaudb: nftables: add types and directories [puppet] - 10https://gerrit.wikimedia.org/r/1114717 (https://phabricator.wikimedia.org/T370677)
[13:51:36] <wikibugs>	 (03PS2) 10Arnaudb: nftables: add docker profile and forward chain [puppet] - 10https://gerrit.wikimedia.org/r/1114716 (https://phabricator.wikimedia.org/T370677)
[13:52:09] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P72625 and previous config saved to /var/cache/conftool/dbconfig/20250128-135208-marostegui.json
[13:57:57] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] liberica: Depool on liberica-cp.service stop [puppet] - 10https://gerrit.wikimedia.org/r/1114708 (owner: 10Vgutierrez)
[13:58:00] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2026.codfw.wmnet
[14:00:05] <jouncebot>	 Lucas_WMDE, Urbanecm, and TheresNoTime: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250128T1400).
[14:00:05] <jouncebot>	 Daimona, DreamRimmer, and koi: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[14:00:41] <Lucas_WMDE>	 I have a meeting in half an hour, but I could deploy until then if nobody else is around…
[14:00:48] <Daimona>	 o/
[14:00:53] <koi>	 o/
[14:00:59] <wikibugs>	 (03CR) 10Jelto: [C:03+2] Support multiple helm versions [debs/helm3] - 10https://gerrit.wikimedia.org/r/1114666 (https://phabricator.wikimedia.org/T341984) (owner: 10Jelto)
[14:01:56] <wikibugs>	 (03PS1) 10Brouberol: dse-k8s-eqiad: deploy the sidecar job controller [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114732 (https://phabricator.wikimedia.org/T384329)
[14:02:47] <Lucas_WMDE>	 let’s start with Daimona then
[14:02:50] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114440 (https://phabricator.wikimedia.org/T380818) (owner: 10Daimona Eaytoy)
[14:03:54] <wikibugs>	 (03Merged) 10jenkins-bot: prod: Enable $wgCampaignEventsEnableEventTopics [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114440 (https://phabricator.wikimedia.org/T380818) (owner: 10Daimona Eaytoy)
[14:04:12] <cmelo>	 o/
[14:04:24] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1114440|prod: Enable $wgCampaignEventsEnableEventTopics (T380818)]]
[14:04:30] <stashbot>	 T380818: Enable the event topics feature in production - https://phabricator.wikimedia.org/T380818
[14:06:36] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2026.codfw.wmnet
[14:07:16] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P72626 and previous config saved to /var/cache/conftool/dbconfig/20250128-140715-marostegui.json
[14:09:02] <wikibugs>	 10ops-codfw, 06DC-Ops, 06Infrastructure-Foundations, 13Patch-For-Review: Q3:rack/setup/install ganeti20[46-51] - https://phabricator.wikimedia.org/T384838#10500645 (10RobH) a:05RobH→03MoritzMuehlenhoff >>! In T384838#10499694, @MoritzMuehlenhoff wrote: > @RobH Why 2046 onwards? Our highest Ganeti serve...
[14:09:06] <wikibugs>	 (03PS2) 10Brouberol: dse-k8s-eqiad: deploy the sidecar job controller [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114732 (https://phabricator.wikimedia.org/T384329)
[14:09:18] <wikibugs>	 10ops-codfw, 06DC-Ops, 06Infrastructure-Foundations, 13Patch-For-Review: Q3:rack/setup/install ganeti20[46-51] - https://phabricator.wikimedia.org/T384838#10500647 (10RobH)
[14:09:27] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, daimona: Backport for [[gerrit:1114440|prod: Enable $wgCampaignEventsEnableEventTopics (T380818)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[14:09:31] <stashbot>	 T380818: Enable the event topics feature in production - https://phabricator.wikimedia.org/T380818
[14:09:33] <Lucas_WMDE>	 Daimona: can you test on mwdebug?
[14:09:42] <wikibugs>	 10ops-codfw, 06DC-Ops, 06Infrastructure-Foundations, 13Patch-For-Review: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10500652 (10RobH)
[14:09:51] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti2026.codfw.wmnet to cluster codfw and group D
[14:09:54] * Lucas_WMDE sees a lot of tracing errors in mwdebug logstash
[14:10:32] <Daimona>	 Testing
[14:11:25] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti2026.codfw.wmnet to cluster codfw and group D
[14:15:12] <wikibugs>	 (03CR) 10Btullis: [C:03+1] "Looks good, thanks." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114732 (https://phabricator.wikimedia.org/T384329) (owner: 10Brouberol)
[14:15:16] <wikibugs>	 (03PS10) 10Clément Goubert: admin_ng: add mwcron namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087212
[14:15:41] <wikibugs>	 (03PS11) 10Clément Goubert: admin_ng: add mw-cron namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087212
[14:15:50] <wikibugs>	 (03CR) 10Vgutierrez: [C:04-1] "change looks good, commit needs to be fixed (see inline comment)" [puppet] - 10https://gerrit.wikimedia.org/r/1114728 (https://phabricator.wikimedia.org/T377931) (owner: 10Fabfur)
[14:16:19] <Lucas_WMDE>	 Daimona: out of interest, do you know if cmelo was also here for the CampaignEvents config change or something else? ^^
[14:16:21] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] dse-k8s-eqiad: deploy the sidecar job controller [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114732 (https://phabricator.wikimedia.org/T384329) (owner: 10Brouberol)
[14:16:22] <wikibugs>	 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops, 10decommission-hardware: decommission cloudcephmon100[1-3].eqiad.wmnet - https://phabricator.wikimedia.org/T380893#10500680 (10cmooney) >>! In T380893#10396432, @Andrew wrote: > These hosts have a somewhat unusual vlan setup, so my guess is something i...
[14:16:44] <Lucas_WMDE>	 (he quit before I could ask – SAL / deployments archive suggests he works roughly in this area, as far as I understand it anyway ^^)
[14:16:57] <Daimona>	 Lucas_WMDE: you can go ahead. We found a DB error, seems like a recent schema change has not been applied. It's unrelated to the current config change though.
[14:17:05] <Lucas_WMDE>	 hm
[14:17:06] <Lucas_WMDE>	 ok ^^
[14:17:14] <Lucas_WMDE>	 lemme just peek at logstash real quick
[14:17:17] <wikibugs>	 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops, 10decommission-hardware: decommission cloudcephmon100[1-3].eqiad.wmnet - https://phabricator.wikimedia.org/T380893#10500684 (10cmooney) a:05cmooney→03None
[14:17:27] <Lucas_WMDE>	 (and filter out the damn tracing channel spam)
[14:17:36] <Daimona>	 Yep, we're in the same call, testing together :)
[14:17:39] <Lucas_WMDE>	 “Unknown column 'event_is_test_event' in 'field list'”
[14:17:39] <Lucas_WMDE>	 ok :)
[14:17:55] <Lucas_WMDE>	 there’s also some PHP Notice: Undefined property: stdClass::$event_is_test_event
[14:17:57] <Lucas_WMDE>	 is that known?
[14:18:00] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[14:18:09] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[14:18:10] <Lucas_WMDE>	 oh, right
[14:18:14] <Lucas_WMDE>	 that’ll be the same error
[14:18:22] <Lucas_WMDE>	 'event_is_test_event' column missing from a result set
[14:18:39] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, daimona: Continuing with sync
[14:18:43] <Lucas_WMDE>	 ok then let’s try it
[14:19:42] <Daimona>	 Yep, same thing. Trying to figure out why the column doesn't exist in prod.
[14:20:06] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: ms-fe1014 hardware fault (may need new disk controller?) - https://phabricator.wikimedia.org/T384317#10500691 (10MatthewVernon) @Papaul is this host likely to get some attention soon, please?
[14:21:08] <jelto>	 !log Imported helm311 | 3.11.3-3 to bookworm-wikimedia - T341984
[14:21:10] <wikibugs>	 (03PS1) 10Brouberol: dse-k8s-eqiad: create the sidecar-controller ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114738 (https://phabricator.wikimedia.org/T384329)
[14:21:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:21:12] <stashbot>	 T341984: Update Kubernetes clusters to 1.31 - https://phabricator.wikimedia.org/T341984
[14:21:25] <_joe_>	 Lucas_WMDE: once you're done, let me know
[14:21:38] <wikibugs>	 (03CR) 10Muehlenhoff: sre.ganeti.resource-report: Stop logging to SAL (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1113084 (https://phabricator.wikimedia.org/T324655) (owner: 10Muehlenhoff)
[14:22:13] <wikibugs>	 (03PS3) 10Fabfur: hiera: consolidate haproxykafka into common profile [puppet] - 10https://gerrit.wikimedia.org/r/1114728 (https://phabricator.wikimedia.org/T378578)
[14:22:23] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T384592)', diff saved to https://phabricator.wikimedia.org/P72627 and previous config saved to /var/cache/conftool/dbconfig/20250128-142222-marostegui.json
[14:22:26] <Lucas_WMDE>	 _joe_: I’ll have to stop after this deployment anyway, meeting coming up
[14:22:27] <wikibugs>	 (03CR) 10Fabfur: hiera: consolidate haproxykafka into common profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1114728 (https://phabricator.wikimedia.org/T378578) (owner: 10Fabfur)
[14:22:28] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[14:22:38] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1202.eqiad.wmnet with reason: Maintenance
[14:22:45] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1202 (T384592)', diff saved to https://phabricator.wikimedia.org/P72628 and previous config saved to /var/cache/conftool/dbconfig/20250128-142244-marostegui.json
[14:22:52] <_joe_>	 Lucas_WMDE: so I can merge a patch of mine instead? :)
[14:23:08] <Lucas_WMDE>	 if you think it’s more important than the other schedule changes, I guess? ^^
[14:23:10] <Lucas_WMDE>	 up to you :P
[14:23:20] <wikibugs>	 10ops-codfw, 06DC-Ops, 06Infrastructure-Foundations, 13Patch-For-Review: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10500704 (10MoritzMuehlenhoff) a:05MoritzMuehlenhoff→03None >>! In T384838#10500645, @RobH wrote: >>>! In T384838#10499694, @MoritzMuehlenhoff wrote...
[14:23:36] <Lucas_WMDE>	 (idk if anyone else would volunteer to deploy those otherwise, I didn’t see anyone else speak up at the beginning of the window but I might have missed it)
[14:23:37] <wikibugs>	 (03CR) 10Volans: [C:03+1] "LGTM, I've also tested it with test-cookbook" [cookbooks] - 10https://gerrit.wikimedia.org/r/1113084 (https://phabricator.wikimedia.org/T324655) (owner: 10Muehlenhoff)
[14:23:55] <wikibugs>	 (03CR) 10Btullis: [C:03+1] dse-k8s-eqiad: create the sidecar-controller ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114738 (https://phabricator.wikimedia.org/T384329) (owner: 10Brouberol)
[14:23:58] <wikibugs>	 10ops-codfw, 06DC-Ops, 06Infrastructure-Foundations, 13Patch-For-Review: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10500707 (10MoritzMuehlenhoff)
[14:24:29] <koi>	 :(
[14:25:28] * Lucas_WMDE looks up when CNY 2025 starts/ends
[14:25:29] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1114440|prod: Enable $wgCampaignEventsEnableEventTopics (T380818)]] (duration: 21m 04s)
[14:25:30] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] dse-k8s-eqiad: create the sidecar-controller ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114738 (https://phabricator.wikimedia.org/T384329) (owner: 10Brouberol)
[14:25:34] <stashbot>	 T380818: Enable the event topics feature in production - https://phabricator.wikimedia.org/T380818
[14:25:36] <Lucas_WMDE>	 _joe_: I’m done
[14:26:01] <Lucas_WMDE>	 29 January… so preferably we shouldn’t postpone https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1114703 for *too* long I guess :/
[14:26:19] <_joe_>	 Lucas_WMDE: can you +1 koi's patch?
[14:26:32] <_joe_>	 if you think it's good, I have no experience with logos
[14:26:39] <_joe_>	 and I can deploy it given it's time sensitive
[14:26:50] <koi>	 yes it will happen very soon
[14:26:53] * Lucas_WMDE looks
[14:27:10] <Lucas_WMDE>	 I don’t know the logos stuff very well either, I assume we have CI that asserts PHP and YAML are in sync
[14:27:25] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[14:27:26] <Lucas_WMDE>	 the SVG… has a PNG embedded :S
[14:27:30] <Lucas_WMDE>	 but at least no sodipodi junk
[14:28:11] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[14:28:32] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[14:28:47] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[14:29:14] <jelto>	 !log Imported helm311 | 3.11.3-3 to bullseye-wikimedia - T341984
[14:29:16] <Daimona>	 Lucas: thanks!
[14:29:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:29:18] <stashbot>	 T341984: Update Kubernetes clusters to 1.31 - https://phabricator.wikimedia.org/T341984
[14:29:34] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "Should be okay to deploy; the embedded PNG in the SVG isn’t super nice but I think for a temporary logo we can live with it. (It’s the “ma" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114703 (https://phabricator.wikimedia.org/T384913) (owner: 10Stang)
[14:29:54] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10500737 (10MoritzMuehlenhoff)
[14:30:32] <_joe_>	 koi: given it's time sensitive, I'm deploying your patch 
[14:30:52] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2028.codfw.wmnet
[14:30:57] * Lucas_WMDE afk
[14:31:04] <Lucas_WMDE>	 _joe_: thanks!
[14:31:32] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by oblivian@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114703 (https://phabricator.wikimedia.org/T384913) (owner: 10Stang)
[14:31:36] <koi>	 thanks for the help!
[14:31:43] <_joe_>	 np
[14:31:45] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch ganeti2028 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1114741
[14:31:48] <_joe_>	 my patch will wait :)
[14:31:53] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10500740 (10ops-monitoring-bot) Draining ganeti2028.codfw.wmnet of running VMs
[14:32:30] <wikibugs>	 (03Merged) 10jenkins-bot: zhwiki: Add 2025 CNY celebration logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114703 (https://phabricator.wikimedia.org/T384913) (owner: 10Stang)
[14:33:01] <logmsgbot>	 !log oblivian@deploy2002 Started scap sync-world: Backport for [[gerrit:1114703|zhwiki: Add 2025 CNY celebration logos (T384913)]]
[14:33:05] <stashbot>	 T384913: Requesting temporary logo change for zhwiki - https://phabricator.wikimedia.org/T384913
[14:33:31] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-ctrl1002.eqiad.wmnet
[14:33:35] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, 06serviceops: wikikube-ctrl1002 and wikikube-ctrl1003: Switch network cable from port 2 to port 1 on the 10G NIC - https://phabricator.wikimedia.org/T379717#10500746 (10ops-monitoring-bot) depool host wikikube-ctrl1002.eqiad.wmnet by jayme@cumin1002 with r...
[14:33:45] <logmsgbot>	 !log jayme@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-ctrl1002.eqiad.wmnet with reason: Depooled via sre.k8s.pool-depool-node
[14:33:47] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-ctrl1002.eqiad.wmnet
[14:33:56] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, 06serviceops: wikikube-ctrl1002 and wikikube-ctrl1003: Switch network cable from port 2 to port 1 on the 10G NIC - https://phabricator.wikimedia.org/T379717#10500747 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by jayme@cumin1...
[14:34:09] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2028.codfw.wmnet
[14:34:39] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10500751 (10phaultfinder)
[14:36:08] <wikibugs>	 (03PS2) 10Clément Goubert: kubernetes: Add mw-cron deploy config [puppet] - 10https://gerrit.wikimedia.org/r/1077001 (https://phabricator.wikimedia.org/T377962)
[14:36:16] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T384592)', diff saved to https://phabricator.wikimedia.org/P72629 and previous config saved to /var/cache/conftool/dbconfig/20250128-143616-marostegui.json
[14:36:21] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[14:36:28] <wikibugs>	 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops, 10decommission-hardware: decommission cloudcephmon100[1-3].eqiad.wmnet - https://phabricator.wikimedia.org/T380893#10500778 (10Andrew) Thanks @cmooney !  @VRiley-WMF, you can give this another try at your convenience.
[14:37:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:37:42] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] kubernetes: Add mw-cron deploy config [puppet] - 10https://gerrit.wikimedia.org/r/1077001 (https://phabricator.wikimedia.org/T377962) (owner: 10Clément Goubert)
[14:37:47] <_joe_>	 koi: can you check you like how the logo is displayed using the wikimedia-debug extension?
[14:37:47] <logmsgbot>	 !log oblivian@deploy2002 stang, oblivian: Backport for [[gerrit:1114703|zhwiki: Add 2025 CNY celebration logos (T384913)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[14:38:04] <koi>	 _joe_, sure, looking
[14:38:07] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] admin_ng: add mw-cron namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087212 (owner: 10Clément Goubert)
[14:38:17] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] admin_ng: add mw-cron namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087212 (owner: 10Clément Goubert)
[14:38:21] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] kubernetes: Add mw-cron deploy config [puppet] - 10https://gerrit.wikimedia.org/r/1077001 (https://phabricator.wikimedia.org/T377962) (owner: 10Clément Goubert)
[14:39:22] <wikibugs>	 (03CR) 10Arthur taylor: [C:03+1] "Looks good to me!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114727 (https://phabricator.wikimedia.org/T312176) (owner: 10Lucas Werkmeister (WMDE))
[14:39:46] <vgutierrez>	 !log upload liberica 0.6 to apt.wm.o (bookworm-wikimedia)
[14:39:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:40:19] <koi>	 _joe_, tested and LGTM
[14:40:34] <vgutierrez>	 !log updating to liberica 0.6 in lvs1013
[14:40:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:41:36] <wikibugs>	 (03PS1) 10Elukey: custom_deploy.d: rework dse-k8s-eqiad's istio config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114743
[14:41:50] <_joe_>	 koi: ok proceeding
[14:41:53] <logmsgbot>	 !log oblivian@deploy2002 stang, oblivian: Continuing with sync
[14:42:17] <wikibugs>	 (03CR) 10CI reject: [V:04-1] custom_deploy.d: rework dse-k8s-eqiad's istio config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114743 (owner: 10Elukey)
[14:42:26] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of aux-k8s-etcd2003.codfw.wmnet to drbd
[14:42:42] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job gnmi in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:42:48] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10500818 (10ops-monitoring-bot) VM aux-k8s-etcd2003.codfw.wmnet switching disk type to drbd
[14:43:13] <logmsgbot>	 !log klausman@deploy2002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'.
[14:43:50] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] liberica: Use libericad instead of liberica binary [puppet] - 10https://gerrit.wikimedia.org/r/1108875 (owner: 10Vgutierrez)
[14:44:06] <wikibugs>	 (03PS1) 10Btullis: airflow: Update the default package version [puppet] - 10https://gerrit.wikimedia.org/r/1114745 (https://phabricator.wikimedia.org/T383430)
[14:44:12] <wikibugs>	 (03PS1) 10Brouberol: Enable the sidecar-controller in all airflow namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114746 (https://phabricator.wikimedia.org/T384329)
[14:44:47] <logmsgbot>	 !log klausman@deploy2002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'.
[14:44:54] <wikibugs>	 (03PS2) 10Elukey: custom_deploy.d: rework dse-k8s-eqiad's istio config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114743
[14:45:33] <wikibugs>	 (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4875/co" [puppet] - 10https://gerrit.wikimedia.org/r/1114745 (https://phabricator.wikimedia.org/T383430) (owner: 10Btullis)
[14:45:48] <wikibugs>	 (03CR) 10Elukey: "root@deploy2002:/home/elukey# istioctl-1.15.7 manifest diff /srv/deployment-charts/custom_deploy.d/istio/dse-k8s/config.yaml /tmp/new-dse-" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114743 (owner: 10Elukey)
[14:47:32] <wikibugs>	 (03CR) 10Btullis: [C:03+1] Enable the sidecar-controller in all airflow namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114746 (https://phabricator.wikimedia.org/T384329) (owner: 10Brouberol)
[14:47:42] <jinxer-wm>	 FIRING: [3x] JobUnavailable: Reduced availability for job gnmi in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:47:58] <wikibugs>	 (03CR) 10Btullis: [V:03+1 C:03+2] airflow: Update the default package version [puppet] - 10https://gerrit.wikimedia.org/r/1114745 (https://phabricator.wikimedia.org/T383430) (owner: 10Btullis)
[14:48:22] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] sre.ganeti.resource-report: Stop logging to SAL [cookbooks] - 10https://gerrit.wikimedia.org/r/1113084 (https://phabricator.wikimedia.org/T324655) (owner: 10Muehlenhoff)
[14:48:41] <logmsgbot>	 !log oblivian@deploy2002 Finished scap sync-world: Backport for [[gerrit:1114703|zhwiki: Add 2025 CNY celebration logos (T384913)]] (duration: 15m 40s)
[14:48:46] <stashbot>	 T384913: Requesting temporary logo change for zhwiki - https://phabricator.wikimedia.org/T384913
[14:48:52] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] Enable the sidecar-controller in all airflow namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114746 (https://phabricator.wikimedia.org/T384329) (owner: 10Brouberol)
[14:49:02] <wikibugs>	 (03PS3) 10Brouberol: airflow: include an envoy mesh sidecar in all the airflow task pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114386 (https://phabricator.wikimedia.org/T384329)
[14:49:05] <wikibugs>	 (03PS6) 10Brouberol: Add discovery listeners to airflow-analytics(-test) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114387 (https://phabricator.wikimedia.org/T384329)
[14:49:56] <wikibugs>	 (03CR) 10Btullis: [C:03+1] airflow: include an envoy mesh sidecar in all the airflow task pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114386 (https://phabricator.wikimedia.org/T384329) (owner: 10Brouberol)
[14:50:14] <_joe_>	 koi: {{done}}
[14:50:23] <wikibugs>	 (03CR) 10Btullis: [C:03+1] Add discovery listeners to airflow-analytics(-test) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114387 (https://phabricator.wikimedia.org/T384329) (owner: 10Brouberol)
[14:50:23] <koi>	 ty
[14:51:23] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P72630 and previous config saved to /var/cache/conftool/dbconfig/20250128-145123-marostegui.json
[14:51:26] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:51:28] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:51:55] <wikibugs>	 (03PS1) 10Vgutierrez: liberica,hiera: Provide grpc endpoint config for liberica-cp [puppet] - 10https://gerrit.wikimedia.org/r/1114748
[14:52:10] <icinga-wm>	 PROBLEM - Host analytics1073 is DOWN: PING CRITICAL - Packet loss = 100%
[14:52:19] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1114748 (owner: 10Vgutierrez)
[14:52:22] <wikibugs>	 (03CR) 10CI reject: [V:04-1] liberica,hiera: Provide grpc endpoint config for liberica-cp [puppet] - 10https://gerrit.wikimedia.org/r/1114748 (owner: 10Vgutierrez)
[14:54:26] <wikibugs>	 (03PS1) 10Elukey: custom_deploy.d: remove ML-specific bits from DSE's istio config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114749
[14:55:22] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2013,2036,2088].codfw.wmnet
[14:55:31] <wikibugs>	 10ops-codfw, 06SRE, 06collaboration-services, 06Data-Persistence, and 2 others: Tracking List: Relocating servers to free up 10G switch space in codfw - https://phabricator.wikimedia.org/T383709#10500883 (10ops-monitoring-bot) depool host wikikube-worker[2013,2036,2088].codfw.wmnet by jayme@cumin1002 with...
[14:55:38] <logmsgbot>	 !log jayme@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on wikikube-worker[2013,2036,2088].codfw.wmnet with reason: Depooled via sre.k8s.pool-depool-node
[14:55:48] <wikibugs>	 (03PS2) 10Vgutierrez: liberica,hiera: Provide grpc endpoint config for liberica-cp [puppet] - 10https://gerrit.wikimedia.org/r/1114748
[14:57:09] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] "Looks good at a quick glance!" [puppet] - 10https://gerrit.wikimedia.org/r/1114748 (owner: 10Vgutierrez)
[14:57:22] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2013,2036,2088].codfw.wmnet
[14:57:29] <wikibugs>	 10ops-codfw, 06SRE, 06collaboration-services, 06Data-Persistence, and 2 others: Tracking List: Relocating servers to free up 10G switch space in codfw - https://phabricator.wikimedia.org/T383709#10500889 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by jayme@cumin1002 depool fo...
[14:58:06] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] liberica,hiera: Provide grpc endpoint config for liberica-cp [puppet] - 10https://gerrit.wikimedia.org/r/1114748 (owner: 10Vgutierrez)
[14:59:32] <wikibugs>	 (03PS1) 10Urbanecm: [tests] Add ConfigWrapperTest [extensions/Babel] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1114751 (https://phabricator.wikimedia.org/T383905)
[14:59:33] <wikibugs>	 (03PS1) 10Urbanecm: Remove BabelCategorizeNamespaces from CommunityConfiguration [extensions/Babel] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1114752 (https://phabricator.wikimedia.org/T383905)
[14:59:57] <wikibugs>	 10ops-codfw, 06SRE, 06collaboration-services, 06Data-Persistence, and 2 others: Tracking List: Relocating servers to free up 10G switch space in codfw - https://phabricator.wikimedia.org/T383709#10500895 (10JMeybohm) @Jhancock.wm wikikube-worker[2013,2036,2088].codfw.wmnet have been shut down, lmk when you...
[15:01:43] <icinga-wm>	 PROBLEM - BGP status on lsw1-b8-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:01:47] <icinga-wm>	 PROBLEM - BGP status on lsw1-a5-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:02:42] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:03:28] <wikibugs>	 (03PS12) 10Muehlenhoff: Make maps-test2001 a bookworm maps master node [puppet] - 10https://gerrit.wikimedia.org/r/1111634 (https://phabricator.wikimedia.org/T381565)
[15:04:48] <wikibugs>	 (03PS1) 10Elukey: kubernetes: remove ad-hoc CNI config from dse-k8s-worker [puppet] - 10https://gerrit.wikimedia.org/r/1114753
[15:05:38] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of aux-k8s-etcd2003.codfw.wmnet to drbd
[15:05:56] <wikibugs>	 (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4876/co" [puppet] - 10https://gerrit.wikimedia.org/r/1114753 (owner: 10Elukey)
[15:06:13] <logmsgbot>	 !log jelto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on gerrit2002.wikimedia.org with reason: NIC port switch -t T383709
[15:06:17] <stashbot>	 T383709: Tracking List: Relocating servers to free up 10G switch space in codfw - https://phabricator.wikimedia.org/T383709
[15:06:30] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P72631 and previous config saved to /var/cache/conftool/dbconfig/20250128-150630-marostegui.json
[15:07:42] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:08:21] <wikibugs>	 10ops-magru, 06DC-Ops: hw troubleshooting: Power supply failure (PSU) for cp7001.magru.wmnet and cp7006.magru.wmnet - https://phabricator.wikimedia.org/T381446#10500925 (10RobH) Ok, progress.  I had to provide 3 possible call back numbers, so I provided myself as primary, with Papaul and Willy as backup only i...
[15:08:42] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job gnmi in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:08:50] <wikibugs>	 10ops-magru, 06DC-Ops: hw troubleshooting: Power supply failure (PSU) for cp7001.magru.wmnet and cp7006.magru.wmnet - https://phabricator.wikimedia.org/T381446#10500926 (10RobH) > ** Por favor não alterar o título deste email. **  >  > Prezado(a) Rob, >  > Conforme plano de ação, foi aberto o chamado número 45...
[15:09:31] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:10:30] <wikibugs>	 (03PS1) 10Muehlenhoff: maps_bookworm: Initially disable replication/tile gen timers [puppet] - 10https://gerrit.wikimedia.org/r/1114755 (https://phabricator.wikimedia.org/T381565)
[15:10:48] <wikibugs>	 (03PS2) 10Muehlenhoff: maps_bookworm: Initially disable replication/tile gen timers [puppet] - 10https://gerrit.wikimedia.org/r/1114755 (https://phabricator.wikimedia.org/T381565)
[15:10:58] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2028.codfw.wmnet
[15:11:13] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10500945 (10ops-monitoring-bot) Draining ganeti2028.codfw.wmnet of running VMs
[15:11:25] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2028.codfw.wmnet
[15:12:10] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1114755 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[15:12:32] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of aux-k8s-etcd2003.codfw.wmnet to plain
[15:12:57] <jinxer-wm>	 FIRING: [4x] JobUnavailable: Reduced availability for job gnmi in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:13:09] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10500950 (10ops-monitoring-bot) VM aux-k8s-etcd2003.codfw.wmnet switching disk type to plain
[15:13:22] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of aux-k8s-etcd2003.codfw.wmnet to plain
[15:14:32] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2028.codfw.wmnet
[15:14:49] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10500955 (10ops-monitoring-bot) Draining ganeti2028.codfw.wmnet of running VMs
[15:15:11] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Data-Persistence, 10Thumbor, and 6 others: Change default image thumbnail size - https://phabricator.wikimedia.org/T355914#10500956 (10MatthewVernon) @ovasileva any update on progress on this, please? I see a bunch of changes (e.g. Incoming -> Freezer) that suggests this is ma...
[15:19:45] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10500977 (10phaultfinder)
[15:21:37] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T384592)', diff saved to https://phabricator.wikimedia.org/P72634 and previous config saved to /var/cache/conftool/dbconfig/20250128-152137-marostegui.json
[15:21:42] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[15:21:53] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1227.eqiad.wmnet with reason: Maintenance
[15:22:00] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1227 (T384592)', diff saved to https://phabricator.wikimedia.org/P72635 and previous config saved to /var/cache/conftool/dbconfig/20250128-152159-marostegui.json
[15:22:26] <wikibugs>	 (03Abandoned) 10Dzahn: gerrit: remove UA-based blocking of some old bots/spiders [puppet] - 10https://gerrit.wikimedia.org/r/1114442 (owner: 10Dzahn)
[15:24:49] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] kubernetes: Add mw-cron deploy config [puppet] - 10https://gerrit.wikimedia.org/r/1077001 (https://phabricator.wikimedia.org/T377962) (owner: 10Clément Goubert)
[15:24:53] <wikibugs>	 (03CR) 10Elukey: [C:03+1] maps_bookworm: Initially disable replication/tile gen timers [puppet] - 10https://gerrit.wikimedia.org/r/1114755 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[15:27:29] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] admin_ng: add mw-cron namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087212 (owner: 10Clément Goubert)
[15:32:53] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, January 29 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114398 (https://phabricator.wikimedia.org/T280718) (owner: 10Hnowlan)
[15:32:59] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] mediawiki: Add mwcron feature [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076746 (https://phabricator.wikimedia.org/T341555) (owner: 10Clément Goubert)
[15:35:27] <Reedy>	 jouncebot: nowandnext
[15:35:27] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 24 minute(s)
[15:35:27] <jouncebot>	 In 0 hour(s) and 24 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250128T1600)
[15:35:52] <wikibugs>	 (03CR) 10Reedy: [C:03+2] FormatMetadata: Prevent running preg_match() on null [core] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1114701 (https://phabricator.wikimedia.org/T384879) (owner: 10Reedy)
[15:35:54] <wikibugs>	 (03CR) 10Reedy: [C:03+2] FormatMetadata: Prevent running preg_match() on null [core] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1114702 (https://phabricator.wikimedia.org/T384879) (owner: 10Reedy)
[15:36:26] <wikibugs>	 (03Merged) 10jenkins-bot: mediawiki: Add mwcron feature [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076746 (https://phabricator.wikimedia.org/T341555) (owner: 10Clément Goubert)
[15:37:49] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] maps_bookworm: Initially disable replication/tile gen timers [puppet] - 10https://gerrit.wikimedia.org/r/1114755 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[15:38:33] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2013
[15:38:41] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2013
[15:39:11] <wikibugs>	 (03Merged) 10jenkins-bot: admin_ng: add mw-cron namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087212 (owner: 10Clément Goubert)
[15:39:49] <icinga-wm>	 RECOVERY - BGP status on lsw1-a5-codfw.mgmt is OK: BGP OK - up: 32, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:40:33] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 06Infrastructure-Foundations: Perform fake disk swap on ms-be2088 as test - https://phabricator.wikimedia.org/T384003#10501075 (10elukey) Followed up with Supermicro to show our results, let's see what they say.
[15:41:29] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:41:29] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:42:57] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job gnmi in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:45:37] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[15:45:44] <wikibugs>	 (03PS1) 10Ottomata: beta - EventStreamConfig - Rename hoist_http_headers_to_fields setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114767 (https://phabricator.wikimedia.org/T382173)
[15:46:48] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[15:47:06] <jinxer-wm>	 FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic1096-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[15:47:47] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[15:47:56] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1114768
[15:47:56] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1114768 (owner: 10TrainBranchBot)
[15:47:57] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[15:48:30] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[15:48:30] <aqu>	 !log About to deploy analytics/refinery/source 0.2.57
[15:48:31] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[15:48:35] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-ctrl1002.eqiad.wmnet
[15:48:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:48:36] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.remove-downtime for wikikube-ctrl1002.eqiad.wmnet
[15:48:37] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for wikikube-ctrl1002.eqiad.wmnet
[15:48:38] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-ctrl1002.eqiad.wmnet
[15:48:48] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, 06serviceops: wikikube-ctrl1002 and wikikube-ctrl1003: Switch network cable from port 2 to port 1 on the 10G NIC - https://phabricator.wikimedia.org/T379717#10501105 (10ops-monitoring-bot) pool host wikikube-ctrl1002.eqiad.wmnet by jayme@cumin1002 with rea...
[15:48:50] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, 06serviceops: wikikube-ctrl1002 and wikikube-ctrl1003: Switch network cable from port 2 to port 1 on the 10G NIC - https://phabricator.wikimedia.org/T379717#10501106 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by jayme@cumin1...
[15:49:03] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-ctrl1003.eqiad.wmnet
[15:49:11] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[15:49:11] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, 06serviceops: wikikube-ctrl1002 and wikikube-ctrl1003: Switch network cable from port 2 to port 1 on the 10G NIC - https://phabricator.wikimedia.org/T379717#10501108 (10ops-monitoring-bot) depool host wikikube-ctrl1003.eqiad.wmnet by jayme@cumin1002 with r...
[15:49:16] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[15:49:17] <logmsgbot>	 !log jayme@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-ctrl1003.eqiad.wmnet with reason: Depooled via sre.k8s.pool-depool-node
[15:49:19] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-ctrl1003.eqiad.wmnet
[15:49:34] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, 06serviceops: wikikube-ctrl1002 and wikikube-ctrl1003: Switch network cable from port 2 to port 1 on the 10G NIC - https://phabricator.wikimedia.org/T379717#10501111 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by jayme@cumin1...
[15:49:43] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[15:50:20] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2036
[15:50:25] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[15:50:29] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2036
[15:51:28] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'.
[15:51:43] <wikibugs>	 (03PS1) 10Muehlenhoff: Add a separate Hiera option to control the waterlines import [puppet] - 10https://gerrit.wikimedia.org/r/1114769 (https://phabricator.wikimedia.org/T381565)
[15:52:04] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Add a separate Hiera option to control the waterlines import [puppet] - 10https://gerrit.wikimedia.org/r/1114769 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[15:52:31] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:52:31] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:52:54] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'.
[15:53:10] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "then we should be good to remove the rsa-2048 key from Gerrit as well." [puppet] - 10https://gerrit.wikimedia.org/r/1075614 (https://phabricator.wikimedia.org/T375569) (owner: 10BCornwall)
[15:53:19] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'.
[15:54:18] <wikibugs>	 (03PS1) 10Cathal Mooney: gnmic: use event-value-tag-v2 to improve performance [puppet] - 10https://gerrit.wikimedia.org/r/1114770 (https://phabricator.wikimedia.org/T369384)
[15:54:27] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'.
[15:54:39] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/admin 'apply'.
[15:54:40] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10501134 (10phaultfinder)
[15:55:47] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'apply'.
[15:56:01] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/admin 'apply'.
[15:56:26] <wikibugs>	 (03PS2) 10Muehlenhoff: Add a separate Hiera option to control the waterlines import [puppet] - 10https://gerrit.wikimedia.org/r/1114769 (https://phabricator.wikimedia.org/T381565)
[15:56:42] <wikibugs>	 (03CR) 10Hashar: [C:03+1] "Thanks for the verification Valentin, very much appreciated. I reached out to Jelto we will roll it on our Wednesday morning and and monit" [puppet] - 10https://gerrit.wikimedia.org/r/1075614 (https://phabricator.wikimedia.org/T375569) (owner: 10BCornwall)
[15:56:52] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Add a separate Hiera option to control the waterlines import [puppet] - 10https://gerrit.wikimedia.org/r/1114769 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[15:56:57] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'apply'.
[15:56:58] <wikibugs>	 (03CR) 10Ottomata: [C:03+2] beta - EventStreamConfig - Rename hoist_http_headers_to_fields setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114767 (https://phabricator.wikimedia.org/T382173) (owner: 10Ottomata)
[15:57:29] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'.
[15:57:51] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'.
[15:58:46] <vgutierrez>	 !log upload liberica 0.7 to apt.wm.o (bookworm-wikimedia)
[15:58:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:59:24] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'.
[15:59:31] <jinxer-wm>	 RESOLVED: BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs2016:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[15:59:48] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'.
[15:59:58] <wikibugs>	 (03Merged) 10jenkins-bot: beta - EventStreamConfig - Rename hoist_http_headers_to_fields setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114767 (https://phabricator.wikimedia.org/T382173) (owner: 10Ottomata)
[16:00:05] <jouncebot>	 jelto, arnoldokoth, and mutante: SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250128T1600). Please do the needful.
[16:00:33] <logmsgbot>	 !log jelto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on gerrit2002.wikimedia.org with reason: NIC port switch -t T383709
[16:00:38] <stashbot>	 T383709: Tracking List: Relocating servers to free up 10G switch space in codfw - https://phabricator.wikimedia.org/T383709
[16:01:52] <wikibugs>	 (03CR) 10Jelto: nftables: add docker profile and forward chain (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1114716 (https://phabricator.wikimedia.org/T370677) (owner: 10Arnaudb)
[16:03:02] <logmsgbot>	 !log root@cumin1002 START - Cookbook sre.mysql.pool db1166 gradually with 4 steps - Repooling after rebuild index T384807
[16:03:06] <stashbot>	 T384807: Upgrade and rebuild s3 - https://phabricator.wikimedia.org/T384807
[16:04:52] <wikibugs>	 (03PS3) 10Muehlenhoff: Add a separate Hiera option to control the waterlines import [puppet] - 10https://gerrit.wikimedia.org/r/1114769 (https://phabricator.wikimedia.org/T381565)
[16:04:55] <logmsgbot>	 !log root@cumin1002 END (ERROR) - Cookbook sre.mysql.pool (exit_code=97) db1166 gradually with 4 steps - Repooling after rebuild index T384807
[16:05:06] <wikibugs>	 (03Merged) 10jenkins-bot: FormatMetadata: Prevent running preg_match() on null [core] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1114701 (https://phabricator.wikimedia.org/T384879) (owner: 10Reedy)
[16:05:18] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1166', diff saved to https://phabricator.wikimedia.org/P72637 and previous config saved to /var/cache/conftool/dbconfig/20250128-160518-marostegui.json
[16:05:19] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#10501179 (10cmooney) I'm very happy to say Karim Radhouani, one of the gnmic devs, has been extremely helpful in response to the github issue I poste...
[16:06:03] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-ctrl1003.eqiad.wmnet
[16:06:05] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.remove-downtime for wikikube-ctrl1003.eqiad.wmnet
[16:06:05] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for wikikube-ctrl1003.eqiad.wmnet
[16:06:06] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-ctrl1003.eqiad.wmnet
[16:06:09] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, 06serviceops: wikikube-ctrl1002 and wikikube-ctrl1003: Switch network cable from port 2 to port 1 on the 10G NIC - https://phabricator.wikimedia.org/T379717#10501182 (10ops-monitoring-bot) pool host wikikube-ctrl1003.eqiad.wmnet by jayme@cumin1002 with rea...
[16:06:12] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, 06serviceops: wikikube-ctrl1002 and wikikube-ctrl1003: Switch network cable from port 2 to port 1 on the 10G NIC - https://phabricator.wikimedia.org/T379717#10501183 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by jayme@cumin1...
[16:07:50] <wikibugs>	 (03Merged) 10jenkins-bot: FormatMetadata: Prevent running preg_match() on null [core] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1114702 (https://phabricator.wikimedia.org/T384879) (owner: 10Reedy)
[16:08:44] <logmsgbot>	 !log reedy@deploy2002 Started scap sync-world: Backport for [[gerrit:1114701|FormatMetadata: Prevent running preg_match() on null (T384879)]], [[gerrit:1114702|FormatMetadata: Prevent running preg_match() on null (T384879)]]
[16:08:46] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1114769 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[16:08:49] <stashbot>	 T384879: PHP Deprecated: preg_match(): Passing null to parameter #2 ($subject) of type string is deprecated - https://phabricator.wikimedia.org/T384879
[16:08:50] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, 06serviceops: wikikube-ctrl1002 and wikikube-ctrl1003: Switch network cable from port 2 to port 1 on the 10G NIC - https://phabricator.wikimedia.org/T379717#10501198 (10JMeybohm) 05Open→03Resolved a:03JMeybohm All done, thank @Papaul for your pat...
[16:09:59] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] site,hiera: Reimage lvs4010 as a liberica LB [puppet] - 10https://gerrit.wikimedia.org/r/1113478 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez)
[16:11:12] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s3 #page on db1166 is OK: OK slave_sql_lag Replication lag: 0.23 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[16:11:40] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2088
[16:11:44] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1166 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P72638 and previous config saved to /var/cache/conftool/dbconfig/20250128-161143-root.json
[16:11:48] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2088
[16:12:47] <icinga-wm>	 RECOVERY - BGP status on lsw1-b8-codfw.mgmt is OK: BGP OK - up: 16, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:13:54] <logmsgbot>	 !log reedy@deploy2002 reedy: Backport for [[gerrit:1114701|FormatMetadata: Prevent running preg_match() on null (T384879)]], [[gerrit:1114702|FormatMetadata: Prevent running preg_match() on null (T384879)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[16:14:00] <stashbot>	 T384879: PHP Deprecated: preg_match(): Passing null to parameter #2 ($subject) of type string is deprecated - https://phabricator.wikimedia.org/T384879
[16:14:09] <logmsgbot>	 !log reedy@deploy2002 reedy: Continuing with sync
[16:14:43] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2013,2036,2088].codfw.wmnet
[16:14:46] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.remove-downtime for wikikube-worker[2013,2036,2088].codfw.wmnet
[16:14:48] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for wikikube-worker[2013,2036,2088].codfw.wmnet
[16:14:49] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2013,2036,2088].codfw.wmnet
[16:14:54] <wikibugs>	 10ops-codfw, 06SRE, 06collaboration-services, 06Data-Persistence, and 2 others: Tracking List: Relocating servers to free up 10G switch space in codfw - https://phabricator.wikimedia.org/T383709#10501231 (10ops-monitoring-bot) pool host wikikube-worker[2013,2036,2088].codfw.wmnet by jayme@cumin1002 with re...
[16:14:56] <wikibugs>	 10ops-codfw, 06SRE, 06collaboration-services, 06Data-Persistence, and 2 others: Tracking List: Relocating servers to free up 10G switch space in codfw - https://phabricator.wikimedia.org/T383709#10501232 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by jayme@cumin1002 pool for...
[16:15:08] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.hosts.reimage for host lvs4010.ulsfo.wmnet with OS bookworm
[16:15:19] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1114768 (owner: 10TrainBranchBot)
[16:15:50] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-videoscaler: apply
[16:15:53] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-videoscaler: apply
[16:16:29] <icinga-wm>	 PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64600/IPv4: Connect - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:16:38] <vgutierrez>	 ^^ expected, lvs4010 is being reimaged
[16:16:45] <icinga-wm>	 PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64600/IPv4: Connect - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:18:25] <hnowlan>	 jouncebot: nowandnext
[16:18:25] <jouncebot>	 For the next 0 hour(s) and 41 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250128T1600)
[16:18:25] <jouncebot>	 In 0 hour(s) and 41 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250128T1700)
[16:18:30] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1227 (T384592)', diff saved to https://phabricator.wikimedia.org/P72639 and previous config saved to /var/cache/conftool/dbconfig/20250128-161829-marostegui.json
[16:18:34] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[16:18:58] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2177 T382842', diff saved to https://phabricator.wikimedia.org/P72640 and previous config saved to /var/cache/conftool/dbconfig/20250128-161857-marostegui.json
[16:19:02] <logmsgbot>	 !log root@cumin1002 START - Cookbook sre.mysql.upgrade for db2177.codfw.wmnet
[16:19:03] <stashbot>	 T382842: Upgrade to 10.6.20 and rebuild recentchanges and pagelinks tables - https://phabricator.wikimedia.org/T382842
[16:19:53] <wikibugs>	 (03CR) 10Scott French: "Adding Hugh as well. Thanks in advance, Hugh!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113213 (https://phabricator.wikimedia.org/T377038) (owner: 10Scott French)
[16:20:57] <logmsgbot>	 !log reedy@deploy2002 Finished scap sync-world: Backport for [[gerrit:1114701|FormatMetadata: Prevent running preg_match() on null (T384879)]], [[gerrit:1114702|FormatMetadata: Prevent running preg_match() on null (T384879)]] (duration: 12m 12s)
[16:21:09] <stashbot>	 T384879: PHP Deprecated: preg_match(): Passing null to parameter #2 ($subject) of type string is deprecated - https://phabricator.wikimedia.org/T384879
[16:21:30] <wikibugs>	 (03PS4) 10Muehlenhoff: Add a separate Hiera option to control the waterlines import [puppet] - 10https://gerrit.wikimedia.org/r/1114769 (https://phabricator.wikimedia.org/T381565)
[16:22:34] <wikibugs>	 06SRE, 06Commons, 10MediaWiki-Uploading, 06Traffic: HTTP 503 error when uploading images on Wikimedia Commons - https://phabricator.wikimedia.org/T383274#10501271 (10Underbar_dk) I disabled IPv6 and the multiple uploads went through! I then switched it back on and the uploads also went through no problem....
[16:22:39] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] shellbox-video: 3 codfw replicas on 8.1 (change 1/4) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113213 (https://phabricator.wikimedia.org/T377038) (owner: 10Scott French)
[16:24:01] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1114769 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[16:24:41] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10501277 (10phaultfinder)
[16:25:38] <wikibugs>	 (03CR) 10Bartosz Dziewoński: [C:03+1] Update CentralAuth multi-DC rules for SUL3 [puppet] - 10https://gerrit.wikimedia.org/r/1114070 (https://phabricator.wikimedia.org/T363695) (owner: 10Gergő Tisza)
[16:25:50] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2177.codfw.wmnet with reason: maintenance
[16:25:51] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host gerrit2002
[16:25:52] <rzl>	 tgr|away: happen to be around? I think you linked the wrong patch in the puppet window
[16:25:59] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host gerrit2002
[16:26:04] <logmsgbot>	 !log root@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2177.codfw.wmnet
[16:26:45] <rzl>	 oh got it, it should be https://gerrit.wikimedia.org/r/1114070
[16:26:49] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1166 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P72641 and previous config saved to /var/cache/conftool/dbconfig/20250128-162649-root.json
[16:26:57] <logmsgbot>	 !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2177.codfw.wmnet with reason: Index rebuild
[16:29:26] <wikibugs>	 10ops-codfw, 06SRE, 06collaboration-services, 06Data-Persistence, and 2 others: Tracking List: Relocating servers to free up 10G switch space in codfw - https://phabricator.wikimedia.org/T383709#10501283 (10Jhancock.wm)
[16:33:33] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs4010.ulsfo.wmnet with reason: host reimage
[16:33:37] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1227', diff saved to https://phabricator.wikimedia.org/P72642 and previous config saved to /var/cache/conftool/dbconfig/20250128-163336-marostegui.json
[16:33:59] <wikibugs>	 (03PS5) 10Muehlenhoff: Add a separate Hiera option to control the waterlines import [puppet] - 10https://gerrit.wikimedia.org/r/1114769 (https://phabricator.wikimedia.org/T381565)
[16:36:26] <wikibugs>	 (03PS1) 10Clément Goubert: mw-script: Add conftool state to helmfile.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114777 (https://phabricator.wikimedia.org/T367118)
[16:37:30] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs4010.ulsfo.wmnet with reason: host reimage
[16:38:33] <wikibugs>	 (03CR) 10RLazarus: [C:03+1] mw-script: Add conftool state to helmfile.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114777 (https://phabricator.wikimedia.org/T367118) (owner: 10Clément Goubert)
[16:38:41] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] mw-script: Add conftool state to helmfile.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114777 (https://phabricator.wikimedia.org/T367118) (owner: 10Clément Goubert)
[16:39:25] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1114769 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[16:40:03] <wikibugs>	 (03Merged) 10jenkins-bot: mw-script: Add conftool state to helmfile.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114777 (https://phabricator.wikimedia.org/T367118) (owner: 10Clément Goubert)
[16:40:35] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[16:40:45] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[16:41:07] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[16:41:13] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[16:41:34] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[16:41:39] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[16:41:55] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1166 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P72643 and previous config saved to /var/cache/conftool/dbconfig/20250128-164154-root.json
[16:42:04] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[16:42:06] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[16:42:35] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[16:42:39] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[16:42:54] <wikibugs>	 (03PS6) 10Muehlenhoff: Add a separate Hiera option to control the waterlines import [puppet] - 10https://gerrit.wikimedia.org/r/1114769 (https://phabricator.wikimedia.org/T381565)
[16:44:43] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10501321 (10phaultfinder)
[16:46:46] <wikibugs>	 (03PS1) 10Sergio Gimeno: beta wgEventStreams: set enrich_fields_from_http_headers on metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114779 (https://phabricator.wikimedia.org/T382173)
[16:46:51] <wikibugs>	 (03PS1) 10Vgutierrez: hiera: Fix NIC name on liberica@ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1114780
[16:47:13] <wikibugs>	 (03CR) 10CI reject: [V:04-1] hiera: Fix NIC name on liberica@ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1114780 (owner: 10Vgutierrez)
[16:47:25] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] hiera: Fix NIC name on liberica@ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1114780 (owner: 10Vgutierrez)
[16:47:43] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1114769 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[16:47:43] <wikibugs>	 (03PS2) 10Vgutierrez: hiera: Fix NIC name on liberica@ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1114780 (https://phabricator.wikimedia.org/T384477)
[16:47:52] <wikibugs>	 (03CR) 10Ssingh: hiera: Fix NIC name on liberica@ulsfo (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1114780 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez)
[16:48:15] <wikibugs>	 (03CR) 10jenkins-bot: hiera: Fix NIC name on liberica@ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1114780 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez)
[16:48:26] <wikibugs>	 (03CR) 10Ssingh: hiera: Fix NIC name on liberica@ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1114780 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez)
[16:48:39] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] hiera: Fix NIC name on liberica@ulsfo (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1114780 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez)
[16:48:44] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1227', diff saved to https://phabricator.wikimedia.org/P72644 and previous config saved to /var/cache/conftool/dbconfig/20250128-164843-marostegui.json
[16:49:25] <wikibugs>	 (03PS2) 10Sergio Gimeno: beta wgEventStreams: opt out collecting user agent for HompageVisit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114779 (https://phabricator.wikimedia.org/T382173)
[16:51:42] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[16:51:45] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[16:52:33] <elukey>	 !log restart kartotherian on maps1009 as test
[16:52:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:53:38] <wikibugs>	 (03PS7) 10Muehlenhoff: Add a separate Hiera option to control the waterlines import [puppet] - 10https://gerrit.wikimedia.org/r/1114769 (https://phabricator.wikimedia.org/T381565)
[16:53:40] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[16:53:42] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[16:55:50] <wikibugs>	 (03CR) 10Btullis: [C:03+1] "Looks good to me." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114743 (owner: 10Elukey)
[16:56:11] <wikibugs>	 (03CR) 10Ottomata: beta wgEventStreams: opt out collecting user agent for HompageVisit (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114779 (https://phabricator.wikimedia.org/T382173) (owner: 10Sergio Gimeno)
[16:57:00] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1166 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P72645 and previous config saved to /var/cache/conftool/dbconfig/20250128-165700-root.json
[16:57:12] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] hiera: consolidate haproxykafka into common profile [puppet] - 10https://gerrit.wikimedia.org/r/1114728 (https://phabricator.wikimedia.org/T378578) (owner: 10Fabfur)
[16:57:49] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Data-Persistence, 10Thumbor, and 6 others: Change default image thumbnail size - https://phabricator.wikimedia.org/T355914#10501367 (10ovasileva) a:03ovasileva
[16:58:00] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Data-Persistence, 10Thumbor, and 6 others: Change default image thumbnail size - https://phabricator.wikimedia.org/T355914#10501368 (10ovasileva) a:05ovasileva→03None
[16:58:19] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1114769 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[16:59:00] <wikibugs>	 (03CR) 10Btullis: [C:03+1] custom_deploy.d: remove ML-specific bits from DSE's istio config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114749 (owner: 10Elukey)
[16:59:55] <wikibugs>	 (03CR) 10Btullis: [C:03+1] "Thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1114753 (owner: 10Elukey)
[17:00:04] <jouncebot>	 jhathaway and rzl: That opportune time for a Puppet request window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250128T1700).
[17:00:04] <jouncebot>	 tgr: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[17:00:10] <rzl>	 o/
[17:01:46] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.hosts.remove-downtime for cr[1-2]-magru,cr[1-2]-magru IPv6
[17:01:48] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for cr[1-2]-magru,cr[1-2]-magru IPv6
[17:02:58] <tgr|away>	 o/
[17:03:51] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1227 (T384592)', diff saved to https://phabricator.wikimedia.org/P72646 and previous config saved to /var/cache/conftool/dbconfig/20250128-170350-marostegui.json
[17:03:56] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[17:04:03] <wikibugs>	 (03PS8) 10Dzahn: gerrit: fix todo from 2022, remove nist key setting [puppet] - 10https://gerrit.wikimedia.org/r/1064413 (https://phabricator.wikimedia.org/T315942)
[17:04:06] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1236.eqiad.wmnet with reason: Maintenance
[17:04:13] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1236 (T384592)', diff saved to https://phabricator.wikimedia.org/P72647 and previous config saved to /var/cache/conftool/dbconfig/20250128-170412-marostegui.json
[17:04:29] <rzl>	 tgr|away: just as a heads up, ATS lua is moderately scary and I can't promise we can always do it in the puppet window :) but I chatted with traffic and I think we're all set
[17:04:46] <rzl>	 plan is I'll stop puppet on cp-text, deploy this to a host or two, we can test, and then deploy it everywhere
[17:05:39] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install puppetserver2004 - https://phabricator.wikimedia.org/T381274#10501392 (10Jhancock.wm) @elukey hey having a little bit of trouble with the provisioning on this one. i know it's a custom model and wanted to see if you had any i...
[17:05:42] <rzl>	 before we start, will you open an affected url, check the x-cache header, and let me know what it says?
[17:05:46] <logmsgbot>	 !log cmooney@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on netflow3003.esams.wmnet with reason: disabling alerts as I'm running gnmic manually rather than with systemd
[17:05:56] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#10501393 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=e5ab529a-1fb4-461d-b85a-a2d5a66a020a) set by cmooney@cumin1002 for 1:00:...
[17:06:33] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 219, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[17:06:42] <tgr|away>	 rzl: ack. Let me know if there's a process to follow that would make it easier on the deployer (although I hope we are done with SUL3 puppet patches after this one)
[17:06:57] <rzl>	 !log stopping puppet on A:cp-text
[17:07:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:07:47] <rzl>	 the ideal thing would just be finding a reviewer on the traffic team who can deploy for you, but also I understand finding reviewers on each team can be challenging
[17:08:35] <tgr|away>	 curl -vo/dev/null 'http://auth.wikimedia.org/enwiki/wiki/Special:CentralLogin' |& grep x-cache
[17:08:38] <tgr|away>	 < x-cache: cp3070 int
[17:08:41] <tgr|away>	 < x-cache-status: int-tls
[17:09:00] <tgr|away>	 that's an URL that should go from not forced to primary DC to forced to primary DC
[17:09:27] <rzl>	 perfect thanks, I'll deploy to cp3070 for testing
[17:09:27] <tgr|away>	 will try to do that next time
[17:09:37] <rzl>	 (and cp4038, which is what I get here in san francisco)
[17:09:51] <wikibugs>	 (03CR) 10RLazarus: [C:03+2] Update CentralAuth multi-DC rules for SUL3 [puppet] - 10https://gerrit.wikimedia.org/r/1114070 (https://phabricator.wikimedia.org/T363695) (owner: 10Gergő Tisza)
[17:10:10] <vgutierrez>	 tgr|away: https:// and not http:// BTW
[17:10:37] <vgutierrez>	 on port http:// you're just getting a 301 to https://
[17:10:37] <rzl>	 ah thanks, missed that
[17:10:46] <tgr|away>	 oops sorry
[17:10:46] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs4010.ulsfo.wmnet with OS bookworm
[17:10:52] <rzl>	 (see? reviewer on the traffic team)
[17:10:56] <tgr|away>	 < x-cache: cp3069 miss, cp3069 pass
[17:10:56] <tgr|away>	 < x-cache-status: pass
[17:10:59] <sukhe>	 yeah good point, which is why you see the int-tls there
[17:11:01] <rzl>	 that'll affect the hashing, can you-- perfect
[17:11:19] <sukhe>	 rzl: feel free to add me or fabfur or brett for the patches and one of us can triage it (and add vg where required)
[17:12:06] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1166 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P72648 and previous config saved to /var/cache/conftool/dbconfig/20250128-171205-root.json
[17:13:46] <rzl>	 (puppet's running)
[17:13:53] <rzl>	 sukhe: 👍 👍
[17:15:32] <rzl>	 tgr|away: okay, deployed to cp3069 and cp4039
[17:16:05] <tgr|away>	 I get the same response, not sure if that's good or bad
[17:16:40] <rzl>	 okay, maybe I should have opened with "is this testable" :)
[17:16:56] <rzl>	 but it doesn't look like ATS is failing on those hosts now, which is the good news
[17:17:12] <tgr|away>	 not easily, all the patch does is influence which URLs get always sent to the primary DC
[17:17:29] <wikibugs>	 (03CR) 10Urbanecm: [C:03+1] "lgtm, but let's wait with deployment until the GE counterpart is finalised" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113984 (https://phabricator.wikimedia.org/T383714) (owner: 10Cyndywikime)
[17:18:09] <tgr|away>	 I thought that would a different-first-digit cp host, but I only have very vague ideas of how multi-DC works
[17:18:15] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1236 (T384592)', diff saved to https://phabricator.wikimedia.org/P72649 and previous config saved to /var/cache/conftool/dbconfig/20250128-171814-marostegui.json
[17:18:20] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[17:19:04] <rzl>	 *oh*
[17:19:20] <rzl>	 no, it goes to mediawiki either way, so that'll change the "server" header in the response
[17:19:25] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudcephosd2004-dev - https://phabricator.wikimedia.org/T378825#10501461 (10Jhancock.wm) 05Open→03Resolved
[17:19:31] <tgr|away>	 < server: mw-web.eqiad.main-c544b8984-bwnqh
[17:19:37] <wikibugs>	 (03PS1) 10Vgutierrez: hiera: Fix BGP peers for liberica@ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1114783 (https://phabricator.wikimedia.org/T384477)
[17:19:40] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudcephosd2004-dev - https://phabricator.wikimedia.org/T378825#10501464 (10Jhancock.wm)
[17:19:46] <rzl>	 you might have seen it change from mw-web.eqiad... to mw-web.codfw... if this worked, or change from codfw to codfw if you were already there :)
[17:19:49] <tgr|away>	 eqiad is the secondary now, right?
[17:19:53] <claime>	 yeah
[17:20:01] <rzl>	 yeah, if it changed from eqiad to eqiad it means this didn't work as intended
[17:20:12] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1114783 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez)
[17:23:38] <wikibugs>	 10ops-codfw, 06DC-Ops, 06Infrastructure-Foundations, 13Patch-For-Review: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10501476 (10Jhancock.wm)
[17:23:52] <rzl>	 tgr|away: take your time digging but let me know what you'd like to do -- if we don't end up keeping this, I can roll back on those two hosts and resume puppet on the rest
[17:24:41] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10501477 (10phaultfinder)
[17:24:55] <tgr|away>	 thx, let me try a few more requests
[17:25:21] <rzl>	 sure -- note that they may hash to different hosts that don't have your patch, check the x-cache header
[17:26:10] <vgutierrez>	 esams is a single_backend DC nowadays
[17:26:28] <vgutierrez>	 so requests from the same client IP will hit the same ATS instances 
[17:26:32] <vgutierrez>	 *instance
[17:26:36] <swfrench-wmf>	 possibly a naive question - `string.find(path, "/wiki/Special:CentralLogin") == 1 ` - that won't work if the wiki name is prefix of the path, right?
[17:26:48] <swfrench-wmf>	 i.e., from the test URL above
[17:27:11] <rzl>	 vgutierrez: doh thanks
[17:27:39] <vgutierrez>	 swfrench-wmf: you're right
[17:27:40] <jinxer-wm>	 FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker1245:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1245 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[17:27:54] <claime>	 lua string matching strikes again
[17:27:54] <vgutierrez>	 that needs to be refactored using a regex with string.match
[17:27:58] <tgr|away>	 yeah, just realized that
[17:28:05] <rzl>	 swfrench-wmf++
[17:28:11] <tgr|away>	 which is annoying because not all queries will have that prefix
[17:28:14] <claime>	 gg swfrench-wmf 
[17:28:50] <vgutierrez>	 it could be refactored into string.find(...) != nil
[17:29:35] <vgutierrez>	 > path = "/enwiki/foo"
[17:29:35] <vgutierrez>	 > string.find(path, "/foo")
[17:29:35] <vgutierrez>	 8       11
[17:30:11] <tgr|away>	 now I wonder if the current code even works.
[17:30:32] <tgr|away>	 what if the URL is index.php-style? What if some wiki localizes those special page names?
[17:30:54] <tgr|away>	 will have to double-check the MediaWiki code to see if that can happen
[17:32:19] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, 06serviceops: wikikube-ctrl1002 and wikikube-ctrl1003: Switch network cable from port 2 to port 1 on the 10G NIC - https://phabricator.wikimedia.org/T379717#10501502 (10Papaul) you welcome
[17:32:23] <vgutierrez>	 of course.. `~= nil` rather than `!= nil` :)
[17:32:24] <tgr|away>	 rzl: sorry, can we abandon the deploy for now? I'll need to read through the CentralAuth code, no point in fixing the string matching if it's not using this URL format
[17:33:22] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1236', diff saved to https://phabricator.wikimedia.org/P72650 and previous config saved to /var/cache/conftool/dbconfig/20250128-173321-marostegui.json
[17:34:33] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 06Infrastructure-Foundations: Perform fake disk swap on ms-be2088 as test - https://phabricator.wikimedia.org/T384003#10501511 (10Neobeta61) Ciao, try - {     "Target": "/redfish/v1/Systems/{SystemId}/Storage/{StorageId}/Actions/StorageController.ClearFo...
[17:35:06] <rzl>	 tgr|away: no worries, rolling back
[17:35:51] <wikibugs>	 (03PS1) 10RLazarus: Revert "Update CentralAuth multi-DC rules for SUL3" [puppet] - 10https://gerrit.wikimedia.org/r/1114785
[17:36:16] <wikibugs>	 (03PS2) 10Cathal Mooney: gnmic: use event-value-tag-v2 to improve performance [puppet] - 10https://gerrit.wikimedia.org/r/1114770 (https://phabricator.wikimedia.org/T369384)
[17:37:40] <jinxer-wm>	 RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-worker1245:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1245 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[17:38:03] <wikibugs>	 (03CR) 10RLazarus: [C:03+2] Revert "Update CentralAuth multi-DC rules for SUL3" [puppet] - 10https://gerrit.wikimedia.org/r/1114785 (owner: 10RLazarus)
[17:39:27] <rzl>	 merged, deploying on our two test hosts first just for caution
[17:42:43] <rzl>	 done, re-enabling puppet
[17:42:56] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1114783 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez)
[17:43:40] <jinxer-wm>	 FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker1245:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1245 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[17:43:52] <rzl>	 sukhe, vgutierrez: thanks <3 can tgr|away send you a revised patch directly, without waiting for a puppet window?
[17:44:30] <tgr|away>	 can confirm that cp3069 behaves as expected before the patch
[17:46:05] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] gerrit: fix todo from 2022, remove nist key setting [puppet] - 10https://gerrit.wikimedia.org/r/1064413 (https://phabricator.wikimedia.org/T315942) (owner: 10Dzahn)
[17:46:05] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] hiera: Fix BGP peers for liberica@ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1114783 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez)
[17:47:02] <vgutierrez>	 rzl: yes
[17:47:23] <rzl>	 thanks!
[17:48:08] <tgr|away>	 thanks all! will make a new patch, and make sure the requirements are documented on the PHP side
[17:48:29] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1236', diff saved to https://phabricator.wikimedia.org/P72651 and previous config saved to /var/cache/conftool/dbconfig/20250128-174828-marostegui.json
[17:48:40] <jinxer-wm>	 RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-worker1245:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1245 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[17:49:59] <mutante>	 I am removing SSH config option "KexAlgorithms ecdh-sha2-nistp521" from gerrit. this is supposed to be not needed anymore since at least 2022. but in the unlikely even that someone says something.. I would expect it must be about some ancient client.
[17:50:40] <mutante>	 there was literally a TODO to remove it once we are on Gerrit 3.6 and MINA 2.8.0 (that's the Gerrit sshd, not openssh) and we are many versions past that
[17:51:23] <wikibugs>	 (03PS3) 10Cathal Mooney: gnmic: use event-value-tag-v2 to improve performance [puppet] - 10https://gerrit.wikimedia.org/r/1114770 (https://phabricator.wikimedia.org/T369384)
[17:52:33] <wikibugs>	 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops, 10decommission-hardware: decommission cloudcephmon100[1-3].eqiad.wmnet - https://phabricator.wikimedia.org/T380893#10501556 (10Papaul) replaced 1002 with  {F58301781}
[17:54:40] <jinxer-wm>	 FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker1245:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1245 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[17:57:47] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: ms-fe1014 hardware fault (may need new disk controller?) - https://phabricator.wikimedia.org/T384317#10501586 (10Papaul) @MatthewVernon i will take a look at it, thanks
[17:59:59] <wikibugs>	 (03CR) 10Sergio Gimeno: beta wgEventStreams: opt out collecting user agent for HompageVisit (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114779 (https://phabricator.wikimedia.org/T382173) (owner: 10Sergio Gimeno)
[18:00:05] <jouncebot>	 swfrench-wmf: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for MediaWiki infrastructure (UTC late) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250128T1800).
[18:00:17] <swfrench-wmf>	 o/
[18:01:26] <wikibugs>	 (03CR) 10Scott French: [C:03+2] shellbox-constraints: 1 eqiad replica on 8.1 (change 1/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113217 (https://phabricator.wikimedia.org/T377038) (owner: 10Scott French)
[18:02:42] <wikibugs>	 (03Merged) 10jenkins-bot: shellbox-constraints: 1 eqiad replica on 8.1 (change 1/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113217 (https://phabricator.wikimedia.org/T377038) (owner: 10Scott French)
[18:03:36] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1236 (T384592)', diff saved to https://phabricator.wikimedia.org/P72652 and previous config saved to /var/cache/conftool/dbconfig/20250128-180335-marostegui.json
[18:03:41] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[18:03:51] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance
[18:04:04] <wikibugs>	 (03PS8) 10Btullis: mediawiki: Add support for dumps suspended job [deployment-charts] - 10https://gerrit.wikimedia.org/r/1104605 (https://phabricator.wikimedia.org/T352650) (owner: 10Giuseppe Lavagetto)
[18:04:10] <wikibugs>	 (03PS8) 10Btullis: mediwiki-dumps-legacy: Create helmfile deployment of a suspended job [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114001 (https://phabricator.wikimedia.org/T352650)
[18:04:23] <swfrench-wmf>	 !log starting shellbox-constraints pilot on PHP 8.1 (1 replica, eqiad only) - T377038
[18:04:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:04:27] <stashbot>	 T377038: Migrate production Shellbox variants to PHP 8.1 - https://phabricator.wikimedia.org/T377038
[18:04:32] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply
[18:05:08] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply
[18:05:47] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10501623 (10phaultfinder)
[18:08:07] <wikibugs>	 (03PS1) 10Jasmine: wikikube: decommission wikikube-worker102[2-5].eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1114788 (https://phabricator.wikimedia.org/T383227)
[18:08:22] <wikibugs>	 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops, 10decommission-hardware: decommission cloudcephmon100[1-3].eqiad.wmnet - https://phabricator.wikimedia.org/T380893#10501630 (10Papaul) 05Open→03Resolved a:03Papaul This is complete
[18:12:43] <wikibugs>	 10ops-codfw, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T384951 (10phaultfinder) 03NEW
[18:13:14] <wikibugs>	 (03CR) 10Kamila Součková: wikikube: decommission wikikube-worker102[2-5].eqiad.wmnet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1114788 (https://phabricator.wikimedia.org/T383227) (owner: 10Jasmine)
[18:15:20] <wikibugs>	 (03CR) 10Scott French: [C:03+2] shellbox-video: 3 codfw replicas on 8.1 (change 1/4) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113213 (https://phabricator.wikimedia.org/T377038) (owner: 10Scott French)
[18:16:50] <wikibugs>	 (03Merged) 10jenkins-bot: shellbox-video: 3 codfw replicas on 8.1 (change 1/4) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113213 (https://phabricator.wikimedia.org/T377038) (owner: 10Scott French)
[18:17:23] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2150.codfw.wmnet with reason: Maintenance
[18:17:30] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2150 (T384592)', diff saved to https://phabricator.wikimedia.org/P72653 and previous config saved to /var/cache/conftool/dbconfig/20250128-181729-marostegui.json
[18:17:35] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1014 - https://phabricator.wikimedia.org/T384281#10501677 (10Papaul) 05Open→03Resolved a:03Papaul working on this on T382984
[18:17:35] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[18:17:39] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] "that patch is merged now" [puppet] - 10https://gerrit.wikimedia.org/r/1074381 (owner: 10Muehlenhoff)
[18:18:10] <swfrench-wmf>	 !log starting shellbox-video pilot on PHP 8.1 (3 replicas, codfw only) - T377038
[18:18:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:18:14] <stashbot>	 T377038: Migrate production Shellbox variants to PHP 8.1 - https://phabricator.wikimedia.org/T377038
[18:18:21] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-video: apply
[18:18:24] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1014 - https://phabricator.wikimedia.org/T384415#10501686 (10Papaul) 05Open→03Resolved a:03Papaul Working on this in T382984
[18:21:29] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-video: apply
[18:21:40] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 06Infrastructure-Foundations: Perform fake disk swap on ms-be2088 as test - https://phabricator.wikimedia.org/T384003#10501708 (10elukey) @Neobeta61 Ciao! Grazie :)  I tested it but the Action is not available afaics:  `  'Actions': {'Oem': {'#SmcHARAIDC...
[18:23:15] <wikibugs>	 (03PS3) 10Ottomata: beta wgEventStreams: opt out collecting user agent for HompageVisit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114779 (https://phabricator.wikimedia.org/T382173) (owner: 10Sergio Gimeno)
[18:23:53] <ottomata>	 swfrench-wmf: okay if I deploy a beta only mw config change?
[18:24:40] <jinxer-wm>	 RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-worker1245:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1245 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[18:26:22] <wikibugs>	 10ops-eqiad, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T384892#10501716 (10VRiley-WMF) 05Open→03Resolved a:03VRiley-WMF Reseated cable and pinged the managment IP. Seems to be resolved now.
[18:27:25] <ottomata>	 swfrench-wmf: i'm going to merge, and do the no-op prod scap when you verify its okay.  ty!
[18:27:34] <wikibugs>	 (03CR) 10Ottomata: [C:03+2] beta wgEventStreams: opt out collecting user agent for HompageVisit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114779 (https://phabricator.wikimedia.org/T382173) (owner: 10Sergio Gimeno)
[18:28:28] <wikibugs>	 (03Merged) 10jenkins-bot: beta wgEventStreams: opt out collecting user agent for HompageVisit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114779 (https://phabricator.wikimedia.org/T382173) (owner: 10Sergio Gimeno)
[18:28:36] <swfrench-wmf>	 ottomata: thanks for checking. I think I'm at the point in my work where a deployment is unlikely to disrupt anything, so I think you're good to go.
[18:32:00] <ottomata>	 okay thanks!
[18:33:40] <jinxer-wm>	 FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker1245:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1245 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[18:34:39] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10501723 (10phaultfinder)
[18:38:27] <wikibugs>	 10ops-eqiad, 06DC-Ops: Inbound interface errors - https://phabricator.wikimedia.org/T384861#10501742 (10VRiley-WMF) @Jgreen It looks like we are having an issue on this connection. Could we plan for a time for us to swap the transceiver? Let us know, thanks!
[18:38:40] <jinxer-wm>	 RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-worker1245:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1245 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[18:38:55] <wikibugs>	 (03PS2) 10Jasmine: wikikube: decommission wikikube-worker102[2-5].eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1114788 (https://phabricator.wikimedia.org/T383227)
[18:39:31] <wikibugs>	 (03CR) 10Jasmine: "ty!" [puppet] - 10https://gerrit.wikimedia.org/r/1114788 (https://phabricator.wikimedia.org/T383227) (owner: 10Jasmine)
[18:44:35] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10501758 (10phaultfinder)
[18:54:40] <jinxer-wm>	 FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker1245:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1245 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[18:56:06] <jinxer-wm>	 FIRING: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures
[18:59:40] <jinxer-wm>	 RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-worker1245:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1245 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[19:00:04] <jouncebot>	 jeena and hashar: Deploy window MediaWiki train - Utc-7+Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250128T1900)
[19:01:03] <icinga-wm>	 PROBLEM - Checks that the local airflow scheduler for airflow @analytics_product is working properly on an-airflow1006 is CRITICAL: CRITICAL: /usr/bin/env PYTHONPATH=/srv/deployment/airflow-dags/analytics_product AIRFLOW_HOME=/srv/airflow-analytics_product /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1006.eqiad.wmnet did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow
[19:01:06] <jinxer-wm>	 RESOLVED: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures
[19:02:03] <icinga-wm>	 RECOVERY - Checks that the local airflow scheduler for airflow @analytics_product is working properly on an-airflow1006 is OK: OK: /usr/bin/env PYTHONPATH=/srv/deployment/airflow-dags/analytics_product AIRFLOW_HOME=/srv/airflow-analytics_product /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1006.eqiad.wmnet succeeded https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow
[19:06:18] <wikibugs>	 (03PS1) 10Xcollazo: Scale down mw-content-history-reconcile-enrich for nominal events intake [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114790 (https://phabricator.wikimedia.org/T382953)
[19:07:36] <jinxer-wm>	 FIRING: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures
[19:10:16] <wikibugs>	 (03PS1) 10TrainBranchBot: group0 to 1.44.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114791 (https://phabricator.wikimedia.org/T382365)
[19:10:18] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] group0 to 1.44.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114791 (https://phabricator.wikimedia.org/T382365) (owner: 10TrainBranchBot)
[19:11:00] <wikibugs>	 (03Merged) 10jenkins-bot: group0 to 1.44.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114791 (https://phabricator.wikimedia.org/T382365) (owner: 10TrainBranchBot)
[19:12:06] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:12:31] <wikibugs>	 10ops-eqiad, 06DBA, 06DC-Ops, 10decommission-hardware: decommission es1024.eqiad.wmnet - https://phabricator.wikimedia.org/T384820#10501811 (10Papaul)
[19:12:36] <jinxer-wm>	 RESOLVED: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures
[19:12:51] <wikibugs>	 10ops-eqiad, 06DBA, 06DC-Ops, 10decommission-hardware: decommission es1024.eqiad.wmnet - https://phabricator.wikimedia.org/T384820#10501814 (10Papaul) 05Open→03Resolved a:03Papaul Complete
[19:20:14] <logmsgbot>	 !log jhuneidi@deploy2002 rebuilt and synchronized wikiversions files: group0 to 1.44.0-wmf.14  refs T382365
[19:20:18] <stashbot>	 T382365: 1.44.0-wmf.14 deployment blockers - https://phabricator.wikimedia.org/T382365
[19:22:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext/canary at eqiad: 23.21% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[19:25:17] <wikibugs>	 (03PS1) 10DLynch: Enable VisualEditor's EditCheck multiple-check mode on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114792 (https://phabricator.wikimedia.org/T384658)
[19:25:36] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, January 28 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114792 (https://phabricator.wikimedia.org/T384658) (owner: 10DLynch)
[19:27:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext/canary at eqiad: 25% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[19:36:39] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2028.codfw.wmnet
[19:44:21] <wikibugs>	 10ops-eqiad, 06Data-Platform-SRE, 06DC-Ops: Q3:rack/setup/install elastic1108-elastic1119 - https://phabricator.wikimedia.org/T384966 (10RobH) 03NEW
[19:45:07] <wikibugs>	 10ops-eqiad, 06Data-Platform-SRE, 06DC-Ops: Q3:rack/setup/install elastic1108-elastic1119 - https://phabricator.wikimedia.org/T384966#10501995 (10RobH) a:03bking @bking,  As discussed in IRC, assigning this to you for further details on racking restrictions section of racking details.  In addition to the a...
[19:45:25] <wikibugs>	 10ops-eqiad, 06Data-Platform-SRE, 06DC-Ops: Q3:rack/setup/install elastic1108-elastic1119 - https://phabricator.wikimedia.org/T384966#10501999 (10RobH)
[19:45:52] <wikibugs>	 10ops-eqiad, 06Data-Platform-SRE, 06DC-Ops: Q3:rack/setup/install elastic1108-elastic1119 - https://phabricator.wikimedia.org/T384966#10502002 (10RobH)
[19:46:52] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2177 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P72654 and previous config saved to /var/cache/conftool/dbconfig/20250128-194651-root.json
[19:47:06] <jinxer-wm>	 FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic1096-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[19:50:00] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 06Traffic: Console domain and property access request - https://phabricator.wikimedia.org/T381904#10502013 (10Scott_French) Tagging #traffic in hopes that someone (especially with expertise in our DNS configuration) may be able to help advance the request in T381904#10464...
[19:51:14] <wikibugs>	 10ops-eqiad, 06DC-Ops: Inbound interface errors - https://phabricator.wikimedia.org/T384861#10502019 (10Jgreen) >>! In T384861#10501741, @VRiley-WMF wrote: > @Jgreen It looks like we are having an issue on this connection. Could we plan for a time for us to swap the transceiver? Let us know, thanks!  The timin...
[19:52:34] <wikibugs>	 10ops-eqiad, 06Data-Platform-SRE, 06DC-Ops: Q3:rack/setup/install elastic1108-elastic1119 - https://phabricator.wikimedia.org/T384966#10502027 (10RobH)
[20:01:24] <wikibugs>	 (03PS1) 10Ottomata: eventgate - templatize module name, default to @eventgate/wikimedia [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114795 (https://phabricator.wikimedia.org/T383814)
[20:01:57] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2177 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P72655 and previous config saved to /var/cache/conftool/dbconfig/20250128-200157-root.json
[20:02:16] <wikibugs>	 (03CR) 10CI reject: [V:04-1] eventgate - templatize module name, default to @eventgate/wikimedia [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114795 (https://phabricator.wikimedia.org/T383814) (owner: 10Ottomata)
[20:03:06] <wikibugs>	 (03PS2) 10Ottomata: eventgate - templatize module name, default to @eventgate/wikimedia [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114795 (https://phabricator.wikimedia.org/T383814)
[20:03:47] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T384592)', diff saved to https://phabricator.wikimedia.org/P72656 and previous config saved to /var/cache/conftool/dbconfig/20250128-200346-marostegui.json
[20:03:51] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[20:04:02] <wikibugs>	 (03CR) 10CI reject: [V:04-1] eventgate - templatize module name, default to @eventgate/wikimedia [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114795 (https://phabricator.wikimedia.org/T383814) (owner: 10Ottomata)
[20:04:38] <wikibugs>	 (03PS3) 10Ottomata: eventgate - templatize module name, default to @eventgate/wikimedia [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114795 (https://phabricator.wikimedia.org/T383814)
[20:05:03] <wikibugs>	 (03PS4) 10Ottomata: eventgate - templatize module name, default to @eventgate/wikimedia [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114795 (https://phabricator.wikimedia.org/T383814)
[20:09:09] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 129, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[20:09:35] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 220, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[20:11:33] <wikibugs>	 (03PS1) 10Ottomata: eventgate-analytics - upgrade to v1.10.0 and NodeJS 20 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114798 (https://phabricator.wikimedia.org/T383814)
[20:15:39] <wikibugs>	 (03Abandoned) 10Gergő Tisza: Add machine-readable markings for SUL3 extension denylist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114351 (owner: 10Gergő Tisza)
[20:15:58] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: Fix PHP 7.4 issue [extensions/Flow] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1114799 (https://phabricator.wikimedia.org/T384905)
[20:16:10] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: wikimedia/request-timeout: 2.0.1 -> 2.0.2 [vendor] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1114800 (https://phabricator.wikimedia.org/T384905)
[20:17:02] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2177 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P72659 and previous config saved to /var/cache/conftool/dbconfig/20250128-201702-root.json
[20:17:38] <wikibugs>	 (03CR) 10Bartosz Dziewoński: "I don't know if the corresponding core patch https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1114760 also needs to be backported? (I kno" [vendor] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1114800 (https://phabricator.wikimedia.org/T384905) (owner: 10Bartosz Dziewoński)
[20:18:13] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, January 28 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [extensions/Flow] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1114799 (https://phabricator.wikimedia.org/T384905) (owner: 10Bartosz Dziewoński)
[20:18:54] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P72660 and previous config saved to /var/cache/conftool/dbconfig/20250128-201853-marostegui.json
[20:26:46] <wikibugs>	 (03CR) 10Bartosz Dziewoński: "It looks like it should, I see other cases where we've done both, e.g. https://gerrit.wikimedia.org/r/c/mediawiki/vendor/+/1098581 and htt" [vendor] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1114800 (https://phabricator.wikimedia.org/T384905) (owner: 10Bartosz Dziewoński)
[20:27:05] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: composer: wikimedia/request-timeout 2.0.1 -> 2.0.2 [core] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1114802 (https://phabricator.wikimedia.org/T384905)
[20:27:13] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, January 28 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [core] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1114802 (https://phabricator.wikimedia.org/T384905) (owner: 10Bartosz Dziewoński)
[20:32:08] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2177 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P72661 and previous config saved to /var/cache/conftool/dbconfig/20250128-203207-root.json
[20:34:01] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P72662 and previous config saved to /var/cache/conftool/dbconfig/20250128-203400-marostegui.json
[20:34:59] <wikibugs>	 (03CR) 10TChin: [C:03+1] eventgate - templatize module name, default to @eventgate/wikimedia [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114795 (https://phabricator.wikimedia.org/T383814) (owner: 10Ottomata)
[20:47:13] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2177 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P72663 and previous config saved to /var/cache/conftool/dbconfig/20250128-204712-root.json
[20:47:30] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, January 28 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114425 (https://phabricator.wikimedia.org/T365367) (owner: 10C. Scott Ananian)
[20:49:08] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T384592)', diff saved to https://phabricator.wikimedia.org/P72664 and previous config saved to /var/cache/conftool/dbconfig/20250128-204907-marostegui.json
[20:49:12] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[20:49:23] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2159.codfw.wmnet with reason: Maintenance
[20:49:27] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2187.codfw.wmnet with reason: Maintenance
[20:49:33] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2159 (T384592)', diff saved to https://phabricator.wikimedia.org/P72665 and previous config saved to /var/cache/conftool/dbconfig/20250128-204933-marostegui.json
[21:00:04] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250128T2100).
[21:00:04] <jouncebot>	 kemayo, MatmaRex, and cscott: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[21:00:11] <MatmaRex>	 hi
[21:00:20] <Kemayo>	 o/
[21:00:37] <MatmaRex>	 my patches are not directly testable - i'm backporting them just to unbreak CI for future backports
[21:02:01] <jeena>	 Hi, I can start deploying if no backport deployer is available
[21:03:15] <jeena>	 I'll start with Kemayo's patch
[21:03:26] <Kemayo>	 🎉
[21:03:46] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jhuneidi@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114792 (https://phabricator.wikimedia.org/T384658) (owner: 10DLynch)
[21:04:34] <wikibugs>	 (03Merged) 10jenkins-bot: Enable VisualEditor's EditCheck multiple-check mode on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114792 (https://phabricator.wikimedia.org/T384658) (owner: 10DLynch)
[21:05:44] <cscott>	 \o/
[21:06:05] <jeena>	 all done, now MatmaRex, is it okay to deploy all yours together?
[21:06:08] <cscott>	 jeena: arlolra is also here for our config change. 
[21:06:25] <jeena>	 👍
[21:06:32] <MatmaRex>	 jeena: yes
[21:07:04] <cscott>	 our patch shouldn't have any visible effect -- `composer checkDiff` shows no output -- it's just a cleanup.  Nevertheless we can smoke test it by checking that parsoid read views is still on/off on those wikis it should be on/off for.
[21:08:22] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jhuneidi@deploy2002 using scap backport" [extensions/Flow] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1114799 (https://phabricator.wikimedia.org/T384905) (owner: 10Bartosz Dziewoński)
[21:08:23] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jhuneidi@deploy2002 using scap backport" [vendor] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1114800 (https://phabricator.wikimedia.org/T384905) (owner: 10Bartosz Dziewoński)
[21:08:24] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jhuneidi@deploy2002 using scap backport" [core] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1114802 (https://phabricator.wikimedia.org/T384905) (owner: 10Bartosz Dziewoński)
[21:18:17] <wikibugs>	 (03Merged) 10jenkins-bot: Fix PHP 7.4 issue [extensions/Flow] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1114799 (https://phabricator.wikimedia.org/T384905) (owner: 10Bartosz Dziewoński)
[21:30:53] <Kemayo>	 jeena: For what it's worth, my one patch actually is testable if you want me to stick around to do it before it goes out.
[21:31:39] <wikibugs>	 (03Merged) 10jenkins-bot: wikimedia/request-timeout: 2.0.1 -> 2.0.2 [vendor] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1114800 (https://phabricator.wikimedia.org/T384905) (owner: 10Bartosz Dziewoński)
[21:31:52] <jeena>	 Kemayo: your patch was beta-only, right? So no production deployment happened. I think there is a job that updates beta?
[21:32:55] <Kemayo>	 jeena: Ah, I actually didn't realize that testwiki was on that sort of update schedule. I suppose I will check back on it in an hour or two and see whether said job has occurred.
[21:33:15] <wikibugs>	 (03CR) 10CI reject: [V:04-1] composer: wikimedia/request-timeout 2.0.1 -> 2.0.2 [core] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1114802 (https://phabricator.wikimedia.org/T384905) (owner: 10Bartosz Dziewoński)
[21:34:02] <cscott>	 i'm afk for 15 minutes, @subbu and @arlolra are here and can check the config patch when it deploys.
[21:35:18] <jeena>	 Kemayo: I think it's because the you changed was InitialiseSettings-labs.php
[21:36:30] <jeena>	 MatmaRex: I'm going to try running tests again on your patch that failed
[21:37:04] <Kemayo>	 jeena: It makes sense, I just don't really think of testwiki as being part of the beta cluster.
[21:37:46] <p858snake|cloud>	 testwiki isn't part of the betacluster
[21:38:04] <p858snake|cloud>	 its a normal wiki, unless something has changed
[21:38:33] <Kemayo>	 Ah, so a deployment would be needed after all?
[21:38:51] <wikibugs>	 (03CR) 10Jeena Huneidi: "recheck" [core] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1114802 (https://phabricator.wikimedia.org/T384905) (owner: 10Bartosz Dziewoński)
[21:38:54] <p858snake|cloud>	 -labs.php changes will only go to the betacluster
[21:39:09] <MatmaRex>	 jeena: please do. the failure seems unrelated to the changes, the error messahe is a database deadlock: https://integration.wikimedia.org/ci/job/mediawiki-quibble-apitests-vendor-php74/33528/artifact/log/mw-error.log/*view*/
[21:39:10] <p858snake|cloud>	 if you want test wiki you need to update the non labs file
[21:41:23] <Kemayo>	 Hm, I suppose I'm not the only one who was confused about that. There are other uses of testwiki in that -labs file, which is why I thought it was okay.
[21:41:42] <Kemayo>	 jeena: If I make a patch to fix that mixup, would you still be able to get it into this window?
[21:42:06] <jeena>	 Kemayo: that would probably be fine
[21:42:26] <Kemayo>	 jeena: okay, one sec
[21:44:36] <wikibugs>	 (03PS1) 10DLynch: Move VE EditCheck testwiki enabling into the correct file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114804 (https://phabricator.wikimedia.org/T384658)
[21:45:29] <Kemayo>	 jeena: ^ that should do it
[21:46:52] <jeena>	 Kemayo: are you missing adding testwiki to  wgVisualEditorEditCheck?
[21:47:15] <Kemayo>	 jeena: It's not needed, since testwiki is a wikipedia and it's already enabled for all those.
[21:47:24] <jeena>	 oh ok
[21:51:10] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T384592)', diff saved to https://phabricator.wikimedia.org/P72666 and previous config saved to /var/cache/conftool/dbconfig/20250128-215109-marostegui.json
[21:51:15] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[21:56:23] <aqu>	 !log Deployed refinery-source using jenkins
[21:56:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:57:01] <aqu>	 !log About to deploy analytics/refinery
[21:57:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:58:34] <logmsgbot>	 !log aqu@deploy2002 Started deploy [analytics/refinery@3959b36]: Regular analytics weekly train [analytics/refinery@3959b36b]
[22:00:05] <jouncebot>	 Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250128T2200)
[22:00:38] <logmsgbot>	 !log aqu@deploy2002 Finished deploy [analytics/refinery@3959b36]: Regular analytics weekly train [analytics/refinery@3959b36b] (duration: 02m 03s)
[22:00:59] <logmsgbot>	 !log aqu@deploy2002 Started deploy [analytics/refinery@3959b36] (thin): Regular analytics weekly train THIN [analytics/refinery@3959b36b]
[22:01:20] <logmsgbot>	 !log jhuneidi@deploy2002 Started scap sync-world: Backport for [[gerrit:1114799|Fix PHP 7.4 issue (T384905)]], [[gerrit:1114800|wikimedia/request-timeout: 2.0.1 -> 2.0.2 (T384905)]], [[gerrit:1114802|composer: wikimedia/request-timeout 2.0.1 -> 2.0.2 (T384905)]]
[22:01:25] <stashbot>	 T384905: Class Flow\Exception\InvalidDataException does not exist / Declaration of Flow\Exception\FlowException::__construct should be compatible with Wikimedia\NormalizedException\NormalizedException::normalizedConstructor - https://phabricator.wikimedia.org/T384905
[22:02:07] <logmsgbot>	 !log aqu@deploy2002 Finished deploy [analytics/refinery@3959b36] (thin): Regular analytics weekly train THIN [analytics/refinery@3959b36b] (duration: 01m 08s)
[22:02:47] <logmsgbot>	 !log aqu@deploy2002 Started deploy [analytics/refinery@3959b36] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@3959b36b]
[22:03:22] <logmsgbot>	 !log aqu@deploy2002 Finished deploy [analytics/refinery@3959b36] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@3959b36b] (duration: 00m 34s)
[22:03:31] <icinga-wm>	 PROBLEM - BGP status on cr1-drmrs is CRITICAL: BGP CRITICAL - AS6939/IPv4: Active - HE, AS6939/IPv6: Active - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[22:04:31] <icinga-wm>	 PROBLEM - BGP status on cr2-drmrs is CRITICAL: BGP CRITICAL - AS6939/IPv4: Active - HE, AS6939/IPv6: Active - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[22:06:14] <logmsgbot>	 !log jhuneidi@deploy2002 jhuneidi, matmarex: Backport for [[gerrit:1114799|Fix PHP 7.4 issue (T384905)]], [[gerrit:1114800|wikimedia/request-timeout: 2.0.1 -> 2.0.2 (T384905)]], [[gerrit:1114802|composer: wikimedia/request-timeout 2.0.1 -> 2.0.2 (T384905)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[22:06:17] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P72667 and previous config saved to /var/cache/conftool/dbconfig/20250128-220616-marostegui.json
[22:06:21] <logmsgbot>	 !log jhuneidi@deploy2002 jhuneidi, matmarex: Continuing with sync
[22:08:56] <aqu>	 !log Deployed refinery-source using jenkins
[22:10:22] <wikibugs>	 (03PS5) 10Raymond Ndibe: [toolforge::harbor] use latest thirdparty/docker [puppet] - 10https://gerrit.wikimedia.org/r/1114007 (https://phabricator.wikimedia.org/T384720)
[22:11:00] <wikibugs>	 (03PS5) 10Raymond Ndibe: [toolforge::harbor] upgrade harbor v2.10.1 ---> v2.12.2 [puppet] - 10https://gerrit.wikimedia.org/r/1113871 (https://phabricator.wikimedia.org/T358225)
[22:12:48] <logmsgbot>	 !log jhuneidi@deploy2002 Finished scap sync-world: Backport for [[gerrit:1114799|Fix PHP 7.4 issue (T384905)]], [[gerrit:1114800|wikimedia/request-timeout: 2.0.1 -> 2.0.2 (T384905)]], [[gerrit:1114802|composer: wikimedia/request-timeout 2.0.1 -> 2.0.2 (T384905)]] (duration: 11m 27s)
[22:13:55] <jeena>	 cscott: Kemayo I'm going to do both of your config changes together if that's okay 
[22:14:04] <Kemayo>	 jeena: Fine by me!
[22:14:56] <arlolra>	 cscott may still be away, but go for it
[22:15:00] <jeena>	 cool thanks
[22:15:12] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jhuneidi@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114804 (https://phabricator.wikimedia.org/T384658) (owner: 10DLynch)
[22:15:13] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jhuneidi@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114425 (https://phabricator.wikimedia.org/T365367) (owner: 10C. Scott Ananian)
[22:15:20] <MatmaRex>	 thanks for deploying jeena
[22:15:50] <jeena>	 you're welcome! sorry i messed up with the recheck thing thinking it would gate-and-submit again
[22:15:58] <wikibugs>	 (03Merged) 10jenkins-bot: Move VE EditCheck testwiki enabling into the correct file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114804 (https://phabricator.wikimedia.org/T384658) (owner: 10DLynch)
[22:16:00] <wikibugs>	 (03Merged) 10jenkins-bot: Condense wikivoyage configuration options for Parsoid Read Views [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114425 (https://phabricator.wikimedia.org/T365367) (owner: 10C. Scott Ananian)
[22:16:12] <wikibugs>	 (03PS1) 10Aqu: Refine: Bump jar version to 0.2.49.3 [puppet] - 10https://gerrit.wikimedia.org/r/1114806 (https://phabricator.wikimedia.org/T383914)
[22:16:31] <logmsgbot>	 !log jhuneidi@deploy2002 Started scap sync-world: Backport for [[gerrit:1114804|Move VE EditCheck testwiki enabling into the correct file (T384658)]], [[gerrit:1114425|Condense wikivoyage configuration options for Parsoid Read Views (T365367)]]
[22:19:30] <logmsgbot>	 !log jhuneidi@deploy2002 jhuneidi, cscott, kemayo: Backport for [[gerrit:1114804|Move VE EditCheck testwiki enabling into the correct file (T384658)]], [[gerrit:1114425|Condense wikivoyage configuration options for Parsoid Read Views (T365367)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[22:20:24] <Kemayo>	 jeena: Tested mine on 2002, and it looks good.
[22:20:51] <jeena>	 arlolra: ready for you to test
[22:21:01] <arlolra>	 ok, one sec
[22:21:24] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P72670 and previous config saved to /var/cache/conftool/dbconfig/20250128-222123-marostegui.json
[22:22:55] <wikibugs>	 (03CR) 10Aqu: [V:03+1] "PCC SUCCESS (NOOP 2 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1114806 (https://phabricator.wikimedia.org/T383914) (owner: 10Aqu)
[22:23:32] <arlolra>	 jeena: seems good
[22:23:42] <logmsgbot>	 !log jhuneidi@deploy2002 jhuneidi, cscott, kemayo: Continuing with sync
[22:26:31] <icinga-wm>	 RECOVERY - BGP status on cr2-drmrs is OK: BGP OK - up: 114, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[22:26:31] <icinga-wm>	 RECOVERY - BGP status on cr1-drmrs is OK: BGP OK - up: 107, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[22:29:40] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10502550 (10phaultfinder)
[22:30:19] <logmsgbot>	 !log jhuneidi@deploy2002 Finished scap sync-world: Backport for [[gerrit:1114804|Move VE EditCheck testwiki enabling into the correct file (T384658)]], [[gerrit:1114425|Condense wikivoyage configuration options for Parsoid Read Views (T365367)]] (duration: 13m 48s)
[22:30:25] <stashbot>	 T384658: Conduct pre-deployment QA of showing multiple Reference Checks in a given edit - https://phabricator.wikimedia.org/T384658
[22:30:25] <stashbot>	 T365367: [EPIC] Deploy Parsoid Read Views for English Wikivoyage and Hebrew Wikivoyage - https://phabricator.wikimedia.org/T365367
[22:31:04] <wikibugs>	 (03CR) 10Ottomata: [C:03+1] Scale down mw-content-history-reconcile-enrich for nominal events intake [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114790 (https://phabricator.wikimedia.org/T382953) (owner: 10Xcollazo)
[22:31:48] <arlolra>	 thanks jeena 
[22:32:15] <jeena>	 👍
[22:32:28] <jeena>	 backport window completed
[22:35:30] <wikibugs>	 (03PS6) 10Raymond Ndibe: [toolforge::harbor] use latest thirdparty/docker [puppet] - 10https://gerrit.wikimedia.org/r/1114007 (https://phabricator.wikimedia.org/T384720)
[22:35:55] <wikibugs>	 (03PS6) 10Raymond Ndibe: [toolforge::harbor] upgrade harbor v2.10.1 ---> v2.12.2 [puppet] - 10https://gerrit.wikimedia.org/r/1113871 (https://phabricator.wikimedia.org/T358225)
[22:36:31] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T384592)', diff saved to https://phabricator.wikimedia.org/P72672 and previous config saved to /var/cache/conftool/dbconfig/20250128-223630-marostegui.json
[22:36:36] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[22:36:46] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2168.codfw.wmnet with reason: Maintenance
[22:36:49] <Kemayo>	 jeena: Thanks! And sorry about the misunderstanding leading to extra work.
[22:36:53] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2168 (T384592)', diff saved to https://phabricator.wikimedia.org/P72673 and previous config saved to /var/cache/conftool/dbconfig/20250128-223652-marostegui.json
[22:37:18] <Kemayo>	 (Also, p858snake|cloud, thanks for letting me know what I'd misunderstood.)
[22:37:37] <wikibugs>	 (03CR) 10CI reject: [V:04-1] [toolforge::harbor] use latest thirdparty/docker [puppet] - 10https://gerrit.wikimedia.org/r/1114007 (https://phabricator.wikimedia.org/T384720) (owner: 10Raymond Ndibe)
[22:37:57] <jeena>	 Kemayo: no worries, glad we could get it sorted, and yeah thanks p858snake|cloud for a better explanation :)
[22:38:14] <wikibugs>	 (03CR) 10CI reject: [V:04-1] [toolforge::harbor] upgrade harbor v2.10.1 ---> v2.12.2 [puppet] - 10https://gerrit.wikimedia.org/r/1113871 (https://phabricator.wikimedia.org/T358225) (owner: 10Raymond Ndibe)
[22:39:02] <cscott>	 thanks jeena, arlolra !
[22:42:53] <wikibugs>	 (03PS7) 10Raymond Ndibe: [toolforge::harbor] use latest thirdparty/docker [puppet] - 10https://gerrit.wikimedia.org/r/1114007 (https://phabricator.wikimedia.org/T384720)
[22:46:49] <wikibugs>	 (03PS7) 10Raymond Ndibe: [toolforge::harbor] upgrade harbor v2.10.1 ---> v2.12.2 [puppet] - 10https://gerrit.wikimedia.org/r/1113871 (https://phabricator.wikimedia.org/T358225)
[23:12:07] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:31:30] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2168 (T384592)', diff saved to https://phabricator.wikimedia.org/P72675 and previous config saved to /var/cache/conftool/dbconfig/20250128-233130-marostegui.json
[23:31:35] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[23:37:55] <wikibugs>	 (03PS8) 10Raymond Ndibe: [toolforge::harbor] use latest thirdparty/docker [puppet] - 10https://gerrit.wikimedia.org/r/1114007 (https://phabricator.wikimedia.org/T384720)
[23:39:57] <wikibugs>	 (03CR) 10CI reject: [V:04-1] [toolforge::harbor] use latest thirdparty/docker [puppet] - 10https://gerrit.wikimedia.org/r/1114007 (https://phabricator.wikimedia.org/T384720) (owner: 10Raymond Ndibe)
[23:45:09] <wikibugs>	 (03PS9) 10Raymond Ndibe: [toolforge::harbor] use latest thirdparty/docker [puppet] - 10https://gerrit.wikimedia.org/r/1114007 (https://phabricator.wikimedia.org/T384720)
[23:46:37] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2168', diff saved to https://phabricator.wikimedia.org/P72676 and previous config saved to /var/cache/conftool/dbconfig/20250128-234637-marostegui.json
[23:47:07] <jinxer-wm>	 FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic1096-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[23:47:09] <wikibugs>	 (03CR) 10CI reject: [V:04-1] [toolforge::harbor] use latest thirdparty/docker [puppet] - 10https://gerrit.wikimedia.org/r/1114007 (https://phabricator.wikimedia.org/T384720) (owner: 10Raymond Ndibe)
[23:49:31] <jinxer-wm>	 FIRING: [2x] CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic1071-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[23:52:13] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 10observability: LibreNMS reporting no routes learnt from doh/durum Anycast peers at various POPs - https://phabricator.wikimedia.org/T384258#10502663 (10cmooney) I was able to run a manual poller command with the updated 'lmns' command and it shows errors pro...
[23:53:04] <wikibugs>	 (03PS10) 10Raymond Ndibe: [toolforge::harbor] use latest thirdparty/docker [puppet] - 10https://gerrit.wikimedia.org/r/1114007 (https://phabricator.wikimedia.org/T384720)
[23:53:26] <wikibugs>	 (03CR) 10CI reject: [V:04-1] [toolforge::harbor] use latest thirdparty/docker [puppet] - 10https://gerrit.wikimedia.org/r/1114007 (https://phabricator.wikimedia.org/T384720) (owner: 10Raymond Ndibe)
[23:55:07] <wikibugs>	 (03PS11) 10Raymond Ndibe: [toolforge::harbor] use latest thirdparty/docker [puppet] - 10https://gerrit.wikimedia.org/r/1114007 (https://phabricator.wikimedia.org/T384720)
[23:56:44] <wikibugs>	 (03PS2) 10Scott French: Enroll 5% of client sessions in PHP 8.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114793 (https://phabricator.wikimedia.org/T383845)
[23:56:44] <wikibugs>	 (03CR) 10Scott French: "Thanks in advance for the review! I plan to move forward with this during the one-off infra window I've scheduled for Wednesday at 16:00 U" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114793 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French)
[23:59:15] <wikibugs>	 (03CR) 10Xcollazo: "@tchin@wikimedia.org can you please merge? I don't have +2 in this repo." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114790 (https://phabricator.wikimedia.org/T382953) (owner: 10Xcollazo)