[00:01:44] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2168', diff saved to https://phabricator.wikimedia.org/P72677 and previous config saved to /var/cache/conftool/dbconfig/20250129-000144-marostegui.json
[00:11:05] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s1 on db2141 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 657.19 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[00:16:51] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2168 (T384592)', diff saved to https://phabricator.wikimedia.org/P72678 and previous config saved to /var/cache/conftool/dbconfig/20250129-001651-marostegui.json
[00:16:56] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2182.codfw.wmnet with reason: Maintenance
[00:16:56] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[00:17:02] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2182 (T384592)', diff saved to https://phabricator.wikimedia.org/P72679 and previous config saved to /var/cache/conftool/dbconfig/20250129-001702-marostegui.json
[00:17:23] <wikibugs>	 (03PS2) 10Scott French: shellbox-constraints: all eqiad replicas on 8.1 (change 2/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113218 (https://phabricator.wikimedia.org/T377038)
[00:17:23] <wikibugs>	 (03PS2) 10Scott French: shellbox-constraints: all replicas on PHP 8.1 (change 3/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113219 (https://phabricator.wikimedia.org/T377038)
[00:17:56] <wikibugs>	 (03PS2) 10Scott French: shellbox-video: 50% of codfw replicas to 8.1 (change 2/4) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113214 (https://phabricator.wikimedia.org/T377038)
[00:17:56] <wikibugs>	 (03PS2) 10Scott French: shellbox-video: all codfw replicas to 8.1 (change 3/4) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113215 (https://phabricator.wikimedia.org/T377038)
[00:17:56] <wikibugs>	 (03PS2) 10Scott French: shellbox-video: all replicas on PHP 8.1 (change 4/4) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113216 (https://phabricator.wikimedia.org/T377038)
[00:26:21] <wikibugs>	 (03PS8) 10Raymond Ndibe: [toolforge::harbor] upgrade harbor v2.10.1 ---> v2.12.2 [puppet] - 10https://gerrit.wikimedia.org/r/1113871 (https://phabricator.wikimedia.org/T358225)
[00:27:07] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:29:31] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:30:40] <jinxer-wm>	 FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker1036:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1036 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[00:38:11] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1114812
[00:40:40] <jinxer-wm>	 RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-worker1036:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1036 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[00:47:10] <jinxer-wm>	 FIRING: [2x] KubernetesRsyslogDown: rsyslog on wikikube-worker1036:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[00:57:10] <jinxer-wm>	 RESOLVED: [2x] KubernetesRsyslogDown: rsyslog on wikikube-worker1036:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[00:58:35] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1114812 (owner: 10TrainBranchBot)
[01:00:02] <wikibugs>	 (03Abandoned) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1114478 (owner: 10TrainBranchBot)
[01:08:38] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1114815
[01:08:38] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1114815 (owner: 10TrainBranchBot)
[01:11:57] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T384592)', diff saved to https://phabricator.wikimedia.org/P72680 and previous config saved to /var/cache/conftool/dbconfig/20250129-011157-marostegui.json
[01:12:02] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[01:27:04] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P72681 and previous config saved to /var/cache/conftool/dbconfig/20250129-012703-marostegui.json
[01:27:07] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[01:28:27] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+1] netbox: use asctime in the logs [puppet] - 10https://gerrit.wikimedia.org/r/1114331 (https://phabricator.wikimedia.org/T379072) (owner: 10Volans)
[01:28:40] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1114815 (owner: 10TrainBranchBot)
[01:29:31] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[01:42:11] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P72682 and previous config saved to /var/cache/conftool/dbconfig/20250129-014210-marostegui.json
[01:46:27] <icinga-wm>	 PROBLEM - Disk space on releases1003 is CRITICAL: DISK CRITICAL - /srv/docker/overlay2/814cb09f4ce883829fb9195053b3ab127bbf1c8c1935c70f205fae91cb4fbf7b/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops
[01:57:18] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T384592)', diff saved to https://phabricator.wikimedia.org/P72683 and previous config saved to /var/cache/conftool/dbconfig/20250129-015717-marostegui.json
[01:57:22] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[01:57:33] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2198.codfw.wmnet with reason: Maintenance
[02:06:27] <icinga-wm>	 RECOVERY - Disk space on releases1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops
[02:10:05] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s1 on db2141 is OK: OK slave_sql_lag Replication lag: 0.27 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[02:27:07] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:29:31] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:37:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:44:21] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 10observability: LibreNMS reporting no routes learnt from doh/durum Anycast peers at various POPs - https://phabricator.wikimedia.org/T384258#10502803 (10andrea.denisse) Hi @cmooney,  I was reviewing the [[ https://github.com/librenms/librenms/releases/tag/25....
[02:56:21] <logmsgbot>	 !log denisse@deploy2002 Started deploy [librenms/librenms@f049593]: Upgrade LibreNMS to 25.1.0 - T384258
[02:56:26] <stashbot>	 T384258: LibreNMS reporting no routes learnt from doh/durum Anycast peers at various POPs - https://phabricator.wikimedia.org/T384258
[02:56:34] <logmsgbot>	 !log denisse@deploy2002 Finished deploy [librenms/librenms@f049593]: Upgrade LibreNMS to 25.1.0 - T384258 (duration: 00m 13s)
[03:02:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:06:34] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2200.codfw.wmnet with reason: Maintenance
[03:18:36] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 10observability, 10SRE Observability (FY2024/2025-Q3): LibreNMS reporting no routes learnt from doh/durum Anycast peers at various POPs - https://phabricator.wikimedia.org/T384258#10502812 (10andrea.denisse) 05Open→03Resolved After upgrading to v25.1...
[03:27:07] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:29:31] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:37:01] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[03:37:53] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.310 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[03:42:01] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[03:42:51] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[03:47:41] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53367 bytes in 0.067 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[03:47:51] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.183 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[03:52:07] <jinxer-wm>	 FIRING: [2x] CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic1071-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[03:53:01] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[03:59:53] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.190 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[04:08:16] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2208.codfw.wmnet with reason: Maintenance
[04:08:23] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2208 (T384592)', diff saved to https://phabricator.wikimedia.org/P72685 and previous config saved to /var/cache/conftool/dbconfig/20250129-040822-marostegui.json
[04:08:27] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[04:56:01] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2208 (T384592)', diff saved to https://phabricator.wikimedia.org/P72686 and previous config saved to /var/cache/conftool/dbconfig/20250129-045600-marostegui.json
[04:56:06] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[05:11:08] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2208', diff saved to https://phabricator.wikimedia.org/P72687 and previous config saved to /var/cache/conftool/dbconfig/20250129-051108-marostegui.json
[05:26:15] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2208', diff saved to https://phabricator.wikimedia.org/P72688 and previous config saved to /var/cache/conftool/dbconfig/20250129-052615-marostegui.json
[05:41:23] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2208 (T384592)', diff saved to https://phabricator.wikimedia.org/P72689 and previous config saved to /var/cache/conftool/dbconfig/20250129-054121-marostegui.json
[05:41:28] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[05:41:39] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2220.codfw.wmnet with reason: Maintenance
[05:41:45] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2220 (T384592)', diff saved to https://phabricator.wikimedia.org/P72690 and previous config saved to /var/cache/conftool/dbconfig/20250129-054145-marostegui.json
[05:47:46] <wikibugs>	 (03PS14) 10AOkoth: miscweb: support os-reports deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098486 (https://phabricator.wikimedia.org/T350794)
[05:49:30] <wikibugs>	 (03PS15) 10AOkoth: miscweb: support os-reports deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098486 (https://phabricator.wikimedia.org/T350794)
[05:50:02] <wikibugs>	 (03CR) 10AOkoth: "Acknowledged" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098486 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth)
[06:12:14] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2156', diff saved to https://phabricator.wikimedia.org/P72691 and previous config saved to /var/cache/conftool/dbconfig/20250129-061214-marostegui.json
[06:12:45] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2156.codfw.wmnet with reason: maintenance
[06:12:56] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2186.codfw.wmnet with reason: maintenance
[06:13:02] <logmsgbot>	 !log root@cumin1002 START - Cookbook sre.mysql.upgrade for db2156.codfw.wmnet
[06:14:17] <wikibugs>	 (03PS1) 10Marostegui: rebuild_tables.sh: Add sleep [software] - 10https://gerrit.wikimedia.org/r/1114832 (https://phabricator.wikimedia.org/T382842)
[06:18:06] <wikibugs>	 (03CR) 10Marostegui: "FYI" [software] - 10https://gerrit.wikimedia.org/r/1114832 (https://phabricator.wikimedia.org/T382842) (owner: 10Marostegui)
[06:18:07] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] rebuild_tables.sh: Add sleep [software] - 10https://gerrit.wikimedia.org/r/1114832 (https://phabricator.wikimedia.org/T382842) (owner: 10Marostegui)
[06:18:35] <wikibugs>	 (03Merged) 10jenkins-bot: rebuild_tables.sh: Add sleep [software] - 10https://gerrit.wikimedia.org/r/1114832 (https://phabricator.wikimedia.org/T382842) (owner: 10Marostegui)
[06:19:45] <logmsgbot>	 !log root@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2156.codfw.wmnet
[06:20:29] <logmsgbot>	 !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2156.codfw.wmnet with reason: Index rebuild
[06:21:56] <wikibugs>	 (03PS1) 10Marostegui: installserver: Reimage db1257 [puppet] - 10https://gerrit.wikimedia.org/r/1114833 (https://phabricator.wikimedia.org/T384979)
[06:29:29] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] installserver: Reimage db1257 [puppet] - 10https://gerrit.wikimedia.org/r/1114833 (https://phabricator.wikimedia.org/T384979) (owner: 10Marostegui)
[06:30:15] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2220 (T384592)', diff saved to https://phabricator.wikimedia.org/P72692 and previous config saved to /var/cache/conftool/dbconfig/20250129-063015-marostegui.json
[06:30:20] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[06:34:47] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Add new future host [puppet] - 10https://gerrit.wikimedia.org/r/1114900 (https://phabricator.wikimedia.org/T384979)
[06:45:46] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1230 db2157 T384994', diff saved to https://phabricator.wikimedia.org/P72693 and previous config saved to /var/cache/conftool/dbconfig/20250129-064545-marostegui.json
[06:45:52] <stashbot>	 T384994: Upgrade and rebuild s5 - https://phabricator.wikimedia.org/T384994
[06:45:57] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2220', diff saved to https://phabricator.wikimedia.org/P72694 and previous config saved to /var/cache/conftool/dbconfig/20250129-064555-marostegui.json
[06:46:11] <logmsgbot>	 !log root@cumin1002 START - Cookbook sre.mysql.upgrade for db1230.eqiad.wmnet
[06:46:18] <logmsgbot>	 !log root@cumin1002 START - Cookbook sre.mysql.upgrade for db2157.codfw.wmnet
[06:49:39] <wikibugs>	 (03PS1) 10Marostegui: db1230,db2157: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1114904 (https://phabricator.wikimedia.org/T384994)
[06:49:43] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db1230,db2157: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1114904 (https://phabricator.wikimedia.org/T384994) (owner: 10Marostegui)
[06:49:47] <wikibugs>	 (03PS1) 10Marostegui: Revert "db1230,db2157: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1114905
[06:49:51] <wikibugs>	 (03CR) 10Marostegui: [C:04-2] "Not yet" [puppet] - 10https://gerrit.wikimedia.org/r/1114905 (owner: 10Marostegui)
[06:51:49] <logmsgbot>	 !log root@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1230.eqiad.wmnet
[06:52:37] <logmsgbot>	 !log root@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2157.codfw.wmnet
[06:52:43] <logmsgbot>	 !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1230.eqiad.wmnet with reason: Index rebuild
[06:53:12] <logmsgbot>	 !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2157.codfw.wmnet with reason: Index rebuild
[07:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250129T0700)
[07:01:04] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2220', diff saved to https://phabricator.wikimedia.org/P72695 and previous config saved to /var/cache/conftool/dbconfig/20250129-070103-marostegui.json
[07:16:11] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2220 (T384592)', diff saved to https://phabricator.wikimedia.org/P72696 and previous config saved to /var/cache/conftool/dbconfig/20250129-071610-marostegui.json
[07:16:15] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[07:16:26] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2221.codfw.wmnet with reason: Maintenance
[07:16:33] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2221 (T384592)', diff saved to https://phabricator.wikimedia.org/P72697 and previous config saved to /var/cache/conftool/dbconfig/20250129-071632-marostegui.json
[07:32:07] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:32:19] <wikibugs>	 (03CR) 10Muehlenhoff: nftables: add types and directories (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1114717 (https://phabricator.wikimedia.org/T370677) (owner: 10Arnaudb)
[07:33:48] <moritzm>	 !log installing Tomcat security updates
[07:33:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:34:47] <logmsgbot>	 !log jmm@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ganeti2028.codfw.wmnet with reason: remove from cluster for reimage
[07:34:56] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10503006 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=160bb060-4ed1-4784-9312-c60a5421c725) set by jmm@cumin2002 for 1 day, 0:00:00 on 1 host(...
[07:36:44] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Switch ganeti2028 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1114741 (owner: 10Muehlenhoff)
[07:40:31] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host idp1004.wikimedia.org
[07:40:45] <logmsgbot>	 !log root@cumin1002 START - Cookbook sre.mysql.pool db1230 gradually with 4 steps - Repooling after rebuild index T384994
[07:40:49] <logmsgbot>	 !log root@cumin1002 END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) db1230 gradually with 4 steps - Repooling after rebuild index T384994
[07:40:49] <stashbot>	 T384994: Upgrade and rebuild s5 - https://phabricator.wikimedia.org/T384994
[07:42:02] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2028.codfw.wmnet with OS bookworm
[07:42:08] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10503015 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2028.codfw.wmnet with OS bookworm
[07:44:31] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host idp1004.wikimedia.org
[07:52:07] <jinxer-wm>	 FIRING: [2x] CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic1071-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[07:54:40] <icinga-wm>	 PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1002 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs
[07:55:30] <logmsgbot>	 !log jmm@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ganeti2028.codfw.wmnet with OS bookworm
[07:55:35] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10503023 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2028.codfw.wmnet with OS bookworm executed with errors:...
[07:55:38] <icinga-wm>	 PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs
[07:55:43] <wikibugs>	 (03CR) 10Volans: [C:03+2] netbox: use asctime in the logs [puppet] - 10https://gerrit.wikimedia.org/r/1114331 (https://phabricator.wikimedia.org/T379072) (owner: 10Volans)
[07:56:08] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2028.codfw.wmnet with OS bookworm
[07:56:17] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10503025 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2028.codfw.wmnet with OS bookworm
[07:58:12] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2031.codfw.wmnet
[07:58:38] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10503026 (10ops-monitoring-bot) Draining ganeti2031.codfw.wmnet of running VMs
[07:58:40] <logmsgbot>	 !log root@cumin1002 START - Cookbook sre.mysql.pool db2157 gradually with 4 steps - Repooling after rebuild index T384994
[07:58:41] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] "Nice!" [puppet] - 10https://gerrit.wikimedia.org/r/1114770 (https://phabricator.wikimedia.org/T369384) (owner: 10Cathal Mooney)
[07:58:44] <stashbot>	 T384994: Upgrade and rebuild s5 - https://phabricator.wikimedia.org/T384994
[08:00:05] <jouncebot>	 Amir1, Urbanecm, and awight: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250129T0800).
[08:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[08:01:14] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: sre.netbox.update-extras hits KeyError with logging - https://phabricator.wikimedia.org/T379072#10503032 (10Volans) 05Open→03Resolved a:03Volans The patch has been deployed and this is now fixed.
[08:04:06] <wikibugs>	 (03PS16) 10AOkoth: miscweb: support os-reports deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098486 (https://phabricator.wikimedia.org/T350794)
[08:06:11] <wikibugs>	 (03PS1) 10Slyngshede: P:idm add logstash to requestable permission [puppet] - 10https://gerrit.wikimedia.org/r/1114949
[08:08:03] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2031.codfw.wmnet
[08:08:26] <wikibugs>	 (03CR) 10Jelto: [C:03+2] gerrit: Remove rsa-2048 certs from apache config [puppet] - 10https://gerrit.wikimedia.org/r/1075614 (https://phabricator.wikimedia.org/T375569) (owner: 10BCornwall)
[08:08:52] <wikibugs>	 (03PS2) 10BCornwall: gerrit: Remove rsa-2048 certs from apache config [puppet] - 10https://gerrit.wikimedia.org/r/1075614 (https://phabricator.wikimedia.org/T375569)
[08:12:20] <wikibugs>	 (03CR) 10Muehlenhoff: "This looks good, but before merging let me add a description to the group." [puppet] - 10https://gerrit.wikimedia.org/r/1114949 (owner: 10Slyngshede)
[08:12:52] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti2028.codfw.wmnet with reason: host reimage
[08:14:05] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2031.codfw.wmnet
[08:14:23] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10503036 (10ops-monitoring-bot) Draining ganeti2031.codfw.wmnet of running VMs
[08:15:52] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch ganeti2031 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1114950
[08:16:35] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti2028.codfw.wmnet with reason: host reimage
[08:16:43] <wikibugs>	 (03PS17) 10AOkoth: miscweb: support os-reports deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098486 (https://phabricator.wikimedia.org/T350794)
[08:17:42] <wikibugs>	 (03CR) 10AOkoth: miscweb: support os-reports deployment (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098486 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth)
[08:17:50] <wikibugs>	 (03PS18) 10AOkoth: miscweb: support os-reports deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098486 (https://phabricator.wikimedia.org/T350794)
[08:17:55] <wikibugs>	 (03CR) 10Jelto: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1075614 (https://phabricator.wikimedia.org/T375569) (owner: 10BCornwall)
[08:26:07] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Pool db1230', diff saved to https://phabricator.wikimedia.org/P72700 and previous config saved to /var/cache/conftool/dbconfig/20250129-082606-marostegui.json
[08:26:13] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2221 (T384592)', diff saved to https://phabricator.wikimedia.org/P72701 and previous config saved to /var/cache/conftool/dbconfig/20250129-082613-marostegui.json
[08:26:18] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[08:28:42] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1230 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P72702 and previous config saved to /var/cache/conftool/dbconfig/20250129-082841-root.json
[08:29:42] <icinga-wm>	 RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1002 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs
[08:30:20] <icinga-wm>	 PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[08:30:26] <icinga-wm>	 PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[08:30:38] <icinga-wm>	 RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs
[08:30:38] <wikibugs>	 (03CR) 10Hashar: "After the replica got updated:" [puppet] - 10https://gerrit.wikimedia.org/r/1075614 (https://phabricator.wikimedia.org/T375569) (owner: 10BCornwall)
[08:30:42] <vgutierrez>	 that's me testing lvs4010 
[08:31:33] <vgutierrez>	 !log depooled lvs4009 during 60s to test lvs4010 running liberica - T384477
[08:31:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:31:38] <stashbot>	 T384477: Replace pybal with liberica on the PoPs - https://phabricator.wikimedia.org/T384477
[08:33:26] <wikibugs>	 (03CR) 10Jelto: [C:03+2] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1075614 (https://phabricator.wikimedia.org/T375569) (owner: 10BCornwall)
[08:34:50] <wikibugs>	 (03CR) 10Fabfur: [C:03+1] Refine: Bump jar version to 0.2.49.3 [puppet] - 10https://gerrit.wikimedia.org/r/1114806 (https://phabricator.wikimedia.org/T383914) (owner: 10Aqu)
[08:36:38] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2028.codfw.wmnet with OS bookworm
[08:36:47] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10503068 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2028.codfw.wmnet with OS bookworm completed: - ganeti202...
[08:41:20] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2221', diff saved to https://phabricator.wikimedia.org/P72704 and previous config saved to /var/cache/conftool/dbconfig/20250129-084120-marostegui.json
[08:42:01] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ganeti2028.codfw.wmnet
[08:42:15] <wikibugs>	 07sre-alert-triage, 06serviceops: Alert in need of triage: SystemdUnitFailed (instance cumin1002:9100) - https://phabricator.wikimedia.org/T384999 (10LSobanski) 03NEW
[08:43:07] <wikibugs>	 (03CR) 10Fabfur: [C:03+2] hiera: consolidate haproxykafka into common profile [puppet] - 10https://gerrit.wikimedia.org/r/1114728 (https://phabricator.wikimedia.org/T378578) (owner: 10Fabfur)
[08:43:47] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1230 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P72705 and previous config saved to /var/cache/conftool/dbconfig/20250129-084347-root.json
[08:44:01] <logmsgbot>	 !log root@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2157 gradually with 4 steps - Repooling after rebuild index T384994
[08:44:07] <stashbot>	 T384994: Upgrade and rebuild s5 - https://phabricator.wikimedia.org/T384994
[08:45:08] <wikibugs>	 (03CR) 10Marostegui: Revert "db1230,db2157: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1114905 (owner: 10Marostegui)
[08:45:10] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] Revert "db1230,db2157: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1114905 (owner: 10Marostegui)
[08:45:32] <wikibugs>	 (03CR) 10Gmodena: [C:03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114790 (https://phabricator.wikimedia.org/T382953) (owner: 10Xcollazo)
[08:46:11] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1210 db2223 T384994', diff saved to https://phabricator.wikimedia.org/P72707 and previous config saved to /var/cache/conftool/dbconfig/20250129-084611-marostegui.json
[08:46:17] <logmsgbot>	 !log root@cumin1002 START - Cookbook sre.mysql.upgrade for db2223.codfw.wmnet
[08:46:24] <logmsgbot>	 !log root@cumin1002 START - Cookbook sre.mysql.upgrade for db1210.eqiad.wmnet
[08:47:02] <wikibugs>	 07sre-alert-triage, 06serviceops: Alert in need of triage: SystemdUnitFailed (instance cumin1002:9100) - https://phabricator.wikimedia.org/T384999#10503094 (10JMeybohm)
[08:48:03] <_joe_>	 jouncebot: now
[08:48:03] <jouncebot>	 For the next 0 hour(s) and 11 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250129T0800)
[08:48:30] <_joe_>	 well I'll slip in my change I couldn't deploy yesterday
[08:51:08] <logmsgbot>	 !log root@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1210.eqiad.wmnet
[08:51:38] <logmsgbot>	 !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1210.eqiad.wmnet with reason: Index rebuild
[08:51:47] <logmsgbot>	 !log root@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2223.codfw.wmnet
[08:52:08] <logmsgbot>	 !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2223.codfw.wmnet with reason: Index rebuild
[08:52:24] <wikibugs>	 (03CR) 10Arnaudb: nftables: add docker profile and forward chain (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1114716 (https://phabricator.wikimedia.org/T370677) (owner: 10Arnaudb)
[08:52:53] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by oblivian@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113788 (https://phabricator.wikimedia.org/T382947) (owner: 10Giuseppe Lavagetto)
[08:53:36] <wikibugs>	 (03Merged) 10jenkins-bot: DBRecordCache: handle default section [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113788 (https://phabricator.wikimedia.org/T382947) (owner: 10Giuseppe Lavagetto)
[08:54:24] <logmsgbot>	 !log oblivian@deploy2002 Started scap sync-world: Backport for [[gerrit:1113788|DBRecordCache: handle default section (T382947)]]
[08:54:29] <stashbot>	 T382947: Switch dumps 1.0 processes to use the analytics MariadB replicas (dbstore100[7-9]) - https://phabricator.wikimedia.org/T382947
[08:56:25] <wikibugs>	 (03CR) 10Jelto: "looks mostly good, one comment in-line" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098486 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth)
[08:56:28] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2221', diff saved to https://phabricator.wikimedia.org/P72710 and previous config saved to /var/cache/conftool/dbconfig/20250129-085627-marostegui.json
[08:57:36] <logmsgbot>	 !log oblivian@deploy2002 oblivian: Backport for [[gerrit:1113788|DBRecordCache: handle default section (T382947)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[08:58:23] <logmsgbot>	 !log oblivian@deploy2002 oblivian: Continuing with sync
[08:58:52] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1230 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P72711 and previous config saved to /var/cache/conftool/dbconfig/20250129-085852-root.json
[08:59:50] <wikibugs>	 (03CR) 10Fabfur: [C:03+2] Refine: Bump jar version to 0.2.49.3 [puppet] - 10https://gerrit.wikimedia.org/r/1114806 (https://phabricator.wikimedia.org/T383914) (owner: 10Aqu)
[09:00:04] <jouncebot>	 jeena and hashar: Deploy window MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250129T0900)
[09:05:04] <logmsgbot>	 !log oblivian@deploy2002 Finished scap sync-world: Backport for [[gerrit:1113788|DBRecordCache: handle default section (T382947)]] (duration: 10m 39s)
[09:05:10] <stashbot>	 T382947: Switch dumps 1.0 processes to use the analytics MariadB replicas (dbstore100[7-9]) - https://phabricator.wikimedia.org/T382947
[09:11:35] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2221 (T384592)', diff saved to https://phabricator.wikimedia.org/P72713 and previous config saved to /var/cache/conftool/dbconfig/20250129-091134-marostegui.json
[09:11:40] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[09:11:49] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2222.codfw.wmnet with reason: Maintenance
[09:11:56] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2222 (T384592)', diff saved to https://phabricator.wikimedia.org/P72714 and previous config saved to /var/cache/conftool/dbconfig/20250129-091156-marostegui.json
[09:12:18] <wikibugs>	 (03PS1) 10Vgutierrez: prometheus::ops: Scrape ipip-mq-optimizer metrics on liberica nodes [puppet] - 10https://gerrit.wikimedia.org/r/1114953 (https://phabricator.wikimedia.org/T385001)
[09:13:05] <wikibugs>	 (03PS2) 10Vgutierrez: prometheus::ops: Scrape ipip-mq-optimizer metrics on liberica nodes [puppet] - 10https://gerrit.wikimedia.org/r/1114953 (https://phabricator.wikimedia.org/T385001)
[09:13:23] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1114953 (https://phabricator.wikimedia.org/T385001) (owner: 10Vgutierrez)
[09:13:57] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1230 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P72715 and previous config saved to /var/cache/conftool/dbconfig/20250129-091357-root.json
[09:15:57] <wikibugs>	 (03PS1) 10DCausse: cirrus: add v1 stream for the search update pipeline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114955 (https://phabricator.wikimedia.org/T375821)
[09:15:59] <wikibugs>	 (03PS1) 10DCausse: cirrus: drop rc0 streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114956 (https://phabricator.wikimedia.org/T375821)
[09:16:14] <wikibugs>	 (03CR) 10DCausse: [C:04-2] cirrus: drop rc0 streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114956 (https://phabricator.wikimedia.org/T375821) (owner: 10DCausse)
[09:25:52] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2028.codfw.wmnet
[09:29:03] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1230 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P72716 and previous config saved to /var/cache/conftool/dbconfig/20250129-092902-root.json
[09:31:01] <wikibugs>	 (03PS3) 10Arnaudb: nftables: add docker profile and forward chain [puppet] - 10https://gerrit.wikimedia.org/r/1114716 (https://phabricator.wikimedia.org/T370677)
[09:31:08] <wikibugs>	 (03PS3) 10Arnaudb: nftables: add types and directories [puppet] - 10https://gerrit.wikimedia.org/r/1114717 (https://phabricator.wikimedia.org/T370677)
[09:31:14] <wikibugs>	 (03PS4) 10Arnaudb: nftables: add nftable docker manifest [puppet] - 10https://gerrit.wikimedia.org/r/1114718 (https://phabricator.wikimedia.org/T370677)
[09:31:24] <wikibugs>	 (03PS2) 10Arnaudb: gitlab_runner: add nftables logic [puppet] - 10https://gerrit.wikimedia.org/r/1114726 (https://phabricator.wikimedia.org/T370677)
[09:31:29] <logmsgbot>	 !log root@cumin1002 START - Cookbook sre.mysql.pool db2156 gradually with 4 steps - Repooling after rebuild index T384807
[09:31:33] <stashbot>	 T384807: Upgrade and rebuild s3 - https://phabricator.wikimedia.org/T384807
[09:36:56] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2028.codfw.wmnet
[09:36:58] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ganeti2028.codfw.wmnet
[09:39:31] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 10observability, 10SRE Observability (FY2024/2025-Q3): LibreNMS reporting no routes learnt from doh/durum Anycast peers at various POPs - https://phabricator.wikimedia.org/T384258#10503188 (10cmooney) >>! In T384258#10502812, @andrea.denisse wrote: > Aft...
[09:41:20] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-redacteddb1001.eqiad.wmnet with reason: maintenance
[09:42:07] <marostegui>	 !log Upgrade mariadb on an-redacteddb1001
[09:42:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:45:13] <wikibugs>	 (03Abandoned) 10Marostegui: dbproxy: switch CNAMEs [dns] - 10https://gerrit.wikimedia.org/r/1087374 (https://phabricator.wikimedia.org/T368874) (owner: 10Arnaudb)
[09:48:42] <wikibugs>	 (03CR) 10Fabfur: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1114953 (https://phabricator.wikimedia.org/T385001) (owner: 10Vgutierrez)
[09:57:44] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] prometheus::ops: Scrape ipip-mq-optimizer metrics on liberica nodes [puppet] - 10https://gerrit.wikimedia.org/r/1114953 (https://phabricator.wikimedia.org/T385001) (owner: 10Vgutierrez)
[10:00:20] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2028.codfw.wmnet
[10:00:38] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2222 (T384592)', diff saved to https://phabricator.wikimedia.org/P72719 and previous config saved to /var/cache/conftool/dbconfig/20250129-100037-marostegui.json
[10:00:43] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[10:00:49] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1210 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P72720 and previous config saved to /var/cache/conftool/dbconfig/20250129-100048-root.json
[10:01:05] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2223 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P72721 and previous config saved to /var/cache/conftool/dbconfig/20250129-100104-root.json
[10:01:44] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on an-redacteddb1001.eqiad.wmnet with reason: maintenance
[10:04:15] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] gnmic: use event-value-tag-v2 to improve performance [puppet] - 10https://gerrit.wikimedia.org/r/1114770 (https://phabricator.wikimedia.org/T369384) (owner: 10Cathal Mooney)
[10:08:49] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2028.codfw.wmnet
[10:08:53] <wikibugs>	 (03CR) 10Federico Ceratto: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1114900 (https://phabricator.wikimedia.org/T384979) (owner: 10Marostegui)
[10:09:07] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] mariadb: Add new future host [puppet] - 10https://gerrit.wikimedia.org/r/1114900 (https://phabricator.wikimedia.org/T384979) (owner: 10Marostegui)
[10:10:19] <wikibugs>	 10ops-eqiad, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install db1257 - https://phabricator.wikimedia.org/T384979#10503318 (10Marostegui) a:05Marostegui→03None
[10:15:45] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2222', diff saved to https://phabricator.wikimedia.org/P72723 and previous config saved to /var/cache/conftool/dbconfig/20250129-101544-marostegui.json
[10:15:54] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1210 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P72724 and previous config saved to /var/cache/conftool/dbconfig/20250129-101553-root.json
[10:16:10] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2223 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P72725 and previous config saved to /var/cache/conftool/dbconfig/20250129-101609-root.json
[10:16:51] <logmsgbot>	 !log root@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2156 gradually with 4 steps - Repooling after rebuild index T384807
[10:16:55] <stashbot>	 T384807: Upgrade and rebuild s3 - https://phabricator.wikimedia.org/T384807
[10:17:23] <wikibugs>	 (03PS2) 10Muehlenhoff: openssh: Remove code to disable NIST key exchange [puppet] - 10https://gerrit.wikimedia.org/r/1074381
[10:18:42] <moritzm>	 !log installing git-lfs security updates on bullseye
[10:18:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:18:48] <wikibugs>	 (03CR) 10AOkoth: miscweb: support os-reports deployment (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098486 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth)
[10:19:13] <wikibugs>	 (03PS19) 10AOkoth: miscweb: support os-reports deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098486 (https://phabricator.wikimedia.org/T350794)
[10:21:34] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1013,1017].eqiad.wmnet,db1154.eqiad.wmnet with reason: maintenance
[10:21:46] <marostegui>	 !log Upgrade and reboot db1154 (s1, s3, s5, s8 wikireplicas will get lag)
[10:21:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:23:29] <wikibugs>	 (03PS20) 10AOkoth: miscweb: support os-reports deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098486 (https://phabricator.wikimedia.org/T350794)
[10:25:29] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on clouddb1016.eqiad.wmnet with reason: maintenance
[10:25:47] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on clouddb1020.eqiad.wmnet with reason: maintenance
[10:26:06] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on clouddb1013.eqiad.wmnet with reason: maintenance
[10:26:15] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on clouddb1017.eqiad.wmnet with reason: maintenance
[10:30:00] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "looks good to me now 🚢" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098486 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth)
[10:30:52] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2222', diff saved to https://phabricator.wikimedia.org/P72727 and previous config saved to /var/cache/conftool/dbconfig/20250129-103051-marostegui.json
[10:30:59] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1210 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P72728 and previous config saved to /var/cache/conftool/dbconfig/20250129-103059-root.json
[10:31:15] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2223 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P72729 and previous config saved to /var/cache/conftool/dbconfig/20250129-103115-root.json
[10:36:53] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depool es1025 T384912', diff saved to https://phabricator.wikimedia.org/P72730 and previous config saved to /var/cache/conftool/dbconfig/20250129-103652-fceratto.json
[10:36:58] <stashbot>	 T384912: decommission es1025.eqiad.wmnet - https://phabricator.wikimedia.org/T384912
[10:37:15] <wikibugs>	 (03CR) 10AOkoth: [C:03+2] miscweb: support os-reports deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098486 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth)
[10:38:02] <wikibugs>	 (03CR) 10Muehlenhoff: P:idm add logstash to requestable permission (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1114949 (owner: 10Slyngshede)
[10:39:49] <wikibugs>	 (03PS1) 10Federico Ceratto: instances.yaml: remove es1025 [puppet] - 10https://gerrit.wikimedia.org/r/1114962 (https://phabricator.wikimedia.org/T384912)
[10:40:05] <logmsgbot>	 !log aokoth@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'.
[10:40:24] <wikibugs>	 (03PS2) 10Slyngshede: P:idm add logstash to requestable permission [puppet] - 10https://gerrit.wikimedia.org/r/1114949
[10:40:27] <logmsgbot>	 !log aokoth@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'.
[10:40:54] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1114949 (owner: 10Slyngshede)
[10:41:22] <logmsgbot>	 !log aokoth@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply
[10:41:26] <logmsgbot>	 !log aokoth@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply
[10:42:17] <wikibugs>	 (03CR) 10Marostegui: [C:03+1] instances.yaml: remove es1025 [puppet] - 10https://gerrit.wikimedia.org/r/1114962 (https://phabricator.wikimedia.org/T384912) (owner: 10Federico Ceratto)
[10:43:38] <logmsgbot>	 !log aokoth@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply
[10:43:43] <logmsgbot>	 !log aokoth@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply
[10:45:59] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2222 (T384592)', diff saved to https://phabricator.wikimedia.org/P72731 and previous config saved to /var/cache/conftool/dbconfig/20250129-104558-marostegui.json
[10:46:04] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[10:46:05] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1210 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P72732 and previous config saved to /var/cache/conftool/dbconfig/20250129-104604-root.json
[10:46:21] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2223 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P72733 and previous config saved to /var/cache/conftool/dbconfig/20250129-104620-root.json
[10:46:24] <wikibugs>	 (03CR) 10Federico Ceratto: [C:03+1] instances.yaml: remove es1025 [puppet] - 10https://gerrit.wikimedia.org/r/1114962 (https://phabricator.wikimedia.org/T384912) (owner: 10Federico Ceratto)
[10:48:49] <wikibugs>	 (03CR) 10Federico Ceratto: [C:03+2] instances.yaml: remove es1025 [puppet] - 10https://gerrit.wikimedia.org/r/1114962 (https://phabricator.wikimedia.org/T384912) (owner: 10Federico Ceratto)
[10:49:03] <wikibugs>	 (03PS1) 10JMeybohm: Add restricted users to deployment_server [puppet] - 10https://gerrit.wikimedia.org/r/1114963 (https://phabricator.wikimedia.org/T378429)
[10:51:02] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1181.eqiad.wmnet with reason: Maintenance
[10:51:26] <wikibugs>	 (03CR) 10JMeybohm: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4881/co" [puppet] - 10https://gerrit.wikimedia.org/r/1114963 (https://phabricator.wikimedia.org/T378429) (owner: 10JMeybohm)
[10:51:34] <wikibugs>	 (03PS1) 10AOkoth: misweb: fix type error and service account [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114965 (https://phabricator.wikimedia.org/T350794)
[10:52:12] <wikibugs>	 10ops-codfw, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2242-2329 - https://phabricator.wikimedia.org/T384970#10503555 (10Clement_Goubert)
[10:52:28] <wikibugs>	 (03CR) 10JMeybohm: "This should probably do the trick already" [puppet] - 10https://gerrit.wikimedia.org/r/1114963 (https://phabricator.wikimedia.org/T378429) (owner: 10JMeybohm)
[10:52:33] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Remove es1025 from dbctl T384912', diff saved to https://phabricator.wikimedia.org/P72734 and previous config saved to /var/cache/conftool/dbconfig/20250129-105232-fceratto.json
[10:52:37] <stashbot>	 T384912: decommission es1025.eqiad.wmnet - https://phabricator.wikimedia.org/T384912
[10:52:43] <wikibugs>	 (03CR) 10JMeybohm: [V:03+1] Add restricted users to deployment_server [puppet] - 10https://gerrit.wikimedia.org/r/1114963 (https://phabricator.wikimedia.org/T378429) (owner: 10JMeybohm)
[10:52:45] <wikibugs>	 (03PS1) 10Hashar: Do not copy Code-Review +2 [puppet] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/1114966
[10:53:06] <wikibugs>	 (03PS4) 10Brouberol: airflow: deploy an envoy proxy alongside each airflow instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114386 (https://phabricator.wikimedia.org/T384329)
[10:53:06] <wikibugs>	 (03PS7) 10Brouberol: Add discovery listeners to airflow-analytics(-test) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114387 (https://phabricator.wikimedia.org/T384329)
[10:53:18] <wikibugs>	 10ops-magru, 06Infrastructure-Foundations, 10netops: Jan 2025 - Magru core router connectivity blips - https://phabricator.wikimedia.org/T384774#10503562 (10cmooney) 05Open→03Resolved Gonna close this one, all is stable after ~24h.
[10:53:44] <wikibugs>	 (03CR) 10Hashar: [V:03+2 C:03+2] Do not copy Code-Review +2 [puppet] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/1114966 (owner: 10Hashar)
[10:53:57] <wikibugs>	 (03PS1) 10Cathal Mooney: Prometheus: change gnmi label rewrite from 'target' to 'source' [puppet] - 10https://gerrit.wikimedia.org/r/1114967 (https://phabricator.wikimedia.org/T369384)
[10:54:16] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Add discovery listeners to airflow-analytics(-test) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114387 (https://phabricator.wikimedia.org/T384329) (owner: 10Brouberol)
[10:54:30] <wikibugs>	 (03CR) 10CI reject: [V:04-1] airflow: deploy an envoy proxy alongside each airflow instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114386 (https://phabricator.wikimedia.org/T384329) (owner: 10Brouberol)
[10:54:47] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "lgtm" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114965 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth)
[10:56:04] <wikibugs>	 (03PS5) 10Brouberol: airflow: deploy an envoy proxy alongside each airflow instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114386 (https://phabricator.wikimedia.org/T384329)
[10:56:04] <wikibugs>	 (03PS8) 10Brouberol: Add discovery listeners to airflow-analytics(-test) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114387 (https://phabricator.wikimedia.org/T384329)
[10:56:56] <wikibugs>	 (03PS6) 10Brouberol: airflow: deploy an envoy proxy alongside each airflow instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114386 (https://phabricator.wikimedia.org/T384329)
[10:56:56] <wikibugs>	 (03PS9) 10Brouberol: Add discovery listeners to airflow-analytics(-test) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114387 (https://phabricator.wikimedia.org/T384329)
[10:57:52] <wikibugs>	 (03CR) 10AOkoth: [C:03+2] misweb: fix type error and service account [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114965 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth)
[10:58:03] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Add discovery listeners to airflow-analytics(-test) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114387 (https://phabricator.wikimedia.org/T384329) (owner: 10Brouberol)
[10:58:12] <wikibugs>	 (03CR) 10CI reject: [V:04-1] airflow: deploy an envoy proxy alongside each airflow instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114386 (https://phabricator.wikimedia.org/T384329) (owner: 10Brouberol)
[10:59:23] <logmsgbot>	 !log aokoth@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply
[10:59:28] <logmsgbot>	 !log aokoth@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply
[11:00:05] <jouncebot>	 effie and swfrench-wmf: It is that lovely time of the day again! You are hereby commanded to deploy MediaWiki infrastructure (UTC mid-day). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250129T1100).
[11:00:19] <logmsgbot>	 !log aokoth@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply
[11:00:23] <logmsgbot>	 !log aokoth@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply
[11:01:10] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1210 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P72735 and previous config saved to /var/cache/conftool/dbconfig/20250129-110109-root.json
[11:01:24] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V:03+1] Prometheus: change gnmi label rewrite from 'target' to 'source' [puppet] - 10https://gerrit.wikimedia.org/r/1114967 (https://phabricator.wikimedia.org/T369384) (owner: 10Cathal Mooney)
[11:01:26] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2223 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P72736 and previous config saved to /var/cache/conftool/dbconfig/20250129-110125-root.json
[11:01:32] <logmsgbot>	 !log aokoth@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'.
[11:01:36] <logmsgbot>	 !log aokoth@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'.
[11:02:33] <logmsgbot>	 !log aokoth@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply
[11:02:38] <logmsgbot>	 !log aokoth@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply
[11:02:41] <wikibugs>	 (03PS2) 10Cathal Mooney: Prometheus: change gnmi label rewrite from 'target' to 'source' [puppet] - 10https://gerrit.wikimedia.org/r/1114967 (https://phabricator.wikimedia.org/T369384)
[11:04:09] <wikibugs>	 (03PS7) 10Brouberol: airflow: deploy an envoy proxy alongside each airflow instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114386 (https://phabricator.wikimedia.org/T384329)
[11:04:09] <wikibugs>	 (03PS10) 10Brouberol: Add discovery listeners to airflow-analytics(-test) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114387 (https://phabricator.wikimedia.org/T384329)
[11:05:09] <wikibugs>	 (03CR) 10CI reject: [V:04-1] airflow: deploy an envoy proxy alongside each airflow instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114386 (https://phabricator.wikimedia.org/T384329) (owner: 10Brouberol)
[11:05:17] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Add discovery listeners to airflow-analytics(-test) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114387 (https://phabricator.wikimedia.org/T384329) (owner: 10Brouberol)
[11:05:30] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2218.codfw.wmnet with reason: Maintenance
[11:05:59] <logmsgbot>	 !log aokoth@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply
[11:06:03] <logmsgbot>	 !log aokoth@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply
[11:07:11] <wikibugs>	 (03PS1) 10Federico Ceratto: es1025.yaml, site.pp, backup1002.cnf.erb: Remove es102 [puppet] - 10https://gerrit.wikimedia.org/r/1114969 (https://phabricator.wikimedia.org/T384912)
[11:07:36] <wikibugs>	 (03PS1) 10JMeybohm: Allow to install multiple kubectl versions [puppet] - 10https://gerrit.wikimedia.org/r/1114970 (https://phabricator.wikimedia.org/T341984)
[11:07:45] <wikibugs>	 (03PS8) 10Brouberol: airflow: deploy an envoy proxy alongside each airflow instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114386 (https://phabricator.wikimedia.org/T384329)
[11:07:45] <wikibugs>	 (03PS11) 10Brouberol: Add discovery listeners to airflow-analytics(-test) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114387 (https://phabricator.wikimedia.org/T384329)
[11:07:57] <wikibugs>	 (03CR) 10Marostegui: "Typo in the commit message, "Remove es102"" [puppet] - 10https://gerrit.wikimedia.org/r/1114969 (https://phabricator.wikimedia.org/T384912) (owner: 10Federico Ceratto)
[11:10:07] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+1] shellbox-video: 50% of codfw replicas to 8.1 (change 2/4) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113214 (https://phabricator.wikimedia.org/T377038) (owner: 10Scott French)
[11:10:15] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+2] shellbox-video: 50% of codfw replicas to 8.1 (change 2/4) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113214 (https://phabricator.wikimedia.org/T377038) (owner: 10Scott French)
[11:10:36] <wikibugs>	 (03PS1) 10MVernon: swift: remove drained eqiad nodes from the rings [puppet] - 10https://gerrit.wikimedia.org/r/1114971 (https://phabricator.wikimedia.org/T382056)
[11:11:50] <wikibugs>	 (03Merged) 10jenkins-bot: shellbox-video: 50% of codfw replicas to 8.1 (change 2/4) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113214 (https://phabricator.wikimedia.org/T377038) (owner: 10Scott French)
[11:12:26] <wikibugs>	 (03CR) 10JMeybohm: [V:03+1] "PCC SUCCESS (NOOP 8): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4882/console" [puppet] - 10https://gerrit.wikimedia.org/r/1114970 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm)
[11:13:20] <logmsgbot>	 !log jiji@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-video: apply
[11:13:26] <logmsgbot>	 !log jiji@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-video: apply
[11:13:31] <logmsgbot>	 !log jiji@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-video: apply
[11:14:12] <logmsgbot>	 !log jiji@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-video: apply
[11:14:59] <wikibugs>	 (03CR) 10Btullis: [C:03+1] "Nice, thanks." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114386 (https://phabricator.wikimedia.org/T384329) (owner: 10Brouberol)
[11:15:41] <wikibugs>	 (03CR) 10Btullis: [C:03+1] Add discovery listeners to airflow-analytics(-test) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114387 (https://phabricator.wikimedia.org/T384329) (owner: 10Brouberol)
[11:15:55] <wikibugs>	 (03PS2) 10JMeybohm: Allow to install multiple kubectl versions [puppet] - 10https://gerrit.wikimedia.org/r/1114970 (https://phabricator.wikimedia.org/T341984)
[11:17:08] <wikibugs>	 (03PS2) 10Federico Ceratto: es1025.yaml, site.pp, backup1002.cnf.erb: Remove es1025 [puppet] - 10https://gerrit.wikimedia.org/r/1114969 (https://phabricator.wikimedia.org/T384912)
[11:18:27] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] airflow: deploy an envoy proxy alongside each airflow instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114386 (https://phabricator.wikimedia.org/T384329) (owner: 10Brouberol)
[11:18:30] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] Add discovery listeners to airflow-analytics(-test) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114387 (https://phabricator.wikimedia.org/T384329) (owner: 10Brouberol)
[11:19:25] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+2] shellbox-constraints: all eqiad replicas on 8.1 (change 2/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113218 (https://phabricator.wikimedia.org/T377038) (owner: 10Scott French)
[11:19:50] <wikibugs>	 (03CR) 10JMeybohm: [V:03+1] "PCC SUCCESS (NOOP 8): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4883/console" [puppet] - 10https://gerrit.wikimedia.org/r/1114970 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm)
[11:20:01] <wikibugs>	 (03Merged) 10jenkins-bot: airflow: deploy an envoy proxy alongside each airflow instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114386 (https://phabricator.wikimedia.org/T384329) (owner: 10Brouberol)
[11:20:04] <wikibugs>	 (03Merged) 10jenkins-bot: Add discovery listeners to airflow-analytics(-test) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114387 (https://phabricator.wikimedia.org/T384329) (owner: 10Brouberol)
[11:20:35] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "FWIW, the diff gets somewhat shorter if the list is sorted before and after the change – there’s still a fair amount of changes but also p" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114398 (https://phabricator.wikimedia.org/T280718) (owner: 10Hnowlan)
[11:21:11] <wikibugs>	 (03Merged) 10jenkins-bot: shellbox-constraints: all eqiad replicas on 8.1 (change 2/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113218 (https://phabricator.wikimedia.org/T377038) (owner: 10Scott French)
[11:21:36] <logmsgbot>	 !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply
[11:21:39] <logmsgbot>	 !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply
[11:21:51] <wikibugs>	 (03PS1) 10Urbanecm: migrateConfigToCommunity: Deal with false category names [extensions/Babel] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1114973 (https://phabricator.wikimedia.org/T384941)
[11:22:29] <logmsgbot>	 !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply
[11:23:08] <logmsgbot>	 !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply
[11:24:00] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2031.codfw.wmnet
[11:24:50] <logmsgbot>	 !log jmm@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ganeti2031.codfw.wmnet with reason: remove from cluster for reimage
[11:24:55] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10503653 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=7af53928-134c-4589-9808-e36a2bde4422) set by jmm@cumin2002 for 1 day, 0:00:00 on 1 host(...
[11:25:14] <wikibugs>	 (03PS1) 10Urbanecm: [tests] Add MigrateConfigToCommunityTest [extensions/Babel] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1114975 (https://phabricator.wikimedia.org/T383905)
[11:25:16] <wikibugs>	 (03PS1) 10Urbanecm: migrateConfigToCommunity: Deal with false category names [extensions/Babel] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1114976 (https://phabricator.wikimedia.org/T384941)
[11:26:20] <wikibugs>	 (03CR) 10Kamila Součková: "LGTM but I haven't checked folder permissions on the hosts" [puppet] - 10https://gerrit.wikimedia.org/r/1114963 (https://phabricator.wikimedia.org/T378429) (owner: 10JMeybohm)
[11:27:17] <wikibugs>	 (03CR) 10Jcrespo: [V:03+1] "I have checked syntax is right, I have not checked they finished draining." [puppet] - 10https://gerrit.wikimedia.org/r/1114971 (https://phabricator.wikimedia.org/T382056) (owner: 10MVernon)
[11:28:49] <wikibugs>	 (03PS2) 10Hnowlan: fc-list: update font list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114398 (https://phabricator.wikimedia.org/T280718)
[11:29:47] <wikibugs>	 (03CR) 10Hnowlan: "Fair point, pushed a sorted list and looking much neater." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114398 (https://phabricator.wikimedia.org/T280718) (owner: 10Hnowlan)
[11:31:56] <wikibugs>	 (03PS1) 10Btullis: dumps: Use the analytics replicas by default for dumps 1.0 [puppet] - 10https://gerrit.wikimedia.org/r/1114978 (https://phabricator.wikimedia.org/T382947)
[11:32:00] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+2] shellbox-constraints: all replicas on PHP 8.1 (change 3/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113219 (https://phabricator.wikimedia.org/T377038) (owner: 10Scott French)
[11:32:07] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:33:05] <wikibugs>	 (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1114978 (https://phabricator.wikimedia.org/T382947) (owner: 10Btullis)
[11:33:13] <wikibugs>	 (03Merged) 10jenkins-bot: shellbox-constraints: all replicas on PHP 8.1 (change 3/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113219 (https://phabricator.wikimedia.org/T377038) (owner: 10Scott French)
[11:33:14] <wikibugs>	 (03CR) 10Btullis: dumps: Use the analytics replicas by default for dumps 1.0 [puppet] - 10https://gerrit.wikimedia.org/r/1114978 (https://phabricator.wikimedia.org/T382947) (owner: 10Btullis)
[11:34:29] <wikibugs>	 (03PS1) 10Jelto: Revert "Do not copy Code-Review +2" [puppet] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/1114980
[11:35:25] <logmsgbot>	 !log cmooney@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on netflow2003.codfw.wmnet with reason: disabling alerts as I'm running gnmic manually rather than with systemd
[11:35:33] <logmsgbot>	 !log jiji@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply
[11:35:40] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#10503694 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=36d26c8a-4d30-4345-8682-54b6b4882e38) set by cmooney@cumin1002 for 3:00:...
[11:35:58] <logmsgbot>	 !log jiji@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply
[11:36:32] <wikibugs>	 (03CR) 10Jelto: [V:03+2] Revert "Do not copy Code-Review +2" [puppet] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/1114980 (owner: 10Jelto)
[11:37:15] <wikibugs>	 06SRE, 10SRE-swift-storage, 13Patch-For-Review: ms backend hardware refresh for 24/25 - https://phabricator.wikimedia.org/T382056#10503706 (10MatthewVernon)
[11:40:13] <wikibugs>	 (03CR) 10Marostegui: [C:03+1] swift: remove drained eqiad nodes from the rings [puppet] - 10https://gerrit.wikimedia.org/r/1114971 (https://phabricator.wikimedia.org/T382056) (owner: 10MVernon)
[11:40:26] <wikibugs>	 (03CR) 10Marostegui: [C:03+1] es1025.yaml, site.pp, backup1002.cnf.erb: Remove es1025 [puppet] - 10https://gerrit.wikimedia.org/r/1114969 (https://phabricator.wikimedia.org/T384912) (owner: 10Federico Ceratto)
[11:40:41] <wikibugs>	 (03CR) 10MVernon: [C:03+2] swift: remove drained eqiad nodes from the rings [puppet] - 10https://gerrit.wikimedia.org/r/1114971 (https://phabricator.wikimedia.org/T382056) (owner: 10MVernon)
[11:41:04] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "(Just to be clear, the list already wasn’t sorted before, so the ”fully” neat diff I had in mind would’ve required a separate change just " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114398 (https://phabricator.wikimedia.org/T280718) (owner: 10Hnowlan)
[11:42:09] <logmsgbot>	 !log fceratto@cumin1002 START - Cookbook sre.hosts.decommission for hosts es1025.eqiad.wmnet
[11:49:01] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance
[11:51:10] <logmsgbot>	 !log fceratto@cumin1002 START - Cookbook sre.dns.netbox
[11:52:07] <jinxer-wm>	 FIRING: [2x] CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic1071-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[11:57:00] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1200', diff saved to https://phabricator.wikimedia.org/P72737 and previous config saved to /var/cache/conftool/dbconfig/20250129-115700-marostegui.json
[11:57:09] <logmsgbot>	 !log root@cumin1002 START - Cookbook sre.mysql.upgrade for db1200.eqiad.wmnet
[12:00:04] <jouncebot>	 mvolz: Time to snap out of that daydream and deploy Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250129T1200).
[12:00:29] <logmsgbot>	 !log mvolz@deploy2002 helmfile [staging] START helmfile.d/services/citoid: apply
[12:00:51] <logmsgbot>	 !log mvolz@deploy2002 helmfile [staging] DONE helmfile.d/services/citoid: apply
[12:02:14] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2211 T384994', diff saved to https://phabricator.wikimedia.org/P72738 and previous config saved to /var/cache/conftool/dbconfig/20250129-120213-marostegui.json
[12:02:19] <stashbot>	 T384994: Upgrade and rebuild s5 - https://phabricator.wikimedia.org/T384994
[12:03:04] <logmsgbot>	 !log root@cumin1002 START - Cookbook sre.mysql.upgrade for db2211.codfw.wmnet
[12:03:08] <logmsgbot>	 !log fceratto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: es1025.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - fceratto@cumin1002"
[12:03:11] <logmsgbot>	 !log mvolz@deploy2002 helmfile [codfw] START helmfile.d/services/citoid: apply
[12:03:21] <wikibugs>	 (03PS3) 10Jcrespo: dbbackups: Remove set user permissions from m1 backup user grants [puppet] - 10https://gerrit.wikimedia.org/r/1112802 (https://phabricator.wikimedia.org/T383902)
[12:03:22] <wikibugs>	 (03PS1) 10Jcrespo: installserver: Enable reimage of backup1013, backup1014, backup2013, backup2014 [puppet] - 10https://gerrit.wikimedia.org/r/1114986 (https://phabricator.wikimedia.org/T384977)
[12:03:29] <logmsgbot>	 !log root@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1200.eqiad.wmnet
[12:03:38] <wikibugs>	 (03PS4) 10Jcrespo: dbbackups: Remove set user permissions from m1 backup user grants [puppet] - 10https://gerrit.wikimedia.org/r/1112802 (https://phabricator.wikimedia.org/T383902)
[12:03:46] <wikibugs>	 (03PS2) 10Jcrespo: installserver: Enable reimage of backup1013, backup1014, backup2013, backup2014 [puppet] - 10https://gerrit.wikimedia.org/r/1114986 (https://phabricator.wikimedia.org/T384977)
[12:04:01] <logmsgbot>	 !log mvolz@deploy2002 helmfile [codfw] DONE helmfile.d/services/citoid: apply
[12:04:15] <logmsgbot>	 !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1200.eqiad.wmnet with reason: Index rebuild
[12:04:29] <wikibugs>	 (03CR) 10Jcrespo: [C:04-1] "We need to remove read only admin, too." [puppet] - 10https://gerrit.wikimedia.org/r/1112802 (https://phabricator.wikimedia.org/T383902) (owner: 10Jcrespo)
[12:06:05] <urbanecm>	 jouncebot: nowandnext
[12:06:06] <jouncebot>	 For the next 0 hour(s) and 53 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250129T1200)
[12:06:06] <jouncebot>	 In 1 hour(s) and 53 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250129T1400)
[12:06:44] <logmsgbot>	 !log mvolz@deploy2002 helmfile [eqiad] START helmfile.d/services/citoid: apply
[12:07:14] <logmsgbot>	 !log mvolz@deploy2002 helmfile [eqiad] DONE helmfile.d/services/citoid: apply
[12:07:31] <wikibugs>	 (03CR) 10Urbanecm: [C:03+2] [tests] Add ConfigWrapperTest [extensions/Babel] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1114751 (https://phabricator.wikimedia.org/T383905) (owner: 10Urbanecm)
[12:07:33] <wikibugs>	 (03CR) 10Urbanecm: [C:03+2] Remove BabelCategorizeNamespaces from CommunityConfiguration [extensions/Babel] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1114752 (https://phabricator.wikimedia.org/T383905) (owner: 10Urbanecm)
[12:07:39] <logmsgbot>	 !log root@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2211.codfw.wmnet
[12:07:42] <wikibugs>	 (03CR) 10Urbanecm: [C:03+2] [tests] Add MigrateConfigToCommunityTest [extensions/Babel] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1114975 (https://phabricator.wikimedia.org/T383905) (owner: 10Urbanecm)
[12:07:44] <wikibugs>	 (03CR) 10Urbanecm: [C:03+2] migrateConfigToCommunity: Deal with false category names [extensions/Babel] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1114976 (https://phabricator.wikimedia.org/T384941) (owner: 10Urbanecm)
[12:07:59] <wikibugs>	 (03CR) 10Urbanecm: [C:03+2] migrateConfigToCommunity: Deal with false category names [extensions/Babel] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1114973 (https://phabricator.wikimedia.org/T384941) (owner: 10Urbanecm)
[12:08:18] <logmsgbot>	 !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2211.codfw.wmnet with reason: Index rebuild
[12:08:55] <logmsgbot>	 !log fceratto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: es1025.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - fceratto@cumin1002"
[12:08:55] <logmsgbot>	 !log fceratto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[12:08:56] <logmsgbot>	 !log fceratto@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts es1025.eqiad.wmnet
[12:09:52] <wikibugs>	 (03CR) 10Federico Ceratto: [C:03+2] es1025.yaml, site.pp, backup1002.cnf.erb: Remove es1025 [puppet] - 10https://gerrit.wikimedia.org/r/1114969 (https://phabricator.wikimedia.org/T384912) (owner: 10Federico Ceratto)
[12:20:42] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[12:20:46] <logmsgbot>	 !log aokoth@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply
[12:20:49] <logmsgbot>	 !log aokoth@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply
[12:25:39] <wikibugs>	 (03PS1) 10Btullis: dumps: Re-enable the enwiki dumps on snapshot1012 [puppet] - 10https://gerrit.wikimedia.org/r/1114991 (https://phabricator.wikimedia.org/T382947)
[12:25:59] <wikibugs>	 (03CR) 10CI reject: [V:04-1] dumps: Re-enable the enwiki dumps on snapshot1012 [puppet] - 10https://gerrit.wikimedia.org/r/1114991 (https://phabricator.wikimedia.org/T382947) (owner: 10Btullis)
[12:26:10] <wikibugs>	 10ops-eqiad, 06DC-Ops, 10decommission-hardware: decommission es1025.eqiad.wmnet - https://phabricator.wikimedia.org/T384912#10503810 (10FCeratto-WMF) 05In progress→03Open a:05FCeratto-WMF→03None
[12:26:26] <wikibugs>	 10ops-eqiad, 06DC-Ops, 10decommission-hardware: decommission es1025.eqiad.wmnet - https://phabricator.wikimedia.org/T384912#10503816 (10FCeratto-WMF) The host is ready for DC-ops
[12:27:00] <wikibugs>	 (03PS2) 10Btullis: dumps: Re-enable the enwiki dumps on snapshot1012 [puppet] - 10https://gerrit.wikimedia.org/r/1114991 (https://phabricator.wikimedia.org/T382947)
[12:27:04] <logmsgbot>	 !log aokoth@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply
[12:27:08] <logmsgbot>	 !log aokoth@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply
[12:27:41] <wikibugs>	 (03Merged) 10jenkins-bot: [tests] Add ConfigWrapperTest [extensions/Babel] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1114751 (https://phabricator.wikimedia.org/T383905) (owner: 10Urbanecm)
[12:27:43] <wikibugs>	 (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4886/co" [puppet] - 10https://gerrit.wikimedia.org/r/1114991 (https://phabricator.wikimedia.org/T382947) (owner: 10Btullis)
[12:28:34] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [extensions/Babel] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1114752 (https://phabricator.wikimedia.org/T383905) (owner: 10Urbanecm)
[12:28:34] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [extensions/Babel] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1114975 (https://phabricator.wikimedia.org/T383905) (owner: 10Urbanecm)
[12:28:35] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [extensions/Babel] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1114976 (https://phabricator.wikimedia.org/T384941) (owner: 10Urbanecm)
[12:28:35] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [extensions/Babel] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1114973 (https://phabricator.wikimedia.org/T384941) (owner: 10Urbanecm)
[12:29:07] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [extensions/Babel] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1114752 (https://phabricator.wikimedia.org/T383905) (owner: 10Urbanecm)
[12:29:07] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [extensions/Babel] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1114975 (https://phabricator.wikimedia.org/T383905) (owner: 10Urbanecm)
[12:29:08] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [extensions/Babel] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1114976 (https://phabricator.wikimedia.org/T384941) (owner: 10Urbanecm)
[12:29:08] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [extensions/Babel] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1114973 (https://phabricator.wikimedia.org/T384941) (owner: 10Urbanecm)
[12:30:06] <wikibugs>	 (03PS1) 10AOkoth: miscweb: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114992 (https://phabricator.wikimedia.org/T350794)
[12:30:42] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[12:32:01] <wikibugs>	 (03CR) 10Raymond Ndibe: "Reverting back on this, there is currently no way that I know of to stop kubeadm from regenerating this file David. The best approach righ" [puppet] - 10https://gerrit.wikimedia.org/r/1113194 (https://phabricator.wikimedia.org/T374193) (owner: 10Raymond Ndibe)
[12:32:14] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "lgtm, this was missing in Iba37c095353b76bfaf1ee19228a4ec783b6239f9" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114992 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth)
[12:32:57] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] C:idm remove associate_by_email pipeline [puppet] - 10https://gerrit.wikimedia.org/r/1112224 (https://phabricator.wikimedia.org/T383707) (owner: 10Slyngshede)
[12:33:16] <marostegui>	 !log Rebuild tables on dbstore1007 (s2, s3, s4) T384818
[12:33:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:33:21] <stashbot>	 T384818: Upgrade dbstore* hosts to 10.6.20 and rebuild tables - https://phabricator.wikimedia.org/T384818
[12:33:25] <wikibugs>	 (03CR) 10AOkoth: [C:03+2] miscweb: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114992 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth)
[12:34:08] <marostegui>	 !log Rebuild tables on dbstore1009 (s6 s8) T384818
[12:34:11] <wikibugs>	 (03Merged) 10jenkins-bot: Remove BabelCategorizeNamespaces from CommunityConfiguration [extensions/Babel] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1114752 (https://phabricator.wikimedia.org/T383905) (owner: 10Urbanecm)
[12:34:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:34:12] <wikibugs>	 (03Merged) 10jenkins-bot: [tests] Add MigrateConfigToCommunityTest [extensions/Babel] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1114975 (https://phabricator.wikimedia.org/T383905) (owner: 10Urbanecm)
[12:34:14] <wikibugs>	 (03Merged) 10jenkins-bot: migrateConfigToCommunity: Deal with false category names [extensions/Babel] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1114976 (https://phabricator.wikimedia.org/T384941) (owner: 10Urbanecm)
[12:35:02] <wikibugs>	 (03PS1) 10Lucas Werkmeister (WMDE): Handle missing `monthonly` format in MwTimeIsoFormatter [extensions/Wikibase] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1114994 (https://phabricator.wikimedia.org/T384867)
[12:35:15] <wikibugs>	 (03PS1) 10Lucas Werkmeister (WMDE): Handle missing `monthonly` format in MwTimeIsoFormatter [extensions/Wikibase] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1114995 (https://phabricator.wikimedia.org/T384867)
[12:35:32] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, January 29 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [extensions/Wikibase] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1114995 (https://phabricator.wikimedia.org/T384867) (owner: 10Lucas Werkmeister (WMDE))
[12:35:40] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, January 29 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [extensions/Wikibase] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1114994 (https://phabricator.wikimedia.org/T384867) (owner: 10Lucas Werkmeister (WMDE))
[12:35:41] <logmsgbot>	 !log aokoth@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply
[12:35:44] <logmsgbot>	 !log aokoth@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply
[12:37:08] <logmsgbot>	 !log aokoth@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply
[12:37:11] <logmsgbot>	 !log aokoth@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply
[12:37:53] <logmsgbot>	 !log aokoth@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply
[12:37:57] <logmsgbot>	 !log aokoth@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply
[12:41:26] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloudgw1003: take over cloudgw1001 [puppet] - 10https://gerrit.wikimedia.org/r/1114997 (https://phabricator.wikimedia.org/T382356)
[12:42:30] <icinga-wm>	 RECOVERY - Host ms-fe1014 is UP: PING WARNING - Packet loss = 33%, RTA = 0.38 ms
[12:43:00] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s2 on dbstore1007 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 635.42 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[12:44:02] <jynus>	 👀
[12:44:04] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s6 on dbstore1009 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 612.06 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[12:45:18] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloudgw1004: take over cloudgw1002 [puppet] - 10https://gerrit.wikimedia.org/r/1114998 (https://phabricator.wikimedia.org/T382356)
[12:45:43] <jynus>	 that's a table rebuilding
[12:46:13] <marostegui>	 That's strange
[12:46:18] <marostegui>	 I downtimed it
[12:46:23] <marostegui>	 I will do it again
[12:46:51] <marostegui>	 Ah it failed apparently, anyway, doing it!
[12:47:10] <wikibugs>	 (03CR) 10Hashar: [V:03+2 C:03+2] "I think the issue is the section overrides all properties from the parent All-Projects when I guess I assumed it would extend it. So as th" [puppet] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/1114966 (owner: 10Hashar)
[12:47:23] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on dbstore1007.eqiad.wmnet with reason: maintenance
[12:47:41] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on dbstore1009.eqiad.wmnet with reason: maintenance
[12:48:54] <icinga-wm>	 PROBLEM - Host ms-fe1014 is DOWN: PING CRITICAL - Packet loss = 100%
[12:49:23] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db[1154,1212].eqiad.wmnet with reason: maintenance
[12:49:58] <wikibugs>	 (03Merged) 10jenkins-bot: migrateConfigToCommunity: Deal with false category names [extensions/Babel] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1114973 (https://phabricator.wikimedia.org/T384941) (owner: 10Urbanecm)
[12:50:16] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1212 T384807', diff saved to https://phabricator.wikimedia.org/P72741 and previous config saved to /var/cache/conftool/dbconfig/20250129-125015-marostegui.json
[12:50:20] <stashbot>	 T384807: Upgrade and rebuild s3 - https://phabricator.wikimedia.org/T384807
[12:50:35] <logmsgbot>	 !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1114751|[tests] Add ConfigWrapperTest (T383905)]], [[gerrit:1114752|Remove BabelCategorizeNamespaces from CommunityConfiguration (T383905)]], [[gerrit:1114975|[tests] Add MigrateConfigToCommunityTest (T383905)]], [[gerrit:1114976|migrateConfigToCommunity: Deal with false category names (T384941)]], [[gerrit:1114973|migrateConfigToCommunity: Deal with
[12:50:35] <logmsgbot>	 false category names (T384941)]]
[12:50:40] <stashbot>	 T383905: Running extensions/Babel/maintenance/migrateConfigToCommunity.php with the default configuration fails on validation error - https://phabricator.wikimedia.org/T383905
[12:50:40] <stashbot>	 T384941: Setting wgBabelCategoryNames[level] to false is not supported by the migration script - https://phabricator.wikimedia.org/T384941
[12:50:41] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1013,1017].eqiad.wmnet with reason: maintenance
[12:50:47] <logmsgbot>	 !log root@cumin1002 START - Cookbook sre.mysql.upgrade for db1212.eqiad.wmnet
[12:52:05] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: cloudgw1003: take over cloudgw1001 [puppet] - 10https://gerrit.wikimedia.org/r/1114997 (https://phabricator.wikimedia.org/T382356)
[12:52:05] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: cloudgw1004: take over cloudgw1002 [puppet] - 10https://gerrit.wikimedia.org/r/1114998 (https://phabricator.wikimedia.org/T382356)
[12:52:33] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] dumps: Use the analytics replicas by default for dumps 1.0 [puppet] - 10https://gerrit.wikimedia.org/r/1114978 (https://phabricator.wikimedia.org/T382947) (owner: 10Btullis)
[12:52:38] <wikibugs>	 (03CR) 10Elukey: [C:03+2] custom_deploy.d: rework dse-k8s-eqiad's istio config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114743 (owner: 10Elukey)
[12:54:33] <logmsgbot>	 !log aokoth@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply
[12:54:38] <logmsgbot>	 !log aokoth@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply
[12:55:08] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Handle missing `monthonly` format in MwTimeIsoFormatter [extensions/Wikibase] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1114995 (https://phabricator.wikimedia.org/T384867) (owner: 10Lucas Werkmeister (WMDE))
[12:56:05] <logmsgbot>	 !log root@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1212.eqiad.wmnet
[12:56:36] <logmsgbot>	 !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1212.eqiad.wmnet with reason: Index rebuild
[12:57:56] <logmsgbot>	 !log aokoth@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply
[12:58:00] <logmsgbot>	 !log aokoth@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply
[13:00:32] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1200 (re)pooling @ 10%: Repooling after rebuild index $TASKID', diff saved to https://phabricator.wikimedia.org/P72742 and previous config saved to /var/cache/conftool/dbconfig/20250129-130031-root.json
[13:00:44] <wikibugs>	 (03PS1) 10Marostegui: installserver: Do not format es104* [puppet] - 10https://gerrit.wikimedia.org/r/1115000
[13:01:25] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti2028.codfw.wmnet to cluster codfw and group A
[13:02:13] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti2028.codfw.wmnet to cluster codfw and group A
[13:02:29] <wikibugs>	 (03CR) 10MVernon: [C:03+1] installserver: Enable reimage of backup1013, backup1014, backup2013, backup2014 [puppet] - 10https://gerrit.wikimedia.org/r/1114986 (https://phabricator.wikimedia.org/T384977) (owner: 10Jcrespo)
[13:03:23] <wikibugs>	 (03CR) 10Jcrespo: [C:03+2] installserver: Enable reimage of backup1013, backup1014, backup2013, backup2014 [puppet] - 10https://gerrit.wikimedia.org/r/1114986 (https://phabricator.wikimedia.org/T384977) (owner: 10Jcrespo)
[13:03:58] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2211 (re)pooling @ 10%: Repooling after rebuild index $TASKID', diff saved to https://phabricator.wikimedia.org/P72743 and previous config saved to /var/cache/conftool/dbconfig/20250129-130358-root.json
[13:07:24] <wikibugs>	 (03CR) 10Elukey: [C:03+1] Add a separate Hiera option to control the waterlines import (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1114769 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[13:07:48] <urbanecm>	 scap's build-and-push-container-images is taking quite some time
[13:07:54] <urbanecm>	 since 12:51:48
[13:08:24] <urbanecm>	 scap-image-build-and-push-log is not updating
[13:10:10] <wikibugs>	 10ops-codfw, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2242-2329 - https://phabricator.wikimedia.org/T384970#10503959 (10Clement_Goubert) Based on the calculation in my [[ https://docs.google.com/spreadsheets/d/18BokLsimZj-7XdQfTGLIP__11aDIJnbL0cqBNdLRXuY/edit?usp=sharing | balancing sheet...
[13:10:33] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:11:37] <wikibugs>	 (03PS1) 10Cathal Mooney: gNMIc: Add BGP stats collection for network devices [puppet] - 10https://gerrit.wikimedia.org/r/1115002 (https://phabricator.wikimedia.org/T369384)
[13:11:42] <wikibugs>	 07sre-alert-triage, 06serviceops: Alert in need of triage: SystemdUnitFailed (instance cumin1002:9100) - https://phabricator.wikimedia.org/T384999#10503966 (10Clement_Goubert) →14Duplicate dup:03T383032
[13:12:08] <wikibugs>	 (03PS8) 10Muehlenhoff: maps: Add a separate Hiera option to control the waterlines import [puppet] - 10https://gerrit.wikimedia.org/r/1114769 (https://phabricator.wikimedia.org/T381565)
[13:12:13] <wikibugs>	 06SRE, 10Phabricator, 06Traffic: Phabricator should cache tasks for a few minutes for logged-out users - https://phabricator.wikimedia.org/T274228#10503972 (10Aklapper) Does anyone have sufficient understanding to outline the next potential steps in the blurry territories between undermaintained Phorge upstr...
[13:12:40] <wikibugs>	 (03CR) 10Muehlenhoff: maps: Add a separate Hiera option to control the waterlines import (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1114769 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[13:13:00] <moritzm>	 !log installing runc security updates
[13:13:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:13:26] <wikibugs>	 07sre-alert-triage, 06serviceops: Alert in need of triage: SystemdUnitFailed (instance cumin1002:9100) - https://phabricator.wikimedia.org/T384999#10503978 (10Clement_Goubert) Doing the dupe the other way around as T383032 for #abstract_wikipedia_team has been triaged by them already.
[13:15:33] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:15:37] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1200 (re)pooling @ 25%: Repooling after rebuild index $TASKID', diff saved to https://phabricator.wikimedia.org/P72744 and previous config saved to /var/cache/conftool/dbconfig/20250129-131537-root.json
[13:16:42] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10503998 (10MoritzMuehlenhoff)
[13:16:52] <wikibugs>	 (03CR) 10Volans: [C:03+1] "Puppet compiler seems happy:" [puppet] - 10https://gerrit.wikimedia.org/r/1114007 (https://phabricator.wikimedia.org/T384720) (owner: 10Raymond Ndibe)
[13:17:23] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Switch ganeti2031 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1114950 (owner: 10Muehlenhoff)
[13:18:53] <urbanecm>	 ...pulling to testservers now...
[13:18:54] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#10504014 (10cmooney) Moving to //event-value-tag-v2// has been pushed out to all our Netflow VMs and we've seen a nice reduction in CPU usage, plus a...
[13:19:03] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2211 (re)pooling @ 25%: Repooling after rebuild index $TASKID', diff saved to https://phabricator.wikimedia.org/P72745 and previous config saved to /var/cache/conftool/dbconfig/20250129-131903-root.json
[13:20:35] <wikibugs>	 (03CR) 10Volans: [C:03+1] "Not much to review here, did you review the harbor changelog between the 2 versions to ensure there is no backward incompatible change? LG" [puppet] - 10https://gerrit.wikimedia.org/r/1113871 (https://phabricator.wikimedia.org/T358225) (owner: 10Raymond Ndibe)
[13:23:11] <wikibugs>	 (03PS1) 10Arthur taylor: Remove `tmpAlwaysShowMulLanguageCode` temporary setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115006 (https://phabricator.wikimedia.org/T330217)
[13:23:48] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1157.eqiad.wmnet with reason: Maintenance
[13:23:55] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1157 (T384592)', diff saved to https://phabricator.wikimedia.org/P72746 and previous config saved to /var/cache/conftool/dbconfig/20250129-132354-marostegui.json
[13:24:00] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[13:27:08] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2031.codfw.wmnet with OS bookworm
[13:27:14] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10504057 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2031.codfw.wmnet with OS bookworm
[13:28:55] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1074381 (owner: 10Muehlenhoff)
[13:30:43] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1200 (re)pooling @ 50%: Repooling after rebuild index $TASKID', diff saved to https://phabricator.wikimedia.org/P72747 and previous config saved to /var/cache/conftool/dbconfig/20250129-133042-root.json
[13:31:44] <wikibugs>	 (03PS5) 10Arnaudb: nftables: add nftable docker manifest [puppet] - 10https://gerrit.wikimedia.org/r/1114718 (https://phabricator.wikimedia.org/T370677)
[13:31:52] <wikibugs>	 (03PS3) 10Arnaudb: gitlab_runner: add nftables logic [puppet] - 10https://gerrit.wikimedia.org/r/1114726 (https://phabricator.wikimedia.org/T370677)
[13:34:09] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2211 (re)pooling @ 50%: Repooling after rebuild index $TASKID', diff saved to https://phabricator.wikimedia.org/P72748 and previous config saved to /var/cache/conftool/dbconfig/20250129-133408-root.json
[13:35:44] <wikibugs>	 (03CR) 10Jforrester: "recheck" [extensions/Wikibase] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1114995 (https://phabricator.wikimedia.org/T384867) (owner: 10Lucas Werkmeister (WMDE))
[13:36:22] <wikibugs>	 (03PS2) 10Elukey: custom_deploy.d: remove ML-specific bits from DSE's istio config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114749
[13:36:22] <wikibugs>	 (03PS1) 10Elukey: custom_deploy.d: rework Istio ML's config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115008 (https://phabricator.wikimedia.org/T369493)
[13:39:34] <wikibugs>	 (03CR) 10Klausman: [C:03+1] custom_deploy.d: rework Istio ML's config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115008 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey)
[13:41:41] <logmsgbot>	 !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1114751|[tests] Add ConfigWrapperTest (T383905)]], [[gerrit:1114752|Remove BabelCategorizeNamespaces from CommunityConfiguration (T383905)]], [[gerrit:1114975|[tests] Add MigrateConfigToCommunityTest (T383905)]], [[gerrit:1114976|migrateConfigToCommunity: Deal with false category names (T384941)]], [[gerrit:1114973|migrateConfigToCommunity: Deal with
[13:41:41] <logmsgbot>	 false category names (T384941)]] (duration: 51m 06s)
[13:41:47] <urbanecm>	 finally
[13:41:47] <stashbot>	 T383905: Running extensions/Babel/maintenance/migrateConfigToCommunity.php with the default configuration fails on validation error - https://phabricator.wikimedia.org/T383905
[13:41:48] <stashbot>	 T384941: Setting wgBabelCategoryNames[level] to false is not supported by the migration script - https://phabricator.wikimedia.org/T384941
[13:41:55] <urbanecm>	 almost an hour...
[13:41:56] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] maps: Add a separate Hiera option to control the waterlines import [puppet] - 10https://gerrit.wikimedia.org/r/1114769 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[13:42:25] <wikibugs>	 (03PS1) 10Lucas Werkmeister (WMDE): Handle null date format in MwDateFormatParserFactory [extensions/Wikibase] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1115010 (https://phabricator.wikimedia.org/T384963)
[13:42:28] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ganeti2031.codfw.wmnet with OS bookworm
[13:42:32] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10504127 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2031.codfw.wmnet with OS bookworm executed with errors:...
[13:42:41] <wikibugs>	 (03PS1) 10Lucas Werkmeister (WMDE): Handle null date format in MwDateFormatParserFactory [extensions/Wikibase] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1115012 (https://phabricator.wikimedia.org/T384963)
[13:42:51] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, January 29 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [extensions/Wikibase] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1115012 (https://phabricator.wikimedia.org/T384963) (owner: 10Lucas Werkmeister (WMDE))
[13:42:57] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, January 29 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [extensions/Wikibase] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1115010 (https://phabricator.wikimedia.org/T384963) (owner: 10Lucas Werkmeister (WMDE))
[13:43:06] <Lucas_WMDE>	 let’s see how many of these backports I actually get through ^^
[13:43:19] <wikibugs>	 (03PS1) 10Arthur taylor: Add `enableMulLanguageCode` to replace `tmpEnableMulLanguageCode` [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115013 (https://phabricator.wikimedia.org/T330217)
[13:43:50] <wikibugs>	 (03PS1) 10Filippo Giunchedi: vopsbot: sync db when needed [puppet] - 10https://gerrit.wikimedia.org/r/1115014 (https://phabricator.wikimedia.org/T375143)
[13:44:06] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s6 on dbstore1009 is OK: OK slave_sql_lag Replication lag: 0.10 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[13:45:48] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1200 (re)pooling @ 75%: Repooling after rebuild index $TASKID', diff saved to https://phabricator.wikimedia.org/P72749 and previous config saved to /var/cache/conftool/dbconfig/20250129-134547-root.json
[13:45:52] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2031.codfw.wmnet with OS bookworm
[13:45:58] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10504147 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2031.codfw.wmnet with OS bookworm
[13:47:05] <wikibugs>	 (03PS13) 10Muehlenhoff: Make maps-test2001 a bookworm maps master node [puppet] - 10https://gerrit.wikimedia.org/r/1111634 (https://phabricator.wikimedia.org/T381565)
[13:47:49] <wikibugs>	 10ops-codfw, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2242-2329 - https://phabricator.wikimedia.org/T384970#10504173 (10RobH)
[13:48:20] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Bullseye 11.11 point update - https://phabricator.wikimedia.org/T373795#10504174 (10MoritzMuehlenhoff)
[13:48:26] <wikibugs>	 10ops-codfw, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2242-2329 - https://phabricator.wikimedia.org/T384970#10504176 (10RobH) Copying over the explanation of hostname breakdown from the purchasing task.  >>! In T382899#10503772, @Clement_Goubert wrote: > Updated list of hostnames because...
[13:49:02] <wikibugs>	 10ops-codfw, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2242-2329 - https://phabricator.wikimedia.org/T384970#10504188 (10RobH)
[13:49:12] <wikibugs>	 (03PS1) 10Arthur taylor: Remove `tmpEnableMulLanguageCode` setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115016 (https://phabricator.wikimedia.org/T330217)
[13:49:13] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2211 (re)pooling @ 75%: Repooling after rebuild index $TASKID', diff saved to https://phabricator.wikimedia.org/P72750 and previous config saved to /var/cache/conftool/dbconfig/20250129-134912-root.json
[13:49:17] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1111634 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[13:49:28] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T384592)', diff saved to https://phabricator.wikimedia.org/P72751 and previous config saved to /var/cache/conftool/dbconfig/20250129-134927-marostegui.json
[13:49:32] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[13:53:33] <wikibugs>	 (03CR) 10Volans: "Reading the backlog of the code review and the task this seems quite a rabbit hole. I'm not sure I'm familiar enough to judge if this is s" [puppet] - 10https://gerrit.wikimedia.org/r/1113194 (https://phabricator.wikimedia.org/T374193) (owner: 10Raymond Ndibe)
[13:55:11] <wikibugs>	 (03PS1) 10Jcrespo: backup: Temporary setup of backup101[34], backup201[34] [puppet] - 10https://gerrit.wikimedia.org/r/1115020 (https://phabricator.wikimedia.org/T384977)
[13:57:51] <wikibugs>	 (03PS2) 10Jcrespo: backup: Temporary setup of backup101[34], backup201[34] [puppet] - 10https://gerrit.wikimedia.org/r/1115020 (https://phabricator.wikimedia.org/T384977)
[13:58:24] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2029.codfw.wmnet
[13:58:38] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10504236 (10ops-monitoring-bot) Draining ganeti2029.codfw.wmnet of running VMs
[14:00:05] <jouncebot>	 Lucas_WMDE, Urbanecm, and TheresNoTime: May I have your attention please! UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250129T1400)
[14:00:05] <jouncebot>	 Lucas_WMDE and hnowlan: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[14:00:07] <Lucas_WMDE>	 o/
[14:00:10] <Lucas_WMDE>	 I can deploy!
[14:00:35] <Lucas_WMDE>	 I’ll start with my config change and then do hnowlan’s before continuing with my backports, the backports aren’t urgent
[14:00:42] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1111336 (owner: 10JHathaway)
[14:00:46] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114727 (https://phabricator.wikimedia.org/T312176) (owner: 10Lucas Werkmeister (WMDE))
[14:00:53] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1200 (re)pooling @ 100%: Repooling after rebuild index $TASKID', diff saved to https://phabricator.wikimedia.org/P72752 and previous config saved to /var/cache/conftool/dbconfig/20250129-140052-root.json
[14:01:15] <hnowlan>	 o/
[14:01:30] <hnowlan>	 my change is more or less cosmetic, no impact
[14:01:34] <wikibugs>	 (03Merged) 10jenkins-bot: Enable mul language code on Wikidata (full release) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114727 (https://phabricator.wikimedia.org/T312176) (owner: 10Lucas Werkmeister (WMDE))
[14:01:43] <Lucas_WMDE>	 ok
[14:02:05] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1114727|Enable mul language code on Wikidata (full release) (T312176)]]
[14:02:10] <stashbot>	 T312176: MUL - Phased rollout on Wikidata.org (Stage 3 of 3: Full release) - https://phabricator.wikimedia.org/T312176
[14:02:14] <Lucas_WMDE>	 then let’s say I +2 my four backports, and we’ll see if they make it through gate-and-submit before or after we get to your config change? ^^
[14:02:19] <hnowlan>	 sgtm 
[14:02:38] <Lucas_WMDE>	 oh, I suppose I also need to rebase them on one another anyway ^^
[14:02:51] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2029.codfw.wmnet
[14:02:51] <wikibugs>	 (03PS2) 10Lucas Werkmeister (WMDE): Handle null date format in MwDateFormatParserFactory [extensions/Wikibase] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1115012 (https://phabricator.wikimedia.org/T384963)
[14:02:57] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2029.codfw.wmnet
[14:03:04] <wikibugs>	 (03PS2) 10Lucas Werkmeister (WMDE): Handle null date format in MwDateFormatParserFactory [extensions/Wikibase] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1115010 (https://phabricator.wikimedia.org/T384963)
[14:03:09] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10504306 (10ops-monitoring-bot) Draining ganeti2029.codfw.wmnet of running VMs
[14:03:23] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "starting gate-and-submit ahead of deployment" [extensions/Wikibase] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1114995 (https://phabricator.wikimedia.org/T384867) (owner: 10Lucas Werkmeister (WMDE))
[14:03:33] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "starting gate-and-submit ahead of deployment" [extensions/Wikibase] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1114994 (https://phabricator.wikimedia.org/T384867) (owner: 10Lucas Werkmeister (WMDE))
[14:03:43] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "starting gate-and-submit ahead of deployment" [extensions/Wikibase] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1115012 (https://phabricator.wikimedia.org/T384963) (owner: 10Lucas Werkmeister (WMDE))
[14:03:53] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "starting gate-and-submit ahead of deployment" [extensions/Wikibase] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1115010 (https://phabricator.wikimedia.org/T384963) (owner: 10Lucas Werkmeister (WMDE))
[14:04:10] <wikibugs>	 (03CR) 10TChin: [C:03+2] Scale down mw-content-history-reconcile-enrich for nominal events intake [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114790 (https://phabricator.wikimedia.org/T382953) (owner: 10Xcollazo)
[14:04:18] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2211 (re)pooling @ 100%: Repooling after rebuild index $TASKID', diff saved to https://phabricator.wikimedia.org/P72753 and previous config saved to /var/cache/conftool/dbconfig/20250129-140418-root.json
[14:04:34] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] "Can't say I fully understand what's going on but LGTM to my untrained eye" [puppet] - 10https://gerrit.wikimedia.org/r/1115002 (https://phabricator.wikimedia.org/T369384) (owner: 10Cathal Mooney)
[14:04:35] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P72754 and previous config saved to /var/cache/conftool/dbconfig/20250129-140434-marostegui.json
[14:04:50] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2029.codfw.wmnet
[14:06:39] <wikibugs>	 (03Merged) 10jenkins-bot: Scale down mw-content-history-reconcile-enrich for nominal events intake [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114790 (https://phabricator.wikimedia.org/T382953) (owner: 10Xcollazo)
[14:07:31] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2030.codfw.wmnet
[14:07:42] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde: Backport for [[gerrit:1114727|Enable mul language code on Wikidata (full release) (T312176)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[14:07:46] <stashbot>	 T312176: MUL - Phased rollout on Wikidata.org (Stage 3 of 3: Full release) - https://phabricator.wikimedia.org/T312176
[14:07:52] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10504374 (10ops-monitoring-bot) Draining ganeti2030.codfw.wmnet of running VMs
[14:08:10] <Lucas_WMDE>	 working for me on https://www.wikidata.org/wiki/Q107133815 – with k8s-mwdebug I see the mul row \o/
[14:08:12] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde: Continuing with sync
[14:09:20] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[14:09:36] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti2031.codfw.wmnet with reason: host reimage
[14:09:57] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[14:11:26] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2030.codfw.wmnet
[14:13:33] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti2031.codfw.wmnet with reason: host reimage
[14:14:52] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2030.codfw.wmnet
[14:15:13] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10504384 (10ops-monitoring-bot) Draining ganeti2030.codfw.wmnet of running VMs
[14:16:32] <wikibugs>	 (03CR) 10Ottomata: [C:03+1] "TY" [puppet] - 10https://gerrit.wikimedia.org/r/1114806 (https://phabricator.wikimedia.org/T383914) (owner: 10Aqu)
[14:16:42] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+1] cloudgw1003: take over cloudgw1001 [puppet] - 10https://gerrit.wikimedia.org/r/1114997 (https://phabricator.wikimedia.org/T382356) (owner: 10Arturo Borrero Gonzalez)
[14:17:46] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply
[14:17:47] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+1] cloudgw1004: take over cloudgw1002 [puppet] - 10https://gerrit.wikimedia.org/r/1114998 (https://phabricator.wikimedia.org/T382356) (owner: 10Arturo Borrero Gonzalez)
[14:18:09] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply
[14:19:04] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1114727|Enable mul language code on Wikidata (full release) (T312176)]] (duration: 16m 58s)
[14:19:08] <stashbot>	 T312176: MUL - Phased rollout on Wikidata.org (Stage 3 of 3: Full release) - https://phabricator.wikimedia.org/T312176
[14:19:21] <Lucas_WMDE>	 zuul says 8 more minutes for my backports
[14:19:26] <Lucas_WMDE>	 hnowlan: want to self-service your config change?
[14:19:42] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P72755 and previous config saved to /var/cache/conftool/dbconfig/20250129-141941-marostegui.json
[14:19:50] <Lucas_WMDE>	 (also, I just filed T385037 for an issue I mentioned in here yesterday [possibly Monday, not sure])
[14:19:50] <stashbot>	 T385037: mwdebug dashboard on logstash is full of "Failed to connect to exporter" messages (tracing channel) since 7 January - https://phabricator.wikimedia.org/T385037
[14:20:56] <wikibugs>	 (03PS3) 10Jcrespo: backup: Temporary setup of backup101[34], backup201[34] [puppet] - 10https://gerrit.wikimedia.org/r/1115020 (https://phabricator.wikimedia.org/T384977)
[14:21:13] <wikibugs>	 (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1115020 (https://phabricator.wikimedia.org/T384977) (owner: 10Jcrespo)
[14:22:54] <wikibugs>	 (03PS1) 10Brouberol: airflow: fix envoy service port names [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115025
[14:23:08] <hnowlan>	 Lucas_WMDE: sure, thanks
[14:23:17] <Lucas_WMDE>	 ok :)
[14:24:09] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by hnowlan@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114398 (https://phabricator.wikimedia.org/T280718) (owner: 10Hnowlan)
[14:25:04] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] airflow: fix envoy service port names [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115025 (owner: 10Brouberol)
[14:25:04] <wikibugs>	 (03Merged) 10jenkins-bot: fc-list: update font list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114398 (https://phabricator.wikimedia.org/T280718) (owner: 10Hnowlan)
[14:25:31] <logmsgbot>	 !log hnowlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1114398|fc-list: update font list (T280718)]]
[14:25:36] <stashbot>	 T280718: Re-evaluate whether keeping around https://noc.wikimedia.org/conf/fc-list is a good practive - https://phabricator.wikimedia.org/T280718
[14:25:40] <wikibugs>	 (03CR) 10Xcollazo: [C:03+1] "LGTM!!" [puppet] - 10https://gerrit.wikimedia.org/r/1114978 (https://phabricator.wikimedia.org/T382947) (owner: 10Btullis)
[14:26:48] <wikibugs>	 (03CR) 10Jcrespo: "noop: https://puppet-compiler.wmflabs.org/output/1115020/2838/" [puppet] - 10https://gerrit.wikimedia.org/r/1115020 (https://phabricator.wikimedia.org/T384977) (owner: 10Jcrespo)
[14:27:31] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[14:27:39] <wikibugs>	 (03CR) 10Xcollazo: dumps: Re-enable the enwiki dumps on snapshot1012 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1114991 (https://phabricator.wikimedia.org/T382947) (owner: 10Btullis)
[14:27:50] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[14:28:00] <wikibugs>	 (03Merged) 10jenkins-bot: Handle missing `monthonly` format in MwTimeIsoFormatter [extensions/Wikibase] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1114995 (https://phabricator.wikimedia.org/T384867) (owner: 10Lucas Werkmeister (WMDE))
[14:28:03] <wikibugs>	 (03Merged) 10jenkins-bot: Handle missing `monthonly` format in MwTimeIsoFormatter [extensions/Wikibase] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1114994 (https://phabricator.wikimedia.org/T384867) (owner: 10Lucas Werkmeister (WMDE))
[14:28:15] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply
[14:28:39] <wikibugs>	 (03PS1) 10Jforrester: wikifunctions: Upgrade orchestrator from 2025-01-22-203140 to 2025-01-28-144249 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115028 (https://phabricator.wikimedia.org/T380103)
[14:28:45] <wikibugs>	 (03PS1) 10Jforrester: wikifunctions: Upgrade evaluators from 2025-01-22-212306 to 2025-01-29-140344 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115029 (https://phabricator.wikimedia.org/T359562)
[14:28:52] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply
[14:32:18] <logmsgbot>	 !log hnowlan@deploy2002 hnowlan: Backport for [[gerrit:1114398|fc-list: update font list (T280718)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[14:32:23] <stashbot>	 T280718: Re-evaluate whether keeping around https://noc.wikimedia.org/conf/fc-list is a good practive - https://phabricator.wikimedia.org/T280718
[14:32:38] <wikibugs>	 (03PS1) 10Ssingh: varnish: add schoolwiki.in to allowed maps domains [puppet] - 10https://gerrit.wikimedia.org/r/1115031 (https://phabricator.wikimedia.org/T383210)
[14:32:50] <Lucas_WMDE>	 (fyi, I’m testing repro steps for my backports, so sorry for a bit of noise in the mwdebug logstash during your deploy)
[14:32:55] <wikibugs>	 06SRE, 10Maps, 06Traffic, 13Patch-For-Review: Allow Wikimedia Maps usage on schoolwiki.in - https://phabricator.wikimedia.org/T383210#10504478 (10ssingh) @MSantos: Hi! This is pending your approval but otherwise is a simple patch to merge.
[14:33:02] <Lucas_WMDE>	 (hopefully not bad enough to trip the scap canaries or anything ^^)
[14:33:17] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1185 db2178 T384994', diff saved to https://phabricator.wikimedia.org/P72756 and previous config saved to /var/cache/conftool/dbconfig/20250129-143317-marostegui.json
[14:33:22] <stashbot>	 T384994: Upgrade and rebuild s5 - https://phabricator.wikimedia.org/T384994
[14:33:25] <wikibugs>	 (03Merged) 10jenkins-bot: Handle null date format in MwDateFormatParserFactory [extensions/Wikibase] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1115012 (https://phabricator.wikimedia.org/T384963) (owner: 10Lucas Werkmeister (WMDE))
[14:33:25] <wikibugs>	 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops: Repurpose 5 config B servers - https://phabricator.wikimedia.org/T380805#10504480 (10Papaul) @Andrew anything dc-ops need to do on this task?
[14:33:25] <wikibugs>	 (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4889/co" [puppet] - 10https://gerrit.wikimedia.org/r/1115031 (https://phabricator.wikimedia.org/T383210) (owner: 10Ssingh)
[14:33:27] <wikibugs>	 (03Merged) 10jenkins-bot: Handle null date format in MwDateFormatParserFactory [extensions/Wikibase] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1115010 (https://phabricator.wikimedia.org/T384963) (owner: 10Lucas Werkmeister (WMDE))
[14:33:33] <logmsgbot>	 !log root@cumin1002 START - Cookbook sre.mysql.upgrade for db1185.eqiad.wmnet
[14:33:39] <wikibugs>	 (03PS1) 10Cory Massaro: wikifunctions: Upgrade orchestrator from version: 2025-01-22-203140 to 2025-01-28-144249 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115032 (https://phabricator.wikimedia.org/T139010)
[14:33:39] <logmsgbot>	 !log root@cumin1002 START - Cookbook sre.mysql.upgrade for db2178.codfw.wmnet
[14:33:59] <logmsgbot>	 !log hnowlan@deploy2002 hnowlan: Continuing with sync
[14:34:02] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2031.codfw.wmnet with OS bookworm
[14:34:49] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T384592)', diff saved to https://phabricator.wikimedia.org/P72757 and previous config saved to /var/cache/conftool/dbconfig/20250129-143448-marostegui.json
[14:34:54] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[14:35:04] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1166.eqiad.wmnet with reason: Maintenance
[14:35:11] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1166 (T384592)', diff saved to https://phabricator.wikimedia.org/P72758 and previous config saved to /var/cache/conftool/dbconfig/20250129-143510-marostegui.json
[14:35:35] <wikibugs>	 (03CR) 10JMeybohm: [V:03+1] "PCC SUCCESS (CORE_DIFF 8): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4888/co" [puppet] - 10https://gerrit.wikimedia.org/r/1114970 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm)
[14:36:09] <wikibugs>	 (03Abandoned) 10Cory Massaro: wikifunctions: Upgrade orchestrator from version: 2025-01-22-203140 to 2025-01-28-144249 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115032 (https://phabricator.wikimedia.org/T139010) (owner: 10Cory Massaro)
[14:36:23] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10504514 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2031.codfw.wmnet with OS bookworm completed: - ganeti203...
[14:37:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:39:04] <wikibugs>	 (03CR) 10Jforrester: wikifunctions: Upgrade orchestrator from version: 2025-01-22-203140 to 2025-01-28-144249 (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115032 (https://phabricator.wikimedia.org/T139010) (owner: 10Cory Massaro)
[14:39:44] <wikibugs>	 (03PS1) 10Andrew Bogott: Horizon: update release version for codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/1115035 (https://phabricator.wikimedia.org/T380081)
[14:39:55] <logmsgbot>	 !log root@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1185.eqiad.wmnet
[14:40:14] <logmsgbot>	 !log root@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2178.codfw.wmnet
[14:40:34] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] Horizon: update release version for codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/1115035 (https://phabricator.wikimedia.org/T380081) (owner: 10Andrew Bogott)
[14:40:40] <logmsgbot>	 !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1185.eqiad.wmnet with reason: Index rebuild
[14:40:58] <logmsgbot>	 !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2178.codfw.wmnet with reason: Index rebuild
[14:41:14] <logmsgbot>	 !log aokoth@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply
[14:41:31] <logmsgbot>	 !log hnowlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1114398|fc-list: update font list (T280718)]] (duration: 16m 00s)
[14:41:36] <stashbot>	 T280718: Re-evaluate whether keeping around https://noc.wikimedia.org/conf/fc-list is a good practive - https://phabricator.wikimedia.org/T280718
[14:42:26] <Lucas_WMDE>	 hnowlan: can I continue with the backports?
[14:42:26] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ganeti2031.codfw.wmnet
[14:42:49] <logmsgbot>	 !log aokoth@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply
[14:42:59] <wikibugs>	 (03PS1) 10MVernon: swift: remove ms-be105[1-9] from profile::swift::storagehosts [puppet] - 10https://gerrit.wikimedia.org/r/1115038 (https://phabricator.wikimedia.org/T382056)
[14:44:51] <wikibugs>	 (03PS3) 10JMeybohm: Allow to install multiple kubectl versions [puppet] - 10https://gerrit.wikimedia.org/r/1114970 (https://phabricator.wikimedia.org/T341984)
[14:45:16] <wikibugs>	 (03CR) 10Marostegui: [C:03+1] swift: remove ms-be105[1-9] from profile::swift::storagehosts [puppet] - 10https://gerrit.wikimedia.org/r/1115038 (https://phabricator.wikimedia.org/T382056) (owner: 10MVernon)
[14:45:20] <Lucas_WMDE>	 I’ll assume it’s okay for me to continue deploying
[14:45:29] <hnowlan>	 please do, sorry! 
[14:45:33] <wikibugs>	 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops: Repurpose 5 config B servers - https://phabricator.wikimedia.org/T380805#10504734 (10Andrew) >>! In T380805#10504480, @Papaul wrote: > @Andrew anything dc-ops need to do on this task?  Not immediately! Valerie has already moved and set up two of them, we...
[14:45:35] <Lucas_WMDE>	 ok thanks!
[14:45:48] <wikibugs>	 (03CR) 10MVernon: [C:03+2] swift: remove ms-be105[1-9] from profile::swift::storagehosts [puppet] - 10https://gerrit.wikimedia.org/r/1115038 (https://phabricator.wikimedia.org/T382056) (owner: 10MVernon)
[14:46:04] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1114995|Handle missing `monthonly` format in MwTimeIsoFormatter (T384867)]], [[gerrit:1114994|Handle missing `monthonly` format in MwTimeIsoFormatter (T384867)]], [[gerrit:1115012|Handle null date format in MwDateFormatParserFactory (T384963)]], [[gerrit:1115010|Handle null date format in MwDateFormatParserFactory (T384963)]]
[14:46:11] <stashbot>	 T384867: PHP Deprecated: preg_match(): Passing null to parameter #2 ($subject) of type string is deprecated - https://phabricator.wikimedia.org/T384867
[14:46:11] <stashbot>	 T384963: PHP Deprecated: strlen(): Passing null to parameter #1 ($string) of type string is deprecated - https://phabricator.wikimedia.org/T384963
[14:46:46] <wikibugs>	 (03CR) 10Jcrespo: "@btullis I moved an-redacteddb1001 as it looked weird." [puppet] - 10https://gerrit.wikimedia.org/r/1115020 (https://phabricator.wikimedia.org/T384977) (owner: 10Jcrespo)
[14:49:19] <wikibugs>	 (03PS1) 10Andrew Bogott: Revert "Horizon: update release version for codfw1dev" [puppet] - 10https://gerrit.wikimedia.org/r/1115041
[14:50:18] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install ganeti105[34].eqiad.wmnet - https://phabricator.wikimedia.org/T381576#10504749 (10VRiley-WMF)
[14:50:25] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[14:50:32] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde: Backport for [[gerrit:1114995|Handle missing `monthonly` format in MwTimeIsoFormatter (T384867)]], [[gerrit:1114994|Handle missing `monthonly` format in MwTimeIsoFormatter (T384867)]], [[gerrit:1115012|Handle null date format in MwDateFormatParserFactory (T384963)]], [[gerrit:1115010|Handle null date format in MwDateFormatParserFactory (T384963)]] synced to the
[14:50:33] <logmsgbot>	 testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[14:50:55] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] Revert "Horizon: update release version for codfw1dev" [puppet] - 10https://gerrit.wikimedia.org/r/1115041 (owner: 10Andrew Bogott)
[14:51:21] <Lucas_WMDE>	 looks good to me \o/
[14:51:24] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde: Continuing with sync
[14:51:42] <wikibugs>	 (03PS1) 10Ottomata: EventStreamConfig - prep for per stream user agent collection config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115042 (https://phabricator.wikimedia.org/T382173)
[14:52:05] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[14:52:15] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.580 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[14:52:57] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53369 bytes in 1.072 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[14:53:30] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.decommission for hosts ms-be[1051-1059].eqiad.wmnet
[14:53:33] <wikibugs>	 (03PS4) 10JMeybohm: Allow to install multiple kubectl versions [puppet] - 10https://gerrit.wikimedia.org/r/1114970 (https://phabricator.wikimedia.org/T341984)
[14:54:42] <vgutierrez>	 !log repool ncredir4002
[14:54:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:58:25] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1114995|Handle missing `monthonly` format in MwTimeIsoFormatter (T384867)]], [[gerrit:1114994|Handle missing `monthonly` format in MwTimeIsoFormatter (T384867)]], [[gerrit:1115012|Handle null date format in MwDateFormatParserFactory (T384963)]], [[gerrit:1115010|Handle null date format in MwDateFormatParserFactory (T384963)]] (duration:
[14:58:25] <logmsgbot>	 12m 20s)
[14:58:30] <stashbot>	 T384867: PHP Deprecated: preg_match(): Passing null to parameter #2 ($subject) of type string is deprecated - https://phabricator.wikimedia.org/T384867
[14:58:31] <stashbot>	 T384963: PHP Deprecated: strlen(): Passing null to parameter #1 ($string) of type string is deprecated - https://phabricator.wikimedia.org/T384963
[14:58:40] <Lucas_WMDE>	 !log UTC afternoon backport+config window done
[14:58:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:58:44] <Lucas_WMDE>	 even just before the end of the window \o/
[14:59:52] <wikibugs>	 (03CR) 10JMeybohm: [V:03+1] "PCC SUCCESS (NOOP 7 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4891/" [puppet] - 10https://gerrit.wikimedia.org/r/1114970 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm)
[15:00:05] <jouncebot>	 Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250129T1500)
[15:02:01] <logmsgbot>	 !log jynus@cumin1002 DONE (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 1 day, 0:00:00 on db2201.codfw.wmnet with reason: upgrade kernel and rebuilding tables
[15:02:14] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install db125[0-4] - https://phabricator.wikimedia.org/T380083#10504784 (10VRiley-WMF)
[15:02:37] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T384592)', diff saved to https://phabricator.wikimedia.org/P72761 and previous config saved to /var/cache/conftool/dbconfig/20250129-150236-marostegui.json
[15:02:41] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[15:02:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:02:58] <wikibugs>	 (03PS1) 10AOkoth: miscweb: remove kubectl cronjob [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115044 (https://phabricator.wikimedia.org/T350794)
[15:03:06] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2095,2175,2186].codfw.wmnet
[15:03:15] <wikibugs>	 10ops-codfw, 06SRE, 06collaboration-services, 06Data-Persistence, and 2 others: Tracking List: Relocating servers to free up 10G switch space in codfw - https://phabricator.wikimedia.org/T383709#10504805 (10ops-monitoring-bot) depool host wikikube-worker[2095,2175,2186].codfw.wmnet by jayme@cumin1002 with...
[15:03:22] <logmsgbot>	 !log jayme@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on wikikube-worker[2095,2175,2186].codfw.wmnet with reason: Depooled via sre.k8s.pool-depool-node
[15:03:57] <wikibugs>	 (03CR) 10Cory Massaro: [C:03+2] wikifunctions: Upgrade orchestrator from 2025-01-22-203140 to 2025-01-28-144249 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115028 (https://phabricator.wikimedia.org/T380103) (owner: 10Jforrester)
[15:05:03] <wikibugs>	 (03Merged) 10jenkins-bot: wikifunctions: Upgrade orchestrator from 2025-01-22-203140 to 2025-01-28-144249 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115028 (https://phabricator.wikimedia.org/T380103) (owner: 10Jforrester)
[15:05:09] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2095,2175,2186].codfw.wmnet
[15:05:47] <wikibugs>	 10ops-codfw, 06SRE, 06collaboration-services, 06Data-Persistence, and 2 others: Tracking List: Relocating servers to free up 10G switch space in codfw - https://phabricator.wikimedia.org/T383709#10504823 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by jayme@cumin1002 depool fo...
[15:05:52] <icinga-wm>	 ACKNOWLEDGEMENT - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly Clément Goubert T383032 - The acknowledgement expires at: 2025-02-12 15:04:52. https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[15:06:16] <icinga-wm>	 ACKNOWLEDGEMENT - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly Clément Goubert T383032 - The acknowledgement expires at: 2025-02-12 15:06:05. https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[15:06:25] <logmsgbot>	 !log apine@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[15:06:51] <wikibugs>	 (03PS1) 10Brouberol: Disable the sidecar controller from dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115045 (https://phabricator.wikimedia.org/T384329)
[15:06:53] <wikibugs>	 (03PS1) 10Brouberol: dse-k8s-eqiad: delete the sidecar-controller ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115046 (https://phabricator.wikimedia.org/T384329)
[15:06:58] <logmsgbot>	 !log apine@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[15:07:39] <wikibugs>	 10ops-codfw, 06SRE, 06collaboration-services, 06Data-Persistence, and 2 others: Tracking List: Relocating servers to free up 10G switch space in codfw - https://phabricator.wikimedia.org/T383709#10504828 (10JMeybohm) @Jhancock.wm wikikube-worker[2095,2175,2186].codfw.wmnet have been shut down, lmk when you...
[15:08:46] <logmsgbot>	 !log apine@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply
[15:09:17] <icinga-wm>	 PROBLEM - BGP status on lsw1-b5-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:09:17] <icinga-wm>	 PROBLEM - BGP status on lsw1-d3-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:09:19] <icinga-wm>	 PROBLEM - BGP status on lsw1-d5-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:09:44] <logmsgbot>	 !log apine@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply
[15:10:00] <wikibugs>	 (03PS2) 10AOkoth: miscweb: remove kubectl cronjob [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115044 (https://phabricator.wikimedia.org/T350794)
[15:10:21] <wikibugs>	 (03CR) 10Jelto: miscweb: remove kubectl cronjob (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115044 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth)
[15:10:25] <logmsgbot>	 !log apine@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply
[15:11:15] <wikibugs>	 (03CR) 10Stevemunene: [C:03+1] "lgtm!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115045 (https://phabricator.wikimedia.org/T384329) (owner: 10Brouberol)
[15:11:27] <logmsgbot>	 !log apine@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply
[15:11:29] <wikibugs>	 (03PS3) 10AOkoth: miscweb: remove kubectl cronjob [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115044 (https://phabricator.wikimedia.org/T350794)
[15:11:31] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 10observability, and 3 others: Prevent BGP alerts triggering when K8s host maintenance is being done - https://phabricator.wikimedia.org/T384731#10504843 (10lmata)
[15:11:40] <wikibugs>	 (03CR) 10Stevemunene: [C:03+1] dse-k8s-eqiad: delete the sidecar-controller ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115046 (https://phabricator.wikimedia.org/T384329) (owner: 10Brouberol)
[15:11:53] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] Disable the sidecar controller from dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115045 (https://phabricator.wikimedia.org/T384329) (owner: 10Brouberol)
[15:12:00] <wikibugs>	 (03CR) 10AOkoth: miscweb: remove kubectl cronjob (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115044 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth)
[15:12:36] <wikibugs>	 (03CR) 10Hnowlan: [C:04-1] "linting URL pattern is not compliant with the gateway" [puppet] - 10https://gerrit.wikimedia.org/r/1112815 (https://phabricator.wikimedia.org/T384216) (owner: 10Hnowlan)
[15:12:50] <wikibugs>	 (03CR) 10Hnowlan: [C:04-1] "linting URL pattern is not compliant with the gateway" [puppet] - 10https://gerrit.wikimedia.org/r/1112800 (https://phabricator.wikimedia.org/T384216) (owner: 10Hnowlan)
[15:13:21] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[15:13:29] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[15:13:44] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] dse-k8s-eqiad: delete the sidecar-controller ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115046 (https://phabricator.wikimedia.org/T384329) (owner: 10Brouberol)
[15:14:32] <wikibugs>	 (03CR) 10Cory Massaro: [C:03+2] wikifunctions: Upgrade evaluators from 2025-01-22-212306 to 2025-01-29-140344 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115029 (https://phabricator.wikimedia.org/T359562) (owner: 10Jforrester)
[15:14:52] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[15:15:35] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[15:15:46] <wikibugs>	 (03Merged) 10jenkins-bot: wikifunctions: Upgrade evaluators from 2025-01-22-212306 to 2025-01-29-140344 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115029 (https://phabricator.wikimedia.org/T359562) (owner: 10Jforrester)
[15:15:49] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): "LGTM in general but needs a rebase after I920abe8e23 ^^" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115006 (https://phabricator.wikimedia.org/T330217) (owner: 10Arthur taylor)
[15:16:10] <logmsgbot>	 !log apine@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[15:16:42] <logmsgbot>	 !log apine@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[15:17:41] <logmsgbot>	 !log apine@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply
[15:17:44] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P72762 and previous config saved to /var/cache/conftool/dbconfig/20250129-151743-marostegui.json
[15:18:05] <wikibugs>	 10ops-eqiad, 10SRE-swift-storage, 06DC-Ops, 10decommission-hardware: decommission ms-fe105[1-9].eqiad.wmnet - https://phabricator.wikimedia.org/T385049 (10MatthewVernon) 03NEW
[15:18:45] <logmsgbot>	 !log apine@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply
[15:18:50] <logmsgbot>	 !log apine@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply
[15:19:01] <wikibugs>	 (03PS4) 10Hnowlan: svg: use rsvg-convert's language parameter [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1042203 (https://phabricator.wikimedia.org/T261192)
[15:19:44] <wikibugs>	 06SRE, 10SRE-swift-storage, 13Patch-For-Review: ms backend hardware refresh for 24/25 - https://phabricator.wikimedia.org/T382056#10504908 (10MatthewVernon)
[15:19:49] <logmsgbot>	 !log apine@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply
[15:21:24] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "Looks okay to me but could also be simplified further." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115013 (https://phabricator.wikimedia.org/T330217) (owner: 10Arthur taylor)
[15:22:44] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.dns.netbox
[15:22:55] <wikibugs>	 (03Abandoned) 10Hnowlan: changeprop: make num_workers configurable for jobqueue [deployment-charts] - 10https://gerrit.wikimedia.org/r/826570 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan)
[15:23:06] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2031.codfw.wmnet
[15:24:00] <wikibugs>	 (03PS15) 10JMeybohm: Update staging-codfw to kubernetes 1.31, calico 3.29 [puppet] - 10https://gerrit.wikimedia.org/r/1110813 (https://phabricator.wikimedia.org/T341984)
[15:24:00] <wikibugs>	 (03PS5) 10JMeybohm: Allow to install multiple kubectl versions [puppet] - 10https://gerrit.wikimedia.org/r/1114970 (https://phabricator.wikimedia.org/T341984)
[15:25:33] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] Enroll 5% of client sessions in PHP 8.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114793 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French)
[15:25:38] <wikibugs>	 (03PS2) 10Muehlenhoff: postgresql::server: Use wmflib::debian_postgresql_version [puppet] - 10https://gerrit.wikimedia.org/r/1108707
[15:25:40] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] Enroll 5% of client sessions in PHP 8.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114793 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French)
[15:26:25] <wikibugs>	 (03PS1) 10Elukey: services: set the Tegola's cluster local endpoint for Kartotherian [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115049 (https://phabricator.wikimedia.org/T384530)
[15:26:43] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ms-be[1051-1059].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - mvernon@cumin2002"
[15:27:13] <wikibugs>	 (03CR) 10Elukey: [C:03+2] custom_deploy.d: remove ML-specific bits from DSE's istio config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114749 (owner: 10Elukey)
[15:27:27] <wikibugs>	 (03CR) 10Elukey: [C:03+2] custom_deploy.d: rework Istio ML's config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115008 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey)
[15:27:40] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ms-be[1051-1059].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - mvernon@cumin2002"
[15:27:40] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:27:41] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ms-be[1051-1059].eqiad.wmnet
[15:27:52] <wikibugs>	 06SRE, 10SRE-swift-storage, 13Patch-For-Review: ms backend hardware refresh for 24/25 - https://phabricator.wikimedia.org/T382056#10504958 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by mvernon@cumin2002 for hosts: `ms-be[1051-1059].eqiad.wmnet` - ms-be1051.eqiad.wmnet (**PASS**)   - Dow...
[15:28:02] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1212 (re)pooling @ 10%: Repooling after rebuild index $TASKID', diff saved to https://phabricator.wikimedia.org/P72763 and previous config saved to /var/cache/conftool/dbconfig/20250129-152801-root.json
[15:28:56] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1223 to s3 master [puppet] - 10https://gerrit.wikimedia.org/r/1115050 (https://phabricator.wikimedia.org/T385051)
[15:29:44] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] Remove `tmpEnableMulLanguageCode` setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115016 (https://phabricator.wikimedia.org/T330217) (owner: 10Arthur taylor)
[15:29:44] <wikibugs>	 (03CR) 10CI reject: [V:04-1] svg: use rsvg-convert's language parameter [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1042203 (https://phabricator.wikimedia.org/T261192) (owner: 10Hnowlan)
[15:30:05] <wikibugs>	 (03CR) 10JMeybohm: [V:03+1] "PCC SUCCESS (CORE_DIFF 8): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4892/co" [puppet] - 10https://gerrit.wikimedia.org/r/1114970 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm)
[15:30:56] <wikibugs>	 (03CR) 10Ottomata: [C:03+2] EventStreamConfig - prep for per stream user agent collection config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115042 (https://phabricator.wikimedia.org/T382173) (owner: 10Ottomata)
[15:31:38] <wikibugs>	 (03Merged) 10jenkins-bot: EventStreamConfig - prep for per stream user agent collection config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115042 (https://phabricator.wikimedia.org/T382173) (owner: 10Ottomata)
[15:32:51] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P72764 and previous config saved to /var/cache/conftool/dbconfig/20250129-153250-marostegui.json
[15:33:38] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2031.codfw.wmnet
[15:33:40] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ganeti2031.codfw.wmnet
[15:36:04] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2031.codfw.wmnet
[15:36:19] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "lgtm now" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115044 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth)
[15:37:03] <ottomata>	 FYI am going to deploy https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1115042 to prep for an eventgate deployment.  SHould be a no-op for now.
[15:37:09] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1185 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P72765 and previous config saved to /var/cache/conftool/dbconfig/20250129-153708-root.json
[15:37:11] <wikibugs>	 10ops-magru, 06DC-Ops: Power supply failure (PSU) for cp7006.magru.wmnet - https://phabricator.wikimedia.org/T381446#10505017 (10RobH)
[15:37:31] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1108707 (owner: 10Muehlenhoff)
[15:37:39] <logmsgbot>	 !log otto@deploy2002 Started scap sync-world: Backport for [[gerrit:1115042|EventStreamConfig - prep for per stream user agent collection config (T382173)]]
[15:37:44] <stashbot>	 T382173: Enable Event Platform instruments to opt out of collecting User-Agent data - https://phabricator.wikimedia.org/T382173
[15:37:58] <wikibugs>	 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Dec 2024: cr3-ulsfo errors on et-0/0/0 link from cr4 - https://phabricator.wikimedia.org/T384288#10505023 (10RobH)
[15:38:00] <wikibugs>	 (03PS1) 10Vgutierrez: lvs: Extend alerts to liberica cluster [alerts] - 10https://gerrit.wikimedia.org/r/1115054
[15:38:40] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2178 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P72766 and previous config saved to /var/cache/conftool/dbconfig/20250129-153840-root.json
[15:40:24] <wikibugs>	 (03PS6) 10JMeybohm: k8s::client: Allow for install of all kubectl versions [puppet] - 10https://gerrit.wikimedia.org/r/1114970 (https://phabricator.wikimedia.org/T341984)
[15:42:06] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] lvs: Extend alerts to liberica cluster (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1115054 (owner: 10Vgutierrez)
[15:42:26] <logmsgbot>	 !log otto@deploy2002 otto: Backport for [[gerrit:1115042|EventStreamConfig - prep for per stream user agent collection config (T382173)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[15:42:31] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C:04-1] "let's replace cloudgw1002 first (so, rebase this patch)" [puppet] - 10https://gerrit.wikimedia.org/r/1114998 (https://phabricator.wikimedia.org/T382356) (owner: 10Arturo Borrero Gonzalez)
[15:42:37] <wikibugs>	 (03CR) 10Elukey: [V:03+1 C:03+2] kubernetes: remove ad-hoc CNI config from dse-k8s-worker [puppet] - 10https://gerrit.wikimedia.org/r/1114753 (owner: 10Elukey)
[15:42:38] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C:04-1] "let's replace cloudgw1002 first (so, rebase this patch)" [puppet] - 10https://gerrit.wikimedia.org/r/1114997 (https://phabricator.wikimedia.org/T382356) (owner: 10Arturo Borrero Gonzalez)
[15:43:07] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1212 (re)pooling @ 25%: Repooling after rebuild index $TASKID', diff saved to https://phabricator.wikimedia.org/P72767 and previous config saved to /var/cache/conftool/dbconfig/20250129-154306-root.json
[15:44:34] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2031.codfw.wmnet
[15:45:47] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] lvs: Extend alerts to liberica cluster [alerts] - 10https://gerrit.wikimedia.org/r/1115054 (owner: 10Vgutierrez)
[15:47:21] <wikibugs>	 (03Merged) 10jenkins-bot: lvs: Extend alerts to liberica cluster [alerts] - 10https://gerrit.wikimedia.org/r/1115054 (owner: 10Vgutierrez)
[15:47:54] <logmsgbot>	 !log otto@deploy2002 otto: Continuing with sync
[15:47:57] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T384592)', diff saved to https://phabricator.wikimedia.org/P72768 and previous config saved to /var/cache/conftool/dbconfig/20250129-154757-marostegui.json
[15:48:01] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1175.eqiad.wmnet with reason: Maintenance
[15:48:03] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[15:48:08] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1175 (T384592)', diff saved to https://phabricator.wikimedia.org/P72769 and previous config saved to /var/cache/conftool/dbconfig/20250129-154807-marostegui.json
[15:50:21] <wikibugs>	 (03CR) 10Federico Ceratto: [C:03+1] installserver: Do not format es104* [puppet] - 10https://gerrit.wikimedia.org/r/1115000 (owner: 10Marostegui)
[15:51:20] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] installserver: Do not format es104* [puppet] - 10https://gerrit.wikimedia.org/r/1115000 (owner: 10Marostegui)
[15:52:07] <jinxer-wm>	 FIRING: [2x] CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic1071-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[15:52:13] <wikibugs>	 (03PS7) 10JMeybohm: k8s::client: Allow for install of all kubectl versions [puppet] - 10https://gerrit.wikimedia.org/r/1114970 (https://phabricator.wikimedia.org/T341984)
[15:52:14] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1185 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P72770 and previous config saved to /var/cache/conftool/dbconfig/20250129-155213-root.json
[15:53:46] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2178 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P72771 and previous config saved to /var/cache/conftool/dbconfig/20250129-155345-root.json
[15:54:48] <logmsgbot>	 !log otto@deploy2002 Finished scap sync-world: Backport for [[gerrit:1115042|EventStreamConfig - prep for per stream user agent collection config (T382173)]] (duration: 17m 08s)
[15:54:52] <stashbot>	 T382173: Enable Event Platform instruments to opt out of collecting User-Agent data - https://phabricator.wikimedia.org/T382173
[15:57:09] <wikibugs>	 (03CR) 10Ottomata: [C:03+2] eventgate - templatize module name, default to @eventgate/wikimedia [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114795 (https://phabricator.wikimedia.org/T383814) (owner: 10Ottomata)
[15:58:12] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1212 (re)pooling @ 50%: Repooling after rebuild index $TASKID', diff saved to https://phabricator.wikimedia.org/P72772 and previous config saved to /var/cache/conftool/dbconfig/20250129-155812-root.json
[15:58:32] <wikibugs>	 (03Merged) 10jenkins-bot: eventgate - templatize module name, default to @eventgate/wikimedia [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114795 (https://phabricator.wikimedia.org/T383814) (owner: 10Ottomata)
[15:59:18] <wikibugs>	 (03PS1) 10Hnowlan: trafficserver: directly route to citoid on testwiki [puppet] - 10https://gerrit.wikimedia.org/r/1115056 (https://phabricator.wikimedia.org/T361576)
[15:59:58] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2095
[15:59:58] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[16:00:00] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2095
[16:00:04] <jouncebot>	 swfrench-wmf: That opportune time for a MediaWiki infrastructure (UTC late one-off) deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250129T1600).
[16:00:32] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2095
[16:00:35] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2095
[16:00:52] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[16:02:07] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s2 on dbstore1007 is OK: OK slave_sql_lag Replication lag: 0.15 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[16:02:09] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2095
[16:02:18] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2095
[16:02:36] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] postgresql::server: Use wmflib::debian_postgresql_version [puppet] - 10https://gerrit.wikimedia.org/r/1108707 (owner: 10Muehlenhoff)
[16:02:43] <swfrench-wmf>	 o/
[16:02:50] <swfrench-wmf>	 I'll get started shortly
[16:04:19] <icinga-wm>	 RECOVERY - BGP status on lsw1-b5-codfw.mgmt is OK: BGP OK - up: 8, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:04:37] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch ganeti2030 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1115057
[16:05:09] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by swfrench@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114793 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French)
[16:06:29] <wikibugs>	 (03Merged) 10jenkins-bot: Enroll 5% of client sessions in PHP 8.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114793 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French)
[16:06:46] <wikibugs>	 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Dec 2024: cr3-ulsfo errors on et-0/0/0 link from cr4 - https://phabricator.wikimedia.org/T384288#10505148 (10RobH) Picked this back up, it had gotten neglected due to not being assigned to me and not having the ops-ulsfo tag and I shou...
[16:06:46] <wikibugs>	 (03PS1) 10Urbanecm: migrateConfigToCommunity: Handle false BabelMainCategory [extensions/Babel] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1115059 (https://phabricator.wikimedia.org/T384941)
[16:06:53] <wikibugs>	 06SRE, 06SRE Observability: logstash.rb uses deprecated Socket.gethostbyname - https://phabricator.wikimedia.org/T385058 (10MatthewVernon) 03NEW
[16:06:54] <wikibugs>	 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Dec 2024: cr3-ulsfo errors on et-0/0/0 link from cr4 - https://phabricator.wikimedia.org/T384288#10505158 (10RobH) a:05cmooney→03RobH
[16:06:58] <logmsgbot>	 !log swfrench@deploy2002 Started scap sync-world: Backport for [[gerrit:1114793|Enroll 5% of client sessions in PHP 8.1 (T383845)]]
[16:07:03] <stashbot>	 T383845: MediaWiki on PHP 8.1 production traffic ramp-up - https://phabricator.wikimedia.org/T383845
[16:07:19] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1185 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P72773 and previous config saved to /var/cache/conftool/dbconfig/20250129-160718-root.json
[16:08:51] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2178 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P72774 and previous config saved to /var/cache/conftool/dbconfig/20250129-160850-root.json
[16:09:00] <wikibugs>	 (03PS1) 10Ottomata: eventgate-analytics-external - bump to v1.9.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115061 (https://phabricator.wikimedia.org/T382173)
[16:09:30] <wikibugs>	 (03PS1) 10Urbanecm: migrateConfigToCommunity: Handle false BabelMainCategory [extensions/Babel] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1115062 (https://phabricator.wikimedia.org/T384941)
[16:09:36] <wikibugs>	 (03PS1) 10Hnowlan: kubernetes: reimage two jobrunners to workers [puppet] - 10https://gerrit.wikimedia.org/r/1115063 (https://phabricator.wikimedia.org/T354791)
[16:10:09] <wikibugs>	 (03PS2) 10Muehlenhoff: postgresql::dirs: Use wmflib::debian_postgresql_version() [puppet] - 10https://gerrit.wikimedia.org/r/1108710
[16:10:32] <wikibugs>	 (03CR) 10CI reject: [V:04-1] postgresql::dirs: Use wmflib::debian_postgresql_version() [puppet] - 10https://gerrit.wikimedia.org/r/1108710 (owner: 10Muehlenhoff)
[16:11:02] <ottomata>	 swfrench-wmf: I'd like to do an eventgate-analytics-external deployment only to staging for now (meetings starting).  
[16:11:02] <ottomata>	 that okay with you?
[16:11:27] <wikibugs>	 (03PS3) 10Muehlenhoff: postgresql::dirs: Use wmflib::debian_postgresql_version() [puppet] - 10https://gerrit.wikimedia.org/r/1108710
[16:12:19] <swfrench-wmf>	 ottomata: no objections to update staging - thanks for checking!
[16:12:19] <ottomata>	 proceeding to deploy but only in staging
[16:12:22] <ottomata>	 thanks!
[16:12:25] <wikibugs>	 (03CR) 10Ottomata: [C:03+2] eventgate-analytics-external - bump to v1.9.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115061 (https://phabricator.wikimedia.org/T382173) (owner: 10Ottomata)
[16:13:17] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1212 (re)pooling @ 75%: Repooling after rebuild index $TASKID', diff saved to https://phabricator.wikimedia.org/P72775 and previous config saved to /var/cache/conftool/dbconfig/20250129-161317-root.json
[16:13:37] <logmsgbot>	 !log swfrench@deploy2002 swfrench: Backport for [[gerrit:1114793|Enroll 5% of client sessions in PHP 8.1 (T383845)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[16:13:41] <stashbot>	 T383845: MediaWiki on PHP 8.1 production traffic ramp-up - https://phabricator.wikimedia.org/T383845
[16:13:53] <wikibugs>	 (03Merged) 10jenkins-bot: eventgate-analytics-external - bump to v1.9.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115061 (https://phabricator.wikimedia.org/T382173) (owner: 10Ottomata)
[16:13:54] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T384592)', diff saved to https://phabricator.wikimedia.org/P72776 and previous config saved to /var/cache/conftool/dbconfig/20250129-161353-marostegui.json
[16:13:59] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[16:14:38] <moritzm>	 !log installing glib2.0 security updates
[16:14:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:14:48] <logmsgbot>	 !log swfrench@deploy2002 swfrench: Continuing with sync
[16:15:08] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1108710 (owner: 10Muehlenhoff)
[16:15:52] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+1] Enroll 5% of client sessions in PHP 8.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114793 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French)
[16:16:23] <wikibugs>	 (03CR) 10Muehlenhoff: postgresql::dirs: Use wmflib::debian_postgresql_version() (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1108710 (owner: 10Muehlenhoff)
[16:21:59] <logmsgbot>	 !log swfrench@deploy2002 Finished scap sync-world: Backport for [[gerrit:1114793|Enroll 5% of client sessions in PHP 8.1 (T383845)]] (duration: 15m 00s)
[16:22:04] <stashbot>	 T383845: MediaWiki on PHP 8.1 production traffic ramp-up - https://phabricator.wikimedia.org/T383845
[16:22:24] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1185 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P72777 and previous config saved to /var/cache/conftool/dbconfig/20250129-162224-root.json
[16:23:56] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2178 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P72778 and previous config saved to /var/cache/conftool/dbconfig/20250129-162355-root.json
[16:25:46] <wikibugs>	 (03PS1) 10Hashar: Do not copy Code-Review +2 (take 2) [puppet] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/1115068
[16:27:15] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: amd-pytorch25: use ROCm 6.2 in torch 2.5.1 image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1115069 (https://phabricator.wikimedia.org/T384734)
[16:28:23] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1212 (re)pooling @ 100%: Repooling after rebuild index $TASKID', diff saved to https://phabricator.wikimedia.org/P72780 and previous config saved to /var/cache/conftool/dbconfig/20250129-162822-root.json
[16:28:49] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: "The output of docker-pkg -c config.yaml build images/ --select "*pytorch25*"" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1115069 (https://phabricator.wikimedia.org/T384734) (owner: 10Ilias Sarantopoulos)
[16:29:00] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+1] kubernetes: reimage two jobrunners to workers [puppet] - 10https://gerrit.wikimedia.org/r/1115063 (https://phabricator.wikimedia.org/T354791) (owner: 10Hnowlan)
[16:29:00] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P72781 and previous config saved to /var/cache/conftool/dbconfig/20250129-162900-marostegui.json
[16:29:56] <swfrench-wmf>	 FYI, I'm done with the window
[16:30:05] <wikibugs>	 (03PS1) 10Muehlenhoff: wikilabels::db: Use wmflib::debian_postgresql_version [puppet] - 10https://gerrit.wikimedia.org/r/1115070
[16:30:49] <wikibugs>	 (03CR) 10Klausman: [C:03+1] amd-pytorch25: use ROCm 6.2 in torch 2.5.1 image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1115069 (https://phabricator.wikimedia.org/T384734) (owner: 10Ilias Sarantopoulos)
[16:32:38] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1115070 (owner: 10Muehlenhoff)
[16:32:56] <moritzm>	 !log installing util-linux bugfix updates from bookworm point release
[16:33:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:34:11] <wikibugs>	 (03PS1) 10Federico Ceratto: preseed.yaml: add comments around DB data safety [puppet] - 10https://gerrit.wikimedia.org/r/1115072
[16:34:52] <wikibugs>	 (03PS1) 10Elukey: kartotherian: update config.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115073
[16:35:47] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] kubernetes: reimage two jobrunners to workers [puppet] - 10https://gerrit.wikimedia.org/r/1115063 (https://phabricator.wikimedia.org/T354791) (owner: 10Hnowlan)
[16:36:17] <wikibugs>	 (03PS1) 10Sergio Gimeno: beta wgEventStreams: opt out collecting user agent for HelpPanel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115074 (https://phabricator.wikimedia.org/T382173)
[16:37:30] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1185 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P72782 and previous config saved to /var/cache/conftool/dbconfig/20250129-163729-root.json
[16:39:01] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2178 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P72783 and previous config saved to /var/cache/conftool/dbconfig/20250129-163901-root.json
[16:39:43] <wikibugs>	 (03CR) 10Btullis: [C:03+2] dumps: Use the analytics replicas by default for dumps 1.0 [puppet] - 10https://gerrit.wikimedia.org/r/1114978 (https://phabricator.wikimedia.org/T382947) (owner: 10Btullis)
[16:41:02] <wikibugs>	 (03CR) 10JHathaway: [C:03+1] postgresql::dirs: Use wmflib::debian_postgresql_version() (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1108710 (owner: 10Muehlenhoff)
[16:41:12] <logmsgbot>	 !log hnowlan@cumin2002 START - Cookbook sre.hosts.rename from mw2410 to wikikube-worker2242
[16:41:30] <logmsgbot>	 !log hnowlan@cumin2002 START - Cookbook sre.hosts.rename from mw2411 to wikikube-worker2243
[16:41:35] <logmsgbot>	 !log hnowlan@cumin2002 START - Cookbook sre.dns.netbox
[16:41:39] <wikibugs>	 (03CR) 10JHathaway: [C:03+2] kafka_shipper: when disabled, don't render templates [puppet] - 10https://gerrit.wikimedia.org/r/1111336 (owner: 10JHathaway)
[16:43:06] <wikibugs>	 (03PS3) 10Btullis: dumps: Re-enable the enwiki dumps on snapshot1012 [puppet] - 10https://gerrit.wikimedia.org/r/1114991 (https://phabricator.wikimedia.org/T382947)
[16:44:07] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P72784 and previous config saved to /var/cache/conftool/dbconfig/20250129-164406-marostegui.json
[16:46:19] <logmsgbot>	 !log hnowlan@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2410 to wikikube-worker2242 - hnowlan@cumin2002"
[16:46:34] <wikibugs>	 (03PS1) 10Vgutierrez: site,hiera: Reimage lvs4009 as role(liberica) [puppet] - 10https://gerrit.wikimedia.org/r/1115075 (https://phabricator.wikimedia.org/T384477)
[16:46:39] <logmsgbot>	 !log hnowlan@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2410 to wikikube-worker2242 - hnowlan@cumin2002"
[16:46:39] <logmsgbot>	 !log hnowlan@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:46:40] <logmsgbot>	 !log hnowlan@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2242
[16:46:50] <logmsgbot>	 !log hnowlan@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2242
[16:47:00] <logmsgbot>	 !log hnowlan@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2410 to wikikube-worker2242
[16:47:19] <wikibugs>	 06SRE, 10MW-on-K8s, 06serviceops, 13Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#10505334 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by hnowlan@cumin2002 from mw2410 to wikikube-worker2242 completed: - mw2410 (**PASS**)   - ✔️ Down...
[16:47:25] <logmsgbot>	 !log hnowlan@cumin2002 START - Cookbook sre.dns.netbox
[16:48:01] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.8 point update - https://phabricator.wikimedia.org/T379600#10505336 (10MoritzMuehlenhoff)
[16:48:43] <logmsgbot>	 !log hnowlan@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2242.codfw.wmnet with OS bookworm
[16:48:53] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1115075 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez)
[16:48:53] <logmsgbot>	 !log hnowlan@cumin2002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2242
[16:49:03] <wikibugs>	 06SRE, 10MW-on-K8s, 06serviceops, 13Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#10505337 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin2002 for host wikikube-worker2242.codfw.wmnet with OS bookworm
[16:50:04] <wikibugs>	 (03CR) 10Marostegui: [C:03+1] "Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1115072 (owner: 10Federico Ceratto)
[16:51:17] <logmsgbot>	 !log hnowlan@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2411 to wikikube-worker2243 - hnowlan@cumin2002"
[16:51:22] <logmsgbot>	 !log hnowlan@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2411 to wikikube-worker2243 - hnowlan@cumin2002"
[16:51:22] <logmsgbot>	 !log hnowlan@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:51:23] <logmsgbot>	 !log hnowlan@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2243
[16:51:23] <logmsgbot>	 !log hnowlan@cumin2002 START - Cookbook sre.dns.netbox
[16:51:37] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 219, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[16:51:40] <wikibugs>	 (03CR) 10Jgiannelos: [C:03+1] kartotherian: update config.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115073 (owner: 10Elukey)
[16:51:50] <wikibugs>	 (03CR) 10Xcollazo: [C:03+1] dumps: Re-enable the enwiki dumps on snapshot1012 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1114991 (https://phabricator.wikimedia.org/T382947) (owner: 10Btullis)
[16:51:53] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 128, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[16:51:58] <logmsgbot>	 !log jayme@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on wikikube-worker[2095,2175,2186].codfw.wmnet with reason: extending downtime
[16:52:09] <wikibugs>	 10ops-codfw, 06SRE, 06collaboration-services, 06Data-Persistence, and 2 others: Tracking List: Relocating servers to free up 10G switch space in codfw - https://phabricator.wikimedia.org/T383709#10505352 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=c24ad8f7-3e57-4f83-8a1f-c507313e344...
[16:53:15] <wikibugs>	 (03CR) 10Vgutierrez: "experimental check fails cause the interface name for the main NIC doesn't match between bullseye and bookworm :facepalm:" [puppet] - 10https://gerrit.wikimedia.org/r/1115075 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez)
[16:54:21] <wikibugs>	 (03CR) 10Elukey: [C:03+2] services: set the Tegola's cluster local endpoint for Kartotherian [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115049 (https://phabricator.wikimedia.org/T384530) (owner: 10Elukey)
[16:54:27] <wikibugs>	 (03CR) 10Elukey: [C:03+2] kartotherian: update config.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115073 (owner: 10Elukey)
[16:56:28] <logmsgbot>	 !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/kartotherian: sync
[16:56:36] <logmsgbot>	 !log hnowlan@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2243
[16:56:46] <logmsgbot>	 !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/kartotherian: sync
[16:56:46] <logmsgbot>	 !log hnowlan@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2411 to wikikube-worker2243
[16:57:05] <wikibugs>	 06SRE, 10MW-on-K8s, 06serviceops, 13Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#10505376 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by hnowlan@cumin2002 from mw2411 to wikikube-worker2243 completed: - mw2411 (**PASS**)   - ✔️ Down...
[16:57:30] <logmsgbot>	 !log hnowlan@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2242 - hnowlan@cumin2002"
[16:57:34] <logmsgbot>	 !log hnowlan@cumin2002 START - Cookbook sre.dns.wipe-cache wikikube-worker2243.codfw.wmnet on all recursors
[16:57:35] <logmsgbot>	 !log hnowlan@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2242 - hnowlan@cumin2002"
[16:57:35] <logmsgbot>	 !log hnowlan@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:57:36] <logmsgbot>	 !log hnowlan@cumin2002 START - Cookbook sre.dns.wipe-cache wikikube-worker2242.codfw.wmnet 113.0.192.10.in-addr.arpa 3.1.1.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[16:57:37] <logmsgbot>	 !log hnowlan@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2243.codfw.wmnet on all recursors
[16:57:39] <logmsgbot>	 !log hnowlan@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2242.codfw.wmnet 113.0.192.10.in-addr.arpa 3.1.1.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[16:57:40] <logmsgbot>	 !log hnowlan@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2242
[16:58:00] <logmsgbot>	 !log hnowlan@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2242
[16:58:00] <logmsgbot>	 !log hnowlan@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2242
[16:58:28] <logmsgbot>	 !log hnowlan@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2243.codfw.wmnet with OS bookworm
[16:58:39] <logmsgbot>	 !log hnowlan@cumin2002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2243
[16:58:45] <wikibugs>	 06SRE, 10MW-on-K8s, 06serviceops, 13Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#10505382 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin2002 for host wikikube-worker2243.codfw.wmnet with OS bookworm
[16:58:56] <logmsgbot>	 !log hnowlan@cumin2002 START - Cookbook sre.dns.netbox
[16:59:13] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T384592)', diff saved to https://phabricator.wikimedia.org/P72785 and previous config saved to /var/cache/conftool/dbconfig/20250129-165913-marostegui.json
[16:59:18] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[16:59:28] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1198.eqiad.wmnet with reason: Maintenance
[16:59:36] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1198 (T384592)', diff saved to https://phabricator.wikimedia.org/P72786 and previous config saved to /var/cache/conftool/dbconfig/20250129-165935-marostegui.json
[16:59:52] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1159 T384994', diff saved to https://phabricator.wikimedia.org/P72787 and previous config saved to /var/cache/conftool/dbconfig/20250129-165951-marostegui.json
[16:59:57] <stashbot>	 T384994: Upgrade and rebuild s5 - https://phabricator.wikimedia.org/T384994
[17:00:03] <logmsgbot>	 !log root@cumin1002 START - Cookbook sre.mysql.upgrade for db1159.eqiad.wmnet
[17:02:46] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] "Ha yeah." [puppet] - 10https://gerrit.wikimedia.org/r/1115075 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez)
[17:03:46] <logmsgbot>	 !log hnowlan@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2243 - hnowlan@cumin2002"
[17:03:51] <logmsgbot>	 !log hnowlan@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2243 - hnowlan@cumin2002"
[17:03:51] <logmsgbot>	 !log hnowlan@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:03:52] <logmsgbot>	 !log hnowlan@cumin2002 START - Cookbook sre.dns.wipe-cache wikikube-worker2243.codfw.wmnet 122.0.192.10.in-addr.arpa 2.2.1.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[17:03:55] <logmsgbot>	 !log hnowlan@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2243.codfw.wmnet 122.0.192.10.in-addr.arpa 2.2.1.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[17:03:56] <logmsgbot>	 !log hnowlan@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2243
[17:04:15] <jinxer-wm>	 FIRING: AppserversUnreachable: Appserver unavailable for cluster jobrunner at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=codfw&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable
[17:05:25] <wikibugs>	 (03CR) 10Vgutierrez: [C:04-2] "to be merged tomorrow 2025-01-30" [puppet] - 10https://gerrit.wikimedia.org/r/1115075 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez)
[17:05:51] <logmsgbot>	 !log root@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1159.eqiad.wmnet
[17:06:26] <logmsgbot>	 !log hnowlan@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2243
[17:06:26] <logmsgbot>	 !log hnowlan@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2243
[17:06:29] <wikibugs>	 (03CR) 10Federico Ceratto: [C:03+2] preseed.yaml: add comments around DB data safety [puppet] - 10https://gerrit.wikimedia.org/r/1115072 (owner: 10Federico Ceratto)
[17:07:09] <logmsgbot>	 !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1159.eqiad.wmnet with reason: Index rebuild
[17:09:21] <wikibugs>	 10ops-codfw, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T384951#10505458 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm reseated ps1 cable.
[17:14:28] <logmsgbot>	 !log hnowlan@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2242.codfw.wmnet with reason: host reimage
[17:17:07] <wikibugs>	 10ops-eqiad, 06DC-Ops, 10decommission-hardware: decommission es1025.eqiad.wmnet - https://phabricator.wikimedia.org/T384912#10505507 (10Papaul)
[17:17:19] <wikibugs>	 10ops-eqiad, 06DC-Ops, 10decommission-hardware: decommission es1025.eqiad.wmnet - https://phabricator.wikimedia.org/T384912#10505509 (10Papaul) 05Open→03Resolved a:03Papaul Complete
[17:17:21] <logmsgbot>	 !log hnowlan@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2242.codfw.wmnet with reason: host reimage
[17:19:15] <jinxer-wm>	 RESOLVED: AppserversUnreachable: Appserver unavailable for cluster jobrunner at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=codfw&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable
[17:20:21] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2175
[17:20:29] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2175
[17:21:22] <icinga-wm>	 RECOVERY - BGP status on lsw1-d5-codfw.mgmt is OK: BGP OK - up: 24, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[17:23:08] <wikibugs>	 (03PS1) 10Jgiannelos: kartotherian: Fix dependency in service config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115079
[17:23:19] <logmsgbot>	 !log hnowlan@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2243.codfw.wmnet with reason: host reimage
[17:24:04] <wikibugs>	 (03PS2) 10Jgiannelos: kartotherian: Fix dependency in service config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115079
[17:24:57] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T384592)', diff saved to https://phabricator.wikimedia.org/P72789 and previous config saved to /var/cache/conftool/dbconfig/20250129-172455-marostegui.json
[17:25:02] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[17:26:20] <wikibugs>	 (03PS3) 10Scott French: shellbox-video: all codfw replicas to 8.1 (change 3/4) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113215 (https://phabricator.wikimedia.org/T377038)
[17:26:20] <wikibugs>	 (03PS3) 10Scott French: shellbox-video: all replicas on PHP 8.1 (change 4/4) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113216 (https://phabricator.wikimedia.org/T377038)
[17:26:22] <logmsgbot>	 !log hnowlan@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2243.codfw.wmnet with reason: host reimage
[17:26:33] <wikibugs>	 (03CR) 10Klausman: [V:03+2 C:03+2] amd-pytorch25: use ROCm 6.2 in torch 2.5.1 image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1115069 (https://phabricator.wikimedia.org/T384734) (owner: 10Ilias Sarantopoulos)
[17:26:48] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install ganeti105[34].eqiad.wmnet - https://phabricator.wikimedia.org/T381576#10505577 (10VRiley-WMF)
[17:27:30] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2186
[17:27:38] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2186
[17:28:24] <wikibugs>	 (03CR) 10Btullis: [C:03+2] dumps: Re-enable the enwiki dumps on snapshot1012 [puppet] - 10https://gerrit.wikimedia.org/r/1114991 (https://phabricator.wikimedia.org/T382947) (owner: 10Btullis)
[17:30:22] <icinga-wm>	 RECOVERY - BGP status on lsw1-d3-codfw.mgmt is OK: BGP OK - up: 32, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[17:32:55] <wikibugs>	 (03CR) 10Elukey: [C:03+2] kartotherian: Fix dependency in service config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115079 (owner: 10Jgiannelos)
[17:34:17] <wikibugs>	 (03PS1) 10Jgiannelos: kartotherian: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115082
[17:35:44] <wikibugs>	 10ops-codfw, 06SRE, 06collaboration-services, 06Data-Persistence, and 2 others: Tracking List: Relocating servers to free up 10G switch space in codfw - https://phabricator.wikimedia.org/T383709#10505596 (10Jhancock.wm)
[17:35:47] <wikibugs>	 (03CR) 10Elukey: [C:03+2] kartotherian: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115082 (owner: 10Jgiannelos)
[17:36:45] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/kartotherian: apply
[17:36:48] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/kartotherian: apply
[17:36:57] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/kartotherian: apply
[17:37:00] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/kartotherian: apply
[17:37:22] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/kartotherian: apply
[17:37:25] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/kartotherian: apply
[17:37:25] <logmsgbot>	 !log hnowlan@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2242.codfw.wmnet with OS bookworm
[17:37:28] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/kartotherian: apply
[17:37:31] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/kartotherian: apply
[17:37:37] <wikibugs>	 06SRE, 10MW-on-K8s, 06serviceops, 13Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#10505607 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin2002 for host wikikube-worker2242.codfw.wmnet with OS bookworm completed: - wikik...
[17:37:43] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/kartotherian: apply
[17:38:07] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/kartotherian: apply
[17:38:39] <wikibugs>	 (03CR) 10AOkoth: [C:03+2] miscweb: remove kubectl cronjob [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115044 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth)
[17:38:40] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/kartotherian: apply
[17:40:01] <wikibugs>	 (03Merged) 10jenkins-bot: miscweb: remove kubectl cronjob [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115044 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth)
[17:40:04] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P72790 and previous config saved to /var/cache/conftool/dbconfig/20250129-174003-marostegui.json
[17:42:48] <logmsgbot>	 !log aokoth@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply
[17:43:32] <wikibugs>	 (03PS3) 10BCornwall: conftool: rm ats-be services cache nodes [puppet] - 10https://gerrit.wikimedia.org/r/1114074
[17:43:53] <wikibugs>	 (03CR) 10BCornwall: "Done" [puppet] - 10https://gerrit.wikimedia.org/r/1114074 (owner: 10BCornwall)
[17:45:55] <logmsgbot>	 !log hnowlan@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2243.codfw.wmnet with OS bookworm
[17:46:12] <wikibugs>	 06SRE, 10MW-on-K8s, 06serviceops, 13Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#10505639 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin2002 for host wikikube-worker2243.codfw.wmnet with OS bookworm completed: - wikik...
[17:46:27] <wikibugs>	 (03PS1) 10BCornwall: varnish: Enable single_backend by default [puppet] - 10https://gerrit.wikimedia.org/r/1115086
[17:47:52] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] "Looks good! On perhaps an unrelated note, we don't do inbound TLS with ATS now so I was wondering if it would make sense to rename the che" [puppet] - 10https://gerrit.wikimedia.org/r/1099782 (owner: 10BCornwall)
[17:48:03] <hnowlan>	 !log homer 'lsw1-a5-codfw*' commit 
[17:48:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:48:27] <wikibugs>	 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Dec 2024: cr3-ulsfo errors on et-0/0/0 link from cr4 - https://phabricator.wikimedia.org/T384288#10505646 (10RobH) Remote hands 01020815 scheduled for 2025-02-04 @ 0800 Pacific (1600 GMT).
[17:49:59] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10505661 (10Jhancock.wm) unproductive update. the level 3 helpdesk is still going over the files and the TSR report. Will update when i hear back from them.
[17:50:36] <wikibugs>	 10ops-eqiad, 10SRE-swift-storage, 06DC-Ops, 10decommission-hardware: decommission ms-fe105[1-9].eqiad.wmnet - https://phabricator.wikimedia.org/T385049#10505664 (10Papaul) @MatthewVernon these are ms-be105[1-9].eqiad.wmnet or ms-fe105[1-9].eqiad.wmnet
[17:50:45] <logmsgbot>	 !log hnowlan@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2242-2243].codfw.wmnet
[17:50:48] <logmsgbot>	 !log hnowlan@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2242-2243].codfw.wmnet
[17:50:58] <wikibugs>	 06SRE, 10MW-on-K8s, 06serviceops, 13Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#10505666 (10ops-monitoring-bot) pool host wikikube-worker[2242-2243].codfw.wmnet by hnowlan@cumin2002 with reason: None
[17:51:02] <wikibugs>	 06SRE, 10MW-on-K8s, 06serviceops, 13Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#10505667 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by hnowlan@cumin2002 pool for host wikikube-worker[2242-2243].codfw.wmnet completed: - wik...
[17:51:04] <wikibugs>	 10ops-codfw, 06DC-Ops, 10Prod-Kubernetes, 06serviceops, 07Kubernetes: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T385078 (10hnowlan) 03NEW
[17:52:57] <wikibugs>	 (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (NOOP 7): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4894/console" [puppet] - 10https://gerrit.wikimedia.org/r/1115086 (owner: 10BCornwall)
[17:53:17] <logmsgbot>	 !log aokoth@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply
[17:53:32] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 220, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[17:54:02] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 129, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[17:54:11] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: maintenance
[17:55:11] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P72792 and previous config saved to /var/cache/conftool/dbconfig/20250129-175510-marostegui.json
[17:59:31] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv4 ping to codfw RIPE Atlas anchor: failures over threshold for measurement 32391305 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[18:00:05] <jouncebot>	 swfrench-wmf: Time to do the MediaWiki infrastructure (UTC late) deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250129T1800).
[18:04:31] <jinxer-wm>	 RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv4 ping to codfw RIPE Atlas anchor: failures over threshold for measurement 32391305 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[18:05:06] <swfrench-wmf>	 o/
[18:05:19] <swfrench-wmf>	 I'm holding for the moment while we're troubleshooting a separate issue
[18:10:18] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T384592)', diff saved to https://phabricator.wikimedia.org/P72794 and previous config saved to /var/cache/conftool/dbconfig/20250129-181017-marostegui.json
[18:10:22] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1212.eqiad.wmnet with reason: Maintenance
[18:10:23] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[18:10:30] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1013,1017].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[18:10:37] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1212 (T384592)', diff saved to https://phabricator.wikimedia.org/P72795 and previous config saved to /var/cache/conftool/dbconfig/20250129-181037-marostegui.json
[18:10:53] <wikibugs>	 06SRE, 10MW-on-K8s, 06serviceops, 13Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#10505737 (10hnowlan)
[18:12:13] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1159 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P72796 and previous config saved to /var/cache/conftool/dbconfig/20250129-181212-root.json
[18:20:56] <wikibugs>	 (03PS1) 10AOkoth: miscweb: update os-reports version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115092 (https://phabricator.wikimedia.org/T350794)
[18:23:59] <wikibugs>	 (03CR) 10AOkoth: [C:03+2] miscweb: update os-reports version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115092 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth)
[18:25:19] <wikibugs>	 (03Merged) 10jenkins-bot: miscweb: update os-reports version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115092 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth)
[18:26:26] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Check link from msw1-eqiad et-0/1/0 to msw2-eqiad et-0/1/0 - https://phabricator.wikimedia.org/T384708#10505845 (10Papaul) Replaced the optic on the msw2 side
[18:26:34] <logmsgbot>	 !log aokoth@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply
[18:26:51] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2030.codfw.wmnet
[18:27:19] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1159 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P72801 and previous config saved to /var/cache/conftool/dbconfig/20250129-182718-root.json
[18:29:31] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv4 ping to codfw RIPE Atlas anchor: failures over threshold for measurement 32391305 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[18:37:07] <jinxer-wm>	 RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv4 ping to codfw RIPE Atlas anchor: failures over threshold for measurement 32391305 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[18:37:18] <logmsgbot>	 !log aokoth@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply
[18:40:56] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1212 (T384592)', diff saved to https://phabricator.wikimedia.org/P72804 and previous config saved to /var/cache/conftool/dbconfig/20250129-184055-marostegui.json
[18:41:00] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[18:42:24] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1159 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P72805 and previous config saved to /var/cache/conftool/dbconfig/20250129-184223-root.json
[18:43:50] <logmsgbot>	 !log aokoth@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply
[18:44:57] <logmsgbot>	 !log xcollazo@deploy2002 Started deploy [airflow-dags/analytics@5b0aeae]: Deploying latest DAGs to the analytics Airflow instance. T358375.
[18:45:02] <stashbot>	 T358375: Declare wmf_content.mediawiki_content_history_v1 a production table - https://phabricator.wikimedia.org/T358375
[18:45:32] <logmsgbot>	 !log xcollazo@deploy2002 Finished deploy [airflow-dags/analytics@5b0aeae]: Deploying latest DAGs to the analytics Airflow instance. T358375. (duration: 00m 35s)
[18:53:59] <logmsgbot>	 !log aokoth@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply
[18:56:02] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1212', diff saved to https://phabricator.wikimedia.org/P72809 and previous config saved to /var/cache/conftool/dbconfig/20250129-185602-marostegui.json
[18:57:29] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1159 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P72810 and previous config saved to /var/cache/conftool/dbconfig/20250129-185729-root.json
[19:00:04] <jouncebot>	 jeena and hashar: Deploy window MediaWiki train - Utc-7+Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250129T1900)
[19:04:12] <icinga-wm>	 PROBLEM - BFD status on cloudsw1-c8-eqiad.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[19:04:18] <icinga-wm>	 PROBLEM - BGP status on cloudsw1-c8-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[19:07:12] <icinga-wm>	 RECOVERY - BFD status on cloudsw1-c8-eqiad.mgmt is OK: UP: 10 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[19:07:18] <icinga-wm>	 RECOVERY - BGP status on cloudsw1-c8-eqiad.mgmt is OK: BGP OK - up: 14, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[19:09:10] <icinga-wm>	 PROBLEM - BGP status on cloudsw1-d5-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[19:09:17] <logmsgbot>	 !log otto@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-analytics-external: apply
[19:09:46] <logmsgbot>	 !log otto@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics-external: apply
[19:09:52] <icinga-wm>	 PROBLEM - BFD status on cloudsw1-d5-eqiad.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[19:11:09] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1212', diff saved to https://phabricator.wikimedia.org/P72811 and previous config saved to /var/cache/conftool/dbconfig/20250129-191108-marostegui.json
[19:11:52] <icinga-wm>	 RECOVERY - BFD status on cloudsw1-d5-eqiad.mgmt is OK: UP: 10 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[19:12:10] <icinga-wm>	 RECOVERY - BGP status on cloudsw1-d5-eqiad.mgmt is OK: BGP OK - up: 14, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[19:12:35] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1159 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P72812 and previous config saved to /var/cache/conftool/dbconfig/20250129-191234-root.json
[19:12:47] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.dns.netbox
[19:13:09] <logmsgbot>	 !log otto@deploy2002 helmfile [codfw] START helmfile.d/services/eventgate-analytics-external: apply
[19:14:07] <logmsgbot>	 !log otto@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics-external: apply
[19:14:29] <logmsgbot>	 !log otto@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics-external: apply
[19:15:19] <logmsgbot>	 !log otto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics-external: apply
[19:16:22] <wikibugs>	 (03CR) 10Ottomata: "I have deployed eventgate-analytics-external in production!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115074 (https://phabricator.wikimedia.org/T382173) (owner: 10Sergio Gimeno)
[19:16:22] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update  mgmt  ganeti1053 - vriley@cumin1002"
[19:16:31] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update  mgmt  ganeti1053 - vriley@cumin1002"
[19:16:31] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[19:17:39] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.dns.netbox
[19:19:57] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[19:20:04] <logmsgbot>	 !log pt1979@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-fe1014']
[19:24:16] <wikibugs>	 (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4895/co" [puppet] - 10https://gerrit.wikimedia.org/r/1099782 (owner: 10BCornwall)
[19:24:46] <wikibugs>	 10ops-codfw, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T385096 (10phaultfinder) 03NEW
[19:26:16] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1212 (T384592)', diff saved to https://phabricator.wikimedia.org/P72813 and previous config saved to /var/cache/conftool/dbconfig/20250129-192615-marostegui.json
[19:26:21] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[19:26:31] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1223.eqiad.wmnet with reason: Maintenance
[19:26:38] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1223 (T384592)', diff saved to https://phabricator.wikimedia.org/P72814 and previous config saved to /var/cache/conftool/dbconfig/20250129-192637-marostegui.json
[19:32:07] <jinxer-wm>	 FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-parsoid_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:32:44] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[19:34:05] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1053.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[19:34:17] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 06Infrastructure-Foundations: Perform fake disk swap on ms-be2088 as test - https://phabricator.wikimedia.org/T384003#10506113 (10Neobeta61) What redfish API version are you running?
[19:36:14] <wikibugs>	 (03PS1) 10CDanis: resourceloader: Fix hash computation for virtual files with versionFilePath [core] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1115098 (https://phabricator.wikimedia.org/T385055)
[19:36:41] <wikibugs>	 (03PS1) 10CDanis: resourceloader: Fix hash computation for virtual files with versionFilePath [core] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1115099 (https://phabricator.wikimedia.org/T385055)
[19:39:12] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Check link from msw1-eqiad et-0/1/0 to msw2-eqiad et-0/1/0 - https://phabricator.wikimedia.org/T384708#10506123 (10cmooney) >>! In T384708#10505845, @Papaul wrote: > Replaced the optic on the msw2 side   Cool, looks ok so far but will...
[19:42:34] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1054.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[19:43:51] <logmsgbot>	 !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti1053.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[19:44:10] <logmsgbot>	 !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti1054.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[19:47:35] <wikibugs>	 (03CR) 10Catrope: [C:03+2] resourceloader: Fix hash computation for virtual files with versionFilePath [core] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1115098 (https://phabricator.wikimedia.org/T385055) (owner: 10CDanis)
[19:47:39] <wikibugs>	 (03CR) 10Catrope: [C:03+2] resourceloader: Fix hash computation for virtual files with versionFilePath [core] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1115099 (https://phabricator.wikimedia.org/T385055) (owner: 10CDanis)
[19:47:52] <logmsgbot>	 !log pt1979@cumin1002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['ms-fe1014']
[19:48:31] <logmsgbot>	 !log xcollazo@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply
[19:48:50] <logmsgbot>	 !log xcollazo@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply
[19:49:06] <logmsgbot>	 !log pt1979@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-fe1014']
[19:52:07] <jinxer-wm>	 FIRING: [2x] CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic1071-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[19:54:49] <wikibugs>	 (03CR) 10BCornwall: [V:03+1 C:03+2] icinga: Remove unused check_ssl_unified config [puppet] - 10https://gerrit.wikimedia.org/r/1099782 (owner: 10BCornwall)
[19:54:58] <inflatador>	 !log bking@apt1002 publish new opensearch_1.3.20 pkg to thirdparty/opensearch1
[19:54:59] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1223 (T384592)', diff saved to https://phabricator.wikimedia.org/P72815 and previous config saved to /var/cache/conftool/dbconfig/20250129-195459-marostegui.json
[19:55:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:55:44] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1053.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[19:56:32] <logmsgbot>	 !log pt1979@cumin1002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['ms-fe1014']
[19:56:33] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[19:58:18] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1054.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[19:59:23] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: Add 'auth' docroot with custom files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115103 (https://phabricator.wikimedia.org/T383952)
[19:59:50] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: Add 'auth' docroot with custom files (beta) [puppet] - 10https://gerrit.wikimedia.org/r/1115104 (https://phabricator.wikimedia.org/T383952)
[20:00:55] <logmsgbot>	 !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti1053.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[20:03:29] <logmsgbot>	 !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti1054.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[20:03:29] <wikibugs>	 (03PS1) 10D3r1ck01: SUL3: Allow temp users to authenticate (login/signup) via the API [extensions/CentralAuth] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1115106 (https://phabricator.wikimedia.org/T384523)
[20:04:49] <wikibugs>	 (03CR) 10CI reject: [V:04-1] resourceloader: Fix hash computation for virtual files with versionFilePath [core] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1115098 (https://phabricator.wikimedia.org/T385055) (owner: 10CDanis)
[20:10:06] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1223', diff saved to https://phabricator.wikimedia.org/P72816 and previous config saved to /var/cache/conftool/dbconfig/20250129-201006-marostegui.json
[20:13:54] <logmsgbot>	 !log pt1979@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-fe1014']
[20:14:51] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.dns.netbox
[20:16:57] <logmsgbot>	 !log vriley@cumin1002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[20:18:51] <wikibugs>	 06SRE, 06collaboration-services, 10Stewards-Onboarding-Tool, 10Wikimedia-Mailing-lists, 13Patch-For-Review: stewards1001 / stewards2001: automatically subscribe stewards to mailman lists (was: Enable API access for Mailman3) - https://phabricator.wikimedia.org/T351202#10506235 (10Dzahn) 05Open→03Stall...
[20:21:01] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.dns.netbox
[20:24:56] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt db1250 - vriley@cumin1002"
[20:25:01] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt db1250 - vriley@cumin1002"
[20:25:01] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[20:25:13] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1223', diff saved to https://phabricator.wikimedia.org/P72817 and previous config saved to /var/cache/conftool/dbconfig/20250129-202513-marostegui.json
[20:25:17] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.dns.netbox
[20:27:12] <wikibugs>	 (03CR) 10Catrope: resourceloader: Fix hash computation for virtual files with versionFilePath [core] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1115098 (https://phabricator.wikimedia.org/T385055) (owner: 10CDanis)
[20:27:16] <wikibugs>	 (03CR) 10Catrope: [C:03+2] resourceloader: Fix hash computation for virtual files with versionFilePath [core] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1115098 (https://phabricator.wikimedia.org/T385055) (owner: 10CDanis)
[20:27:32] <wikibugs>	 (03Merged) 10jenkins-bot: resourceloader: Fix hash computation for virtual files with versionFilePath [core] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1115099 (https://phabricator.wikimedia.org/T385055) (owner: 10CDanis)
[20:27:37] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[20:29:31] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-parsoid_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:30:56] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host db1250.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[20:31:15] <wikibugs>	 (03PS1) 10Cathal Mooney: Network: add qos and sflow config for configure-switch-interfaces [cookbooks] - 10https://gerrit.wikimedia.org/r/1115109 (https://phabricator.wikimedia.org/T379549)
[20:32:19] <logmsgbot>	 !log pt1979@cumin1002 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts ['ms-fe1014']
[20:32:22] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host db1251.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[20:32:27] <logmsgbot>	 !log pt1979@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-fe1014']
[20:32:44] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[20:36:01] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Extend sre.network.configure-switch-interfaces cookbook to add sflow and qos config - https://phabricator.wikimedia.org/T379549#10506260 (10cmooney) The above patch I believe will do what we need.  Needs some testing I will work with dc-ops...
[20:36:07] <wikibugs>	 (03CR) 10Ottomata: [C:04-1] "Let's wait until DPE is back from an offsite before this is deployed." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114798 (https://phabricator.wikimedia.org/T383814) (owner: 10Ottomata)
[20:38:06] <wikibugs>	 (03PS2) 10Cathal Mooney: Network: add qos and sflow config for configure-switch-interfaces [cookbooks] - 10https://gerrit.wikimedia.org/r/1115109 (https://phabricator.wikimedia.org/T379549)
[20:38:28] <wikibugs>	 (03Merged) 10jenkins-bot: resourceloader: Fix hash computation for virtual files with versionFilePath [core] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1115098 (https://phabricator.wikimedia.org/T385055) (owner: 10CDanis)
[20:38:38] <wikibugs>	 (03CR) 10RLazarus: [C:03+1] Add restricted users to deployment_server [puppet] - 10https://gerrit.wikimedia.org/r/1114963 (https://phabricator.wikimedia.org/T378429) (owner: 10JMeybohm)
[20:40:21] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1223 (T384592)', diff saved to https://phabricator.wikimedia.org/P72818 and previous config saved to /var/cache/conftool/dbconfig/20250129-204020-marostegui.json
[20:40:25] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[20:40:35] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1240.eqiad.wmnet with reason: Maintenance
[20:42:14] <wikibugs>	 (03PS1) 10Ottomata: mediawiki.org/beacon/event - don't raise error on failure [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115111 (https://phabricator.wikimedia.org/T383939)
[20:44:35] <logmsgbot>	 !log andrew@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudvirt2006-dev.codfw.wmnet
[20:48:40] <wikibugs>	 06SRE, 06collaboration-services, 13Patch-For-Review: setup gerrit2003 with gerrit service (gerrit on bookworm) - https://phabricator.wikimedia.org/T372804#10506296 (10Dzahn) We need a follow-up task to _acutally start using_ this new server and failover gerrit to it.
[20:51:11] <logmsgbot>	 !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudvirt2006-dev.codfw.wmnet
[20:51:30] <wikibugs>	 (03PS2) 10BCornwall: varnish: Fix claim obj.hits isn't known in vcl_hit [puppet] - 10https://gerrit.wikimedia.org/r/1113591 (https://phabricator.wikimedia.org/T378737)
[20:51:55] <wikibugs>	 (03PS7) 10BCornwall: varnish: Upgrade VCL for Varnish 7.0+/modules 0.20 [puppet] - 10https://gerrit.wikimedia.org/r/1113592 (https://phabricator.wikimedia.org/T378737)
[20:52:22] <wikibugs>	 (03CR) 10BCornwall: "`" [puppet] - 10https://gerrit.wikimedia.org/r/1113592 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall)
[20:55:04] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+1] shellbox-video: all codfw replicas to 8.1 (change 3/4) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113215 (https://phabricator.wikimedia.org/T377038) (owner: 10Scott French)
[20:55:17] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+1] shellbox-video: all replicas on PHP 8.1 (change 4/4) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113216 (https://phabricator.wikimedia.org/T377038) (owner: 10Scott French)
[20:58:45] <logmsgbot>	 !log pt1979@cumin1002 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts ['ms-fe1014']
[21:00:04] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250129T2100).
[21:00:04] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[21:07:19] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1250.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[21:09:37] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1250']
[21:10:34] <logmsgbot>	 !log vriley@cumin1002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['db1250']
[21:10:41] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1250']
[21:11:04] <logmsgbot>	 !log vriley@cumin1002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['db1250']
[21:11:40] <wikibugs>	 (03PS1) 10Fabfur: benthos: send data to eventgate too [puppet] - 10https://gerrit.wikimedia.org/r/1115113 (https://phabricator.wikimedia.org/T383392)
[21:11:59] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1250']
[21:12:01] <wikibugs>	 (03CR) 10CI reject: [V:04-1] benthos: send data to eventgate too [puppet] - 10https://gerrit.wikimedia.org/r/1115113 (https://phabricator.wikimedia.org/T383392) (owner: 10Fabfur)
[21:12:22] <logmsgbot>	 !log vriley@cumin1002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['db1250']
[21:13:27] <RoanKattouw>	 I'm finally going to deploy the UBN fixes now 
[21:13:59] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1250']
[21:14:50] <wikibugs>	 (03PS2) 10Fabfur: benthos: send data to eventgate too [puppet] - 10https://gerrit.wikimedia.org/r/1115113 (https://phabricator.wikimedia.org/T383392)
[21:15:06] <logmsgbot>	 !log vriley@cumin1002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db1250']
[21:15:12] <wikibugs>	 (03CR) 10CI reject: [V:04-1] benthos: send data to eventgate too [puppet] - 10https://gerrit.wikimedia.org/r/1115113 (https://phabricator.wikimedia.org/T383392) (owner: 10Fabfur)
[21:17:32] <wikibugs>	 (03CR) 10Ottomata: [C:03+1] "Nice. Its preferred if producers can do all of this:" [puppet] - 10https://gerrit.wikimedia.org/r/1115113 (https://phabricator.wikimedia.org/T383392) (owner: 10Fabfur)
[21:17:49] <wikibugs>	 (03PS3) 10Fabfur: benthos: send data to eventgate too [puppet] - 10https://gerrit.wikimedia.org/r/1115113 (https://phabricator.wikimedia.org/T383392)
[21:21:21] <wikibugs>	 (03CR) 10Fabfur: "👍" [puppet] - 10https://gerrit.wikimedia.org/r/1115113 (https://phabricator.wikimedia.org/T383392) (owner: 10Fabfur)
[21:33:32] <wikibugs>	 10ops-eqiad, 06Data-Platform-SRE, 06DC-Ops: Q3:rack/setup/install elastic1108-elastic1119 - https://phabricator.wikimedia.org/T384966#10506471 (10RKemper)
[21:33:40] <icinga-wm>	 RECOVERY - Host ms-fe1014 is UP: PING OK - Packet loss = 0%, RTA = 0.37 ms
[21:33:43] <wikibugs>	 10ops-eqiad, 06Data-Platform-SRE, 06DC-Ops: Q3:rack/setup/install elastic1108-elastic1119 - https://phabricator.wikimedia.org/T384966#10506472 (10RKemper) Racking details are up. Working on the puppet patches today.
[21:33:58] <icinga-wm>	 RECOVERY - SSH on ms-fe1014 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[21:33:58] <icinga-wm>	 RECOVERY - Memcached on ms-fe1014 is OK: TCP OK - 0.021 second response time on 10.64.134.13 port 11211 https://wikitech.wikimedia.org/wiki/Memcached
[21:33:58] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1014 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.065 second response time https://wikitech.wikimedia.org/wiki/Swift
[21:33:58] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe1014 is OK: HTTP OK: HTTP/1.1 200 OK - 501 bytes in 0.090 second response time https://wikitech.wikimedia.org/wiki/Swift
[21:35:03] <logmsgbot>	 !log catrope@deploy2002 Started scap sync-world: Backport for [[gerrit:1115099|resourceloader: Fix hash computation for virtual files with versionFilePath (T385055)]], [[gerrit:1115098|resourceloader: Fix hash computation for virtual files with versionFilePath (T385055)]]
[21:35:08] <stashbot>	 T385055: Search disappearing on focus (t.useId is not a function) - https://phabricator.wikimedia.org/T385055
[21:38:12] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install db125[0-4] - https://phabricator.wikimedia.org/T380083#10506500 (10VRiley-WMF)
[21:38:32] <icinga-wm>	 PROBLEM - Host ms-fe1014 is DOWN: PING CRITICAL - Packet loss = 100%
[21:40:06] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host db1250.eqiad.wmnet with OS bookworm
[21:40:15] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install db125[0-4] - https://phabricator.wikimedia.org/T380083#10506503 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host db1250.eqiad.wmnet with OS bookworm
[21:42:00] <icinga-wm>	 RECOVERY - Host ms-fe1014 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms
[21:42:07] <logmsgbot>	 !log catrope@deploy2002 cdanis, catrope: Backport for [[gerrit:1115099|resourceloader: Fix hash computation for virtual files with versionFilePath (T385055)]], [[gerrit:1115098|resourceloader: Fix hash computation for virtual files with versionFilePath (T385055)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[21:42:12] <stashbot>	 T385055: Search disappearing on focus (t.useId is not a function) - https://phabricator.wikimedia.org/T385055
[21:45:58] <logmsgbot>	 !log catrope@deploy2002 cdanis, catrope: Continuing with sync
[21:48:18] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1251.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[21:48:32] <icinga-wm>	 PROBLEM - Host ms-fe1014 is DOWN: PING CRITICAL - Packet loss = 100%
[21:49:08] <icinga-wm>	 RECOVERY - Host ms-fe1014 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms
[21:50:38] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host db1251.eqiad.wmnet with OS bookworm
[21:50:38] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1014 - https://phabricator.wikimedia.org/T382984#10506574 (10Papaul) still waiting for the part.
[21:50:43] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install db125[0-4] - https://phabricator.wikimedia.org/T380083#10506575 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host db1251.eqiad.wmnet with OS bookworm
[21:51:10] <icinga-wm>	 RECOVERY - MD RAID on ms-fe1014 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[21:52:33] <logmsgbot>	 !log catrope@deploy2002 Finished scap sync-world: Backport for [[gerrit:1115099|resourceloader: Fix hash computation for virtual files with versionFilePath (T385055)]], [[gerrit:1115098|resourceloader: Fix hash computation for virtual files with versionFilePath (T385055)]] (duration: 17m 29s)
[21:52:34] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: ms-fe1014 hardware fault (may need new disk controller?) - https://phabricator.wikimedia.org/T384317#10506579 (10Papaul) upgrade BIOS and IDRAC on the server, Server is back up, I will leave the task open for now to see if we do have the same error again .
[21:52:38] <stashbot>	 T385055: Search disappearing on focus (t.useId is not a function) - https://phabricator.wikimedia.org/T385055
[21:55:11] <RoanKattouw>	 FYI this scap run did print an error:
[21:55:12] <RoanKattouw>	 21:52:27 sudo -u mwdeploy -n -- /usr/bin/rsync -l deployment.codfw.wmnet::common/wikiversions*.{json,php} /srv/mediawiki (ran as mwdeploy@mw2410.codfw.wmnet) returned [255]: ssh: Could not resolve hostname mw2410.codfw.wmnet: Name or service not known
[21:55:49] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1250.eqiad.wmnet with reason: host reimage
[21:58:35] <Reedy>	 RoanKattouw: Yeah, pretty sure you can mostly ignore that as the host was renamed
[21:58:40] <Reedy>	 Sounds like some list somewhere isn't in sync though
[22:00:05] <jouncebot>	 Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250129T2200)
[22:00:17] <rzl>	 hm yeah that was just renamed earlier today, https://phabricator.wikimedia.org/T354791#10505334
[22:00:40] <Reedy>	 The other host renamed at roughly the same time hasn't seemingly given an error
[22:02:16] <rzl>	 aha it's still listed as a scap proxy, https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/%2B/refs/heads/production/hieradata/common/scap/dsh.yaml#6
[22:02:22] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1250.eqiad.wmnet with reason: host reimage
[22:02:48] <Reedy>	 heh
[22:02:54] <Reedy>	 That'd probably explain it
[22:03:01] <Reedy>	 And that it wasn't blocking in any way
[22:04:56] <rzl>	 I see h.nowlan also has https://phabricator.wikimedia.org/T384196 and https://gerrit.wikimedia.org/r/1112714 so if it's not hurting anybody I'm inclined to let him know, but leave it until he can look at it tomorrow
[22:06:15] <Reedy>	 Just worth probably leaving a comment (on the task?) to point out that if we're not removing the rest just yet, we should at least remove the one that's erroring
[22:06:20] <Reedy>	 to stop people repeatedly reporting it
[22:06:43] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1251.eqiad.wmnet with reason: host reimage
[22:06:46] * Reedy does that
[22:07:05] <RoanKattouw>	 Unfortunately it causes scap backport to exit with a nonzero exit status
[22:07:20] <Reedy>	 but otherwise completed/finished?
[22:07:27] <RoanKattouw>	 So it's probably fine, everything works and the deploy gets logged, it's just the exit status at the very end
[22:07:43] <RoanKattouw>	 A little confusing for the deployer but not terrible
[22:07:54] <Reedy>	 it's a good job we've got a human running it not an AI ;)
[22:08:44] <rzl>	 yeah, sorry for the confusion
[22:09:28] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
[22:10:26] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1251.eqiad.wmnet with reason: host reimage
[22:10:45] <rzl>	 separately I have an apache config change to deploy if the current window isn't in use -- RoanKattouw let me know if you're finished, no rush
[22:11:02] <RoanKattouw>	 Yeah I'm done, go ahead
[22:11:07] <rzl>	 thanks!
[22:11:27] <wikibugs>	 (03PS4) 10RLazarus: mediawiki: Restrict /wiki RewriteRule [puppet] - 10https://gerrit.wikimedia.org/r/1007026 (https://phabricator.wikimedia.org/T357595)
[22:13:01] <wikibugs>	 (03CR) 10Scott French: [C:03+1] mediawiki: Restrict /wiki RewriteRule [puppet] - 10https://gerrit.wikimedia.org/r/1007026 (https://phabricator.wikimedia.org/T357595) (owner: 10RLazarus)
[22:14:25] <wikibugs>	 (03CR) 10RLazarus: [C:03+2] mediawiki: Restrict /wiki RewriteRule [puppet] - 10https://gerrit.wikimedia.org/r/1007026 (https://phabricator.wikimedia.org/T357595) (owner: 10RLazarus)
[22:15:06] <logmsgbot>	 !log jhathaway@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on ms-be2088.codfw.wmnet with reason: T381919
[22:19:13] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1002"
[22:20:37] <wikibugs>	 10ops-eqiad, 06Data-Platform-SRE, 06DC-Ops: Q3:rack/setup/install elastic1108-elastic1119 - https://phabricator.wikimedia.org/T384966#10506693 (10RKemper)
[22:21:22] <wikibugs>	 (03PS1) 10RLazarus: mediawiki: Restrict /wiki RewriteRule [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115121 (https://phabricator.wikimedia.org/T357595)
[22:23:01] <wikibugs>	 (03PS1) 10Ryan Kemper: elastic: 15 refresh hosts [puppet] - 10https://gerrit.wikimedia.org/r/1115122 (https://phabricator.wikimedia.org/T384966)
[22:23:03] <wikibugs>	 (03PS2) 10RLazarus: mediawiki: Restrict /wiki RewriteRule [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115121 (https://phabricator.wikimedia.org/T357595)
[22:23:14] <wikibugs>	 (03PS2) 10Bartosz Dziewoński: Add 'auth' docroot with custom files (beta) [puppet] - 10https://gerrit.wikimedia.org/r/1115104 (https://phabricator.wikimedia.org/T383952)
[22:27:13] <wikibugs>	 (03CR) 10Scott French: [C:03+1] "Thanks, Reuven!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115121 (https://phabricator.wikimedia.org/T357595) (owner: 10RLazarus)
[22:27:20] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1002"
[22:27:43] <wikibugs>	 (03CR) 10RLazarus: [C:03+2] mediawiki: Restrict /wiki RewriteRule [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115121 (https://phabricator.wikimedia.org/T357595) (owner: 10RLazarus)
[22:27:44] <wikibugs>	 (03PS1) 10BCornwall: Varnish: Upgrade test container to bullseye [puppet] - 10https://gerrit.wikimedia.org/r/1115123
[22:28:42] <wikibugs>	 (03CR) 10BCornwall: "FWIW:" [puppet] - 10https://gerrit.wikimedia.org/r/1115123 (owner: 10BCornwall)
[22:28:58] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[22:29:09] <rzl>	 ^ known, fixing with that chart update
[22:29:52] <wikibugs>	 (03Merged) 10jenkins-bot: mediawiki: Restrict /wiki RewriteRule [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115121 (https://phabricator.wikimedia.org/T357595) (owner: 10RLazarus)
[22:31:12] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[22:31:25] <wikibugs>	 10ops-eqiad, 06Data-Platform-SRE, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install elastic1108-elastic1122 - https://phabricator.wikimedia.org/T384966#10506736 (10RKemper)
[22:32:07] <jinxer-wm>	 FIRING: [6x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext-next_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:32:48] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[22:33:02] <wikibugs>	 (03CR) 10Bartosz Dziewoński: "Cherry-picked this on the beta cluster, seems to work, I'm a bit surprised I got it right the first time. I'm not sure if I should make th" [puppet] - 10https://gerrit.wikimedia.org/r/1115104 (https://phabricator.wikimedia.org/T383952) (owner: 10Bartosz Dziewoński)
[22:33:44] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[22:33:44] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[22:36:06] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[22:42:43] <wikibugs>	 (03PS2) 10Ryan Kemper: elastic: 15 refresh hosts [puppet] - 10https://gerrit.wikimedia.org/r/1115122 (https://phabricator.wikimedia.org/T384966)
[22:42:57] <wikibugs>	 (03PS1) 10Cwhite: puppetmaster: remove use of deprecated method in logstash.rb [puppet] - 10https://gerrit.wikimedia.org/r/1115124 (https://phabricator.wikimedia.org/T385058)
[22:43:17] <logmsgbot>	 !log rzl@deploy2002 Started scap sync-world: T357595
[22:43:22] <stashbot>	 T357595: Investigate restricting match pattern on /wiki RewriteRule - https://phabricator.wikimedia.org/T357595
[22:43:36] <wikibugs>	 (03CR) 10CI reject: [V:04-1] puppetmaster: remove use of deprecated method in logstash.rb [puppet] - 10https://gerrit.wikimedia.org/r/1115124 (https://phabricator.wikimedia.org/T385058) (owner: 10Cwhite)
[22:43:45] <wikibugs>	 (03PS3) 10Bartosz Dziewoński: Add 'auth' docroot with custom files (beta) [puppet] - 10https://gerrit.wikimedia.org/r/1115104 (https://phabricator.wikimedia.org/T383952)
[22:44:33] <wikibugs>	 (03PS2) 10Cwhite: puppetmaster: remove use of deprecated method in logstash.rb [puppet] - 10https://gerrit.wikimedia.org/r/1115124 (https://phabricator.wikimedia.org/T385058)
[22:44:43] <wikibugs>	 (03CR) 10Bking: [C:03+1] elastic: 15 refresh hosts [puppet] - 10https://gerrit.wikimedia.org/r/1115122 (https://phabricator.wikimedia.org/T384966) (owner: 10Ryan Kemper)
[22:44:59] <wikibugs>	 10ops-eqiad, 06Data-Platform-SRE, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install elastic1108-elastic1122 - https://phabricator.wikimedia.org/T384966#10506754 (10RKemper) 05Open→03In progress a:05RKemper→03None
[22:45:00] <wikibugs>	 (03CR) 10CI reject: [V:04-1] elastic: 15 refresh hosts [puppet] - 10https://gerrit.wikimedia.org/r/1115122 (https://phabricator.wikimedia.org/T384966) (owner: 10Ryan Kemper)
[22:46:18] <logmsgbot>	 !log rzl@deploy2002 rzl: T357595 synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[22:46:48] <wikibugs>	 (03PS3) 10Ryan Kemper: elastic: 15 refresh hosts [puppet] - 10https://gerrit.wikimedia.org/r/1115122 (https://phabricator.wikimedia.org/T384966)
[22:47:00] <wikibugs>	 10ops-eqiad, 06Data-Platform-SRE, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install elastic1108-elastic1122 - https://phabricator.wikimedia.org/T384966#10506774 (10RKemper) Okay, I think our work here is done so we have removed ourselves as assignees.  Wasn't sure whether task status should be `Open` or...
[22:49:05] <logmsgbot>	 !log rzl@deploy2002 rzl: Continuing with sync
[22:50:23] <wikibugs>	 (03PS4) 10Bartosz Dziewoński: Add 'auth' docroot with custom files [puppet] - 10https://gerrit.wikimedia.org/r/1115104 (https://phabricator.wikimedia.org/T383952)
[22:50:57] <wikibugs>	 (03CR) 10Bartosz Dziewoński: "I added the prod changes too, hope they work. Let me know if I should split them to a separate patch." [puppet] - 10https://gerrit.wikimedia.org/r/1115104 (https://phabricator.wikimedia.org/T383952) (owner: 10Bartosz Dziewoński)
[22:51:05] <wikibugs>	 (03CR) 10Bking: [C:03+1] elastic: 15 refresh hosts [puppet] - 10https://gerrit.wikimedia.org/r/1115122 (https://phabricator.wikimedia.org/T384966) (owner: 10Ryan Kemper)
[22:54:43] <logmsgbot>	 !log rzl@deploy2002 Finished scap sync-world: T357595 (duration: 11m 57s)
[22:54:48] <stashbot>	 T357595: Investigate restricting match pattern on /wiki RewriteRule - https://phabricator.wikimedia.org/T357595
[22:55:21] <rzl>	 \o/
[22:55:31] <rzl>	 those httpbb alerts will self-resolve
[22:55:51] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, January 30 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113476 (https://phabricator.wikimedia.org/T383916) (owner: 10Bartosz Dziewoński)
[22:56:31] <rzl>	 I'm through deploying, and I think swfrench-wmf is up next if the 22:00 window is unused today
[22:56:33] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, January 30 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115103 (https://phabricator.wikimedia.org/T383952) (owner: 10Bartosz Dziewoński)
[22:56:46] <rzl>	 *23:00
[22:57:12] <swfrench-wmf>	 rzl: thanks!
[22:57:21] <swfrench-wmf>	 jouncebot: nowandnext
[22:57:21] <jouncebot>	 For the next 0 hour(s) and 2 minute(s): Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250129T2200)
[22:57:21] <jouncebot>	 In 0 hour(s) and 2 minute(s): Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250129T2300)
[22:58:06] <swfrench-wmf>	 I'll give the web team a few minutes to convene for a deployment before proceeding
[22:58:28] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[22:58:50] <swfrench-wmf>	 my change does not require a mediawiki deployment, but would be preferable to isolate from other changes, if possibe
[22:58:55] <swfrench-wmf>	 *possible
[22:59:24] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[23:00:04] <jouncebot>	 Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250129T2300)
[23:02:14] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53367 bytes in 0.066 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[23:02:18] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.194 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[23:05:00] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2095
[23:06:43] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2095
[23:09:07] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2186
[23:09:22] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2186
[23:11:28] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host db1251
[23:12:22] <logmsgbot>	 !log pt1979@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1053.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[23:13:42] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 207, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[23:14:02] <icinga-wm>	 PROBLEM - Router interfaces on cr1-esams is CRITICAL: CRITICAL: host 185.15.59.128, interfaces up: 77, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[23:14:18] <swfrench-wmf>	 seems quiet, so I'm going to move ahead with my change shortly
[23:14:28] <icinga-wm>	 PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[23:14:40] <wikibugs>	 (03CR) 10Scott French: [C:03+2] shellbox-video: all codfw replicas to 8.1 (change 3/4) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113215 (https://phabricator.wikimedia.org/T377038) (owner: 10Scott French)
[23:15:42] <wikibugs>	 (03Merged) 10jenkins-bot: shellbox-video: all codfw replicas to 8.1 (change 3/4) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113215 (https://phabricator.wikimedia.org/T377038) (owner: 10Scott French)
[23:15:52] <wikibugs>	 (03PS3) 10Cathal Mooney: Network: add qos and sflow config for configure-switch-interfaces [cookbooks] - 10https://gerrit.wikimedia.org/r/1115109 (https://phabricator.wikimedia.org/T379549)
[23:17:25] <logmsgbot>	 !log pt1979@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti1053.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[23:18:21] <logmsgbot>	 !log cmooney@cumin1002 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host db1251
[23:20:55] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-video: apply
[23:21:45] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-video: apply
[23:22:31] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[23:24:42] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[23:24:51] <rzl>	 hmm, gerrit seems actually-down
[23:25:24] <swfrench-wmf>	 hmmm ... that's not good =/
[23:25:33] <wikibugs>	 10SRE-tools, 06Infrastructure-Foundations: Support creating phab tasks in wmflib.phabricator - https://phabricator.wikimedia.org/T366470#10506845 (10Aklapper) > Unfortunately wmflib currently only supports creating comments.  I guess this is about expanding the `transactions` handling for the `self._client.man...
[23:25:42] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 208, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[23:26:02] <icinga-wm>	 RECOVERY - Router interfaces on cr1-esams is OK: OK: host 185.15.59.128, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[23:26:28] <icinga-wm>	 RECOVERY - BFD status on cr2-eqiad is OK: UP: 25 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[23:28:49] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host db1251
[23:28:58] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[23:29:00] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Network: add qos and sflow config for configure-switch-interfaces [cookbooks] - 10https://gerrit.wikimedia.org/r/1115109 (https://phabricator.wikimedia.org/T379549) (owner: 10Cathal Mooney)
[23:29:11] <rzl>	 back now, poking around a little
[23:29:31] <jinxer-wm>	 RESOLVED: [6x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext-next_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:29:42] <jinxer-wm>	 FIRING: [4x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[23:30:20] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Extend sre.network.configure-switch-interfaces cookbook to add sflow and qos config - https://phabricator.wikimedia.org/T379549#10506853 (10cmooney) As a test I ran this for an existing host that had been configured with the current live co...
[23:30:33] <logmsgbot>	 !log cmooney@cumin1002 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host db1251
[23:31:12] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[23:31:56] <wikibugs>	 (03PS4) 10Cathal Mooney: Network: add qos and sflow config for configure-switch-interfaces [cookbooks] - 10https://gerrit.wikimedia.org/r/1115109 (https://phabricator.wikimedia.org/T379549)
[23:32:31] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[23:32:48] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[23:33:44] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[23:33:44] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[23:34:42] <jinxer-wm>	 RESOLVED: [4x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[23:36:06] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[23:36:53] <wikibugs>	 (03CR) 10Scott French: [C:03+2] shellbox-video: all replicas on PHP 8.1 (change 4/4) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113216 (https://phabricator.wikimedia.org/T377038) (owner: 10Scott French)
[23:37:01] <wikibugs>	 (03CR) 10CI reject: [V:04-1] shellbox-video: all replicas on PHP 8.1 (change 4/4) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113216 (https://phabricator.wikimedia.org/T377038) (owner: 10Scott French)
[23:37:25] <wikibugs>	 (03PS4) 10Scott French: shellbox-video: all replicas on PHP 8.1 (change 4/4) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113216 (https://phabricator.wikimedia.org/T377038)
[23:37:51] <wikibugs>	 (03PS5) 10Cathal Mooney: Network: add qos and sflow config for configure-switch-interfaces [cookbooks] - 10https://gerrit.wikimedia.org/r/1115109 (https://phabricator.wikimedia.org/T379549)
[23:38:08] <wikibugs>	 (03CR) 10Scott French: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113216 (https://phabricator.wikimedia.org/T377038) (owner: 10Scott French)
[23:39:56] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host db1251
[23:40:30] <wikibugs>	 (03Merged) 10jenkins-bot: shellbox-video: all replicas on PHP 8.1 (change 4/4) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113216 (https://phabricator.wikimedia.org/T377038) (owner: 10Scott French)
[23:41:20] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db1251
[23:43:15] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host backup1010
[23:43:38] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host backup1010
[23:43:53] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Network: add qos and sflow config for configure-switch-interfaces [cookbooks] - 10https://gerrit.wikimedia.org/r/1115109 (https://phabricator.wikimedia.org/T379549) (owner: 10Cathal Mooney)
[23:44:46] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-video: apply
[23:45:32] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-video: apply
[23:50:14] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2139.codfw.wmnet with reason: Maintenance
[23:50:56] <wikibugs>	 (03PS6) 10Cathal Mooney: Network: add qos and sflow config for configure-switch-interfaces [cookbooks] - 10https://gerrit.wikimedia.org/r/1115109 (https://phabricator.wikimedia.org/T379549)
[23:52:07] <jinxer-wm>	 FIRING: [2x] CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic1071-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[23:52:20] <wikibugs>	 (03CR) 10Raymond Ndibe: "Yeaa I did. There are no backwards incompatible change as far as I know of" [puppet] - 10https://gerrit.wikimedia.org/r/1113871 (https://phabricator.wikimedia.org/T358225) (owner: 10Raymond Ndibe)
[23:53:12] <wikibugs>	 (03CR) 10Raymond Ndibe: "Yes I've already tested this on toolseta-harbor-1 node. This is currently running on that node right now." [puppet] - 10https://gerrit.wikimedia.org/r/1114007 (https://phabricator.wikimedia.org/T384720) (owner: 10Raymond Ndibe)
[23:53:53] <wikibugs>	 (03PS1) 10Cathal Mooney: Class-of-service: don't insert comment with host name under cos/ints [homer/public] - 10https://gerrit.wikimedia.org/r/1115134 (https://phabricator.wikimedia.org/T379549)
[23:54:09] <wikibugs>	 (03CR) 10Raymond Ndibe: "I think the next step is to announce that toolforge will be down for maybe 1hr for maintenance. Will use that window to perform the upgrad" [puppet] - 10https://gerrit.wikimedia.org/r/1114007 (https://phabricator.wikimedia.org/T384720) (owner: 10Raymond Ndibe)
[23:59:28] <icinga-wm>	 PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[23:59:42] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 207, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down