[00:01:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2168', diff saved to https://phabricator.wikimedia.org/P72677 and previous config saved to /var/cache/conftool/dbconfig/20250129-000144-marostegui.json [00:11:05] PROBLEM - MariaDB Replica Lag: s1 on db2141 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 657.19 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:16:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2168 (T384592)', diff saved to https://phabricator.wikimedia.org/P72678 and previous config saved to /var/cache/conftool/dbconfig/20250129-001651-marostegui.json [00:16:56] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2182.codfw.wmnet with reason: Maintenance [00:16:56] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [00:17:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2182 (T384592)', diff saved to https://phabricator.wikimedia.org/P72679 and previous config saved to /var/cache/conftool/dbconfig/20250129-001702-marostegui.json [00:17:23] (03PS2) 10Scott French: shellbox-constraints: all eqiad replicas on 8.1 (change 2/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113218 (https://phabricator.wikimedia.org/T377038) [00:17:23] (03PS2) 10Scott French: shellbox-constraints: all replicas on PHP 8.1 (change 3/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113219 (https://phabricator.wikimedia.org/T377038) [00:17:56] (03PS2) 10Scott French: shellbox-video: 50% of codfw replicas to 8.1 (change 2/4) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113214 (https://phabricator.wikimedia.org/T377038) [00:17:56] (03PS2) 10Scott French: shellbox-video: all codfw replicas to 8.1 (change 3/4) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113215 (https://phabricator.wikimedia.org/T377038) [00:17:56] (03PS2) 10Scott French: shellbox-video: all replicas on PHP 8.1 (change 4/4) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113216 (https://phabricator.wikimedia.org/T377038) [00:26:21] (03PS8) 10Raymond Ndibe: [toolforge::harbor] upgrade harbor v2.10.1 ---> v2.12.2 [puppet] - 10https://gerrit.wikimedia.org/r/1113871 (https://phabricator.wikimedia.org/T358225) [00:27:07] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:29:31] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:30:40] FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker1036:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1036 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [00:38:11] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1114812 [00:40:40] RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-worker1036:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1036 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [00:47:10] FIRING: [2x] KubernetesRsyslogDown: rsyslog on wikikube-worker1036:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [00:57:10] RESOLVED: [2x] KubernetesRsyslogDown: rsyslog on wikikube-worker1036:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [00:58:35] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1114812 (owner: 10TrainBranchBot) [01:00:02] (03Abandoned) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1114478 (owner: 10TrainBranchBot) [01:08:38] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1114815 [01:08:38] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1114815 (owner: 10TrainBranchBot) [01:11:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T384592)', diff saved to https://phabricator.wikimedia.org/P72680 and previous config saved to /var/cache/conftool/dbconfig/20250129-011157-marostegui.json [01:12:02] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [01:27:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P72681 and previous config saved to /var/cache/conftool/dbconfig/20250129-012703-marostegui.json [01:27:07] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:28:27] (03CR) 10Cathal Mooney: [C:03+1] netbox: use asctime in the logs [puppet] - 10https://gerrit.wikimedia.org/r/1114331 (https://phabricator.wikimedia.org/T379072) (owner: 10Volans) [01:28:40] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1114815 (owner: 10TrainBranchBot) [01:29:31] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:42:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P72682 and previous config saved to /var/cache/conftool/dbconfig/20250129-014210-marostegui.json [01:46:27] PROBLEM - Disk space on releases1003 is CRITICAL: DISK CRITICAL - /srv/docker/overlay2/814cb09f4ce883829fb9195053b3ab127bbf1c8c1935c70f205fae91cb4fbf7b/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [01:57:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T384592)', diff saved to https://phabricator.wikimedia.org/P72683 and previous config saved to /var/cache/conftool/dbconfig/20250129-015717-marostegui.json [01:57:22] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [01:57:33] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2198.codfw.wmnet with reason: Maintenance [02:06:27] RECOVERY - Disk space on releases1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [02:10:05] RECOVERY - MariaDB Replica Lag: s1 on db2141 is OK: OK slave_sql_lag Replication lag: 0.27 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [02:27:07] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:29:31] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:37:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:44:21] 06SRE, 06Infrastructure-Foundations, 10netops, 10observability: LibreNMS reporting no routes learnt from doh/durum Anycast peers at various POPs - https://phabricator.wikimedia.org/T384258#10502803 (10andrea.denisse) Hi @cmooney, I was reviewing the [[ https://github.com/librenms/librenms/releases/tag/25.... [02:56:21] !log denisse@deploy2002 Started deploy [librenms/librenms@f049593]: Upgrade LibreNMS to 25.1.0 - T384258 [02:56:26] T384258: LibreNMS reporting no routes learnt from doh/durum Anycast peers at various POPs - https://phabricator.wikimedia.org/T384258 [02:56:34] !log denisse@deploy2002 Finished deploy [librenms/librenms@f049593]: Upgrade LibreNMS to 25.1.0 - T384258 (duration: 00m 13s) [03:02:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:06:34] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2200.codfw.wmnet with reason: Maintenance [03:18:36] 06SRE, 06Infrastructure-Foundations, 10netops, 10observability, 10SRE Observability (FY2024/2025-Q3): LibreNMS reporting no routes learnt from doh/durum Anycast peers at various POPs - https://phabricator.wikimedia.org/T384258#10502812 (10andrea.denisse) 05Open→03Resolved After upgrading to v25.1... [03:27:07] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:29:31] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:37:01] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:37:53] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.310 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:42:01] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:42:51] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:47:41] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53367 bytes in 0.067 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:47:51] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.183 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:52:07] FIRING: [2x] CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic1071-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [03:53:01] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:59:53] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.190 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:08:16] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2208.codfw.wmnet with reason: Maintenance [04:08:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2208 (T384592)', diff saved to https://phabricator.wikimedia.org/P72685 and previous config saved to /var/cache/conftool/dbconfig/20250129-040822-marostegui.json [04:08:27] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [04:56:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2208 (T384592)', diff saved to https://phabricator.wikimedia.org/P72686 and previous config saved to /var/cache/conftool/dbconfig/20250129-045600-marostegui.json [04:56:06] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [05:11:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2208', diff saved to https://phabricator.wikimedia.org/P72687 and previous config saved to /var/cache/conftool/dbconfig/20250129-051108-marostegui.json [05:26:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2208', diff saved to https://phabricator.wikimedia.org/P72688 and previous config saved to /var/cache/conftool/dbconfig/20250129-052615-marostegui.json [05:41:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2208 (T384592)', diff saved to https://phabricator.wikimedia.org/P72689 and previous config saved to /var/cache/conftool/dbconfig/20250129-054121-marostegui.json [05:41:28] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [05:41:39] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2220.codfw.wmnet with reason: Maintenance [05:41:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2220 (T384592)', diff saved to https://phabricator.wikimedia.org/P72690 and previous config saved to /var/cache/conftool/dbconfig/20250129-054145-marostegui.json [05:47:46] (03PS14) 10AOkoth: miscweb: support os-reports deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098486 (https://phabricator.wikimedia.org/T350794) [05:49:30] (03PS15) 10AOkoth: miscweb: support os-reports deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098486 (https://phabricator.wikimedia.org/T350794) [05:50:02] (03CR) 10AOkoth: "Acknowledged" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098486 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth) [06:12:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2156', diff saved to https://phabricator.wikimedia.org/P72691 and previous config saved to /var/cache/conftool/dbconfig/20250129-061214-marostegui.json [06:12:45] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2156.codfw.wmnet with reason: maintenance [06:12:56] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2186.codfw.wmnet with reason: maintenance [06:13:02] !log root@cumin1002 START - Cookbook sre.mysql.upgrade for db2156.codfw.wmnet [06:14:17] (03PS1) 10Marostegui: rebuild_tables.sh: Add sleep [software] - 10https://gerrit.wikimedia.org/r/1114832 (https://phabricator.wikimedia.org/T382842) [06:18:06] (03CR) 10Marostegui: "FYI" [software] - 10https://gerrit.wikimedia.org/r/1114832 (https://phabricator.wikimedia.org/T382842) (owner: 10Marostegui) [06:18:07] (03CR) 10Marostegui: [C:03+2] rebuild_tables.sh: Add sleep [software] - 10https://gerrit.wikimedia.org/r/1114832 (https://phabricator.wikimedia.org/T382842) (owner: 10Marostegui) [06:18:35] (03Merged) 10jenkins-bot: rebuild_tables.sh: Add sleep [software] - 10https://gerrit.wikimedia.org/r/1114832 (https://phabricator.wikimedia.org/T382842) (owner: 10Marostegui) [06:19:45] !log root@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2156.codfw.wmnet [06:20:29] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2156.codfw.wmnet with reason: Index rebuild [06:21:56] (03PS1) 10Marostegui: installserver: Reimage db1257 [puppet] - 10https://gerrit.wikimedia.org/r/1114833 (https://phabricator.wikimedia.org/T384979) [06:29:29] (03CR) 10Marostegui: [C:03+2] installserver: Reimage db1257 [puppet] - 10https://gerrit.wikimedia.org/r/1114833 (https://phabricator.wikimedia.org/T384979) (owner: 10Marostegui) [06:30:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2220 (T384592)', diff saved to https://phabricator.wikimedia.org/P72692 and previous config saved to /var/cache/conftool/dbconfig/20250129-063015-marostegui.json [06:30:20] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [06:34:47] (03PS1) 10Marostegui: mariadb: Add new future host [puppet] - 10https://gerrit.wikimedia.org/r/1114900 (https://phabricator.wikimedia.org/T384979) [06:45:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1230 db2157 T384994', diff saved to https://phabricator.wikimedia.org/P72693 and previous config saved to /var/cache/conftool/dbconfig/20250129-064545-marostegui.json [06:45:52] T384994: Upgrade and rebuild s5 - https://phabricator.wikimedia.org/T384994 [06:45:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2220', diff saved to https://phabricator.wikimedia.org/P72694 and previous config saved to /var/cache/conftool/dbconfig/20250129-064555-marostegui.json [06:46:11] !log root@cumin1002 START - Cookbook sre.mysql.upgrade for db1230.eqiad.wmnet [06:46:18] !log root@cumin1002 START - Cookbook sre.mysql.upgrade for db2157.codfw.wmnet [06:49:39] (03PS1) 10Marostegui: db1230,db2157: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1114904 (https://phabricator.wikimedia.org/T384994) [06:49:43] (03CR) 10Marostegui: [C:03+2] db1230,db2157: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1114904 (https://phabricator.wikimedia.org/T384994) (owner: 10Marostegui) [06:49:47] (03PS1) 10Marostegui: Revert "db1230,db2157: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1114905 [06:49:51] (03CR) 10Marostegui: [C:04-2] "Not yet" [puppet] - 10https://gerrit.wikimedia.org/r/1114905 (owner: 10Marostegui) [06:51:49] !log root@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1230.eqiad.wmnet [06:52:37] !log root@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2157.codfw.wmnet [06:52:43] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1230.eqiad.wmnet with reason: Index rebuild [06:53:12] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2157.codfw.wmnet with reason: Index rebuild [07:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250129T0700) [07:01:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2220', diff saved to https://phabricator.wikimedia.org/P72695 and previous config saved to /var/cache/conftool/dbconfig/20250129-070103-marostegui.json [07:16:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2220 (T384592)', diff saved to https://phabricator.wikimedia.org/P72696 and previous config saved to /var/cache/conftool/dbconfig/20250129-071610-marostegui.json [07:16:15] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [07:16:26] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2221.codfw.wmnet with reason: Maintenance [07:16:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2221 (T384592)', diff saved to https://phabricator.wikimedia.org/P72697 and previous config saved to /var/cache/conftool/dbconfig/20250129-071632-marostegui.json [07:32:07] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:32:19] (03CR) 10Muehlenhoff: nftables: add types and directories (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1114717 (https://phabricator.wikimedia.org/T370677) (owner: 10Arnaudb) [07:33:48] !log installing Tomcat security updates [07:33:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:34:47] !log jmm@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ganeti2028.codfw.wmnet with reason: remove from cluster for reimage [07:34:56] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10503006 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=160bb060-4ed1-4784-9312-c60a5421c725) set by jmm@cumin2002 for 1 day, 0:00:00 on 1 host(... [07:36:44] (03CR) 10Muehlenhoff: [C:03+2] Switch ganeti2028 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1114741 (owner: 10Muehlenhoff) [07:40:31] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host idp1004.wikimedia.org [07:40:45] !log root@cumin1002 START - Cookbook sre.mysql.pool db1230 gradually with 4 steps - Repooling after rebuild index T384994 [07:40:49] !log root@cumin1002 END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) db1230 gradually with 4 steps - Repooling after rebuild index T384994 [07:40:49] T384994: Upgrade and rebuild s5 - https://phabricator.wikimedia.org/T384994 [07:42:02] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2028.codfw.wmnet with OS bookworm [07:42:08] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10503015 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2028.codfw.wmnet with OS bookworm [07:44:31] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host idp1004.wikimedia.org [07:52:07] FIRING: [2x] CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic1071-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [07:54:40] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1002 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [07:55:30] !log jmm@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ganeti2028.codfw.wmnet with OS bookworm [07:55:35] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10503023 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2028.codfw.wmnet with OS bookworm executed with errors:... [07:55:38] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [07:55:43] (03CR) 10Volans: [C:03+2] netbox: use asctime in the logs [puppet] - 10https://gerrit.wikimedia.org/r/1114331 (https://phabricator.wikimedia.org/T379072) (owner: 10Volans) [07:56:08] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2028.codfw.wmnet with OS bookworm [07:56:17] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10503025 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2028.codfw.wmnet with OS bookworm [07:58:12] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2031.codfw.wmnet [07:58:38] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10503026 (10ops-monitoring-bot) Draining ganeti2031.codfw.wmnet of running VMs [07:58:40] !log root@cumin1002 START - Cookbook sre.mysql.pool db2157 gradually with 4 steps - Repooling after rebuild index T384994 [07:58:41] (03CR) 10Filippo Giunchedi: [C:03+1] "Nice!" [puppet] - 10https://gerrit.wikimedia.org/r/1114770 (https://phabricator.wikimedia.org/T369384) (owner: 10Cathal Mooney) [07:58:44] T384994: Upgrade and rebuild s5 - https://phabricator.wikimedia.org/T384994 [08:00:05] Amir1, Urbanecm, and awight: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250129T0800). [08:00:05] No Gerrit patches in the queue for this window AFAICS. [08:01:14] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: sre.netbox.update-extras hits KeyError with logging - https://phabricator.wikimedia.org/T379072#10503032 (10Volans) 05Open→03Resolved a:03Volans The patch has been deployed and this is now fixed. [08:04:06] (03PS16) 10AOkoth: miscweb: support os-reports deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098486 (https://phabricator.wikimedia.org/T350794) [08:06:11] (03PS1) 10Slyngshede: P:idm add logstash to requestable permission [puppet] - 10https://gerrit.wikimedia.org/r/1114949 [08:08:03] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2031.codfw.wmnet [08:08:26] (03CR) 10Jelto: [C:03+2] gerrit: Remove rsa-2048 certs from apache config [puppet] - 10https://gerrit.wikimedia.org/r/1075614 (https://phabricator.wikimedia.org/T375569) (owner: 10BCornwall) [08:08:52] (03PS2) 10BCornwall: gerrit: Remove rsa-2048 certs from apache config [puppet] - 10https://gerrit.wikimedia.org/r/1075614 (https://phabricator.wikimedia.org/T375569) [08:12:20] (03CR) 10Muehlenhoff: "This looks good, but before merging let me add a description to the group." [puppet] - 10https://gerrit.wikimedia.org/r/1114949 (owner: 10Slyngshede) [08:12:52] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti2028.codfw.wmnet with reason: host reimage [08:14:05] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2031.codfw.wmnet [08:14:23] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10503036 (10ops-monitoring-bot) Draining ganeti2031.codfw.wmnet of running VMs [08:15:52] (03PS1) 10Muehlenhoff: Switch ganeti2031 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1114950 [08:16:35] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti2028.codfw.wmnet with reason: host reimage [08:16:43] (03PS17) 10AOkoth: miscweb: support os-reports deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098486 (https://phabricator.wikimedia.org/T350794) [08:17:42] (03CR) 10AOkoth: miscweb: support os-reports deployment (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098486 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth) [08:17:50] (03PS18) 10AOkoth: miscweb: support os-reports deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098486 (https://phabricator.wikimedia.org/T350794) [08:17:55] (03CR) 10Jelto: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1075614 (https://phabricator.wikimedia.org/T375569) (owner: 10BCornwall) [08:26:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Pool db1230', diff saved to https://phabricator.wikimedia.org/P72700 and previous config saved to /var/cache/conftool/dbconfig/20250129-082606-marostegui.json [08:26:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2221 (T384592)', diff saved to https://phabricator.wikimedia.org/P72701 and previous config saved to /var/cache/conftool/dbconfig/20250129-082613-marostegui.json [08:26:18] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [08:28:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1230 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P72702 and previous config saved to /var/cache/conftool/dbconfig/20250129-082841-root.json [08:29:42] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1002 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [08:30:20] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:30:26] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:30:38] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [08:30:38] (03CR) 10Hashar: "After the replica got updated:" [puppet] - 10https://gerrit.wikimedia.org/r/1075614 (https://phabricator.wikimedia.org/T375569) (owner: 10BCornwall) [08:30:42] that's me testing lvs4010 [08:31:33] !log depooled lvs4009 during 60s to test lvs4010 running liberica - T384477 [08:31:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:38] T384477: Replace pybal with liberica on the PoPs - https://phabricator.wikimedia.org/T384477 [08:33:26] (03CR) 10Jelto: [C:03+2] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1075614 (https://phabricator.wikimedia.org/T375569) (owner: 10BCornwall) [08:34:50] (03CR) 10Fabfur: [C:03+1] Refine: Bump jar version to 0.2.49.3 [puppet] - 10https://gerrit.wikimedia.org/r/1114806 (https://phabricator.wikimedia.org/T383914) (owner: 10Aqu) [08:36:38] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2028.codfw.wmnet with OS bookworm [08:36:47] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10503068 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2028.codfw.wmnet with OS bookworm completed: - ganeti202... [08:41:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2221', diff saved to https://phabricator.wikimedia.org/P72704 and previous config saved to /var/cache/conftool/dbconfig/20250129-084120-marostegui.json [08:42:01] !log jmm@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ganeti2028.codfw.wmnet [08:42:15] 07sre-alert-triage, 06serviceops: Alert in need of triage: SystemdUnitFailed (instance cumin1002:9100) - https://phabricator.wikimedia.org/T384999 (10LSobanski) 03NEW [08:43:07] (03CR) 10Fabfur: [C:03+2] hiera: consolidate haproxykafka into common profile [puppet] - 10https://gerrit.wikimedia.org/r/1114728 (https://phabricator.wikimedia.org/T378578) (owner: 10Fabfur) [08:43:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1230 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P72705 and previous config saved to /var/cache/conftool/dbconfig/20250129-084347-root.json [08:44:01] !log root@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2157 gradually with 4 steps - Repooling after rebuild index T384994 [08:44:07] T384994: Upgrade and rebuild s5 - https://phabricator.wikimedia.org/T384994 [08:45:08] (03CR) 10Marostegui: Revert "db1230,db2157: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1114905 (owner: 10Marostegui) [08:45:10] (03CR) 10Marostegui: [C:03+2] Revert "db1230,db2157: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1114905 (owner: 10Marostegui) [08:45:32] (03CR) 10Gmodena: [C:03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114790 (https://phabricator.wikimedia.org/T382953) (owner: 10Xcollazo) [08:46:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1210 db2223 T384994', diff saved to https://phabricator.wikimedia.org/P72707 and previous config saved to /var/cache/conftool/dbconfig/20250129-084611-marostegui.json [08:46:17] !log root@cumin1002 START - Cookbook sre.mysql.upgrade for db2223.codfw.wmnet [08:46:24] !log root@cumin1002 START - Cookbook sre.mysql.upgrade for db1210.eqiad.wmnet [08:47:02] 07sre-alert-triage, 06serviceops: Alert in need of triage: SystemdUnitFailed (instance cumin1002:9100) - https://phabricator.wikimedia.org/T384999#10503094 (10JMeybohm) [08:48:03] <_joe_> jouncebot: now [08:48:03] For the next 0 hour(s) and 11 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250129T0800) [08:48:30] <_joe_> well I'll slip in my change I couldn't deploy yesterday [08:51:08] !log root@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1210.eqiad.wmnet [08:51:38] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1210.eqiad.wmnet with reason: Index rebuild [08:51:47] !log root@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2223.codfw.wmnet [08:52:08] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2223.codfw.wmnet with reason: Index rebuild [08:52:24] (03CR) 10Arnaudb: nftables: add docker profile and forward chain (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1114716 (https://phabricator.wikimedia.org/T370677) (owner: 10Arnaudb) [08:52:53] (03CR) 10TrainBranchBot: [C:03+2] "Approved by oblivian@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113788 (https://phabricator.wikimedia.org/T382947) (owner: 10Giuseppe Lavagetto) [08:53:36] (03Merged) 10jenkins-bot: DBRecordCache: handle default section [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113788 (https://phabricator.wikimedia.org/T382947) (owner: 10Giuseppe Lavagetto) [08:54:24] !log oblivian@deploy2002 Started scap sync-world: Backport for [[gerrit:1113788|DBRecordCache: handle default section (T382947)]] [08:54:29] T382947: Switch dumps 1.0 processes to use the analytics MariadB replicas (dbstore100[7-9]) - https://phabricator.wikimedia.org/T382947 [08:56:25] (03CR) 10Jelto: "looks mostly good, one comment in-line" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098486 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth) [08:56:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2221', diff saved to https://phabricator.wikimedia.org/P72710 and previous config saved to /var/cache/conftool/dbconfig/20250129-085627-marostegui.json [08:57:36] !log oblivian@deploy2002 oblivian: Backport for [[gerrit:1113788|DBRecordCache: handle default section (T382947)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:58:23] !log oblivian@deploy2002 oblivian: Continuing with sync [08:58:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1230 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P72711 and previous config saved to /var/cache/conftool/dbconfig/20250129-085852-root.json [08:59:50] (03CR) 10Fabfur: [C:03+2] Refine: Bump jar version to 0.2.49.3 [puppet] - 10https://gerrit.wikimedia.org/r/1114806 (https://phabricator.wikimedia.org/T383914) (owner: 10Aqu) [09:00:04] jeena and hashar: Deploy window MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250129T0900) [09:05:04] !log oblivian@deploy2002 Finished scap sync-world: Backport for [[gerrit:1113788|DBRecordCache: handle default section (T382947)]] (duration: 10m 39s) [09:05:10] T382947: Switch dumps 1.0 processes to use the analytics MariadB replicas (dbstore100[7-9]) - https://phabricator.wikimedia.org/T382947 [09:11:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2221 (T384592)', diff saved to https://phabricator.wikimedia.org/P72713 and previous config saved to /var/cache/conftool/dbconfig/20250129-091134-marostegui.json [09:11:40] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [09:11:49] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2222.codfw.wmnet with reason: Maintenance [09:11:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2222 (T384592)', diff saved to https://phabricator.wikimedia.org/P72714 and previous config saved to /var/cache/conftool/dbconfig/20250129-091156-marostegui.json [09:12:18] (03PS1) 10Vgutierrez: prometheus::ops: Scrape ipip-mq-optimizer metrics on liberica nodes [puppet] - 10https://gerrit.wikimedia.org/r/1114953 (https://phabricator.wikimedia.org/T385001) [09:13:05] (03PS2) 10Vgutierrez: prometheus::ops: Scrape ipip-mq-optimizer metrics on liberica nodes [puppet] - 10https://gerrit.wikimedia.org/r/1114953 (https://phabricator.wikimedia.org/T385001) [09:13:23] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1114953 (https://phabricator.wikimedia.org/T385001) (owner: 10Vgutierrez) [09:13:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1230 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P72715 and previous config saved to /var/cache/conftool/dbconfig/20250129-091357-root.json [09:15:57] (03PS1) 10DCausse: cirrus: add v1 stream for the search update pipeline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114955 (https://phabricator.wikimedia.org/T375821) [09:15:59] (03PS1) 10DCausse: cirrus: drop rc0 streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114956 (https://phabricator.wikimedia.org/T375821) [09:16:14] (03CR) 10DCausse: [C:04-2] cirrus: drop rc0 streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114956 (https://phabricator.wikimedia.org/T375821) (owner: 10DCausse) [09:25:52] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2028.codfw.wmnet [09:29:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1230 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P72716 and previous config saved to /var/cache/conftool/dbconfig/20250129-092902-root.json [09:31:01] (03PS3) 10Arnaudb: nftables: add docker profile and forward chain [puppet] - 10https://gerrit.wikimedia.org/r/1114716 (https://phabricator.wikimedia.org/T370677) [09:31:08] (03PS3) 10Arnaudb: nftables: add types and directories [puppet] - 10https://gerrit.wikimedia.org/r/1114717 (https://phabricator.wikimedia.org/T370677) [09:31:14] (03PS4) 10Arnaudb: nftables: add nftable docker manifest [puppet] - 10https://gerrit.wikimedia.org/r/1114718 (https://phabricator.wikimedia.org/T370677) [09:31:24] (03PS2) 10Arnaudb: gitlab_runner: add nftables logic [puppet] - 10https://gerrit.wikimedia.org/r/1114726 (https://phabricator.wikimedia.org/T370677) [09:31:29] !log root@cumin1002 START - Cookbook sre.mysql.pool db2156 gradually with 4 steps - Repooling after rebuild index T384807 [09:31:33] T384807: Upgrade and rebuild s3 - https://phabricator.wikimedia.org/T384807 [09:36:56] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2028.codfw.wmnet [09:36:58] !log jmm@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ganeti2028.codfw.wmnet [09:39:31] 06SRE, 06Infrastructure-Foundations, 10netops, 10observability, 10SRE Observability (FY2024/2025-Q3): LibreNMS reporting no routes learnt from doh/durum Anycast peers at various POPs - https://phabricator.wikimedia.org/T384258#10503188 (10cmooney) >>! In T384258#10502812, @andrea.denisse wrote: > Aft... [09:41:20] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-redacteddb1001.eqiad.wmnet with reason: maintenance [09:42:07] !log Upgrade mariadb on an-redacteddb1001 [09:42:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:13] (03Abandoned) 10Marostegui: dbproxy: switch CNAMEs [dns] - 10https://gerrit.wikimedia.org/r/1087374 (https://phabricator.wikimedia.org/T368874) (owner: 10Arnaudb) [09:48:42] (03CR) 10Fabfur: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1114953 (https://phabricator.wikimedia.org/T385001) (owner: 10Vgutierrez) [09:57:44] (03CR) 10Vgutierrez: [C:03+2] prometheus::ops: Scrape ipip-mq-optimizer metrics on liberica nodes [puppet] - 10https://gerrit.wikimedia.org/r/1114953 (https://phabricator.wikimedia.org/T385001) (owner: 10Vgutierrez) [10:00:20] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2028.codfw.wmnet [10:00:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2222 (T384592)', diff saved to https://phabricator.wikimedia.org/P72719 and previous config saved to /var/cache/conftool/dbconfig/20250129-100037-marostegui.json [10:00:43] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [10:00:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1210 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P72720 and previous config saved to /var/cache/conftool/dbconfig/20250129-100048-root.json [10:01:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2223 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P72721 and previous config saved to /var/cache/conftool/dbconfig/20250129-100104-root.json [10:01:44] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on an-redacteddb1001.eqiad.wmnet with reason: maintenance [10:04:15] (03CR) 10Cathal Mooney: [C:03+2] gnmic: use event-value-tag-v2 to improve performance [puppet] - 10https://gerrit.wikimedia.org/r/1114770 (https://phabricator.wikimedia.org/T369384) (owner: 10Cathal Mooney) [10:08:49] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2028.codfw.wmnet [10:08:53] (03CR) 10Federico Ceratto: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1114900 (https://phabricator.wikimedia.org/T384979) (owner: 10Marostegui) [10:09:07] (03CR) 10Marostegui: [C:03+2] mariadb: Add new future host [puppet] - 10https://gerrit.wikimedia.org/r/1114900 (https://phabricator.wikimedia.org/T384979) (owner: 10Marostegui) [10:10:19] 10ops-eqiad, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install db1257 - https://phabricator.wikimedia.org/T384979#10503318 (10Marostegui) a:05Marostegui→03None [10:15:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2222', diff saved to https://phabricator.wikimedia.org/P72723 and previous config saved to /var/cache/conftool/dbconfig/20250129-101544-marostegui.json [10:15:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1210 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P72724 and previous config saved to /var/cache/conftool/dbconfig/20250129-101553-root.json [10:16:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2223 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P72725 and previous config saved to /var/cache/conftool/dbconfig/20250129-101609-root.json [10:16:51] !log root@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2156 gradually with 4 steps - Repooling after rebuild index T384807 [10:16:55] T384807: Upgrade and rebuild s3 - https://phabricator.wikimedia.org/T384807 [10:17:23] (03PS2) 10Muehlenhoff: openssh: Remove code to disable NIST key exchange [puppet] - 10https://gerrit.wikimedia.org/r/1074381 [10:18:42] !log installing git-lfs security updates on bullseye [10:18:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:48] (03CR) 10AOkoth: miscweb: support os-reports deployment (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098486 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth) [10:19:13] (03PS19) 10AOkoth: miscweb: support os-reports deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098486 (https://phabricator.wikimedia.org/T350794) [10:21:34] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1013,1017].eqiad.wmnet,db1154.eqiad.wmnet with reason: maintenance [10:21:46] !log Upgrade and reboot db1154 (s1, s3, s5, s8 wikireplicas will get lag) [10:21:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:29] (03PS20) 10AOkoth: miscweb: support os-reports deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098486 (https://phabricator.wikimedia.org/T350794) [10:25:29] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on clouddb1016.eqiad.wmnet with reason: maintenance [10:25:47] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on clouddb1020.eqiad.wmnet with reason: maintenance [10:26:06] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on clouddb1013.eqiad.wmnet with reason: maintenance [10:26:15] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on clouddb1017.eqiad.wmnet with reason: maintenance [10:30:00] (03CR) 10Jelto: [C:03+1] "looks good to me now 🚢" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098486 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth) [10:30:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2222', diff saved to https://phabricator.wikimedia.org/P72727 and previous config saved to /var/cache/conftool/dbconfig/20250129-103051-marostegui.json [10:30:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1210 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P72728 and previous config saved to /var/cache/conftool/dbconfig/20250129-103059-root.json [10:31:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2223 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P72729 and previous config saved to /var/cache/conftool/dbconfig/20250129-103115-root.json [10:36:53] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depool es1025 T384912', diff saved to https://phabricator.wikimedia.org/P72730 and previous config saved to /var/cache/conftool/dbconfig/20250129-103652-fceratto.json [10:36:58] T384912: decommission es1025.eqiad.wmnet - https://phabricator.wikimedia.org/T384912 [10:37:15] (03CR) 10AOkoth: [C:03+2] miscweb: support os-reports deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098486 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth) [10:38:02] (03CR) 10Muehlenhoff: P:idm add logstash to requestable permission (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1114949 (owner: 10Slyngshede) [10:39:49] (03PS1) 10Federico Ceratto: instances.yaml: remove es1025 [puppet] - 10https://gerrit.wikimedia.org/r/1114962 (https://phabricator.wikimedia.org/T384912) [10:40:05] !log aokoth@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [10:40:24] (03PS2) 10Slyngshede: P:idm add logstash to requestable permission [puppet] - 10https://gerrit.wikimedia.org/r/1114949 [10:40:27] !log aokoth@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [10:40:54] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1114949 (owner: 10Slyngshede) [10:41:22] !log aokoth@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply [10:41:26] !log aokoth@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [10:42:17] (03CR) 10Marostegui: [C:03+1] instances.yaml: remove es1025 [puppet] - 10https://gerrit.wikimedia.org/r/1114962 (https://phabricator.wikimedia.org/T384912) (owner: 10Federico Ceratto) [10:43:38] !log aokoth@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply [10:43:43] !log aokoth@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [10:45:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2222 (T384592)', diff saved to https://phabricator.wikimedia.org/P72731 and previous config saved to /var/cache/conftool/dbconfig/20250129-104558-marostegui.json [10:46:04] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [10:46:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1210 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P72732 and previous config saved to /var/cache/conftool/dbconfig/20250129-104604-root.json [10:46:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2223 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P72733 and previous config saved to /var/cache/conftool/dbconfig/20250129-104620-root.json [10:46:24] (03CR) 10Federico Ceratto: [C:03+1] instances.yaml: remove es1025 [puppet] - 10https://gerrit.wikimedia.org/r/1114962 (https://phabricator.wikimedia.org/T384912) (owner: 10Federico Ceratto) [10:48:49] (03CR) 10Federico Ceratto: [C:03+2] instances.yaml: remove es1025 [puppet] - 10https://gerrit.wikimedia.org/r/1114962 (https://phabricator.wikimedia.org/T384912) (owner: 10Federico Ceratto) [10:49:03] (03PS1) 10JMeybohm: Add restricted users to deployment_server [puppet] - 10https://gerrit.wikimedia.org/r/1114963 (https://phabricator.wikimedia.org/T378429) [10:51:02] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1181.eqiad.wmnet with reason: Maintenance [10:51:26] (03CR) 10JMeybohm: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4881/co" [puppet] - 10https://gerrit.wikimedia.org/r/1114963 (https://phabricator.wikimedia.org/T378429) (owner: 10JMeybohm) [10:51:34] (03PS1) 10AOkoth: misweb: fix type error and service account [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114965 (https://phabricator.wikimedia.org/T350794) [10:52:12] 10ops-codfw, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2242-2329 - https://phabricator.wikimedia.org/T384970#10503555 (10Clement_Goubert) [10:52:28] (03CR) 10JMeybohm: "This should probably do the trick already" [puppet] - 10https://gerrit.wikimedia.org/r/1114963 (https://phabricator.wikimedia.org/T378429) (owner: 10JMeybohm) [10:52:33] !log fceratto@cumin1002 dbctl commit (dc=all): 'Remove es1025 from dbctl T384912', diff saved to https://phabricator.wikimedia.org/P72734 and previous config saved to /var/cache/conftool/dbconfig/20250129-105232-fceratto.json [10:52:37] T384912: decommission es1025.eqiad.wmnet - https://phabricator.wikimedia.org/T384912 [10:52:43] (03CR) 10JMeybohm: [V:03+1] Add restricted users to deployment_server [puppet] - 10https://gerrit.wikimedia.org/r/1114963 (https://phabricator.wikimedia.org/T378429) (owner: 10JMeybohm) [10:52:45] (03PS1) 10Hashar: Do not copy Code-Review +2 [puppet] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/1114966 [10:53:06] (03PS4) 10Brouberol: airflow: deploy an envoy proxy alongside each airflow instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114386 (https://phabricator.wikimedia.org/T384329) [10:53:06] (03PS7) 10Brouberol: Add discovery listeners to airflow-analytics(-test) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114387 (https://phabricator.wikimedia.org/T384329) [10:53:18] 10ops-magru, 06Infrastructure-Foundations, 10netops: Jan 2025 - Magru core router connectivity blips - https://phabricator.wikimedia.org/T384774#10503562 (10cmooney) 05Open→03Resolved Gonna close this one, all is stable after ~24h. [10:53:44] (03CR) 10Hashar: [V:03+2 C:03+2] Do not copy Code-Review +2 [puppet] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/1114966 (owner: 10Hashar) [10:53:57] (03PS1) 10Cathal Mooney: Prometheus: change gnmi label rewrite from 'target' to 'source' [puppet] - 10https://gerrit.wikimedia.org/r/1114967 (https://phabricator.wikimedia.org/T369384) [10:54:16] (03CR) 10CI reject: [V:04-1] Add discovery listeners to airflow-analytics(-test) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114387 (https://phabricator.wikimedia.org/T384329) (owner: 10Brouberol) [10:54:30] (03CR) 10CI reject: [V:04-1] airflow: deploy an envoy proxy alongside each airflow instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114386 (https://phabricator.wikimedia.org/T384329) (owner: 10Brouberol) [10:54:47] (03CR) 10Jelto: [C:03+1] "lgtm" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114965 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth) [10:56:04] (03PS5) 10Brouberol: airflow: deploy an envoy proxy alongside each airflow instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114386 (https://phabricator.wikimedia.org/T384329) [10:56:04] (03PS8) 10Brouberol: Add discovery listeners to airflow-analytics(-test) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114387 (https://phabricator.wikimedia.org/T384329) [10:56:56] (03PS6) 10Brouberol: airflow: deploy an envoy proxy alongside each airflow instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114386 (https://phabricator.wikimedia.org/T384329) [10:56:56] (03PS9) 10Brouberol: Add discovery listeners to airflow-analytics(-test) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114387 (https://phabricator.wikimedia.org/T384329) [10:57:52] (03CR) 10AOkoth: [C:03+2] misweb: fix type error and service account [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114965 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth) [10:58:03] (03CR) 10CI reject: [V:04-1] Add discovery listeners to airflow-analytics(-test) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114387 (https://phabricator.wikimedia.org/T384329) (owner: 10Brouberol) [10:58:12] (03CR) 10CI reject: [V:04-1] airflow: deploy an envoy proxy alongside each airflow instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114386 (https://phabricator.wikimedia.org/T384329) (owner: 10Brouberol) [10:59:23] !log aokoth@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply [10:59:28] !log aokoth@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [11:00:05] effie and swfrench-wmf: It is that lovely time of the day again! You are hereby commanded to deploy MediaWiki infrastructure (UTC mid-day). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250129T1100). [11:00:19] !log aokoth@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply [11:00:23] !log aokoth@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [11:01:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1210 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P72735 and previous config saved to /var/cache/conftool/dbconfig/20250129-110109-root.json [11:01:24] (03CR) 10Filippo Giunchedi: [V:03+1] Prometheus: change gnmi label rewrite from 'target' to 'source' [puppet] - 10https://gerrit.wikimedia.org/r/1114967 (https://phabricator.wikimedia.org/T369384) (owner: 10Cathal Mooney) [11:01:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2223 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P72736 and previous config saved to /var/cache/conftool/dbconfig/20250129-110125-root.json [11:01:32] !log aokoth@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [11:01:36] !log aokoth@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [11:02:33] !log aokoth@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply [11:02:38] !log aokoth@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [11:02:41] (03PS2) 10Cathal Mooney: Prometheus: change gnmi label rewrite from 'target' to 'source' [puppet] - 10https://gerrit.wikimedia.org/r/1114967 (https://phabricator.wikimedia.org/T369384) [11:04:09] (03PS7) 10Brouberol: airflow: deploy an envoy proxy alongside each airflow instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114386 (https://phabricator.wikimedia.org/T384329) [11:04:09] (03PS10) 10Brouberol: Add discovery listeners to airflow-analytics(-test) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114387 (https://phabricator.wikimedia.org/T384329) [11:05:09] (03CR) 10CI reject: [V:04-1] airflow: deploy an envoy proxy alongside each airflow instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114386 (https://phabricator.wikimedia.org/T384329) (owner: 10Brouberol) [11:05:17] (03CR) 10CI reject: [V:04-1] Add discovery listeners to airflow-analytics(-test) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114387 (https://phabricator.wikimedia.org/T384329) (owner: 10Brouberol) [11:05:30] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2218.codfw.wmnet with reason: Maintenance [11:05:59] !log aokoth@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply [11:06:03] !log aokoth@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [11:07:11] (03PS1) 10Federico Ceratto: es1025.yaml, site.pp, backup1002.cnf.erb: Remove es102 [puppet] - 10https://gerrit.wikimedia.org/r/1114969 (https://phabricator.wikimedia.org/T384912) [11:07:36] (03PS1) 10JMeybohm: Allow to install multiple kubectl versions [puppet] - 10https://gerrit.wikimedia.org/r/1114970 (https://phabricator.wikimedia.org/T341984) [11:07:45] (03PS8) 10Brouberol: airflow: deploy an envoy proxy alongside each airflow instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114386 (https://phabricator.wikimedia.org/T384329) [11:07:45] (03PS11) 10Brouberol: Add discovery listeners to airflow-analytics(-test) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114387 (https://phabricator.wikimedia.org/T384329) [11:07:57] (03CR) 10Marostegui: "Typo in the commit message, "Remove es102"" [puppet] - 10https://gerrit.wikimedia.org/r/1114969 (https://phabricator.wikimedia.org/T384912) (owner: 10Federico Ceratto) [11:10:07] (03CR) 10Effie Mouzeli: [C:03+1] shellbox-video: 50% of codfw replicas to 8.1 (change 2/4) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113214 (https://phabricator.wikimedia.org/T377038) (owner: 10Scott French) [11:10:15] (03CR) 10Effie Mouzeli: [C:03+2] shellbox-video: 50% of codfw replicas to 8.1 (change 2/4) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113214 (https://phabricator.wikimedia.org/T377038) (owner: 10Scott French) [11:10:36] (03PS1) 10MVernon: swift: remove drained eqiad nodes from the rings [puppet] - 10https://gerrit.wikimedia.org/r/1114971 (https://phabricator.wikimedia.org/T382056) [11:11:50] (03Merged) 10jenkins-bot: shellbox-video: 50% of codfw replicas to 8.1 (change 2/4) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113214 (https://phabricator.wikimedia.org/T377038) (owner: 10Scott French) [11:12:26] (03CR) 10JMeybohm: [V:03+1] "PCC SUCCESS (NOOP 8): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4882/console" [puppet] - 10https://gerrit.wikimedia.org/r/1114970 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [11:13:20] !log jiji@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-video: apply [11:13:26] !log jiji@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-video: apply [11:13:31] !log jiji@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-video: apply [11:14:12] !log jiji@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-video: apply [11:14:59] (03CR) 10Btullis: [C:03+1] "Nice, thanks." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114386 (https://phabricator.wikimedia.org/T384329) (owner: 10Brouberol) [11:15:41] (03CR) 10Btullis: [C:03+1] Add discovery listeners to airflow-analytics(-test) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114387 (https://phabricator.wikimedia.org/T384329) (owner: 10Brouberol) [11:15:55] (03PS2) 10JMeybohm: Allow to install multiple kubectl versions [puppet] - 10https://gerrit.wikimedia.org/r/1114970 (https://phabricator.wikimedia.org/T341984) [11:17:08] (03PS2) 10Federico Ceratto: es1025.yaml, site.pp, backup1002.cnf.erb: Remove es1025 [puppet] - 10https://gerrit.wikimedia.org/r/1114969 (https://phabricator.wikimedia.org/T384912) [11:18:27] (03CR) 10Brouberol: [C:03+2] airflow: deploy an envoy proxy alongside each airflow instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114386 (https://phabricator.wikimedia.org/T384329) (owner: 10Brouberol) [11:18:30] (03CR) 10Brouberol: [C:03+2] Add discovery listeners to airflow-analytics(-test) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114387 (https://phabricator.wikimedia.org/T384329) (owner: 10Brouberol) [11:19:25] (03CR) 10Effie Mouzeli: [C:03+2] shellbox-constraints: all eqiad replicas on 8.1 (change 2/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113218 (https://phabricator.wikimedia.org/T377038) (owner: 10Scott French) [11:19:50] (03CR) 10JMeybohm: [V:03+1] "PCC SUCCESS (NOOP 8): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4883/console" [puppet] - 10https://gerrit.wikimedia.org/r/1114970 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [11:20:01] (03Merged) 10jenkins-bot: airflow: deploy an envoy proxy alongside each airflow instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114386 (https://phabricator.wikimedia.org/T384329) (owner: 10Brouberol) [11:20:04] (03Merged) 10jenkins-bot: Add discovery listeners to airflow-analytics(-test) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114387 (https://phabricator.wikimedia.org/T384329) (owner: 10Brouberol) [11:20:35] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "FWIW, the diff gets somewhat shorter if the list is sorted before and after the change – there’s still a fair amount of changes but also p" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114398 (https://phabricator.wikimedia.org/T280718) (owner: 10Hnowlan) [11:21:11] (03Merged) 10jenkins-bot: shellbox-constraints: all eqiad replicas on 8.1 (change 2/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113218 (https://phabricator.wikimedia.org/T377038) (owner: 10Scott French) [11:21:36] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply [11:21:39] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply [11:21:51] (03PS1) 10Urbanecm: migrateConfigToCommunity: Deal with false category names [extensions/Babel] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1114973 (https://phabricator.wikimedia.org/T384941) [11:22:29] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply [11:23:08] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply [11:24:00] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2031.codfw.wmnet [11:24:50] !log jmm@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ganeti2031.codfw.wmnet with reason: remove from cluster for reimage [11:24:55] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10503653 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=7af53928-134c-4589-9808-e36a2bde4422) set by jmm@cumin2002 for 1 day, 0:00:00 on 1 host(... [11:25:14] (03PS1) 10Urbanecm: [tests] Add MigrateConfigToCommunityTest [extensions/Babel] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1114975 (https://phabricator.wikimedia.org/T383905) [11:25:16] (03PS1) 10Urbanecm: migrateConfigToCommunity: Deal with false category names [extensions/Babel] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1114976 (https://phabricator.wikimedia.org/T384941) [11:26:20] (03CR) 10Kamila Součková: "LGTM but I haven't checked folder permissions on the hosts" [puppet] - 10https://gerrit.wikimedia.org/r/1114963 (https://phabricator.wikimedia.org/T378429) (owner: 10JMeybohm) [11:27:17] (03CR) 10Jcrespo: [V:03+1] "I have checked syntax is right, I have not checked they finished draining." [puppet] - 10https://gerrit.wikimedia.org/r/1114971 (https://phabricator.wikimedia.org/T382056) (owner: 10MVernon) [11:28:49] (03PS2) 10Hnowlan: fc-list: update font list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114398 (https://phabricator.wikimedia.org/T280718) [11:29:47] (03CR) 10Hnowlan: "Fair point, pushed a sorted list and looking much neater." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114398 (https://phabricator.wikimedia.org/T280718) (owner: 10Hnowlan) [11:31:56] (03PS1) 10Btullis: dumps: Use the analytics replicas by default for dumps 1.0 [puppet] - 10https://gerrit.wikimedia.org/r/1114978 (https://phabricator.wikimedia.org/T382947) [11:32:00] (03CR) 10Effie Mouzeli: [C:03+2] shellbox-constraints: all replicas on PHP 8.1 (change 3/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113219 (https://phabricator.wikimedia.org/T377038) (owner: 10Scott French) [11:32:07] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:33:05] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1114978 (https://phabricator.wikimedia.org/T382947) (owner: 10Btullis) [11:33:13] (03Merged) 10jenkins-bot: shellbox-constraints: all replicas on PHP 8.1 (change 3/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113219 (https://phabricator.wikimedia.org/T377038) (owner: 10Scott French) [11:33:14] (03CR) 10Btullis: dumps: Use the analytics replicas by default for dumps 1.0 [puppet] - 10https://gerrit.wikimedia.org/r/1114978 (https://phabricator.wikimedia.org/T382947) (owner: 10Btullis) [11:34:29] (03PS1) 10Jelto: Revert "Do not copy Code-Review +2" [puppet] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/1114980 [11:35:25] !log cmooney@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on netflow2003.codfw.wmnet with reason: disabling alerts as I'm running gnmic manually rather than with systemd [11:35:33] !log jiji@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply [11:35:40] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#10503694 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=36d26c8a-4d30-4345-8682-54b6b4882e38) set by cmooney@cumin1002 for 3:00:... [11:35:58] !log jiji@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply [11:36:32] (03CR) 10Jelto: [V:03+2] Revert "Do not copy Code-Review +2" [puppet] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/1114980 (owner: 10Jelto) [11:37:15] 06SRE, 10SRE-swift-storage, 13Patch-For-Review: ms backend hardware refresh for 24/25 - https://phabricator.wikimedia.org/T382056#10503706 (10MatthewVernon) [11:40:13] (03CR) 10Marostegui: [C:03+1] swift: remove drained eqiad nodes from the rings [puppet] - 10https://gerrit.wikimedia.org/r/1114971 (https://phabricator.wikimedia.org/T382056) (owner: 10MVernon) [11:40:26] (03CR) 10Marostegui: [C:03+1] es1025.yaml, site.pp, backup1002.cnf.erb: Remove es1025 [puppet] - 10https://gerrit.wikimedia.org/r/1114969 (https://phabricator.wikimedia.org/T384912) (owner: 10Federico Ceratto) [11:40:41] (03CR) 10MVernon: [C:03+2] swift: remove drained eqiad nodes from the rings [puppet] - 10https://gerrit.wikimedia.org/r/1114971 (https://phabricator.wikimedia.org/T382056) (owner: 10MVernon) [11:41:04] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "(Just to be clear, the list already wasn’t sorted before, so the ”fully” neat diff I had in mind would’ve required a separate change just " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114398 (https://phabricator.wikimedia.org/T280718) (owner: 10Hnowlan) [11:42:09] !log fceratto@cumin1002 START - Cookbook sre.hosts.decommission for hosts es1025.eqiad.wmnet [11:49:01] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance [11:51:10] !log fceratto@cumin1002 START - Cookbook sre.dns.netbox [11:52:07] FIRING: [2x] CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic1071-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [11:57:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1200', diff saved to https://phabricator.wikimedia.org/P72737 and previous config saved to /var/cache/conftool/dbconfig/20250129-115700-marostegui.json [11:57:09] !log root@cumin1002 START - Cookbook sre.mysql.upgrade for db1200.eqiad.wmnet [12:00:04] mvolz: Time to snap out of that daydream and deploy Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250129T1200). [12:00:29] !log mvolz@deploy2002 helmfile [staging] START helmfile.d/services/citoid: apply [12:00:51] !log mvolz@deploy2002 helmfile [staging] DONE helmfile.d/services/citoid: apply [12:02:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2211 T384994', diff saved to https://phabricator.wikimedia.org/P72738 and previous config saved to /var/cache/conftool/dbconfig/20250129-120213-marostegui.json [12:02:19] T384994: Upgrade and rebuild s5 - https://phabricator.wikimedia.org/T384994 [12:03:04] !log root@cumin1002 START - Cookbook sre.mysql.upgrade for db2211.codfw.wmnet [12:03:08] !log fceratto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: es1025.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - fceratto@cumin1002" [12:03:11] !log mvolz@deploy2002 helmfile [codfw] START helmfile.d/services/citoid: apply [12:03:21] (03PS3) 10Jcrespo: dbbackups: Remove set user permissions from m1 backup user grants [puppet] - 10https://gerrit.wikimedia.org/r/1112802 (https://phabricator.wikimedia.org/T383902) [12:03:22] (03PS1) 10Jcrespo: installserver: Enable reimage of backup1013, backup1014, backup2013, backup2014 [puppet] - 10https://gerrit.wikimedia.org/r/1114986 (https://phabricator.wikimedia.org/T384977) [12:03:29] !log root@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1200.eqiad.wmnet [12:03:38] (03PS4) 10Jcrespo: dbbackups: Remove set user permissions from m1 backup user grants [puppet] - 10https://gerrit.wikimedia.org/r/1112802 (https://phabricator.wikimedia.org/T383902) [12:03:46] (03PS2) 10Jcrespo: installserver: Enable reimage of backup1013, backup1014, backup2013, backup2014 [puppet] - 10https://gerrit.wikimedia.org/r/1114986 (https://phabricator.wikimedia.org/T384977) [12:04:01] !log mvolz@deploy2002 helmfile [codfw] DONE helmfile.d/services/citoid: apply [12:04:15] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1200.eqiad.wmnet with reason: Index rebuild [12:04:29] (03CR) 10Jcrespo: [C:04-1] "We need to remove read only admin, too." [puppet] - 10https://gerrit.wikimedia.org/r/1112802 (https://phabricator.wikimedia.org/T383902) (owner: 10Jcrespo) [12:06:05] jouncebot: nowandnext [12:06:06] For the next 0 hour(s) and 53 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250129T1200) [12:06:06] In 1 hour(s) and 53 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250129T1400) [12:06:44] !log mvolz@deploy2002 helmfile [eqiad] START helmfile.d/services/citoid: apply [12:07:14] !log mvolz@deploy2002 helmfile [eqiad] DONE helmfile.d/services/citoid: apply [12:07:31] (03CR) 10Urbanecm: [C:03+2] [tests] Add ConfigWrapperTest [extensions/Babel] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1114751 (https://phabricator.wikimedia.org/T383905) (owner: 10Urbanecm) [12:07:33] (03CR) 10Urbanecm: [C:03+2] Remove BabelCategorizeNamespaces from CommunityConfiguration [extensions/Babel] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1114752 (https://phabricator.wikimedia.org/T383905) (owner: 10Urbanecm) [12:07:39] !log root@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2211.codfw.wmnet [12:07:42] (03CR) 10Urbanecm: [C:03+2] [tests] Add MigrateConfigToCommunityTest [extensions/Babel] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1114975 (https://phabricator.wikimedia.org/T383905) (owner: 10Urbanecm) [12:07:44] (03CR) 10Urbanecm: [C:03+2] migrateConfigToCommunity: Deal with false category names [extensions/Babel] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1114976 (https://phabricator.wikimedia.org/T384941) (owner: 10Urbanecm) [12:07:59] (03CR) 10Urbanecm: [C:03+2] migrateConfigToCommunity: Deal with false category names [extensions/Babel] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1114973 (https://phabricator.wikimedia.org/T384941) (owner: 10Urbanecm) [12:08:18] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2211.codfw.wmnet with reason: Index rebuild [12:08:55] !log fceratto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: es1025.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - fceratto@cumin1002" [12:08:55] !log fceratto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:08:56] !log fceratto@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts es1025.eqiad.wmnet [12:09:52] (03CR) 10Federico Ceratto: [C:03+2] es1025.yaml, site.pp, backup1002.cnf.erb: Remove es1025 [puppet] - 10https://gerrit.wikimedia.org/r/1114969 (https://phabricator.wikimedia.org/T384912) (owner: 10Federico Ceratto) [12:20:42] FIRING: [2x] JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:20:46] !log aokoth@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply [12:20:49] !log aokoth@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [12:25:39] (03PS1) 10Btullis: dumps: Re-enable the enwiki dumps on snapshot1012 [puppet] - 10https://gerrit.wikimedia.org/r/1114991 (https://phabricator.wikimedia.org/T382947) [12:25:59] (03CR) 10CI reject: [V:04-1] dumps: Re-enable the enwiki dumps on snapshot1012 [puppet] - 10https://gerrit.wikimedia.org/r/1114991 (https://phabricator.wikimedia.org/T382947) (owner: 10Btullis) [12:26:10] 10ops-eqiad, 06DC-Ops, 10decommission-hardware: decommission es1025.eqiad.wmnet - https://phabricator.wikimedia.org/T384912#10503810 (10FCeratto-WMF) 05In progress→03Open a:05FCeratto-WMF→03None [12:26:26] 10ops-eqiad, 06DC-Ops, 10decommission-hardware: decommission es1025.eqiad.wmnet - https://phabricator.wikimedia.org/T384912#10503816 (10FCeratto-WMF) The host is ready for DC-ops [12:27:00] (03PS2) 10Btullis: dumps: Re-enable the enwiki dumps on snapshot1012 [puppet] - 10https://gerrit.wikimedia.org/r/1114991 (https://phabricator.wikimedia.org/T382947) [12:27:04] !log aokoth@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply [12:27:08] !log aokoth@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [12:27:41] (03Merged) 10jenkins-bot: [tests] Add ConfigWrapperTest [extensions/Babel] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1114751 (https://phabricator.wikimedia.org/T383905) (owner: 10Urbanecm) [12:27:43] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4886/co" [puppet] - 10https://gerrit.wikimedia.org/r/1114991 (https://phabricator.wikimedia.org/T382947) (owner: 10Btullis) [12:28:34] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [extensions/Babel] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1114752 (https://phabricator.wikimedia.org/T383905) (owner: 10Urbanecm) [12:28:34] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [extensions/Babel] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1114975 (https://phabricator.wikimedia.org/T383905) (owner: 10Urbanecm) [12:28:35] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [extensions/Babel] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1114976 (https://phabricator.wikimedia.org/T384941) (owner: 10Urbanecm) [12:28:35] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [extensions/Babel] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1114973 (https://phabricator.wikimedia.org/T384941) (owner: 10Urbanecm) [12:29:07] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [extensions/Babel] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1114752 (https://phabricator.wikimedia.org/T383905) (owner: 10Urbanecm) [12:29:07] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [extensions/Babel] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1114975 (https://phabricator.wikimedia.org/T383905) (owner: 10Urbanecm) [12:29:08] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [extensions/Babel] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1114976 (https://phabricator.wikimedia.org/T384941) (owner: 10Urbanecm) [12:29:08] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [extensions/Babel] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1114973 (https://phabricator.wikimedia.org/T384941) (owner: 10Urbanecm) [12:30:06] (03PS1) 10AOkoth: miscweb: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114992 (https://phabricator.wikimedia.org/T350794) [12:30:42] RESOLVED: [2x] JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:32:01] (03CR) 10Raymond Ndibe: "Reverting back on this, there is currently no way that I know of to stop kubeadm from regenerating this file David. The best approach righ" [puppet] - 10https://gerrit.wikimedia.org/r/1113194 (https://phabricator.wikimedia.org/T374193) (owner: 10Raymond Ndibe) [12:32:14] (03CR) 10Jelto: [C:03+1] "lgtm, this was missing in Iba37c095353b76bfaf1ee19228a4ec783b6239f9" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114992 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth) [12:32:57] (03CR) 10Slyngshede: [C:03+2] C:idm remove associate_by_email pipeline [puppet] - 10https://gerrit.wikimedia.org/r/1112224 (https://phabricator.wikimedia.org/T383707) (owner: 10Slyngshede) [12:33:16] !log Rebuild tables on dbstore1007 (s2, s3, s4) T384818 [12:33:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:21] T384818: Upgrade dbstore* hosts to 10.6.20 and rebuild tables - https://phabricator.wikimedia.org/T384818 [12:33:25] (03CR) 10AOkoth: [C:03+2] miscweb: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114992 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth) [12:34:08] !log Rebuild tables on dbstore1009 (s6 s8) T384818 [12:34:11] (03Merged) 10jenkins-bot: Remove BabelCategorizeNamespaces from CommunityConfiguration [extensions/Babel] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1114752 (https://phabricator.wikimedia.org/T383905) (owner: 10Urbanecm) [12:34:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:12] (03Merged) 10jenkins-bot: [tests] Add MigrateConfigToCommunityTest [extensions/Babel] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1114975 (https://phabricator.wikimedia.org/T383905) (owner: 10Urbanecm) [12:34:14] (03Merged) 10jenkins-bot: migrateConfigToCommunity: Deal with false category names [extensions/Babel] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1114976 (https://phabricator.wikimedia.org/T384941) (owner: 10Urbanecm) [12:35:02] (03PS1) 10Lucas Werkmeister (WMDE): Handle missing `monthonly` format in MwTimeIsoFormatter [extensions/Wikibase] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1114994 (https://phabricator.wikimedia.org/T384867) [12:35:15] (03PS1) 10Lucas Werkmeister (WMDE): Handle missing `monthonly` format in MwTimeIsoFormatter [extensions/Wikibase] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1114995 (https://phabricator.wikimedia.org/T384867) [12:35:32] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, January 29 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [extensions/Wikibase] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1114995 (https://phabricator.wikimedia.org/T384867) (owner: 10Lucas Werkmeister (WMDE)) [12:35:40] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, January 29 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [extensions/Wikibase] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1114994 (https://phabricator.wikimedia.org/T384867) (owner: 10Lucas Werkmeister (WMDE)) [12:35:41] !log aokoth@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply [12:35:44] !log aokoth@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [12:37:08] !log aokoth@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply [12:37:11] !log aokoth@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [12:37:53] !log aokoth@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply [12:37:57] !log aokoth@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [12:41:26] (03PS1) 10Arturo Borrero Gonzalez: cloudgw1003: take over cloudgw1001 [puppet] - 10https://gerrit.wikimedia.org/r/1114997 (https://phabricator.wikimedia.org/T382356) [12:42:30] RECOVERY - Host ms-fe1014 is UP: PING WARNING - Packet loss = 33%, RTA = 0.38 ms [12:43:00] PROBLEM - MariaDB Replica Lag: s2 on dbstore1007 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 635.42 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:44:02] 👀 [12:44:04] PROBLEM - MariaDB Replica Lag: s6 on dbstore1009 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 612.06 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:45:18] (03PS1) 10Arturo Borrero Gonzalez: cloudgw1004: take over cloudgw1002 [puppet] - 10https://gerrit.wikimedia.org/r/1114998 (https://phabricator.wikimedia.org/T382356) [12:45:43] that's a table rebuilding [12:46:13] That's strange [12:46:18] I downtimed it [12:46:23] I will do it again [12:46:51] Ah it failed apparently, anyway, doing it! [12:47:10] (03CR) 10Hashar: [V:03+2 C:03+2] "I think the issue is the section overrides all properties from the parent All-Projects when I guess I assumed it would extend it. So as th" [puppet] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/1114966 (owner: 10Hashar) [12:47:23] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on dbstore1007.eqiad.wmnet with reason: maintenance [12:47:41] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on dbstore1009.eqiad.wmnet with reason: maintenance [12:48:54] PROBLEM - Host ms-fe1014 is DOWN: PING CRITICAL - Packet loss = 100% [12:49:23] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db[1154,1212].eqiad.wmnet with reason: maintenance [12:49:58] (03Merged) 10jenkins-bot: migrateConfigToCommunity: Deal with false category names [extensions/Babel] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1114973 (https://phabricator.wikimedia.org/T384941) (owner: 10Urbanecm) [12:50:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1212 T384807', diff saved to https://phabricator.wikimedia.org/P72741 and previous config saved to /var/cache/conftool/dbconfig/20250129-125015-marostegui.json [12:50:20] T384807: Upgrade and rebuild s3 - https://phabricator.wikimedia.org/T384807 [12:50:35] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1114751|[tests] Add ConfigWrapperTest (T383905)]], [[gerrit:1114752|Remove BabelCategorizeNamespaces from CommunityConfiguration (T383905)]], [[gerrit:1114975|[tests] Add MigrateConfigToCommunityTest (T383905)]], [[gerrit:1114976|migrateConfigToCommunity: Deal with false category names (T384941)]], [[gerrit:1114973|migrateConfigToCommunity: Deal with [12:50:35] false category names (T384941)]] [12:50:40] T383905: Running extensions/Babel/maintenance/migrateConfigToCommunity.php with the default configuration fails on validation error - https://phabricator.wikimedia.org/T383905 [12:50:40] T384941: Setting wgBabelCategoryNames[level] to false is not supported by the migration script - https://phabricator.wikimedia.org/T384941 [12:50:41] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1013,1017].eqiad.wmnet with reason: maintenance [12:50:47] !log root@cumin1002 START - Cookbook sre.mysql.upgrade for db1212.eqiad.wmnet [12:52:05] (03PS2) 10Arturo Borrero Gonzalez: cloudgw1003: take over cloudgw1001 [puppet] - 10https://gerrit.wikimedia.org/r/1114997 (https://phabricator.wikimedia.org/T382356) [12:52:05] (03PS2) 10Arturo Borrero Gonzalez: cloudgw1004: take over cloudgw1002 [puppet] - 10https://gerrit.wikimedia.org/r/1114998 (https://phabricator.wikimedia.org/T382356) [12:52:33] (03CR) 10Brouberol: [C:03+1] dumps: Use the analytics replicas by default for dumps 1.0 [puppet] - 10https://gerrit.wikimedia.org/r/1114978 (https://phabricator.wikimedia.org/T382947) (owner: 10Btullis) [12:52:38] (03CR) 10Elukey: [C:03+2] custom_deploy.d: rework dse-k8s-eqiad's istio config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114743 (owner: 10Elukey) [12:54:33] !log aokoth@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply [12:54:38] !log aokoth@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [12:55:08] (03CR) 10CI reject: [V:04-1] Handle missing `monthonly` format in MwTimeIsoFormatter [extensions/Wikibase] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1114995 (https://phabricator.wikimedia.org/T384867) (owner: 10Lucas Werkmeister (WMDE)) [12:56:05] !log root@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1212.eqiad.wmnet [12:56:36] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1212.eqiad.wmnet with reason: Index rebuild [12:57:56] !log aokoth@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply [12:58:00] !log aokoth@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [13:00:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1200 (re)pooling @ 10%: Repooling after rebuild index $TASKID', diff saved to https://phabricator.wikimedia.org/P72742 and previous config saved to /var/cache/conftool/dbconfig/20250129-130031-root.json [13:00:44] (03PS1) 10Marostegui: installserver: Do not format es104* [puppet] - 10https://gerrit.wikimedia.org/r/1115000 [13:01:25] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti2028.codfw.wmnet to cluster codfw and group A [13:02:13] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti2028.codfw.wmnet to cluster codfw and group A [13:02:29] (03CR) 10MVernon: [C:03+1] installserver: Enable reimage of backup1013, backup1014, backup2013, backup2014 [puppet] - 10https://gerrit.wikimedia.org/r/1114986 (https://phabricator.wikimedia.org/T384977) (owner: 10Jcrespo) [13:03:23] (03CR) 10Jcrespo: [C:03+2] installserver: Enable reimage of backup1013, backup1014, backup2013, backup2014 [puppet] - 10https://gerrit.wikimedia.org/r/1114986 (https://phabricator.wikimedia.org/T384977) (owner: 10Jcrespo) [13:03:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2211 (re)pooling @ 10%: Repooling after rebuild index $TASKID', diff saved to https://phabricator.wikimedia.org/P72743 and previous config saved to /var/cache/conftool/dbconfig/20250129-130358-root.json [13:07:24] (03CR) 10Elukey: [C:03+1] Add a separate Hiera option to control the waterlines import (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1114769 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [13:07:48] scap's build-and-push-container-images is taking quite some time [13:07:54] since 12:51:48 [13:08:24] scap-image-build-and-push-log is not updating [13:10:10] 10ops-codfw, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2242-2329 - https://phabricator.wikimedia.org/T384970#10503959 (10Clement_Goubert) Based on the calculation in my [[ https://docs.google.com/spreadsheets/d/18BokLsimZj-7XdQfTGLIP__11aDIJnbL0cqBNdLRXuY/edit?usp=sharing | balancing sheet... [13:10:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:11:37] (03PS1) 10Cathal Mooney: gNMIc: Add BGP stats collection for network devices [puppet] - 10https://gerrit.wikimedia.org/r/1115002 (https://phabricator.wikimedia.org/T369384) [13:11:42] 07sre-alert-triage, 06serviceops: Alert in need of triage: SystemdUnitFailed (instance cumin1002:9100) - https://phabricator.wikimedia.org/T384999#10503966 (10Clement_Goubert) →14Duplicate dup:03T383032 [13:12:08] (03PS8) 10Muehlenhoff: maps: Add a separate Hiera option to control the waterlines import [puppet] - 10https://gerrit.wikimedia.org/r/1114769 (https://phabricator.wikimedia.org/T381565) [13:12:13] 06SRE, 10Phabricator, 06Traffic: Phabricator should cache tasks for a few minutes for logged-out users - https://phabricator.wikimedia.org/T274228#10503972 (10Aklapper) Does anyone have sufficient understanding to outline the next potential steps in the blurry territories between undermaintained Phorge upstr... [13:12:40] (03CR) 10Muehlenhoff: maps: Add a separate Hiera option to control the waterlines import (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1114769 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [13:13:00] !log installing runc security updates [13:13:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:26] 07sre-alert-triage, 06serviceops: Alert in need of triage: SystemdUnitFailed (instance cumin1002:9100) - https://phabricator.wikimedia.org/T384999#10503978 (10Clement_Goubert) Doing the dupe the other way around as T383032 for #abstract_wikipedia_team has been triaged by them already. [13:15:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:15:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1200 (re)pooling @ 25%: Repooling after rebuild index $TASKID', diff saved to https://phabricator.wikimedia.org/P72744 and previous config saved to /var/cache/conftool/dbconfig/20250129-131537-root.json [13:16:42] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10503998 (10MoritzMuehlenhoff) [13:16:52] (03CR) 10Volans: [C:03+1] "Puppet compiler seems happy:" [puppet] - 10https://gerrit.wikimedia.org/r/1114007 (https://phabricator.wikimedia.org/T384720) (owner: 10Raymond Ndibe) [13:17:23] (03CR) 10Muehlenhoff: [C:03+2] Switch ganeti2031 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1114950 (owner: 10Muehlenhoff) [13:18:53] ...pulling to testservers now... [13:18:54] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#10504014 (10cmooney) Moving to //event-value-tag-v2// has been pushed out to all our Netflow VMs and we've seen a nice reduction in CPU usage, plus a... [13:19:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2211 (re)pooling @ 25%: Repooling after rebuild index $TASKID', diff saved to https://phabricator.wikimedia.org/P72745 and previous config saved to /var/cache/conftool/dbconfig/20250129-131903-root.json [13:20:35] (03CR) 10Volans: [C:03+1] "Not much to review here, did you review the harbor changelog between the 2 versions to ensure there is no backward incompatible change? LG" [puppet] - 10https://gerrit.wikimedia.org/r/1113871 (https://phabricator.wikimedia.org/T358225) (owner: 10Raymond Ndibe) [13:23:11] (03PS1) 10Arthur taylor: Remove `tmpAlwaysShowMulLanguageCode` temporary setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115006 (https://phabricator.wikimedia.org/T330217) [13:23:48] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1157.eqiad.wmnet with reason: Maintenance [13:23:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1157 (T384592)', diff saved to https://phabricator.wikimedia.org/P72746 and previous config saved to /var/cache/conftool/dbconfig/20250129-132354-marostegui.json [13:24:00] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [13:27:08] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2031.codfw.wmnet with OS bookworm [13:27:14] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10504057 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2031.codfw.wmnet with OS bookworm [13:28:55] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1074381 (owner: 10Muehlenhoff) [13:30:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1200 (re)pooling @ 50%: Repooling after rebuild index $TASKID', diff saved to https://phabricator.wikimedia.org/P72747 and previous config saved to /var/cache/conftool/dbconfig/20250129-133042-root.json [13:31:44] (03PS5) 10Arnaudb: nftables: add nftable docker manifest [puppet] - 10https://gerrit.wikimedia.org/r/1114718 (https://phabricator.wikimedia.org/T370677) [13:31:52] (03PS3) 10Arnaudb: gitlab_runner: add nftables logic [puppet] - 10https://gerrit.wikimedia.org/r/1114726 (https://phabricator.wikimedia.org/T370677) [13:34:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2211 (re)pooling @ 50%: Repooling after rebuild index $TASKID', diff saved to https://phabricator.wikimedia.org/P72748 and previous config saved to /var/cache/conftool/dbconfig/20250129-133408-root.json [13:35:44] (03CR) 10Jforrester: "recheck" [extensions/Wikibase] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1114995 (https://phabricator.wikimedia.org/T384867) (owner: 10Lucas Werkmeister (WMDE)) [13:36:22] (03PS2) 10Elukey: custom_deploy.d: remove ML-specific bits from DSE's istio config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114749 [13:36:22] (03PS1) 10Elukey: custom_deploy.d: rework Istio ML's config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115008 (https://phabricator.wikimedia.org/T369493) [13:39:34] (03CR) 10Klausman: [C:03+1] custom_deploy.d: rework Istio ML's config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115008 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey) [13:41:41] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1114751|[tests] Add ConfigWrapperTest (T383905)]], [[gerrit:1114752|Remove BabelCategorizeNamespaces from CommunityConfiguration (T383905)]], [[gerrit:1114975|[tests] Add MigrateConfigToCommunityTest (T383905)]], [[gerrit:1114976|migrateConfigToCommunity: Deal with false category names (T384941)]], [[gerrit:1114973|migrateConfigToCommunity: Deal with [13:41:41] false category names (T384941)]] (duration: 51m 06s) [13:41:47] finally [13:41:47] T383905: Running extensions/Babel/maintenance/migrateConfigToCommunity.php with the default configuration fails on validation error - https://phabricator.wikimedia.org/T383905 [13:41:48] T384941: Setting wgBabelCategoryNames[level] to false is not supported by the migration script - https://phabricator.wikimedia.org/T384941 [13:41:55] almost an hour... [13:41:56] (03CR) 10Muehlenhoff: [C:03+2] maps: Add a separate Hiera option to control the waterlines import [puppet] - 10https://gerrit.wikimedia.org/r/1114769 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [13:42:25] (03PS1) 10Lucas Werkmeister (WMDE): Handle null date format in MwDateFormatParserFactory [extensions/Wikibase] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1115010 (https://phabricator.wikimedia.org/T384963) [13:42:28] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ganeti2031.codfw.wmnet with OS bookworm [13:42:32] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10504127 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2031.codfw.wmnet with OS bookworm executed with errors:... [13:42:41] (03PS1) 10Lucas Werkmeister (WMDE): Handle null date format in MwDateFormatParserFactory [extensions/Wikibase] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1115012 (https://phabricator.wikimedia.org/T384963) [13:42:51] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, January 29 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [extensions/Wikibase] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1115012 (https://phabricator.wikimedia.org/T384963) (owner: 10Lucas Werkmeister (WMDE)) [13:42:57] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, January 29 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [extensions/Wikibase] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1115010 (https://phabricator.wikimedia.org/T384963) (owner: 10Lucas Werkmeister (WMDE)) [13:43:06] let’s see how many of these backports I actually get through ^^ [13:43:19] (03PS1) 10Arthur taylor: Add `enableMulLanguageCode` to replace `tmpEnableMulLanguageCode` [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115013 (https://phabricator.wikimedia.org/T330217) [13:43:50] (03PS1) 10Filippo Giunchedi: vopsbot: sync db when needed [puppet] - 10https://gerrit.wikimedia.org/r/1115014 (https://phabricator.wikimedia.org/T375143) [13:44:06] RECOVERY - MariaDB Replica Lag: s6 on dbstore1009 is OK: OK slave_sql_lag Replication lag: 0.10 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:45:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1200 (re)pooling @ 75%: Repooling after rebuild index $TASKID', diff saved to https://phabricator.wikimedia.org/P72749 and previous config saved to /var/cache/conftool/dbconfig/20250129-134547-root.json [13:45:52] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2031.codfw.wmnet with OS bookworm [13:45:58] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10504147 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2031.codfw.wmnet with OS bookworm [13:47:05] (03PS13) 10Muehlenhoff: Make maps-test2001 a bookworm maps master node [puppet] - 10https://gerrit.wikimedia.org/r/1111634 (https://phabricator.wikimedia.org/T381565) [13:47:49] 10ops-codfw, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2242-2329 - https://phabricator.wikimedia.org/T384970#10504173 (10RobH) [13:48:20] 06SRE, 06Infrastructure-Foundations: Integrate Bullseye 11.11 point update - https://phabricator.wikimedia.org/T373795#10504174 (10MoritzMuehlenhoff) [13:48:26] 10ops-codfw, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2242-2329 - https://phabricator.wikimedia.org/T384970#10504176 (10RobH) Copying over the explanation of hostname breakdown from the purchasing task. >>! In T382899#10503772, @Clement_Goubert wrote: > Updated list of hostnames because... [13:49:02] 10ops-codfw, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2242-2329 - https://phabricator.wikimedia.org/T384970#10504188 (10RobH) [13:49:12] (03PS1) 10Arthur taylor: Remove `tmpEnableMulLanguageCode` setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115016 (https://phabricator.wikimedia.org/T330217) [13:49:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2211 (re)pooling @ 75%: Repooling after rebuild index $TASKID', diff saved to https://phabricator.wikimedia.org/P72750 and previous config saved to /var/cache/conftool/dbconfig/20250129-134912-root.json [13:49:17] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1111634 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [13:49:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T384592)', diff saved to https://phabricator.wikimedia.org/P72751 and previous config saved to /var/cache/conftool/dbconfig/20250129-134927-marostegui.json [13:49:32] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [13:53:33] (03CR) 10Volans: "Reading the backlog of the code review and the task this seems quite a rabbit hole. I'm not sure I'm familiar enough to judge if this is s" [puppet] - 10https://gerrit.wikimedia.org/r/1113194 (https://phabricator.wikimedia.org/T374193) (owner: 10Raymond Ndibe) [13:55:11] (03PS1) 10Jcrespo: backup: Temporary setup of backup101[34], backup201[34] [puppet] - 10https://gerrit.wikimedia.org/r/1115020 (https://phabricator.wikimedia.org/T384977) [13:57:51] (03PS2) 10Jcrespo: backup: Temporary setup of backup101[34], backup201[34] [puppet] - 10https://gerrit.wikimedia.org/r/1115020 (https://phabricator.wikimedia.org/T384977) [13:58:24] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2029.codfw.wmnet [13:58:38] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10504236 (10ops-monitoring-bot) Draining ganeti2029.codfw.wmnet of running VMs [14:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: May I have your attention please! UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250129T1400) [14:00:05] Lucas_WMDE and hnowlan: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:07] o/ [14:00:10] I can deploy! [14:00:35] I’ll start with my config change and then do hnowlan’s before continuing with my backports, the backports aren’t urgent [14:00:42] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1111336 (owner: 10JHathaway) [14:00:46] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114727 (https://phabricator.wikimedia.org/T312176) (owner: 10Lucas Werkmeister (WMDE)) [14:00:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1200 (re)pooling @ 100%: Repooling after rebuild index $TASKID', diff saved to https://phabricator.wikimedia.org/P72752 and previous config saved to /var/cache/conftool/dbconfig/20250129-140052-root.json [14:01:15] o/ [14:01:30] my change is more or less cosmetic, no impact [14:01:34] (03Merged) 10jenkins-bot: Enable mul language code on Wikidata (full release) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114727 (https://phabricator.wikimedia.org/T312176) (owner: 10Lucas Werkmeister (WMDE)) [14:01:43] ok [14:02:05] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1114727|Enable mul language code on Wikidata (full release) (T312176)]] [14:02:10] T312176: MUL - Phased rollout on Wikidata.org (Stage 3 of 3: Full release) - https://phabricator.wikimedia.org/T312176 [14:02:14] then let’s say I +2 my four backports, and we’ll see if they make it through gate-and-submit before or after we get to your config change? ^^ [14:02:19] sgtm [14:02:38] oh, I suppose I also need to rebase them on one another anyway ^^ [14:02:51] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2029.codfw.wmnet [14:02:51] (03PS2) 10Lucas Werkmeister (WMDE): Handle null date format in MwDateFormatParserFactory [extensions/Wikibase] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1115012 (https://phabricator.wikimedia.org/T384963) [14:02:57] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2029.codfw.wmnet [14:03:04] (03PS2) 10Lucas Werkmeister (WMDE): Handle null date format in MwDateFormatParserFactory [extensions/Wikibase] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1115010 (https://phabricator.wikimedia.org/T384963) [14:03:09] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10504306 (10ops-monitoring-bot) Draining ganeti2029.codfw.wmnet of running VMs [14:03:23] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "starting gate-and-submit ahead of deployment" [extensions/Wikibase] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1114995 (https://phabricator.wikimedia.org/T384867) (owner: 10Lucas Werkmeister (WMDE)) [14:03:33] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "starting gate-and-submit ahead of deployment" [extensions/Wikibase] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1114994 (https://phabricator.wikimedia.org/T384867) (owner: 10Lucas Werkmeister (WMDE)) [14:03:43] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "starting gate-and-submit ahead of deployment" [extensions/Wikibase] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1115012 (https://phabricator.wikimedia.org/T384963) (owner: 10Lucas Werkmeister (WMDE)) [14:03:53] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "starting gate-and-submit ahead of deployment" [extensions/Wikibase] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1115010 (https://phabricator.wikimedia.org/T384963) (owner: 10Lucas Werkmeister (WMDE)) [14:04:10] (03CR) 10TChin: [C:03+2] Scale down mw-content-history-reconcile-enrich for nominal events intake [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114790 (https://phabricator.wikimedia.org/T382953) (owner: 10Xcollazo) [14:04:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2211 (re)pooling @ 100%: Repooling after rebuild index $TASKID', diff saved to https://phabricator.wikimedia.org/P72753 and previous config saved to /var/cache/conftool/dbconfig/20250129-140418-root.json [14:04:34] (03CR) 10Filippo Giunchedi: [C:03+1] "Can't say I fully understand what's going on but LGTM to my untrained eye" [puppet] - 10https://gerrit.wikimedia.org/r/1115002 (https://phabricator.wikimedia.org/T369384) (owner: 10Cathal Mooney) [14:04:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P72754 and previous config saved to /var/cache/conftool/dbconfig/20250129-140434-marostegui.json [14:04:50] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2029.codfw.wmnet [14:06:39] (03Merged) 10jenkins-bot: Scale down mw-content-history-reconcile-enrich for nominal events intake [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114790 (https://phabricator.wikimedia.org/T382953) (owner: 10Xcollazo) [14:07:31] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2030.codfw.wmnet [14:07:42] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde: Backport for [[gerrit:1114727|Enable mul language code on Wikidata (full release) (T312176)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:07:46] T312176: MUL - Phased rollout on Wikidata.org (Stage 3 of 3: Full release) - https://phabricator.wikimedia.org/T312176 [14:07:52] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10504374 (10ops-monitoring-bot) Draining ganeti2030.codfw.wmnet of running VMs [14:08:10] working for me on https://www.wikidata.org/wiki/Q107133815 – with k8s-mwdebug I see the mul row \o/ [14:08:12] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde: Continuing with sync [14:09:20] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [14:09:36] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti2031.codfw.wmnet with reason: host reimage [14:09:57] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [14:11:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2030.codfw.wmnet [14:13:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti2031.codfw.wmnet with reason: host reimage [14:14:52] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2030.codfw.wmnet [14:15:13] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10504384 (10ops-monitoring-bot) Draining ganeti2030.codfw.wmnet of running VMs [14:16:32] (03CR) 10Ottomata: [C:03+1] "TY" [puppet] - 10https://gerrit.wikimedia.org/r/1114806 (https://phabricator.wikimedia.org/T383914) (owner: 10Aqu) [14:16:42] (03CR) 10Andrew Bogott: [C:03+1] cloudgw1003: take over cloudgw1001 [puppet] - 10https://gerrit.wikimedia.org/r/1114997 (https://phabricator.wikimedia.org/T382356) (owner: 10Arturo Borrero Gonzalez) [14:17:46] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [14:17:47] (03CR) 10Andrew Bogott: [C:03+1] cloudgw1004: take over cloudgw1002 [puppet] - 10https://gerrit.wikimedia.org/r/1114998 (https://phabricator.wikimedia.org/T382356) (owner: 10Arturo Borrero Gonzalez) [14:18:09] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [14:19:04] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1114727|Enable mul language code on Wikidata (full release) (T312176)]] (duration: 16m 58s) [14:19:08] T312176: MUL - Phased rollout on Wikidata.org (Stage 3 of 3: Full release) - https://phabricator.wikimedia.org/T312176 [14:19:21] zuul says 8 more minutes for my backports [14:19:26] hnowlan: want to self-service your config change? [14:19:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P72755 and previous config saved to /var/cache/conftool/dbconfig/20250129-141941-marostegui.json [14:19:50] (also, I just filed T385037 for an issue I mentioned in here yesterday [possibly Monday, not sure]) [14:19:50] T385037: mwdebug dashboard on logstash is full of "Failed to connect to exporter" messages (tracing channel) since 7 January - https://phabricator.wikimedia.org/T385037 [14:20:56] (03PS3) 10Jcrespo: backup: Temporary setup of backup101[34], backup201[34] [puppet] - 10https://gerrit.wikimedia.org/r/1115020 (https://phabricator.wikimedia.org/T384977) [14:21:13] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1115020 (https://phabricator.wikimedia.org/T384977) (owner: 10Jcrespo) [14:22:54] (03PS1) 10Brouberol: airflow: fix envoy service port names [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115025 [14:23:08] Lucas_WMDE: sure, thanks [14:23:17] ok :) [14:24:09] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hnowlan@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114398 (https://phabricator.wikimedia.org/T280718) (owner: 10Hnowlan) [14:25:04] (03CR) 10Brouberol: [C:03+2] airflow: fix envoy service port names [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115025 (owner: 10Brouberol) [14:25:04] (03Merged) 10jenkins-bot: fc-list: update font list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114398 (https://phabricator.wikimedia.org/T280718) (owner: 10Hnowlan) [14:25:31] !log hnowlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1114398|fc-list: update font list (T280718)]] [14:25:36] T280718: Re-evaluate whether keeping around https://noc.wikimedia.org/conf/fc-list is a good practive - https://phabricator.wikimedia.org/T280718 [14:25:40] (03CR) 10Xcollazo: [C:03+1] "LGTM!!" [puppet] - 10https://gerrit.wikimedia.org/r/1114978 (https://phabricator.wikimedia.org/T382947) (owner: 10Btullis) [14:26:48] (03CR) 10Jcrespo: "noop: https://puppet-compiler.wmflabs.org/output/1115020/2838/" [puppet] - 10https://gerrit.wikimedia.org/r/1115020 (https://phabricator.wikimedia.org/T384977) (owner: 10Jcrespo) [14:27:31] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [14:27:39] (03CR) 10Xcollazo: dumps: Re-enable the enwiki dumps on snapshot1012 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1114991 (https://phabricator.wikimedia.org/T382947) (owner: 10Btullis) [14:27:50] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [14:28:00] (03Merged) 10jenkins-bot: Handle missing `monthonly` format in MwTimeIsoFormatter [extensions/Wikibase] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1114995 (https://phabricator.wikimedia.org/T384867) (owner: 10Lucas Werkmeister (WMDE)) [14:28:03] (03Merged) 10jenkins-bot: Handle missing `monthonly` format in MwTimeIsoFormatter [extensions/Wikibase] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1114994 (https://phabricator.wikimedia.org/T384867) (owner: 10Lucas Werkmeister (WMDE)) [14:28:15] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [14:28:39] (03PS1) 10Jforrester: wikifunctions: Upgrade orchestrator from 2025-01-22-203140 to 2025-01-28-144249 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115028 (https://phabricator.wikimedia.org/T380103) [14:28:45] (03PS1) 10Jforrester: wikifunctions: Upgrade evaluators from 2025-01-22-212306 to 2025-01-29-140344 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115029 (https://phabricator.wikimedia.org/T359562) [14:28:52] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [14:32:18] !log hnowlan@deploy2002 hnowlan: Backport for [[gerrit:1114398|fc-list: update font list (T280718)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:32:23] T280718: Re-evaluate whether keeping around https://noc.wikimedia.org/conf/fc-list is a good practive - https://phabricator.wikimedia.org/T280718 [14:32:38] (03PS1) 10Ssingh: varnish: add schoolwiki.in to allowed maps domains [puppet] - 10https://gerrit.wikimedia.org/r/1115031 (https://phabricator.wikimedia.org/T383210) [14:32:50] (fyi, I’m testing repro steps for my backports, so sorry for a bit of noise in the mwdebug logstash during your deploy) [14:32:55] 06SRE, 10Maps, 06Traffic, 13Patch-For-Review: Allow Wikimedia Maps usage on schoolwiki.in - https://phabricator.wikimedia.org/T383210#10504478 (10ssingh) @MSantos: Hi! This is pending your approval but otherwise is a simple patch to merge. [14:33:02] (hopefully not bad enough to trip the scap canaries or anything ^^) [14:33:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1185 db2178 T384994', diff saved to https://phabricator.wikimedia.org/P72756 and previous config saved to /var/cache/conftool/dbconfig/20250129-143317-marostegui.json [14:33:22] T384994: Upgrade and rebuild s5 - https://phabricator.wikimedia.org/T384994 [14:33:25] (03Merged) 10jenkins-bot: Handle null date format in MwDateFormatParserFactory [extensions/Wikibase] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1115012 (https://phabricator.wikimedia.org/T384963) (owner: 10Lucas Werkmeister (WMDE)) [14:33:25] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops: Repurpose 5 config B servers - https://phabricator.wikimedia.org/T380805#10504480 (10Papaul) @Andrew anything dc-ops need to do on this task? [14:33:25] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4889/co" [puppet] - 10https://gerrit.wikimedia.org/r/1115031 (https://phabricator.wikimedia.org/T383210) (owner: 10Ssingh) [14:33:27] (03Merged) 10jenkins-bot: Handle null date format in MwDateFormatParserFactory [extensions/Wikibase] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1115010 (https://phabricator.wikimedia.org/T384963) (owner: 10Lucas Werkmeister (WMDE)) [14:33:33] !log root@cumin1002 START - Cookbook sre.mysql.upgrade for db1185.eqiad.wmnet [14:33:39] (03PS1) 10Cory Massaro: wikifunctions: Upgrade orchestrator from version: 2025-01-22-203140 to 2025-01-28-144249 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115032 (https://phabricator.wikimedia.org/T139010) [14:33:39] !log root@cumin1002 START - Cookbook sre.mysql.upgrade for db2178.codfw.wmnet [14:33:59] !log hnowlan@deploy2002 hnowlan: Continuing with sync [14:34:02] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2031.codfw.wmnet with OS bookworm [14:34:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T384592)', diff saved to https://phabricator.wikimedia.org/P72757 and previous config saved to /var/cache/conftool/dbconfig/20250129-143448-marostegui.json [14:34:54] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [14:35:04] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1166.eqiad.wmnet with reason: Maintenance [14:35:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1166 (T384592)', diff saved to https://phabricator.wikimedia.org/P72758 and previous config saved to /var/cache/conftool/dbconfig/20250129-143510-marostegui.json [14:35:35] (03CR) 10JMeybohm: [V:03+1] "PCC SUCCESS (CORE_DIFF 8): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4888/co" [puppet] - 10https://gerrit.wikimedia.org/r/1114970 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [14:36:09] (03Abandoned) 10Cory Massaro: wikifunctions: Upgrade orchestrator from version: 2025-01-22-203140 to 2025-01-28-144249 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115032 (https://phabricator.wikimedia.org/T139010) (owner: 10Cory Massaro) [14:36:23] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10504514 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2031.codfw.wmnet with OS bookworm completed: - ganeti203... [14:37:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:39:04] (03CR) 10Jforrester: wikifunctions: Upgrade orchestrator from version: 2025-01-22-203140 to 2025-01-28-144249 (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115032 (https://phabricator.wikimedia.org/T139010) (owner: 10Cory Massaro) [14:39:44] (03PS1) 10Andrew Bogott: Horizon: update release version for codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/1115035 (https://phabricator.wikimedia.org/T380081) [14:39:55] !log root@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1185.eqiad.wmnet [14:40:14] !log root@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2178.codfw.wmnet [14:40:34] (03CR) 10Andrew Bogott: [C:03+2] Horizon: update release version for codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/1115035 (https://phabricator.wikimedia.org/T380081) (owner: 10Andrew Bogott) [14:40:40] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1185.eqiad.wmnet with reason: Index rebuild [14:40:58] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2178.codfw.wmnet with reason: Index rebuild [14:41:14] !log aokoth@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply [14:41:31] !log hnowlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1114398|fc-list: update font list (T280718)]] (duration: 16m 00s) [14:41:36] T280718: Re-evaluate whether keeping around https://noc.wikimedia.org/conf/fc-list is a good practive - https://phabricator.wikimedia.org/T280718 [14:42:26] hnowlan: can I continue with the backports? [14:42:26] !log jmm@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ganeti2031.codfw.wmnet [14:42:49] !log aokoth@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [14:42:59] (03PS1) 10MVernon: swift: remove ms-be105[1-9] from profile::swift::storagehosts [puppet] - 10https://gerrit.wikimedia.org/r/1115038 (https://phabricator.wikimedia.org/T382056) [14:44:51] (03PS3) 10JMeybohm: Allow to install multiple kubectl versions [puppet] - 10https://gerrit.wikimedia.org/r/1114970 (https://phabricator.wikimedia.org/T341984) [14:45:16] (03CR) 10Marostegui: [C:03+1] swift: remove ms-be105[1-9] from profile::swift::storagehosts [puppet] - 10https://gerrit.wikimedia.org/r/1115038 (https://phabricator.wikimedia.org/T382056) (owner: 10MVernon) [14:45:20] I’ll assume it’s okay for me to continue deploying [14:45:29] please do, sorry! [14:45:33] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops: Repurpose 5 config B servers - https://phabricator.wikimedia.org/T380805#10504734 (10Andrew) >>! In T380805#10504480, @Papaul wrote: > @Andrew anything dc-ops need to do on this task? Not immediately! Valerie has already moved and set up two of them, we... [14:45:35] ok thanks! [14:45:48] (03CR) 10MVernon: [C:03+2] swift: remove ms-be105[1-9] from profile::swift::storagehosts [puppet] - 10https://gerrit.wikimedia.org/r/1115038 (https://phabricator.wikimedia.org/T382056) (owner: 10MVernon) [14:46:04] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1114995|Handle missing `monthonly` format in MwTimeIsoFormatter (T384867)]], [[gerrit:1114994|Handle missing `monthonly` format in MwTimeIsoFormatter (T384867)]], [[gerrit:1115012|Handle null date format in MwDateFormatParserFactory (T384963)]], [[gerrit:1115010|Handle null date format in MwDateFormatParserFactory (T384963)]] [14:46:11] T384867: PHP Deprecated: preg_match(): Passing null to parameter #2 ($subject) of type string is deprecated - https://phabricator.wikimedia.org/T384867 [14:46:11] T384963: PHP Deprecated: strlen(): Passing null to parameter #1 ($string) of type string is deprecated - https://phabricator.wikimedia.org/T384963 [14:46:46] (03CR) 10Jcrespo: "@btullis I moved an-redacteddb1001 as it looked weird." [puppet] - 10https://gerrit.wikimedia.org/r/1115020 (https://phabricator.wikimedia.org/T384977) (owner: 10Jcrespo) [14:49:19] (03PS1) 10Andrew Bogott: Revert "Horizon: update release version for codfw1dev" [puppet] - 10https://gerrit.wikimedia.org/r/1115041 [14:50:18] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install ganeti105[34].eqiad.wmnet - https://phabricator.wikimedia.org/T381576#10504749 (10VRiley-WMF) [14:50:25] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:50:32] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde: Backport for [[gerrit:1114995|Handle missing `monthonly` format in MwTimeIsoFormatter (T384867)]], [[gerrit:1114994|Handle missing `monthonly` format in MwTimeIsoFormatter (T384867)]], [[gerrit:1115012|Handle null date format in MwDateFormatParserFactory (T384963)]], [[gerrit:1115010|Handle null date format in MwDateFormatParserFactory (T384963)]] synced to the [14:50:33] testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:50:55] (03CR) 10Andrew Bogott: [C:03+2] Revert "Horizon: update release version for codfw1dev" [puppet] - 10https://gerrit.wikimedia.org/r/1115041 (owner: 10Andrew Bogott) [14:51:21] looks good to me \o/ [14:51:24] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde: Continuing with sync [14:51:42] (03PS1) 10Ottomata: EventStreamConfig - prep for per stream user agent collection config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115042 (https://phabricator.wikimedia.org/T382173) [14:52:05] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:52:15] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.580 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:52:57] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53369 bytes in 1.072 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:53:30] !log mvernon@cumin2002 START - Cookbook sre.hosts.decommission for hosts ms-be[1051-1059].eqiad.wmnet [14:53:33] (03PS4) 10JMeybohm: Allow to install multiple kubectl versions [puppet] - 10https://gerrit.wikimedia.org/r/1114970 (https://phabricator.wikimedia.org/T341984) [14:54:42] !log repool ncredir4002 [14:54:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:25] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1114995|Handle missing `monthonly` format in MwTimeIsoFormatter (T384867)]], [[gerrit:1114994|Handle missing `monthonly` format in MwTimeIsoFormatter (T384867)]], [[gerrit:1115012|Handle null date format in MwDateFormatParserFactory (T384963)]], [[gerrit:1115010|Handle null date format in MwDateFormatParserFactory (T384963)]] (duration: [14:58:25] 12m 20s) [14:58:30] T384867: PHP Deprecated: preg_match(): Passing null to parameter #2 ($subject) of type string is deprecated - https://phabricator.wikimedia.org/T384867 [14:58:31] T384963: PHP Deprecated: strlen(): Passing null to parameter #1 ($string) of type string is deprecated - https://phabricator.wikimedia.org/T384963 [14:58:40] !log UTC afternoon backport+config window done [14:58:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:44] even just before the end of the window \o/ [14:59:52] (03CR) 10JMeybohm: [V:03+1] "PCC SUCCESS (NOOP 7 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4891/" [puppet] - 10https://gerrit.wikimedia.org/r/1114970 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [15:00:05] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250129T1500) [15:02:01] !log jynus@cumin1002 DONE (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 1 day, 0:00:00 on db2201.codfw.wmnet with reason: upgrade kernel and rebuilding tables [15:02:14] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install db125[0-4] - https://phabricator.wikimedia.org/T380083#10504784 (10VRiley-WMF) [15:02:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T384592)', diff saved to https://phabricator.wikimedia.org/P72761 and previous config saved to /var/cache/conftool/dbconfig/20250129-150236-marostegui.json [15:02:41] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [15:02:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:02:58] (03PS1) 10AOkoth: miscweb: remove kubectl cronjob [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115044 (https://phabricator.wikimedia.org/T350794) [15:03:06] !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2095,2175,2186].codfw.wmnet [15:03:15] 10ops-codfw, 06SRE, 06collaboration-services, 06Data-Persistence, and 2 others: Tracking List: Relocating servers to free up 10G switch space in codfw - https://phabricator.wikimedia.org/T383709#10504805 (10ops-monitoring-bot) depool host wikikube-worker[2095,2175,2186].codfw.wmnet by jayme@cumin1002 with... [15:03:22] !log jayme@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on wikikube-worker[2095,2175,2186].codfw.wmnet with reason: Depooled via sre.k8s.pool-depool-node [15:03:57] (03CR) 10Cory Massaro: [C:03+2] wikifunctions: Upgrade orchestrator from 2025-01-22-203140 to 2025-01-28-144249 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115028 (https://phabricator.wikimedia.org/T380103) (owner: 10Jforrester) [15:05:03] (03Merged) 10jenkins-bot: wikifunctions: Upgrade orchestrator from 2025-01-22-203140 to 2025-01-28-144249 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115028 (https://phabricator.wikimedia.org/T380103) (owner: 10Jforrester) [15:05:09] !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2095,2175,2186].codfw.wmnet [15:05:47] 10ops-codfw, 06SRE, 06collaboration-services, 06Data-Persistence, and 2 others: Tracking List: Relocating servers to free up 10G switch space in codfw - https://phabricator.wikimedia.org/T383709#10504823 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by jayme@cumin1002 depool fo... [15:05:52] ACKNOWLEDGEMENT - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly Clément Goubert T383032 - The acknowledgement expires at: 2025-02-12 15:04:52. https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:06:16] ACKNOWLEDGEMENT - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly Clément Goubert T383032 - The acknowledgement expires at: 2025-02-12 15:06:05. https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:06:25] !log apine@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [15:06:51] (03PS1) 10Brouberol: Disable the sidecar controller from dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115045 (https://phabricator.wikimedia.org/T384329) [15:06:53] (03PS1) 10Brouberol: dse-k8s-eqiad: delete the sidecar-controller ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115046 (https://phabricator.wikimedia.org/T384329) [15:06:58] !log apine@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [15:07:39] 10ops-codfw, 06SRE, 06collaboration-services, 06Data-Persistence, and 2 others: Tracking List: Relocating servers to free up 10G switch space in codfw - https://phabricator.wikimedia.org/T383709#10504828 (10JMeybohm) @Jhancock.wm wikikube-worker[2095,2175,2186].codfw.wmnet have been shut down, lmk when you... [15:08:46] !log apine@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [15:09:17] PROBLEM - BGP status on lsw1-b5-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:09:17] PROBLEM - BGP status on lsw1-d3-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:09:19] PROBLEM - BGP status on lsw1-d5-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:09:44] !log apine@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [15:10:00] (03PS2) 10AOkoth: miscweb: remove kubectl cronjob [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115044 (https://phabricator.wikimedia.org/T350794) [15:10:21] (03CR) 10Jelto: miscweb: remove kubectl cronjob (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115044 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth) [15:10:25] !log apine@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [15:11:15] (03CR) 10Stevemunene: [C:03+1] "lgtm!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115045 (https://phabricator.wikimedia.org/T384329) (owner: 10Brouberol) [15:11:27] !log apine@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [15:11:29] (03PS3) 10AOkoth: miscweb: remove kubectl cronjob [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115044 (https://phabricator.wikimedia.org/T350794) [15:11:31] 06SRE, 06Infrastructure-Foundations, 10netops, 10observability, and 3 others: Prevent BGP alerts triggering when K8s host maintenance is being done - https://phabricator.wikimedia.org/T384731#10504843 (10lmata) [15:11:40] (03CR) 10Stevemunene: [C:03+1] dse-k8s-eqiad: delete the sidecar-controller ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115046 (https://phabricator.wikimedia.org/T384329) (owner: 10Brouberol) [15:11:53] (03CR) 10Brouberol: [C:03+2] Disable the sidecar controller from dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115045 (https://phabricator.wikimedia.org/T384329) (owner: 10Brouberol) [15:12:00] (03CR) 10AOkoth: miscweb: remove kubectl cronjob (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115044 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth) [15:12:36] (03CR) 10Hnowlan: [C:04-1] "linting URL pattern is not compliant with the gateway" [puppet] - 10https://gerrit.wikimedia.org/r/1112815 (https://phabricator.wikimedia.org/T384216) (owner: 10Hnowlan) [15:12:50] (03CR) 10Hnowlan: [C:04-1] "linting URL pattern is not compliant with the gateway" [puppet] - 10https://gerrit.wikimedia.org/r/1112800 (https://phabricator.wikimedia.org/T384216) (owner: 10Hnowlan) [15:13:21] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [15:13:29] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [15:13:44] (03CR) 10Brouberol: [C:03+2] dse-k8s-eqiad: delete the sidecar-controller ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115046 (https://phabricator.wikimedia.org/T384329) (owner: 10Brouberol) [15:14:32] (03CR) 10Cory Massaro: [C:03+2] wikifunctions: Upgrade evaluators from 2025-01-22-212306 to 2025-01-29-140344 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115029 (https://phabricator.wikimedia.org/T359562) (owner: 10Jforrester) [15:14:52] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [15:15:35] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [15:15:46] (03Merged) 10jenkins-bot: wikifunctions: Upgrade evaluators from 2025-01-22-212306 to 2025-01-29-140344 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115029 (https://phabricator.wikimedia.org/T359562) (owner: 10Jforrester) [15:15:49] (03CR) 10Lucas Werkmeister (WMDE): "LGTM in general but needs a rebase after I920abe8e23 ^^" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115006 (https://phabricator.wikimedia.org/T330217) (owner: 10Arthur taylor) [15:16:10] !log apine@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [15:16:42] !log apine@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [15:17:41] !log apine@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [15:17:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P72762 and previous config saved to /var/cache/conftool/dbconfig/20250129-151743-marostegui.json [15:18:05] 10ops-eqiad, 10SRE-swift-storage, 06DC-Ops, 10decommission-hardware: decommission ms-fe105[1-9].eqiad.wmnet - https://phabricator.wikimedia.org/T385049 (10MatthewVernon) 03NEW [15:18:45] !log apine@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [15:18:50] !log apine@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [15:19:01] (03PS4) 10Hnowlan: svg: use rsvg-convert's language parameter [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1042203 (https://phabricator.wikimedia.org/T261192) [15:19:44] 06SRE, 10SRE-swift-storage, 13Patch-For-Review: ms backend hardware refresh for 24/25 - https://phabricator.wikimedia.org/T382056#10504908 (10MatthewVernon) [15:19:49] !log apine@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [15:21:24] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "Looks okay to me but could also be simplified further." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115013 (https://phabricator.wikimedia.org/T330217) (owner: 10Arthur taylor) [15:22:44] !log mvernon@cumin2002 START - Cookbook sre.dns.netbox [15:22:55] (03Abandoned) 10Hnowlan: changeprop: make num_workers configurable for jobqueue [deployment-charts] - 10https://gerrit.wikimedia.org/r/826570 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [15:23:06] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2031.codfw.wmnet [15:24:00] (03PS15) 10JMeybohm: Update staging-codfw to kubernetes 1.31, calico 3.29 [puppet] - 10https://gerrit.wikimedia.org/r/1110813 (https://phabricator.wikimedia.org/T341984) [15:24:00] (03PS5) 10JMeybohm: Allow to install multiple kubectl versions [puppet] - 10https://gerrit.wikimedia.org/r/1114970 (https://phabricator.wikimedia.org/T341984) [15:25:33] (03CR) 10Clément Goubert: [C:03+1] Enroll 5% of client sessions in PHP 8.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114793 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [15:25:38] (03PS2) 10Muehlenhoff: postgresql::server: Use wmflib::debian_postgresql_version [puppet] - 10https://gerrit.wikimedia.org/r/1108707 [15:25:40] (03CR) 10Hnowlan: [C:03+1] Enroll 5% of client sessions in PHP 8.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114793 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [15:26:25] (03PS1) 10Elukey: services: set the Tegola's cluster local endpoint for Kartotherian [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115049 (https://phabricator.wikimedia.org/T384530) [15:26:43] !log mvernon@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ms-be[1051-1059].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - mvernon@cumin2002" [15:27:13] (03CR) 10Elukey: [C:03+2] custom_deploy.d: remove ML-specific bits from DSE's istio config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114749 (owner: 10Elukey) [15:27:27] (03CR) 10Elukey: [C:03+2] custom_deploy.d: rework Istio ML's config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115008 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey) [15:27:40] !log mvernon@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ms-be[1051-1059].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - mvernon@cumin2002" [15:27:40] !log mvernon@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:27:41] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ms-be[1051-1059].eqiad.wmnet [15:27:52] 06SRE, 10SRE-swift-storage, 13Patch-For-Review: ms backend hardware refresh for 24/25 - https://phabricator.wikimedia.org/T382056#10504958 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by mvernon@cumin2002 for hosts: `ms-be[1051-1059].eqiad.wmnet` - ms-be1051.eqiad.wmnet (**PASS**) - Dow... [15:28:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1212 (re)pooling @ 10%: Repooling after rebuild index $TASKID', diff saved to https://phabricator.wikimedia.org/P72763 and previous config saved to /var/cache/conftool/dbconfig/20250129-152801-root.json [15:28:56] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1223 to s3 master [puppet] - 10https://gerrit.wikimedia.org/r/1115050 (https://phabricator.wikimedia.org/T385051) [15:29:44] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] Remove `tmpEnableMulLanguageCode` setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115016 (https://phabricator.wikimedia.org/T330217) (owner: 10Arthur taylor) [15:29:44] (03CR) 10CI reject: [V:04-1] svg: use rsvg-convert's language parameter [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1042203 (https://phabricator.wikimedia.org/T261192) (owner: 10Hnowlan) [15:30:05] (03CR) 10JMeybohm: [V:03+1] "PCC SUCCESS (CORE_DIFF 8): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4892/co" [puppet] - 10https://gerrit.wikimedia.org/r/1114970 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [15:30:56] (03CR) 10Ottomata: [C:03+2] EventStreamConfig - prep for per stream user agent collection config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115042 (https://phabricator.wikimedia.org/T382173) (owner: 10Ottomata) [15:31:38] (03Merged) 10jenkins-bot: EventStreamConfig - prep for per stream user agent collection config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115042 (https://phabricator.wikimedia.org/T382173) (owner: 10Ottomata) [15:32:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P72764 and previous config saved to /var/cache/conftool/dbconfig/20250129-153250-marostegui.json [15:33:38] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2031.codfw.wmnet [15:33:40] !log jmm@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ganeti2031.codfw.wmnet [15:36:04] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2031.codfw.wmnet [15:36:19] (03CR) 10Jelto: [C:03+1] "lgtm now" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115044 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth) [15:37:03] FYI am going to deploy https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1115042 to prep for an eventgate deployment. SHould be a no-op for now. [15:37:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1185 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P72765 and previous config saved to /var/cache/conftool/dbconfig/20250129-153708-root.json [15:37:11] 10ops-magru, 06DC-Ops: Power supply failure (PSU) for cp7006.magru.wmnet - https://phabricator.wikimedia.org/T381446#10505017 (10RobH) [15:37:31] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1108707 (owner: 10Muehlenhoff) [15:37:39] !log otto@deploy2002 Started scap sync-world: Backport for [[gerrit:1115042|EventStreamConfig - prep for per stream user agent collection config (T382173)]] [15:37:44] T382173: Enable Event Platform instruments to opt out of collecting User-Agent data - https://phabricator.wikimedia.org/T382173 [15:37:58] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Dec 2024: cr3-ulsfo errors on et-0/0/0 link from cr4 - https://phabricator.wikimedia.org/T384288#10505023 (10RobH) [15:38:00] (03PS1) 10Vgutierrez: lvs: Extend alerts to liberica cluster [alerts] - 10https://gerrit.wikimedia.org/r/1115054 [15:38:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2178 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P72766 and previous config saved to /var/cache/conftool/dbconfig/20250129-153840-root.json [15:40:24] (03PS6) 10JMeybohm: k8s::client: Allow for install of all kubectl versions [puppet] - 10https://gerrit.wikimedia.org/r/1114970 (https://phabricator.wikimedia.org/T341984) [15:42:06] (03CR) 10Ssingh: [C:03+1] lvs: Extend alerts to liberica cluster (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1115054 (owner: 10Vgutierrez) [15:42:26] !log otto@deploy2002 otto: Backport for [[gerrit:1115042|EventStreamConfig - prep for per stream user agent collection config (T382173)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [15:42:31] (03CR) 10Arturo Borrero Gonzalez: [C:04-1] "let's replace cloudgw1002 first (so, rebase this patch)" [puppet] - 10https://gerrit.wikimedia.org/r/1114998 (https://phabricator.wikimedia.org/T382356) (owner: 10Arturo Borrero Gonzalez) [15:42:37] (03CR) 10Elukey: [V:03+1 C:03+2] kubernetes: remove ad-hoc CNI config from dse-k8s-worker [puppet] - 10https://gerrit.wikimedia.org/r/1114753 (owner: 10Elukey) [15:42:38] (03CR) 10Arturo Borrero Gonzalez: [C:04-1] "let's replace cloudgw1002 first (so, rebase this patch)" [puppet] - 10https://gerrit.wikimedia.org/r/1114997 (https://phabricator.wikimedia.org/T382356) (owner: 10Arturo Borrero Gonzalez) [15:43:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1212 (re)pooling @ 25%: Repooling after rebuild index $TASKID', diff saved to https://phabricator.wikimedia.org/P72767 and previous config saved to /var/cache/conftool/dbconfig/20250129-154306-root.json [15:44:34] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2031.codfw.wmnet [15:45:47] (03CR) 10Vgutierrez: [C:03+2] lvs: Extend alerts to liberica cluster [alerts] - 10https://gerrit.wikimedia.org/r/1115054 (owner: 10Vgutierrez) [15:47:21] (03Merged) 10jenkins-bot: lvs: Extend alerts to liberica cluster [alerts] - 10https://gerrit.wikimedia.org/r/1115054 (owner: 10Vgutierrez) [15:47:54] !log otto@deploy2002 otto: Continuing with sync [15:47:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T384592)', diff saved to https://phabricator.wikimedia.org/P72768 and previous config saved to /var/cache/conftool/dbconfig/20250129-154757-marostegui.json [15:48:01] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1175.eqiad.wmnet with reason: Maintenance [15:48:03] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [15:48:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1175 (T384592)', diff saved to https://phabricator.wikimedia.org/P72769 and previous config saved to /var/cache/conftool/dbconfig/20250129-154807-marostegui.json [15:50:21] (03CR) 10Federico Ceratto: [C:03+1] installserver: Do not format es104* [puppet] - 10https://gerrit.wikimedia.org/r/1115000 (owner: 10Marostegui) [15:51:20] (03CR) 10Marostegui: [C:03+2] installserver: Do not format es104* [puppet] - 10https://gerrit.wikimedia.org/r/1115000 (owner: 10Marostegui) [15:52:07] FIRING: [2x] CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic1071-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [15:52:13] (03PS7) 10JMeybohm: k8s::client: Allow for install of all kubectl versions [puppet] - 10https://gerrit.wikimedia.org/r/1114970 (https://phabricator.wikimedia.org/T341984) [15:52:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1185 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P72770 and previous config saved to /var/cache/conftool/dbconfig/20250129-155213-root.json [15:53:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2178 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P72771 and previous config saved to /var/cache/conftool/dbconfig/20250129-155345-root.json [15:54:48] !log otto@deploy2002 Finished scap sync-world: Backport for [[gerrit:1115042|EventStreamConfig - prep for per stream user agent collection config (T382173)]] (duration: 17m 08s) [15:54:52] T382173: Enable Event Platform instruments to opt out of collecting User-Agent data - https://phabricator.wikimedia.org/T382173 [15:57:09] (03CR) 10Ottomata: [C:03+2] eventgate - templatize module name, default to @eventgate/wikimedia [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114795 (https://phabricator.wikimedia.org/T383814) (owner: 10Ottomata) [15:58:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1212 (re)pooling @ 50%: Repooling after rebuild index $TASKID', diff saved to https://phabricator.wikimedia.org/P72772 and previous config saved to /var/cache/conftool/dbconfig/20250129-155812-root.json [15:58:32] (03Merged) 10jenkins-bot: eventgate - templatize module name, default to @eventgate/wikimedia [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114795 (https://phabricator.wikimedia.org/T383814) (owner: 10Ottomata) [15:59:18] (03PS1) 10Hnowlan: trafficserver: directly route to citoid on testwiki [puppet] - 10https://gerrit.wikimedia.org/r/1115056 (https://phabricator.wikimedia.org/T361576) [15:59:58] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2095 [15:59:58] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [16:00:00] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2095 [16:00:04] swfrench-wmf: That opportune time for a MediaWiki infrastructure (UTC late one-off) deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250129T1600). [16:00:32] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2095 [16:00:35] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2095 [16:00:52] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [16:02:07] RECOVERY - MariaDB Replica Lag: s2 on dbstore1007 is OK: OK slave_sql_lag Replication lag: 0.15 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:02:09] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2095 [16:02:18] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2095 [16:02:36] (03CR) 10Muehlenhoff: [C:03+2] postgresql::server: Use wmflib::debian_postgresql_version [puppet] - 10https://gerrit.wikimedia.org/r/1108707 (owner: 10Muehlenhoff) [16:02:43] o/ [16:02:50] I'll get started shortly [16:04:19] RECOVERY - BGP status on lsw1-b5-codfw.mgmt is OK: BGP OK - up: 8, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:04:37] (03PS1) 10Muehlenhoff: Switch ganeti2030 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1115057 [16:05:09] (03CR) 10TrainBranchBot: [C:03+2] "Approved by swfrench@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114793 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [16:06:29] (03Merged) 10jenkins-bot: Enroll 5% of client sessions in PHP 8.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114793 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [16:06:46] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Dec 2024: cr3-ulsfo errors on et-0/0/0 link from cr4 - https://phabricator.wikimedia.org/T384288#10505148 (10RobH) Picked this back up, it had gotten neglected due to not being assigned to me and not having the ops-ulsfo tag and I shou... [16:06:46] (03PS1) 10Urbanecm: migrateConfigToCommunity: Handle false BabelMainCategory [extensions/Babel] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1115059 (https://phabricator.wikimedia.org/T384941) [16:06:53] 06SRE, 06SRE Observability: logstash.rb uses deprecated Socket.gethostbyname - https://phabricator.wikimedia.org/T385058 (10MatthewVernon) 03NEW [16:06:54] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Dec 2024: cr3-ulsfo errors on et-0/0/0 link from cr4 - https://phabricator.wikimedia.org/T384288#10505158 (10RobH) a:05cmooney→03RobH [16:06:58] !log swfrench@deploy2002 Started scap sync-world: Backport for [[gerrit:1114793|Enroll 5% of client sessions in PHP 8.1 (T383845)]] [16:07:03] T383845: MediaWiki on PHP 8.1 production traffic ramp-up - https://phabricator.wikimedia.org/T383845 [16:07:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1185 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P72773 and previous config saved to /var/cache/conftool/dbconfig/20250129-160718-root.json [16:08:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2178 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P72774 and previous config saved to /var/cache/conftool/dbconfig/20250129-160850-root.json [16:09:00] (03PS1) 10Ottomata: eventgate-analytics-external - bump to v1.9.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115061 (https://phabricator.wikimedia.org/T382173) [16:09:30] (03PS1) 10Urbanecm: migrateConfigToCommunity: Handle false BabelMainCategory [extensions/Babel] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1115062 (https://phabricator.wikimedia.org/T384941) [16:09:36] (03PS1) 10Hnowlan: kubernetes: reimage two jobrunners to workers [puppet] - 10https://gerrit.wikimedia.org/r/1115063 (https://phabricator.wikimedia.org/T354791) [16:10:09] (03PS2) 10Muehlenhoff: postgresql::dirs: Use wmflib::debian_postgresql_version() [puppet] - 10https://gerrit.wikimedia.org/r/1108710 [16:10:32] (03CR) 10CI reject: [V:04-1] postgresql::dirs: Use wmflib::debian_postgresql_version() [puppet] - 10https://gerrit.wikimedia.org/r/1108710 (owner: 10Muehlenhoff) [16:11:02] swfrench-wmf: I'd like to do an eventgate-analytics-external deployment only to staging for now (meetings starting). [16:11:02] that okay with you? [16:11:27] (03PS3) 10Muehlenhoff: postgresql::dirs: Use wmflib::debian_postgresql_version() [puppet] - 10https://gerrit.wikimedia.org/r/1108710 [16:12:19] ottomata: no objections to update staging - thanks for checking! [16:12:19] proceeding to deploy but only in staging [16:12:22] thanks! [16:12:25] (03CR) 10Ottomata: [C:03+2] eventgate-analytics-external - bump to v1.9.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115061 (https://phabricator.wikimedia.org/T382173) (owner: 10Ottomata) [16:13:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1212 (re)pooling @ 75%: Repooling after rebuild index $TASKID', diff saved to https://phabricator.wikimedia.org/P72775 and previous config saved to /var/cache/conftool/dbconfig/20250129-161317-root.json [16:13:37] !log swfrench@deploy2002 swfrench: Backport for [[gerrit:1114793|Enroll 5% of client sessions in PHP 8.1 (T383845)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [16:13:41] T383845: MediaWiki on PHP 8.1 production traffic ramp-up - https://phabricator.wikimedia.org/T383845 [16:13:53] (03Merged) 10jenkins-bot: eventgate-analytics-external - bump to v1.9.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115061 (https://phabricator.wikimedia.org/T382173) (owner: 10Ottomata) [16:13:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T384592)', diff saved to https://phabricator.wikimedia.org/P72776 and previous config saved to /var/cache/conftool/dbconfig/20250129-161353-marostegui.json [16:13:59] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [16:14:38] !log installing glib2.0 security updates [16:14:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:48] !log swfrench@deploy2002 swfrench: Continuing with sync [16:15:08] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1108710 (owner: 10Muehlenhoff) [16:15:52] (03CR) 10Effie Mouzeli: [C:03+1] Enroll 5% of client sessions in PHP 8.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114793 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [16:16:23] (03CR) 10Muehlenhoff: postgresql::dirs: Use wmflib::debian_postgresql_version() (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1108710 (owner: 10Muehlenhoff) [16:21:59] !log swfrench@deploy2002 Finished scap sync-world: Backport for [[gerrit:1114793|Enroll 5% of client sessions in PHP 8.1 (T383845)]] (duration: 15m 00s) [16:22:04] T383845: MediaWiki on PHP 8.1 production traffic ramp-up - https://phabricator.wikimedia.org/T383845 [16:22:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1185 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P72777 and previous config saved to /var/cache/conftool/dbconfig/20250129-162224-root.json [16:23:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2178 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P72778 and previous config saved to /var/cache/conftool/dbconfig/20250129-162355-root.json [16:25:46] (03PS1) 10Hashar: Do not copy Code-Review +2 (take 2) [puppet] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/1115068 [16:27:15] (03PS1) 10Ilias Sarantopoulos: amd-pytorch25: use ROCm 6.2 in torch 2.5.1 image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1115069 (https://phabricator.wikimedia.org/T384734) [16:28:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1212 (re)pooling @ 100%: Repooling after rebuild index $TASKID', diff saved to https://phabricator.wikimedia.org/P72780 and previous config saved to /var/cache/conftool/dbconfig/20250129-162822-root.json [16:28:49] (03CR) 10Ilias Sarantopoulos: "The output of docker-pkg -c config.yaml build images/ --select "*pytorch25*"" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1115069 (https://phabricator.wikimedia.org/T384734) (owner: 10Ilias Sarantopoulos) [16:29:00] (03CR) 10Kamila Součková: [C:03+1] kubernetes: reimage two jobrunners to workers [puppet] - 10https://gerrit.wikimedia.org/r/1115063 (https://phabricator.wikimedia.org/T354791) (owner: 10Hnowlan) [16:29:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P72781 and previous config saved to /var/cache/conftool/dbconfig/20250129-162900-marostegui.json [16:29:56] FYI, I'm done with the window [16:30:05] (03PS1) 10Muehlenhoff: wikilabels::db: Use wmflib::debian_postgresql_version [puppet] - 10https://gerrit.wikimedia.org/r/1115070 [16:30:49] (03CR) 10Klausman: [C:03+1] amd-pytorch25: use ROCm 6.2 in torch 2.5.1 image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1115069 (https://phabricator.wikimedia.org/T384734) (owner: 10Ilias Sarantopoulos) [16:32:38] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1115070 (owner: 10Muehlenhoff) [16:32:56] !log installing util-linux bugfix updates from bookworm point release [16:33:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:11] (03PS1) 10Federico Ceratto: preseed.yaml: add comments around DB data safety [puppet] - 10https://gerrit.wikimedia.org/r/1115072 [16:34:52] (03PS1) 10Elukey: kartotherian: update config.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115073 [16:35:47] (03CR) 10Hnowlan: [C:03+2] kubernetes: reimage two jobrunners to workers [puppet] - 10https://gerrit.wikimedia.org/r/1115063 (https://phabricator.wikimedia.org/T354791) (owner: 10Hnowlan) [16:36:17] (03PS1) 10Sergio Gimeno: beta wgEventStreams: opt out collecting user agent for HelpPanel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115074 (https://phabricator.wikimedia.org/T382173) [16:37:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1185 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P72782 and previous config saved to /var/cache/conftool/dbconfig/20250129-163729-root.json [16:39:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2178 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P72783 and previous config saved to /var/cache/conftool/dbconfig/20250129-163901-root.json [16:39:43] (03CR) 10Btullis: [C:03+2] dumps: Use the analytics replicas by default for dumps 1.0 [puppet] - 10https://gerrit.wikimedia.org/r/1114978 (https://phabricator.wikimedia.org/T382947) (owner: 10Btullis) [16:41:02] (03CR) 10JHathaway: [C:03+1] postgresql::dirs: Use wmflib::debian_postgresql_version() (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1108710 (owner: 10Muehlenhoff) [16:41:12] !log hnowlan@cumin2002 START - Cookbook sre.hosts.rename from mw2410 to wikikube-worker2242 [16:41:30] !log hnowlan@cumin2002 START - Cookbook sre.hosts.rename from mw2411 to wikikube-worker2243 [16:41:35] !log hnowlan@cumin2002 START - Cookbook sre.dns.netbox [16:41:39] (03CR) 10JHathaway: [C:03+2] kafka_shipper: when disabled, don't render templates [puppet] - 10https://gerrit.wikimedia.org/r/1111336 (owner: 10JHathaway) [16:43:06] (03PS3) 10Btullis: dumps: Re-enable the enwiki dumps on snapshot1012 [puppet] - 10https://gerrit.wikimedia.org/r/1114991 (https://phabricator.wikimedia.org/T382947) [16:44:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P72784 and previous config saved to /var/cache/conftool/dbconfig/20250129-164406-marostegui.json [16:46:19] !log hnowlan@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2410 to wikikube-worker2242 - hnowlan@cumin2002" [16:46:34] (03PS1) 10Vgutierrez: site,hiera: Reimage lvs4009 as role(liberica) [puppet] - 10https://gerrit.wikimedia.org/r/1115075 (https://phabricator.wikimedia.org/T384477) [16:46:39] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2410 to wikikube-worker2242 - hnowlan@cumin2002" [16:46:39] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:46:40] !log hnowlan@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2242 [16:46:50] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2242 [16:47:00] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2410 to wikikube-worker2242 [16:47:19] 06SRE, 10MW-on-K8s, 06serviceops, 13Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#10505334 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by hnowlan@cumin2002 from mw2410 to wikikube-worker2242 completed: - mw2410 (**PASS**) - ✔️ Down... [16:47:25] !log hnowlan@cumin2002 START - Cookbook sre.dns.netbox [16:48:01] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.8 point update - https://phabricator.wikimedia.org/T379600#10505336 (10MoritzMuehlenhoff) [16:48:43] !log hnowlan@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2242.codfw.wmnet with OS bookworm [16:48:53] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1115075 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [16:48:53] !log hnowlan@cumin2002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2242 [16:49:03] 06SRE, 10MW-on-K8s, 06serviceops, 13Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#10505337 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin2002 for host wikikube-worker2242.codfw.wmnet with OS bookworm [16:50:04] (03CR) 10Marostegui: [C:03+1] "Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1115072 (owner: 10Federico Ceratto) [16:51:17] !log hnowlan@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2411 to wikikube-worker2243 - hnowlan@cumin2002" [16:51:22] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2411 to wikikube-worker2243 - hnowlan@cumin2002" [16:51:22] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:51:23] !log hnowlan@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2243 [16:51:23] !log hnowlan@cumin2002 START - Cookbook sre.dns.netbox [16:51:37] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 219, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:51:40] (03CR) 10Jgiannelos: [C:03+1] kartotherian: update config.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115073 (owner: 10Elukey) [16:51:50] (03CR) 10Xcollazo: [C:03+1] dumps: Re-enable the enwiki dumps on snapshot1012 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1114991 (https://phabricator.wikimedia.org/T382947) (owner: 10Btullis) [16:51:53] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 128, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:51:58] !log jayme@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on wikikube-worker[2095,2175,2186].codfw.wmnet with reason: extending downtime [16:52:09] 10ops-codfw, 06SRE, 06collaboration-services, 06Data-Persistence, and 2 others: Tracking List: Relocating servers to free up 10G switch space in codfw - https://phabricator.wikimedia.org/T383709#10505352 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=c24ad8f7-3e57-4f83-8a1f-c507313e344... [16:53:15] (03CR) 10Vgutierrez: "experimental check fails cause the interface name for the main NIC doesn't match between bullseye and bookworm :facepalm:" [puppet] - 10https://gerrit.wikimedia.org/r/1115075 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [16:54:21] (03CR) 10Elukey: [C:03+2] services: set the Tegola's cluster local endpoint for Kartotherian [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115049 (https://phabricator.wikimedia.org/T384530) (owner: 10Elukey) [16:54:27] (03CR) 10Elukey: [C:03+2] kartotherian: update config.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115073 (owner: 10Elukey) [16:56:28] !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/kartotherian: sync [16:56:36] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2243 [16:56:46] !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/kartotherian: sync [16:56:46] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2411 to wikikube-worker2243 [16:57:05] 06SRE, 10MW-on-K8s, 06serviceops, 13Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#10505376 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by hnowlan@cumin2002 from mw2411 to wikikube-worker2243 completed: - mw2411 (**PASS**) - ✔️ Down... [16:57:30] !log hnowlan@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2242 - hnowlan@cumin2002" [16:57:34] !log hnowlan@cumin2002 START - Cookbook sre.dns.wipe-cache wikikube-worker2243.codfw.wmnet on all recursors [16:57:35] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2242 - hnowlan@cumin2002" [16:57:35] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:57:36] !log hnowlan@cumin2002 START - Cookbook sre.dns.wipe-cache wikikube-worker2242.codfw.wmnet 113.0.192.10.in-addr.arpa 3.1.1.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [16:57:37] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2243.codfw.wmnet on all recursors [16:57:39] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2242.codfw.wmnet 113.0.192.10.in-addr.arpa 3.1.1.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [16:57:40] !log hnowlan@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2242 [16:58:00] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2242 [16:58:00] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2242 [16:58:28] !log hnowlan@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2243.codfw.wmnet with OS bookworm [16:58:39] !log hnowlan@cumin2002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2243 [16:58:45] 06SRE, 10MW-on-K8s, 06serviceops, 13Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#10505382 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin2002 for host wikikube-worker2243.codfw.wmnet with OS bookworm [16:58:56] !log hnowlan@cumin2002 START - Cookbook sre.dns.netbox [16:59:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T384592)', diff saved to https://phabricator.wikimedia.org/P72785 and previous config saved to /var/cache/conftool/dbconfig/20250129-165913-marostegui.json [16:59:18] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [16:59:28] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1198.eqiad.wmnet with reason: Maintenance [16:59:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1198 (T384592)', diff saved to https://phabricator.wikimedia.org/P72786 and previous config saved to /var/cache/conftool/dbconfig/20250129-165935-marostegui.json [16:59:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1159 T384994', diff saved to https://phabricator.wikimedia.org/P72787 and previous config saved to /var/cache/conftool/dbconfig/20250129-165951-marostegui.json [16:59:57] T384994: Upgrade and rebuild s5 - https://phabricator.wikimedia.org/T384994 [17:00:03] !log root@cumin1002 START - Cookbook sre.mysql.upgrade for db1159.eqiad.wmnet [17:02:46] (03CR) 10Ssingh: [C:03+1] "Ha yeah." [puppet] - 10https://gerrit.wikimedia.org/r/1115075 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [17:03:46] !log hnowlan@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2243 - hnowlan@cumin2002" [17:03:51] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2243 - hnowlan@cumin2002" [17:03:51] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:03:52] !log hnowlan@cumin2002 START - Cookbook sre.dns.wipe-cache wikikube-worker2243.codfw.wmnet 122.0.192.10.in-addr.arpa 2.2.1.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [17:03:55] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2243.codfw.wmnet 122.0.192.10.in-addr.arpa 2.2.1.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [17:03:56] !log hnowlan@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2243 [17:04:15] FIRING: AppserversUnreachable: Appserver unavailable for cluster jobrunner at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=codfw&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [17:05:25] (03CR) 10Vgutierrez: [C:04-2] "to be merged tomorrow 2025-01-30" [puppet] - 10https://gerrit.wikimedia.org/r/1115075 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [17:05:51] !log root@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1159.eqiad.wmnet [17:06:26] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2243 [17:06:26] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2243 [17:06:29] (03CR) 10Federico Ceratto: [C:03+2] preseed.yaml: add comments around DB data safety [puppet] - 10https://gerrit.wikimedia.org/r/1115072 (owner: 10Federico Ceratto) [17:07:09] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1159.eqiad.wmnet with reason: Index rebuild [17:09:21] 10ops-codfw, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T384951#10505458 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm reseated ps1 cable. [17:14:28] !log hnowlan@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2242.codfw.wmnet with reason: host reimage [17:17:07] 10ops-eqiad, 06DC-Ops, 10decommission-hardware: decommission es1025.eqiad.wmnet - https://phabricator.wikimedia.org/T384912#10505507 (10Papaul) [17:17:19] 10ops-eqiad, 06DC-Ops, 10decommission-hardware: decommission es1025.eqiad.wmnet - https://phabricator.wikimedia.org/T384912#10505509 (10Papaul) 05Open→03Resolved a:03Papaul Complete [17:17:21] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2242.codfw.wmnet with reason: host reimage [17:19:15] RESOLVED: AppserversUnreachable: Appserver unavailable for cluster jobrunner at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=codfw&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [17:20:21] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2175 [17:20:29] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2175 [17:21:22] RECOVERY - BGP status on lsw1-d5-codfw.mgmt is OK: BGP OK - up: 24, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:23:08] (03PS1) 10Jgiannelos: kartotherian: Fix dependency in service config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115079 [17:23:19] !log hnowlan@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2243.codfw.wmnet with reason: host reimage [17:24:04] (03PS2) 10Jgiannelos: kartotherian: Fix dependency in service config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115079 [17:24:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T384592)', diff saved to https://phabricator.wikimedia.org/P72789 and previous config saved to /var/cache/conftool/dbconfig/20250129-172455-marostegui.json [17:25:02] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [17:26:20] (03PS3) 10Scott French: shellbox-video: all codfw replicas to 8.1 (change 3/4) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113215 (https://phabricator.wikimedia.org/T377038) [17:26:20] (03PS3) 10Scott French: shellbox-video: all replicas on PHP 8.1 (change 4/4) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113216 (https://phabricator.wikimedia.org/T377038) [17:26:22] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2243.codfw.wmnet with reason: host reimage [17:26:33] (03CR) 10Klausman: [V:03+2 C:03+2] amd-pytorch25: use ROCm 6.2 in torch 2.5.1 image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1115069 (https://phabricator.wikimedia.org/T384734) (owner: 10Ilias Sarantopoulos) [17:26:48] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install ganeti105[34].eqiad.wmnet - https://phabricator.wikimedia.org/T381576#10505577 (10VRiley-WMF) [17:27:30] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2186 [17:27:38] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2186 [17:28:24] (03CR) 10Btullis: [C:03+2] dumps: Re-enable the enwiki dumps on snapshot1012 [puppet] - 10https://gerrit.wikimedia.org/r/1114991 (https://phabricator.wikimedia.org/T382947) (owner: 10Btullis) [17:30:22] RECOVERY - BGP status on lsw1-d3-codfw.mgmt is OK: BGP OK - up: 32, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:32:55] (03CR) 10Elukey: [C:03+2] kartotherian: Fix dependency in service config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115079 (owner: 10Jgiannelos) [17:34:17] (03PS1) 10Jgiannelos: kartotherian: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115082 [17:35:44] 10ops-codfw, 06SRE, 06collaboration-services, 06Data-Persistence, and 2 others: Tracking List: Relocating servers to free up 10G switch space in codfw - https://phabricator.wikimedia.org/T383709#10505596 (10Jhancock.wm) [17:35:47] (03CR) 10Elukey: [C:03+2] kartotherian: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115082 (owner: 10Jgiannelos) [17:36:45] !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/kartotherian: apply [17:36:48] !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/kartotherian: apply [17:36:57] !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/kartotherian: apply [17:37:00] !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/kartotherian: apply [17:37:22] !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/kartotherian: apply [17:37:25] !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/kartotherian: apply [17:37:25] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2242.codfw.wmnet with OS bookworm [17:37:28] !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/kartotherian: apply [17:37:31] !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/kartotherian: apply [17:37:37] 06SRE, 10MW-on-K8s, 06serviceops, 13Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#10505607 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin2002 for host wikikube-worker2242.codfw.wmnet with OS bookworm completed: - wikik... [17:37:43] !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/kartotherian: apply [17:38:07] !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/kartotherian: apply [17:38:39] (03CR) 10AOkoth: [C:03+2] miscweb: remove kubectl cronjob [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115044 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth) [17:38:40] !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/kartotherian: apply [17:40:01] (03Merged) 10jenkins-bot: miscweb: remove kubectl cronjob [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115044 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth) [17:40:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P72790 and previous config saved to /var/cache/conftool/dbconfig/20250129-174003-marostegui.json [17:42:48] !log aokoth@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply [17:43:32] (03PS3) 10BCornwall: conftool: rm ats-be services cache nodes [puppet] - 10https://gerrit.wikimedia.org/r/1114074 [17:43:53] (03CR) 10BCornwall: "Done" [puppet] - 10https://gerrit.wikimedia.org/r/1114074 (owner: 10BCornwall) [17:45:55] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2243.codfw.wmnet with OS bookworm [17:46:12] 06SRE, 10MW-on-K8s, 06serviceops, 13Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#10505639 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin2002 for host wikikube-worker2243.codfw.wmnet with OS bookworm completed: - wikik... [17:46:27] (03PS1) 10BCornwall: varnish: Enable single_backend by default [puppet] - 10https://gerrit.wikimedia.org/r/1115086 [17:47:52] (03CR) 10Ssingh: [C:03+1] "Looks good! On perhaps an unrelated note, we don't do inbound TLS with ATS now so I was wondering if it would make sense to rename the che" [puppet] - 10https://gerrit.wikimedia.org/r/1099782 (owner: 10BCornwall) [17:48:03] !log homer 'lsw1-a5-codfw*' commit [17:48:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:48:27] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Dec 2024: cr3-ulsfo errors on et-0/0/0 link from cr4 - https://phabricator.wikimedia.org/T384288#10505646 (10RobH) Remote hands 01020815 scheduled for 2025-02-04 @ 0800 Pacific (1600 GMT). [17:49:59] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10505661 (10Jhancock.wm) unproductive update. the level 3 helpdesk is still going over the files and the TSR report. Will update when i hear back from them. [17:50:36] 10ops-eqiad, 10SRE-swift-storage, 06DC-Ops, 10decommission-hardware: decommission ms-fe105[1-9].eqiad.wmnet - https://phabricator.wikimedia.org/T385049#10505664 (10Papaul) @MatthewVernon these are ms-be105[1-9].eqiad.wmnet or ms-fe105[1-9].eqiad.wmnet [17:50:45] !log hnowlan@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2242-2243].codfw.wmnet [17:50:48] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2242-2243].codfw.wmnet [17:50:58] 06SRE, 10MW-on-K8s, 06serviceops, 13Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#10505666 (10ops-monitoring-bot) pool host wikikube-worker[2242-2243].codfw.wmnet by hnowlan@cumin2002 with reason: None [17:51:02] 06SRE, 10MW-on-K8s, 06serviceops, 13Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#10505667 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by hnowlan@cumin2002 pool for host wikikube-worker[2242-2243].codfw.wmnet completed: - wik... [17:51:04] 10ops-codfw, 06DC-Ops, 10Prod-Kubernetes, 06serviceops, 07Kubernetes: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T385078 (10hnowlan) 03NEW [17:52:57] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (NOOP 7): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4894/console" [puppet] - 10https://gerrit.wikimedia.org/r/1115086 (owner: 10BCornwall) [17:53:17] !log aokoth@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [17:53:32] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 220, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:54:02] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 129, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:54:11] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: maintenance [17:55:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P72792 and previous config saved to /var/cache/conftool/dbconfig/20250129-175510-marostegui.json [17:59:31] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv4 ping to codfw RIPE Atlas anchor: failures over threshold for measurement 32391305 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [18:00:05] swfrench-wmf: Time to do the MediaWiki infrastructure (UTC late) deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250129T1800). [18:04:31] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv4 ping to codfw RIPE Atlas anchor: failures over threshold for measurement 32391305 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [18:05:06] o/ [18:05:19] I'm holding for the moment while we're troubleshooting a separate issue [18:10:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T384592)', diff saved to https://phabricator.wikimedia.org/P72794 and previous config saved to /var/cache/conftool/dbconfig/20250129-181017-marostegui.json [18:10:22] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1212.eqiad.wmnet with reason: Maintenance [18:10:23] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [18:10:30] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1013,1017].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [18:10:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1212 (T384592)', diff saved to https://phabricator.wikimedia.org/P72795 and previous config saved to /var/cache/conftool/dbconfig/20250129-181037-marostegui.json [18:10:53] 06SRE, 10MW-on-K8s, 06serviceops, 13Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#10505737 (10hnowlan) [18:12:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1159 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P72796 and previous config saved to /var/cache/conftool/dbconfig/20250129-181212-root.json [18:20:56] (03PS1) 10AOkoth: miscweb: update os-reports version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115092 (https://phabricator.wikimedia.org/T350794) [18:23:59] (03CR) 10AOkoth: [C:03+2] miscweb: update os-reports version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115092 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth) [18:25:19] (03Merged) 10jenkins-bot: miscweb: update os-reports version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115092 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth) [18:26:26] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Check link from msw1-eqiad et-0/1/0 to msw2-eqiad et-0/1/0 - https://phabricator.wikimedia.org/T384708#10505845 (10Papaul) Replaced the optic on the msw2 side [18:26:34] !log aokoth@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply [18:26:51] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2030.codfw.wmnet [18:27:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1159 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P72801 and previous config saved to /var/cache/conftool/dbconfig/20250129-182718-root.json [18:29:31] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv4 ping to codfw RIPE Atlas anchor: failures over threshold for measurement 32391305 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [18:37:07] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv4 ping to codfw RIPE Atlas anchor: failures over threshold for measurement 32391305 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [18:37:18] !log aokoth@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [18:40:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1212 (T384592)', diff saved to https://phabricator.wikimedia.org/P72804 and previous config saved to /var/cache/conftool/dbconfig/20250129-184055-marostegui.json [18:41:00] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [18:42:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1159 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P72805 and previous config saved to /var/cache/conftool/dbconfig/20250129-184223-root.json [18:43:50] !log aokoth@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply [18:44:57] !log xcollazo@deploy2002 Started deploy [airflow-dags/analytics@5b0aeae]: Deploying latest DAGs to the analytics Airflow instance. T358375. [18:45:02] T358375: Declare wmf_content.mediawiki_content_history_v1 a production table - https://phabricator.wikimedia.org/T358375 [18:45:32] !log xcollazo@deploy2002 Finished deploy [airflow-dags/analytics@5b0aeae]: Deploying latest DAGs to the analytics Airflow instance. T358375. (duration: 00m 35s) [18:53:59] !log aokoth@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [18:56:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1212', diff saved to https://phabricator.wikimedia.org/P72809 and previous config saved to /var/cache/conftool/dbconfig/20250129-185602-marostegui.json [18:57:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1159 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P72810 and previous config saved to /var/cache/conftool/dbconfig/20250129-185729-root.json [19:00:04] jeena and hashar: Deploy window MediaWiki train - Utc-7+Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250129T1900) [19:04:12] PROBLEM - BFD status on cloudsw1-c8-eqiad.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:04:18] PROBLEM - BGP status on cloudsw1-c8-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:07:12] RECOVERY - BFD status on cloudsw1-c8-eqiad.mgmt is OK: UP: 10 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:07:18] RECOVERY - BGP status on cloudsw1-c8-eqiad.mgmt is OK: BGP OK - up: 14, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:09:10] PROBLEM - BGP status on cloudsw1-d5-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:09:17] !log otto@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-analytics-external: apply [19:09:46] !log otto@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics-external: apply [19:09:52] PROBLEM - BFD status on cloudsw1-d5-eqiad.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:11:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1212', diff saved to https://phabricator.wikimedia.org/P72811 and previous config saved to /var/cache/conftool/dbconfig/20250129-191108-marostegui.json [19:11:52] RECOVERY - BFD status on cloudsw1-d5-eqiad.mgmt is OK: UP: 10 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:12:10] RECOVERY - BGP status on cloudsw1-d5-eqiad.mgmt is OK: BGP OK - up: 14, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:12:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1159 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P72812 and previous config saved to /var/cache/conftool/dbconfig/20250129-191234-root.json [19:12:47] !log vriley@cumin1002 START - Cookbook sre.dns.netbox [19:13:09] !log otto@deploy2002 helmfile [codfw] START helmfile.d/services/eventgate-analytics-external: apply [19:14:07] !log otto@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics-external: apply [19:14:29] !log otto@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics-external: apply [19:15:19] !log otto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics-external: apply [19:16:22] (03CR) 10Ottomata: "I have deployed eventgate-analytics-external in production!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115074 (https://phabricator.wikimedia.org/T382173) (owner: 10Sergio Gimeno) [19:16:22] !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt ganeti1053 - vriley@cumin1002" [19:16:31] !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt ganeti1053 - vriley@cumin1002" [19:16:31] !log vriley@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:17:39] !log vriley@cumin1002 START - Cookbook sre.dns.netbox [19:19:57] !log vriley@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:20:04] !log pt1979@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-fe1014'] [19:24:16] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4895/co" [puppet] - 10https://gerrit.wikimedia.org/r/1099782 (owner: 10BCornwall) [19:24:46] 10ops-codfw, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T385096 (10phaultfinder) 03NEW [19:26:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1212 (T384592)', diff saved to https://phabricator.wikimedia.org/P72813 and previous config saved to /var/cache/conftool/dbconfig/20250129-192615-marostegui.json [19:26:21] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [19:26:31] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1223.eqiad.wmnet with reason: Maintenance [19:26:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1223 (T384592)', diff saved to https://phabricator.wikimedia.org/P72814 and previous config saved to /var/cache/conftool/dbconfig/20250129-192637-marostegui.json [19:32:07] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-parsoid_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:32:44] PROBLEM - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:34:05] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1053.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [19:34:17] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 06Infrastructure-Foundations: Perform fake disk swap on ms-be2088 as test - https://phabricator.wikimedia.org/T384003#10506113 (10Neobeta61) What redfish API version are you running? [19:36:14] (03PS1) 10CDanis: resourceloader: Fix hash computation for virtual files with versionFilePath [core] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1115098 (https://phabricator.wikimedia.org/T385055) [19:36:41] (03PS1) 10CDanis: resourceloader: Fix hash computation for virtual files with versionFilePath [core] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1115099 (https://phabricator.wikimedia.org/T385055) [19:39:12] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Check link from msw1-eqiad et-0/1/0 to msw2-eqiad et-0/1/0 - https://phabricator.wikimedia.org/T384708#10506123 (10cmooney) >>! In T384708#10505845, @Papaul wrote: > Replaced the optic on the msw2 side Cool, looks ok so far but will... [19:42:34] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1054.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [19:43:51] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti1053.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [19:44:10] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti1054.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [19:47:35] (03CR) 10Catrope: [C:03+2] resourceloader: Fix hash computation for virtual files with versionFilePath [core] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1115098 (https://phabricator.wikimedia.org/T385055) (owner: 10CDanis) [19:47:39] (03CR) 10Catrope: [C:03+2] resourceloader: Fix hash computation for virtual files with versionFilePath [core] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1115099 (https://phabricator.wikimedia.org/T385055) (owner: 10CDanis) [19:47:52] !log pt1979@cumin1002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['ms-fe1014'] [19:48:31] !log xcollazo@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply [19:48:50] !log xcollazo@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply [19:49:06] !log pt1979@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-fe1014'] [19:52:07] FIRING: [2x] CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic1071-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [19:54:49] (03CR) 10BCornwall: [V:03+1 C:03+2] icinga: Remove unused check_ssl_unified config [puppet] - 10https://gerrit.wikimedia.org/r/1099782 (owner: 10BCornwall) [19:54:58] !log bking@apt1002 publish new opensearch_1.3.20 pkg to thirdparty/opensearch1 [19:54:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1223 (T384592)', diff saved to https://phabricator.wikimedia.org/P72815 and previous config saved to /var/cache/conftool/dbconfig/20250129-195459-marostegui.json [19:55:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:55:44] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1053.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [19:56:32] !log pt1979@cumin1002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['ms-fe1014'] [19:56:33] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [19:58:18] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1054.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [19:59:23] (03PS1) 10Bartosz Dziewoński: Add 'auth' docroot with custom files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115103 (https://phabricator.wikimedia.org/T383952) [19:59:50] (03PS1) 10Bartosz Dziewoński: Add 'auth' docroot with custom files (beta) [puppet] - 10https://gerrit.wikimedia.org/r/1115104 (https://phabricator.wikimedia.org/T383952) [20:00:55] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti1053.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [20:03:29] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti1054.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [20:03:29] (03PS1) 10D3r1ck01: SUL3: Allow temp users to authenticate (login/signup) via the API [extensions/CentralAuth] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1115106 (https://phabricator.wikimedia.org/T384523) [20:04:49] (03CR) 10CI reject: [V:04-1] resourceloader: Fix hash computation for virtual files with versionFilePath [core] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1115098 (https://phabricator.wikimedia.org/T385055) (owner: 10CDanis) [20:10:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1223', diff saved to https://phabricator.wikimedia.org/P72816 and previous config saved to /var/cache/conftool/dbconfig/20250129-201006-marostegui.json [20:13:54] !log pt1979@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-fe1014'] [20:14:51] !log vriley@cumin1002 START - Cookbook sre.dns.netbox [20:16:57] !log vriley@cumin1002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [20:18:51] 06SRE, 06collaboration-services, 10Stewards-Onboarding-Tool, 10Wikimedia-Mailing-lists, 13Patch-For-Review: stewards1001 / stewards2001: automatically subscribe stewards to mailman lists (was: Enable API access for Mailman3) - https://phabricator.wikimedia.org/T351202#10506235 (10Dzahn) 05Open→03Stall... [20:21:01] !log vriley@cumin1002 START - Cookbook sre.dns.netbox [20:24:56] !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt db1250 - vriley@cumin1002" [20:25:01] !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt db1250 - vriley@cumin1002" [20:25:01] !log vriley@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:25:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1223', diff saved to https://phabricator.wikimedia.org/P72817 and previous config saved to /var/cache/conftool/dbconfig/20250129-202513-marostegui.json [20:25:17] !log vriley@cumin1002 START - Cookbook sre.dns.netbox [20:27:12] (03CR) 10Catrope: resourceloader: Fix hash computation for virtual files with versionFilePath [core] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1115098 (https://phabricator.wikimedia.org/T385055) (owner: 10CDanis) [20:27:16] (03CR) 10Catrope: [C:03+2] resourceloader: Fix hash computation for virtual files with versionFilePath [core] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1115098 (https://phabricator.wikimedia.org/T385055) (owner: 10CDanis) [20:27:32] (03Merged) 10jenkins-bot: resourceloader: Fix hash computation for virtual files with versionFilePath [core] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1115099 (https://phabricator.wikimedia.org/T385055) (owner: 10CDanis) [20:27:37] !log vriley@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:29:31] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-parsoid_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:30:56] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host db1250.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:31:15] (03PS1) 10Cathal Mooney: Network: add qos and sflow config for configure-switch-interfaces [cookbooks] - 10https://gerrit.wikimedia.org/r/1115109 (https://phabricator.wikimedia.org/T379549) [20:32:19] !log pt1979@cumin1002 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts ['ms-fe1014'] [20:32:22] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host db1251.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:32:27] !log pt1979@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-fe1014'] [20:32:44] RECOVERY - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:36:01] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Extend sre.network.configure-switch-interfaces cookbook to add sflow and qos config - https://phabricator.wikimedia.org/T379549#10506260 (10cmooney) The above patch I believe will do what we need. Needs some testing I will work with dc-ops... [20:36:07] (03CR) 10Ottomata: [C:04-1] "Let's wait until DPE is back from an offsite before this is deployed." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114798 (https://phabricator.wikimedia.org/T383814) (owner: 10Ottomata) [20:38:06] (03PS2) 10Cathal Mooney: Network: add qos and sflow config for configure-switch-interfaces [cookbooks] - 10https://gerrit.wikimedia.org/r/1115109 (https://phabricator.wikimedia.org/T379549) [20:38:28] (03Merged) 10jenkins-bot: resourceloader: Fix hash computation for virtual files with versionFilePath [core] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1115098 (https://phabricator.wikimedia.org/T385055) (owner: 10CDanis) [20:38:38] (03CR) 10RLazarus: [C:03+1] Add restricted users to deployment_server [puppet] - 10https://gerrit.wikimedia.org/r/1114963 (https://phabricator.wikimedia.org/T378429) (owner: 10JMeybohm) [20:40:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1223 (T384592)', diff saved to https://phabricator.wikimedia.org/P72818 and previous config saved to /var/cache/conftool/dbconfig/20250129-204020-marostegui.json [20:40:25] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [20:40:35] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1240.eqiad.wmnet with reason: Maintenance [20:42:14] (03PS1) 10Ottomata: mediawiki.org/beacon/event - don't raise error on failure [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115111 (https://phabricator.wikimedia.org/T383939) [20:44:35] !log andrew@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudvirt2006-dev.codfw.wmnet [20:48:40] 06SRE, 06collaboration-services, 13Patch-For-Review: setup gerrit2003 with gerrit service (gerrit on bookworm) - https://phabricator.wikimedia.org/T372804#10506296 (10Dzahn) We need a follow-up task to _acutally start using_ this new server and failover gerrit to it. [20:51:11] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudvirt2006-dev.codfw.wmnet [20:51:30] (03PS2) 10BCornwall: varnish: Fix claim obj.hits isn't known in vcl_hit [puppet] - 10https://gerrit.wikimedia.org/r/1113591 (https://phabricator.wikimedia.org/T378737) [20:51:55] (03PS7) 10BCornwall: varnish: Upgrade VCL for Varnish 7.0+/modules 0.20 [puppet] - 10https://gerrit.wikimedia.org/r/1113592 (https://phabricator.wikimedia.org/T378737) [20:52:22] (03CR) 10BCornwall: "`" [puppet] - 10https://gerrit.wikimedia.org/r/1113592 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [20:55:04] (03CR) 10Effie Mouzeli: [C:03+1] shellbox-video: all codfw replicas to 8.1 (change 3/4) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113215 (https://phabricator.wikimedia.org/T377038) (owner: 10Scott French) [20:55:17] (03CR) 10Effie Mouzeli: [C:03+1] shellbox-video: all replicas on PHP 8.1 (change 4/4) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113216 (https://phabricator.wikimedia.org/T377038) (owner: 10Scott French) [20:58:45] !log pt1979@cumin1002 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts ['ms-fe1014'] [21:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250129T2100). [21:00:04] No Gerrit patches in the queue for this window AFAICS. [21:07:19] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1250.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:09:37] !log vriley@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1250'] [21:10:34] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['db1250'] [21:10:41] !log vriley@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1250'] [21:11:04] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['db1250'] [21:11:40] (03PS1) 10Fabfur: benthos: send data to eventgate too [puppet] - 10https://gerrit.wikimedia.org/r/1115113 (https://phabricator.wikimedia.org/T383392) [21:11:59] !log vriley@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1250'] [21:12:01] (03CR) 10CI reject: [V:04-1] benthos: send data to eventgate too [puppet] - 10https://gerrit.wikimedia.org/r/1115113 (https://phabricator.wikimedia.org/T383392) (owner: 10Fabfur) [21:12:22] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['db1250'] [21:13:27] I'm finally going to deploy the UBN fixes now [21:13:59] !log vriley@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1250'] [21:14:50] (03PS2) 10Fabfur: benthos: send data to eventgate too [puppet] - 10https://gerrit.wikimedia.org/r/1115113 (https://phabricator.wikimedia.org/T383392) [21:15:06] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db1250'] [21:15:12] (03CR) 10CI reject: [V:04-1] benthos: send data to eventgate too [puppet] - 10https://gerrit.wikimedia.org/r/1115113 (https://phabricator.wikimedia.org/T383392) (owner: 10Fabfur) [21:17:32] (03CR) 10Ottomata: [C:03+1] "Nice. Its preferred if producers can do all of this:" [puppet] - 10https://gerrit.wikimedia.org/r/1115113 (https://phabricator.wikimedia.org/T383392) (owner: 10Fabfur) [21:17:49] (03PS3) 10Fabfur: benthos: send data to eventgate too [puppet] - 10https://gerrit.wikimedia.org/r/1115113 (https://phabricator.wikimedia.org/T383392) [21:21:21] (03CR) 10Fabfur: "👍" [puppet] - 10https://gerrit.wikimedia.org/r/1115113 (https://phabricator.wikimedia.org/T383392) (owner: 10Fabfur) [21:33:32] 10ops-eqiad, 06Data-Platform-SRE, 06DC-Ops: Q3:rack/setup/install elastic1108-elastic1119 - https://phabricator.wikimedia.org/T384966#10506471 (10RKemper) [21:33:40] RECOVERY - Host ms-fe1014 is UP: PING OK - Packet loss = 0%, RTA = 0.37 ms [21:33:43] 10ops-eqiad, 06Data-Platform-SRE, 06DC-Ops: Q3:rack/setup/install elastic1108-elastic1119 - https://phabricator.wikimedia.org/T384966#10506472 (10RKemper) Racking details are up. Working on the puppet patches today. [21:33:58] RECOVERY - SSH on ms-fe1014 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [21:33:58] RECOVERY - Memcached on ms-fe1014 is OK: TCP OK - 0.021 second response time on 10.64.134.13 port 11211 https://wikitech.wikimedia.org/wiki/Memcached [21:33:58] RECOVERY - Swift https frontend on ms-fe1014 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.065 second response time https://wikitech.wikimedia.org/wiki/Swift [21:33:58] RECOVERY - Swift https backend on ms-fe1014 is OK: HTTP OK: HTTP/1.1 200 OK - 501 bytes in 0.090 second response time https://wikitech.wikimedia.org/wiki/Swift [21:35:03] !log catrope@deploy2002 Started scap sync-world: Backport for [[gerrit:1115099|resourceloader: Fix hash computation for virtual files with versionFilePath (T385055)]], [[gerrit:1115098|resourceloader: Fix hash computation for virtual files with versionFilePath (T385055)]] [21:35:08] T385055: Search disappearing on focus (t.useId is not a function) - https://phabricator.wikimedia.org/T385055 [21:38:12] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install db125[0-4] - https://phabricator.wikimedia.org/T380083#10506500 (10VRiley-WMF) [21:38:32] PROBLEM - Host ms-fe1014 is DOWN: PING CRITICAL - Packet loss = 100% [21:40:06] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host db1250.eqiad.wmnet with OS bookworm [21:40:15] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install db125[0-4] - https://phabricator.wikimedia.org/T380083#10506503 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host db1250.eqiad.wmnet with OS bookworm [21:42:00] RECOVERY - Host ms-fe1014 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [21:42:07] !log catrope@deploy2002 cdanis, catrope: Backport for [[gerrit:1115099|resourceloader: Fix hash computation for virtual files with versionFilePath (T385055)]], [[gerrit:1115098|resourceloader: Fix hash computation for virtual files with versionFilePath (T385055)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:42:12] T385055: Search disappearing on focus (t.useId is not a function) - https://phabricator.wikimedia.org/T385055 [21:45:58] !log catrope@deploy2002 cdanis, catrope: Continuing with sync [21:48:18] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1251.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:48:32] PROBLEM - Host ms-fe1014 is DOWN: PING CRITICAL - Packet loss = 100% [21:49:08] RECOVERY - Host ms-fe1014 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [21:50:38] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host db1251.eqiad.wmnet with OS bookworm [21:50:38] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1014 - https://phabricator.wikimedia.org/T382984#10506574 (10Papaul) still waiting for the part. [21:50:43] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install db125[0-4] - https://phabricator.wikimedia.org/T380083#10506575 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host db1251.eqiad.wmnet with OS bookworm [21:51:10] RECOVERY - MD RAID on ms-fe1014 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [21:52:33] !log catrope@deploy2002 Finished scap sync-world: Backport for [[gerrit:1115099|resourceloader: Fix hash computation for virtual files with versionFilePath (T385055)]], [[gerrit:1115098|resourceloader: Fix hash computation for virtual files with versionFilePath (T385055)]] (duration: 17m 29s) [21:52:34] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: ms-fe1014 hardware fault (may need new disk controller?) - https://phabricator.wikimedia.org/T384317#10506579 (10Papaul) upgrade BIOS and IDRAC on the server, Server is back up, I will leave the task open for now to see if we do have the same error again . [21:52:38] T385055: Search disappearing on focus (t.useId is not a function) - https://phabricator.wikimedia.org/T385055 [21:55:11] FYI this scap run did print an error: [21:55:12] 21:52:27 sudo -u mwdeploy -n -- /usr/bin/rsync -l deployment.codfw.wmnet::common/wikiversions*.{json,php} /srv/mediawiki (ran as mwdeploy@mw2410.codfw.wmnet) returned [255]: ssh: Could not resolve hostname mw2410.codfw.wmnet: Name or service not known [21:55:49] !log vriley@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1250.eqiad.wmnet with reason: host reimage [21:58:35] RoanKattouw: Yeah, pretty sure you can mostly ignore that as the host was renamed [21:58:40] Sounds like some list somewhere isn't in sync though [22:00:05] Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250129T2200) [22:00:17] hm yeah that was just renamed earlier today, https://phabricator.wikimedia.org/T354791#10505334 [22:00:40] The other host renamed at roughly the same time hasn't seemingly given an error [22:02:16] aha it's still listed as a scap proxy, https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/%2B/refs/heads/production/hieradata/common/scap/dsh.yaml#6 [22:02:22] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1250.eqiad.wmnet with reason: host reimage [22:02:48] heh [22:02:54] That'd probably explain it [22:03:01] And that it wasn't blocking in any way [22:04:56] I see h.nowlan also has https://phabricator.wikimedia.org/T384196 and https://gerrit.wikimedia.org/r/1112714 so if it's not hurting anybody I'm inclined to let him know, but leave it until he can look at it tomorrow [22:06:15] Just worth probably leaving a comment (on the task?) to point out that if we're not removing the rest just yet, we should at least remove the one that's erroring [22:06:20] to stop people repeatedly reporting it [22:06:43] !log vriley@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1251.eqiad.wmnet with reason: host reimage [22:06:46] * Reedy does that [22:07:05] Unfortunately it causes scap backport to exit with a nonzero exit status [22:07:20] but otherwise completed/finished? [22:07:27] So it's probably fine, everything works and the deploy gets logged, it's just the exit status at the very end [22:07:43] A little confusing for the deployer but not terrible [22:07:54] it's a good job we've got a human running it not an AI ;) [22:08:44] yeah, sorry for the confusion [22:09:28] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [22:10:26] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1251.eqiad.wmnet with reason: host reimage [22:10:45] separately I have an apache config change to deploy if the current window isn't in use -- RoanKattouw let me know if you're finished, no rush [22:11:02] Yeah I'm done, go ahead [22:11:07] thanks! [22:11:27] (03PS4) 10RLazarus: mediawiki: Restrict /wiki RewriteRule [puppet] - 10https://gerrit.wikimedia.org/r/1007026 (https://phabricator.wikimedia.org/T357595) [22:13:01] (03CR) 10Scott French: [C:03+1] mediawiki: Restrict /wiki RewriteRule [puppet] - 10https://gerrit.wikimedia.org/r/1007026 (https://phabricator.wikimedia.org/T357595) (owner: 10RLazarus) [22:14:25] (03CR) 10RLazarus: [C:03+2] mediawiki: Restrict /wiki RewriteRule [puppet] - 10https://gerrit.wikimedia.org/r/1007026 (https://phabricator.wikimedia.org/T357595) (owner: 10RLazarus) [22:15:06] !log jhathaway@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on ms-be2088.codfw.wmnet with reason: T381919 [22:19:13] !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1002" [22:20:37] 10ops-eqiad, 06Data-Platform-SRE, 06DC-Ops: Q3:rack/setup/install elastic1108-elastic1119 - https://phabricator.wikimedia.org/T384966#10506693 (10RKemper) [22:21:22] (03PS1) 10RLazarus: mediawiki: Restrict /wiki RewriteRule [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115121 (https://phabricator.wikimedia.org/T357595) [22:23:01] (03PS1) 10Ryan Kemper: elastic: 15 refresh hosts [puppet] - 10https://gerrit.wikimedia.org/r/1115122 (https://phabricator.wikimedia.org/T384966) [22:23:03] (03PS2) 10RLazarus: mediawiki: Restrict /wiki RewriteRule [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115121 (https://phabricator.wikimedia.org/T357595) [22:23:14] (03PS2) 10Bartosz Dziewoński: Add 'auth' docroot with custom files (beta) [puppet] - 10https://gerrit.wikimedia.org/r/1115104 (https://phabricator.wikimedia.org/T383952) [22:27:13] (03CR) 10Scott French: [C:03+1] "Thanks, Reuven!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115121 (https://phabricator.wikimedia.org/T357595) (owner: 10RLazarus) [22:27:20] !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1002" [22:27:43] (03CR) 10RLazarus: [C:03+2] mediawiki: Restrict /wiki RewriteRule [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115121 (https://phabricator.wikimedia.org/T357595) (owner: 10RLazarus) [22:27:44] (03PS1) 10BCornwall: Varnish: Upgrade test container to bullseye [puppet] - 10https://gerrit.wikimedia.org/r/1115123 [22:28:42] (03CR) 10BCornwall: "FWIW:" [puppet] - 10https://gerrit.wikimedia.org/r/1115123 (owner: 10BCornwall) [22:28:58] PROBLEM - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [22:29:09] ^ known, fixing with that chart update [22:29:52] (03Merged) 10jenkins-bot: mediawiki: Restrict /wiki RewriteRule [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115121 (https://phabricator.wikimedia.org/T357595) (owner: 10RLazarus) [22:31:12] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [22:31:25] 10ops-eqiad, 06Data-Platform-SRE, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install elastic1108-elastic1122 - https://phabricator.wikimedia.org/T384966#10506736 (10RKemper) [22:32:07] FIRING: [6x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext-next_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:32:48] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [22:33:02] (03CR) 10Bartosz Dziewoński: "Cherry-picked this on the beta cluster, seems to work, I'm a bit surprised I got it right the first time. I'm not sure if I should make th" [puppet] - 10https://gerrit.wikimedia.org/r/1115104 (https://phabricator.wikimedia.org/T383952) (owner: 10Bartosz Dziewoński) [22:33:44] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [22:33:44] PROBLEM - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [22:36:06] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [22:42:43] (03PS2) 10Ryan Kemper: elastic: 15 refresh hosts [puppet] - 10https://gerrit.wikimedia.org/r/1115122 (https://phabricator.wikimedia.org/T384966) [22:42:57] (03PS1) 10Cwhite: puppetmaster: remove use of deprecated method in logstash.rb [puppet] - 10https://gerrit.wikimedia.org/r/1115124 (https://phabricator.wikimedia.org/T385058) [22:43:17] !log rzl@deploy2002 Started scap sync-world: T357595 [22:43:22] T357595: Investigate restricting match pattern on /wiki RewriteRule - https://phabricator.wikimedia.org/T357595 [22:43:36] (03CR) 10CI reject: [V:04-1] puppetmaster: remove use of deprecated method in logstash.rb [puppet] - 10https://gerrit.wikimedia.org/r/1115124 (https://phabricator.wikimedia.org/T385058) (owner: 10Cwhite) [22:43:45] (03PS3) 10Bartosz Dziewoński: Add 'auth' docroot with custom files (beta) [puppet] - 10https://gerrit.wikimedia.org/r/1115104 (https://phabricator.wikimedia.org/T383952) [22:44:33] (03PS2) 10Cwhite: puppetmaster: remove use of deprecated method in logstash.rb [puppet] - 10https://gerrit.wikimedia.org/r/1115124 (https://phabricator.wikimedia.org/T385058) [22:44:43] (03CR) 10Bking: [C:03+1] elastic: 15 refresh hosts [puppet] - 10https://gerrit.wikimedia.org/r/1115122 (https://phabricator.wikimedia.org/T384966) (owner: 10Ryan Kemper) [22:44:59] 10ops-eqiad, 06Data-Platform-SRE, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install elastic1108-elastic1122 - https://phabricator.wikimedia.org/T384966#10506754 (10RKemper) 05Open→03In progress a:05RKemper→03None [22:45:00] (03CR) 10CI reject: [V:04-1] elastic: 15 refresh hosts [puppet] - 10https://gerrit.wikimedia.org/r/1115122 (https://phabricator.wikimedia.org/T384966) (owner: 10Ryan Kemper) [22:46:18] !log rzl@deploy2002 rzl: T357595 synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [22:46:48] (03PS3) 10Ryan Kemper: elastic: 15 refresh hosts [puppet] - 10https://gerrit.wikimedia.org/r/1115122 (https://phabricator.wikimedia.org/T384966) [22:47:00] 10ops-eqiad, 06Data-Platform-SRE, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install elastic1108-elastic1122 - https://phabricator.wikimedia.org/T384966#10506774 (10RKemper) Okay, I think our work here is done so we have removed ourselves as assignees. Wasn't sure whether task status should be `Open` or... [22:49:05] !log rzl@deploy2002 rzl: Continuing with sync [22:50:23] (03PS4) 10Bartosz Dziewoński: Add 'auth' docroot with custom files [puppet] - 10https://gerrit.wikimedia.org/r/1115104 (https://phabricator.wikimedia.org/T383952) [22:50:57] (03CR) 10Bartosz Dziewoński: "I added the prod changes too, hope they work. Let me know if I should split them to a separate patch." [puppet] - 10https://gerrit.wikimedia.org/r/1115104 (https://phabricator.wikimedia.org/T383952) (owner: 10Bartosz Dziewoński) [22:51:05] (03CR) 10Bking: [C:03+1] elastic: 15 refresh hosts [puppet] - 10https://gerrit.wikimedia.org/r/1115122 (https://phabricator.wikimedia.org/T384966) (owner: 10Ryan Kemper) [22:54:43] !log rzl@deploy2002 Finished scap sync-world: T357595 (duration: 11m 57s) [22:54:48] T357595: Investigate restricting match pattern on /wiki RewriteRule - https://phabricator.wikimedia.org/T357595 [22:55:21] \o/ [22:55:31] those httpbb alerts will self-resolve [22:55:51] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, January 30 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113476 (https://phabricator.wikimedia.org/T383916) (owner: 10Bartosz Dziewoński) [22:56:31] I'm through deploying, and I think swfrench-wmf is up next if the 22:00 window is unused today [22:56:33] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, January 30 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115103 (https://phabricator.wikimedia.org/T383952) (owner: 10Bartosz Dziewoński) [22:56:46] *23:00 [22:57:12] rzl: thanks! [22:57:21] jouncebot: nowandnext [22:57:21] For the next 0 hour(s) and 2 minute(s): Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250129T2200) [22:57:21] In 0 hour(s) and 2 minute(s): Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250129T2300) [22:58:06] I'll give the web team a few minutes to convene for a deployment before proceeding [22:58:28] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:58:50] my change does not require a mediawiki deployment, but would be preferable to isolate from other changes, if possibe [22:58:55] *possible [22:59:24] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:00:04] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250129T2300) [23:02:14] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53367 bytes in 0.066 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:02:18] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.194 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:05:00] !log cmooney@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2095 [23:06:43] !log cmooney@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2095 [23:09:07] !log cmooney@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2186 [23:09:22] !log cmooney@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2186 [23:11:28] !log cmooney@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host db1251 [23:12:22] !log pt1979@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1053.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [23:13:42] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 207, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:14:02] PROBLEM - Router interfaces on cr1-esams is CRITICAL: CRITICAL: host 185.15.59.128, interfaces up: 77, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:14:18] seems quiet, so I'm going to move ahead with my change shortly [23:14:28] PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [23:14:40] (03CR) 10Scott French: [C:03+2] shellbox-video: all codfw replicas to 8.1 (change 3/4) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113215 (https://phabricator.wikimedia.org/T377038) (owner: 10Scott French) [23:15:42] (03Merged) 10jenkins-bot: shellbox-video: all codfw replicas to 8.1 (change 3/4) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113215 (https://phabricator.wikimedia.org/T377038) (owner: 10Scott French) [23:15:52] (03PS3) 10Cathal Mooney: Network: add qos and sflow config for configure-switch-interfaces [cookbooks] - 10https://gerrit.wikimedia.org/r/1115109 (https://phabricator.wikimedia.org/T379549) [23:17:25] !log pt1979@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti1053.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [23:18:21] !log cmooney@cumin1002 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host db1251 [23:20:55] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-video: apply [23:21:45] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-video: apply [23:22:31] FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:24:42] FIRING: [2x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:24:51] hmm, gerrit seems actually-down [23:25:24] hmmm ... that's not good =/ [23:25:33] 10SRE-tools, 06Infrastructure-Foundations: Support creating phab tasks in wmflib.phabricator - https://phabricator.wikimedia.org/T366470#10506845 (10Aklapper) > Unfortunately wmflib currently only supports creating comments. I guess this is about expanding the `transactions` handling for the `self._client.man... [23:25:42] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 208, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:26:02] RECOVERY - Router interfaces on cr1-esams is OK: OK: host 185.15.59.128, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:26:28] RECOVERY - BFD status on cr2-eqiad is OK: UP: 25 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [23:28:49] !log cmooney@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host db1251 [23:28:58] RECOVERY - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [23:29:00] (03CR) 10CI reject: [V:04-1] Network: add qos and sflow config for configure-switch-interfaces [cookbooks] - 10https://gerrit.wikimedia.org/r/1115109 (https://phabricator.wikimedia.org/T379549) (owner: 10Cathal Mooney) [23:29:11] back now, poking around a little [23:29:31] RESOLVED: [6x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext-next_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:29:42] FIRING: [4x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:30:20] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Extend sre.network.configure-switch-interfaces cookbook to add sflow and qos config - https://phabricator.wikimedia.org/T379549#10506853 (10cmooney) As a test I ran this for an existing host that had been configured with the current live co... [23:30:33] !log cmooney@cumin1002 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host db1251 [23:31:12] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [23:31:56] (03PS4) 10Cathal Mooney: Network: add qos and sflow config for configure-switch-interfaces [cookbooks] - 10https://gerrit.wikimedia.org/r/1115109 (https://phabricator.wikimedia.org/T379549) [23:32:31] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:32:48] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [23:33:44] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [23:33:44] RECOVERY - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [23:34:42] RESOLVED: [4x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:36:06] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [23:36:53] (03CR) 10Scott French: [C:03+2] shellbox-video: all replicas on PHP 8.1 (change 4/4) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113216 (https://phabricator.wikimedia.org/T377038) (owner: 10Scott French) [23:37:01] (03CR) 10CI reject: [V:04-1] shellbox-video: all replicas on PHP 8.1 (change 4/4) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113216 (https://phabricator.wikimedia.org/T377038) (owner: 10Scott French) [23:37:25] (03PS4) 10Scott French: shellbox-video: all replicas on PHP 8.1 (change 4/4) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113216 (https://phabricator.wikimedia.org/T377038) [23:37:51] (03PS5) 10Cathal Mooney: Network: add qos and sflow config for configure-switch-interfaces [cookbooks] - 10https://gerrit.wikimedia.org/r/1115109 (https://phabricator.wikimedia.org/T379549) [23:38:08] (03CR) 10Scott French: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113216 (https://phabricator.wikimedia.org/T377038) (owner: 10Scott French) [23:39:56] !log cmooney@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host db1251 [23:40:30] (03Merged) 10jenkins-bot: shellbox-video: all replicas on PHP 8.1 (change 4/4) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113216 (https://phabricator.wikimedia.org/T377038) (owner: 10Scott French) [23:41:20] !log cmooney@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db1251 [23:43:15] !log cmooney@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host backup1010 [23:43:38] !log cmooney@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host backup1010 [23:43:53] (03CR) 10CI reject: [V:04-1] Network: add qos and sflow config for configure-switch-interfaces [cookbooks] - 10https://gerrit.wikimedia.org/r/1115109 (https://phabricator.wikimedia.org/T379549) (owner: 10Cathal Mooney) [23:44:46] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-video: apply [23:45:32] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-video: apply [23:50:14] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2139.codfw.wmnet with reason: Maintenance [23:50:56] (03PS6) 10Cathal Mooney: Network: add qos and sflow config for configure-switch-interfaces [cookbooks] - 10https://gerrit.wikimedia.org/r/1115109 (https://phabricator.wikimedia.org/T379549) [23:52:07] FIRING: [2x] CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic1071-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [23:52:20] (03CR) 10Raymond Ndibe: "Yeaa I did. There are no backwards incompatible change as far as I know of" [puppet] - 10https://gerrit.wikimedia.org/r/1113871 (https://phabricator.wikimedia.org/T358225) (owner: 10Raymond Ndibe) [23:53:12] (03CR) 10Raymond Ndibe: "Yes I've already tested this on toolseta-harbor-1 node. This is currently running on that node right now." [puppet] - 10https://gerrit.wikimedia.org/r/1114007 (https://phabricator.wikimedia.org/T384720) (owner: 10Raymond Ndibe) [23:53:53] (03PS1) 10Cathal Mooney: Class-of-service: don't insert comment with host name under cos/ints [homer/public] - 10https://gerrit.wikimedia.org/r/1115134 (https://phabricator.wikimedia.org/T379549) [23:54:09] (03CR) 10Raymond Ndibe: "I think the next step is to announce that toolforge will be down for maybe 1hr for maintenance. Will use that window to perform the upgrad" [puppet] - 10https://gerrit.wikimedia.org/r/1114007 (https://phabricator.wikimedia.org/T384720) (owner: 10Raymond Ndibe) [23:59:28] PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [23:59:42] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 207, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down