[00:05:28] FIRING: [5x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:07:27] 06SRE, 10Domains, 06Traffic: Register wiki(m|p)edia.ro - https://phabricator.wikimedia.org/T222080#10439344 (10Dzahn) Looks like as the next step we need to add the 2 domain names to the wikimedia.com Letsencrypt TLS certificate. [00:12:07] (03PS1) 10Dzahn: certificates: add wiki[m|p]edia.ro to ncredir Letsencrypt cert 1 [puppet] - 10https://gerrit.wikimedia.org/r/1108859 (https://phabricator.wikimedia.org/T222080) [00:15:28] FIRING: [5x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:24:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10439379 (10phaultfinder) [00:31:28] (03PS1) 10Zabe: snapshot: Remove absented file [puppet] - 10https://gerrit.wikimedia.org/r/1108861 (https://phabricator.wikimedia.org/T378260) [00:31:48] (03CR) 10CI reject: [V:04-1] snapshot: Remove absented file [puppet] - 10https://gerrit.wikimedia.org/r/1108861 (https://phabricator.wikimedia.org/T378260) (owner: 10Zabe) [00:33:02] (03PS2) 10Zabe: snapshot: Remove absented file [puppet] - 10https://gerrit.wikimedia.org/r/1108861 (https://phabricator.wikimedia.org/T378260) [00:38:16] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1108862 [00:38:16] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1108862 (owner: 10TrainBranchBot) [00:55:53] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1108862 (owner: 10TrainBranchBot) [01:08:11] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1108866 [01:08:11] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1108866 (owner: 10TrainBranchBot) [01:09:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10439444 (10phaultfinder) [01:29:27] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1108866 (owner: 10TrainBranchBot) [01:54:47] (03CR) 10Diskdance: varnish: Hide X-Client-IP on error page by default (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1108485 (https://phabricator.wikimedia.org/T383062) (owner: 10BCornwall) [02:14:39] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1007-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [02:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:06:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:13:50] (03PS1) 10Scott French: mw-videoscaler: enable access to logging cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1108874 (https://phabricator.wikimedia.org/T382517) [04:15:28] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:20:51] (03PS1) 10Vgutierrez: liberica: Use libericad instead of liberica binary [puppet] - 10https://gerrit.wikimedia.org/r/1108875 [05:29:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10439677 (10phaultfinder) [05:36:56] (03CR) 10Vgutierrez: [C:03+1] trafficserver: validate production config in tests [puppet] - 10https://gerrit.wikimedia.org/r/1101104 (https://phabricator.wikimedia.org/T377042) (owner: 10Scott French) [06:12:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1020 (re)pooling @ 5%: Repooling after recloning', diff saved to https://phabricator.wikimedia.org/P71828 and previous config saved to /var/cache/conftool/dbconfig/20250108-061207-root.json [06:14:28] (03PS1) 10Marostegui: instances.yaml: Add es1041 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1108878 (https://phabricator.wikimedia.org/T382569) [06:14:39] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1007-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [06:16:06] (03PS1) 10Marostegui: db_maint_mapper_sal.py: Update list of nicks [software] - 10https://gerrit.wikimedia.org/r/1108879 [06:17:43] (03CR) 10Marostegui: [C:03+2] instances.yaml: Add es1041 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1108878 (https://phabricator.wikimedia.org/T382569) (owner: 10Marostegui) [06:19:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'Switchover es4 eqiad master dbmaint T382569', diff saved to https://phabricator.wikimedia.org/P71829 and previous config saved to /var/cache/conftool/dbconfig/20250108-061914-marostegui.json [06:19:18] T382569: Productionize es104[1-6] - https://phabricator.wikimedia.org/T382569 [06:19:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es1021 T382569', diff saved to https://phabricator.wikimedia.org/P71830 and previous config saved to /var/cache/conftool/dbconfig/20250108-061928-marostegui.json [06:21:06] (03PS1) 10Marostegui: mariadb: Productionize es1042 [puppet] - 10https://gerrit.wikimedia.org/r/1108880 (https://phabricator.wikimedia.org/T382569) [06:21:58] 06SRE, 10Domains, 06Traffic, 13Patch-For-Review: Register wiki(m|p)edia.ro - https://phabricator.wikimedia.org/T222080#10439707 (10Strainu) Awesome news, thank you very much! What are the next steps to redirect them to relevant wikis? Do we need a community consultation? [06:22:34] (03CR) 10Marostegui: [C:03+2] mariadb: Productionize es1042 [puppet] - 10https://gerrit.wikimedia.org/r/1108880 (https://phabricator.wikimedia.org/T382569) (owner: 10Marostegui) [06:24:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'Add es1041 to dbctl depooled T382569', diff saved to https://phabricator.wikimedia.org/P71831 and previous config saved to /var/cache/conftool/dbconfig/20250108-062447-marostegui.json [06:24:57] T382569: Productionize es104[1-6] - https://phabricator.wikimedia.org/T382569 [06:25:42] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on es1021.eqiad.wmnet with reason: cloning es1042 [06:25:44] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on es1021.eqiad.wmnet with reason: cloning es1042 [06:27:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1020 (re)pooling @ 10%: Repooling after recloning', diff saved to https://phabricator.wikimedia.org/P71832 and previous config saved to /var/cache/conftool/dbconfig/20250108-062712-root.json [06:31:50] (03CR) 10Marostegui: "Is there any mechanism to avoid depooling ALL sections?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1108794 (https://phabricator.wikimedia.org/T373037) (owner: 10Ladsgroup) [06:34:42] (03PS1) 10Marostegui: dbproxy1021: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1108881 (https://phabricator.wikimedia.org/T383033) [06:36:13] (03CR) 10Marostegui: [C:03+2] dbproxy1021: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1108881 (https://phabricator.wikimedia.org/T383033) (owner: 10Marostegui) [06:40:28] (03PS1) 10Marostegui: production-m5.sql.erb: Replace dbproxy1021 with dbproxy1029 [puppet] - 10https://gerrit.wikimedia.org/r/1108882 (https://phabricator.wikimedia.org/T383033) [06:42:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1020 (re)pooling @ 25%: Repooling after recloning', diff saved to https://phabricator.wikimedia.org/P71833 and previous config saved to /var/cache/conftool/dbconfig/20250108-064217-root.json [06:42:45] (03CR) 10Marostegui: [C:03+2] production-m5.sql.erb: Replace dbproxy1021 with dbproxy1029 [puppet] - 10https://gerrit.wikimedia.org/r/1108882 (https://phabricator.wikimedia.org/T383033) (owner: 10Marostegui) [06:44:15] !log marostegui@cumin1002 START - Cookbook sre.hosts.decommission for hosts dbproxy1021.eqiad.wmnet [06:46:19] (03PS1) 10Marostegui: mariadb: Remove dbproxy1021 [puppet] - 10https://gerrit.wikimedia.org/r/1108883 (https://phabricator.wikimedia.org/T383033) [06:47:03] (03CR) 10Marostegui: [C:03+2] mariadb: Remove dbproxy1021 [puppet] - 10https://gerrit.wikimedia.org/r/1108883 (https://phabricator.wikimedia.org/T383033) (owner: 10Marostegui) [06:49:01] (03PS1) 10Marostegui: report_users: Remove dbproxy1021 [software] - 10https://gerrit.wikimedia.org/r/1108925 (https://phabricator.wikimedia.org/T383033) [06:50:01] !log marostegui@cumin1002 START - Cookbook sre.dns.netbox [06:50:09] (03CR) 10Marostegui: [C:03+2] report_users: Remove dbproxy1021 [software] - 10https://gerrit.wikimedia.org/r/1108925 (https://phabricator.wikimedia.org/T383033) (owner: 10Marostegui) [06:50:37] (03Merged) 10jenkins-bot: report_users: Remove dbproxy1021 [software] - 10https://gerrit.wikimedia.org/r/1108925 (https://phabricator.wikimedia.org/T383033) (owner: 10Marostegui) [06:53:26] !log marostegui@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: dbproxy1021.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1002" [06:53:41] !log marostegui@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: dbproxy1021.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1002" [06:53:41] !log marostegui@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [06:53:42] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts dbproxy1021.eqiad.wmnet [06:53:54] 10ops-eqiad, 06DBA, 06DC-Ops, 10decommission-hardware: decommission dbproxy1021.eqiad.wmnet - https://phabricator.wikimedia.org/T383033#10439736 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by marostegui@cumin1002 for hosts: `dbproxy1021.eqiad.wmnet` - dbproxy1021.eqiad.wmnet (**PASS**... [06:53:59] 10ops-eqiad, 06DBA, 06DC-Ops, 10decommission-hardware: decommission dbproxy1021.eqiad.wmnet - https://phabricator.wikimedia.org/T383033#10439737 (10Marostegui) a:05Marostegui→03None [06:54:08] 10ops-eqiad, 06DBA, 06DC-Ops, 10decommission-hardware: decommission dbproxy1021.eqiad.wmnet - https://phabricator.wikimedia.org/T383033#10439742 (10Marostegui) This is ready for #dc-ops [06:56:15] (03PS1) 10Marostegui: installserver: Do not format es1041 [puppet] - 10https://gerrit.wikimedia.org/r/1108971 (https://phabricator.wikimedia.org/T382569) [06:57:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1020 (re)pooling @ 50%: Repooling after recloning', diff saved to https://phabricator.wikimedia.org/P71834 and previous config saved to /var/cache/conftool/dbconfig/20250108-065723-root.json [06:58:58] (03CR) 10Marostegui: [C:03+2] installserver: Do not format es1041 [puppet] - 10https://gerrit.wikimedia.org/r/1108971 (https://phabricator.wikimedia.org/T382569) (owner: 10Marostegui) [07:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250108T0700) [07:12:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1020 (re)pooling @ 75%: Repooling after recloning', diff saved to https://phabricator.wikimedia.org/P71836 and previous config saved to /var/cache/conftool/dbconfig/20250108-071228-root.json [07:12:38] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:13:30] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.183 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:14:26] PROBLEM - Disk space on seaborgium is CRITICAL: DISK CRITICAL - free space: / 468 MB (2% inode=92%): /tmp 468 MB (2% inode=92%): /var/tmp 468 MB (2% inode=92%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=seaborgium&var-datasource=eqiad+prometheus/ops [07:27:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1020 (re)pooling @ 100%: Repooling after recloning', diff saved to https://phabricator.wikimedia.org/P71837 and previous config saved to /var/cache/conftool/dbconfig/20250108-072733-root.json [07:36:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2126 T373579', diff saved to https://phabricator.wikimedia.org/P71838 and previous config saved to /var/cache/conftool/dbconfig/20250108-073603-marostegui.json [07:36:07] T373579: Productionize db22[21-40] - https://phabricator.wikimedia.org/T373579 [07:37:54] (03PS1) 10Marostegui: mariadb: Productionize db2226 [puppet] - 10https://gerrit.wikimedia.org/r/1109020 (https://phabricator.wikimedia.org/T373579) [07:38:03] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2126.codfw.wmnet with reason: cloning [07:38:17] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2126.codfw.wmnet with reason: cloning [07:40:07] (03CR) 10Marostegui: [C:03+2] mariadb: Productionize db2226 [puppet] - 10https://gerrit.wikimedia.org/r/1109020 (https://phabricator.wikimedia.org/T373579) (owner: 10Marostegui) [07:41:29] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2187.codfw.wmnet with reason: cloning [07:41:42] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2187.codfw.wmnet with reason: cloning [07:44:37] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2024-2025].codfw.wmnet [07:45:50] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2024-2025].codfw.wmnet [07:46:16] (03PS1) 10Marostegui: instances.yaml: Add db2226 [puppet] - 10https://gerrit.wikimedia.org/r/1109023 [07:47:01] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2024.codfw.wmnet with OS bookworm [07:47:03] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2025.codfw.wmnet with OS bookworm [07:47:13] (03CR) 10Marostegui: [C:03+2] instances.yaml: Add db2226 [puppet] - 10https://gerrit.wikimedia.org/r/1109023 (owner: 10Marostegui) [07:47:22] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2025 [07:47:22] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2025 [07:47:28] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2024 [07:47:40] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [07:48:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Add db2226 depooled', diff saved to https://phabricator.wikimedia.org/P71839 and previous config saved to /var/cache/conftool/dbconfig/20250108-074856-marostegui.json [07:49:06] !log root@cumin1002 START - Cookbook sre.mysql.clone of db2126.codfw.wmnet onto db2226.codfw.wmnet [07:50:19] !log truncate /var/log/debug on seaborgium to unblock some disk space [07:50:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:51:06] PROBLEM - BGP status on lsw1-a3-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:51:09] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2024 - jelto@cumin1002" [07:51:13] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2024 - jelto@cumin1002" [07:51:13] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:51:13] !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2024.codfw.wmnet 214.32.192.10.in-addr.arpa 4.1.2.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [07:51:16] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2024.codfw.wmnet 214.32.192.10.in-addr.arpa 4.1.2.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [07:51:17] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2024 [07:51:29] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2024 [07:51:29] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2024 [07:52:25] FIRING: SystemdUnitFailed: prometheus-dpkg-success-textfile.service on seaborgium:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:52:54] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:54:26] RECOVERY - Disk space on seaborgium is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=seaborgium&var-datasource=eqiad+prometheus/ops [07:54:36] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:56:44] (03PS1) 10Muehlenhoff: Record LDAP access [puppet] - 10https://gerrit.wikimedia.org/r/1109025 [07:57:31] (03CR) 10Muehlenhoff: [C:03+2] Record LDAP access [puppet] - 10https://gerrit.wikimedia.org/r/1109025 (owner: 10Muehlenhoff) [08:00:05] Amir1, Urbanecm, and awight: Time to do the UTC morning backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250108T0800). [08:00:05] No Gerrit patches in the queue for this window AFAICS. [08:01:54] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for LorenMora - https://phabricator.wikimedia.org/T382377#10439814 (10MoritzMuehlenhoff) 05In progress→03Resolved Access has been granted via Wikimedia IDM, resolving the task. [08:02:25] RESOLVED: SystemdUnitFailed: prometheus-dpkg-success-textfile.service on seaborgium:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:07:11] (03PS1) 10Marostegui: es1041: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1109027 [08:07:28] (03CR) 10Marostegui: "Host green in icinga" [puppet] - 10https://gerrit.wikimedia.org/r/1109027 (owner: 10Marostegui) [08:07:42] (03CR) 10Marostegui: [C:03+2] es1041: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1109027 (owner: 10Marostegui) [08:08:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1041 (re)pooling @ 1%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P71840 and previous config saved to /var/cache/conftool/dbconfig/20250108-080828-root.json [08:10:51] jouncebot: now and next [08:10:51] For the next 0 hour(s) and 49 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250108T0800) [08:12:42] (03CR) 10Filippo Giunchedi: [V:03+1 C:03+2] prometheus: deploy instances from a single configuration [puppet] - 10https://gerrit.wikimedia.org/r/1104980 (https://phabricator.wikimedia.org/T371087) (owner: 10Filippo Giunchedi) [08:15:28] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:17:49] (03CR) 10Elukey: maps::osm_master: Inline osm class (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1108773 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [08:19:24] (03CR) 10Elukey: "LGTM! Are you bumping to debhelper 12 or 13 though? From the commit msg I see 12 but control says = 13." [debs/osmborder] - 10https://gerrit.wikimedia.org/r/1108776 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [08:19:59] !log jayme@cumin1002 START - Cookbook sre.k8s.roll-reimage-nodes rolling reimage on P{wikikube-worker[1267-1269].eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad) [08:21:17] 07sre-alert-triage, 06Data-Platform-SRE: Alert in need of triage: SmartNotHealthy (instance dse-k8s-worker1009:9100) - https://phabricator.wikimedia.org/T382871#10439833 (10Gehel) p:05Triage→03High [08:21:20] 07sre-alert-triage, 06Data-Platform-SRE: Alert in need of triage: Dell PowerEdge RAID Controller (instance an-presto1016) - https://phabricator.wikimedia.org/T382714#10439836 (10Gehel) p:05Triage→03High [08:21:38] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1267.eqiad.wmnet with OS bookworm [08:23:13] 07sre-alert-triage, 10Data-Platform-SRE (2024.11.30 - 2024.12.20): Alert in need of triage: Dell PowerEdge RAID Controller (instance an-presto1016) - https://phabricator.wikimedia.org/T382714#10439840 (10Gehel) [08:23:14] 07sre-alert-triage, 10Data-Platform-SRE (2024.11.30 - 2024.12.20): Alert in need of triage: SmartNotHealthy (instance dse-k8s-worker1009:9100) - https://phabricator.wikimedia.org/T382871#10439842 (10Gehel) [08:23:24] !log destroy puppet cert for cloudelastic1011.eqiad.wmnet on puppetmaster1001 (cruft from old/wrong reimages) [08:23:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1041 (re)pooling @ 2%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P71841 and previous config saved to /var/cache/conftool/dbconfig/20250108-082333-root.json [08:24:52] (03CR) 10Muehlenhoff: [C:03+2] Remove access for aitolkyn [puppet] - 10https://gerrit.wikimedia.org/r/1108810 (owner: 10Muehlenhoff) [08:26:31] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Aitolkyn out of all services on: 2309 hosts [08:26:32] (03PS1) 10Marostegui: instances.yaml: Remove es2024 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1109028 (https://phabricator.wikimedia.org/T383028) [08:27:02] (03CR) 10Marostegui: [C:03+2] instances.yaml: Remove es2024 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1109028 (https://phabricator.wikimedia.org/T383028) (owner: 10Marostegui) [08:27:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Aitolkyn out of all services on: 2309 hosts [08:28:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'Remove es2024 from dbctl T383028', diff saved to https://phabricator.wikimedia.org/P71842 and previous config saved to /var/cache/conftool/dbconfig/20250108-082807-marostegui.json [08:28:10] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:28:11] T383028: decommission es2024.codfw.wmnet - https://phabricator.wikimedia.org/T383028 [08:30:10] 06SRE, 06Data-Engineering, 06Data-Engineering-Radar, 10Data-Platform-SRE (2024.11.30 - 2024.12.20): Data Platform access streamlining for WMDE staff - https://phabricator.wikimedia.org/T381824#10439878 (10Gehel) [08:30:55] (03PS4) 10Filippo Giunchedi: prometheus: migrate ops instance to prometheus::instances [puppet] - 10https://gerrit.wikimedia.org/r/1108746 (https://phabricator.wikimedia.org/T371087) [08:32:36] (03PS1) 10Marostegui: mariadb: Remove es2024 [puppet] - 10https://gerrit.wikimedia.org/r/1109029 (https://phabricator.wikimedia.org/T383028) [08:32:58] !log marostegui@cumin1002 START - Cookbook sre.hosts.decommission for hosts es2024.codfw.wmnet [08:33:25] (03CR) 10Filippo Giunchedi: [V:03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4747/console" [puppet] - 10https://gerrit.wikimedia.org/r/1108746 (https://phabricator.wikimedia.org/T371087) (owner: 10Filippo Giunchedi) [08:33:27] (03CR) 10Marostegui: [C:03+2] mariadb: Remove es2024 [puppet] - 10https://gerrit.wikimedia.org/r/1109029 (https://phabricator.wikimedia.org/T383028) (owner: 10Marostegui) [08:33:28] (03PS2) 10Muehlenhoff: osmborder: Build for Bookworm and bump debhelper compat to 13 [debs/osmborder] - 10https://gerrit.wikimedia.org/r/1108776 (https://phabricator.wikimedia.org/T381565) [08:33:50] (03CR) 10Muehlenhoff: "Ah, yes. That was a typo, commit message has been updated." [debs/osmborder] - 10https://gerrit.wikimedia.org/r/1108776 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [08:34:26] 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations, 10Spicerack, 13Patch-For-Review: Spicerack: expand Supermicro support in the Redfish module - https://phabricator.wikimedia.org/T365372#10439908 (10elukey) 05Open→03Resolved a:03elukey I think that we can declare this task completed, we are s... [08:36:05] !log kill hanging processes on stat1011 to allow puppet to properly clean up absented users [08:36:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:11] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:36:17] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2027.codfw.wmnet [08:36:32] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10439924 (10ops-monitoring-bot) Draining ganeti2027.codfw.wmnet of running VMs [08:37:43] !log marostegui@cumin1002 START - Cookbook sre.dns.netbox [08:37:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2027.codfw.wmnet [08:38:21] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2027.codfw.wmnet [08:38:36] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10439933 (10ops-monitoring-bot) Draining ganeti2027.codfw.wmnet of running VMs [08:38:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1041 (re)pooling @ 3%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P71843 and previous config saved to /var/cache/conftool/dbconfig/20250108-083838-root.json [08:41:23] !log marostegui@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: es2024.codfw.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1002" [08:41:40] !log marostegui@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: es2024.codfw.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1002" [08:41:40] !log marostegui@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:41:41] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts es2024.codfw.wmnet [08:42:13] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1267.eqiad.wmnet with reason: host reimage [08:44:01] 06SRE, 06Infrastructure-Foundations, 10netops, 10Data-Platform-SRE (2024.11.30 - 2024.12.20): Add QoS markings to profile Hadoop/HDFS analytics traffic - https://phabricator.wikimedia.org/T381389#10439961 (10Gehel) [08:44:23] 10ops-codfw, 06DBA, 06DC-Ops, 10decommission-hardware: decommission es2024.codfw.wmnet - https://phabricator.wikimedia.org/T383028#10439963 (10Marostegui) a:05Marostegui→03None [08:44:45] 10ops-codfw, 06DBA, 06DC-Ops, 10decommission-hardware: decommission es2024.codfw.wmnet - https://phabricator.wikimedia.org/T383028#10439968 (10Marostegui) This is ready for #dc-ops [08:44:58] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1267.eqiad.wmnet with reason: host reimage [08:45:03] (03CR) 10Elukey: [C:03+1] "Nit in the changelog, please go ahead with merging after that!" [debs/osmborder] - 10https://gerrit.wikimedia.org/r/1108776 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [08:45:37] 06SRE, 10SRE-Access-Requests, 06Gerrit-Privilege-Requests, 10Infrastructure Security, 10LDAP-Access-Requests: Offboard Muhammad Jazirahly from WMF systems - https://phabricator.wikimedia.org/T383056#10439976 (10WMDE-leszek) > I think we might need some coordination how to handle WMDE staff offboarding in... [08:45:58] (03PS3) 10Muehlenhoff: osmborder: Build for Bookworm and bump debhelper compat to 13 [debs/osmborder] - 10https://gerrit.wikimedia.org/r/1108776 (https://phabricator.wikimedia.org/T381565) [08:46:12] (03CR) 10Filippo Giunchedi: [V:03+1] "The patch moves 'ops' instance to prometheus::instances configuration, essentially a no-op per PCC https://puppet-compiler.wmflabs.org/out" [puppet] - 10https://gerrit.wikimedia.org/r/1108746 (https://phabricator.wikimedia.org/T371087) (owner: 10Filippo Giunchedi) [08:46:16] (03CR) 10Muehlenhoff: osmborder: Build for Bookworm and bump debhelper compat to 13 (031 comment) [debs/osmborder] - 10https://gerrit.wikimedia.org/r/1108776 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [08:46:21] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] osmborder: Build for Bookworm and bump debhelper compat to 13 [debs/osmborder] - 10https://gerrit.wikimedia.org/r/1108776 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [08:53:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1041 (re)pooling @ 4%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P71844 and previous config saved to /var/cache/conftool/dbconfig/20250108-085344-root.json [09:00:21] (03CR) 10David Caro: [C:03+2] "NP, we should have a nicer way of not missing it" [puppet] - 10https://gerrit.wikimedia.org/r/1108798 (https://phabricator.wikimedia.org/T383114) (owner: 10David Caro) [09:04:14] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1267.eqiad.wmnet with OS bookworm [09:05:04] (03PS5) 10Filippo Giunchedi: WIP: prometheus: k8s instances migration [puppet] - 10https://gerrit.wikimedia.org/r/1108772 (https://phabricator.wikimedia.org/T371087) [09:05:04] (03CR) 10Filippo Giunchedi: [V:03+1] "The patch moves k8s/prometheus configuration to prometheus::instances, and it works as-is." [puppet] - 10https://gerrit.wikimedia.org/r/1108772 (https://phabricator.wikimedia.org/T371087) (owner: 10Filippo Giunchedi) [09:05:45] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1268.eqiad.wmnet with OS bookworm [09:08:04] !log jelto@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker2025.codfw.wmnet with OS bookworm [09:08:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1041 (re)pooling @ 5%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P71845 and previous config saved to /var/cache/conftool/dbconfig/20250108-090849-root.json [09:09:14] (03CR) 10Filippo Giunchedi: [C:03+1] prometheus: add ttl option to statsd-exporter, set to 30d [puppet] - 10https://gerrit.wikimedia.org/r/1105971 (https://phabricator.wikimedia.org/T359497) (owner: 10Cwhite) [09:10:15] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2025.codfw.wmnet with OS bookworm [09:10:18] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2025 [09:10:19] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2025 [09:12:20] !log jelto@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker2024.codfw.wmnet with OS bookworm [09:13:03] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2024.codfw.wmnet with OS bookworm [09:13:06] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2024 [09:13:06] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2024 [09:23:39] (03PS1) 10JMeybohm: Rename kubernetes[1059-1062] as wikikube-worker[1084-1087] [puppet] - 10https://gerrit.wikimedia.org/r/1109030 (https://phabricator.wikimedia.org/T377876) [09:23:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1041 (re)pooling @ 10%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P71846 and previous config saved to /var/cache/conftool/dbconfig/20250108-092354-root.json [09:25:11] (03CR) 10Jelto: [C:03+1] "lgtm 🎉" [puppet] - 10https://gerrit.wikimedia.org/r/1109030 (https://phabricator.wikimedia.org/T377876) (owner: 10JMeybohm) [09:25:40] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1268.eqiad.wmnet with reason: host reimage [09:27:12] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2025.codfw.wmnet with reason: host reimage [09:28:25] (03PS2) 10TChin: mw-content-history-reconcile-enrich: Enable K8 HA [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105667 (https://phabricator.wikimedia.org/T375176) [09:29:05] (03CR) 10TChin: mw-content-history-reconcile-enrich: Enable K8 HA (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105667 (https://phabricator.wikimedia.org/T375176) (owner: 10TChin) [09:30:48] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2024.codfw.wmnet with reason: host reimage [09:30:48] (03CR) 10David Caro: "I think this broke prometheus on cloud xd" [puppet] - 10https://gerrit.wikimedia.org/r/1104980 (https://phabricator.wikimedia.org/T371087) (owner: 10Filippo Giunchedi) [09:31:40] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1268.eqiad.wmnet with reason: host reimage [09:34:14] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 93324624 and 5 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [09:35:18] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 7064 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [09:35:27] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2024.codfw.wmnet with reason: host reimage [09:38:58] (03PS1) 10Filippo Giunchedi: cloud: fix tools prometheus [puppet] - 10https://gerrit.wikimedia.org/r/1109031 (https://phabricator.wikimedia.org/T371087) [09:39:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1041 (re)pooling @ 25%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P71847 and previous config saved to /var/cache/conftool/dbconfig/20250108-093900-root.json [09:39:40] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2025.codfw.wmnet with reason: host reimage [09:40:04] (03PS1) 10Muehlenhoff: Switch ganeti2027 to nftables for upcoming reimage [puppet] - 10https://gerrit.wikimedia.org/r/1109032 (https://phabricator.wikimedia.org/T382508) [09:41:27] (03CR) 10Clément Goubert: [C:03+1] mw-videoscaler: enable access to logging cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1108874 (https://phabricator.wikimedia.org/T382517) (owner: 10Scott French) [09:42:02] !log root@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2126.codfw.wmnet onto db2226.codfw.wmnet [09:49:09] (03PS2) 10David Caro: cloud: fix tools prometheus [puppet] - 10https://gerrit.wikimedia.org/r/1109031 (https://phabricator.wikimedia.org/T371087) (owner: 10Filippo Giunchedi) [09:50:59] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1268.eqiad.wmnet with OS bookworm [09:52:29] (03CR) 10David Caro: "Tested in tools:" [puppet] - 10https://gerrit.wikimedia.org/r/1109031 (https://phabricator.wikimedia.org/T371087) (owner: 10Filippo Giunchedi) [09:52:32] (03CR) 10David Caro: [C:03+2] cloud: fix tools prometheus [puppet] - 10https://gerrit.wikimedia.org/r/1109031 (https://phabricator.wikimedia.org/T371087) (owner: 10Filippo Giunchedi) [09:52:47] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1269.eqiad.wmnet with OS bookworm [09:54:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1041 (re)pooling @ 50%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P71848 and previous config saved to /var/cache/conftool/dbconfig/20250108-095405-root.json [09:55:44] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2024.codfw.wmnet with OS bookworm [09:59:20] RECOVERY - BGP status on lsw1-a3-codfw.mgmt is OK: BGP OK - up: 68, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:59:33] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2025.codfw.wmnet with OS bookworm [10:00:11] !log sudo homer 'lsw1-c6-codfw*' commit 'T377877' [10:00:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:14] T377877: Migrate wikikube-codfw to containerd - https://phabricator.wikimedia.org/T377877 [10:00:28] !log sudo homer 'cr*codfw*' commit 'T377877' [10:00:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:31] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2025.codfw.wmnet [10:02:33] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2025.codfw.wmnet [10:02:43] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2024.codfw.wmnet [10:02:45] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2024.codfw.wmnet [10:03:57] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2027.codfw.wmnet [10:04:41] !log imported osmborder 0.1.0+wmf12u1 to apt.wikimedia.org/bookworm T381565 [10:04:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:43] T381565: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565 [10:05:10] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 178, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:09:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1041 (re)pooling @ 75%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P71849 and previous config saved to /var/cache/conftool/dbconfig/20250108-100910-root.json [10:09:12] (03CR) 10Muehlenhoff: [C:03+2] Switch ganeti2027 to nftables for upcoming reimage [puppet] - 10https://gerrit.wikimedia.org/r/1109032 (https://phabricator.wikimedia.org/T382508) (owner: 10Muehlenhoff) [10:10:54] PROBLEM - ganeti-noded running on ganeti2027 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [10:11:04] PROBLEM - ganeti-confd running on ganeti2027 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 112 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti [10:12:16] FIRING: ProbeDown: Service ganeti2027:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:12:37] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1269.eqiad.wmnet with reason: host reimage [10:14:35] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2022-2023].codfw.wmnet [10:14:39] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1007-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [10:15:28] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1269.eqiad.wmnet with reason: host reimage [10:15:31] (03PS1) 10Marostegui: rebuild_tables.sh: Quick script to rebuild tables [software] - 10https://gerrit.wikimedia.org/r/1109035 (https://phabricator.wikimedia.org/T382842) [10:15:44] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2022-2023].codfw.wmnet [10:16:21] 06SRE, 10SRE-tools, 06Infrastructure-Foundations, 07Python3-Porting: Puppet tox: properly lint both Py2 and Py3 files - https://phabricator.wikimedia.org/T184435#10440092 (10elukey) @Volans is this currently an active issue? [10:17:27] ^ ganeti2027 is expected, it's being reimaged and I missed to downtime it in time [10:18:04] (03CR) 10Marostegui: "I've been using this for a few days" [software] - 10https://gerrit.wikimedia.org/r/1109035 (https://phabricator.wikimedia.org/T382842) (owner: 10Marostegui) [10:18:06] (03CR) 10Marostegui: [C:03+2] rebuild_tables.sh: Quick script to rebuild tables [software] - 10https://gerrit.wikimedia.org/r/1109035 (https://phabricator.wikimedia.org/T382842) (owner: 10Marostegui) [10:19:05] (03Merged) 10jenkins-bot: rebuild_tables.sh: Quick script to rebuild tables [software] - 10https://gerrit.wikimedia.org/r/1109035 (https://phabricator.wikimedia.org/T382842) (owner: 10Marostegui) [10:19:20] RESOLVED: ProbeDown: Service ganeti2027:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:19:23] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2022.codfw.wmnet with OS bookworm [10:19:24] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2023.codfw.wmnet with OS bookworm [10:19:49] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2022 [10:19:55] !log jelto@cumin1002 END (FAIL) - Cookbook sre.hosts.move-vlan (exit_code=99) for host wikikube-worker2022 [10:19:55] !log jelto@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker2022.codfw.wmnet with OS bookworm [10:19:56] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2023 [10:20:15] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [10:21:19] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2022.codfw.wmnet with OS bookworm [10:22:16] FIRING: ProbeDown: Service ganeti2027:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:23:38] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2023 - jelto@cumin1002" [10:23:42] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2023 - jelto@cumin1002" [10:23:42] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:23:42] !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2023.codfw.wmnet 213.32.192.10.in-addr.arpa 3.1.2.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [10:23:45] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2023.codfw.wmnet 213.32.192.10.in-addr.arpa 3.1.2.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [10:23:46] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2023 [10:24:06] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2023 [10:24:06] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2023 [10:24:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1041 (re)pooling @ 100%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P71850 and previous config saved to /var/cache/conftool/dbconfig/20250108-102416-root.json [10:24:28] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2022 [10:24:29] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2022 [10:24:31] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:24:35] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10440100 (10phaultfinder) [10:24:57] (03CR) 10JMeybohm: create sre.k8s.roll-reimage-nodes (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1094494 (https://phabricator.wikimedia.org/T377857) (owner: 10Kamila Součková) [10:25:56] (03CR) 10JMeybohm: create sre.k8s.roll-reimage-nodes (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1094494 (https://phabricator.wikimedia.org/T377857) (owner: 10Kamila Součková) [10:26:07] (03CR) 10JMeybohm: [C:03+2] Rename kubernetes[1059-1062] as wikikube-worker[1084-1087] [puppet] - 10https://gerrit.wikimedia.org/r/1109030 (https://phabricator.wikimedia.org/T377876) (owner: 10JMeybohm) [10:27:16] RESOLVED: ProbeDown: Service ganeti2027:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:28:51] !log jayme@cumin1002 START - Cookbook sre.hosts.rename from kubernetes1059 to wikikube-worker1084 [10:29:12] !log jayme@cumin1002 START - Cookbook sre.dns.netbox [10:29:22] !log jayme@cumin1002 START - Cookbook sre.hosts.rename from kubernetes1060 to wikikube-worker1085 [10:29:54] !log jayme@cumin1002 START - Cookbook sre.hosts.rename from kubernetes1061 to wikikube-worker1086 [10:30:12] !log jayme@cumin1002 START - Cookbook sre.hosts.rename from kubernetes1062 to wikikube-worker1087 [10:31:20] !log jayme@cumin1002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [10:31:21] !log jayme@cumin1002 START - Cookbook sre.dns.netbox [10:31:31] PROBLEM - BGP status on lsw1-e2-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:31:39] PROBLEM - BGP status on lsw1-e1-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:31:59] PROBLEM - BGP status on lsw1-f2-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:32:16] FIRING: ProbeDown: Service ganeti2027:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:34:11] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1269.eqiad.wmnet with OS bookworm [10:34:14] !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.roll-reimage-nodes (exit_code=0) rolling reimage on P{wikikube-worker[1267-1269].eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad) [10:36:33] !log jayme@cumin1002 START - Cookbook sre.dns.netbox [10:36:38] !log jelto@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker2022.codfw.wmnet with OS bookworm [10:36:38] !log jayme@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes1062 to wikikube-worker1087 - jayme@cumin1002" [10:36:58] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 3:00:00 on ganeti2027.codfw.wmnet with reason: reimage to bookworm [10:37:07] !log jayme@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes1062 to wikikube-worker1087 - jayme@cumin1002" [10:37:07] !log jayme@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:37:08] !log jayme@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1087 [10:37:13] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on ganeti2027.codfw.wmnet with reason: reimage to bookworm [10:37:18] !log jayme@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1087 [10:37:30] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2022.codfw.wmnet with OS bookworm [10:37:33] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2022 [10:37:33] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2022 [10:37:57] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kubernetes1062 to wikikube-worker1087 [10:38:52] !log jayme@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:38:53] !log jayme@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1084 [10:38:59] !log jayme@cumin1002 START - Cookbook sre.dns.netbox [10:39:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=eqiad%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [10:40:33] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:40:46] (03CR) 10Muehlenhoff: maps::osm_master: Inline osm class (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1108773 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [10:41:18] !log jayme@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:41:18] !log jayme@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1086 [10:41:33] !log jayme@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1086 [10:41:42] !log jayme@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1084 [10:42:00] 06SRE, 06Infrastructure-Foundations: Spicerack fails to find host physical interface for ganeti nodes - https://phabricator.wikimedia.org/T383207 (10cmooney) 03NEW p:05Triage→03Medium [10:42:06] (03PS1) 10Cathal Mooney: Spicerack: find true physical int if server primary IP is on a bridge [software/spicerack] - 10https://gerrit.wikimedia.org/r/1109037 (https://phabricator.wikimedia.org/T383207) [10:42:10] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2023.codfw.wmnet with reason: host reimage [10:42:12] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kubernetes1061 to wikikube-worker1086 [10:42:20] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kubernetes1059 to wikikube-worker1084 [10:42:28] !log jayme@cumin1002 START - Cookbook sre.dns.netbox [10:43:43] !log uploaded php7.4 1:7.4.33-1+0~20221108.73+debian10~1.gbpa00350a+wmf10u2+icu67u4 (backport of latest PHP security fixes to our PHP build) T378173 [10:43:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:15] RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [10:44:48] !log jayme@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:44:48] !log jayme@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1085 [10:46:01] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2023.codfw.wmnet with reason: host reimage [10:46:04] !log installing libnvme bugfix updates from bookworm point release [10:46:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:57] !log jelto@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker2022.codfw.wmnet with OS bookworm [10:47:25] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2022.codfw.wmnet with OS bookworm [10:47:28] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2022 [10:47:28] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2022 [10:47:30] !log jayme@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1085 [10:48:09] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kubernetes1060 to wikikube-worker1085 [10:49:29] (03PS1) 10Muehlenhoff: Add library hint for libnvme [puppet] - 10https://gerrit.wikimedia.org/r/1109038 [10:50:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [10:50:38] !log jayme@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker1084.eqiad.wmnet wikikube-worker1085.eqiad.wmnet wikikube-worker1086.eqiad.wmnet wikikube-worker1087.eqiad.wmnet on all recursors [10:50:41] !log jayme@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker1084.eqiad.wmnet wikikube-worker1085.eqiad.wmnet wikikube-worker1086.eqiad.wmnet wikikube-worker1087.eqiad.wmnet on all recursors [10:51:15] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1084.eqiad.wmnet with OS bookworm [10:51:19] !log jayme@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1084 [10:51:19] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1084 [10:51:46] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1086.eqiad.wmnet with OS bookworm [10:51:49] !log jayme@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1086 [10:51:49] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1086 [10:51:50] (03CR) 10Elukey: [C:03+1] maps::osm_master: Inline osm class (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1108773 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [10:52:05] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1087.eqiad.wmnet with OS bookworm [10:52:08] !log jayme@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1087 [10:52:09] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1087 [10:52:25] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1085.eqiad.wmnet with OS bookworm [10:52:27] (03PS1) 10Btullis: Migrate the airflow-research scheduler to dse-k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109041 (https://phabricator.wikimedia.org/T380620) [10:52:29] !log jayme@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1085 [10:52:29] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1085 [10:52:43] (03CR) 10CI reject: [V:04-1] Spicerack: find true physical int if server primary IP is on a bridge [software/spicerack] - 10https://gerrit.wikimedia.org/r/1109037 (https://phabricator.wikimedia.org/T383207) (owner: 10Cathal Mooney) [10:53:28] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.7 point update - https://phabricator.wikimedia.org/T373783#10440208 (10MoritzMuehlenhoff) [10:54:47] !log installing numpy bugfix updates from bookworm point release [10:54:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:15] RESOLVED: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [10:56:38] (03PS1) 10DDesouza: miscweb(wikiworkshop): set alternative probe path [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109043 (https://phabricator.wikimedia.org/T382617) [10:59:54] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.7 point update - https://phabricator.wikimedia.org/T373783#10440215 (10MoritzMuehlenhoff) [11:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250108T1100) [11:00:48] !log installing graphviz bugfix updates from bookworm point release [11:00:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:57] (03PS2) 10Cathal Mooney: Spicerack: find true physical int if server primary IP is on a bridge [software/spicerack] - 10https://gerrit.wikimedia.org/r/1109037 (https://phabricator.wikimedia.org/T383207) [11:06:12] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2023.codfw.wmnet with OS bookworm [11:06:39] PROBLEM - Checks that the local airflow scheduler for airflow @research is working properly on an-airflow1002 is CRITICAL: CRITICAL: /usr/bin/env PYTHONPATH=/srv/deployment/airflow-dags/research AIRFLOW_HOME=/srv/airflow-research /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1002.eqiad.wmnet did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [11:06:42] !log jelto@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker2022.codfw.wmnet with OS bookworm [11:07:52] !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-airflow1002.eqiad.wmnet with reason: Migrating to kubernetes [11:08:06] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-airflow1002.eqiad.wmnet with reason: Migrating to kubernetes [11:08:51] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.7 point update - https://phabricator.wikimedia.org/T373783#10440235 (10MoritzMuehlenhoff) [11:08:58] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1084.eqiad.wmnet with reason: host reimage [11:09:39] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1087.eqiad.wmnet with reason: host reimage [11:10:20] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1085.eqiad.wmnet with reason: host reimage [11:12:57] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1084.eqiad.wmnet with reason: host reimage [11:13:41] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2022 [11:13:41] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2022 [11:14:08] (03CR) 10Muehlenhoff: [C:03+2] Add library hint for libnvme [puppet] - 10https://gerrit.wikimedia.org/r/1109038 (owner: 10Muehlenhoff) [11:15:41] (03CR) 10CI reject: [V:04-1] Spicerack: find true physical int if server primary IP is on a bridge [software/spicerack] - 10https://gerrit.wikimedia.org/r/1109037 (https://phabricator.wikimedia.org/T383207) (owner: 10Cathal Mooney) [11:17:02] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1085.eqiad.wmnet with reason: host reimage [11:18:31] !log sudo homer 'lsw1-c6-codfw*' commit 'T377877' [11:18:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:34] T377877: Migrate wikikube-codfw to containerd - https://phabricator.wikimedia.org/T377877 [11:19:23] sudo homer 'cr*codfw*' commit 'T377877' [11:20:43] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1087.eqiad.wmnet with reason: host reimage [11:21:19] PROBLEM - BGP status on lsw1-c6-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:24:13] RECOVERY - BGP status on lsw1-e1-eqiad.mgmt is OK: BGP OK - up: 9, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:25:28] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2022.codfw.wmnet with OS bookworm [11:25:31] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2022 [11:25:32] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2022 [11:26:19] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 3:00:00 on ganeti2027.codfw.wmnet with reason: reimage pending, blocked by T383207 [11:26:22] T383207: Spicerack fails to find host physical interface for ganeti nodes - https://phabricator.wikimedia.org/T383207 [11:26:24] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on ganeti2027.codfw.wmnet with reason: reimage pending, blocked by T383207 [11:26:35] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Spicerack fails to find host physical interface for ganeti nodes - https://phabricator.wikimedia.org/T383207#10440286 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=f82529ae-1943-40c5-869d-4f3786e140c0) set by jmm@cumin2002 for 3:0... [11:26:49] (03CR) 10Ladsgroup: "Thanks <3" [software] - 10https://gerrit.wikimedia.org/r/1109035 (https://phabricator.wikimedia.org/T382842) (owner: 10Marostegui) [11:27:45] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Spicerack fails to find host physical interface for ganeti nodes - https://phabricator.wikimedia.org/T383207#10440289 (10MoritzMuehlenhoff) [11:28:01] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on ganeti2027.codfw.wmnet with reason: reimage pending, blocked by T383207 [11:28:07] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on ganeti2027.codfw.wmnet with reason: reimage pending, blocked by T383207 [11:28:13] PROBLEM - BGP status on lsw1-e1-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:28:23] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Spicerack fails to find host physical interface for ganeti nodes - https://phabricator.wikimedia.org/T383207#10440292 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=98d828cd-07ea-44e7-bb64-56e6ad972bde) set by jmm@cumin2002 for 7 d... [11:28:34] (03PS3) 10Muehlenhoff: Spicerack: find true physical int if server primary IP is on a bridge [software/spicerack] - 10https://gerrit.wikimedia.org/r/1109037 (https://phabricator.wikimedia.org/T383207) (owner: 10Cathal Mooney) [11:28:59] (03PS1) 10Marostegui: db2226: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1109045 [11:29:13] RECOVERY - BGP status on lsw1-e1-eqiad.mgmt is OK: BGP OK - up: 9, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:29:29] (03CR) 10Marostegui: [C:03+2] db2226: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1109045 (owner: 10Marostegui) [11:29:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2226 (re)pooling @ 1%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P71852 and previous config saved to /var/cache/conftool/dbconfig/20250108-112956-root.json [11:30:56] (03CR) 10Muehlenhoff: maps::osm_master: Inline osm class (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1108773 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [11:30:57] (03CR) 10Muehlenhoff: [C:03+2] maps::osm_master: Inline osm class [puppet] - 10https://gerrit.wikimedia.org/r/1108773 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [11:31:42] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1084.eqiad.wmnet with OS bookworm [11:32:03] RECOVERY - BGP status on lsw1-f2-eqiad.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:32:22] (03PS1) 10Btullis: Add postgresql import parameters for the airflow-research instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109046 (https://phabricator.wikimedia.org/T380616) [11:34:26] (03CR) 10Ladsgroup: "This code path is only invoked in ParserCache sections (see ( substr( $dbctlCluster, 0, 2 ) === 'pc' ) a couple of lines above). So if yo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1108794 (https://phabricator.wikimedia.org/T373037) (owner: 10Ladsgroup) [11:34:50] !log installing php7.4 security updates [11:34:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:56] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1085.eqiad.wmnet with OS bookworm [11:35:59] RECOVERY - BGP status on lsw1-e2-eqiad.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:37:03] PROBLEM - BGP status on lsw1-f2-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:37:09] (03CR) 10Marostegui: "Sorry if I wasn't clear, my question was about the ability to depool ALL pc sections. We should make sure we prevent that." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1108794 (https://phabricator.wikimedia.org/T373037) (owner: 10Ladsgroup) [11:37:54] (03PS1) 10Btullis: Switch airflow-research to use the cloudnativepg cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109049 (https://phabricator.wikimedia.org/T380616) [11:38:04] RECOVERY - BGP status on lsw1-f2-eqiad.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:39:44] (03CR) 10CI reject: [V:04-1] Spicerack: find true physical int if server primary IP is on a bridge [software/spicerack] - 10https://gerrit.wikimedia.org/r/1109037 (https://phabricator.wikimedia.org/T383207) (owner: 10Cathal Mooney) [11:40:27] !log jayme@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wikikube-worker1086.eqiad.wmnet with OS bookworm [11:40:43] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1087.eqiad.wmnet with OS bookworm [11:40:43] 06SRE, 10Maps: Allow Wikimedia Maps usage on schoolwiki.in - https://phabricator.wikimedia.org/T383210 (10Gnoeee) 03NEW [11:40:57] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1086.eqiad.wmnet with OS bookworm [11:41:01] !log jayme@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1086 [11:41:01] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1086 [11:41:15] (03CR) 10Muehlenhoff: [C:03+2] Fix permissions for /var/lib/ganeti/known_hosts in managed mode [puppet] - 10https://gerrit.wikimedia.org/r/1108092 (https://phabricator.wikimedia.org/T382870) (owner: 10Muehlenhoff) [11:45:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2226 (re)pooling @ 2%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P71853 and previous config saved to /var/cache/conftool/dbconfig/20250108-114501-root.json [11:49:19] (03PS3) 10Zabe: snapshot: Remove absented file [puppet] - 10https://gerrit.wikimedia.org/r/1108861 (https://phabricator.wikimedia.org/T378260) [11:49:24] (03CR) 10Ladsgroup: [C:03+2] snapshot: Remove absented file [puppet] - 10https://gerrit.wikimedia.org/r/1108861 (https://phabricator.wikimedia.org/T378260) (owner: 10Zabe) [11:49:26] (03CR) 10Ladsgroup: [V:03+2 C:03+2] snapshot: Remove absented file [puppet] - 10https://gerrit.wikimedia.org/r/1108861 (https://phabricator.wikimedia.org/T378260) (owner: 10Zabe) [11:49:52] 07sre-alert-triage, 06Infrastructure-Foundations, 13Patch-For-Review: Alert in need of triage: PuppetConstantChange (instance ganeti-test2003:9100) - https://phabricator.wikimedia.org/T382870#10440386 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff This has been fixed: When running wi... [11:51:59] (03PS1) 10Marostegui: mariadb: Productionize db2228 [puppet] - 10https://gerrit.wikimedia.org/r/1109050 (https://phabricator.wikimedia.org/T373579) [11:52:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2128 T373579', diff saved to https://phabricator.wikimedia.org/P71854 and previous config saved to /var/cache/conftool/dbconfig/20250108-115206-marostegui.json [11:52:10] T373579: Productionize db22[21-40] - https://phabricator.wikimedia.org/T373579 [11:52:46] (03CR) 10Marostegui: [C:03+2] mariadb: Productionize db2228 [puppet] - 10https://gerrit.wikimedia.org/r/1109050 (https://phabricator.wikimedia.org/T373579) (owner: 10Marostegui) [11:53:13] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db[2128,2186].codfw.wmnet with reason: cloning [11:53:28] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[2128,2186].codfw.wmnet with reason: cloning [11:54:35] (03CR) 10Brouberol: Migrate the airflow-research scheduler to dse-k8s (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109041 (https://phabricator.wikimedia.org/T380620) (owner: 10Btullis) [11:55:36] (03PS1) 10Marostegui: instances.yaml: Add db2228 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1109052 (https://phabricator.wikimedia.org/T373579) [11:55:48] (03CR) 10Brouberol: [C:03+1] "LGTM! I checked that the password was committed to the private repo as well." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109046 (https://phabricator.wikimedia.org/T380616) (owner: 10Btullis) [11:56:05] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.7 point update - https://phabricator.wikimedia.org/T373783#10440417 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff All done [11:56:06] (03PS2) 10Btullis: Migrate the airflow-research scheduler to dse-k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109041 (https://phabricator.wikimedia.org/T380620) [11:56:06] (03PS2) 10Btullis: Add postgresql import parameters for the airflow-research instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109046 (https://phabricator.wikimedia.org/T380616) [11:56:59] (03CR) 10Marostegui: [C:03+2] instances.yaml: Add db2228 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1109052 (https://phabricator.wikimedia.org/T373579) (owner: 10Marostegui) [11:57:40] (03CR) 10Btullis: Migrate the airflow-research scheduler to dse-k8s (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109041 (https://phabricator.wikimedia.org/T380620) (owner: 10Btullis) [11:59:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'Add db2228 to dbctl depooled T373579', diff saved to https://phabricator.wikimedia.org/P71855 and previous config saved to /var/cache/conftool/dbconfig/20250108-115908-marostegui.json [11:59:12] T373579: Productionize db22[21-40] - https://phabricator.wikimedia.org/T373579 [11:59:38] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1086.eqiad.wmnet with reason: host reimage [11:59:53] !log root@cumin1002 START - Cookbook sre.mysql.clone of db2128.codfw.wmnet onto db2228.codfw.wmnet [11:59:56] 10ops-eqiad, 06DC-Ops, 10Prod-Kubernetes, 06serviceops, 07Kubernetes: Relabel eqiad kubernetes nodes - https://phabricator.wikimedia.org/T383213 (10JMeybohm) 03NEW [12:00:05] mvolz: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250108T1200). [12:00:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2226 (re)pooling @ 3%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P71856 and previous config saved to /var/cache/conftool/dbconfig/20250108-120006-root.json [12:01:21] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.8 point update - https://phabricator.wikimedia.org/T379600#10440445 (10MoritzMuehlenhoff) [12:02:19] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1086.eqiad.wmnet with reason: host reimage [12:02:42] !log root@cumin1002 END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of db2128.codfw.wmnet onto db2228.codfw.wmnet [12:03:30] !log root@cumin1002 START - Cookbook sre.mysql.clone of db2128.codfw.wmnet onto db2228.codfw.wmnet [12:03:34] (03CR) 10Brouberol: [C:03+1] Migrate the airflow-research scheduler to dse-k8s (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109041 (https://phabricator.wikimedia.org/T380620) (owner: 10Btullis) [12:04:56] (03CR) 10Btullis: [C:03+2] Migrate the airflow-research scheduler to dse-k8s (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109041 (https://phabricator.wikimedia.org/T380620) (owner: 10Btullis) [12:05:11] (03CR) 10Brouberol: [C:03+1] Migrate the airflow-research scheduler to dse-k8s (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109041 (https://phabricator.wikimedia.org/T380620) (owner: 10Btullis) [12:06:11] (03Merged) 10jenkins-bot: Migrate the airflow-research scheduler to dse-k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109041 (https://phabricator.wikimedia.org/T380620) (owner: 10Btullis) [12:08:32] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-research: apply [12:09:00] !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1084-1085,1087].eqiad.wmnet [12:09:01] !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1084-1085,1087].eqiad.wmnet [12:15:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2226 (re)pooling @ 4%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P71857 and previous config saved to /var/cache/conftool/dbconfig/20250108-121512-root.json [12:15:28] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:15:36] (03CR) 10Slyngshede: [C:03+2] data.yaml: Offboarding mhernandez [puppet] - 10https://gerrit.wikimedia.org/r/1108034 (owner: 10Slyngshede) [12:16:13] (03CR) 10JMeybohm: "You can use the function `k8s::fetch_cluster_config($cluster-name)` to get the clusters structure from kubernetes.yaml (including intermed" [puppet] - 10https://gerrit.wikimedia.org/r/1108772 (https://phabricator.wikimedia.org/T371087) (owner: 10Filippo Giunchedi) [12:19:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2126 (re)pooling @ 10%: Repooling after cloning', diff saved to https://phabricator.wikimedia.org/P71858 and previous config saved to /var/cache/conftool/dbconfig/20250108-121931-root.json [12:20:21] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-research: apply [12:20:28] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-research: apply [12:20:53] RECOVERY - BGP status on lsw1-e3-eqiad.mgmt is OK: BGP OK - up: 24, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:21:14] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1086.eqiad.wmnet with OS bookworm [12:30:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2226 (re)pooling @ 5%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P71859 and previous config saved to /var/cache/conftool/dbconfig/20250108-123017-root.json [12:30:55] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-research: apply [12:34:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10440538 (10phaultfinder) [12:34:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2126 (re)pooling @ 25%: Repooling after cloning', diff saved to https://phabricator.wikimedia.org/P71860 and previous config saved to /var/cache/conftool/dbconfig/20250108-123437-root.json [12:35:15] (03PS1) 10Btullis: Add missing postgresql networkpolicies to airflow-research [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109056 (https://phabricator.wikimedia.org/T380616) [12:35:25] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109057 [12:37:01] (03CR) 10Brouberol: [C:03+1] Add missing postgresql networkpolicies to airflow-research [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109056 (https://phabricator.wikimedia.org/T380616) (owner: 10Btullis) [12:37:07] (03CR) 10Btullis: [C:03+2] Add missing postgresql networkpolicies to airflow-research [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109056 (https://phabricator.wikimedia.org/T380616) (owner: 10Btullis) [12:37:14] (03CR) 10Mvolz: [C:03+2] citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109057 (owner: 10PipelineBot) [12:38:18] (03Merged) 10jenkins-bot: Add missing postgresql networkpolicies to airflow-research [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109056 (https://phabricator.wikimedia.org/T380616) (owner: 10Btullis) [12:38:20] (03Merged) 10jenkins-bot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109057 (owner: 10PipelineBot) [12:39:20] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-research: apply [12:39:23] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:39:40] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10440546 (10phaultfinder) [12:40:03] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:40:45] !log mvolz@deploy2002 helmfile [staging] START helmfile.d/services/citoid: apply [12:41:09] !log mvolz@deploy2002 helmfile [staging] DONE helmfile.d/services/citoid: apply [12:42:23] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:43:03] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53369 bytes in 7.609 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:43:13] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 08 Feb 2025 11:19:52 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:43:15] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.165 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:45:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2226 (re)pooling @ 10%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P71861 and previous config saved to /var/cache/conftool/dbconfig/20250108-124522-root.json [12:45:52] !log jelto@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker2022.codfw.wmnet with OS bookworm [12:46:40] !log root@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2128.codfw.wmnet onto db2228.codfw.wmnet [12:47:13] !log mvolz@deploy2002 helmfile [codfw] START helmfile.d/services/citoid: apply [12:47:50] !log mvolz@deploy2002 helmfile [codfw] DONE helmfile.d/services/citoid: apply [12:48:24] !log mvolz@deploy2002 helmfile [eqiad] START helmfile.d/services/citoid: apply [12:48:56] !log mvolz@deploy2002 helmfile [eqiad] DONE helmfile.d/services/citoid: apply [12:49:04] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2023.codfw.wmnet [12:49:06] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2023.codfw.wmnet [12:49:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2126 (re)pooling @ 50%: Repooling after cloning', diff saved to https://phabricator.wikimedia.org/P71862 and previous config saved to /var/cache/conftool/dbconfig/20250108-124943-root.json [12:50:01] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-research: apply [12:50:23] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2020-2021].codfw.wmnet [12:51:37] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2020-2021].codfw.wmnet [12:52:18] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2021.codfw.wmnet with OS bookworm [12:52:19] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2020.codfw.wmnet with OS bookworm [12:52:45] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2020 [12:52:58] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [12:53:08] !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1086.eqiad.wmnet [12:53:10] !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1086.eqiad.wmnet [12:54:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10440654 (10phaultfinder) [12:56:23] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2020 - jelto@cumin1002" [12:56:27] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2020 - jelto@cumin1002" [12:56:28] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:56:28] !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2020.codfw.wmnet 208.32.192.10.in-addr.arpa 8.0.2.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [12:56:31] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2020.codfw.wmnet 208.32.192.10.in-addr.arpa 8.0.2.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [12:56:31] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2020 [12:56:58] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2020 [12:56:59] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2020 [12:57:40] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2021 [12:57:47] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [13:00:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2226 (re)pooling @ 25%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P71863 and previous config saved to /var/cache/conftool/dbconfig/20250108-130028-root.json [13:00:41] jouncebot: nowandnext [13:00:41] No deployments scheduled for the next 0 hour(s) and 59 minute(s) [13:00:41] In 0 hour(s) and 59 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250108T1400) [13:00:45] AWESOME [13:01:11] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2021 - jelto@cumin1002" [13:01:16] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2021 - jelto@cumin1002" [13:01:16] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:01:16] !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2021.codfw.wmnet 210.32.192.10.in-addr.arpa 0.1.2.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [13:01:19] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2021.codfw.wmnet 210.32.192.10.in-addr.arpa 0.1.2.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [13:01:19] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2021 [13:01:53] (03CR) 10Ladsgroup: [C:03+2] "We decided to do this outside of mw and in dbctl. Resolving this." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1108794 (https://phabricator.wikimedia.org/T373037) (owner: 10Ladsgroup) [13:01:53] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2021 [13:01:53] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2021 [13:02:24] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1108794 (https://phabricator.wikimedia.org/T373037) (owner: 10Ladsgroup) [13:02:39] (03Merged) 10jenkins-bot: Fully depool ParserCache section if load of the primary is zero [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1108794 (https://phabricator.wikimedia.org/T373037) (owner: 10Ladsgroup) [13:03:34] !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1108794|Fully depool ParserCache section if load of the primary is zero (T373037 T383137)]] [13:03:38] T373037: Make ParserCache more like a ring - https://phabricator.wikimedia.org/T373037 [13:03:39] T383137: Allow full depool of pc sections in dbctl - https://phabricator.wikimedia.org/T383137 [13:03:44] (03CR) 10Filippo Giunchedi: [V:03+1] "Thank you, I'll try with that and see what I can come up with" [puppet] - 10https://gerrit.wikimedia.org/r/1108772 (https://phabricator.wikimedia.org/T371087) (owner: 10Filippo Giunchedi) [13:04:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2126 (re)pooling @ 75%: Repooling after cloning', diff saved to https://phabricator.wikimedia.org/P71864 and previous config saved to /var/cache/conftool/dbconfig/20250108-130448-root.json [13:12:07] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:1108794|Fully depool ParserCache section if load of the primary is zero (T373037 T383137)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:12:12] T373037: Make ParserCache more like a ring - https://phabricator.wikimedia.org/T373037 [13:12:13] T383137: Allow full depool of pc sections in dbctl - https://phabricator.wikimedia.org/T383137 [13:13:54] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [13:14:28] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2020.codfw.wmnet with reason: host reimage [13:15:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2226 (re)pooling @ 50%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P71865 and previous config saved to /var/cache/conftool/dbconfig/20250108-131533-root.json [13:17:38] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2020.codfw.wmnet with reason: host reimage [13:19:45] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2021.codfw.wmnet with reason: host reimage [13:19:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2126 (re)pooling @ 100%: Repooling after cloning', diff saved to https://phabricator.wikimedia.org/P71866 and previous config saved to /var/cache/conftool/dbconfig/20250108-131953-root.json [13:21:42] (03PS1) 10Alexandros Kosiaris: Revert "mwdebug: Enable retries" [puppet] - 10https://gerrit.wikimedia.org/r/1109062 (https://phabricator.wikimedia.org/T380958) [13:22:05] (03CR) 10CI reject: [V:04-1] Revert "mwdebug: Enable retries" [puppet] - 10https://gerrit.wikimedia.org/r/1109062 (https://phabricator.wikimedia.org/T380958) (owner: 10Alexandros Kosiaris) [13:22:13] (03PS2) 10Muehlenhoff: Switch to idp2004 [dns] - 10https://gerrit.wikimedia.org/r/1108716 [13:22:54] !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1108794|Fully depool ParserCache section if load of the primary is zero (T373037 T383137)]] (duration: 19m 19s) [13:22:58] T373037: Make ParserCache more like a ring - https://phabricator.wikimedia.org/T373037 [13:22:59] T383137: Allow full depool of pc sections in dbctl - https://phabricator.wikimedia.org/T383137 [13:23:30] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2021.codfw.wmnet with reason: host reimage [13:24:15] marostegui: I'm about to depool pc5 in eqiad (and then codfw) as test [13:24:24] FIRING: [2x] ProbeDown: Service ml-serve-ctrl1001:6443 has failed probes (http_ml_serve_eqiad_kube_apiserver_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#ml-serve-ctrl1001:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:24:31] last chance for objection [13:24:31] Amir1: ok! [13:25:06] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling pc5 in eqiad for test (T373037)', diff saved to https://phabricator.wikimedia.org/P71867 and previous config saved to /var/cache/conftool/dbconfig/20250108-132506-ladsgroup.json [13:25:31] https://www.irccloud.com/pastebin/UwwQ3H3F/ [13:25:36] marostegui: ^ \o/ [13:27:08] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling pc5 in codfw for test (T373037)', diff saved to https://phabricator.wikimedia.org/P71868 and previous config saved to /var/cache/conftool/dbconfig/20250108-132708-ladsgroup.json [13:27:20] now codfw is done [13:27:42] I leave it for a bit to measure impact of loss of a section [13:29:24] RESOLVED: [2x] ProbeDown: Service ml-serve-ctrl1001:6443 has failed probes (http_ml_serve_eqiad_kube_apiserver_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#ml-serve-ctrl1001:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:29:40] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10440761 (10phaultfinder) [13:30:36] Amir1: Nice!" [13:30:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2226 (re)pooling @ 75%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P71869 and previous config saved to /var/cache/conftool/dbconfig/20250108-133038-root.json [13:34:36] (03PS5) 10Filippo Giunchedi: prometheus: migrate ops instance to prometheus::instances [puppet] - 10https://gerrit.wikimedia.org/r/1108746 (https://phabricator.wikimedia.org/T371087) [13:34:36] (03PS6) 10Filippo Giunchedi: WIP: prometheus: k8s instances migration [puppet] - 10https://gerrit.wikimedia.org/r/1108772 (https://phabricator.wikimedia.org/T371087) [13:37:07] (03CR) 10CI reject: [V:04-1] WIP: prometheus: k8s instances migration [puppet] - 10https://gerrit.wikimedia.org/r/1108772 (https://phabricator.wikimedia.org/T371087) (owner: 10Filippo Giunchedi) [13:37:35] (03CR) 10Filippo Giunchedi: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4760/co" [puppet] - 10https://gerrit.wikimedia.org/r/1108772 (https://phabricator.wikimedia.org/T371087) (owner: 10Filippo Giunchedi) [13:37:57] !log elukey@puppetserver1001:~$ sudo puppetserver ca clean --certname kubernetes1021.eqiad.wmnet [13:37:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:36] (03CR) 10Filippo Giunchedi: [V:03+1] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1108772 (https://phabricator.wikimedia.org/T371087) (owner: 10Filippo Giunchedi) [13:39:09] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2020.codfw.wmnet with OS bookworm [13:41:15] (03PS1) 10Brouberol: Disable analytics when starting the scheduler [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109068 [13:41:15] (03PS1) 10Brouberol: airflow-research: temporarily disable liveness/readiness probes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109069 (https://phabricator.wikimedia.org/T380620) [13:42:36] (03PS1) 10Alexandros Kosiaris: profile::tlsproxy::envoy: Explicitly configure retries [puppet] - 10https://gerrit.wikimedia.org/r/1109070 (https://phabricator.wikimedia.org/T380958) [13:42:53] 06SRE, 10Observability-Metrics: Aggregate prometheus functions yielding different results in grafana vs. prometheus console - https://phabricator.wikimedia.org/T168403#10440839 (10tappof) 05Open→03Declined The dash no longer exists. Closing the task. [13:43:08] (03PS2) 10Alexandros Kosiaris: profile::tlsproxy::envoy: Explicitly configure retries [puppet] - 10https://gerrit.wikimedia.org/r/1109070 (https://phabricator.wikimedia.org/T380958) [13:43:09] (03Abandoned) 10Alexandros Kosiaris: Revert "mwdebug: Enable retries" [puppet] - 10https://gerrit.wikimedia.org/r/1109062 (https://phabricator.wikimedia.org/T380958) (owner: 10Alexandros Kosiaris) [13:43:34] (03CR) 10Alexandros Kosiaris: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1109070 (https://phabricator.wikimedia.org/T380958) (owner: 10Alexandros Kosiaris) [13:44:07] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2021.codfw.wmnet with OS bookworm [13:44:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2128 (re)pooling @ 10%: Repooling after cloning', diff saved to https://phabricator.wikimedia.org/P71870 and previous config saved to /var/cache/conftool/dbconfig/20250108-134408-root.json [13:44:54] (03PS1) 10Marostegui: db2228: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1109071 [13:45:28] (03CR) 10Marostegui: "Host green in Icinga." [puppet] - 10https://gerrit.wikimedia.org/r/1109071 (owner: 10Marostegui) [13:45:29] (03CR) 10Marostegui: [C:03+2] db2228: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1109071 (owner: 10Marostegui) [13:45:31] (03CR) 10CI reject: [V:04-1] profile::tlsproxy::envoy: Explicitly configure retries [puppet] - 10https://gerrit.wikimedia.org/r/1109070 (https://phabricator.wikimedia.org/T380958) (owner: 10Alexandros Kosiaris) [13:45:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2226 (re)pooling @ 100%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P71871 and previous config saved to /var/cache/conftool/dbconfig/20250108-134544-root.json [13:45:44] (03CR) 10Filippo Giunchedi: [V:03+1] "PS6 has a solution that fetches the cluster config, using only the cluster name + site (i.e. same as profile::kubernetes::cluster_name). W" [puppet] - 10https://gerrit.wikimedia.org/r/1108772 (https://phabricator.wikimedia.org/T371087) (owner: 10Filippo Giunchedi) [13:46:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2228 (re)pooling @ 1%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P71872 and previous config saved to /var/cache/conftool/dbconfig/20250108-134610-root.json [13:50:09] (03CR) 10Vgutierrez: [C:03+1] "procedure looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1082581 (https://phabricator.wikimedia.org/T377042) (owner: 10Scott French) [13:50:59] 10ops-codfw, 06DC-Ops: hw troubleshooting: SMART errors on ml-serve2001.codfw.wmnet - https://phabricator.wikimedia.org/T383225#10440906 (10klausman) a:03Papaul [13:56:11] (03PS1) 10Marostegui: installserver: Do not format db2230 [puppet] - 10https://gerrit.wikimedia.org/r/1109075 (https://phabricator.wikimedia.org/T373579) [13:56:38] (03PS7) 10Filippo Giunchedi: WIP: prometheus: k8s instances migration [puppet] - 10https://gerrit.wikimedia.org/r/1108772 (https://phabricator.wikimedia.org/T371087) [13:58:45] (03CR) 10Marostegui: [C:03+2] installserver: Do not format db2230 [puppet] - 10https://gerrit.wikimedia.org/r/1109075 (https://phabricator.wikimedia.org/T373579) (owner: 10Marostegui) [13:59:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2128 (re)pooling @ 25%: Repooling after cloning', diff saved to https://phabricator.wikimedia.org/P71873 and previous config saved to /var/cache/conftool/dbconfig/20250108-135913-root.json [14:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: How many deployers does it take to do UTC afternoon backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250108T1400). [14:00:05] No Gerrit patches in the queue for this window AFAICS. [14:00:43] o/ [14:00:57] nothing to deploy right now [14:01:11] (03CR) 10Btullis: [C:03+1] Disable analytics when starting the scheduler [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109068 (owner: 10Brouberol) [14:01:11] though https://phabricator.wikimedia.org/T383221 sounds like MichaelG_WMF might want to backport something later [14:01:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2228 (re)pooling @ 2%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P71874 and previous config saved to /var/cache/conftool/dbconfig/20250108-140115-root.json [14:01:34] (03CR) 10Btullis: [C:03+1] airflow-research: temporarily disable liveness/readiness probes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109069 (https://phabricator.wikimedia.org/T380620) (owner: 10Brouberol) [14:01:56] !log installing gtk+3.0 security updates [14:02:39] @Lucas_WMDE: In principle yes, though my change probably needs a bit more detailed review. So, it probably has to wait for the late backport window tonight [14:03:06] ok [14:05:01] !log elukey@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2022 [14:05:01] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2022 [14:05:25] (03CR) 10Btullis: [C:03+2] Disable analytics when starting the scheduler [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109068 (owner: 10Brouberol) [14:07:17] (03Merged) 10jenkins-bot: Disable analytics when starting the scheduler [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109068 (owner: 10Brouberol) [14:07:46] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2021.codfw.wmnet [14:08:38] !log jelto@cumin1002 END (FAIL) - Cookbook sre.k8s.pool-depool-node (exit_code=99) pool for host wikikube-worker2021.codfw.wmnet [14:08:44] (03PS1) 10Jforrester: wikifunctions: Upgrade evaluators from 2024-12-16-202347 to 2025-01-06-142521 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109076 (https://phabricator.wikimedia.org/T380828) [14:08:57] !log sudo homer 'lsw1-c6-codfw*' commit 'T377877' [14:09:31] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-research: apply [14:09:37] !log sudo homer 'cr*codfw*' commit 'T377877' [14:09:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:40] T377877: Migrate wikikube-codfw to containerd - https://phabricator.wikimedia.org/T377877 [14:11:05] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2021.codfw.wmnet [14:11:07] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2021.codfw.wmnet [14:11:13] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2020.codfw.wmnet [14:11:15] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2020.codfw.wmnet [14:12:08] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2022.codfw.wmnet with OS bookworm [14:12:11] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2022 [14:12:11] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2022 [14:13:16] Hi! I'm planning to run some maintenance scripts and add wikidata support for tigwiki as per T381382 [14:13:25] does this interfere with any deployments or plans? [14:14:10] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on pc2017.codfw.wmnet,pc[1014,1017].eqiad.wmnet with reason: Reboot [14:14:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2128 (re)pooling @ 50%: Repooling after cloning', diff saved to https://phabricator.wikimedia.org/P71875 and previous config saved to /var/cache/conftool/dbconfig/20250108-141418-root.json [14:14:25] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on pc2017.codfw.wmnet,pc[1014,1017].eqiad.wmnet with reason: Reboot [14:14:39] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1007-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [14:16:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2228 (re)pooling @ 3%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P71876 and previous config saved to /var/cache/conftool/dbconfig/20250108-141620-root.json [14:18:19] (03CR) 10Btullis: [C:03+2] airflow-research: temporarily disable liveness/readiness probes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109069 (https://phabricator.wikimedia.org/T380620) (owner: 10Brouberol) [14:19:18] (03CR) 10Muehlenhoff: [C:03+2] Switch to idp2004 [dns] - 10https://gerrit.wikimedia.org/r/1108716 (owner: 10Muehlenhoff) [14:19:41] (03Merged) 10jenkins-bot: airflow-research: temporarily disable liveness/readiness probes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109069 (https://phabricator.wikimedia.org/T380620) (owner: 10Brouberol) [14:21:27] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-research: apply [14:23:02] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-research: apply [14:24:03] FIRING: KubernetesCalicoDown: wikikube-worker2022.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=wikikube-worker2022.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:29:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2128 (re)pooling @ 75%: Repooling after cloning', diff saved to https://phabricator.wikimedia.org/P71877 and previous config saved to /var/cache/conftool/dbconfig/20250108-142923-root.json [14:31:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2228 (re)pooling @ 4%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P71878 and previous config saved to /var/cache/conftool/dbconfig/20250108-143126-root.json [14:33:39] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker2022.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [14:33:42] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-research: apply [14:35:52] 10ops-eqiad, 06DC-Ops, 10decommission-hardware, 06serviceops, 13Patch-For-Review: decommission mw[1349-1413] - https://phabricator.wikimedia.org/T375842#10441086 (10akosiaris) [14:36:14] (03CR) 10Alexandros Kosiaris: [C:03+1] wikikube: decommission 1 host [puppet] - 10https://gerrit.wikimedia.org/r/1102961 (https://phabricator.wikimedia.org/T375842) (owner: 10Jasmine) [14:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:36:46] 06SRE, 10observability, 10Observability-Logging, 10Wikimedia-Logstash: Rationalize default logrotate "rotated" file extensions - https://phabricator.wikimedia.org/T207296#10441089 (10herron) 05Open→03Declined [14:37:26] !log ladsgroup@cumin1002 START - Cookbook sre.mysql.upgrade for pc1014.eqiad.wmnet [14:38:26] !log joelyrookewmde@mwmaint2002:~$ foreachwikiindblist wikidataclient extensions/Wikibase/lib/maintenance/populateSitesTable.php --force-protocol https [14:38:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:13] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-research: apply [14:39:17] 06SRE, 10observability, 10Observability-Logging, 10Wikimedia-Logstash, 13Patch-For-Review: elk7: fields indexed without position data; cannot run PhraseQuery - https://phabricator.wikimedia.org/T248400#10441114 (10herron) 05Open→03Resolved a:03herron closing old task [14:40:06] (03PS3) 10Alexandros Kosiaris: profile::tlsproxy::envoy: Explicitly configure retries [puppet] - 10https://gerrit.wikimedia.org/r/1109070 (https://phabricator.wikimedia.org/T380958) [14:40:50] !log elukey@puppetserver1001:~$ sudo puppetserver ca clean --certname kubernetes1061.eqiad.wmnet [14:40:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:41] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for pc1014.eqiad.wmnet [14:43:55] (03PS1) 10Alexandros Kosiaris: Remove wgDBsqlpassword setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1109081 [14:44:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2128 (re)pooling @ 100%: Repooling after cloning', diff saved to https://phabricator.wikimedia.org/P71879 and previous config saved to /var/cache/conftool/dbconfig/20250108-144429-root.json [14:45:22] (03PS11) 10Klausman: modules+hiera: Add module to do Ceph mounts and mount ml-lab /home [puppet] - 10https://gerrit.wikimedia.org/r/1109044 [14:46:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2228 (re)pooling @ 5%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P71880 and previous config saved to /var/cache/conftool/dbconfig/20250108-144631-root.json [14:47:19] 10ops-eqiad, 06SRE, 06DC-Ops, 06Discovery-Search, 10Data-Platform-SRE (2024.11.30 - 2024.12.20): Q2:rack/setup/install cloudelastic101[12] - https://phabricator.wikimedia.org/T378368#10441156 (10bking) 05Open→03In progress a:03bking @Andrew , you are correct. I just assigned this to myself and we'l... [14:48:06] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repool pc5 (T373037)', diff saved to https://phabricator.wikimedia.org/P71881 and previous config saved to /var/cache/conftool/dbconfig/20250108-144805-ladsgroup.json [14:48:09] T373037: Make ParserCache more like a ring - https://phabricator.wikimedia.org/T373037 [14:48:42] !log jelto@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wikikube-worker2022.codfw.wmnet with OS bookworm [14:48:50] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: SSH host key verification failures in Ganeti intra node SSH calls after Bullseye update - https://phabricator.wikimedia.org/T309724#10441169 (10MoritzMuehlenhoff) [14:49:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10441183 (10phaultfinder) [14:50:48] (03CR) 10Btullis: [C:03+2] Add postgresql import parameters for the airflow-research instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109046 (https://phabricator.wikimedia.org/T380616) (owner: 10Btullis) [14:51:35] (03CR) 10Jforrester: [C:03+1] "Ha, good find." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1109081 (owner: 10Alexandros Kosiaris) [14:52:14] (03CR) 10Alexandros Kosiaris: [C:03+2] "Thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1109081 (owner: 10Alexandros Kosiaris) [14:52:24] (03Merged) 10jenkins-bot: Add postgresql import parameters for the airflow-research instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109046 (https://phabricator.wikimedia.org/T380616) (owner: 10Btullis) [14:52:33] (03CR) 10Ladsgroup: [C:03+1] "Thanks <3" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1109081 (owner: 10Alexandros Kosiaris) [14:52:58] (03Merged) 10jenkins-bot: Remove wgDBsqlpassword setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1109081 (owner: 10Alexandros Kosiaris) [14:53:16] !log elukey@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker2022.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [14:53:47] (03PS12) 10Bking: modules+hiera: Add module to do Ceph mounts and mount ml-lab /home [puppet] - 10https://gerrit.wikimedia.org/r/1109044 (owner: 10Klausman) [14:53:54] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1109044 (owner: 10Klausman) [14:54:40] (03CR) 10Btullis: modules+hiera: Add module to do Ceph mounts and mount ml-lab /home (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1109044 (owner: 10Klausman) [14:55:53] (03CR) 10Kamila Součková: create sre.k8s.roll-reimage-nodes (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1094494 (https://phabricator.wikimedia.org/T377857) (owner: 10Kamila Součková) [14:56:46] (03PS1) 10Btullis: Enable the airflow-research postgresql release [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109086 (https://phabricator.wikimedia.org/T380616) [14:57:32] (03CR) 10Brouberol: [C:03+1] Enable the airflow-research postgresql release [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109086 (https://phabricator.wikimedia.org/T380616) (owner: 10Btullis) [14:57:32] (03PS13) 10Bking: modules+hiera: Add module to do Ceph mounts and mount ml-lab /home [puppet] - 10https://gerrit.wikimedia.org/r/1109044 (owner: 10Klausman) [14:57:42] (03PS2) 10Jforrester: wikifunctions: Upgrade evaluators from 2024-12-16-202347 to 2025-01-08-143723 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109076 (https://phabricator.wikimedia.org/T313460) [14:57:43] (03PS1) 10Jforrester: wikifunctions: Upgrade orchestrator from 2024-12-17-184905 to 2025-01-08-142250 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109087 (https://phabricator.wikimedia.org/T381207) [14:57:44] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1109044 (owner: 10Klausman) [14:58:35] (03CR) 10Giuseppe Lavagetto: [C:03+1] profile::tlsproxy::envoy: Explicitly configure retries [puppet] - 10https://gerrit.wikimedia.org/r/1109070 (https://phabricator.wikimedia.org/T380958) (owner: 10Alexandros Kosiaris) [14:59:10] (03CR) 10Alexandros Kosiaris: [C:03+2] profile::tlsproxy::envoy: Explicitly configure retries [puppet] - 10https://gerrit.wikimedia.org/r/1109070 (https://phabricator.wikimedia.org/T380958) (owner: 10Alexandros Kosiaris) [14:59:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10441211 (10phaultfinder) [15:00:04] !log Finished populateSitesTable for tigwiki (T381382) [15:00:05] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250108T1500) [15:00:05] (03CR) 10Btullis: [C:03+2] Enable the airflow-research postgresql release [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109086 (https://phabricator.wikimedia.org/T380616) (owner: 10Btullis) [15:00:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:07] T381382: Add Wikidata support for tigwiki - https://phabricator.wikimedia.org/T381382 [15:01:10] (03Merged) 10jenkins-bot: Enable the airflow-research postgresql release [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109086 (https://phabricator.wikimedia.org/T380616) (owner: 10Btullis) [15:01:15] (03CR) 10Muehlenhoff: profile::tlsproxy::envoy: Explicitly configure retries (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1109070 (https://phabricator.wikimedia.org/T380958) (owner: 10Alexandros Kosiaris) [15:01:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2228 (re)pooling @ 10%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P71882 and previous config saved to /var/cache/conftool/dbconfig/20250108-150136-root.json [15:03:53] (03CR) 10Genoveva Galarza: [C:03+2] wikifunctions: Upgrade orchestrator from 2024-12-17-184905 to 2025-01-08-142250 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109087 (https://phabricator.wikimedia.org/T381207) (owner: 10Jforrester) [15:04:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10441248 (10phaultfinder) [15:05:01] (03Merged) 10jenkins-bot: wikifunctions: Upgrade orchestrator from 2024-12-17-184905 to 2025-01-08-142250 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109087 (https://phabricator.wikimedia.org/T381207) (owner: 10Jforrester) [15:06:28] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2022.codfw.wmnet with OS bookworm [15:06:32] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2022 [15:06:32] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2022 [15:06:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:07:31] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-research: apply [15:07:41] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-research: apply [15:07:51] !log gengh@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [15:08:23] Wikidata support has been added for tigwiki and I do not need to run any more maintainence scripts [15:08:27] !log gengh@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [15:10:09] !log gengh@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [15:10:57] !log gengh@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [15:11:16] !log jelto@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wikikube-worker2022.codfw.wmnet with OS bookworm [15:11:20] !log gengh@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [15:12:12] !log gengh@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [15:14:39] RESOLVED: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1007-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [15:14:47] 14SRE-grizzly-sprint, 10Observability-Metrics: Grizzly: onboard "popular" dashboards as static json managed dashboards - https://phabricator.wikimedia.org/T331656#10441298 (10herron) 05Open→03Declined Declining as we're moving away from Grizzly [15:14:50] (03CR) 10Btullis: modules+hiera: Add module to do Ceph mounts and mount ml-lab /home (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1109044 (owner: 10Klausman) [15:15:00] 14SRE-grizzly-sprint, 10Observability-Metrics: Grizzly: CI improvements - https://phabricator.wikimedia.org/T331659#10441300 (10herron) 05Open→03Declined Declining as we're moving away from Grizzly [15:15:07] (03CR) 10Genoveva Galarza: [C:03+2] wikifunctions: Upgrade evaluators from 2024-12-16-202347 to 2025-01-08-143723 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109076 (https://phabricator.wikimedia.org/T313460) (owner: 10Jforrester) [15:16:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2228 (re)pooling @ 25%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P71883 and previous config saved to /var/cache/conftool/dbconfig/20250108-151642-root.json [15:16:52] (03Merged) 10jenkins-bot: wikifunctions: Upgrade evaluators from 2024-12-16-202347 to 2025-01-08-143723 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109076 (https://phabricator.wikimedia.org/T313460) (owner: 10Jforrester) [15:16:57] (03PS3) 10Btullis: Switch airflow-research to use the cloudnativepg cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109049 (https://phabricator.wikimedia.org/T380616) [15:17:26] !log gengh@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [15:17:38] (03CR) 10Brouberol: [C:03+1] Switch airflow-research to use the cloudnativepg cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109049 (https://phabricator.wikimedia.org/T380616) (owner: 10Btullis) [15:18:05] (03CR) 10Btullis: [C:03+2] Switch airflow-research to use the cloudnativepg cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109049 (https://phabricator.wikimedia.org/T380616) (owner: 10Btullis) [15:18:16] !log gengh@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [15:19:11] (03Merged) 10jenkins-bot: Switch airflow-research to use the cloudnativepg cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109049 (https://phabricator.wikimedia.org/T380616) (owner: 10Btullis) [15:19:18] 07Puppet, 06cloud-services-team, 10Cloud-VPS: Preserve formatting and comments etc. in ENC Hiera - https://phabricator.wikimedia.org/T250622#10441312 (10Andrew) I would very much like this to work and I also don't immediately know how to do it :( [15:19:37] (03PS1) 10Muehlenhoff: Switch magru01 to managed /var/lib/ganeti/known_hosts [puppet] - 10https://gerrit.wikimedia.org/r/1109092 (https://phabricator.wikimedia.org/T309724) [15:19:53] 07Puppet, 06cloud-services-team, 10Cloud-VPS: Preserve formatting and comments etc. in ENC Hiera - https://phabricator.wikimedia.org/T250622#10441315 (10joanna_borun) p:05Triage→03Medium [15:20:17] !log gengh@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [15:21:02] !log gengh@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [15:21:17] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops: Repurpose 5 config B servers - https://phabricator.wikimedia.org/T380805#10441317 (10joanna_borun) p:05Triage→03Medium [15:21:38] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops: Repurpose 5 config B servers - https://phabricator.wikimedia.org/T380805#10441319 (10Andrew) Two of these are now intended for https://phabricator.wikimedia.org/T382356 [15:21:45] (03CR) 10Muehlenhoff: [C:03+2] Add an option to pass the Presto firewall settings compatible with nftables [puppet] - 10https://gerrit.wikimedia.org/r/1108041 (owner: 10Muehlenhoff) [15:21:59] !log gengh@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [15:22:23] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2022.codfw.wmnet with OS bookworm [15:22:26] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2022 [15:22:27] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2022 [15:22:51] (03CR) 10DCausse: Make WikibaseQualityConstraints use split-graph query service (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105879 (https://phabricator.wikimedia.org/T374021) (owner: 10Stevemunene) [15:22:59] !log gengh@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [15:23:54] (03CR) 10DCausse: [C:03+1] Make WikimediaCampaignEvents use split-graph query service [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105878 (https://phabricator.wikimedia.org/T377956) (owner: 10Stevemunene) [15:27:08] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-research: apply [15:27:48] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-research: apply [15:28:22] 06SRE, 10observability, 10Observability-Metrics: Add Icinga check for CPU frequency on Dell R320 - https://phabricator.wikimedia.org/T163220#10441351 (10lmata) 05Open→03Declined > * We no longer have any Dell R320s in production -- netbox reports 0 instances. https://netbox.wikimedia.org/dcim/device... [15:29:38] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-research: apply [15:29:50] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops, 10decommission-hardware: decommission es2024.codfw.wmnet - https://phabricator.wikimedia.org/T383028#10441363 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [15:31:13] (03PS1) 10Muehlenhoff: Switch presto in test cluster to nftables-compatible firewall settings [puppet] - 10https://gerrit.wikimedia.org/r/1109095 [15:31:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2228 (re)pooling @ 50%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P71884 and previous config saved to /var/cache/conftool/dbconfig/20250108-153147-root.json [15:32:58] 06SRE, 10Observability-Logging: Logrotate fails for: "$FILE No such file or directory" - https://phabricator.wikimedia.org/T153940#10441395 (10andrea.denisse) 05Open→03Resolved a:03andrea.denisse [15:34:42] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1109095 (owner: 10Muehlenhoff) [15:36:11] !log installing jinja2 security updates [15:36:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:03] 10ops-eqiad, 06SRE, 10Cloud-VPS, 06DC-Ops, 10cloud-services-team (Hardware): Relocate cloudnet1007-dev and cloudnet1008-dev to new racks and rename - https://phabricator.wikimedia.org/T382412#10441433 (10joanna_borun) p:05Triage→03High [15:39:54] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-research: apply [15:40:17] (03CR) 10Hnowlan: [C:03+1] mw-videoscaler: enable access to logging cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1108874 (https://phabricator.wikimedia.org/T382517) (owner: 10Scott French) [15:40:42] 06SRE, 10Observability-Logging, 13Patch-For-Review: Provision plaintext syslog collectors in PoPs - https://phabricator.wikimedia.org/T243065#10441451 (10fgiunchedi) [15:40:43] 06SRE, 06serviceops: Framework for running experiments on a subset of the app server fleet - https://phabricator.wikimedia.org/T315403#10441452 (10lmata) Untagging sre-observability as we do some backlog housekeeping; please add us if we can assist. [15:41:26] 06SRE, 10observability, 10Observability-Logging, 10Wikimedia-Logstash: Ship host syslogs to ELK - https://phabricator.wikimedia.org/T193766#10441459 (10herron) 05Open→03Resolved a:03herron Cleaning up old task [15:43:11] 14SRE-Sprint-Week-Sustainability-March2023, 06Infrastructure-Foundations, 10Mail, 10observability, and 2 others: Graph outbound mail volume on per-service or hostgroup level - https://phabricator.wikimedia.org/T197171#10441474 (10herron) 05Open→03Resolved a:03herron Cleaning up old tasks [15:44:31] 06SRE, 10observability, 10Observability-Logging, 10Wikimedia-Logstash: Investigate missing WikibaseQualityConstraints logs in logstash. - https://phabricator.wikimedia.org/T214031#10441479 (10colewhite) 05Open→03Declined It's been a while and it's still not clear where these logs got lost. Closing... [15:46:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2228 (re)pooling @ 75%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P71885 and previous config saved to /var/cache/conftool/dbconfig/20250108-154653-root.json [15:47:02] jouncebot: nowandnext [15:47:02] For the next 0 hour(s) and 12 minute(s): Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250108T1500) [15:47:02] In 2 hour(s) and 12 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250108T1800) [15:48:49] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-research: apply [15:49:30] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-research: apply [15:49:46] (03PS1) 10Giuseppe Lavagetto: ClusterConfig: add support for dumps trait [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1109108 (https://phabricator.wikimedia.org/T382947) [15:49:48] (03PS1) 10Giuseppe Lavagetto: Use a bespoke database configuration for dumps [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1109109 (https://phabricator.wikimedia.org/T382947) [15:49:54] (03CR) 10Scott French: "Thank you both for the reviews!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1108874 (https://phabricator.wikimedia.org/T382517) (owner: 10Scott French) [15:50:03] (03CR) 10Scott French: [C:03+2] mw-videoscaler: enable access to logging cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1108874 (https://phabricator.wikimedia.org/T382517) (owner: 10Scott French) [15:50:31] (03CR) 10CI reject: [V:04-1] Use a bespoke database configuration for dumps [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1109109 (https://phabricator.wikimedia.org/T382947) (owner: 10Giuseppe Lavagetto) [15:51:17] !log jelto@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wikikube-worker2022.codfw.wmnet with OS bookworm [15:51:43] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2022.codfw.wmnet with OS bookworm [15:51:46] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2022 [15:51:46] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2022 [15:52:04] (03CR) 10Elukey: "Saw the code change passing by and added some comments :)" [puppet] - 10https://gerrit.wikimedia.org/r/1109044 (owner: 10Klausman) [15:52:19] (03Merged) 10jenkins-bot: mw-videoscaler: enable access to logging cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1108874 (https://phabricator.wikimedia.org/T382517) (owner: 10Scott French) [15:52:39] (03PS2) 10Muehlenhoff: Switch presto in test cluster to nftables-compatible firewall settings [puppet] - 10https://gerrit.wikimedia.org/r/1109095 [15:53:04] 10ops-codfw, 06SRE, 06DC-Ops: Move kafka-main2010 within the same rack - https://phabricator.wikimedia.org/T381788#10441522 (10Jhancock.wm) pinging to see if we can schedule a time for this next week. [15:53:55] (03PS1) 10Btullis: Re-enable the airflow-resarch scheduler probes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109110 (https://phabricator.wikimedia.org/T380620) [15:53:55] 06SRE, 10Observability-Metrics: Port Prometheus dashboards to Thanos - https://phabricator.wikimedia.org/T256954#10441529 (10herron) [15:54:07] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-videoscaler: apply [15:54:14] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-videoscaler: apply [15:55:24] (03CR) 10Brouberol: [C:03+1] Re-enable the airflow-resarch scheduler probes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109110 (https://phabricator.wikimedia.org/T380620) (owner: 10Btullis) [15:55:52] (03CR) 10Btullis: [C:03+2] Re-enable the airflow-resarch scheduler probes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109110 (https://phabricator.wikimedia.org/T380620) (owner: 10Btullis) [15:57:00] (03Merged) 10jenkins-bot: Re-enable the airflow-resarch scheduler probes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109110 (https://phabricator.wikimedia.org/T380620) (owner: 10Btullis) [15:57:07] !log jelto@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wikikube-worker2022.codfw.wmnet with OS bookworm [15:57:23] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1109095 (owner: 10Muehlenhoff) [15:57:42] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-videoscaler: apply [15:57:46] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-videoscaler: apply [16:00:55] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2018-2019].codfw.wmnet [16:01:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2228 (re)pooling @ 100%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P71886 and previous config saved to /var/cache/conftool/dbconfig/20250108-160158-root.json [16:02:04] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2018-2019].codfw.wmnet [16:02:25] FIRING: SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:03:08] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2018.codfw.wmnet with OS bookworm [16:03:22] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2019.codfw.wmnet with OS bookworm [16:03:27] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2018 [16:03:27] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2018 [16:03:48] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2019 [16:04:07] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [16:05:46] !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2001.codfw.wmnet with OS bookworm [16:07:26] 10ops-codfw, 06SRE, 06DC-Ops: hw troubleshooting: SMART errors on ml-serve2001.codfw.wmnet - https://phabricator.wikimedia.org/T383225#10441585 (10Jhancock.wm) 05Open→03Resolved the two 2TB drives have been removed. the failed one has been marked and the other returned to inventory. [16:07:28] PROBLEM - BGP status on lsw1-a5-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:07:29] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2019 - jelto@cumin1002" [16:07:33] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2019 - jelto@cumin1002" [16:07:33] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:07:34] !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2019.codfw.wmnet 117.32.192.10.in-addr.arpa 7.1.1.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [16:07:37] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2019.codfw.wmnet 117.32.192.10.in-addr.arpa 7.1.1.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [16:07:37] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2019 [16:07:51] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2019 [16:07:51] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2019 [16:08:05] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1081.eqiad.wmnet with OS bookworm [16:08:20] 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: hw troubleshooting: "Comm Error: backplane 0" for wikikube-worker1081.eqiad.wmnet - https://phabricator.wikimedia.org/T381878#10441587 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube... [16:08:52] !log jhathaway@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest2001.codfw.wmnet with OS bookworm [16:09:44] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10441593 (10phaultfinder) [16:10:42] 06SRE, 10SRE-Access-Requests: Requesting shell access to analytics-privatedata for Katherine Graessle - https://phabricator.wikimedia.org/T383241 (10Kgraessle) 03NEW [16:11:52] (03CR) 10Ladsgroup: [C:03+1] ClusterConfig: add support for dumps trait [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1109108 (https://phabricator.wikimedia.org/T382947) (owner: 10Giuseppe Lavagetto) [16:13:12] PROBLEM - MD RAID on ml-serve2001 is CRITICAL: CRITICAL: State: degraded, Active: 1, Working: 1, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [16:13:13] ACKNOWLEDGEMENT - MD RAID on ml-serve2001 is CRITICAL: CRITICAL: State: degraded, Active: 1, Working: 1, Failed: 0, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T383242 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [16:13:22] 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on ml-serve2001 - https://phabricator.wikimedia.org/T383242 (10ops-monitoring-bot) 03NEW [16:15:28] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:18:07] (03CR) 10Lucas Werkmeister (WMDE): "Looks good to me apart from DCausse’s comments :) should be relatively easy to try out on mwdebug – just grab some external ID statement f" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105879 (https://phabricator.wikimedia.org/T374021) (owner: 10Stevemunene) [16:20:34] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2018.codfw.wmnet with reason: host reimage [16:20:36] (03CR) 10Lucas Werkmeister (WMDE): "(I checked the URLs with `curl -d 'query=SELECT (COUNT(*) AS ?count) WHERE { ?s ?p ?o. }' -d format=json` and they seem to work ^^)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105879 (https://phabricator.wikimedia.org/T374021) (owner: 10Stevemunene) [16:20:44] !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2001.codfw.wmnet with OS bookworm [16:24:01] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2018.codfw.wmnet with reason: host reimage [16:25:59] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1081.eqiad.wmnet with reason: host reimage [16:26:23] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2019.codfw.wmnet with reason: host reimage [16:27:47] !log jhathaway@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2001.codfw.wmnet with OS bookworm [16:29:57] !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2001.codfw.wmnet with OS bookworm [16:31:34] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1081.eqiad.wmnet with reason: host reimage [16:35:43] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2019.codfw.wmnet with reason: host reimage [16:37:06] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops, 10decommission-hardware: decommission dbproxy1020.eqiad.wmnet - https://phabricator.wikimedia.org/T383025#10441708 (10Jclark-ctr) [16:37:13] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops, 10decommission-hardware: decommission dbproxy1020.eqiad.wmnet - https://phabricator.wikimedia.org/T383025#10441709 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr [16:39:56] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops, 10decommission-hardware: decommission dbproxy1021.eqiad.wmnet - https://phabricator.wikimedia.org/T383033#10441715 (10Jclark-ctr) [16:40:02] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops, 10decommission-hardware: decommission dbproxy1021.eqiad.wmnet - https://phabricator.wikimedia.org/T383033#10441716 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr [16:43:05] RECOVERY - BGP status on lsw1-f3-eqiad.mgmt is OK: BGP OK - up: 22, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:43:54] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, and 2 others: decommission mw[1349-1413] - https://phabricator.wikimedia.org/T375842#10441736 (10Jclark-ctr) @akosiaris i see you checked off all the boxes for the dcops team are these ready to be removed? [16:44:30] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2018.codfw.wmnet with OS bookworm [16:44:33] RECOVERY - BGP status on lsw1-a5-codfw.mgmt is OK: BGP OK - up: 32, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:46:05] PROBLEM - BGP status on lsw1-f3-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:47:05] RECOVERY - BGP status on lsw1-f3-eqiad.mgmt is OK: BGP OK - up: 22, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:50:23] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [16:50:43] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [16:50:44] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1081.eqiad.wmnet with OS bookworm [16:50:51] 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: hw troubleshooting: "Comm Error: backplane 0" for wikikube-worker1081.eqiad.wmnet - https://phabricator.wikimedia.org/T381878#10441780 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-wor... [16:50:57] (03PS2) 10Scott French: mw-(web|api-ext): revert to multi-DC sizing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078481 (https://phabricator.wikimedia.org/T376519) [16:52:06] 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: hw troubleshooting: "Comm Error: backplane 0" for wikikube-worker1081.eqiad.wmnet - https://phabricator.wikimedia.org/T381878#10441783 (10Jclark-ctr) @Jelto i performed flea power drain and looks to image properly the critical status has clea... [16:54:11] (03CR) 10Scott French: "Thanks, all. I've updated this patch with a ~ 10% bump to mw-api-ext as discussed in https://phabricator.wikimedia.org/T376519#10439450." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078481 (https://phabricator.wikimedia.org/T376519) (owner: 10Scott French) [16:55:58] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2019.codfw.wmnet with OS bookworm [16:58:55] (03CR) 10Scott French: "Thanks, Valentin!" [puppet] - 10https://gerrit.wikimedia.org/r/1101104 (https://phabricator.wikimedia.org/T377042) (owner: 10Scott French) [16:58:58] (03CR) 10Scott French: [C:03+2] trafficserver: validate production config in tests [puppet] - 10https://gerrit.wikimedia.org/r/1101104 (https://phabricator.wikimedia.org/T377042) (owner: 10Scott French) [16:59:33] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2018.codfw.wmnet [16:59:35] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2018.codfw.wmnet [16:59:45] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2019.codfw.wmnet [17:00:17] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcontrol1011.eqiad.wmnet with OS bookworm [17:00:24] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudcontrol1011 - https://phabricator.wikimedia.org/T380499#10441804 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcontrol1011.eqiad.wmnet with OS bookworm [17:00:38] !log jelto@cumin1002 END (FAIL) - Cookbook sre.k8s.pool-depool-node (exit_code=99) pool for host wikikube-worker2019.codfw.wmnet [17:01:21] (03PS1) 10Muehlenhoff: Make profile::presto::server::ferm_srange optional [puppet] - 10https://gerrit.wikimedia.org/r/1109117 [17:01:30] !log sudo homer 'lsw1-c3-codfw*' commit 'T377877' [17:01:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:34] T377877: Migrate wikikube-codfw to containerd - https://phabricator.wikimedia.org/T377877 [17:02:14] !log sudo homer 'cr*codfw*' commit 'T377877' [17:02:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:24] !log jhathaway@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest2001.codfw.wmnet with reason: host reimage [17:02:25] RESOLVED: SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:04:44] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2019.codfw.wmnet [17:04:46] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2019.codfw.wmnet [17:05:47] !log jhathaway@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest2001.codfw.wmnet with reason: host reimage [17:09:37] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdp) failed in ms-be1090 - https://phabricator.wikimedia.org/T382874#10441849 (10VRiley-WMF) We got a reply from them and the part has been dispatched. Awaiting arrival of part. [17:15:08] (03CR) 10BCornwall: [C:04-1] "From the acme-chief [Wikitech page](https://wikitech.wikimedia.org/wiki/Acme-chief):" [puppet] - 10https://gerrit.wikimedia.org/r/1108859 (https://phabricator.wikimedia.org/T222080) (owner: 10Dzahn) [17:19:52] (03PS1) 10Kamila Součková: kubernetes: rename mw145[1-5] to wikikube-worker* [puppet] - 10https://gerrit.wikimedia.org/r/1109119 (https://phabricator.wikimedia.org/T365571) [17:22:01] !log jhathaway@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest2001.codfw.wmnet with OS bookworm [17:31:15] (03PS1) 10Ssingh: P:tcpircbot: add DNS hosts to allowed CIDRs for tcpircbot [puppet] - 10https://gerrit.wikimedia.org/r/1109120 [17:32:06] (03PS2) 10Ssingh: P:tcpircbot: add DNS hosts to allowed CIDRs for tcpircbot [puppet] - 10https://gerrit.wikimedia.org/r/1109120 [17:33:07] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4762/co" [puppet] - 10https://gerrit.wikimedia.org/r/1109120 (owner: 10Ssingh) [17:34:19] (03CR) 10Scott French: [C:03+1] kubernetes: rename mw145[1-5] to wikikube-worker* [puppet] - 10https://gerrit.wikimedia.org/r/1109119 (https://phabricator.wikimedia.org/T365571) (owner: 10Kamila Součková) [17:36:17] !log kamila@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw[1451-1455].eqiad.wmnet [17:36:26] 06SRE, 10Ceph, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Configure DSCP marking for cloudceph* hosts - https://phabricator.wikimedia.org/T371501#10441947 (10dcaro) @cmooney what's needed to get this rolling? I'll make time whenever you are able :) [17:36:30] (03CR) 10Kamila Součková: [C:03+2] kubernetes: rename mw145[1-5] to wikikube-worker* [puppet] - 10https://gerrit.wikimedia.org/r/1109119 (https://phabricator.wikimedia.org/T365571) (owner: 10Kamila Součková) [17:39:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10441961 (10phaultfinder) [17:39:43] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudcontrol1011 - https://phabricator.wikimedia.org/T380499#10441962 (10Jclark-ctr) [17:40:15] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudcontrol1011 - https://phabricator.wikimedia.org/T380499#10441967 (10Jclark-ctr) ` Failed to load ldlinux.c32 Boot failed: press a key to retry, or wait for reset... .............. ` downgraded firmware on nic and lo... [17:41:43] !log kamila@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw[1451-1455].eqiad.wmnet [17:42:44] !log kamila@cumin1002 START - Cookbook sre.hosts.rename from mw1451 to wikikube-worker1088 [17:43:04] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [17:43:33] !log kamila@cumin1002 START - Cookbook sre.hosts.rename from mw1452 to wikikube-worker1089 [17:43:45] !log kamila@cumin1002 START - Cookbook sre.hosts.rename from mw1453 to wikikube-worker1090 [17:44:06] (03PS4) 10Scott French: hieradata: switch all "migration" releases to 8.1 [puppet] - 10https://gerrit.wikimedia.org/r/1101122 (https://phabricator.wikimedia.org/T377040) [17:44:12] !log kamila@cumin1002 END (FAIL) - Cookbook sre.hosts.rename (exit_code=99) from mw1453 to wikikube-worker1090 [17:46:11] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, I just left a suggestion. Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1109120 (owner: 10Ssingh) [17:46:47] !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1451 to wikikube-worker1088 - kamila@cumin1002" [17:46:57] !log kamila@cumin1002 START - Cookbook sre.hosts.rename from mw1453 to wikikube-worker1090 [17:47:30] !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1451 to wikikube-worker1088 - kamila@cumin1002" [17:47:30] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:47:31] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1088 [17:47:36] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [17:48:51] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1088 [17:49:30] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1451 to wikikube-worker1088 [17:49:32] (03CR) 10Effie Mouzeli: [C:03+1] mw-(web|api-ext): revert to multi-DC sizing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078481 (https://phabricator.wikimedia.org/T376519) (owner: 10Scott French) [17:51:13] !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1453 to wikikube-worker1090 - kamila@cumin1002" [17:51:27] (03CR) 10RLazarus: [C:03+1] mw-(api-ext|api-int|jobrunner|parsoid|web): migration php.version to 8.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101121 (https://phabricator.wikimedia.org/T377040) (owner: 10Scott French) [17:51:33] (03CR) 10RLazarus: [C:03+1] hieradata: switch all "migration" releases to 8.1 [puppet] - 10https://gerrit.wikimedia.org/r/1101122 (https://phabricator.wikimedia.org/T377040) (owner: 10Scott French) [17:51:34] !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1453 to wikikube-worker1090 - kamila@cumin1002" [17:51:34] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:51:34] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1090 [17:52:39] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [17:52:51] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1090 [17:53:29] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1453 to wikikube-worker1090 [17:53:40] FIRING: [2x] KubernetesRsyslogDown: rsyslog on mw1454:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [17:54:08] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol1011.eqiad.wmnet with reason: host reimage [17:55:00] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:55:00] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1089 [17:56:15] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1089 [17:56:54] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1452 to wikikube-worker1089 [17:57:46] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol1011.eqiad.wmnet with reason: host reimage [17:58:01] !log kamila@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker1088.eqiad.wmnet wikikube-worker1089.eqiad.wmnet wikikube-worker1090.eqiad.wmnet wikikub on all recursors [17:58:04] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker1088.eqiad.wmnet wikikube-worker1089.eqiad.wmnet wikikube-worker1090.eqiad.wmnet wikikub on all recursors [17:58:40] !log kamila@cumin1002 START - Cookbook sre.hosts.rename from mw1454 to wikikube-worker1091 [17:59:00] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [17:59:20] (03PS1) 10BCornwall: ncredir: Add wikimedia.ro/wikipedia.ro [puppet] - 10https://gerrit.wikimedia.org/r/1109123 (https://phabricator.wikimedia.org/T222080) [17:59:35] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10442072 (10phaultfinder) [18:00:04] 06SRE, 10Domains, 06Traffic, 13Patch-For-Review: Register wiki(m|p)edia.ro - https://phabricator.wikimedia.org/T222080#10442073 (10BCornwall) @Strainu we just have a few more code pushes and then we'll be set. Thanks for the patience. :) [18:00:06] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250108T1800) [18:00:10] 06SRE, 10Domains, 06Traffic, 13Patch-For-Review: Register wiki(m|p)edia.ro - https://phabricator.wikimedia.org/T222080#10442074 (10BCornwall) 05Open→03In progress [18:01:37] !log kamila@cumin1002 START - Cookbook sre.hosts.rename from mw1455 to wikikube-worker1092 [18:02:30] !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1454 to wikikube-worker1091 - kamila@cumin1002" [18:02:49] !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1454 to wikikube-worker1091 - kamila@cumin1002" [18:02:49] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:02:50] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1091 [18:03:13] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [18:03:31] (03CR) 10Pppery: [C:03+1] ncredir: Add wikimedia.ro/wikipedia.ro [puppet] - 10https://gerrit.wikimedia.org/r/1109123 (https://phabricator.wikimedia.org/T222080) (owner: 10BCornwall) [18:04:03] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1091 [18:04:42] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1454 to wikikube-worker1091 [18:05:42] jouncebot: nowandnext [18:05:42] For the next 0 hour(s) and 54 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250108T1800) [18:05:43] In 0 hour(s) and 54 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250108T1900) [18:07:01] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, I just left a non-blocking question to understand a specific part of the code." [puppet] - 10https://gerrit.wikimedia.org/r/1108746 (https://phabricator.wikimedia.org/T371087) (owner: 10Filippo Giunchedi) [18:07:18] !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1455 to wikikube-worker1092 - kamila@cumin1002" [18:07:22] !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1455 to wikikube-worker1092 - kamila@cumin1002" [18:07:23] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:07:23] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1092 [18:08:29] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1092 [18:09:08] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1455 to wikikube-worker1092 [18:09:19] FYI, I'll be deploying some changes as part of the infra window shortly [18:09:29] (03CR) 10Ssingh: "Change looks good and so does the rendering. I think we will need to update text/12-rate-limiting.vtc though as it is failing." [puppet] - 10https://gerrit.wikimedia.org/r/1108485 (https://phabricator.wikimedia.org/T383062) (owner: 10BCornwall) [18:09:48] !log kamila@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker1088.eqiad.wmnet wikikube-worker1089.eqiad.wmnet wikikube-worker1090.eqiad.wmnet wikikube-worker1091.eqiad.wmnet wikikube-worker1092.eqiad.wmnet on all recursors [18:09:49] (03CR) 10Scott French: [C:03+2] mw-(api-ext|api-int|jobrunner|parsoid|web): migration php.version to 8.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101121 (https://phabricator.wikimedia.org/T377040) (owner: 10Scott French) [18:09:52] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker1088.eqiad.wmnet wikikube-worker1089.eqiad.wmnet wikikube-worker1090.eqiad.wmnet wikikube-worker1091.eqiad.wmnet wikikube-worker1092.eqiad.wmnet on all recursors [18:10:17] (03CR) 10Scott French: [C:03+2] hieradata: switch all "migration" releases to 8.1 [puppet] - 10https://gerrit.wikimedia.org/r/1101122 (https://phabricator.wikimedia.org/T377040) (owner: 10Scott French) [18:11:08] (03PS3) 10Ssingh: P:tcpircbot: add DNS hosts to allowed CIDRs for tcpircbot [puppet] - 10https://gerrit.wikimedia.org/r/1109120 [18:11:17] (03CR) 10Ssingh: P:tcpircbot: add DNS hosts to allowed CIDRs for tcpircbot (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1109120 (owner: 10Ssingh) [18:11:25] (03Merged) 10jenkins-bot: mw-(api-ext|api-int|jobrunner|parsoid|web): migration php.version to 8.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101121 (https://phabricator.wikimedia.org/T377040) (owner: 10Scott French) [18:11:42] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1088.eqiad.wmnet with OS bookworm [18:11:45] !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1088 [18:11:45] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1088 [18:12:21] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1089.eqiad.wmnet with OS bookworm [18:12:24] !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1089 [18:12:24] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1089 [18:12:50] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1090.eqiad.wmnet with OS bookworm [18:12:53] !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1090 [18:12:53] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1090 [18:12:59] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 747810408 and 33 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [18:13:20] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1091.eqiad.wmnet with OS bookworm [18:13:24] !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1091 [18:13:24] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1091 [18:13:35] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1092.eqiad.wmnet with OS bookworm [18:13:38] !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1092 [18:13:39] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1092 [18:13:59] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 88984 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [18:14:44] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10442178 (10phaultfinder) [18:16:20] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [18:17:00] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [18:17:01] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol1011.eqiad.wmnet with OS bookworm [18:17:14] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudcontrol1011 - https://phabricator.wikimedia.org/T380499#10442183 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host cloudcontrol1011.eqiad.wmnet with OS bookworm complete... [18:18:00] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudcontrol1011 - https://phabricator.wikimedia.org/T380499#10442184 (10Jclark-ctr) [18:18:07] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudcontrol1011 - https://phabricator.wikimedia.org/T380499#10442195 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr [18:19:28] !log swfrench@deploy2002 Started scap sync-world: Deployment to switch migration release files to 8.1 - T377040 [18:19:31] T377040: Turn up PHP 8.1-flavored k8s deployments for all MediaWiki services - https://phabricator.wikimedia.org/T377040 [18:24:03] FIRING: KubernetesCalicoDown: wikikube-worker2022.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=wikikube-worker2022.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [18:27:40] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1088.eqiad.wmnet with reason: host reimage [18:28:15] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1089.eqiad.wmnet with reason: host reimage [18:29:10] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1090.eqiad.wmnet with reason: host reimage [18:29:16] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1091.eqiad.wmnet with reason: host reimage [18:29:40] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1092.eqiad.wmnet with reason: host reimage [18:29:51] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10442261 (10phaultfinder) [18:31:52] (03PS4) 10BCornwall: varnish: Hide X-Client-IP on error page by default [puppet] - 10https://gerrit.wikimedia.org/r/1108485 (https://phabricator.wikimedia.org/T383062) [18:32:53] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1088.eqiad.wmnet with reason: host reimage [18:33:13] (03PS1) 10Majavah: hieradata: Upgrade striker-toolsbeta to 2025-01-08-183102-production [puppet] - 10https://gerrit.wikimedia.org/r/1109124 [18:33:26] !log swfrench@deploy2002 Finished scap sync-world: Deployment to switch migration release files to 8.1 - T377040 (duration: 13m 57s) [18:33:29] T377040: Turn up PHP 8.1-flavored k8s deployments for all MediaWiki services - https://phabricator.wikimedia.org/T377040 [18:34:21] (03CR) 10Majavah: [C:03+2] hieradata: Upgrade striker-toolsbeta to 2025-01-08-183102-production [puppet] - 10https://gerrit.wikimedia.org/r/1109124 (owner: 10Majavah) [18:36:29] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1092.eqiad.wmnet with reason: host reimage [18:38:21] (03CR) 10BCornwall: "Thanks for pointing that out. Updated!" [puppet] - 10https://gerrit.wikimedia.org/r/1108485 (https://phabricator.wikimedia.org/T383062) (owner: 10BCornwall) [18:38:56] (03CR) 10Ssingh: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1108485 (https://phabricator.wikimedia.org/T383062) (owner: 10BCornwall) [18:40:41] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1090.eqiad.wmnet with reason: host reimage [18:42:24] (03CR) 10BCornwall: [V:03+2 C:03+2] varnish: Hide X-Client-IP on error page by default [puppet] - 10https://gerrit.wikimedia.org/r/1108485 (https://phabricator.wikimedia.org/T383062) (owner: 10BCornwall) [18:43:59] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1091.eqiad.wmnet with reason: host reimage [18:44:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10442330 (10phaultfinder) [18:48:29] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1089.eqiad.wmnet with reason: host reimage [18:48:33] 06SRE, 06Traffic, 13Patch-For-Review: Reveal IP after click only on Varnish error pages - https://phabricator.wikimedia.org/T383062#10442334 (10BCornwall) 05Open→03Resolved a:03BCornwall Thanks for reporting this! This has been implemented and pushed. Some shots for posterity: {F58147807} {F58147... [18:49:45] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 219, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:50:45] PROBLEM - SSH on bast7001 is CRITICAL: Server answer: Exceeded MaxStartups https://wikitech.wikimedia.org/wiki/SSH/monitoring [18:50:51] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 220, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:51:45] RECOVERY - SSH on bast7001 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [18:52:07] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1088.eqiad.wmnet with OS bookworm [18:55:18] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1092.eqiad.wmnet with OS bookworm [18:55:29] (03PS2) 10FNegri: Revert "Block PAWS workers nodes from all UDP traffic other than DNS & NTP" [puppet] - 10https://gerrit.wikimedia.org/r/1105036 (https://phabricator.wikimedia.org/T383261) [18:59:17] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1090.eqiad.wmnet with OS bookworm [19:00:05] dduvall and dancy: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for MediaWiki train - Utc-7 Version . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250108T1900). [19:03:02] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1091.eqiad.wmnet with OS bookworm [19:08:23] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1089.eqiad.wmnet with OS bookworm [19:11:02] (03PS1) 10AOkoth: aptrepo: upgrade gitlab-ce and gitlab-runner [puppet] - 10https://gerrit.wikimedia.org/r/1109129 (https://phabricator.wikimedia.org/T383263) [19:14:18] (03PS1) 10TrainBranchBot: group1 to 1.44.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1109130 (https://phabricator.wikimedia.org/T382362) [19:14:21] (03CR) 10TrainBranchBot: [C:03+2] group1 to 1.44.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1109130 (https://phabricator.wikimedia.org/T382362) (owner: 10TrainBranchBot) [19:15:07] (03Merged) 10jenkins-bot: group1 to 1.44.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1109130 (https://phabricator.wikimedia.org/T382362) (owner: 10TrainBranchBot) [19:17:16] !log kamila@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker1001.eqiad.wmnet [19:17:53] !log kamila@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker1001.eqiad.wmnet [19:21:50] (03PS1) 10Amire80: Remove Tech News feed URL from Planet [puppet] - 10https://gerrit.wikimedia.org/r/1109131 [19:23:56] jouncebot: nowandnext [19:23:56] For the next 1 hour(s) and 36 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250108T1900) [19:23:56] In 1 hour(s) and 36 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250108T2100) [19:26:53] (03CR) 10Kamila Součková: [C:03+2] wikikube: decommission 1 host [puppet] - 10https://gerrit.wikimedia.org/r/1102961 (https://phabricator.wikimedia.org/T375842) (owner: 10Jasmine) [19:27:38] (03PS1) 10CDanis: group1: enable OpenTelemetry exports [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1109133 (https://phabricator.wikimedia.org/T340552) [19:28:20] (03CR) 10CI reject: [V:04-1] group1: enable OpenTelemetry exports [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1109133 (https://phabricator.wikimedia.org/T340552) (owner: 10CDanis) [19:29:02] !log dduvall@deploy2002 rebuilt and synchronized wikiversions files: group1 to 1.44.0-wmf.11 refs T382362 [19:29:06] T382362: 1.44.0-wmf.11 deployment blockers - https://phabricator.wikimedia.org/T382362 [19:29:47] !log sfaci@deploy2002 Started deploy [airflow-dags/analytics@b2b5707]: (no justification provided) [19:32:50] !log sfaci@deploy2002 Finished deploy [airflow-dags/analytics@b2b5707]: (no justification provided) (duration: 03m 06s) [19:35:47] (03PS1) 10Ottomata: refine_eventlogging_analytics - ensure absent [puppet] - 10https://gerrit.wikimedia.org/r/1109135 (https://phabricator.wikimedia.org/T323828) [19:37:40] (03CR) 10Ottomata: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4767/co" [puppet] - 10https://gerrit.wikimedia.org/r/1109135 (https://phabricator.wikimedia.org/T323828) (owner: 10Ottomata) [19:38:22] (03PS2) 10CDanis: group1: enable OpenTelemetry exports [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1109133 (https://phabricator.wikimedia.org/T340552) [19:38:37] (03CR) 10Ottomata: [V:03+1 C:03+2] refine_eventlogging_analytics - ensure absent [puppet] - 10https://gerrit.wikimedia.org/r/1109135 (https://phabricator.wikimedia.org/T323828) (owner: 10Ottomata) [19:41:08] (03PS3) 10Kamila Součková: wikikube: decommission 1 host [puppet] - 10https://gerrit.wikimedia.org/r/1102961 (https://phabricator.wikimedia.org/T375842) (owner: 10Jasmine) [19:42:24] (03CR) 10Kamila Součková: [C:03+2] wikikube: decommission 1 host [puppet] - 10https://gerrit.wikimedia.org/r/1102961 (https://phabricator.wikimedia.org/T375842) (owner: 10Jasmine) [19:43:02] (03CR) 10CDanis: [C:04-2] group1: enable OpenTelemetry exports [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1109133 (https://phabricator.wikimedia.org/T340552) (owner: 10CDanis) [19:46:35] !log kamila@cumin1002 START - Cookbook sre.hosts.decommission for hosts wikikube-worker1001.eqiad.wmnet [19:49:29] (03PS3) 10CDanis: group1: enable OpenTelemetry exports [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1109133 (https://phabricator.wikimedia.org/T340552) [19:49:41] (03CR) 10CDanis: group1: enable OpenTelemetry exports [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1109133 (https://phabricator.wikimedia.org/T340552) (owner: 10CDanis) [19:49:42] (03PS1) 10Ottomata: Remove absented refine_eventlogging_analytics job [puppet] - 10https://gerrit.wikimedia.org/r/1109137 (https://phabricator.wikimedia.org/T323828) [19:50:54] (03CR) 10Ottomata: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4768/co" [puppet] - 10https://gerrit.wikimedia.org/r/1109137 (https://phabricator.wikimedia.org/T323828) (owner: 10Ottomata) [19:51:57] (03CR) 10Ottomata: [V:03+1 C:03+2] Remove absented refine_eventlogging_analytics job [puppet] - 10https://gerrit.wikimedia.org/r/1109137 (https://phabricator.wikimedia.org/T323828) (owner: 10Ottomata) [19:52:40] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [19:53:16] (03PS1) 10Andrew Bogott: lookup_table_output.json: add entries for nova-api-metadata and placement-api [puppet] - 10https://gerrit.wikimedia.org/r/1109138 (https://phabricator.wikimedia.org/T383203) [19:53:46] (03PS1) 10TrainBranchBot: group0 to 1.44.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1109139 (https://phabricator.wikimedia.org/T382362) [19:53:48] (03CR) 10TrainBranchBot: [C:03+2] group0 to 1.44.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1109139 (https://phabricator.wikimedia.org/T382362) (owner: 10TrainBranchBot) [19:53:53] !log homer 'cr*eqiad*' commit 'T365571' [19:53:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:53:56] T365571: Rename wikikube worker nodes during OS reimage - https://phabricator.wikimedia.org/T365571 [19:54:31] (03Merged) 10jenkins-bot: group0 to 1.44.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1109139 (https://phabricator.wikimedia.org/T382362) (owner: 10TrainBranchBot) [19:54:43] !log rolling back wmf.11 to group0 due to `Table 'commonswiki.file' doesn't exist` errors [19:54:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:55:53] !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - kamila@cumin1002" [19:56:46] (03PS1) 10Ottomata: Remove profile::cache::kafka::eventlogging [puppet] - 10https://gerrit.wikimedia.org/r/1109140 (https://phabricator.wikimedia.org/T238230) [19:57:28] I caught someone's homer change and I'm scared of it :D so whoever wants to run homer, feel free to run it including my wikikube-worker changes [19:57:53] and/or let me know if I can commit `- as-path NTT-VERIZON "^2914 701$";` [19:57:56] kamila_: can you share the change out of curiosity? [19:58:05] sukhe: ^ [19:58:07] PROBLEM - Disk space on an-worker1089 is CRITICAL: DISK CRITICAL - free space: / 26 MB (0% inode=95%): /tmp 26 MB (0% inode=95%): /var/tmp 26 MB (0% inode=95%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1089&var-datasource=eqiad+prometheus/ops [19:58:15] topranks: is that you? [19:58:20] going to be blame topranks here :P [19:58:40] it looks like the dark art of outbound traffic engineering [19:58:52] hence I'm scared of it :D [19:59:00] you and all of us except him :) [19:59:33] (03CR) 10Ottomata: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4769/co" [puppet] - 10https://gerrit.wikimedia.org/r/1109140 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata) [19:59:56] !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - kamila@cumin1002" [19:59:56] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:59:57] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts wikikube-worker1001.eqiad.wmnet [20:00:26] (03PS2) 10Andrew Bogott: lookup_table_output.json: add entries for nova-api-metadata and placement-api [puppet] - 10https://gerrit.wikimedia.org/r/1109138 (https://phabricator.wikimedia.org/T383203) [20:01:19] PROBLEM - Disk space on an-worker1106 is CRITICAL: DISK CRITICAL - free space: / 356 MB (0% inode=95%): /tmp 356 MB (0% inode=95%): /var/tmp 356 MB (0% inode=95%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1106&var-datasource=eqiad+prometheus/ops [20:01:19] PROBLEM - Disk space on an-worker1115 is CRITICAL: DISK CRITICAL - free space: / 856 MB (1% inode=95%): /tmp 856 MB (1% inode=95%): /var/tmp 856 MB (1% inode=95%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1115&var-datasource=eqiad+prometheus/ops [20:01:30] (03CR) 10Kamila Součková: [C:03+1] group1: enable OpenTelemetry exports [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1109133 (https://phabricator.wikimedia.org/T340552) (owner: 10CDanis) [20:02:08] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1109138 (https://phabricator.wikimedia.org/T383203) (owner: 10Andrew Bogott) [20:02:15] (03CR) 10BCornwall: [C:03+1] Remove profile::cache::kafka::eventlogging [puppet] - 10https://gerrit.wikimedia.org/r/1109140 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata) [20:02:28] (03Abandoned) 10Ottomata: Remove profile::cache::kafka::eventlogging [puppet] - 10https://gerrit.wikimedia.org/r/1050017 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata) [20:04:40] (03CR) 10Ottomata: [V:03+1 C:03+2] Remove profile::cache::kafka::eventlogging [puppet] - 10https://gerrit.wikimedia.org/r/1109140 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata) [20:04:55] PROBLEM - Disk space on an-worker1118 is CRITICAL: DISK CRITICAL - free space: / 307 MB (0% inode=95%): /tmp 307 MB (0% inode=95%): /var/tmp 307 MB (0% inode=95%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1118&var-datasource=eqiad+prometheus/ops [20:05:01] PROBLEM - Disk space on an-worker1110 is CRITICAL: DISK CRITICAL - free space: / 351 MB (0% inode=95%): /tmp 351 MB (0% inode=95%): /var/tmp 351 MB (0% inode=95%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1110&var-datasource=eqiad+prometheus/ops [20:05:30] (03CR) 10CDanis: [C:04-2] "blocked until wmf.11 is rolled forward to group1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1109133 (https://phabricator.wikimedia.org/T340552) (owner: 10CDanis) [20:05:58] !log dduvall@deploy2002 rebuilt and synchronized wikiversions files: group0 to 1.44.0-wmf.11 refs T382362 [20:06:01] T382362: 1.44.0-wmf.11 deployment blockers - https://phabricator.wikimedia.org/T382362 [20:08:51] (03PS1) 10Ssingh: P:dns::auth::update: log authdns-update run to SAL [puppet] - 10https://gerrit.wikimedia.org/r/1109141 [20:09:30] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4770/co" [puppet] - 10https://gerrit.wikimedia.org/r/1109141 (owner: 10Ssingh) [20:11:09] PROBLEM - Disk space on an-worker1124 is CRITICAL: DISK CRITICAL - free space: / 148 MB (0% inode=95%): /tmp 148 MB (0% inode=95%): /var/tmp 148 MB (0% inode=95%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1124&var-datasource=eqiad+prometheus/ops [20:11:37] PROBLEM - Disk space on an-worker1169 is CRITICAL: DISK CRITICAL - free space: / 512 MB (0% inode=95%): /tmp 512 MB (0% inode=95%): /var/tmp 512 MB (0% inode=95%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1169&var-datasource=eqiad+prometheus/ops [20:11:41] PROBLEM - Disk space on an-worker1090 is CRITICAL: DISK CRITICAL - free space: / 229 MB (0% inode=95%): /tmp 229 MB (0% inode=95%): /var/tmp 229 MB (0% inode=95%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1090&var-datasource=eqiad+prometheus/ops [20:11:41] PROBLEM - Disk space on an-worker1154 is CRITICAL: DISK CRITICAL - free space: / 42 MB (0% inode=95%): /tmp 42 MB (0% inode=95%): /var/tmp 42 MB (0% inode=95%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1154&var-datasource=eqiad+prometheus/ops [20:11:41] PROBLEM - Disk space on analytics1075 is CRITICAL: DISK CRITICAL - free space: / 647 MB (1% inode=96%): /tmp 647 MB (1% inode=96%): /var/tmp 647 MB (1% inode=96%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=analytics1075&var-datasource=eqiad+prometheus/ops [20:11:42] PROBLEM - Disk space on an-worker1157 is CRITICAL: DISK CRITICAL - free space: / 360 MB (0% inode=95%): /tmp 360 MB (0% inode=95%): /var/tmp 360 MB (0% inode=95%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1157&var-datasource=eqiad+prometheus/ops [20:11:43] PROBLEM - Disk space on an-worker1139 is CRITICAL: DISK CRITICAL - free space: / 386 MB (0% inode=95%): /tmp 386 MB (0% inode=95%): /var/tmp 386 MB (0% inode=95%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1139&var-datasource=eqiad+prometheus/ops [20:11:44] PROBLEM - Disk space on an-worker1172 is CRITICAL: DISK CRITICAL - free space: / 1454 MB (2% inode=95%): /tmp 1454 MB (2% inode=95%): /var/tmp 1454 MB (2% inode=95%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1172&var-datasource=eqiad+prometheus/ops [20:11:49] PROBLEM - Disk space on an-worker1156 is CRITICAL: DISK CRITICAL - free space: / 1 MB (0% inode=95%): /tmp 1 MB (0% inode=95%): /var/tmp 1 MB (0% inode=95%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1156&var-datasource=eqiad+prometheus/ops [20:13:41] (03PS1) 10Ottomata: Ensure eventlogging processor backend is absent on eventlog1003 [puppet] - 10https://gerrit.wikimedia.org/r/1109142 [20:14:01] (03CR) 10CI reject: [V:04-1] Ensure eventlogging processor backend is absent on eventlog1003 [puppet] - 10https://gerrit.wikimedia.org/r/1109142 (owner: 10Ottomata) [20:14:30] (03CR) 10Dzahn: [C:03+2] aptrepo: upgrade gitlab-ce and gitlab-runner [puppet] - 10https://gerrit.wikimedia.org/r/1109129 (https://phabricator.wikimedia.org/T383263) (owner: 10AOkoth) [20:14:49] PROBLEM - Disk space on an-worker1147 is CRITICAL: DISK CRITICAL - free space: / 87 MB (0% inode=95%): /tmp 87 MB (0% inode=95%): /var/tmp 87 MB (0% inode=95%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1147&var-datasource=eqiad+prometheus/ops [20:15:09] PROBLEM - Disk space on an-worker1119 is CRITICAL: DISK CRITICAL - free space: / 36 MB (0% inode=95%): /tmp 36 MB (0% inode=95%): /var/tmp 36 MB (0% inode=95%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1119&var-datasource=eqiad+prometheus/ops [20:15:28] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:16:56] (03PS2) 10Ottomata: Ensure eventlogging processor backend is absent on eventlog1003 [puppet] - 10https://gerrit.wikimedia.org/r/1109142 (https://phabricator.wikimedia.org/T238230) [20:17:17] (03CR) 10CI reject: [V:04-1] Ensure eventlogging processor backend is absent on eventlog1003 [puppet] - 10https://gerrit.wikimedia.org/r/1109142 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata) [20:17:26] (03CR) 10Cwhite: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1109138 (https://phabricator.wikimedia.org/T383203) (owner: 10Andrew Bogott) [20:17:48] 06SRE: Message sizes exceeding limits - https://phabricator.wikimedia.org/T383271 (10DSeyfert_WMF) 03NEW [20:17:53] (03CR) 10Andrew Bogott: [C:03+2] lookup_table_output.json: add entries for nova-api-metadata and placement-api [puppet] - 10https://gerrit.wikimedia.org/r/1109138 (https://phabricator.wikimedia.org/T383203) (owner: 10Andrew Bogott) [20:18:15] PROBLEM - Disk space on an-worker1143 is CRITICAL: DISK CRITICAL - free space: / 318 MB (0% inode=95%): /tmp 318 MB (0% inode=95%): /var/tmp 318 MB (0% inode=95%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1143&var-datasource=eqiad+prometheus/ops [20:18:38] (03CR) 10Ottomata: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1109142 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata) [20:21:36] (03PS2) 10Ssingh: P:dns::auth::update: log authdns-update run to SAL [puppet] - 10https://gerrit.wikimedia.org/r/1109141 [20:22:22] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4771/co" [puppet] - 10https://gerrit.wikimedia.org/r/1109141 (owner: 10Ssingh) [20:23:06] (03CR) 10Ssingh: "I5e21e5741fa5a4e261b24c901011192c70b45f41 should supersede this." [puppet] - 10https://gerrit.wikimedia.org/r/1092324 (https://phabricator.wikimedia.org/T378724) (owner: 10Ssingh) [20:23:38] (03CR) 10Ottomata: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1109142 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata) [20:26:43] (03PS3) 10Ottomata: Ensure eventlogging processor backend is absent on eventlog1003 [puppet] - 10https://gerrit.wikimedia.org/r/1109142 (https://phabricator.wikimedia.org/T238230) [20:27:30] (03CR) 10Gmodena: mw-content-history-reconcile-enrich: Enable K8 HA (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105667 (https://phabricator.wikimedia.org/T375176) (owner: 10TChin) [20:27:35] (03CR) 10Ottomata: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4773/co" [puppet] - 10https://gerrit.wikimedia.org/r/1109142 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata) [20:27:55] (03CR) 10Andrew Bogott: [C:03+2] cloud-vps dns recursor: remove labs-ip-aliaser [puppet] - 10https://gerrit.wikimedia.org/r/1105443 (https://phabricator.wikimedia.org/T374129) (owner: 10Andrew Bogott) [20:28:08] 06SRE, 10Domains, 06Traffic, 13Patch-For-Review: Register wiki(m|p)edia.ro - https://phabricator.wikimedia.org/T222080#10442697 (10Dzahn) >>! In T222080#10439707, @Strainu wrote: > What are the next steps to redirect them to relevant wikis? - pointing them to WMF NS (done by legal) - adding them to WMF DN... [20:28:15] PROBLEM - Disk space on an-worker1116 is CRITICAL: DISK CRITICAL - free space: / 790 MB (1% inode=95%): /tmp 790 MB (1% inode=95%): /var/tmp 790 MB (1% inode=95%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1116&var-datasource=eqiad+prometheus/ops [20:28:20] (03CR) 10Ottomata: [V:03+1 C:03+2] Ensure eventlogging processor backend is absent on eventlog1003 [puppet] - 10https://gerrit.wikimedia.org/r/1109142 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata) [20:28:31] (03CR) 10Ottomata: [V:03+2 C:03+2] Ensure eventlogging processor backend is absent on eventlog1003 [puppet] - 10https://gerrit.wikimedia.org/r/1109142 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata) [20:31:35] PROBLEM - Disk space on an-worker1117 is CRITICAL: DISK CRITICAL - free space: / 44 MB (0% inode=95%): /tmp 44 MB (0% inode=95%): /var/tmp 44 MB (0% inode=95%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1117&var-datasource=eqiad+prometheus/ops [20:32:51] (03CR) 10Dzahn: [C:03+1] ncredir: Add wikimedia.ro/wikipedia.ro [puppet] - 10https://gerrit.wikimedia.org/r/1109123 (https://phabricator.wikimedia.org/T222080) (owner: 10BCornwall) [20:34:21] PROBLEM - Hadoop NodeManager on an-worker1089 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [20:35:58] FYI I have failiing puppet on eventlog1003 and have meetings, back to it shortly [20:38:31] (03PS2) 10Dzahn: certificates: add wiki[m|p]edia.ro to ncredir Letsencrypt cert 1 [puppet] - 10https://gerrit.wikimedia.org/r/1108859 (https://phabricator.wikimedia.org/T222080) [20:39:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10442741 (10phaultfinder) [20:44:21] RECOVERY - Hadoop NodeManager on an-worker1089 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [20:46:07] (03PS1) 10CDanis: filerepo: Fix schema compatibility constant usage [core] (wmf/1.44.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1109149 (https://phabricator.wikimedia.org/T383269) [20:48:26] dduvall: mszabo and I think that ^ should unblock the train [20:50:29] cdanis, mszabo: much thanks for that [20:53:36] (03CR) 10Dduvall: [C:03+2] filerepo: Fix schema compatibility constant usage [core] (wmf/1.44.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1109149 (https://phabricator.wikimedia.org/T383269) (owner: 10CDanis) [20:54:05] 20:48:20 npm error code ECONNRESET [20:54:08] heh [20:54:39] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dduvall@deploy2002 using scap backport" [core] (wmf/1.44.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1109149 (https://phabricator.wikimedia.org/T383269) (owner: 10CDanis) [21:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: How many deployers does it take to do UTC late backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250108T2100). [21:00:05] No Gerrit patches in the queue for this window AFAICS. [21:00:20] PROBLEM - Hadoop NodeManager on an-worker1156 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [21:01:05] cdanis: yeah, that's an annoying failure. hoping that doesn't happen again during gate-and-submit [21:01:16] dduvall: yeah, it's been happening most of the day :/ [21:01:36] (03CR) 10CI reject: [V:04-1] filerepo: Fix schema compatibility constant usage [core] (wmf/1.44.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1109149 (https://phabricator.wikimedia.org/T383269) (owner: 10CDanis) [21:01:41] https://phabricator.wikimedia.org/T383237 [21:02:10] and it has. wee! [21:02:26] (03CR) 10Gmodena: [C:03+1] "Changes seem reasonable to me, but I think we'll need to try them out in practice, and eventually iterate. Might be a good moment to start" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105667 (https://phabricator.wikimedia.org/T375176) (owner: 10TChin) [21:07:20] RECOVERY - Hadoop NodeManager on an-worker1156 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [21:08:48] !log aokoth@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: Security Update [21:09:27] (03PS1) 10Andrew Bogott: Pdns recursor: Make the extrarecursorhosts file world-readable [puppet] - 10https://gerrit.wikimedia.org/r/1109156 (https://phabricator.wikimedia.org/T374129) [21:09:34] (03PS1) 10Ottomata: Remove unused eventlogging code [puppet] - 10https://gerrit.wikimedia.org/r/1109157 [21:09:40] (03CR) 10Scott French: [C:03+1] group1: enable OpenTelemetry exports [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1109133 (https://phabricator.wikimedia.org/T340552) (owner: 10CDanis) [21:13:28] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1109156 (https://phabricator.wikimedia.org/T374129) (owner: 10Andrew Bogott) [21:14:55] !log dzahn@cumin2002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1004.wikimedia.org with reason: security release 20250108 [21:14:56] (03PS2) 10Ottomata: Remove unused eventlogging code [puppet] - 10https://gerrit.wikimedia.org/r/1109157 [21:15:37] (03CR) 10Ottomata: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4775/co" [puppet] - 10https://gerrit.wikimedia.org/r/1109157 (owner: 10Ottomata) [21:16:00] (03CR) 10Andrew Bogott: [C:03+2] Pdns recursor: Make the extrarecursorhosts file world-readable [puppet] - 10https://gerrit.wikimedia.org/r/1109156 (https://phabricator.wikimedia.org/T374129) (owner: 10Andrew Bogott) [21:16:53] (03CR) 10CDanis: [C:03+2] filerepo: Fix schema compatibility constant usage [core] (wmf/1.44.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1109149 (https://phabricator.wikimedia.org/T383269) (owner: 10CDanis) [21:16:55] !log aokoth@cumin1002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1003.wikimedia.org with reason: Security Update [21:17:15] (03CR) 10Ottomata: [V:03+1 C:03+2] "I'm not sure if role::insetup::data_engineering is quite right, but puppet is failing due to my last patch, and I just want to remove the " [puppet] - 10https://gerrit.wikimedia.org/r/1109157 (owner: 10Ottomata) [21:19:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10442792 (10phaultfinder) [21:20:05] dzahn@cumin2002 dzahn: The backup on gitlab1004 is complete, ready to proceed with upgrade. [21:28:37] !log dzahn@cumin2002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1004.wikimedia.org with reason: security release 20250108 [21:31:43] (03CR) 10CI reject: [V:04-1] filerepo: Fix schema compatibility constant usage [core] (wmf/1.44.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1109149 (https://phabricator.wikimedia.org/T383269) (owner: 10CDanis) [21:33:39] (03CR) 10Dduvall: [C:03+2] filerepo: Fix schema compatibility constant usage [core] (wmf/1.44.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1109149 (https://phabricator.wikimedia.org/T383269) (owner: 10CDanis) [21:46:00] (03CR) 10Jdlrobson: [C:03+1] Remove `wgVectorStickyHeader` from InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1108135 (https://phabricator.wikimedia.org/T332728) (owner: 10Kimberly Sarabia) [21:46:15] (03CR) 10CDanis: [V:03+1 C:03+2] filerepo: Fix schema compatibility constant usage [core] (wmf/1.44.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1109149 (https://phabricator.wikimedia.org/T383269) (owner: 10CDanis) [21:46:19] (03CR) 10Jdlrobson: [C:03+1] "Safe to deploy given default value matches master." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1108135 (https://phabricator.wikimedia.org/T332728) (owner: 10Kimberly Sarabia) [21:46:41] (03CR) 10CDanis: filerepo: Fix schema compatibility constant usage [core] (wmf/1.44.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1109149 (https://phabricator.wikimedia.org/T383269) (owner: 10CDanis) [21:46:44] (03CR) 10CDanis: [C:03+2] filerepo: Fix schema compatibility constant usage [core] (wmf/1.44.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1109149 (https://phabricator.wikimedia.org/T383269) (owner: 10CDanis) [21:47:32] (03CR) 10CI reject: [V:04-1] filerepo: Fix schema compatibility constant usage [core] (wmf/1.44.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1109149 (https://phabricator.wikimedia.org/T383269) (owner: 10CDanis) [21:49:04] (03CR) 10CDanis: filerepo: Fix schema compatibility constant usage [core] (wmf/1.44.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1109149 (https://phabricator.wikimedia.org/T383269) (owner: 10CDanis) [21:49:07] (03CR) 10CDanis: [C:03+2] filerepo: Fix schema compatibility constant usage [core] (wmf/1.44.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1109149 (https://phabricator.wikimedia.org/T383269) (owner: 10CDanis) [21:49:46] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10442868 (10phaultfinder) [21:50:22] (03PS1) 10Andrew Bogott: Revert "cloud-vps dns recursor: remove labs-ip-aliaser" [puppet] - 10https://gerrit.wikimedia.org/r/1109172 [21:52:33] (03CR) 10CI reject: [V:04-1] Revert "cloud-vps dns recursor: remove labs-ip-aliaser" [puppet] - 10https://gerrit.wikimedia.org/r/1109172 (owner: 10Andrew Bogott) [21:55:13] (03PS2) 10Andrew Bogott: Revert "cloud-vps dns recursor: remove labs-ip-aliaser" [puppet] - 10https://gerrit.wikimedia.org/r/1109172 [21:55:57] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, 06serviceops: decommission mw[1349-1413] - https://phabricator.wikimedia.org/T375842#10442873 (10akosiaris) [21:57:00] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, 06serviceops: decommission mw[1349-1413] - https://phabricator.wikimedia.org/T375842#10442874 (10akosiaris) >>! In T375842#10441736, @Jclark-ctr wrote: > @akosiaris i see you checked off all the boxes for the dcops team are these ready to be remove... [21:58:17] (03CR) 10Andrew Bogott: [C:03+2] Revert "cloud-vps dns recursor: remove labs-ip-aliaser" [puppet] - 10https://gerrit.wikimedia.org/r/1109172 (owner: 10Andrew Bogott) [22:00:05] Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250108T2200) [22:05:13] (03CR) 10CI reject: [V:04-1] filerepo: Fix schema compatibility constant usage [core] (wmf/1.44.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1109149 (https://phabricator.wikimedia.org/T383269) (owner: 10CDanis) [22:11:55] (03CR) 10Xcollazo: [C:03+1] mw-content-history-reconcile-enrich: Enable K8 HA (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105667 (https://phabricator.wikimedia.org/T375176) (owner: 10TChin) [22:17:02] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1109120 (owner: 10Ssingh) [22:18:54] (03PS1) 10Andrew Bogott: cloud-vps dns recursor: remove labs-ip-aliaser [puppet] - 10https://gerrit.wikimedia.org/r/1109176 (https://phabricator.wikimedia.org/T374129) [22:21:46] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1109176 (https://phabricator.wikimedia.org/T374129) (owner: 10Andrew Bogott) [22:24:03] FIRING: KubernetesCalicoDown: wikikube-worker2022.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=wikikube-worker2022.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [22:34:43] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10442989 (10phaultfinder) [22:49:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10443011 (10phaultfinder) [23:00:04] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250108T2300) [23:06:29] PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [23:06:33] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 207, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:06:39] PROBLEM - Router interfaces on cr1-esams is CRITICAL: CRITICAL: host 185.15.59.128, interfaces up: 77, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:10:33] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 208, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:10:39] RECOVERY - Router interfaces on cr1-esams is OK: OK: host 185.15.59.128, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:11:29] RECOVERY - BFD status on cr2-eqiad is OK: UP: 25 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [23:19:45] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10443057 (10phaultfinder) [23:24:18] (03CR) 10Máté Szabó: [C:03+2] filerepo: Fix schema compatibility constant usage [core] (wmf/1.44.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1109149 (https://phabricator.wikimedia.org/T383269) (owner: 10CDanis) [23:43:06] (03Merged) 10jenkins-bot: filerepo: Fix schema compatibility constant usage [core] (wmf/1.44.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1109149 (https://phabricator.wikimedia.org/T383269) (owner: 10CDanis) [23:54:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10443092 (10phaultfinder)