[00:01:19] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10612075 (10Jclark-ctr) [00:01:54] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1201.eqiad.wmnet with OS bullseye [00:02:05] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10612076 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-worker1201.eqiad.wmnet with OS... [00:09:41] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T387609#10612088 (10phaultfinder) [00:15:25] FIRING: [2x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:16:08] PROBLEM - Check unit status of sync-puppet-volatile on puppetserver1002 is CRITICAL: CRITICAL: Status of the systemd unit sync-puppet-volatile https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:25:25] FIRING: [2x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:26:08] RECOVERY - Check unit status of sync-puppet-volatile on puppetserver1002 is OK: OK: Status of the systemd unit sync-puppet-volatile https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:38:47] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1125260 [00:38:48] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1125260 (owner: 10TrainBranchBot) [00:40:01] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1202.eqiad.wmnet with OS bullseye [00:40:02] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1203.eqiad.wmnet with OS bullseye [00:40:03] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1204.eqiad.wmnet with OS bullseye [00:40:04] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1205.eqiad.wmnet with OS bullseye [00:40:15] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10612140 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-worker1203.eqiad.wmnet with OS... [00:40:16] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10612141 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-worker1202.eqiad.wmnet with OS... [00:40:19] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10612142 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-worker1204.eqiad.wmnet with OS... [00:40:21] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10612143 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-worker1205.eqiad.wmnet with OS... [00:41:26] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1206.eqiad.wmnet with OS bullseye [00:41:27] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1207.eqiad.wmnet with OS bullseye [00:41:28] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1208.eqiad.wmnet with OS bullseye [00:41:37] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10612145 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-worker1206.eqiad.wmnet with OS... [00:41:37] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10612146 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-worker1207.eqiad.wmnet with OS... [00:41:38] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10612147 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-worker1208.eqiad.wmnet with OS... [00:46:13] (03Abandoned) 10Jdlrobson: Remove init event from Search AB test and also remove ABTestEnrollment.js. [extensions/WikimediaEvents] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1120685 (https://phabricator.wikimedia.org/T386734) (owner: 10Jdlrobson) [00:50:30] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1125260 (owner: 10TrainBranchBot) [00:57:41] FIRING: [2x] KubernetesRsyslogDown: rsyslog on aux-k8s-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [01:00:44] (03CR) 10Andrew Bogott: [C:03+2] Horizon/idp: access keystone on port 443 [puppet] - 10https://gerrit.wikimedia.org/r/1125249 (https://phabricator.wikimedia.org/T388137) (owner: 10Andrew Bogott) [01:01:13] 06SRE, 10Cassandra: Eliminate use of secondary IP interfaces & DNS for Cassandra instances - https://phabricator.wikimedia.org/T388169#10612177 (10Eevans) [01:06:50] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1201.eqiad.wmnet with OS bullseye [01:06:56] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10612178 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-worker1201.eqiad.wmnet with OS bull... [01:08:45] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1125261 [01:08:45] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1125261 (owner: 10TrainBranchBot) [01:10:06] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10612180 (10Jclark-ctr) a:05Jclark-ctr→03BTullis @BTullis little typo in preseed file |an-worker12[0-8] should be 120[0-8] [01:11:48] RECOVERY - MegaRAID on an-worker1066 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [01:12:55] 06SRE, 10MediaWiki-extensions-OAuth, 06The-Wikipedia-Library, 07Datacenter-Switchover, 07User-notice-archive: Frequent OAuth failures on Wikimedia wikis since eqiad was repooled due to db-mainstash replication lag - https://phabricator.wikimedia.org/T332650#10612183 (10jsn.sherman) I started a pr to remo... [01:13:16] 06SRE, 10MediaWiki-extensions-OAuth, 06The-Wikipedia-Library, 07Datacenter-Switchover, 07User-notice-archive: Frequent OAuth failures on Wikimedia wikis since eqiad was repooled due to db-mainstash replication lag - https://phabricator.wikimedia.org/T332650#10612184 (10jsn.sherman) p:05High→03Low [01:14:37] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Q2:rack/setup E8/F8 new leaf switches - https://phabricator.wikimedia.org/T382017#10612190 (10Papaul) @Jclark-ctr @VRiley-WMF the 2 switches are received in coupa but are missing in netbox. if there are not ready to be racked yet, can... [01:20:33] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Q2:rack/setup E8/F8 new leaf switches - https://phabricator.wikimedia.org/T382017#10612192 (10Jclark-ctr) @VRiley-WMF I have not seen these in the data center yet but you updated ticket Jan 10 2025 almost 2 months ago? Receiving ti... [01:30:06] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1125261 (owner: 10TrainBranchBot) [01:39:49] (03PS1) 10Andrew Bogott: cloudidp-dev.wikimedia.org: allow keystone.openstack endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1125264 (https://phabricator.wikimedia.org/T388137) [01:40:31] (03PS2) 10Andrew Bogott: cloudidp-dev.wikimedia.org: allow keystone.openstack endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1125264 (https://phabricator.wikimedia.org/T388137) [01:40:57] (03CR) 10Andrew Bogott: [C:03+2] cloudidp-dev.wikimedia.org: allow keystone.openstack endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1125264 (https://phabricator.wikimedia.org/T388137) (owner: 10Andrew Bogott) [01:41:48] PROBLEM - MegaRAID on an-worker1066 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [01:46:28] PROBLEM - Disk space on releases1003 is CRITICAL: DISK CRITICAL - /srv/docker/overlay2/5e191a880c7d45debe14f7c536fcf3c9edf6c2de17bd71a45484929c12603607/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [01:47:42] FIRING: HelmReleaseBadStatus: Helm release article-descriptions/main on k8s-mlstaging@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s-mlstaging&var-namespace=article-descriptions - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [01:50:08] PROBLEM - Host cloudweb2002-dev is DOWN: PING CRITICAL - Packet loss = 100% [01:51:34] RECOVERY - Host cloudweb2002-dev is UP: PING OK - Packet loss = 0%, RTA = 30.42 ms [02:01:09] (03PS1) 10Scott French: mw-*: Tune 8.1 releases to avoid deployment timeouts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125265 (https://phabricator.wikimedia.org/T383845) [02:03:20] RECOVERY - MegaRAID on an-worker1065 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [02:06:28] RECOVERY - Disk space on releases1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [02:10:58] RECOVERY - Host an-presto1014 is UP: PING WARNING - Packet loss = 66%, RTA = 85.75 ms [02:11:46] PROBLEM - SSH on an-presto1014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [02:16:31] (03PS1) 10Andrew Bogott: keystone oidc: use keystone.openstack hostname for redirect [puppet] - 10https://gerrit.wikimedia.org/r/1125266 (https://phabricator.wikimedia.org/T388137) [02:16:58] (03CR) 10CI reject: [V:04-1] keystone oidc: use keystone.openstack hostname for redirect [puppet] - 10https://gerrit.wikimedia.org/r/1125266 (https://phabricator.wikimedia.org/T388137) (owner: 10Andrew Bogott) [02:17:22] PROBLEM - Host an-presto1014 is DOWN: PING CRITICAL - Packet loss = 100% [02:18:18] (03PS2) 10Andrew Bogott: keystone oidc: use keystone.openstack hostname for redirect [puppet] - 10https://gerrit.wikimedia.org/r/1125266 (https://phabricator.wikimedia.org/T388137) [02:19:03] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed & rebooted twice - https://phabricator.wikimedia.org/T387673#10612248 (10wiki_willy) a:05Marostegui→03VRiley-WMF Reassigning to Valerie to create a new Dell Support task >>! In T387673#10604114, @wiki_willy wrote: > @VRiley-WMF or @Jclark-ctr - can o... [02:21:21] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1125266 (https://phabricator.wikimedia.org/T388137) (owner: 10Andrew Bogott) [02:31:48] RECOVERY - MegaRAID on an-worker1066 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [02:33:20] PROBLEM - MegaRAID on an-worker1065 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [02:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:01:48] PROBLEM - MegaRAID on an-worker1066 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [03:15:25] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed & rebooted twice - https://phabricator.wikimedia.org/T387673#10612292 (10VRiley-WMF) Opemed a mew toclet with Dell. 206617456 currently speaking with them about this issue. [03:27:40] FIRING: [4x] KubernetesRsyslogDown: rsyslog on aux-k8s-worker2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [04:14:45] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T387609#10612322 (10phaultfinder) [04:15:30] PROBLEM - OpenSearch unassigned shard check - 9200 on relforge1004 is CRITICAL: CRITICAL - itwiki_general[0](2025-03-03T22:16:56.527Z), .kibana_3[0](2025-03-03T22:15:56.521Z), frwiki_general[0](2025-03-03T22:43:07.514Z) https://wikitech.wikimedia.org/wiki/Search%23Administration [04:20:25] RESOLVED: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:57:41] FIRING: [2x] KubernetesRsyslogDown: rsyslog on aux-k8s-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [05:47:42] FIRING: HelmReleaseBadStatus: Helm release article-descriptions/main on k8s-mlstaging@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s-mlstaging&var-namespace=article-descriptions - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [05:50:06] FIRING: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [05:50:22] PROBLEM - ElasticSearch unassigned shard check - 9200 on relforge1003 is CRITICAL: CRITICAL - itwiki_general[0](2025-03-03T22:16:56.527Z), .kibana_3[0](2025-03-03T22:15:56.521Z), frwiki_general[0](2025-03-03T22:43:07.514Z) https://wikitech.wikimedia.org/wiki/Search%23Administration [05:55:06] RESOLVED: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [06:11:48] RECOVERY - MegaRAID on an-worker1066 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [06:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:41:48] PROBLEM - MegaRAID on an-worker1066 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [07:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250307T0700) [07:26:03] (03PS1) 10Slyngshede: Remove steward from IDM account managers [puppet] - 10https://gerrit.wikimedia.org/r/1125296 [07:27:40] FIRING: [4x] KubernetesRsyslogDown: rsyslog on aux-k8s-worker2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [07:29:16] (03PS2) 10Slyngshede: Remove steward [puppet] - 10https://gerrit.wikimedia.org/r/1125296 [07:29:52] (03CR) 10CI reject: [V:04-1] Remove steward [puppet] - 10https://gerrit.wikimedia.org/r/1125296 (owner: 10Slyngshede) [07:31:47] !log Upgrading Jenkins on contint1002 [07:31:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:53] (03CR) 10Alexandros Kosiaris: [C:03+1] mw-*: Tune 8.1 releases to avoid deployment timeouts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125265 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [07:33:09] (03PS3) 10Slyngshede: Remove steward [puppet] - 10https://gerrit.wikimedia.org/r/1125296 [07:33:25] 06SRE, 06Fundraising-Backlog, 10fundraising-tech-ops, 10LDAP-Access-Requests: Grant Access to astein for fr-tech icinga acknowledgements - https://phabricator.wikimedia.org/T388186#10612402 (10MoritzMuehlenhoff) @Dwisehaupt Please clarify: Is simple Icinga web access needed? If so, we only need the "wmf" L... [07:35:14] !log hashar@deploy2002 Started deploy [releng/jenkins-deploy@34b35a5] (releasing): Upgrade to Jenkins LTS 2.492.2 [07:36:21] !log hashar@deploy2002 Finished deploy [releng/jenkins-deploy@34b35a5] (releasing): Upgrade to Jenkins LTS 2.492.2 (duration: 01m 23s) [07:36:50] 06SRE, 10LDAP-Access-Requests: Grant Access to logstash-access for jhuneidi - https://phabricator.wikimedia.org/T388044#10612407 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff @jeena I've added you to the logstash-access group. If you run into any issues acessing Logstash, please reope... [07:39:12] (03CR) 10Muehlenhoff: Remove steward (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1125296 (owner: 10Slyngshede) [07:43:04] (03PS4) 10Slyngshede: Remove steward [puppet] - 10https://gerrit.wikimedia.org/r/1125296 [07:43:19] (03PS2) 10Hashar: Fix wgCirrusSearchSimilarityProfiles [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125092 [07:43:28] (03CR) 10Hashar: Fix wgCirrusSearchSimilarityProfiles (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125092 (owner: 10Hashar) [07:43:53] (03CR) 10Slyngshede: Remove steward (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1125296 (owner: 10Slyngshede) [07:48:51] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1125296 (owner: 10Slyngshede) [07:50:16] (03PS1) 10Hashar: Remove obsolete $wgAllowMicrodataAttributes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125372 [07:51:45] !log installing emacs security updates [07:51:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:11] (03CR) 10Slyngshede: [C:03+2] Remove steward [puppet] - 10https://gerrit.wikimedia.org/r/1125296 (owner: 10Slyngshede) [07:57:42] (03PS1) 10Hashar: Remove wgArticlePlaceholderSearchIntegrationBackend [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125374 (https://phabricator.wikimedia.org/T207407) [08:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250307T0800) [08:02:19] (03CR) 10Hashar: "Some something about database :) I can self deploy but I could use a double check I did not make something wrong!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125374 (https://phabricator.wikimedia.org/T207407) (owner: 10Hashar) [08:04:55] (03CR) 10Jelto: [C:03+1] "lgtm, thanks for the addition!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1124397 (owner: 10Volans) [08:07:22] !log jmm@cumin2002 DONE (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging JJMC89 out of all services on: 2 hosts [08:12:58] !log installing Linux 5.10.234 on Bullseye hosts (just the rollout of the new kernels, no immediate reboots involved) [08:12:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:40] (03CR) 10Jelto: [C:03+1] "lgtm" [debs/helmfile] - 10https://gerrit.wikimedia.org/r/1125209 (https://phabricator.wikimedia.org/T387837) (owner: 10JMeybohm) [08:14:50] (03PS2) 10Volans: sre.gitlab: use new run_cookbook features [cookbooks] - 10https://gerrit.wikimedia.org/r/1124397 [08:15:01] !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/changeprop: sync [08:15:02] PROBLEM - Ensure traffic_server is running for instance backend on cp6010 is CRITICAL: PROCS CRITICAL: 2 processes with args /usr/bin/traffic_server -M --httpport 3128 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [08:15:07] !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop: sync [08:15:13] (03CR) 10Volans: [C:03+2] sre.gitlab: use new run_cookbook features [cookbooks] - 10https://gerrit.wikimedia.org/r/1124397 (owner: 10Volans) [08:15:46] !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: sync [08:15:55] !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: sync [08:16:02] RECOVERY - Ensure traffic_server is running for instance backend on cp6010 is OK: PROCS OK: 1 process with args /usr/bin/traffic_server -M --httpport 3128 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [08:17:26] (03CR) 10JMeybohm: [C:03+2] "The filename is generated by gbp, so I would assume a length limit" [debs/helmfile] - 10https://gerrit.wikimedia.org/r/1125209 (https://phabricator.wikimedia.org/T387837) (owner: 10JMeybohm) [08:19:09] (03CR) 10Vgutierrez: [C:03+1] acme_chief: add parameter for destination path [puppet] - 10https://gerrit.wikimedia.org/r/1124855 (https://phabricator.wikimedia.org/T387929) (owner: 10Fabfur) [08:19:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T387609#10612422 (10phaultfinder) [08:19:45] hmmmm [08:19:49] * vgutierrez checking cp6010 [08:21:33] (03Merged) 10jenkins-bot: sre.gitlab: use new run_cookbook features [cookbooks] - 10https://gerrit.wikimedia.org/r/1124397 (owner: 10Volans) [08:21:36] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti1035.eqiad.wmnet with OS bookworm [08:21:48] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10612427 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1035.eqiad.wmnet with OS bookworm [08:28:04] (03CR) 10JMeybohm: [C:03+1] services: refactor helmfiles for helmfile 0.171.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124431 (https://phabricator.wikimedia.org/T387836) (owner: 10Jelto) [08:30:53] (03PS3) 10Tiziano Fogli: sre.puppet.sync-netbox-hiera: add rack/row to network_devices [cookbooks] - 10https://gerrit.wikimedia.org/r/1125206 (https://phabricator.wikimedia.org/T387231) [08:30:59] (03Merged) 10jenkins-bot: Don't warn if this and the needed release set installed: false [debs/helmfile] - 10https://gerrit.wikimedia.org/r/1125209 (https://phabricator.wikimedia.org/T387837) (owner: 10JMeybohm) [08:32:39] (03CR) 10Vgutierrez: [C:04-1] haproxy/icinga: Remove RSA from auth algorithms (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1100192 (https://phabricator.wikimedia.org/T370837) (owner: 10BCornwall) [08:34:00] (03PS3) 10Elukey: services: Increase capacity and specs for Kartotherian [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122605 (https://phabricator.wikimedia.org/T386926) [08:36:11] (03CR) 10Jelto: services: refactor helmfiles for helmfile 0.171.0 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124431 (https://phabricator.wikimedia.org/T387836) (owner: 10Jelto) [08:38:57] 06SRE, 10SRE-Access-Requests, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Remove production data access for NDA expired user aude - https://phabricator.wikimedia.org/T388034#10612441 (10Gehel) [08:39:03] (03CR) 10Elukey: [C:03+2] services: Increase capacity and specs for Kartotherian [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122605 (https://phabricator.wikimedia.org/T386926) (owner: 10Elukey) [08:39:45] !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/kartotherian: sync [08:39:55] !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/kartotherian: sync [08:40:25] !log elukey@deploy2002 helmfile [eqiad] START helmfile.d/services/kartotherian: sync [08:42:37] !log elukey@deploy2002 helmfile [eqiad] START helmfile.d/admin 'sync'. [08:43:19] (03CR) 10JMeybohm: [C:03+1] services: refactor helmfiles for helmfile 0.171.0 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124431 (https://phabricator.wikimedia.org/T387836) (owner: 10Jelto) [08:43:33] !log elukey@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'sync'. [08:43:35] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti1035.eqiad.wmnet with reason: host reimage [08:46:19] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti1035.eqiad.wmnet with reason: host reimage [08:46:44] (03CR) 10Jelto: [C:03+2] services: refactor helmfiles for helmfile 0.171.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124431 (https://phabricator.wikimedia.org/T387836) (owner: 10Jelto) [08:48:08] !log imported helmfile 0.171.0-5 to bullseye-wikimedia and bookworm-wikimedia - T387837 [08:48:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:11] T387837: Fix installed key in dependend helmfile releases - https://phabricator.wikimedia.org/T387837 [08:48:29] !log updated helmfile to 0.171.0-5 on deploy* - T387837 [08:48:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:31] !log elukey@deploy2002 helmfile [eqiad] DONE helmfile.d/services/kartotherian: sync [08:52:08] !log elukey@deploy2002 helmfile [eqiad] START helmfile.d/services/kartotherian: sync [08:54:08] (03PS29) 10Federico Ceratto: clone.py, clone_test.py: Automate cloning [cookbooks] - 10https://gerrit.wikimedia.org/r/1120605 (https://phabricator.wikimedia.org/T387023) [08:55:25] !log elukey@deploy2002 helmfile [eqiad] DONE helmfile.d/services/kartotherian: sync [08:55:49] (03CR) 10Federico Ceratto: "This is tested end-to-end and ready for final review. It has been used for https://phabricator.wikimedia.org/T385141" [cookbooks] - 10https://gerrit.wikimedia.org/r/1120605 (https://phabricator.wikimedia.org/T387023) (owner: 10Federico Ceratto) [08:55:51] (03PS3) 10Volans: cli: log an eventual exception to stderr [software/cumin] - 10https://gerrit.wikimedia.org/r/1114456 (https://phabricator.wikimedia.org/T384539) (owner: 10TheAnarcat) [08:55:51] (03PS2) 10Volans: query: do not error on no match in first subquery [software/cumin] - 10https://gerrit.wikimedia.org/r/1125158 [08:56:30] (03Merged) 10jenkins-bot: services: refactor helmfiles for helmfile 0.171.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124431 (https://phabricator.wikimedia.org/T387836) (owner: 10Jelto) [08:57:41] FIRING: [2x] KubernetesRsyslogDown: rsyslog on aux-k8s-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [09:02:35] (03PS1) 10JMeybohm: helm: Install helm 3.11 and 3.17 in parallel [puppet] - 10https://gerrit.wikimedia.org/r/1125377 (https://phabricator.wikimedia.org/T341984) [09:02:47] !log jelto@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply [09:03:56] !log jelto@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [09:05:09] !log jelto@deploy2002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [09:05:22] (03CR) 10JMeybohm: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5034/co" [puppet] - 10https://gerrit.wikimedia.org/r/1125377 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [09:07:08] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1035.eqiad.wmnet with OS bookworm [09:07:12] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10612514 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1035.eqiad.wmnet with OS bookworm completed: - ganeti103... [09:07:37] !log elukey@deploy2002 helmfile [codfw] START helmfile.d/services/kartotherian: sync [09:08:59] !log jelto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [09:09:45] !log jelto@deploy2002 helmfile [codfw] START helmfile.d/services/miscweb: apply [09:12:33] !log jelto@deploy2002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [09:14:58] (03CR) 10Jelto: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1125377 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [09:17:38] (03PS30) 10Federico Ceratto: clone.py, clone_test.py: Automate cloning [cookbooks] - 10https://gerrit.wikimedia.org/r/1120605 (https://phabricator.wikimedia.org/T387023) [09:18:16] !log elukey@deploy2002 helmfile [codfw] DONE helmfile.d/services/kartotherian: sync [09:20:09] !log elukey@deploy2002 helmfile [codfw] START helmfile.d/admin 'sync'. [09:20:25] !log elukey@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'sync'. [09:21:00] !log elukey@deploy2002 helmfile [codfw] START helmfile.d/services/kartotherian: sync [09:24:00] (03CR) 10Hashar: "I will deploy that next week as part of a batch of other clean up changes." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125091 (owner: 10Hashar) [09:24:08] (03CR) 10Hashar: "I will deploy that next week as part of a batch of other clean up changes." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125097 (https://phabricator.wikimedia.org/T348526) (owner: 10Hashar) [09:24:13] (03CR) 10Hashar: "I will deploy that next week as part of a batch of other clean up changes." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125124 (owner: 10Hashar) [09:24:17] (03CR) 10Hashar: "I will deploy that next week as part of a batch of other clean up changes." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125092 (owner: 10Hashar) [09:25:03] (03CR) 10Hashar: "I will deploy that next week as part of a batch of other clean up changes." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125372 (owner: 10Hashar) [09:25:09] (03CR) 10Hashar: "I will deploy that next week as part of a batch of other clean up changes." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125374 (https://phabricator.wikimedia.org/T207407) (owner: 10Hashar) [09:25:36] https://wikitech.wikimedia.org/wiki/Technical_debt/Unused_config#Results :) [09:27:01] !log elukey@deploy2002 helmfile [codfw] DONE helmfile.d/services/kartotherian: sync [09:37:46] !log elukey@puppetserver1001 conftool action : set/weight=10; selector: name=wikikube-worker1.*,dc=eqiad,cluster=maps,service=kartotherian-k8s-ssl [09:38:10] !log elukey@puppetserver1001 conftool action : set/weight=10; selector: name=wikikube-worker2.*,dc=codfw,cluster=maps,service=kartotherian-k8s-ssl [09:46:34] (03CR) 10Ladsgroup: [C:03+1] Remove wgArticlePlaceholderSearchIntegrationBackend [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125374 (https://phabricator.wikimedia.org/T207407) (owner: 10Hashar) [09:47:42] FIRING: HelmReleaseBadStatus: Helm release article-descriptions/main on k8s-mlstaging@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s-mlstaging&var-namespace=article-descriptions - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [09:54:02] (03CR) 10JMeybohm: [V:03+1 C:03+2] helm: Install helm 3.11 and 3.17 in parallel [puppet] - 10https://gerrit.wikimedia.org/r/1125377 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [09:58:19] (03PS1) 10JMeybohm: deployment_server: Select the kubectl version based on the cluster [puppet] - 10https://gerrit.wikimedia.org/r/1125384 (https://phabricator.wikimedia.org/T341984) [09:58:54] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1125384 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [09:58:58] (03PS6) 10Jgiannelos: pcs: Add missing rules for content pregeneration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125225 (https://phabricator.wikimedia.org/T388214) [09:59:19] (03PS31) 10Federico Ceratto: clone.py, clone_test.py: Automate cloning [cookbooks] - 10https://gerrit.wikimedia.org/r/1120605 (https://phabricator.wikimedia.org/T387023) [09:59:50] (03CR) 10Elukey: clone.py, clone_test.py: Automate cloning (035 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1120605 (https://phabricator.wikimedia.org/T387023) (owner: 10Federico Ceratto) [10:06:10] (03PS5) 10Federico Ceratto: dbctl.py, dbctl_test.py: Serialize dbctl changes [cookbooks] - 10https://gerrit.wikimedia.org/r/1124797 (https://phabricator.wikimedia.org/T387209) [10:07:58] (03PS1) 10Elukey: conftool-data: add more wikikube-workers to maps [puppet] - 10https://gerrit.wikimedia.org/r/1125387 (https://phabricator.wikimedia.org/T386926) [10:07:59] (03PS1) 10Elukey: role::maps::{master,replica}: Fix lvs pool config [puppet] - 10https://gerrit.wikimedia.org/r/1125388 (https://phabricator.wikimedia.org/T386926) [10:09:25] FIRING: SystemdUnitFailed: push_cross_cluster_settings_9600.service on cloudelastic1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:11:06] (03CR) 10Elukey: [C:03+1] docs: removed deprecated call to sphinx_rtd_theme [software/cumin] - 10https://gerrit.wikimedia.org/r/1125157 (owner: 10Volans) [10:11:30] (03CR) 10Elukey: [C:03+2] sre.hosts.reimage: add logging and confirmation when forcing puppet 5 (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1117231 (owner: 10Elukey) [10:11:50] PROBLEM - ElasticSearch health check for shards on 9600 on cloudelastic1012 is CRITICAL: CRITICAL - elasticsearch inactive shards 668 threshold =0.2 breach: cluster_name: cloudelastic-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, active_primary_shards: 795, active_shards: 924, relocating_shards: 0, initializing_shards: 4, unassigned_shards: 664, delayed_unassigned_shards: 0, number_of_pending_ta [10:11:50] number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 58.040201005025125 https://wikitech.wikimedia.org/wiki/Search%23Administration [10:12:06] PROBLEM - ElasticSearch health check for shards on 9600 on cloudelastic1010 is CRITICAL: CRITICAL - elasticsearch inactive shards 659 threshold =0.2 breach: cluster_name: cloudelastic-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, active_primary_shards: 795, active_shards: 933, relocating_shards: 0, initializing_shards: 4, unassigned_shards: 655, delayed_unassigned_shards: 0, number_of_pending_ta [10:12:06] number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 58.60552763819096 https://wikitech.wikimedia.org/wiki/Search%23Administration [10:12:12] PROBLEM - ElasticSearch health check for shards on 9600 on cloudelastic1009 is CRITICAL: CRITICAL - elasticsearch inactive shards 654 threshold =0.2 breach: cluster_name: cloudelastic-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, active_primary_shards: 795, active_shards: 938, relocating_shards: 0, initializing_shards: 4, unassigned_shards: 650, delayed_unassigned_shards: 0, number_of_pending_ta [10:12:12] number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 58.91959798994974 https://wikitech.wikimedia.org/wiki/Search%23Administration [10:12:12] PROBLEM - ElasticSearch health check for shards on 9600 on cloudelastic1011 is CRITICAL: CRITICAL - elasticsearch inactive shards 652 threshold =0.2 breach: cluster_name: cloudelastic-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, active_primary_shards: 795, active_shards: 940, relocating_shards: 0, initializing_shards: 4, unassigned_shards: 648, delayed_unassigned_shards: 0, number_of_pending_ta [10:12:12] number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 59.04522613065326 https://wikitech.wikimedia.org/wiki/Search%23Administration [10:12:14] PROBLEM - OpenSearch health check for shards on 9600 on cloudelastic1007 is CRITICAL: CRITICAL - elasticsearch inactive shards 649 threshold =0.15 breach: cluster_name: cloudelastic-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 795, active_shards: 943, relocating_shards: 0, initializing_shards: 3, unassigned_shards: 646, delayed_unassigned_shards: [10:12:14] r_of_pending_tasks: 1, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 59.233668341708544 https://wikitech.wikimedia.org/wiki/Search%23Administration [10:13:00] (03CR) 10Elukey: [C:03+1] cli: log an eventual exception to stderr [software/cumin] - 10https://gerrit.wikimedia.org/r/1114456 (https://phabricator.wikimedia.org/T384539) (owner: 10TheAnarcat) [10:13:01] !log updated pwstore key for btullis [10:13:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:21] (03Abandoned) 10Elukey: admin_ng: enable monitoring for knative [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123656 (owner: 10Elukey) [10:13:31] dcausse: ^ Looks like dcausse is already on the cloudelastic issue [10:13:58] (03CR) 10Elukey: [C:03+1] Update Docker images of change-prop services to ones using node 20 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124191 (https://phabricator.wikimedia.org/T381588) (owner: 10Aaron Schulz) [10:14:38] PROBLEM - Check unit status of push_cross_cluster_settings_9600 on cloudelastic1009 is CRITICAL: CRITICAL: Status of the systemd unit push_cross_cluster_settings_9600 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:14:50] RECOVERY - ElasticSearch health check for shards on 9600 on cloudelastic1012 is OK: OK - elasticsearch status cloudelastic-psi-eqiad: cluster_name: cloudelastic-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, active_primary_shards: 795, active_shards: 1331, relocating_shards: 0, initializing_shards: 5, unassigned_shards: 256, delayed_unassigned_shards: 0, number_of_pending_tasks: 5, number_of_in_f [10:14:50] tch: 0, task_max_waiting_in_queue_millis: 963, active_shards_percent_as_number: 83.60552763819096 https://wikitech.wikimedia.org/wiki/Search%23Administration [10:15:06] RECOVERY - ElasticSearch health check for shards on 9600 on cloudelastic1010 is OK: OK - elasticsearch status cloudelastic-psi-eqiad: cluster_name: cloudelastic-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, active_primary_shards: 795, active_shards: 1353, relocating_shards: 0, initializing_shards: 6, unassigned_shards: 233, delayed_unassigned_shards: 0, number_of_pending_tasks: 2, number_of_in_f [10:15:06] tch: 0, task_max_waiting_in_queue_millis: 5, active_shards_percent_as_number: 84.98743718592965 https://wikitech.wikimedia.org/wiki/Search%23Administration [10:15:12] RECOVERY - ElasticSearch health check for shards on 9600 on cloudelastic1009 is OK: OK - elasticsearch status cloudelastic-psi-eqiad: cluster_name: cloudelastic-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, active_primary_shards: 795, active_shards: 1356, relocating_shards: 0, initializing_shards: 5, unassigned_shards: 231, delayed_unassigned_shards: 0, number_of_pending_tasks: 2, number_of_in_f [10:15:12] tch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 85.17587939698493 https://wikitech.wikimedia.org/wiki/Search%23Administration [10:15:12] RECOVERY - ElasticSearch health check for shards on 9600 on cloudelastic1011 is OK: OK - elasticsearch status cloudelastic-psi-eqiad: cluster_name: cloudelastic-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, active_primary_shards: 795, active_shards: 1357, relocating_shards: 0, initializing_shards: 4, unassigned_shards: 231, delayed_unassigned_shards: 0, number_of_pending_tasks: 2, number_of_in_f [10:15:12] tch: 0, task_max_waiting_in_queue_millis: 179, active_shards_percent_as_number: 85.23869346733667 https://wikitech.wikimedia.org/wiki/Search%23Administration [10:15:14] RECOVERY - OpenSearch health check for shards on 9600 on cloudelastic1007 is OK: OK - elasticsearch status cloudelastic-psi-eqiad: cluster_name: cloudelastic-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 795, active_shards: 1359, relocating_shards: 0, initializing_shards: 4, unassigned_shards: 229, delayed_unassigned_shards: 0, number_of_pending_ta [10:15:14] number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 85.3643216080402 https://wikitech.wikimedia.org/wiki/Search%23Administration [10:17:36] (03PS1) 10Muehlenhoff: Remove access for halfak [puppet] - 10https://gerrit.wikimedia.org/r/1125390 (https://phabricator.wikimedia.org/T388037) [10:18:55] (03CR) 10Muehlenhoff: [C:03+2] Remove access for halfak [puppet] - 10https://gerrit.wikimedia.org/r/1125390 (https://phabricator.wikimedia.org/T388037) (owner: 10Muehlenhoff) [10:21:48] !log jmm@cumin2002 DONE (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Halfak out of all services on: 1284 hosts [10:22:46] !log jmm@cumin2002 DONE (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Halfak out of all services on: 951 hosts [10:24:18] (03PS6) 10Jelto: Revert^2 "k8s::client: Allow for install of all kubectl versions" [puppet] - 10https://gerrit.wikimedia.org/r/1115803 (owner: 10JMeybohm) [10:24:18] (03CR) 10Jelto: "@jmeybohm I've done a quick test on `kubestage2001`. I removed ` the `kubernetes-node` package from the node and run puppet. Puppet instal" [puppet] - 10https://gerrit.wikimedia.org/r/1115803 (owner: 10JMeybohm) [10:24:25] RESOLVED: SystemdUnitFailed: push_cross_cluster_settings_9600.service on cloudelastic1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:24:38] RECOVERY - Check unit status of push_cross_cluster_settings_9600 on cloudelastic1009 is OK: OK: Status of the systemd unit push_cross_cluster_settings_9600 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:28:46] (03PS1) 10Vgutierrez: haproxy: Don't set h2 initial-window-size on haproxy 3.1 [puppet] - 10https://gerrit.wikimedia.org/r/1125393 (https://phabricator.wikimedia.org/T386796) [10:29:21] (03PS2) 10Vgutierrez: haproxy: Don't set h2 initial-window-size on haproxy 3.1 [puppet] - 10https://gerrit.wikimedia.org/r/1125393 (https://phabricator.wikimedia.org/T386796) [10:29:44] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1125393 (https://phabricator.wikimedia.org/T386796) (owner: 10Vgutierrez) [10:30:00] !log elukey@puppetserver1001 conftool action : set/pooled=inactive; selector: name=maps2005.codfw.wmnet,dc=codfw,cluster=maps,service=kartotherian-k8s-ssl [10:30:32] !log elukey@puppetserver1001 conftool action : set/pooled=inactive; selector: name=maps1005.eqiad.wmnet,dc=eqiad,cluster=maps,service=kartotherian-k8s-ssl [10:34:58] (03CR) 10Vgutierrez: [C:03+1] "karthotherian seems to be a valid systemd service on those roles:" [puppet] - 10https://gerrit.wikimedia.org/r/1125388 (https://phabricator.wikimedia.org/T386926) (owner: 10Elukey) [10:35:02] (03CR) 10Muehlenhoff: [C:03+1] "Good catch" [puppet] - 10https://gerrit.wikimedia.org/r/1125388 (https://phabricator.wikimedia.org/T386926) (owner: 10Elukey) [10:35:05] (03PS2) 10JMeybohm: deployment_server: Select the kubectl version based on the cluster [puppet] - 10https://gerrit.wikimedia.org/r/1125384 (https://phabricator.wikimedia.org/T341984) [10:35:42] (03CR) 10Elukey: "All credits to Valentin!" [puppet] - 10https://gerrit.wikimedia.org/r/1125388 (https://phabricator.wikimedia.org/T386926) (owner: 10Elukey) [10:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:37:21] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1035.eqiad.wmnet [10:37:55] (03CR) 10JMeybohm: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5035/co" [puppet] - 10https://gerrit.wikimedia.org/r/1125384 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [10:39:10] (03PS3) 10Vgutierrez: haproxy: Don't set h2 initial-window-size on haproxy 3.1 [puppet] - 10https://gerrit.wikimedia.org/r/1125393 (https://phabricator.wikimedia.org/T386796) [10:39:24] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1125393 (https://phabricator.wikimedia.org/T386796) (owner: 10Vgutierrez) [10:42:53] (03PS4) 10Vgutierrez: haproxy: Don't set h2 initial-window-size on haproxy 3.1 [puppet] - 10https://gerrit.wikimedia.org/r/1125393 (https://phabricator.wikimedia.org/T386796) [10:44:48] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1125393 (https://phabricator.wikimedia.org/T386796) (owner: 10Vgutierrez) [10:46:47] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1035.eqiad.wmnet [10:47:22] (03PS3) 10JMeybohm: deployment_server: Select the kubectl version based on the cluster [puppet] - 10https://gerrit.wikimedia.org/r/1125384 (https://phabricator.wikimedia.org/T341984) [10:48:21] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1035.eqiad.wmnet to cluster eqiad and group A [10:50:03] (03CR) 10JMeybohm: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5036/co" [puppet] - 10https://gerrit.wikimedia.org/r/1125384 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [10:50:05] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti1035.eqiad.wmnet to cluster eqiad and group A [10:50:27] (03PS1) 10Kevin Bazira: ml-services: update article-country image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125398 (https://phabricator.wikimedia.org/T385970) [10:59:08] (03CR) 10Jelto: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1125384 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [11:00:39] 06SRE, 06Infrastructure-Foundations, 06SRE Observability: Split the permission to access Logstash from the cn=wmf and cn=nda groups - https://phabricator.wikimedia.org/T376790#10612886 (10MoritzMuehlenhoff) Status update: Access to Logstash has been split out of cn=wmf, cn=nda is next. [11:05:30] (03CR) 10JMeybohm: mediawiki: introduce feature flags (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116639 (owner: 10Giuseppe Lavagetto) [11:07:17] (03PS4) 10JMeybohm: Add pod-security.wmg.org labels to wikikube mediawiki namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124416 (https://phabricator.wikimedia.org/T273507) [11:07:56] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10612889 (10MoritzMuehlenhoff) [11:12:07] (03CR) 10Gkyziridis: [C:03+1] "Thnx." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125398 (https://phabricator.wikimedia.org/T385970) (owner: 10Kevin Bazira) [11:12:57] (03CR) 10Lucas Werkmeister: Remove $wgAllowAuthenticatedCrossOrigin again (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123741 (https://phabricator.wikimedia.org/T322944) (owner: 10Lucas Werkmeister) [11:14:59] (03CR) 10Kevin Bazira: [C:03+2] "thanks for the review :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125398 (https://phabricator.wikimedia.org/T385970) (owner: 10Kevin Bazira) [11:16:09] !log elukey@cumin1002 START - Cookbook sre.hosts.reboot-single for host ms-be2088.codfw.wmnet [11:16:27] (03Merged) 10jenkins-bot: ml-services: update article-country image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125398 (https://phabricator.wikimedia.org/T385970) (owner: 10Kevin Bazira) [11:16:53] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 06Infrastructure-Foundations: Perform fake disk swap on ms-be2088 as test - https://phabricator.wikimedia.org/T384003#10612909 (10elukey) >>! In T384003#10609005, @MatthewVernon wrote: > It's not the same kernel, though - you've got `5.14.0-503.11.1.el9_... [11:24:06] (03PS4) 10JMeybohm: deployment_server: Select the kubectl version based on the cluster [puppet] - 10https://gerrit.wikimedia.org/r/1125384 (https://phabricator.wikimedia.org/T341984) [11:26:56] (03PS1) 10Ilias Sarantopoulos: ml-services: increase workers for reference-quality models [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125405 (https://phabricator.wikimedia.org/T387019) [11:27:02] (03CR) 10JMeybohm: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5037/co" [puppet] - 10https://gerrit.wikimedia.org/r/1125384 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [11:27:40] FIRING: [4x] KubernetesRsyslogDown: rsyslog on aux-k8s-worker2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [11:27:42] !log elukey@cumin1002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host ms-be2088.codfw.wmnet [11:28:07] (03CR) 10Kamila Součková: [C:03+1] conftool-data: add more wikikube-workers to maps [puppet] - 10https://gerrit.wikimedia.org/r/1125387 (https://phabricator.wikimedia.org/T386926) (owner: 10Elukey) [11:28:41] (03CR) 10Jelto: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1125384 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [11:30:58] (03CR) 10JMeybohm: [V:03+1 C:03+2] deployment_server: Select the kubectl version based on the cluster [puppet] - 10https://gerrit.wikimedia.org/r/1125384 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [11:33:37] RECOVERY - Host an-presto1014 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [11:35:22] (03PS1) 10Lucas Werkmeister (WMDE): Clean up RDF feature flags again [extensions/Wikibase] (wmf/1.44.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1125408 (https://phabricator.wikimedia.org/T384344) [11:35:47] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 10 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/Wikibase] (wmf/1.44.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1125408 (https://phabricator.wikimedia.org/T384344) (owner: 10Lucas Werkmeister (WMDE)) [11:36:35] (03CR) 10Lucas Werkmeister (WMDE): "Optional backport so we can deploy the other cleanup, Ib999da8c03, sooner." [extensions/Wikibase] (wmf/1.44.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1125408 (https://phabricator.wikimedia.org/T384344) (owner: 10Lucas Werkmeister (WMDE)) [11:37:05] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Recommission testhost2001.codfw.wmnet as ms-be2089.codfw.wmnet - https://phabricator.wikimedia.org/T388221 (10MatthewVernon) 03NEW [11:43:05] (03CR) 10Jelto: [C:03+1] "lgtm" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124831 (https://phabricator.wikimedia.org/T384450) (owner: 10JMeybohm) [11:45:40] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 06Infrastructure-Foundations: Perform fake disk swap on ms-be2088 as test - https://phabricator.wikimedia.org/T384003#10612961 (10elukey) All right something different happened, but I am not sure if it was the kernel or not. I rebooted the host with the... [11:46:13] (03CR) 10Hnowlan: [C:03+1] "Makes sense from a helm perspective! I assume the various removed values in the benthos config are acceptable to remove. One query, feel f" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122961 (https://phabricator.wikimedia.org/T371214) (owner: 10Kamila Součková) [11:46:29] !log elukey@cumin1002 START - Cookbook sre.hosts.reboot-single for host ms-be2088.codfw.wmnet [11:49:08] (03PS1) 10MVernon: Preseed: use ms-be_simple.cfg for ms-be2089 [puppet] - 10https://gerrit.wikimedia.org/r/1125410 [11:51:14] (03CR) 10CI reject: [V:04-1] Preseed: use ms-be_simple.cfg for ms-be2089 [puppet] - 10https://gerrit.wikimedia.org/r/1125410 (owner: 10MVernon) [11:52:03] (03PS2) 10MVernon: Preseed: use ms-be_simple.cfg for ms-be2089 [puppet] - 10https://gerrit.wikimedia.org/r/1125410 (https://phabricator.wikimedia.org/T388221) [11:54:39] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T387609#10612995 (10phaultfinder) [11:58:03] !log elukey@cumin1002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host ms-be2088.codfw.wmnet [11:59:13] (03CR) 10Hnowlan: [C:04-1] "Some minor fixes needed, but makes sense generally (benthos stuff aside!)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123010 (owner: 10Kamila Součková) [12:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250307T0800) [12:00:05] jelto, arnoldokoth, and mutante: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for GitLab version upgrades. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250307T1200). [12:06:28] (03CR) 10Kamila Součková: [C:03+1] "LGTM except see inline" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124416 (https://phabricator.wikimedia.org/T273507) (owner: 10JMeybohm) [12:08:38] (03CR) 10Ladsgroup: "I'll hopefully deploy this next week. Unless anyone objects." [dumps] - 10https://gerrit.wikimedia.org/r/1119486 (https://phabricator.wikimedia.org/T382069) (owner: 10Ladsgroup) [12:17:06] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1014 - https://phabricator.wikimedia.org/T382984#10613054 (10BTullis) Hello. It's not a big problem in this case, but it would have been helpful to know about this prior to pulling the disks from this server. I only noticed that there was a problem... [12:18:52] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1014 - https://phabricator.wikimedia.org/T382984#10613058 (10BTullis) I also only see three failed drives in the original description (slots 2,7, and 10), but it looks like the drive in slot 3 was replaced as well. Did this fail at a later time? [12:30:01] (03PS1) 10Muehlenhoff: keepalived: Install keepalived from the "main" component [puppet] - 10https://gerrit.wikimedia.org/r/1125413 (https://phabricator.wikimedia.org/T383557) [12:41:47] RECOVERY - MegaRAID on an-worker1066 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [12:46:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:46:54] (03CR) 10Kamila Součková: "Yes, they are defaults. (I'm not really sure why past me put them there :D)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122961 (https://phabricator.wikimedia.org/T371214) (owner: 10Kamila Součková) [12:57:34] (03CR) 10Kevin Bazira: "most of it lgtm. has this been tested on staging or you're ready to test in prod?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125405 (https://phabricator.wikimedia.org/T387019) (owner: 10Ilias Sarantopoulos) [12:57:41] FIRING: [2x] KubernetesRsyslogDown: rsyslog on aux-k8s-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [13:01:16] (03CR) 10Ilias Sarantopoulos: "It has been tested on ml-staging in experimental ns -> here are the results https://phabricator.wikimedia.org/T387019#10612894" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125405 (https://phabricator.wikimedia.org/T387019) (owner: 10Ilias Sarantopoulos) [13:04:09] (03CR) 10Kevin Bazira: [C:03+1] "ack! I've +1'd." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125405 (https://phabricator.wikimedia.org/T387019) (owner: 10Ilias Sarantopoulos) [13:04:37] (03PS1) 10Fabfur: sslcert: minor refactoring to use consistent key path [puppet] - 10https://gerrit.wikimedia.org/r/1125415 (https://phabricator.wikimedia.org/T387929) [13:06:51] (03CR) 10Kamila Součková: [C:03+2] benthos: add input/output config to chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122961 (https://phabricator.wikimedia.org/T371214) (owner: 10Kamila Součková) [13:08:13] (03Merged) 10jenkins-bot: benthos: add input/output config to chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122961 (https://phabricator.wikimedia.org/T371214) (owner: 10Kamila Součková) [13:11:47] PROBLEM - MegaRAID on an-worker1066 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [13:16:15] (03CR) 10Ilias Sarantopoulos: [C:03+2] ml-services: increase workers for reference-quality models [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125405 (https://phabricator.wikimedia.org/T387019) (owner: 10Ilias Sarantopoulos) [13:17:36] (03Merged) 10jenkins-bot: ml-services: increase workers for reference-quality models [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125405 (https://phabricator.wikimedia.org/T387019) (owner: 10Ilias Sarantopoulos) [13:19:54] !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . [13:21:29] !log isaranto@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . [13:21:47] RECOVERY - MegaRAID on an-worker1066 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [13:22:19] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1125415 (https://phabricator.wikimedia.org/T387929) (owner: 10Fabfur) [13:22:24] (03CR) 10Andrew Bogott: [C:03+2] keystone oidc: use keystone.openstack hostname for redirect [puppet] - 10https://gerrit.wikimedia.org/r/1125266 (https://phabricator.wikimedia.org/T388137) (owner: 10Andrew Bogott) [13:23:07] (03PS1) 10Effie Mouzeli: mw-(api-int|parsoid|jobrunner): increase PHP8.1 traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125418 (https://phabricator.wikimedia.org/T383845) [13:25:16] !log isaranto@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revision-models' for release 'main' . [13:27:20] (03PS1) 10Federico Ceratto: Ask for confirmation before depooling last host in a group [cookbooks] - 10https://gerrit.wikimedia.org/r/1125421 (https://phabricator.wikimedia.org/T299442) [13:27:30] (03CR) 10Jelto: [V:03+1 C:03+1] "lgtm, CI diff looks a bit scary but the actual diff with helmfile looks reasonable" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124832 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [13:29:40] (03CR) 10Federico Ceratto: "Initial version - later on we could set a minimum number of instances per section rather than 1." [cookbooks] - 10https://gerrit.wikimedia.org/r/1125421 (https://phabricator.wikimedia.org/T299442) (owner: 10Federico Ceratto) [13:32:04] (03PS1) 10Ilias Sarantopoulos: admin_ng: increase resource quota for revision models ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125422 (https://phabricator.wikimedia.org/T387019) [13:32:06] (03CR) 10Jcrespo: [C:03+1] Preseed: use ms-be_simple.cfg for ms-be2089 [puppet] - 10https://gerrit.wikimedia.org/r/1125410 (https://phabricator.wikimedia.org/T388221) (owner: 10MVernon) [13:33:36] (03PS1) 10Effie Mouzeli: mw-(api-int|parsoid|jobrunner): switch all traffic to PHP8.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125423 (https://phabricator.wikimedia.org/T383845) [13:35:45] (03PS1) 10Stevemunene: airflow-product-eng: migrate scheduler and db to k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125424 (https://phabricator.wikimedia.org/T380624) [13:40:56] (03CR) 10Hnowlan: [C:03+1] sre: deploy thumbor alerts to prometheus k8s [alerts] - 10https://gerrit.wikimedia.org/r/1124788 (https://phabricator.wikimedia.org/T379559) (owner: 10Filippo Giunchedi) [13:41:34] (03PS2) 10Stevemunene: airflow-product-eng: migrate scheduler and db to k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125424 (https://phabricator.wikimedia.org/T380624) [13:42:54] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1198.eqiad.wmnet with OS bullseye [13:42:59] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10613177 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-worker1198.eqiad.wmnet with OS... [13:43:21] (03CR) 10Filippo Giunchedi: [C:03+2] sre: deploy thumbor alerts to prometheus k8s [alerts] - 10https://gerrit.wikimedia.org/r/1124788 (https://phabricator.wikimedia.org/T379559) (owner: 10Filippo Giunchedi) [13:46:07] PROBLEM - Host cloudweb2002-dev is DOWN: PING CRITICAL - Packet loss = 100% [13:46:55] RECOVERY - Host cloudweb2002-dev is UP: PING OK - Packet loss = 0%, RTA = 30.38 ms [13:47:14] (03PS1) 10Andrew Bogott: cloudweb2002-dev: fix service_id for keystone [puppet] - 10https://gerrit.wikimedia.org/r/1125428 [13:47:42] FIRING: HelmReleaseBadStatus: Helm release article-descriptions/main on k8s-mlstaging@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s-mlstaging&var-namespace=article-descriptions - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [13:47:59] (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1125428 (owner: 10Andrew Bogott) [13:48:12] (03CR) 10Andrew Bogott: [C:03+2] cloudweb2002-dev: fix service_id for keystone [puppet] - 10https://gerrit.wikimedia.org/r/1125428 (owner: 10Andrew Bogott) [13:51:47] PROBLEM - MegaRAID on an-worker1066 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [13:53:35] PROBLEM - Host an-worker1102 is DOWN: PING CRITICAL - Packet loss = 100% [13:55:17] (03CR) 10Marostegui: "Keep in mind that each section has a min number of replicas implementation already, so this may be good already." [cookbooks] - 10https://gerrit.wikimedia.org/r/1125421 (https://phabricator.wikimedia.org/T299442) (owner: 10Federico Ceratto) [13:55:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T387609#10613202 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr rebalanced BA on l2-l3 on b7 reblanced AA on l2-l3 on A4 [13:56:27] RECOVERY - Host an-worker1102 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [13:58:00] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1198.eqiad.wmnet with reason: host reimage [13:58:51] (03CR) 10MVernon: [C:03+2] Preseed: use ms-be_simple.cfg for ms-be2089 [puppet] - 10https://gerrit.wikimedia.org/r/1125410 (https://phabricator.wikimedia.org/T388221) (owner: 10MVernon) [13:59:44] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125430 [14:00:55] 10ops-magru, 06Infrastructure-Foundations, 10netops: Jan 2025 - Magru core router connectivity blips - https://phabricator.wikimedia.org/T384774#10613224 (10cmooney) p:05High→03Medium Everything still seems table. Juniper also provided this link to their KB article on it https://supportportal.juniper.n... [14:01:47] RECOVERY - MegaRAID on an-worker1066 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [14:02:11] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1198.eqiad.wmnet with reason: host reimage [14:03:07] (03PS1) 10Marostegui: installserver: Do not reimage db1256 [puppet] - 10https://gerrit.wikimedia.org/r/1125432 [14:04:56] (03PS2) 10Kamila Součková: benthos-mw-accesslog-metrics: create deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123010 [14:05:19] (03CR) 10Kamila Součková: benthos-mw-accesslog-metrics: create deployment (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123010 (owner: 10Kamila Součková) [14:05:43] (03CR) 10Kamila Součková: "Done" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123010 (owner: 10Kamila Součková) [14:07:00] (03CR) 10Elukey: [C:03+2] conftool-data: add more wikikube-workers to maps [puppet] - 10https://gerrit.wikimedia.org/r/1125387 (https://phabricator.wikimedia.org/T386926) (owner: 10Elukey) [14:07:49] (03CR) 10Marostegui: [C:03+2] installserver: Do not reimage db1256 [puppet] - 10https://gerrit.wikimedia.org/r/1125432 (owner: 10Marostegui) [14:08:20] (03PS1) 10Btullis: Fix the preseed matching for the new an-worker hosts [puppet] - 10https://gerrit.wikimedia.org/r/1125433 (https://phabricator.wikimedia.org/T386390) [14:10:51] (03CR) 10Btullis: [C:03+2] Fix the preseed matching for the new an-worker hosts [puppet] - 10https://gerrit.wikimedia.org/r/1125433 (https://phabricator.wikimedia.org/T386390) (owner: 10Btullis) [14:12:08] !log elukey@puppetserver1001 conftool action : set/weight=10:pooled=yes; selector: name=wikikube-worker2.*,dc=codfw,cluster=maps,service=kartotherian-k8s-ssl [14:12:25] !log elukey@puppetserver1001 conftool action : set/weight=10:pooled=yes; selector: name=wikikube-worker1.*,dc=eqiad,cluster=maps,service=kartotherian-k8s-ssl [14:12:42] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21), 13Patch-For-Review: Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10613276 (10BTullis) a:05BTullis→03Jclark-ctr >>! In T386390#10612180, @Jclark-ctr wrote: > @BTullis little typo in p... [14:23:31] FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:25:29] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [14:25:42] 10ops-eqiad, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236 (10phaultfinder) 03NEW [14:26:42] FIRING: [4x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:28:37] !log elukey@cumin1002 START - Cookbook sre.hosts.reboot-single for host ms-be2088.codfw.wmnet [14:28:59] gerrit is unresponsive for me [14:29:35] same here [14:29:51] see alert [14:30:18] there are some errors in the log, I'd be inclined to restart [14:30:36] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1068 - https://phabricator.wikimedia.org/T387732#10613324 (10BTullis) 05Open→03Resolved a:03BTullis I can't see any problem with the disks on this server, so I think we can just close the ticket. 14 physical disks, all online. ` Physical Driv... [14:30:36] down since 13:39 [14:30:38] !log restart gerrit, unresponsive, errors in the log [14:30:42] +1 [14:31:13] started [14:31:42] FIRING: [4x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:31:47] PROBLEM - MegaRAID on an-worker1066 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [14:33:30] it died on both dcs [14:35:00] !log the previous restart of gerrit was on gerrit1003 [14:35:11] something is going on: https://grafana.wikimedia.org/goto/SpS8NutHR?orgId=1 [14:35:31] gerrit is seeing quite some traffic, I'm looking at logstash at the moment, potentially something to discuss in _security [14:35:58] yeah [14:36:42] FIRING: [5x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:38:03] o/ [14:38:03] I swear I haven't touched anything on Gerrit [14:38:15] hashar: we are discussing it in #security [14:40:11] !log elukey@cumin1002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host ms-be2088.codfw.wmnet [14:40:46] (03PS1) 10Papaul: Remove bfs for link between cr1-codfw and cr2-codfw [homer/public] - 10https://gerrit.wikimedia.org/r/1125448 (https://phabricator.wikimedia.org/T387773) [14:40:59] (03PS1) 10Jelto: gerrit: throttle alibaba IPs [puppet] - 10https://gerrit.wikimedia.org/r/1125449 (https://phabricator.wikimedia.org/T388235) [14:41:35] (03PS2) 10Papaul: Remove bfd for link between cr1-codfw and cr2-codfw [homer/public] - 10https://gerrit.wikimedia.org/r/1125448 (https://phabricator.wikimedia.org/T387773) [14:41:42] FIRING: [5x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:43:31] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:43:42] (03CR) 10Klausman: [C:03+1] admin_ng: increase resource quota for revision models ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125422 (https://phabricator.wikimedia.org/T387019) (owner: 10Ilias Sarantopoulos) [14:44:14] (03CR) 10Federico Ceratto: "This should be ready for review, ideally focusing on bugs that affect safety of the operation. Thanks!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1124797 (https://phabricator.wikimedia.org/T387209) (owner: 10Federico Ceratto) [14:44:18] (03CR) 10Hashar: "I'd rather ban them entirely much like we did for two other cases that badly abused the service:" [puppet] - 10https://gerrit.wikimedia.org/r/1125449 (https://phabricator.wikimedia.org/T388235) (owner: 10Jelto) [14:44:46] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [14:44:47] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1198.eqiad.wmnet with OS bullseye [14:44:53] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10613425 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-worker1198.eqiad.wmnet with OS bull... [14:45:24] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10613427 (10Jclark-ctr) [14:46:24] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1200.eqiad.wmnet with OS bullseye [14:46:31] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10613429 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-worker1200.eqiad.wmnet with OS... [14:47:11] PROBLEM - ElasticSearch health check for shards on 9600 on cloudelastic1009 is CRITICAL: CRITICAL - elasticsearch http://localhost:9600/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9600): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7fbeebe62280: Failed to establish a new connection: [Errno 111] Connection refused)) https://wiki [14:47:11] imedia.org/wiki/Search%23Administration [14:47:34] ^ expected [14:47:49] (03CR) 10Jelto: "this does not scale and overwhelms apache" [puppet] - 10https://gerrit.wikimedia.org/r/1125449 (https://phabricator.wikimedia.org/T388235) (owner: 10Jelto) [14:50:11] (03PS2) 10Jelto: gerrit: throttle alibaba IPs [puppet] - 10https://gerrit.wikimedia.org/r/1125449 (https://phabricator.wikimedia.org/T388235) [14:51:05] (03CR) 10Jcrespo: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1125449 (https://phabricator.wikimedia.org/T388235) (owner: 10Jelto) [14:51:10] (03CR) 10Jelto: [C:03+2] gerrit: throttle alibaba IPs [puppet] - 10https://gerrit.wikimedia.org/r/1125449 (https://phabricator.wikimedia.org/T388235) (owner: 10Jelto) [14:52:02] (03CR) 10Cathal Mooney: "Overall LGTM. I think the 'rack' probably needs to be upper-case (see comment)." [cookbooks] - 10https://gerrit.wikimedia.org/r/1125206 (https://phabricator.wikimedia.org/T387231) (owner: 10Tiziano Fogli) [14:53:41] (03CR) 10Cathal Mooney: [C:03+1] "LGTM!" [homer/public] - 10https://gerrit.wikimedia.org/r/1125448 (https://phabricator.wikimedia.org/T387773) (owner: 10Papaul) [14:54:25] FIRING: SystemdUnitFailed: push_cross_cluster_settings_9600.service on cloudelastic1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:55:12] 06SRE, 10fundraising-tech-ops: Q1:rack/setup/install fransw1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T367801#10613449 (10Jclark-ctr) [14:55:39] PROBLEM - Check unit status of push_cross_cluster_settings_9600 on cloudelastic1009 is CRITICAL: CRITICAL: Status of the systemd unit push_cross_cluster_settings_9600 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:55:42] 06SRE, 10fundraising-tech-ops: Q1:rack/setup/install frban1002 - https://phabricator.wikimedia.org/T369947#10613450 (10Jclark-ctr) [14:56:21] 06SRE, 10fundraising-tech-ops: Q1:rack/setup/install frdb1007 - https://phabricator.wikimedia.org/T369922#10613453 (10Jclark-ctr) [14:57:01] 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install frnetmon1002, pay-lb1001, pay-lb1002 - https://phabricator.wikimedia.org/T369565#10613456 (10Jclark-ctr) [14:57:47] 06SRE, 10fundraising-tech-ops: Q1:rack/setup/install fran1002 - https://phabricator.wikimedia.org/T369940#10613458 (10Jclark-ctr) [14:57:56] 06SRE, 10fundraising-tech-ops: Q1:rack/setup/install fransc1001 - https://phabricator.wikimedia.org/T367814#10613459 (10Jclark-ctr) [14:58:26] 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install franio100[1-3] - https://phabricator.wikimedia.org/T367820#10613473 (10Jclark-ctr) [14:59:05] (03CR) 10Klausman: [C:03+2] admin_ng: increase resource quota for revision models ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125422 (https://phabricator.wikimedia.org/T387019) (owner: 10Ilias Sarantopoulos) [14:59:06] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Q2:rack/setup E8/F8 new leaf switches - https://phabricator.wikimedia.org/T382017#10613477 (10cmooney) Thanks guys. Please ping me when these are in Netbox and I will add the links, IPs, vlans etc. and begin the process of commissioni... [14:59:17] PROBLEM - ElasticSearch health check for shards on 9200 on cloudelastic1009 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7f7a5b38e280: Failed to establish a new connection: [Errno 111] Connection refused)) https://wiki [14:59:17] imedia.org/wiki/Search%23Administration [14:59:19] PROBLEM - ElasticSearch health check for shards on 9400 on cloudelastic1009 is CRITICAL: CRITICAL - elasticsearch http://localhost:9400/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9400): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7fcf0209f280: Failed to establish a new connection: [Errno 111] Connection refused)) https://wiki [14:59:19] imedia.org/wiki/Search%23Administration [14:59:36] cloudelastic1009 issues can be ignored [14:59:39] (03PS1) 10JMeybohm: Rename TILLER_NAMESPACE to K8S_NAMESPACE [puppet] - 10https://gerrit.wikimedia.org/r/1125453 [15:00:27] 06SRE, 10fundraising-tech-ops: Q1:rack/setup/install frnetmon1002, pay-lb1001, pay-lb1002 - https://phabricator.wikimedia.org/T369565#10613482 (10Jclark-ctr) [15:00:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10613483 (10phaultfinder) [15:01:06] 06SRE, 10fundraising-tech-ops: Q1:rack/setup/install franio100[1-3] - https://phabricator.wikimedia.org/T367820#10613485 (10Jclark-ctr) [15:01:25] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1200.eqiad.wmnet with reason: host reimage [15:03:23] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 06Infrastructure-Foundations: Perform fake disk swap on ms-be2088 as test - https://phabricator.wikimedia.org/T384003#10613501 (10elukey) @Jhancock.wm Hi! I apologize in advance for keep requesting the same thing, but could you do another pull/push of a... [15:03:30] FIRING: Outbound discards: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [15:04:01] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1201.eqiad.wmnet with OS bullseye [15:04:05] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1202.eqiad.wmnet with OS bullseye [15:04:06] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10613507 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-worker1201.eqiad.wmnet with OS... [15:04:11] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10613508 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-worker1202.eqiad.wmnet with OS... [15:04:14] (03Merged) 10jenkins-bot: admin_ng: increase resource quota for revision models ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125422 (https://phabricator.wikimedia.org/T387019) (owner: 10Ilias Sarantopoulos) [15:04:25] FIRING: [2x] SystemdUnitFailed: elasticsearch-disable-readahead.service on cloudelastic1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:04:56] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1200.eqiad.wmnet with reason: host reimage [15:05:10] (03CR) 10Federico Ceratto: "Hello Jaime, I am not familiar with the internals of the backup process. Please let me know how I can help you with the CR and what type o" [puppet] - 10https://gerrit.wikimedia.org/r/1125114 (https://phabricator.wikimedia.org/T387892) (owner: 10Jcrespo) [15:05:21] (03PS5) 10JMeybohm: Add pod-security.wmf.org labels to wikikube mediawiki namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124416 (https://phabricator.wikimedia.org/T273507) [15:05:38] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1204.eqiad.wmnet with OS bullseye [15:05:42] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1203.eqiad.wmnet with OS bullseye [15:05:48] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10613521 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-worker1204.eqiad.wmnet with OS... [15:05:49] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10613522 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-worker1203.eqiad.wmnet with OS... [15:06:37] PROBLEM - Check unit status of push_cross_cluster_settings_9400 on cloudelastic1009 is CRITICAL: CRITICAL: Status of the systemd unit push_cross_cluster_settings_9400 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:06:39] (03CR) 10JMeybohm: Add pod-security.wmf.org labels to wikikube mediawiki namespaces (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124416 (https://phabricator.wikimedia.org/T273507) (owner: 10JMeybohm) [15:06:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:06:56] !log klausman@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [15:06:59] (03PS2) 10Scott French: mw-*: Tune 8.1 releases to avoid deployment timeouts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125265 (https://phabricator.wikimedia.org/T383845) [15:08:09] (03CR) 10JMeybohm: [C:03+2] staging-codfw: Unset image.tag for coredns to apply the default version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124831 (https://phabricator.wikimedia.org/T384450) (owner: 10JMeybohm) [15:08:23] !log klausman@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [15:09:25] FIRING: [4x] SystemdUnitFailed: elasticsearch-disable-readahead.service on cloudelastic1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:10:34] (03CR) 10JMeybohm: "Oh yeah, that's wild indeed. I'll make sure all diffs are clean after merging" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124832 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [15:11:11] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed & rebooted twice - https://phabricator.wikimedia.org/T387673#10613542 (10VRiley-WMF) Looks like we'll be replacing the motherboard soon. Will update when we know a time it should be arriving [15:12:46] !log klausman@deploy2002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [15:13:19] (03Merged) 10jenkins-bot: staging-codfw: Unset image.tag for coredns to apply the default version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124831 (https://phabricator.wikimedia.org/T384450) (owner: 10JMeybohm) [15:13:31] !log klausman@deploy2002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [15:14:15] PROBLEM - Check unit status of push_cross_cluster_settings_9200 on cloudelastic1009 is CRITICAL: CRITICAL: Status of the systemd unit push_cross_cluster_settings_9200 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:15:29] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [15:18:57] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1201.eqiad.wmnet with reason: host reimage [15:20:28] (03CR) 10Jcrespo: "I would like you to be aware of the changes and give me the ok to proceed with the m1 database changes at least, as technically you (datab" [puppet] - 10https://gerrit.wikimedia.org/r/1125114 (https://phabricator.wikimedia.org/T387892) (owner: 10Jcrespo) [15:20:30] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1204.eqiad.wmnet with reason: host reimage [15:20:35] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1203.eqiad.wmnet with reason: host reimage [15:21:22] FYI, I'm going to be running a helmfile-only scap deployment in a few minutes to tune some settings that should make deployment timeouts like we saw this week less frequent [15:22:33] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1201.eqiad.wmnet with reason: host reimage [15:23:06] (03CR) 10Scott French: "Thanks for the review!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125265 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [15:23:08] (03CR) 10Scott French: [C:03+2] mw-*: Tune 8.1 releases to avoid deployment timeouts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125265 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [15:23:24] (03CR) 10Federico Ceratto: "I replied to the questions and left a question around netbox_server. Is anybody seeing any showstopping bug or safety issue? If not I thin" [cookbooks] - 10https://gerrit.wikimedia.org/r/1120605 (https://phabricator.wikimedia.org/T387023) (owner: 10Federico Ceratto) [15:24:44] !log cmooney@cumin1002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [15:24:46] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [15:24:47] (03Merged) 10jenkins-bot: mw-*: Tune 8.1 releases to avoid deployment timeouts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125265 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [15:25:46] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10613593 (10phaultfinder) [15:25:57] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1203.eqiad.wmnet with reason: host reimage [15:26:19] (03CR) 10Xcollazo: [C:03+1] "Still a +1 from my side. Thanks Amir. CC @btullis@wikimedia.org." [dumps] - 10https://gerrit.wikimedia.org/r/1119486 (https://phabricator.wikimedia.org/T382069) (owner: 10Ladsgroup) [15:26:31] PROBLEM - Host an-presto1014 is DOWN: PING CRITICAL - Packet loss = 100% [15:27:34] starting said deployment momentarily [15:27:36] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [15:27:41] RECOVERY - SSH on an-presto1014 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [15:27:43] RECOVERY - Host an-presto1014 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [15:27:45] swfrench-wmf: that is good to know! The Tuesday automatic deploy had helm time out at 10 minutes and I kind of forgot to ask for it to be raised [15:27:58] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [15:27:58] :) [15:27:59] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1200.eqiad.wmnet with OS bullseye [15:28:11] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10613611 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-worker1200.eqiad.wmnet with OS bull... [15:28:33] hashar: this is why I want to get this in today, since the first deploy on Monday may also generally be slow due to the weekly production image rebuild [15:28:52] (03CR) 10Federico Ceratto: "(Note: gerrit is flagging the CR as "XL" sized but it's due to the addition of a JSON file used for testing)" [cookbooks] - 10https://gerrit.wikimedia.org/r/1125421 (https://phabricator.wikimedia.org/T299442) (owner: 10Federico Ceratto) [15:29:08] weekly rebuild? [15:29:46] !log cmooney@cumin1002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [15:29:54] the production images (i.e., the php base images used by the multiversion image) are rebuilt every weekend [15:30:30] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [15:30:37] * hashar jaw drops [15:31:03] I have on my todo list to investigate why the first monday backport is slow [15:31:17] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1205.eqiad.wmnet with OS bullseye [15:31:27] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10613630 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-worker1205.eqiad.wmnet with OS... [15:31:33] hashar: that'll do it :) [15:31:33] and I guess if scap image building does a pull of the parent image, that invalidate the image, cause a full image to be build and a 8.5G image to deploy [15:31:42] which solves a mystery I had to investigate [15:31:44] 10ops-eqiad, 06SRE, 06DC-Ops, 13Patch-For-Review: Decommission E/F 8 Dell switches - https://phabricator.wikimedia.org/T380050#10613635 (10cmooney) [15:31:46] !log swfrench@deploy2002 Started scap sync-world: helmfile-only deploy to reduce likelihood of deployment timeouts - T383845 [15:31:49] T383845: MediaWiki on PHP 8.1 production traffic ramp-up - https://phabricator.wikimedia.org/T383845 [15:32:17] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1204.eqiad.wmnet with reason: host reimage [15:32:33] (03PS1) 10Hnowlan: trafficserver: route citoid via rest-gateway for all sites [puppet] - 10https://gerrit.wikimedia.org/r/1125461 (https://phabricator.wikimedia.org/T361576) [15:32:47] RECOVERY - Dell PowerEdge RAID Controller on an-presto1014 is OK: communication: 0 OK https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [15:32:50] (03Abandoned) 10Hnowlan: trafficserver: route citoid via rest-gateway for all sites [puppet] - 10https://gerrit.wikimedia.org/r/1113182 (https://phabricator.wikimedia.org/T361576) (owner: 10Hnowlan) [15:33:41] !log swfrench@deploy2002 Finished scap sync-world: helmfile-only deploy to reduce likelihood of deployment timeouts - T383845 (duration: 04m 33s) [15:34:00] (03CR) 10Btullis: [C:03+1] "Nice. Thanks for doing this." [dumps] - 10https://gerrit.wikimedia.org/r/1119486 (https://phabricator.wikimedia.org/T382069) (owner: 10Ladsgroup) [15:35:08] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: change dns names for eqiad rack e8 endpoints - cmooney@cumin1002" [15:35:13] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: change dns names for eqiad rack e8 endpoints - cmooney@cumin1002" [15:35:13] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:35:25] 10ops-eqiad, 06SRE, 06DC-Ops, 13Patch-For-Review: Decommission E/F 8 Dell switches - https://phabricator.wikimedia.org/T380050#10613653 (10cmooney) I've tidied up netbox for these now. I left the ports enabled on the ssw side with the IPs present, as we can't disable them there and keep the IPs attached.... [15:35:38] alright, that should hopefully do the trick. I'm (ideally) done touching prod on Friday :) [15:36:08] when it is to prevent an outage on monday morning, it is fine :) [15:36:12] or worse, over the week-end! [15:36:32] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Q2:rack/setup E8/F8 new leaf switches - https://phabricator.wikimedia.org/T382017#10613663 (10cmooney) Lastly please call these new switches //lsw1-e8-eqiad// and //lsw1-f8-eqiad// in Netbox. We'll need to either have deleted the Dell... [15:37:06] hashar: exactly, yeah [15:40:19] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10613682 (10BTullis) Hi @VRiley-WMF - I'm not sure if you saw my comment on the HDD upgrade ticket here: T385485#10609902 Basically, I was wonde... [15:43:59] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1206.eqiad.wmnet with OS bullseye [15:44:01] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1208.eqiad.wmnet with OS bullseye [15:44:01] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1207.eqiad.wmnet with OS bullseye [15:44:05] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10613697 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-worker1206.eqiad.wmnet with OS... [15:44:08] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10613698 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-worker1208.eqiad.wmnet with OS... [15:44:10] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10613699 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-worker1207.eqiad.wmnet with OS... [15:44:39] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10613700 (10phaultfinder) [15:45:38] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Q2:rack/setup E8/F8 new leaf switches - https://phabricator.wikimedia.org/T382017#10613705 (10VRiley-WMF) Will be adding these into netbox shortly [15:46:23] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [15:46:47] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1205.eqiad.wmnet with reason: host reimage [15:46:57] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [15:46:59] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1201.eqiad.wmnet with OS bullseye [15:47:11] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10613708 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-worker1201.eqiad.wmnet with OS bull... [15:48:30] RESOLVED: Outbound discards: Device asw2-a-eqiad.mgmt.eqiad.wmnet recovered from Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [15:49:36] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "updating for renamed dell switches in eqiad - cmooney@cumin1002" [15:49:42] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "updating for renamed dell switches in eqiad - cmooney@cumin1002" [15:50:37] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1205.eqiad.wmnet with reason: host reimage [15:50:42] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [15:52:42] (03PS1) 10JMeybohm: admin_ng: Create cert-manager leases in cert-manager namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125462 (https://phabricator.wikimedia.org/T383553) [15:53:24] (03PS1) 10Ilias Sarantopoulos: ml-services: revert asyncio thread usage in reference quality [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125463 [15:53:38] (03CR) 10Ilias Sarantopoulos: [C:03+2] ml-services: revert asyncio thread usage in reference quality [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125463 (owner: 10Ilias Sarantopoulos) [15:55:02] (03Merged) 10jenkins-bot: ml-services: revert asyncio thread usage in reference quality [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125463 (owner: 10Ilias Sarantopoulos) [15:55:51] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [15:58:05] !log isaranto@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revision-models' for release 'main' . [15:58:27] !log isaranto@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . [15:58:35] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1206.eqiad.wmnet with reason: host reimage [15:58:39] hashar: The first backport of the week shouldn' [15:58:47] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1208.eqiad.wmnet with reason: host reimage [15:58:59] !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . [15:59:06] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1207.eqiad.wmnet with reason: host reimage [16:00:31] hashar: A Monday backport _shouldn't_ be any slower than a Friday one, but next time you see this happening, save a copy of your scap-image-build-and-push-log file. [16:02:20] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1206.eqiad.wmnet with reason: host reimage [16:04:53] !log jclark@cumin1002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [16:04:54] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1204.eqiad.wmnet with OS bullseye [16:04:58] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [16:04:58] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10613788 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-worker1204.eqiad.wmnet with OS bull... [16:04:58] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1203.eqiad.wmnet with OS bullseye [16:05:21] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10613789 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-worker1203.eqiad.wmnet with OS bull... [16:05:43] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10613790 (10phaultfinder) [16:05:49] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1207.eqiad.wmnet with reason: host reimage [16:06:16] Hey all - would like to do a security deployment for T387691. We have an accidentally-disclosed patch so I think it warrants a Friday deploy. (cc: hashar thcipriani Lucas_WMDE) [16:06:24] * Lucas_WMDE around [16:06:49] sounds good [16:09:03] I'm around too. [16:09:14] (03PS1) 10Muehlenhoff: Bump changelog for 1.0.1 release [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/1125466 [16:09:36] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1208.eqiad.wmnet with reason: host reimage [16:11:01] Deploying… [16:14:05] (03PS1) 10Clément Goubert: periodic_jobs: Remove last wikitech jobs [puppet] - 10https://gerrit.wikimedia.org/r/1125467 (https://phabricator.wikimedia.org/T388249) [16:14:06] (03PS1) 10Clément Goubert: periodic_jobs: Cleanup last wikitech jobs [puppet] - 10https://gerrit.wikimedia.org/r/1125468 (https://phabricator.wikimedia.org/T388249) [16:14:15] (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1125467 (https://phabricator.wikimedia.org/T388249) (owner: 10Clément Goubert) [16:15:42] FIRING: [2x] JobUnavailable: Reduced availability for job thanos-query-frontend in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:16:31] sbassett: let me know when I can test on WikimediaDebug [16:16:32] (03CR) 10Btullis: [C:03+1] "Looks good to me." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125424 (https://phabricator.wikimedia.org/T380624) (owner: 10Stevemunene) [16:16:56] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [16:17:15] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [16:17:16] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1205.eqiad.wmnet with OS bullseye [16:17:21] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10613866 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-worker1205.eqiad.wmnet with OS bull... [16:17:31] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Bump changelog for 1.0.1 release [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/1125466 (owner: 10Muehlenhoff) [16:17:36] Lucas_WMDE: oh I just went ahead with the prod deploy - should be done soon (60% k8s deployed) [16:17:51] ah [16:17:55] * Lucas_WMDE tests anyway [16:19:39] Looks like I might have o/s on commons via staff rights. I def have the "change visbility" UI for revisions. [16:19:43] ok I think it’s behaving as expected [16:20:16] ok, good. Let me know if you want to test the XSS piece as I’m pretty sure I can o/s… [16:20:28] !log Deployed security patch for T387691 [16:20:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:42] FIRING: [4x] JobUnavailable: Reduced availability for job thanos-query in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:21:22] sbassett: and I think we can see the sharp uptick in wbformatvalue API requests here https://grafana.wikimedia.org/d/000000559/mediawiki-action-api-breakdown?orgId=1&var-metric=p50&var-module=wbformatvalue&from=now-1h&to=now [16:21:34] but as long as the site can handle the load that should hopefully be fine [16:21:47] RECOVERY - MegaRAID on an-worker1066 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [16:21:50] Ok, I assume that’s expected. [16:21:56] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1202.eqiad.wmnet with OS bullseye [16:22:01] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10613894 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-worker1202.eqiad.wmnet with OS bull... [16:22:11] yeah [16:22:32] I expected it to go up and I hope it won’t take the site down 😅 [16:22:34] so far both are looking good [16:22:40] Ok :) [16:22:57] Let me know if you’d like to test anything else. I’ll keep an eye on grafana and logstash for a bit. [16:23:13] at https://grafana.wikimedia.org/d/000000002/mediawiki-action-api-summary?orgId=1&refresh=5m it still looks like wbformatvalue is a negligible amount [16:24:11] (03PS2) 10Clément Goubert: periodic_jobs: Remove last wikitech jobs [puppet] - 10https://gerrit.wikimedia.org/r/1125467 (https://phabricator.wikimedia.org/T388249) [16:24:11] (03PS2) 10Clément Goubert: periodic_jobs: Cleanup last wikitech jobs [puppet] - 10https://gerrit.wikimedia.org/r/1125468 (https://phabricator.wikimedia.org/T388249) [16:24:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10613918 (10phaultfinder) [16:25:13] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [16:25:39] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [16:25:39] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1206.eqiad.wmnet with OS bullseye [16:25:42] RESOLVED: [4x] JobUnavailable: Reduced availability for job thanos-query in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:25:48] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10613921 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-worker1206.eqiad.wmnet with OS bull... [16:26:19] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1202.eqiad.wmnet with OS bullseye [16:26:26] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10613924 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-worker1202.eqiad.wmnet with OS... [16:28:12] (03PS1) 10Cathal Mooney: Enable BGP Multipath for PyBal group [homer/public] - 10https://gerrit.wikimedia.org/r/1125471 (https://phabricator.wikimedia.org/T332027) [16:28:43] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [16:29:02] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [16:29:03] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1207.eqiad.wmnet with OS bullseye [16:29:14] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10613948 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-worker1207.eqiad.wmnet with OS bull... [16:29:25] (03PS3) 10Clément Goubert: periodic_jobs: Remove last wikitech jobs [puppet] - 10https://gerrit.wikimedia.org/r/1125467 (https://phabricator.wikimedia.org/T388249) [16:29:26] (03PS3) 10Clément Goubert: periodic_jobs: Cleanup last wikitech jobs [puppet] - 10https://gerrit.wikimedia.org/r/1125468 (https://phabricator.wikimedia.org/T388249) [16:29:43] (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1125467 (https://phabricator.wikimedia.org/T388249) (owner: 10Clément Goubert) [16:32:30] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [16:32:51] (03PS1) 10Vgutierrez: site,hiera: Reimage lvs6003 as liberica [puppet] - 10https://gerrit.wikimedia.org/r/1125472 (https://phabricator.wikimedia.org/T384477) [16:33:08] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [16:33:09] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1208.eqiad.wmnet with OS bullseye [16:33:14] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10613969 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-worker1208.eqiad.wmnet with OS bull... [16:33:22] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1125472 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [16:33:23] (03PS2) 10Cathal Mooney: Enable BGP Multipath for PyBal group [homer/public] - 10https://gerrit.wikimedia.org/r/1125471 (https://phabricator.wikimedia.org/T332027) [16:35:17] (03PS2) 10JHathaway: puppetserver: add option to manage git permissions with an acl [puppet] - 10https://gerrit.wikimedia.org/r/1125247 (https://phabricator.wikimedia.org/T385995) [16:36:20] (03CR) 10DCausse: [C:03+1] "thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125092 (owner: 10Hashar) [16:37:36] (03CR) 10Vgutierrez: [C:04-2] "do not merge before 2025-03-10" [puppet] - 10https://gerrit.wikimedia.org/r/1125472 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [16:40:51] (03CR) 10Elukey: [C:03+1] "I think that we are ready to proceed, no blocker stands out from my point of view." [cookbooks] - 10https://gerrit.wikimedia.org/r/1120605 (https://phabricator.wikimedia.org/T387023) (owner: 10Federico Ceratto) [16:41:16] (03CR) 10Ladsgroup: [C:03+1] periodic_jobs: Cleanup last wikitech jobs [puppet] - 10https://gerrit.wikimedia.org/r/1125468 (https://phabricator.wikimedia.org/T388249) (owner: 10Clément Goubert) [16:41:44] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1202.eqiad.wmnet with reason: host reimage [16:42:15] (03CR) 10Ladsgroup: [C:03+1] periodic_jobs: Remove last wikitech jobs [puppet] - 10https://gerrit.wikimedia.org/r/1125467 (https://phabricator.wikimedia.org/T388249) (owner: 10Clément Goubert) [16:43:47] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1125247 (https://phabricator.wikimedia.org/T385995) (owner: 10JHathaway) [16:44:35] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10614009 (10phaultfinder) [16:45:32] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1202.eqiad.wmnet with reason: host reimage [16:46:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1246', diff saved to https://phabricator.wikimedia.org/P74156 and previous config saved to /var/cache/conftool/dbconfig/20250307-164605-root.json [16:46:27] (03PS1) 10Marostegui: Revert "db1246: Enable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1125476 [16:46:27] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10614019 (10Jclark-ctr) [16:46:56] (03CR) 10Marostegui: [C:03+2] Revert "db1246: Enable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1125476 (owner: 10Marostegui) [16:47:30] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed & rebooted twice - https://phabricator.wikimedia.org/T387673#10614020 (10Marostegui) @VRiley-WMF the host is depooled and notifications are disabled. So you can change the mainboard anytime you want, whenever it arrives. I will leave it depooled. [16:48:53] (03CR) 10Ladsgroup: "I think this patch is for groups? I don't know whether we have min replicas for groups." [cookbooks] - 10https://gerrit.wikimedia.org/r/1125421 (https://phabricator.wikimedia.org/T299442) (owner: 10Federico Ceratto) [16:50:12] (03PS1) 10Muehlenhoff: Record LDAP access for hswan [puppet] - 10https://gerrit.wikimedia.org/r/1125477 (https://phabricator.wikimedia.org/T387522) [16:50:18] (03PS1) 10DCausse: opensearch: drop minimum_master_nodes [puppet] - 10https://gerrit.wikimedia.org/r/1125478 [16:51:47] PROBLEM - MegaRAID on an-worker1066 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [16:51:59] (03CR) 10Muehlenhoff: [C:03+2] Record LDAP access for hswan [puppet] - 10https://gerrit.wikimedia.org/r/1125477 (https://phabricator.wikimedia.org/T387522) (owner: 10Muehlenhoff) [16:52:53] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to wmf for hswan - https://phabricator.wikimedia.org/T387522#10614031 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff Access to cn=wmf and cn=logstash-access has been enabled via Wikimedia IDM. [16:53:52] (03PS2) 10DCausse: opensearch: drop minimum_master_nodes [puppet] - 10https://gerrit.wikimedia.org/r/1125478 [16:54:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10614036 (10phaultfinder) [16:57:09] (03CR) 10Marostegui: "yeah, my comment was more answering Federico's comment." [cookbooks] - 10https://gerrit.wikimedia.org/r/1125421 (https://phabricator.wikimedia.org/T299442) (owner: 10Federico Ceratto) [17:04:58] (03CR) 10Ladsgroup: "ah sorry for confusion." [cookbooks] - 10https://gerrit.wikimedia.org/r/1125421 (https://phabricator.wikimedia.org/T299442) (owner: 10Federico Ceratto) [17:06:44] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [17:08:51] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [17:10:49] (03CR) 10Cwhite: [C:03+1] "Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1125478 (owner: 10DCausse) [17:14:04] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: dns additions for eqiad E8/F8 links to new switches - cmooney@cumin1002" [17:18:44] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: dns additions for eqiad E8/F8 links to new switches - cmooney@cumin1002" [17:18:44] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:18:48] Hm, I have an override set for s.ukhe but it's showing him as on call still in the topic [17:19:32] (03CR) 10DCausse: "quick heads up that cloudelastic1010 is currently master eligible and might special consideration before the re-image (see https://etherpa" [puppet] - 10https://gerrit.wikimedia.org/r/1125227 (owner: 10Bking) [17:22:29] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Recommission testhost2001.codfw.wmnet as ms-be2089.codfw.wmnet - https://phabricator.wikimedia.org/T388221#10614141 (10wiki_willy) ++ @Jhancock.wm & @Papaul - per our conversation the other day, this will be the R760xd2 seed server th... [17:26:05] (03PS1) 10Btullis: Enable lock transaction management in the hive metastore on hadoop_test [puppet] - 10https://gerrit.wikimedia.org/r/1125486 (https://phabricator.wikimedia.org/T386854) [17:26:26] (03CR) 10CI reject: [V:04-1] Enable lock transaction management in the hive metastore on hadoop_test [puppet] - 10https://gerrit.wikimedia.org/r/1125486 (https://phabricator.wikimedia.org/T386854) (owner: 10Btullis) [17:26:49] !log jclark@cumin1002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [17:26:50] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1202.eqiad.wmnet with OS bullseye [17:27:01] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10614153 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-worker1202.eqiad.wmnet with OS bull... [17:28:04] (03PS2) 10Btullis: Enable lock transaction management in the hive metastore on hadoop_test [puppet] - 10https://gerrit.wikimedia.org/r/1125486 (https://phabricator.wikimedia.org/T386854) [17:28:25] (03CR) 10CI reject: [V:04-1] Enable lock transaction management in the hive metastore on hadoop_test [puppet] - 10https://gerrit.wikimedia.org/r/1125486 (https://phabricator.wikimedia.org/T386854) (owner: 10Btullis) [17:30:08] (03PS3) 10Btullis: Enable lock transaction management in the hive metastore on hadoop_test [puppet] - 10https://gerrit.wikimedia.org/r/1125486 (https://phabricator.wikimedia.org/T386854) [17:30:42] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1125486 (https://phabricator.wikimedia.org/T386854) (owner: 10Btullis) [17:32:23] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [17:34:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10614189 (10phaultfinder) [17:35:04] (03PS1) 10Cathal Mooney: Add new Juniper leaf switches eqiad E8/F8 to IBGP cluster [homer/public] - 10https://gerrit.wikimedia.org/r/1125488 (https://phabricator.wikimedia.org/T382017) [17:38:01] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: dns additions for eqiad E8/F8 links to new switches - cmooney@cumin1002" [17:38:06] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: dns additions for eqiad E8/F8 links to new switches - cmooney@cumin1002" [17:38:07] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:41:16] 06SRE, 10MediaWiki-extensions-OAuth, 06The-Wikipedia-Library, 07Datacenter-Switchover, 07User-notice-archive: Frequent OAuth failures on Wikimedia wikis since eqiad was repooled due to db-mainstash replication lag - https://phabricator.wikimedia.org/T332650#10614202 (10matmarex) 05Open→03Resolved... [17:41:59] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: Q2:rack/setup E8/F8 new leaf switches - https://phabricator.wikimedia.org/T382017#10614209 (10cmooney) >>! In T382017#10613705, @VRiley-WMF wrote: > Will be adding these into netbox shortly Cool I can see them there. FWIW I added t... [17:42:59] (03PS3) 10Fabfur: Fix previous commit [debs/benthos] - 10https://gerrit.wikimedia.org/r/1124894 (https://phabricator.wikimedia.org/T256098) [17:47:42] FIRING: HelmReleaseBadStatus: Helm release article-descriptions/main on k8s-mlstaging@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s-mlstaging&var-namespace=article-descriptions - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [17:48:57] 10ops-eqiad, 06SRE, 06DC-Ops, 13Patch-For-Review: Decommission E/F 8 Dell switches - https://phabricator.wikimedia.org/T380050#10614236 (10cmooney) >>! In T380050#10613653, @cmooney wrote: > Please delete the cables/connections from Netbox to match what is done on site. For the record I deleted the cables... [17:54:43] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10614250 (10phaultfinder) [17:57:13] (03PS1) 10D3r1ck01: Set `$wgCentralAuthLoginWiki` to correct default as documented [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125491 (https://phabricator.wikimedia.org/T388218) [18:04:39] 06SRE, 06Fundraising-Backlog, 10fundraising-tech-ops, 10LDAP-Access-Requests: Grant Access to astein for fr-tech icinga acknowledgements - https://phabricator.wikimedia.org/T388186#10614269 (10Dwisehaupt) @MoritzMuehlenhoff Thanks for the info. I'll have @AStein-WMF step through the IDM bits. As far as the... [18:21:47] RECOVERY - MegaRAID on an-worker1066 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [18:24:44] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10614292 (10phaultfinder) [18:29:40] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10614307 (10phaultfinder) [18:35:41] PROBLEM - OSPF status on ssw1-f1-eqiad.mgmt is CRITICAL: OSPFv2: 12/14 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:36:18] (03PS1) 10Amdrel: CommonSettings.php: Add $wgCentralAuthAutomaticVanishWiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125497 (https://phabricator.wikimedia.org/T380527) [18:36:41] RECOVERY - OSPF status on ssw1-f1-eqiad.mgmt is OK: OSPFv2: 12/12 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:41:33] (03PS1) 10Andrew Bogott: keystone.conf: update oidc comment section to reflect changes in ID mapping [puppet] - 10https://gerrit.wikimedia.org/r/1125499 (https://phabricator.wikimedia.org/T388137) [18:42:37] (03CR) 10Papaul: [C:03+2] Remove bfd for link between cr1-codfw and cr2-codfw [homer/public] - 10https://gerrit.wikimedia.org/r/1125448 (https://phabricator.wikimedia.org/T387773) (owner: 10Papaul) [18:54:18] (03CR) 10Dreamy Jazz: CommonSettings.php: Add $wgCentralAuthAutomaticVanishWiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125497 (https://phabricator.wikimedia.org/T380527) (owner: 10Amdrel) [19:01:39] PROBLEM - Hadoop NodeManager on an-worker1162 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [19:01:47] PROBLEM - MegaRAID on an-worker1066 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [19:02:58] (03CR) 10Xcollazo: Enable lock transaction management in the hive metastore on hadoop_test (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1125486 (https://phabricator.wikimedia.org/T386854) (owner: 10Btullis) [19:03:03] PROBLEM - Hadoop NodeManager on analytics1076 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [19:05:03] RECOVERY - Hadoop NodeManager on analytics1076 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [19:07:39] RECOVERY - Hadoop NodeManager on an-worker1162 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [19:09:25] FIRING: [4x] SystemdUnitFailed: elasticsearch-disable-readahead.service on cloudelastic1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:13:40] 10SRE-swift-storage, 10Thumbor: Gradually drop all thumbnails as a one-off clean up - https://phabricator.wikimedia.org/T379942#10614420 (10Ladsgroup) Third wave of deletions in codfw just started (from `20` to `2f`). I will start eqiad on Monday. [19:20:14] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cloudelastic1009.eqiad.wmnet with OS bullseye [19:20:19] PROBLEM - Hadoop NodeManager on an-worker1125 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [19:23:50] (03CR) 10Scott French: [C:03+1] "Thanks, Effie!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125418 (https://phabricator.wikimedia.org/T383845) (owner: 10Effie Mouzeli) [19:25:14] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for @Ben.buchenau - https://phabricator.wikimedia.org/T386904#10614432 (10KFrancis) The NDA is complete. Thanks! [19:26:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10614438 (10phaultfinder) [19:32:26] PROBLEM - Hadoop NodeManager on an-worker1089 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [19:34:06] (03CR) 10Gergő Tisza: [C:03+1] Set `$wgCentralAuthLoginWiki` to correct default as documented [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125491 (https://phabricator.wikimedia.org/T388218) (owner: 10D3r1ck01) [19:35:20] RECOVERY - Hadoop NodeManager on an-worker1125 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [19:35:49] 06SRE, 10Cassandra: Eliminate use of secondary IP interfaces & DNS for Cassandra instances - https://phabricator.wikimedia.org/T388169#10614471 (10Eevans) [19:40:54] (03CR) 10Scott French: "Thanks, Effie!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125423 (https://phabricator.wikimedia.org/T383845) (owner: 10Effie Mouzeli) [19:43:26] RECOVERY - Hadoop NodeManager on an-worker1089 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [19:46:06] (03PS1) 10Scott French: mw-(api-ext|web): serve 25% of residual traffic on 8.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125503 (https://phabricator.wikimedia.org/T383845) [19:53:06] !log bking@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudelastic1009.eqiad.wmnet with OS bullseye [19:53:47] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cloudelastic1009.eqiad.wmnet with OS bullseye [19:55:44] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10614494 (10phaultfinder) [20:00:20] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10614499 (10Jclark-ctr) [20:00:50] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10614501 (10Jclark-ctr) 05Open→03Resolved @BTullis thanks for your assistance today. these are all finished. [20:06:37] !log bking@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudelastic1009.eqiad.wmnet with OS bullseye [20:07:19] 06SRE, 06Infrastructure-Foundations, 10Mail: Consider Postfix as MTA for our MXes (and OTRS/Mailman/Phab) - https://phabricator.wikimedia.org/T232343#10614521 (10jhathaway) 05Open→03Resolved Postfix has replaced Exim for our inbound and outbound mail servers in production for some time now. Though th... [20:08:23] !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudelastic1009.eqiad.wmnet'] [20:16:06] !log bking@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudelastic1009.eqiad.wmnet'] [20:16:35] !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudelastic1009.eqiad.wmnet'] [20:17:00] !log bking@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['cloudelastic1009.eqiad.wmnet'] [20:17:00] PROBLEM - Hadoop NodeManager on an-worker1119 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [20:17:12] !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudelastic1009.eqiad.wmnet'] [20:17:46] !log bking@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudelastic1009.eqiad.wmnet'] [20:21:48] RECOVERY - MegaRAID on an-worker1066 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [20:24:35] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10614568 (10phaultfinder) [20:25:00] RECOVERY - Hadoop NodeManager on an-worker1119 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [20:32:22] !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudelastic1009.eqiad.wmnet'] [20:32:24] !log bking@cumin2002 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts ['cloudelastic1009.eqiad.wmnet'] [20:32:36] !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudelastic1009.eqiad.wmnet'] [20:34:47] (03PS2) 10Amdrel: CommonSettings.php: Add $wgCentralAuthAutomaticVanishWiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125497 (https://phabricator.wikimedia.org/T380527) [20:35:11] (03CR) 10Amdrel: CommonSettings.php: Add $wgCentralAuthAutomaticVanishWiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125497 (https://phabricator.wikimedia.org/T380527) (owner: 10Amdrel) [20:42:07] (03PS1) 10Bking: cloudelastic: use EFI boot for cloudelastic1009,1010 [puppet] - 10https://gerrit.wikimedia.org/r/1125520 (https://phabricator.wikimedia.org/T387904) [20:42:17] !log bking@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudelastic1009.eqiad.wmnet'] [20:43:15] (03PS2) 10Bking: cloudelastic: use EFI boot for cloudelastic1009,1010 [puppet] - 10https://gerrit.wikimedia.org/r/1125520 (https://phabricator.wikimedia.org/T387904) [20:44:04] !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudelastic1009.eqiad.wmnet'] [20:49:30] !log bking@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['cloudelastic1009.eqiad.wmnet'] [20:51:30] (03CR) 10Bking: [C:03+2] "Self-merging in the interest of time" [puppet] - 10https://gerrit.wikimedia.org/r/1125520 (https://phabricator.wikimedia.org/T387904) (owner: 10Bking) [20:51:48] PROBLEM - MegaRAID on an-worker1066 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [20:58:44] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cloudelastic1009.eqiad.wmnet with OS bullseye [20:59:35] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10614651 (10phaultfinder) [21:12:36] !log bking@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudelastic1009.eqiad.wmnet with OS bullseye [21:13:13] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cloudelastic1009.eqiad.wmnet with OS bullseye [21:24:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10614719 (10phaultfinder) [21:32:36] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudelastic1009.eqiad.wmnet with reason: host reimage [21:36:12] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudelastic1009.eqiad.wmnet with reason: host reimage [21:37:06] (03CR) 10JHathaway: "Thanks @dwisehaupt@wikimedia.org for pulling me in, just a few initial questions." [puppet] - 10https://gerrit.wikimedia.org/r/1124205 (https://phabricator.wikimedia.org/T383715) (owner: 10Dwisehaupt) [21:40:46] (03CR) 10JHathaway: "Thanks @dwisehaupt@wikimedia.org for pulling me in. Our preference would be for you to use our Postfix profile. I am happy to help adapt i" [puppet] - 10https://gerrit.wikimedia.org/r/1125223 (https://phabricator.wikimedia.org/T383715) (owner: 10Dwisehaupt) [21:44:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10614784 (10phaultfinder) [21:47:42] FIRING: HelmReleaseBadStatus: Helm release article-descriptions/main on k8s-mlstaging@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s-mlstaging&var-namespace=article-descriptions - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [21:49:54] 06SRE, 10Cassandra: Eliminate use of secondary IP interfaces & DNS for Cassandra instances - https://phabricator.wikimedia.org/T388169#10614819 (10Eevans) [21:53:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10614840 (10phaultfinder) [22:01:04] (03PS3) 10BCornwall: haproxy/icinga: Remove RSA from auth algorithms [puppet] - 10https://gerrit.wikimedia.org/r/1100192 (https://phabricator.wikimedia.org/T370837) [22:01:05] (03PS3) 10BCornwall: haproxy: Remove cipher regsub of "ECDHE-RSA-" [puppet] - 10https://gerrit.wikimedia.org/r/1100193 (https://phabricator.wikimedia.org/T370837) [22:01:20] (03CR) 10CI reject: [V:04-1] haproxy/icinga: Remove RSA from auth algorithms [puppet] - 10https://gerrit.wikimedia.org/r/1100192 (https://phabricator.wikimedia.org/T370837) (owner: 10BCornwall) [22:01:22] (03CR) 10BCornwall: haproxy/icinga: Remove RSA from auth algorithms (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1100192 (https://phabricator.wikimedia.org/T370837) (owner: 10BCornwall) [22:01:28] (03CR) 10CI reject: [V:04-1] haproxy: Remove cipher regsub of "ECDHE-RSA-" [puppet] - 10https://gerrit.wikimedia.org/r/1100193 (https://phabricator.wikimedia.org/T370837) (owner: 10BCornwall) [22:01:58] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Unbanning all hosts in cloudelastic [22:02:02] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Unbanning all hosts in cloudelastic [22:02:28] (03PS2) 10Scott French: aptrepo: update pcre2 backport from apt-staging [puppet] - 10https://gerrit.wikimedia.org/r/1121388 (https://phabricator.wikimedia.org/T386006) [22:02:52] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudelastic1009.eqiad.wmnet with OS bullseye [22:04:04] 06SRE, 10SRE-Access-Requests, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Remove production data access for NDA expired user aude - https://phabricator.wikimedia.org/T388034#10614901 (10ttaylor) FYI, @Seddon is the engineering manager who is actively working with this user, according to the records I get... [22:05:15] (03PS4) 10BCornwall: haproxy/icinga: Remove RSA from auth algorithms [puppet] - 10https://gerrit.wikimedia.org/r/1100192 (https://phabricator.wikimedia.org/T370837) [22:05:15] (03PS4) 10BCornwall: haproxy: Remove cipher regsub of "ECDHE-RSA-" [puppet] - 10https://gerrit.wikimedia.org/r/1100193 (https://phabricator.wikimedia.org/T370837) [22:06:31] (03CR) 10Scott French: "Thanks you, Matthew!" [puppet] - 10https://gerrit.wikimedia.org/r/1121388 (https://phabricator.wikimedia.org/T386006) (owner: 10Scott French) [22:07:47] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5038/co" [puppet] - 10https://gerrit.wikimedia.org/r/1100192 (https://phabricator.wikimedia.org/T370837) (owner: 10BCornwall) [22:09:22] (03PS3) 10Dwisehaupt: community_civicrm: dovecot module for serving up local mail [puppet] - 10https://gerrit.wikimedia.org/r/1124205 (https://phabricator.wikimedia.org/T383715) [22:10:46] (03CR) 10BCornwall: [C:04-1] geo-maps: update South America DCs (part 1/2) (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1124192 (https://phabricator.wikimedia.org/T387774) (owner: 10CDobbins) [22:13:00] (03CR) 10BCornwall: [C:04-1] "Not sure why this is a separate commit: Shouldn't it be merged in with 1124192?" [dns] - 10https://gerrit.wikimedia.org/r/1124178 (https://phabricator.wikimedia.org/T387774) (owner: 10CDobbins) [22:14:51] (03CR) 10Dwisehaupt: community_civicrm: dovecot module for serving up local mail (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1124205 (https://phabricator.wikimedia.org/T383715) (owner: 10Dwisehaupt) [22:17:51] !log [Cloudelastic] Doing a `/_cluster/reroute?retry_failed=true` of all 3 elastic/opensearch clusters [22:17:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:18:00] (03CR) 10BCornwall: [C:04-1] "[nit] This and 1124178 should be reformatted to email/plain text standards, meaning no markdown." [dns] - 10https://gerrit.wikimedia.org/r/1124192 (https://phabricator.wikimedia.org/T387774) (owner: 10CDobbins) [22:25:28] (03PS1) 10Ebernhardson: Add sudachi analyzer for japanese [software/opensearch/plugins] - 10https://gerrit.wikimedia.org/r/1125533 (https://phabricator.wikimedia.org/T386868) [22:27:20] (03PS2) 10Ebernhardson: Add sudachi analyzer for japanese [software/opensearch/plugins] - 10https://gerrit.wikimedia.org/r/1125533 (https://phabricator.wikimedia.org/T386868) [22:30:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10614940 (10phaultfinder) [22:32:01] (03CR) 10Dzahn: community_civicrm: dovecot module for serving up local mail (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1124205 (https://phabricator.wikimedia.org/T383715) (owner: 10Dwisehaupt) [22:40:06] (03PS3) 10Huji: New alias for Project namespace on Persian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122278 (https://phabricator.wikimedia.org/T387185) [22:40:14] (03CR) 10Huji: New alias for Project namespace on Persian Wikipedia (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122278 (https://phabricator.wikimedia.org/T387185) (owner: 10Huji) [22:41:03] 06SRE, 10SRE-Access-Requests, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Remove production data access for NDA expired user aude - https://phabricator.wikimedia.org/T388034#10614967 (10Seddon) Hey, yes @aude is currently reporting to myself and is on an active contract. Can I got the time being request a... [22:42:02] !log bking@cloudelastic1009 exclude `cloudelastic1010` from master voting T387904 [22:42:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:42:05] T387904: Migrate Cloudelastic to Opensearch - https://phabricator.wikimedia.org/T387904 [22:44:46] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10614969 (10phaultfinder) [22:59:04] (03CR) 10JHathaway: community_civicrm: dovecot module for serving up local mail (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1124205 (https://phabricator.wikimedia.org/T383715) (owner: 10Dwisehaupt) [23:10:22] (03CR) 10Dwisehaupt: community_civicrm: dovecot module for serving up local mail (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1124205 (https://phabricator.wikimedia.org/T383715) (owner: 10Dwisehaupt) [23:12:16] (03PS1) 10Fabfur: haproxy: use TLS tmpfiles and add certificate check script [puppet] - 10https://gerrit.wikimedia.org/r/1125541 (https://phabricator.wikimedia.org/T388147) [23:13:15] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 114, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:29:54] (03PS1) 10Cwhite: beta-logs: restore service functionality [puppet] - 10https://gerrit.wikimedia.org/r/1125543 [23:32:39] (03CR) 10Cwhite: [C:03+2] beta-logs: restore service functionality [puppet] - 10https://gerrit.wikimedia.org/r/1125543 (owner: 10Cwhite) [23:56:17] (03PS5) 10Fabfur: haproxy: use TLS tmpfiles and add certificate check script [puppet] - 10https://gerrit.wikimedia.org/r/1125541 (https://phabricator.wikimedia.org/T388147)