[00:01:19] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10612075 (10Jclark-ctr)
[00:01:54] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1201.eqiad.wmnet with OS bullseye
[00:02:05] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10612076 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-worker1201.eqiad.wmnet with OS...
[00:09:41] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T387609#10612088 (10phaultfinder)
[00:15:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:16:08] <icinga-wm>	 PROBLEM - Check unit status of sync-puppet-volatile on puppetserver1002 is CRITICAL: CRITICAL: Status of the systemd unit sync-puppet-volatile https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[00:25:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:26:08] <icinga-wm>	 RECOVERY - Check unit status of sync-puppet-volatile on puppetserver1002 is OK: OK: Status of the systemd unit sync-puppet-volatile https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[00:38:47] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1125260
[00:38:48] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1125260 (owner: 10TrainBranchBot)
[00:40:01] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1202.eqiad.wmnet with OS bullseye
[00:40:02] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1203.eqiad.wmnet with OS bullseye
[00:40:03] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1204.eqiad.wmnet with OS bullseye
[00:40:04] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1205.eqiad.wmnet with OS bullseye
[00:40:15] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10612140 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-worker1203.eqiad.wmnet with OS...
[00:40:16] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10612141 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-worker1202.eqiad.wmnet with OS...
[00:40:19] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10612142 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-worker1204.eqiad.wmnet with OS...
[00:40:21] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10612143 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-worker1205.eqiad.wmnet with OS...
[00:41:26] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1206.eqiad.wmnet with OS bullseye
[00:41:27] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1207.eqiad.wmnet with OS bullseye
[00:41:28] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1208.eqiad.wmnet with OS bullseye
[00:41:37] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10612145 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-worker1206.eqiad.wmnet with OS...
[00:41:37] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10612146 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-worker1207.eqiad.wmnet with OS...
[00:41:38] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10612147 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-worker1208.eqiad.wmnet with OS...
[00:46:13] <wikibugs>	 (03Abandoned) 10Jdlrobson: Remove init event from Search AB test and also remove ABTestEnrollment.js. [extensions/WikimediaEvents] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1120685 (https://phabricator.wikimedia.org/T386734) (owner: 10Jdlrobson)
[00:50:30] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1125260 (owner: 10TrainBranchBot)
[00:57:41] <jinxer-wm>	 FIRING: [2x] KubernetesRsyslogDown: rsyslog on aux-k8s-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[01:00:44] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] Horizon/idp: access keystone on port 443 [puppet] - 10https://gerrit.wikimedia.org/r/1125249 (https://phabricator.wikimedia.org/T388137) (owner: 10Andrew Bogott)
[01:01:13] <wikibugs>	 06SRE, 10Cassandra: Eliminate use of secondary IP interfaces & DNS for Cassandra instances - https://phabricator.wikimedia.org/T388169#10612177 (10Eevans)
[01:06:50] <logmsgbot>	 !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1201.eqiad.wmnet with OS bullseye
[01:06:56] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10612178 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-worker1201.eqiad.wmnet with OS bull...
[01:08:45] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1125261
[01:08:45] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1125261 (owner: 10TrainBranchBot)
[01:10:06] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10612180 (10Jclark-ctr) a:05Jclark-ctr→03BTullis @BTullis  little typo in preseed file |an-worker12[0-8]   should be 120[0-8]
[01:11:48] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1066 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[01:12:55] <wikibugs>	 06SRE, 10MediaWiki-extensions-OAuth, 06The-Wikipedia-Library, 07Datacenter-Switchover, 07User-notice-archive: Frequent OAuth failures on Wikimedia wikis since eqiad was repooled due to db-mainstash replication lag - https://phabricator.wikimedia.org/T332650#10612183 (10jsn.sherman) I started a pr to remo...
[01:13:16] <wikibugs>	 06SRE, 10MediaWiki-extensions-OAuth, 06The-Wikipedia-Library, 07Datacenter-Switchover, 07User-notice-archive: Frequent OAuth failures on Wikimedia wikis since eqiad was repooled due to db-mainstash replication lag - https://phabricator.wikimedia.org/T332650#10612184 (10jsn.sherman) p:05High→03Low
[01:14:37] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Q2:rack/setup E8/F8 new leaf switches - https://phabricator.wikimedia.org/T382017#10612190 (10Papaul) @Jclark-ctr @VRiley-WMF the 2 switches are received in coupa but are missing in netbox. if there are not ready to be racked yet, can...
[01:20:33] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Q2:rack/setup E8/F8 new leaf switches - https://phabricator.wikimedia.org/T382017#10612192 (10Jclark-ctr) @VRiley-WMF  I have not seen these in the data center yet but you updated  ticket Jan 10 2025 almost 2 months ago?   Receiving ti...
[01:30:06] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1125261 (owner: 10TrainBranchBot)
[01:39:49] <wikibugs>	 (03PS1) 10Andrew Bogott: cloudidp-dev.wikimedia.org: allow keystone.openstack endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1125264 (https://phabricator.wikimedia.org/T388137)
[01:40:31] <wikibugs>	 (03PS2) 10Andrew Bogott: cloudidp-dev.wikimedia.org: allow keystone.openstack endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1125264 (https://phabricator.wikimedia.org/T388137)
[01:40:57] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] cloudidp-dev.wikimedia.org: allow keystone.openstack endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1125264 (https://phabricator.wikimedia.org/T388137) (owner: 10Andrew Bogott)
[01:41:48] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1066 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[01:46:28] <icinga-wm>	 PROBLEM - Disk space on releases1003 is CRITICAL: DISK CRITICAL - /srv/docker/overlay2/5e191a880c7d45debe14f7c536fcf3c9edf6c2de17bd71a45484929c12603607/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops
[01:47:42] <jinxer-wm>	 FIRING: HelmReleaseBadStatus: Helm release article-descriptions/main on k8s-mlstaging@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s-mlstaging&var-namespace=article-descriptions - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[01:50:08] <icinga-wm>	 PROBLEM - Host cloudweb2002-dev is DOWN: PING CRITICAL - Packet loss = 100%
[01:51:34] <icinga-wm>	 RECOVERY - Host cloudweb2002-dev is UP: PING OK - Packet loss = 0%, RTA = 30.42 ms
[02:01:09] <wikibugs>	 (03PS1) 10Scott French: mw-*: Tune 8.1 releases to avoid deployment timeouts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125265 (https://phabricator.wikimedia.org/T383845)
[02:03:20] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1065 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[02:06:28] <icinga-wm>	 RECOVERY - Disk space on releases1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops
[02:10:58] <icinga-wm>	 RECOVERY - Host an-presto1014 is UP: PING WARNING - Packet loss = 66%, RTA = 85.75 ms
[02:11:46] <icinga-wm>	 PROBLEM - SSH on an-presto1014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[02:16:31] <wikibugs>	 (03PS1) 10Andrew Bogott: keystone oidc: use keystone.openstack hostname for redirect [puppet] - 10https://gerrit.wikimedia.org/r/1125266 (https://phabricator.wikimedia.org/T388137)
[02:16:58] <wikibugs>	 (03CR) 10CI reject: [V:04-1] keystone oidc: use keystone.openstack hostname for redirect [puppet] - 10https://gerrit.wikimedia.org/r/1125266 (https://phabricator.wikimedia.org/T388137) (owner: 10Andrew Bogott)
[02:17:22] <icinga-wm>	 PROBLEM - Host an-presto1014 is DOWN: PING CRITICAL - Packet loss = 100%
[02:18:18] <wikibugs>	 (03PS2) 10Andrew Bogott: keystone oidc: use keystone.openstack hostname for redirect [puppet] - 10https://gerrit.wikimedia.org/r/1125266 (https://phabricator.wikimedia.org/T388137)
[02:19:03] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed & rebooted twice - https://phabricator.wikimedia.org/T387673#10612248 (10wiki_willy) a:05Marostegui→03VRiley-WMF Reassigning to Valerie to create a new Dell Support task  >>! In T387673#10604114, @wiki_willy wrote: > @VRiley-WMF or @Jclark-ctr - can o...
[02:21:21] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1125266 (https://phabricator.wikimedia.org/T388137) (owner: 10Andrew Bogott)
[02:31:48] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1066 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[02:33:20] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1065 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[02:36:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:01:48] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1066 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[03:15:25] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed & rebooted twice - https://phabricator.wikimedia.org/T387673#10612292 (10VRiley-WMF) Opemed a mew toclet with Dell.    206617456  currently speaking with them about this issue.
[03:27:40] <jinxer-wm>	 FIRING: [4x] KubernetesRsyslogDown: rsyslog on aux-k8s-worker2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[04:14:45] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T387609#10612322 (10phaultfinder)
[04:15:30] <icinga-wm>	 PROBLEM - OpenSearch unassigned shard check - 9200 on relforge1004 is CRITICAL: CRITICAL - itwiki_general[0](2025-03-03T22:16:56.527Z), .kibana_3[0](2025-03-03T22:15:56.521Z), frwiki_general[0](2025-03-03T22:43:07.514Z) https://wikitech.wikimedia.org/wiki/Search%23Administration
[04:20:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:57:41] <jinxer-wm>	 FIRING: [2x] KubernetesRsyslogDown: rsyslog on aux-k8s-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[05:47:42] <jinxer-wm>	 FIRING: HelmReleaseBadStatus: Helm release article-descriptions/main on k8s-mlstaging@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s-mlstaging&var-namespace=article-descriptions - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[05:50:06] <jinxer-wm>	 FIRING: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures
[05:50:22] <icinga-wm>	 PROBLEM - ElasticSearch unassigned shard check - 9200 on relforge1003 is CRITICAL: CRITICAL - itwiki_general[0](2025-03-03T22:16:56.527Z), .kibana_3[0](2025-03-03T22:15:56.521Z), frwiki_general[0](2025-03-03T22:43:07.514Z) https://wikitech.wikimedia.org/wiki/Search%23Administration
[05:55:06] <jinxer-wm>	 RESOLVED: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures
[06:11:48] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1066 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[06:36:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:41:48] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1066 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[07:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250307T0700)
[07:26:03] <wikibugs>	 (03PS1) 10Slyngshede: Remove steward from IDM account managers [puppet] - 10https://gerrit.wikimedia.org/r/1125296
[07:27:40] <jinxer-wm>	 FIRING: [4x] KubernetesRsyslogDown: rsyslog on aux-k8s-worker2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[07:29:16] <wikibugs>	 (03PS2) 10Slyngshede: Remove steward [puppet] - 10https://gerrit.wikimedia.org/r/1125296
[07:29:52] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Remove steward [puppet] - 10https://gerrit.wikimedia.org/r/1125296 (owner: 10Slyngshede)
[07:31:47] <hashar>	 !log Upgrading Jenkins on contint1002
[07:31:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:31:53] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+1] mw-*: Tune 8.1 releases to avoid deployment timeouts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125265 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French)
[07:33:09] <wikibugs>	 (03PS3) 10Slyngshede: Remove steward [puppet] - 10https://gerrit.wikimedia.org/r/1125296
[07:33:25] <wikibugs>	 06SRE, 06Fundraising-Backlog, 10fundraising-tech-ops, 10LDAP-Access-Requests: Grant Access to astein for fr-tech icinga acknowledgements - https://phabricator.wikimedia.org/T388186#10612402 (10MoritzMuehlenhoff) @Dwisehaupt Please clarify: Is simple Icinga web access needed? If so, we only need the "wmf" L...
[07:35:14] <logmsgbot>	 !log hashar@deploy2002 Started deploy [releng/jenkins-deploy@34b35a5] (releasing): Upgrade to Jenkins LTS 2.492.2
[07:36:21] <logmsgbot>	 !log hashar@deploy2002 Finished deploy [releng/jenkins-deploy@34b35a5] (releasing): Upgrade to Jenkins LTS 2.492.2 (duration: 01m 23s)
[07:36:50] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to logstash-access for jhuneidi - https://phabricator.wikimedia.org/T388044#10612407 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff @jeena I've added you to the logstash-access group. If you run into any issues acessing Logstash, please reope...
[07:39:12] <wikibugs>	 (03CR) 10Muehlenhoff: Remove steward (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1125296 (owner: 10Slyngshede)
[07:43:04] <wikibugs>	 (03PS4) 10Slyngshede: Remove steward [puppet] - 10https://gerrit.wikimedia.org/r/1125296
[07:43:19] <wikibugs>	 (03PS2) 10Hashar: Fix wgCirrusSearchSimilarityProfiles [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125092
[07:43:28] <wikibugs>	 (03CR) 10Hashar: Fix wgCirrusSearchSimilarityProfiles (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125092 (owner: 10Hashar)
[07:43:53] <wikibugs>	 (03CR) 10Slyngshede: Remove steward (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1125296 (owner: 10Slyngshede)
[07:48:51] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1125296 (owner: 10Slyngshede)
[07:50:16] <wikibugs>	 (03PS1) 10Hashar: Remove obsolete $wgAllowMicrodataAttributes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125372
[07:51:45] <moritzm>	 !log installing emacs security updates
[07:51:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:54:11] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] Remove steward [puppet] - 10https://gerrit.wikimedia.org/r/1125296 (owner: 10Slyngshede)
[07:57:42] <wikibugs>	 (03PS1) 10Hashar: Remove wgArticlePlaceholderSearchIntegrationBackend [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125374 (https://phabricator.wikimedia.org/T207407)
[08:00:05] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250307T0800)
[08:02:19] <wikibugs>	 (03CR) 10Hashar: "Some something about database :)   I can self deploy but I could use a double check I did not make something wrong!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125374 (https://phabricator.wikimedia.org/T207407) (owner: 10Hashar)
[08:04:55] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "lgtm, thanks for the addition!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1124397 (owner: 10Volans)
[08:07:22] <logmsgbot>	 !log jmm@cumin2002 DONE (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging JJMC89 out of all services on: 2 hosts
[08:12:58] <moritzm>	 !log installing Linux 5.10.234 on Bullseye hosts (just the rollout of the new kernels, no immediate reboots involved)
[08:12:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:13:40] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "lgtm" [debs/helmfile] - 10https://gerrit.wikimedia.org/r/1125209 (https://phabricator.wikimedia.org/T387837) (owner: 10JMeybohm)
[08:14:50] <wikibugs>	 (03PS2) 10Volans: sre.gitlab: use new run_cookbook features [cookbooks] - 10https://gerrit.wikimedia.org/r/1124397
[08:15:01] <logmsgbot>	 !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/changeprop: sync
[08:15:02] <icinga-wm>	 PROBLEM - Ensure traffic_server is running for instance backend on cp6010 is CRITICAL: PROCS CRITICAL: 2 processes with args /usr/bin/traffic_server -M --httpport 3128 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[08:15:07] <logmsgbot>	 !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop: sync
[08:15:13] <wikibugs>	 (03CR) 10Volans: [C:03+2] sre.gitlab: use new run_cookbook features [cookbooks] - 10https://gerrit.wikimedia.org/r/1124397 (owner: 10Volans)
[08:15:46] <logmsgbot>	 !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: sync
[08:15:55] <logmsgbot>	 !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: sync
[08:16:02] <icinga-wm>	 RECOVERY - Ensure traffic_server is running for instance backend on cp6010 is OK: PROCS OK: 1 process with args /usr/bin/traffic_server -M --httpport 3128 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[08:17:26] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] "The filename is generated by gbp, so I would assume a length limit" [debs/helmfile] - 10https://gerrit.wikimedia.org/r/1125209 (https://phabricator.wikimedia.org/T387837) (owner: 10JMeybohm)
[08:19:09] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] acme_chief: add parameter for destination path [puppet] - 10https://gerrit.wikimedia.org/r/1124855 (https://phabricator.wikimedia.org/T387929) (owner: 10Fabfur)
[08:19:37] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T387609#10612422 (10phaultfinder)
[08:19:45] <vgutierrez>	 hmmmm
[08:19:49] * vgutierrez checking cp6010
[08:21:33] <wikibugs>	 (03Merged) 10jenkins-bot: sre.gitlab: use new run_cookbook features [cookbooks] - 10https://gerrit.wikimedia.org/r/1124397 (owner: 10Volans)
[08:21:36] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti1035.eqiad.wmnet with OS bookworm
[08:21:48] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10612427 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1035.eqiad.wmnet with OS bookworm
[08:28:04] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] services: refactor helmfiles for helmfile 0.171.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124431 (https://phabricator.wikimedia.org/T387836) (owner: 10Jelto)
[08:30:53] <wikibugs>	 (03PS3) 10Tiziano Fogli: sre.puppet.sync-netbox-hiera: add rack/row to network_devices [cookbooks] - 10https://gerrit.wikimedia.org/r/1125206 (https://phabricator.wikimedia.org/T387231)
[08:30:59] <wikibugs>	 (03Merged) 10jenkins-bot: Don't warn if this and the needed release set installed: false [debs/helmfile] - 10https://gerrit.wikimedia.org/r/1125209 (https://phabricator.wikimedia.org/T387837) (owner: 10JMeybohm)
[08:32:39] <wikibugs>	 (03CR) 10Vgutierrez: [C:04-1] haproxy/icinga: Remove RSA from auth algorithms (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1100192 (https://phabricator.wikimedia.org/T370837) (owner: 10BCornwall)
[08:34:00] <wikibugs>	 (03PS3) 10Elukey: services: Increase capacity and specs for Kartotherian [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122605 (https://phabricator.wikimedia.org/T386926)
[08:36:11] <wikibugs>	 (03CR) 10Jelto: services: refactor helmfiles for helmfile 0.171.0 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124431 (https://phabricator.wikimedia.org/T387836) (owner: 10Jelto)
[08:38:57] <wikibugs>	 06SRE, 10SRE-Access-Requests, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Remove production data access for NDA expired user aude - https://phabricator.wikimedia.org/T388034#10612441 (10Gehel)
[08:39:03] <wikibugs>	 (03CR) 10Elukey: [C:03+2] services: Increase capacity and specs for Kartotherian [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122605 (https://phabricator.wikimedia.org/T386926) (owner: 10Elukey)
[08:39:45] <logmsgbot>	 !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/kartotherian: sync
[08:39:55] <logmsgbot>	 !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/kartotherian: sync
[08:40:25] <logmsgbot>	 !log elukey@deploy2002 helmfile [eqiad] START helmfile.d/services/kartotherian: sync
[08:42:37] <logmsgbot>	 !log elukey@deploy2002 helmfile [eqiad] START helmfile.d/admin 'sync'.
[08:43:19] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] services: refactor helmfiles for helmfile 0.171.0 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124431 (https://phabricator.wikimedia.org/T387836) (owner: 10Jelto)
[08:43:33] <logmsgbot>	 !log elukey@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'sync'.
[08:43:35] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti1035.eqiad.wmnet with reason: host reimage
[08:46:19] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti1035.eqiad.wmnet with reason: host reimage
[08:46:44] <wikibugs>	 (03CR) 10Jelto: [C:03+2] services: refactor helmfiles for helmfile 0.171.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124431 (https://phabricator.wikimedia.org/T387836) (owner: 10Jelto)
[08:48:08] <jayme>	 !log imported helmfile 0.171.0-5 to bullseye-wikimedia and bookworm-wikimedia - T387837
[08:48:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:48:11] <stashbot>	 T387837: Fix installed key in dependend helmfile releases - https://phabricator.wikimedia.org/T387837
[08:48:29] <jayme>	 !log updated helmfile to 0.171.0-5 on deploy* - T387837
[08:48:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:50:31] <logmsgbot>	 !log elukey@deploy2002 helmfile [eqiad] DONE helmfile.d/services/kartotherian: sync
[08:52:08] <logmsgbot>	 !log elukey@deploy2002 helmfile [eqiad] START helmfile.d/services/kartotherian: sync
[08:54:08] <wikibugs>	 (03PS29) 10Federico Ceratto: clone.py, clone_test.py: Automate cloning [cookbooks] - 10https://gerrit.wikimedia.org/r/1120605 (https://phabricator.wikimedia.org/T387023)
[08:55:25] <logmsgbot>	 !log elukey@deploy2002 helmfile [eqiad] DONE helmfile.d/services/kartotherian: sync
[08:55:49] <wikibugs>	 (03CR) 10Federico Ceratto: "This is tested end-to-end and ready for final review. It has been used for https://phabricator.wikimedia.org/T385141" [cookbooks] - 10https://gerrit.wikimedia.org/r/1120605 (https://phabricator.wikimedia.org/T387023) (owner: 10Federico Ceratto)
[08:55:51] <wikibugs>	 (03PS3) 10Volans: cli: log an eventual exception to stderr [software/cumin] - 10https://gerrit.wikimedia.org/r/1114456 (https://phabricator.wikimedia.org/T384539) (owner: 10TheAnarcat)
[08:55:51] <wikibugs>	 (03PS2) 10Volans: query: do not error on no match in first subquery [software/cumin] - 10https://gerrit.wikimedia.org/r/1125158
[08:56:30] <wikibugs>	 (03Merged) 10jenkins-bot: services: refactor helmfiles for helmfile 0.171.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124431 (https://phabricator.wikimedia.org/T387836) (owner: 10Jelto)
[08:57:41] <jinxer-wm>	 FIRING: [2x] KubernetesRsyslogDown: rsyslog on aux-k8s-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[09:02:35] <wikibugs>	 (03PS1) 10JMeybohm: helm: Install helm 3.11 and 3.17 in parallel [puppet] - 10https://gerrit.wikimedia.org/r/1125377 (https://phabricator.wikimedia.org/T341984)
[09:02:47] <logmsgbot>	 !log jelto@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply
[09:03:56] <logmsgbot>	 !log jelto@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply
[09:05:09] <logmsgbot>	 !log jelto@deploy2002 helmfile [eqiad] START helmfile.d/services/miscweb: apply
[09:05:22] <wikibugs>	 (03CR) 10JMeybohm: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5034/co" [puppet] - 10https://gerrit.wikimedia.org/r/1125377 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm)
[09:07:08] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1035.eqiad.wmnet with OS bookworm
[09:07:12] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10612514 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1035.eqiad.wmnet with OS bookworm completed: - ganeti103...
[09:07:37] <logmsgbot>	 !log elukey@deploy2002 helmfile [codfw] START helmfile.d/services/kartotherian: sync
[09:08:59] <logmsgbot>	 !log jelto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply
[09:09:45] <logmsgbot>	 !log jelto@deploy2002 helmfile [codfw] START helmfile.d/services/miscweb: apply
[09:12:33] <logmsgbot>	 !log jelto@deploy2002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply
[09:14:58] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1125377 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm)
[09:17:38] <wikibugs>	 (03PS30) 10Federico Ceratto: clone.py, clone_test.py: Automate cloning [cookbooks] - 10https://gerrit.wikimedia.org/r/1120605 (https://phabricator.wikimedia.org/T387023)
[09:18:16] <logmsgbot>	 !log elukey@deploy2002 helmfile [codfw] DONE helmfile.d/services/kartotherian: sync
[09:20:09] <logmsgbot>	 !log elukey@deploy2002 helmfile [codfw] START helmfile.d/admin 'sync'.
[09:20:25] <logmsgbot>	 !log elukey@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'sync'.
[09:21:00] <logmsgbot>	 !log elukey@deploy2002 helmfile [codfw] START helmfile.d/services/kartotherian: sync
[09:24:00] <wikibugs>	 (03CR) 10Hashar: "I will deploy that next week as part of a batch of other clean up changes." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125091 (owner: 10Hashar)
[09:24:08] <wikibugs>	 (03CR) 10Hashar: "I will deploy that next week as part of a batch of other clean up changes." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125097 (https://phabricator.wikimedia.org/T348526) (owner: 10Hashar)
[09:24:13] <wikibugs>	 (03CR) 10Hashar: "I will deploy that next week as part of a batch of other clean up changes." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125124 (owner: 10Hashar)
[09:24:17] <wikibugs>	 (03CR) 10Hashar: "I will deploy that next week as part of a batch of other clean up changes." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125092 (owner: 10Hashar)
[09:25:03] <wikibugs>	 (03CR) 10Hashar: "I will deploy that next week as part of a batch of other clean up changes." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125372 (owner: 10Hashar)
[09:25:09] <wikibugs>	 (03CR) 10Hashar: "I will deploy that next week as part of a batch of other clean up changes." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125374 (https://phabricator.wikimedia.org/T207407) (owner: 10Hashar)
[09:25:36] <hashar>	 https://wikitech.wikimedia.org/wiki/Technical_debt/Unused_config#Results :)
[09:27:01] <logmsgbot>	 !log elukey@deploy2002 helmfile [codfw] DONE helmfile.d/services/kartotherian: sync
[09:37:46] <logmsgbot>	 !log elukey@puppetserver1001 conftool action : set/weight=10; selector: name=wikikube-worker1.*,dc=eqiad,cluster=maps,service=kartotherian-k8s-ssl
[09:38:10] <logmsgbot>	 !log elukey@puppetserver1001 conftool action : set/weight=10; selector: name=wikikube-worker2.*,dc=codfw,cluster=maps,service=kartotherian-k8s-ssl
[09:46:34] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+1] Remove wgArticlePlaceholderSearchIntegrationBackend [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125374 (https://phabricator.wikimedia.org/T207407) (owner: 10Hashar)
[09:47:42] <jinxer-wm>	 FIRING: HelmReleaseBadStatus: Helm release article-descriptions/main on k8s-mlstaging@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s-mlstaging&var-namespace=article-descriptions - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[09:54:02] <wikibugs>	 (03CR) 10JMeybohm: [V:03+1 C:03+2] helm: Install helm 3.11 and 3.17 in parallel [puppet] - 10https://gerrit.wikimedia.org/r/1125377 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm)
[09:58:19] <wikibugs>	 (03PS1) 10JMeybohm: deployment_server: Select the kubectl version based on the cluster [puppet] - 10https://gerrit.wikimedia.org/r/1125384 (https://phabricator.wikimedia.org/T341984)
[09:58:54] <wikibugs>	 (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1125384 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm)
[09:58:58] <wikibugs>	 (03PS6) 10Jgiannelos: pcs: Add missing rules for content pregeneration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125225 (https://phabricator.wikimedia.org/T388214)
[09:59:19] <wikibugs>	 (03PS31) 10Federico Ceratto: clone.py, clone_test.py: Automate cloning [cookbooks] - 10https://gerrit.wikimedia.org/r/1120605 (https://phabricator.wikimedia.org/T387023)
[09:59:50] <wikibugs>	 (03CR) 10Elukey: clone.py, clone_test.py: Automate cloning (035 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1120605 (https://phabricator.wikimedia.org/T387023) (owner: 10Federico Ceratto)
[10:06:10] <wikibugs>	 (03PS5) 10Federico Ceratto: dbctl.py, dbctl_test.py: Serialize dbctl changes [cookbooks] - 10https://gerrit.wikimedia.org/r/1124797 (https://phabricator.wikimedia.org/T387209)
[10:07:58] <wikibugs>	 (03PS1) 10Elukey: conftool-data: add more wikikube-workers to maps [puppet] - 10https://gerrit.wikimedia.org/r/1125387 (https://phabricator.wikimedia.org/T386926)
[10:07:59] <wikibugs>	 (03PS1) 10Elukey: role::maps::{master,replica}: Fix lvs pool config [puppet] - 10https://gerrit.wikimedia.org/r/1125388 (https://phabricator.wikimedia.org/T386926)
[10:09:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: push_cross_cluster_settings_9600.service on cloudelastic1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:11:06] <wikibugs>	 (03CR) 10Elukey: [C:03+1] docs: removed deprecated call to sphinx_rtd_theme [software/cumin] - 10https://gerrit.wikimedia.org/r/1125157 (owner: 10Volans)
[10:11:30] <wikibugs>	 (03CR) 10Elukey: [C:03+2] sre.hosts.reimage: add logging and confirmation when forcing puppet 5 (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1117231 (owner: 10Elukey)
[10:11:50] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9600 on cloudelastic1012 is CRITICAL: CRITICAL - elasticsearch inactive shards 668 threshold =0.2 breach: cluster_name: cloudelastic-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, active_primary_shards: 795, active_shards: 924, relocating_shards: 0, initializing_shards: 4, unassigned_shards: 664, delayed_unassigned_shards: 0, number_of_pending_ta
[10:11:50] <icinga-wm>	 number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 58.040201005025125 https://wikitech.wikimedia.org/wiki/Search%23Administration
[10:12:06] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9600 on cloudelastic1010 is CRITICAL: CRITICAL - elasticsearch inactive shards 659 threshold =0.2 breach: cluster_name: cloudelastic-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, active_primary_shards: 795, active_shards: 933, relocating_shards: 0, initializing_shards: 4, unassigned_shards: 655, delayed_unassigned_shards: 0, number_of_pending_ta
[10:12:06] <icinga-wm>	 number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 58.60552763819096 https://wikitech.wikimedia.org/wiki/Search%23Administration
[10:12:12] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9600 on cloudelastic1009 is CRITICAL: CRITICAL - elasticsearch inactive shards 654 threshold =0.2 breach: cluster_name: cloudelastic-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, active_primary_shards: 795, active_shards: 938, relocating_shards: 0, initializing_shards: 4, unassigned_shards: 650, delayed_unassigned_shards: 0, number_of_pending_ta
[10:12:12] <icinga-wm>	 number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 58.91959798994974 https://wikitech.wikimedia.org/wiki/Search%23Administration
[10:12:12] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9600 on cloudelastic1011 is CRITICAL: CRITICAL - elasticsearch inactive shards 652 threshold =0.2 breach: cluster_name: cloudelastic-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, active_primary_shards: 795, active_shards: 940, relocating_shards: 0, initializing_shards: 4, unassigned_shards: 648, delayed_unassigned_shards: 0, number_of_pending_ta
[10:12:12] <icinga-wm>	 number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 59.04522613065326 https://wikitech.wikimedia.org/wiki/Search%23Administration
[10:12:14] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9600 on cloudelastic1007 is CRITICAL: CRITICAL - elasticsearch inactive shards 649 threshold =0.15 breach: cluster_name: cloudelastic-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 795, active_shards: 943, relocating_shards: 0, initializing_shards: 3, unassigned_shards: 646, delayed_unassigned_shards: 
[10:12:14] <icinga-wm>	 r_of_pending_tasks: 1, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 59.233668341708544 https://wikitech.wikimedia.org/wiki/Search%23Administration
[10:13:00] <wikibugs>	 (03CR) 10Elukey: [C:03+1] cli: log an eventual exception to stderr [software/cumin] - 10https://gerrit.wikimedia.org/r/1114456 (https://phabricator.wikimedia.org/T384539) (owner: 10TheAnarcat)
[10:13:01] <moritzm>	 !log updated pwstore key for btullis
[10:13:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:13:21] <wikibugs>	 (03Abandoned) 10Elukey: admin_ng: enable monitoring for knative [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123656 (owner: 10Elukey)
[10:13:31] <gehel>	 dcausse: ^  Looks like dcausse is already on the cloudelastic issue
[10:13:58] <wikibugs>	 (03CR) 10Elukey: [C:03+1] Update Docker images of change-prop services to ones using node 20 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124191 (https://phabricator.wikimedia.org/T381588) (owner: 10Aaron Schulz)
[10:14:38] <icinga-wm>	 PROBLEM - Check unit status of push_cross_cluster_settings_9600 on cloudelastic1009 is CRITICAL: CRITICAL: Status of the systemd unit push_cross_cluster_settings_9600 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[10:14:50] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9600 on cloudelastic1012 is OK: OK - elasticsearch status cloudelastic-psi-eqiad: cluster_name: cloudelastic-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, active_primary_shards: 795, active_shards: 1331, relocating_shards: 0, initializing_shards: 5, unassigned_shards: 256, delayed_unassigned_shards: 0, number_of_pending_tasks: 5, number_of_in_f
[10:14:50] <icinga-wm>	 tch: 0, task_max_waiting_in_queue_millis: 963, active_shards_percent_as_number: 83.60552763819096 https://wikitech.wikimedia.org/wiki/Search%23Administration
[10:15:06] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9600 on cloudelastic1010 is OK: OK - elasticsearch status cloudelastic-psi-eqiad: cluster_name: cloudelastic-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, active_primary_shards: 795, active_shards: 1353, relocating_shards: 0, initializing_shards: 6, unassigned_shards: 233, delayed_unassigned_shards: 0, number_of_pending_tasks: 2, number_of_in_f
[10:15:06] <icinga-wm>	 tch: 0, task_max_waiting_in_queue_millis: 5, active_shards_percent_as_number: 84.98743718592965 https://wikitech.wikimedia.org/wiki/Search%23Administration
[10:15:12] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9600 on cloudelastic1009 is OK: OK - elasticsearch status cloudelastic-psi-eqiad: cluster_name: cloudelastic-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, active_primary_shards: 795, active_shards: 1356, relocating_shards: 0, initializing_shards: 5, unassigned_shards: 231, delayed_unassigned_shards: 0, number_of_pending_tasks: 2, number_of_in_f
[10:15:12] <icinga-wm>	 tch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 85.17587939698493 https://wikitech.wikimedia.org/wiki/Search%23Administration
[10:15:12] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9600 on cloudelastic1011 is OK: OK - elasticsearch status cloudelastic-psi-eqiad: cluster_name: cloudelastic-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, active_primary_shards: 795, active_shards: 1357, relocating_shards: 0, initializing_shards: 4, unassigned_shards: 231, delayed_unassigned_shards: 0, number_of_pending_tasks: 2, number_of_in_f
[10:15:12] <icinga-wm>	 tch: 0, task_max_waiting_in_queue_millis: 179, active_shards_percent_as_number: 85.23869346733667 https://wikitech.wikimedia.org/wiki/Search%23Administration
[10:15:14] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9600 on cloudelastic1007 is OK: OK - elasticsearch status cloudelastic-psi-eqiad: cluster_name: cloudelastic-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 795, active_shards: 1359, relocating_shards: 0, initializing_shards: 4, unassigned_shards: 229, delayed_unassigned_shards: 0, number_of_pending_ta
[10:15:14] <icinga-wm>	 number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 85.3643216080402 https://wikitech.wikimedia.org/wiki/Search%23Administration
[10:17:36] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove access for halfak [puppet] - 10https://gerrit.wikimedia.org/r/1125390 (https://phabricator.wikimedia.org/T388037)
[10:18:55] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Remove access for halfak [puppet] - 10https://gerrit.wikimedia.org/r/1125390 (https://phabricator.wikimedia.org/T388037) (owner: 10Muehlenhoff)
[10:21:48] <logmsgbot>	 !log jmm@cumin2002 DONE (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Halfak out of all services on: 1284 hosts
[10:22:46] <logmsgbot>	 !log jmm@cumin2002 DONE (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Halfak out of all services on: 951 hosts
[10:24:18] <wikibugs>	 (03PS6) 10Jelto: Revert^2 "k8s::client: Allow for install of all kubectl versions" [puppet] - 10https://gerrit.wikimedia.org/r/1115803 (owner: 10JMeybohm)
[10:24:18] <wikibugs>	 (03CR) 10Jelto: "@jmeybohm I've done a quick test on `kubestage2001`. I removed ` the `kubernetes-node` package from the node and run puppet. Puppet instal" [puppet] - 10https://gerrit.wikimedia.org/r/1115803 (owner: 10JMeybohm)
[10:24:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: push_cross_cluster_settings_9600.service on cloudelastic1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:24:38] <icinga-wm>	 RECOVERY - Check unit status of push_cross_cluster_settings_9600 on cloudelastic1009 is OK: OK: Status of the systemd unit push_cross_cluster_settings_9600 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[10:28:46] <wikibugs>	 (03PS1) 10Vgutierrez: haproxy: Don't set h2 initial-window-size on haproxy 3.1 [puppet] - 10https://gerrit.wikimedia.org/r/1125393 (https://phabricator.wikimedia.org/T386796)
[10:29:21] <wikibugs>	 (03PS2) 10Vgutierrez: haproxy: Don't set h2 initial-window-size on haproxy 3.1 [puppet] - 10https://gerrit.wikimedia.org/r/1125393 (https://phabricator.wikimedia.org/T386796)
[10:29:44] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1125393 (https://phabricator.wikimedia.org/T386796) (owner: 10Vgutierrez)
[10:30:00] <logmsgbot>	 !log elukey@puppetserver1001 conftool action : set/pooled=inactive; selector: name=maps2005.codfw.wmnet,dc=codfw,cluster=maps,service=kartotherian-k8s-ssl
[10:30:32] <logmsgbot>	 !log elukey@puppetserver1001 conftool action : set/pooled=inactive; selector: name=maps1005.eqiad.wmnet,dc=eqiad,cluster=maps,service=kartotherian-k8s-ssl
[10:34:58] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] "karthotherian seems to be a valid systemd service on those roles:" [puppet] - 10https://gerrit.wikimedia.org/r/1125388 (https://phabricator.wikimedia.org/T386926) (owner: 10Elukey)
[10:35:02] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Good catch" [puppet] - 10https://gerrit.wikimedia.org/r/1125388 (https://phabricator.wikimedia.org/T386926) (owner: 10Elukey)
[10:35:05] <wikibugs>	 (03PS2) 10JMeybohm: deployment_server: Select the kubectl version based on the cluster [puppet] - 10https://gerrit.wikimedia.org/r/1125384 (https://phabricator.wikimedia.org/T341984)
[10:35:42] <wikibugs>	 (03CR) 10Elukey: "All credits to Valentin!" [puppet] - 10https://gerrit.wikimedia.org/r/1125388 (https://phabricator.wikimedia.org/T386926) (owner: 10Elukey)
[10:36:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[10:37:21] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1035.eqiad.wmnet
[10:37:55] <wikibugs>	 (03CR) 10JMeybohm: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5035/co" [puppet] - 10https://gerrit.wikimedia.org/r/1125384 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm)
[10:39:10] <wikibugs>	 (03PS3) 10Vgutierrez: haproxy: Don't set h2 initial-window-size on haproxy 3.1 [puppet] - 10https://gerrit.wikimedia.org/r/1125393 (https://phabricator.wikimedia.org/T386796)
[10:39:24] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1125393 (https://phabricator.wikimedia.org/T386796) (owner: 10Vgutierrez)
[10:42:53] <wikibugs>	 (03PS4) 10Vgutierrez: haproxy: Don't set h2 initial-window-size on haproxy 3.1 [puppet] - 10https://gerrit.wikimedia.org/r/1125393 (https://phabricator.wikimedia.org/T386796)
[10:44:48] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1125393 (https://phabricator.wikimedia.org/T386796) (owner: 10Vgutierrez)
[10:46:47] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1035.eqiad.wmnet
[10:47:22] <wikibugs>	 (03PS3) 10JMeybohm: deployment_server: Select the kubectl version based on the cluster [puppet] - 10https://gerrit.wikimedia.org/r/1125384 (https://phabricator.wikimedia.org/T341984)
[10:48:21] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1035.eqiad.wmnet to cluster eqiad and group A
[10:50:03] <wikibugs>	 (03CR) 10JMeybohm: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5036/co" [puppet] - 10https://gerrit.wikimedia.org/r/1125384 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm)
[10:50:05] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti1035.eqiad.wmnet to cluster eqiad and group A
[10:50:27] <wikibugs>	 (03PS1) 10Kevin Bazira: ml-services: update article-country image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125398 (https://phabricator.wikimedia.org/T385970)
[10:59:08] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1125384 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm)
[11:00:39] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 06SRE Observability: Split the permission to access Logstash from the cn=wmf and cn=nda groups - https://phabricator.wikimedia.org/T376790#10612886 (10MoritzMuehlenhoff) Status update: Access to Logstash has been split out of cn=wmf, cn=nda is next.
[11:05:30] <wikibugs>	 (03CR) 10JMeybohm: mediawiki: introduce feature flags (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116639 (owner: 10Giuseppe Lavagetto)
[11:07:17] <wikibugs>	 (03PS4) 10JMeybohm: Add pod-security.wmg.org labels to wikikube mediawiki namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124416 (https://phabricator.wikimedia.org/T273507)
[11:07:56] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10612889 (10MoritzMuehlenhoff)
[11:12:07] <wikibugs>	 (03CR) 10Gkyziridis: [C:03+1] "Thnx." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125398 (https://phabricator.wikimedia.org/T385970) (owner: 10Kevin Bazira)
[11:12:57] <wikibugs>	 (03CR) 10Lucas Werkmeister: Remove $wgAllowAuthenticatedCrossOrigin again (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123741 (https://phabricator.wikimedia.org/T322944) (owner: 10Lucas Werkmeister)
[11:14:59] <wikibugs>	 (03CR) 10Kevin Bazira: [C:03+2] "thanks for the review :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125398 (https://phabricator.wikimedia.org/T385970) (owner: 10Kevin Bazira)
[11:16:09] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.reboot-single for host ms-be2088.codfw.wmnet
[11:16:27] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: update article-country image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125398 (https://phabricator.wikimedia.org/T385970) (owner: 10Kevin Bazira)
[11:16:53] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 06Infrastructure-Foundations: Perform fake disk swap on ms-be2088 as test - https://phabricator.wikimedia.org/T384003#10612909 (10elukey) >>! In T384003#10609005, @MatthewVernon wrote: > It's not the same kernel, though - you've got `5.14.0-503.11.1.el9_...
[11:24:06] <wikibugs>	 (03PS4) 10JMeybohm: deployment_server: Select the kubectl version based on the cluster [puppet] - 10https://gerrit.wikimedia.org/r/1125384 (https://phabricator.wikimedia.org/T341984)
[11:26:56] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: ml-services: increase workers for reference-quality models [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125405 (https://phabricator.wikimedia.org/T387019)
[11:27:02] <wikibugs>	 (03CR) 10JMeybohm: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5037/co" [puppet] - 10https://gerrit.wikimedia.org/r/1125384 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm)
[11:27:40] <jinxer-wm>	 FIRING: [4x] KubernetesRsyslogDown: rsyslog on aux-k8s-worker2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[11:27:42] <logmsgbot>	 !log elukey@cumin1002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host ms-be2088.codfw.wmnet
[11:28:07] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+1] conftool-data: add more wikikube-workers to maps [puppet] - 10https://gerrit.wikimedia.org/r/1125387 (https://phabricator.wikimedia.org/T386926) (owner: 10Elukey)
[11:28:41] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1125384 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm)
[11:30:58] <wikibugs>	 (03CR) 10JMeybohm: [V:03+1 C:03+2] deployment_server: Select the kubectl version based on the cluster [puppet] - 10https://gerrit.wikimedia.org/r/1125384 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm)
[11:33:37] <icinga-wm>	 RECOVERY - Host an-presto1014 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms
[11:35:22] <wikibugs>	 (03PS1) 10Lucas Werkmeister (WMDE): Clean up RDF feature flags again [extensions/Wikibase] (wmf/1.44.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1125408 (https://phabricator.wikimedia.org/T384344)
[11:35:47] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 10 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/Wikibase] (wmf/1.44.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1125408 (https://phabricator.wikimedia.org/T384344) (owner: 10Lucas Werkmeister (WMDE))
[11:36:35] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): "Optional backport so we can deploy the other cleanup, Ib999da8c03, sooner." [extensions/Wikibase] (wmf/1.44.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1125408 (https://phabricator.wikimedia.org/T384344) (owner: 10Lucas Werkmeister (WMDE))
[11:37:05] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Recommission testhost2001.codfw.wmnet as ms-be2089.codfw.wmnet - https://phabricator.wikimedia.org/T388221 (10MatthewVernon) 03NEW
[11:43:05] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "lgtm" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124831 (https://phabricator.wikimedia.org/T384450) (owner: 10JMeybohm)
[11:45:40] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 06Infrastructure-Foundations: Perform fake disk swap on ms-be2088 as test - https://phabricator.wikimedia.org/T384003#10612961 (10elukey) All right something different happened, but I am not sure if it was the kernel or not.  I rebooted the host with the...
[11:46:13] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] "Makes sense from a helm perspective! I assume the various removed values in the benthos config are acceptable to remove. One query, feel f" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122961 (https://phabricator.wikimedia.org/T371214) (owner: 10Kamila Součková)
[11:46:29] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.reboot-single for host ms-be2088.codfw.wmnet
[11:49:08] <wikibugs>	 (03PS1) 10MVernon: Preseed: use ms-be_simple.cfg for ms-be2089 [puppet] - 10https://gerrit.wikimedia.org/r/1125410
[11:51:14] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Preseed: use ms-be_simple.cfg for ms-be2089 [puppet] - 10https://gerrit.wikimedia.org/r/1125410 (owner: 10MVernon)
[11:52:03] <wikibugs>	 (03PS2) 10MVernon: Preseed: use ms-be_simple.cfg for ms-be2089 [puppet] - 10https://gerrit.wikimedia.org/r/1125410 (https://phabricator.wikimedia.org/T388221)
[11:54:39] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T387609#10612995 (10phaultfinder)
[11:58:03] <logmsgbot>	 !log elukey@cumin1002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host ms-be2088.codfw.wmnet
[11:59:13] <wikibugs>	 (03CR) 10Hnowlan: [C:04-1] "Some minor fixes needed, but makes sense generally (benthos stuff aside!)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123010 (owner: 10Kamila Součková)
[12:00:05] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250307T0800)
[12:00:05] <jouncebot>	 jelto, arnoldokoth, and mutante: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for GitLab version upgrades. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250307T1200).
[12:06:28] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+1] "LGTM except see inline" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124416 (https://phabricator.wikimedia.org/T273507) (owner: 10JMeybohm)
[12:08:38] <wikibugs>	 (03CR) 10Ladsgroup: "I'll hopefully deploy this next week. Unless anyone objects." [dumps] - 10https://gerrit.wikimedia.org/r/1119486 (https://phabricator.wikimedia.org/T382069) (owner: 10Ladsgroup)
[12:17:06] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1014 - https://phabricator.wikimedia.org/T382984#10613054 (10BTullis) Hello. It's not a big problem in this case, but it would have been helpful to know about this prior to pulling the disks from this server. I only noticed that there was a problem...
[12:18:52] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1014 - https://phabricator.wikimedia.org/T382984#10613058 (10BTullis) I also only see three failed drives in the original description (slots 2,7, and 10), but it looks like the drive in slot 3 was replaced as well. Did this fail at a later time?
[12:30:01] <wikibugs>	 (03PS1) 10Muehlenhoff: keepalived: Install keepalived from the "main" component [puppet] - 10https://gerrit.wikimedia.org/r/1125413 (https://phabricator.wikimedia.org/T383557)
[12:41:47] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1066 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[12:46:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[12:46:54] <wikibugs>	 (03CR) 10Kamila Součková: "Yes, they are defaults. (I'm not really sure why past me put them there :D)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122961 (https://phabricator.wikimedia.org/T371214) (owner: 10Kamila Součková)
[12:57:34] <wikibugs>	 (03CR) 10Kevin Bazira: "most of it lgtm. has this been tested on staging or you're ready to test in prod?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125405 (https://phabricator.wikimedia.org/T387019) (owner: 10Ilias Sarantopoulos)
[12:57:41] <jinxer-wm>	 FIRING: [2x] KubernetesRsyslogDown: rsyslog on aux-k8s-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[13:01:16] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: "It has been tested on ml-staging in experimental ns -> here are the results https://phabricator.wikimedia.org/T387019#10612894" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125405 (https://phabricator.wikimedia.org/T387019) (owner: 10Ilias Sarantopoulos)
[13:04:09] <wikibugs>	 (03CR) 10Kevin Bazira: [C:03+1] "ack! I've +1'd." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125405 (https://phabricator.wikimedia.org/T387019) (owner: 10Ilias Sarantopoulos)
[13:04:37] <wikibugs>	 (03PS1) 10Fabfur: sslcert: minor refactoring to use consistent key path [puppet] - 10https://gerrit.wikimedia.org/r/1125415 (https://phabricator.wikimedia.org/T387929)
[13:06:51] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+2] benthos: add input/output config to chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122961 (https://phabricator.wikimedia.org/T371214) (owner: 10Kamila Součková)
[13:08:13] <wikibugs>	 (03Merged) 10jenkins-bot: benthos: add input/output config to chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122961 (https://phabricator.wikimedia.org/T371214) (owner: 10Kamila Součková)
[13:11:47] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1066 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[13:16:15] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C:03+2] ml-services: increase workers for reference-quality models [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125405 (https://phabricator.wikimedia.org/T387019) (owner: 10Ilias Sarantopoulos)
[13:17:36] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: increase workers for reference-quality models [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125405 (https://phabricator.wikimedia.org/T387019) (owner: 10Ilias Sarantopoulos)
[13:19:54] <logmsgbot>	 !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' .
[13:21:29] <logmsgbot>	 !log isaranto@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' .
[13:21:47] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1066 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[13:22:19] <wikibugs>	 (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1125415 (https://phabricator.wikimedia.org/T387929) (owner: 10Fabfur)
[13:22:24] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] keystone oidc: use keystone.openstack hostname for redirect [puppet] - 10https://gerrit.wikimedia.org/r/1125266 (https://phabricator.wikimedia.org/T388137) (owner: 10Andrew Bogott)
[13:23:07] <wikibugs>	 (03PS1) 10Effie Mouzeli: mw-(api-int|parsoid|jobrunner): increase PHP8.1 traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125418 (https://phabricator.wikimedia.org/T383845)
[13:25:16] <logmsgbot>	 !log isaranto@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revision-models' for release 'main' .
[13:27:20] <wikibugs>	 (03PS1) 10Federico Ceratto: Ask for confirmation before depooling last host in a group [cookbooks] - 10https://gerrit.wikimedia.org/r/1125421 (https://phabricator.wikimedia.org/T299442)
[13:27:30] <wikibugs>	 (03CR) 10Jelto: [V:03+1 C:03+1] "lgtm, CI diff looks a bit scary but the actual diff with helmfile looks reasonable" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124832 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm)
[13:29:40] <wikibugs>	 (03CR) 10Federico Ceratto: "Initial version - later on we could set a minimum number of instances per section rather than 1." [cookbooks] - 10https://gerrit.wikimedia.org/r/1125421 (https://phabricator.wikimedia.org/T299442) (owner: 10Federico Ceratto)
[13:32:04] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: admin_ng: increase resource quota for revision models ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125422 (https://phabricator.wikimedia.org/T387019)
[13:32:06] <wikibugs>	 (03CR) 10Jcrespo: [C:03+1] Preseed: use ms-be_simple.cfg for ms-be2089 [puppet] - 10https://gerrit.wikimedia.org/r/1125410 (https://phabricator.wikimedia.org/T388221) (owner: 10MVernon)
[13:33:36] <wikibugs>	 (03PS1) 10Effie Mouzeli: mw-(api-int|parsoid|jobrunner): switch all traffic to PHP8.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125423 (https://phabricator.wikimedia.org/T383845)
[13:35:45] <wikibugs>	 (03PS1) 10Stevemunene: airflow-product-eng: migrate scheduler and db to k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125424 (https://phabricator.wikimedia.org/T380624)
[13:40:56] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] sre: deploy thumbor alerts to prometheus k8s [alerts] - 10https://gerrit.wikimedia.org/r/1124788 (https://phabricator.wikimedia.org/T379559) (owner: 10Filippo Giunchedi)
[13:41:34] <wikibugs>	 (03PS2) 10Stevemunene: airflow-product-eng: migrate scheduler and db to k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125424 (https://phabricator.wikimedia.org/T380624)
[13:42:54] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1198.eqiad.wmnet with OS bullseye
[13:42:59] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10613177 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-worker1198.eqiad.wmnet with OS...
[13:43:21] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] sre: deploy thumbor alerts to prometheus k8s [alerts] - 10https://gerrit.wikimedia.org/r/1124788 (https://phabricator.wikimedia.org/T379559) (owner: 10Filippo Giunchedi)
[13:46:07] <icinga-wm>	 PROBLEM - Host cloudweb2002-dev is DOWN: PING CRITICAL - Packet loss = 100%
[13:46:55] <icinga-wm>	 RECOVERY - Host cloudweb2002-dev is UP: PING OK - Packet loss = 0%, RTA = 30.38 ms
[13:47:14] <wikibugs>	 (03PS1) 10Andrew Bogott: cloudweb2002-dev: fix service_id for keystone [puppet] - 10https://gerrit.wikimedia.org/r/1125428
[13:47:42] <jinxer-wm>	 FIRING: HelmReleaseBadStatus: Helm release article-descriptions/main on k8s-mlstaging@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s-mlstaging&var-namespace=article-descriptions - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[13:47:59] <wikibugs>	 (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1125428 (owner: 10Andrew Bogott)
[13:48:12] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] cloudweb2002-dev: fix service_id for keystone [puppet] - 10https://gerrit.wikimedia.org/r/1125428 (owner: 10Andrew Bogott)
[13:51:47] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1066 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[13:53:35] <icinga-wm>	 PROBLEM - Host an-worker1102 is DOWN: PING CRITICAL - Packet loss = 100%
[13:55:17] <wikibugs>	 (03CR) 10Marostegui: "Keep in mind that each section has a min number of replicas implementation already, so this may be good already." [cookbooks] - 10https://gerrit.wikimedia.org/r/1125421 (https://phabricator.wikimedia.org/T299442) (owner: 10Federico Ceratto)
[13:55:38] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T387609#10613202 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr rebalanced BA  on l2-l3 on b7 reblanced AA  on l2-l3 on A4
[13:56:27] <icinga-wm>	 RECOVERY - Host an-worker1102 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms
[13:58:00] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1198.eqiad.wmnet with reason: host reimage
[13:58:51] <wikibugs>	 (03CR) 10MVernon: [C:03+2] Preseed: use ms-be_simple.cfg for ms-be2089 [puppet] - 10https://gerrit.wikimedia.org/r/1125410 (https://phabricator.wikimedia.org/T388221) (owner: 10MVernon)
[13:59:44] <wikibugs>	 (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125430
[14:00:55] <wikibugs>	 10ops-magru, 06Infrastructure-Foundations, 10netops: Jan 2025 - Magru core router connectivity blips - https://phabricator.wikimedia.org/T384774#10613224 (10cmooney) p:05High→03Medium Everything still seems table.  Juniper also provided this link to their KB article on it  https://supportportal.juniper.n...
[14:01:47] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1066 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[14:02:11] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1198.eqiad.wmnet with reason: host reimage
[14:03:07] <wikibugs>	 (03PS1) 10Marostegui: installserver: Do not reimage db1256 [puppet] - 10https://gerrit.wikimedia.org/r/1125432
[14:04:56] <wikibugs>	 (03PS2) 10Kamila Součková: benthos-mw-accesslog-metrics: create deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123010
[14:05:19] <wikibugs>	 (03CR) 10Kamila Součková: benthos-mw-accesslog-metrics: create deployment (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123010 (owner: 10Kamila Součková)
[14:05:43] <wikibugs>	 (03CR) 10Kamila Součková: "Done" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123010 (owner: 10Kamila Součková)
[14:07:00] <wikibugs>	 (03CR) 10Elukey: [C:03+2] conftool-data: add more wikikube-workers to maps [puppet] - 10https://gerrit.wikimedia.org/r/1125387 (https://phabricator.wikimedia.org/T386926) (owner: 10Elukey)
[14:07:49] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] installserver: Do not reimage db1256 [puppet] - 10https://gerrit.wikimedia.org/r/1125432 (owner: 10Marostegui)
[14:08:20] <wikibugs>	 (03PS1) 10Btullis: Fix the preseed matching for the new an-worker hosts [puppet] - 10https://gerrit.wikimedia.org/r/1125433 (https://phabricator.wikimedia.org/T386390)
[14:10:51] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Fix the preseed matching for the new an-worker hosts [puppet] - 10https://gerrit.wikimedia.org/r/1125433 (https://phabricator.wikimedia.org/T386390) (owner: 10Btullis)
[14:12:08] <logmsgbot>	 !log elukey@puppetserver1001 conftool action : set/weight=10:pooled=yes; selector: name=wikikube-worker2.*,dc=codfw,cluster=maps,service=kartotherian-k8s-ssl
[14:12:25] <logmsgbot>	 !log elukey@puppetserver1001 conftool action : set/weight=10:pooled=yes; selector: name=wikikube-worker1.*,dc=eqiad,cluster=maps,service=kartotherian-k8s-ssl
[14:12:42] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21), 13Patch-For-Review: Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10613276 (10BTullis) a:05BTullis→03Jclark-ctr >>! In T386390#10612180, @Jclark-ctr wrote: > @BTullis  little typo in p...
[14:23:31] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:25:29] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
[14:25:42] <wikibugs>	 10ops-eqiad, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236 (10phaultfinder) 03NEW
[14:26:42] <jinxer-wm>	 FIRING: [4x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:28:37] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.reboot-single for host ms-be2088.codfw.wmnet
[14:28:59] <elukey>	 gerrit is unresponsive for me
[14:29:35] <federico3>	 same here
[14:29:51] <jynus>	 see alert
[14:30:18] <elukey>	 there are some errors in the log, I'd be inclined to restart 
[14:30:36] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1068 - https://phabricator.wikimedia.org/T387732#10613324 (10BTullis) 05Open→03Resolved a:03BTullis I can't see any problem with the disks on this server, so I think we can just close the ticket.  14 physical disks, all online. ` Physical Driv...
[14:30:36] <jynus>	 down since 13:39
[14:30:38] <elukey>	 !log restart gerrit, unresponsive, errors in the log
[14:30:42] <jynus>	 +1
[14:31:13] <elukey>	 started
[14:31:42] <jinxer-wm>	 FIRING: [4x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:31:47] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1066 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[14:33:30] <jynus>	 it died on both dcs
[14:35:00] <elukey>	 !log the previous restart of gerrit was on gerrit1003
[14:35:11] <jynus>	 something is going on: https://grafana.wikimedia.org/goto/SpS8NutHR?orgId=1
[14:35:31] <jelto>	 gerrit is seeing quite some traffic, I'm looking at logstash at the moment, potentially something to discuss in _security
[14:35:58] <elukey>	 yeah
[14:36:42] <jinxer-wm>	 FIRING: [5x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:38:03] <hashar>	 o/
[14:38:03] <hashar>	 I swear I haven't touched anything on Gerrit
[14:38:15] <elukey>	 hashar: we are discussing it in #security
[14:40:11] <logmsgbot>	 !log elukey@cumin1002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host ms-be2088.codfw.wmnet
[14:40:46] <wikibugs>	 (03PS1) 10Papaul: Remove bfs for link between cr1-codfw and cr2-codfw [homer/public] - 10https://gerrit.wikimedia.org/r/1125448 (https://phabricator.wikimedia.org/T387773)
[14:40:59] <wikibugs>	 (03PS1) 10Jelto: gerrit: throttle alibaba IPs [puppet] - 10https://gerrit.wikimedia.org/r/1125449 (https://phabricator.wikimedia.org/T388235)
[14:41:35] <wikibugs>	 (03PS2) 10Papaul: Remove bfd for link between cr1-codfw and cr2-codfw [homer/public] - 10https://gerrit.wikimedia.org/r/1125448 (https://phabricator.wikimedia.org/T387773)
[14:41:42] <jinxer-wm>	 FIRING: [5x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:43:31] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:43:42] <wikibugs>	 (03CR) 10Klausman: [C:03+1] admin_ng: increase resource quota for revision models ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125422 (https://phabricator.wikimedia.org/T387019) (owner: 10Ilias Sarantopoulos)
[14:44:14] <wikibugs>	 (03CR) 10Federico Ceratto: "This should be ready for review, ideally focusing on bugs that affect safety of the operation. Thanks!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1124797 (https://phabricator.wikimedia.org/T387209) (owner: 10Federico Ceratto)
[14:44:18] <wikibugs>	 (03CR) 10Hashar: "I'd rather ban them entirely much like we did for two other cases that badly abused the service:" [puppet] - 10https://gerrit.wikimedia.org/r/1125449 (https://phabricator.wikimedia.org/T388235) (owner: 10Jelto)
[14:44:46] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
[14:44:47] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1198.eqiad.wmnet with OS bullseye
[14:44:53] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10613425 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-worker1198.eqiad.wmnet with OS bull...
[14:45:24] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10613427 (10Jclark-ctr)
[14:46:24] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1200.eqiad.wmnet with OS bullseye
[14:46:31] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10613429 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-worker1200.eqiad.wmnet with OS...
[14:47:11] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9600 on cloudelastic1009 is CRITICAL: CRITICAL - elasticsearch http://localhost:9600/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9600): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7fbeebe62280: Failed to establish a new connection: [Errno 111] Connection refused)) https://wiki
[14:47:11] <icinga-wm>	 imedia.org/wiki/Search%23Administration
[14:47:34] <dcausse>	 ^ expected
[14:47:49] <wikibugs>	 (03CR) 10Jelto: "this does not scale and overwhelms apache" [puppet] - 10https://gerrit.wikimedia.org/r/1125449 (https://phabricator.wikimedia.org/T388235) (owner: 10Jelto)
[14:50:11] <wikibugs>	 (03PS2) 10Jelto: gerrit: throttle alibaba IPs [puppet] - 10https://gerrit.wikimedia.org/r/1125449 (https://phabricator.wikimedia.org/T388235)
[14:51:05] <wikibugs>	 (03CR) 10Jcrespo: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1125449 (https://phabricator.wikimedia.org/T388235) (owner: 10Jelto)
[14:51:10] <wikibugs>	 (03CR) 10Jelto: [C:03+2] gerrit: throttle alibaba IPs [puppet] - 10https://gerrit.wikimedia.org/r/1125449 (https://phabricator.wikimedia.org/T388235) (owner: 10Jelto)
[14:52:02] <wikibugs>	 (03CR) 10Cathal Mooney: "Overall LGTM.  I think the 'rack' probably needs to be upper-case (see comment)." [cookbooks] - 10https://gerrit.wikimedia.org/r/1125206 (https://phabricator.wikimedia.org/T387231) (owner: 10Tiziano Fogli)
[14:53:41] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+1] "LGTM!" [homer/public] - 10https://gerrit.wikimedia.org/r/1125448 (https://phabricator.wikimedia.org/T387773) (owner: 10Papaul)
[14:54:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: push_cross_cluster_settings_9600.service on cloudelastic1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:55:12] <wikibugs>	 06SRE, 10fundraising-tech-ops: Q1:rack/setup/install fransw1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T367801#10613449 (10Jclark-ctr)
[14:55:39] <icinga-wm>	 PROBLEM - Check unit status of push_cross_cluster_settings_9600 on cloudelastic1009 is CRITICAL: CRITICAL: Status of the systemd unit push_cross_cluster_settings_9600 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[14:55:42] <wikibugs>	 06SRE, 10fundraising-tech-ops: Q1:rack/setup/install frban1002 - https://phabricator.wikimedia.org/T369947#10613450 (10Jclark-ctr)
[14:56:21] <wikibugs>	 06SRE, 10fundraising-tech-ops: Q1:rack/setup/install frdb1007 - https://phabricator.wikimedia.org/T369922#10613453 (10Jclark-ctr)
[14:57:01] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install frnetmon1002, pay-lb1001, pay-lb1002 - https://phabricator.wikimedia.org/T369565#10613456 (10Jclark-ctr)
[14:57:47] <wikibugs>	 06SRE, 10fundraising-tech-ops: Q1:rack/setup/install fran1002 - https://phabricator.wikimedia.org/T369940#10613458 (10Jclark-ctr)
[14:57:56] <wikibugs>	 06SRE, 10fundraising-tech-ops: Q1:rack/setup/install fransc1001 - https://phabricator.wikimedia.org/T367814#10613459 (10Jclark-ctr)
[14:58:26] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install franio100[1-3] - https://phabricator.wikimedia.org/T367820#10613473 (10Jclark-ctr)
[14:59:05] <wikibugs>	 (03CR) 10Klausman: [C:03+2] admin_ng: increase resource quota for revision models ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125422 (https://phabricator.wikimedia.org/T387019) (owner: 10Ilias Sarantopoulos)
[14:59:06] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Q2:rack/setup E8/F8 new leaf switches - https://phabricator.wikimedia.org/T382017#10613477 (10cmooney) Thanks guys.  Please ping me when these are in Netbox and I will add the links, IPs, vlans etc. and begin the process of commissioni...
[14:59:17] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9200 on cloudelastic1009 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7f7a5b38e280: Failed to establish a new connection: [Errno 111] Connection refused)) https://wiki
[14:59:17] <icinga-wm>	 imedia.org/wiki/Search%23Administration
[14:59:19] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9400 on cloudelastic1009 is CRITICAL: CRITICAL - elasticsearch http://localhost:9400/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9400): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7fcf0209f280: Failed to establish a new connection: [Errno 111] Connection refused)) https://wiki
[14:59:19] <icinga-wm>	 imedia.org/wiki/Search%23Administration
[14:59:36] <dcausse>	 cloudelastic1009 issues can be ignored
[14:59:39] <wikibugs>	 (03PS1) 10JMeybohm: Rename TILLER_NAMESPACE to K8S_NAMESPACE [puppet] - 10https://gerrit.wikimedia.org/r/1125453
[15:00:27] <wikibugs>	 06SRE, 10fundraising-tech-ops: Q1:rack/setup/install frnetmon1002, pay-lb1001, pay-lb1002 - https://phabricator.wikimedia.org/T369565#10613482 (10Jclark-ctr)
[15:00:37] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10613483 (10phaultfinder)
[15:01:06] <wikibugs>	 06SRE, 10fundraising-tech-ops: Q1:rack/setup/install franio100[1-3] - https://phabricator.wikimedia.org/T367820#10613485 (10Jclark-ctr)
[15:01:25] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1200.eqiad.wmnet with reason: host reimage
[15:03:23] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 06Infrastructure-Foundations: Perform fake disk swap on ms-be2088 as test - https://phabricator.wikimedia.org/T384003#10613501 (10elukey) @Jhancock.wm Hi! I apologize in advance for keep requesting the same thing, but could you do another pull/push of a...
[15:03:30] <jinxer-wm>	 FIRING: Outbound discards: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Outbound discards   - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards
[15:04:01] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1201.eqiad.wmnet with OS bullseye
[15:04:05] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1202.eqiad.wmnet with OS bullseye
[15:04:06] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10613507 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-worker1201.eqiad.wmnet with OS...
[15:04:11] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10613508 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-worker1202.eqiad.wmnet with OS...
[15:04:14] <wikibugs>	 (03Merged) 10jenkins-bot: admin_ng: increase resource quota for revision models ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125422 (https://phabricator.wikimedia.org/T387019) (owner: 10Ilias Sarantopoulos)
[15:04:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: elasticsearch-disable-readahead.service on cloudelastic1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:04:56] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1200.eqiad.wmnet with reason: host reimage
[15:05:10] <wikibugs>	 (03CR) 10Federico Ceratto: "Hello Jaime, I am not familiar with the internals of the backup process. Please let me know how I can help you with the CR and what type o" [puppet] - 10https://gerrit.wikimedia.org/r/1125114 (https://phabricator.wikimedia.org/T387892) (owner: 10Jcrespo)
[15:05:21] <wikibugs>	 (03PS5) 10JMeybohm: Add pod-security.wmf.org labels to wikikube mediawiki namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124416 (https://phabricator.wikimedia.org/T273507)
[15:05:38] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1204.eqiad.wmnet with OS bullseye
[15:05:42] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1203.eqiad.wmnet with OS bullseye
[15:05:48] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10613521 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-worker1204.eqiad.wmnet with OS...
[15:05:49] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10613522 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-worker1203.eqiad.wmnet with OS...
[15:06:37] <icinga-wm>	 PROBLEM - Check unit status of push_cross_cluster_settings_9400 on cloudelastic1009 is CRITICAL: CRITICAL: Status of the systemd unit push_cross_cluster_settings_9400 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[15:06:39] <wikibugs>	 (03CR) 10JMeybohm: Add pod-security.wmf.org labels to wikikube mediawiki namespaces (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124416 (https://phabricator.wikimedia.org/T273507) (owner: 10JMeybohm)
[15:06:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:06:56] <logmsgbot>	 !log klausman@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'.
[15:06:59] <wikibugs>	 (03PS2) 10Scott French: mw-*: Tune 8.1 releases to avoid deployment timeouts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125265 (https://phabricator.wikimedia.org/T383845)
[15:08:09] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] staging-codfw: Unset image.tag for coredns to apply the default version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124831 (https://phabricator.wikimedia.org/T384450) (owner: 10JMeybohm)
[15:08:23] <logmsgbot>	 !log klausman@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'.
[15:09:25] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: elasticsearch-disable-readahead.service on cloudelastic1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:10:34] <wikibugs>	 (03CR) 10JMeybohm: "Oh yeah, that's wild indeed. I'll make sure all diffs are clean after merging" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124832 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm)
[15:11:11] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed & rebooted twice - https://phabricator.wikimedia.org/T387673#10613542 (10VRiley-WMF) Looks like we'll be replacing the motherboard soon. Will update when we know a time it should be arriving
[15:12:46] <logmsgbot>	 !log klausman@deploy2002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'.
[15:13:19] <wikibugs>	 (03Merged) 10jenkins-bot: staging-codfw: Unset image.tag for coredns to apply the default version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124831 (https://phabricator.wikimedia.org/T384450) (owner: 10JMeybohm)
[15:13:31] <logmsgbot>	 !log klausman@deploy2002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'.
[15:14:15] <icinga-wm>	 PROBLEM - Check unit status of push_cross_cluster_settings_9200 on cloudelastic1009 is CRITICAL: CRITICAL: Status of the systemd unit push_cross_cluster_settings_9200 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[15:15:29] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.dns.netbox
[15:18:57] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1201.eqiad.wmnet with reason: host reimage
[15:20:28] <wikibugs>	 (03CR) 10Jcrespo: "I would like you to be aware of the changes and give me the ok to proceed with the m1 database changes at least, as technically you (datab" [puppet] - 10https://gerrit.wikimedia.org/r/1125114 (https://phabricator.wikimedia.org/T387892) (owner: 10Jcrespo)
[15:20:30] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1204.eqiad.wmnet with reason: host reimage
[15:20:35] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1203.eqiad.wmnet with reason: host reimage
[15:21:22] <swfrench-wmf>	 FYI, I'm going to be running a helmfile-only scap deployment in a few minutes to tune some settings that should make deployment timeouts like we saw this week less frequent
[15:22:33] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1201.eqiad.wmnet with reason: host reimage
[15:23:06] <wikibugs>	 (03CR) 10Scott French: "Thanks for the review!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125265 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French)
[15:23:08] <wikibugs>	 (03CR) 10Scott French: [C:03+2] mw-*: Tune 8.1 releases to avoid deployment timeouts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125265 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French)
[15:23:24] <wikibugs>	 (03CR) 10Federico Ceratto: "I replied to the questions and left a question around netbox_server. Is anybody seeing any showstopping bug or safety issue? If not I thin" [cookbooks] - 10https://gerrit.wikimedia.org/r/1120605 (https://phabricator.wikimedia.org/T387023) (owner: 10Federico Ceratto)
[15:24:44] <logmsgbot>	 !log cmooney@cumin1002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[15:24:46] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.dns.netbox
[15:24:47] <wikibugs>	 (03Merged) 10jenkins-bot: mw-*: Tune 8.1 releases to avoid deployment timeouts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125265 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French)
[15:25:46] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10613593 (10phaultfinder)
[15:25:57] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1203.eqiad.wmnet with reason: host reimage
[15:26:19] <wikibugs>	 (03CR) 10Xcollazo: [C:03+1] "Still a +1 from my side. Thanks Amir. CC @btullis@wikimedia.org." [dumps] - 10https://gerrit.wikimedia.org/r/1119486 (https://phabricator.wikimedia.org/T382069) (owner: 10Ladsgroup)
[15:26:31] <icinga-wm>	 PROBLEM - Host an-presto1014 is DOWN: PING CRITICAL - Packet loss = 100%
[15:27:34] <swfrench-wmf>	 starting said deployment momentarily
[15:27:36] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
[15:27:41] <icinga-wm>	 RECOVERY - SSH on an-presto1014 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[15:27:43] <icinga-wm>	 RECOVERY - Host an-presto1014 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms
[15:27:45] <hashar>	 swfrench-wmf: that is good to know!   The Tuesday automatic deploy had helm time out at 10 minutes and I kind of forgot to ask for it to be raised
[15:27:58] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
[15:27:58] <swfrench-wmf>	 :)
[15:27:59] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1200.eqiad.wmnet with OS bullseye
[15:28:11] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10613611 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-worker1200.eqiad.wmnet with OS bull...
[15:28:33] <swfrench-wmf>	 hashar: this is why I want to get this in today, since the first deploy on Monday may also generally be slow due to the weekly production image rebuild
[15:28:52] <wikibugs>	 (03CR) 10Federico Ceratto: "(Note: gerrit is flagging the CR as "XL" sized but it's due to the addition of a JSON file used for testing)" [cookbooks] - 10https://gerrit.wikimedia.org/r/1125421 (https://phabricator.wikimedia.org/T299442) (owner: 10Federico Ceratto)
[15:29:08] <hashar>	 weekly rebuild?
[15:29:46] <logmsgbot>	 !log cmooney@cumin1002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[15:29:54] <swfrench-wmf>	 the production images (i.e., the php base images used by the multiversion image) are rebuilt every weekend
[15:30:30] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.dns.netbox
[15:30:37] * hashar jaw drops
[15:31:03] <hashar>	 I have on my todo list to investigate why the first monday backport is slow
[15:31:17] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1205.eqiad.wmnet with OS bullseye
[15:31:27] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10613630 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-worker1205.eqiad.wmnet with OS...
[15:31:33] <swfrench-wmf>	 hashar: that'll do it :)
[15:31:33] <hashar>	 and I guess if scap image building does a pull of the parent image, that invalidate the image, cause a full image to be build and a 8.5G image to deploy
[15:31:42] <hashar>	 which solves a mystery I had to investigate
[15:31:44] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 13Patch-For-Review: Decommission E/F 8 Dell switches - https://phabricator.wikimedia.org/T380050#10613635 (10cmooney)
[15:31:46] <logmsgbot>	 !log swfrench@deploy2002 Started scap sync-world: helmfile-only deploy to reduce likelihood of deployment timeouts - T383845
[15:31:49] <stashbot>	 T383845: MediaWiki on PHP 8.1 production traffic ramp-up - https://phabricator.wikimedia.org/T383845
[15:32:17] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1204.eqiad.wmnet with reason: host reimage
[15:32:33] <wikibugs>	 (03PS1) 10Hnowlan: trafficserver: route citoid via rest-gateway for all sites [puppet] - 10https://gerrit.wikimedia.org/r/1125461 (https://phabricator.wikimedia.org/T361576)
[15:32:47] <icinga-wm>	 RECOVERY - Dell PowerEdge RAID Controller on an-presto1014 is OK: communication: 0 OK https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring
[15:32:50] <wikibugs>	 (03Abandoned) 10Hnowlan: trafficserver: route citoid via rest-gateway for all sites [puppet] - 10https://gerrit.wikimedia.org/r/1113182 (https://phabricator.wikimedia.org/T361576) (owner: 10Hnowlan)
[15:33:41] <logmsgbot>	 !log swfrench@deploy2002 Finished scap sync-world: helmfile-only deploy to reduce likelihood of deployment timeouts - T383845 (duration: 04m 33s)
[15:34:00] <wikibugs>	 (03CR) 10Btullis: [C:03+1] "Nice. Thanks for doing this." [dumps] - 10https://gerrit.wikimedia.org/r/1119486 (https://phabricator.wikimedia.org/T382069) (owner: 10Ladsgroup)
[15:35:08] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: change dns names for eqiad rack e8 endpoints - cmooney@cumin1002"
[15:35:13] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: change dns names for eqiad rack e8 endpoints - cmooney@cumin1002"
[15:35:13] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:35:25] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 13Patch-For-Review: Decommission E/F 8 Dell switches - https://phabricator.wikimedia.org/T380050#10613653 (10cmooney) I've tidied up netbox for these now.  I left the ports enabled on the ssw side with the IPs present, as we can't disable them there and keep the IPs attached....
[15:35:38] <swfrench-wmf>	 alright, that should hopefully do the trick. I'm (ideally) done touching prod on Friday :)
[15:36:08] <hashar>	 when it is to prevent an outage on monday morning, it is fine :)
[15:36:12] <hashar>	 or worse, over the week-end!
[15:36:32] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Q2:rack/setup E8/F8 new leaf switches - https://phabricator.wikimedia.org/T382017#10613663 (10cmooney) Lastly please call these new switches //lsw1-e8-eqiad// and //lsw1-f8-eqiad// in Netbox.  We'll need to either have deleted the Dell...
[15:37:06] <swfrench-wmf>	 hashar: exactly, yeah
[15:40:19] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10613682 (10BTullis) Hi @VRiley-WMF - I'm not sure if you saw my comment on the HDD upgrade ticket here: T385485#10609902  Basically, I was wonde...
[15:43:59] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1206.eqiad.wmnet with OS bullseye
[15:44:01] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1208.eqiad.wmnet with OS bullseye
[15:44:01] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1207.eqiad.wmnet with OS bullseye
[15:44:05] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10613697 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-worker1206.eqiad.wmnet with OS...
[15:44:08] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10613698 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-worker1208.eqiad.wmnet with OS...
[15:44:10] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10613699 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-worker1207.eqiad.wmnet with OS...
[15:44:39] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10613700 (10phaultfinder)
[15:45:38] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Q2:rack/setup E8/F8 new leaf switches - https://phabricator.wikimedia.org/T382017#10613705 (10VRiley-WMF) Will be adding these into netbox shortly
[15:46:23] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
[15:46:47] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1205.eqiad.wmnet with reason: host reimage
[15:46:57] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
[15:46:59] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1201.eqiad.wmnet with OS bullseye
[15:47:11] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10613708 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-worker1201.eqiad.wmnet with OS bull...
[15:48:30] <jinxer-wm>	 RESOLVED: Outbound discards: Device asw2-a-eqiad.mgmt.eqiad.wmnet recovered from Outbound discards   - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards
[15:49:36] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "updating for renamed dell switches in eqiad - cmooney@cumin1002"
[15:49:42] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "updating for renamed dell switches in eqiad - cmooney@cumin1002"
[15:50:37] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1205.eqiad.wmnet with reason: host reimage
[15:50:42] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
[15:52:42] <wikibugs>	 (03PS1) 10JMeybohm: admin_ng: Create cert-manager leases in cert-manager namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125462 (https://phabricator.wikimedia.org/T383553)
[15:53:24] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: ml-services: revert asyncio thread usage in reference quality [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125463
[15:53:38] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C:03+2] ml-services: revert asyncio thread usage in reference quality [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125463 (owner: 10Ilias Sarantopoulos)
[15:55:02] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: revert asyncio thread usage in reference quality [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125463 (owner: 10Ilias Sarantopoulos)
[15:55:51] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
[15:58:05] <logmsgbot>	 !log isaranto@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revision-models' for release 'main' .
[15:58:27] <logmsgbot>	 !log isaranto@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' .
[15:58:35] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1206.eqiad.wmnet with reason: host reimage
[15:58:39] <dancy>	 hashar: The first backport of the week shouldn'
[15:58:47] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1208.eqiad.wmnet with reason: host reimage
[15:58:59] <logmsgbot>	 !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' .
[15:59:06] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1207.eqiad.wmnet with reason: host reimage
[16:00:31] <dancy>	 hashar: A Monday backport _shouldn't_ be any slower than a Friday one, but next time you see this happening, save a copy of your scap-image-build-and-push-log file.
[16:02:20] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1206.eqiad.wmnet with reason: host reimage
[16:04:53] <logmsgbot>	 !log jclark@cumin1002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
[16:04:54] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1204.eqiad.wmnet with OS bullseye
[16:04:58] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
[16:04:58] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10613788 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-worker1204.eqiad.wmnet with OS bull...
[16:04:58] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1203.eqiad.wmnet with OS bullseye
[16:05:21] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10613789 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-worker1203.eqiad.wmnet with OS bull...
[16:05:43] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10613790 (10phaultfinder)
[16:05:49] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1207.eqiad.wmnet with reason: host reimage
[16:06:16] <sbassett>	 Hey all - would like to do a security deployment for T387691.  We have an accidentally-disclosed patch so I think it warrants a Friday deploy.  (cc: hashar thcipriani Lucas_WMDE)
[16:06:24] * Lucas_WMDE around
[16:06:49] <thcipriani>	 sounds good
[16:09:03] <dancy>	 I'm around too.
[16:09:14] <wikibugs>	 (03PS1) 10Muehlenhoff: Bump changelog for 1.0.1 release [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/1125466
[16:09:36] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1208.eqiad.wmnet with reason: host reimage
[16:11:01] <sbassett>	 Deploying…
[16:14:05] <wikibugs>	 (03PS1) 10Clément Goubert: periodic_jobs: Remove last wikitech jobs [puppet] - 10https://gerrit.wikimedia.org/r/1125467 (https://phabricator.wikimedia.org/T388249)
[16:14:06] <wikibugs>	 (03PS1) 10Clément Goubert: periodic_jobs: Cleanup last wikitech jobs [puppet] - 10https://gerrit.wikimedia.org/r/1125468 (https://phabricator.wikimedia.org/T388249)
[16:14:15] <wikibugs>	 (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1125467 (https://phabricator.wikimedia.org/T388249) (owner: 10Clément Goubert)
[16:15:42] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job thanos-query-frontend in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:16:31] <Lucas_WMDE>	 sbassett: let me know when I can test on WikimediaDebug
[16:16:32] <wikibugs>	 (03CR) 10Btullis: [C:03+1] "Looks good to me." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125424 (https://phabricator.wikimedia.org/T380624) (owner: 10Stevemunene)
[16:16:56] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
[16:17:15] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
[16:17:16] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1205.eqiad.wmnet with OS bullseye
[16:17:21] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10613866 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-worker1205.eqiad.wmnet with OS bull...
[16:17:31] <wikibugs>	 (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Bump changelog for 1.0.1 release [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/1125466 (owner: 10Muehlenhoff)
[16:17:36] <sbassett>	 Lucas_WMDE: oh I just went ahead with the prod deploy - should be done soon (60% k8s deployed)
[16:17:51] <Lucas_WMDE>	 ah
[16:17:55] * Lucas_WMDE tests anyway
[16:19:39] <sbassett>	 Looks like I might have o/s on commons via staff rights.  I def have the "change visbility" UI for revisions.
[16:19:43] <Lucas_WMDE>	 ok I think it’s behaving as expected
[16:20:16] <sbassett>	 ok, good.  Let me know if you want to test the XSS piece as I’m pretty sure I can o/s…
[16:20:28] <sbassett>	 !log Deployed security patch for T387691
[16:20:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:20:42] <jinxer-wm>	 FIRING: [4x] JobUnavailable: Reduced availability for job thanos-query in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:21:22] <Lucas_WMDE>	 sbassett: and I think we can see the sharp uptick in wbformatvalue API requests here https://grafana.wikimedia.org/d/000000559/mediawiki-action-api-breakdown?orgId=1&var-metric=p50&var-module=wbformatvalue&from=now-1h&to=now
[16:21:34] <Lucas_WMDE>	 but as long as the site can handle the load that should hopefully be fine
[16:21:47] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1066 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[16:21:50] <sbassett>	 Ok, I assume that’s expected.
[16:21:56] <logmsgbot>	 !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1202.eqiad.wmnet with OS bullseye
[16:22:01] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10613894 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-worker1202.eqiad.wmnet with OS bull...
[16:22:11] <Lucas_WMDE>	 yeah
[16:22:32] <Lucas_WMDE>	 I expected it to go up and I hope it won’t take the site down 😅
[16:22:34] <Lucas_WMDE>	 so far both are looking good
[16:22:40] <sbassett>	 Ok :)
[16:22:57] <sbassett>	 Let me know if you’d like to test anything else.  I’ll keep an eye on grafana and logstash for a bit.
[16:23:13] <Lucas_WMDE>	 at https://grafana.wikimedia.org/d/000000002/mediawiki-action-api-summary?orgId=1&refresh=5m it still looks like wbformatvalue is a negligible amount
[16:24:11] <wikibugs>	 (03PS2) 10Clément Goubert: periodic_jobs: Remove last wikitech jobs [puppet] - 10https://gerrit.wikimedia.org/r/1125467 (https://phabricator.wikimedia.org/T388249)
[16:24:11] <wikibugs>	 (03PS2) 10Clément Goubert: periodic_jobs: Cleanup last wikitech jobs [puppet] - 10https://gerrit.wikimedia.org/r/1125468 (https://phabricator.wikimedia.org/T388249)
[16:24:42] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10613918 (10phaultfinder)
[16:25:13] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
[16:25:39] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
[16:25:39] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1206.eqiad.wmnet with OS bullseye
[16:25:42] <jinxer-wm>	 RESOLVED: [4x] JobUnavailable: Reduced availability for job thanos-query in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:25:48] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10613921 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-worker1206.eqiad.wmnet with OS bull...
[16:26:19] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1202.eqiad.wmnet with OS bullseye
[16:26:26] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10613924 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-worker1202.eqiad.wmnet with OS...
[16:28:12] <wikibugs>	 (03PS1) 10Cathal Mooney: Enable BGP Multipath for PyBal group [homer/public] - 10https://gerrit.wikimedia.org/r/1125471 (https://phabricator.wikimedia.org/T332027)
[16:28:43] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
[16:29:02] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
[16:29:03] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1207.eqiad.wmnet with OS bullseye
[16:29:14] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10613948 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-worker1207.eqiad.wmnet with OS bull...
[16:29:25] <wikibugs>	 (03PS3) 10Clément Goubert: periodic_jobs: Remove last wikitech jobs [puppet] - 10https://gerrit.wikimedia.org/r/1125467 (https://phabricator.wikimedia.org/T388249)
[16:29:26] <wikibugs>	 (03PS3) 10Clément Goubert: periodic_jobs: Cleanup last wikitech jobs [puppet] - 10https://gerrit.wikimedia.org/r/1125468 (https://phabricator.wikimedia.org/T388249)
[16:29:43] <wikibugs>	 (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1125467 (https://phabricator.wikimedia.org/T388249) (owner: 10Clément Goubert)
[16:32:30] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
[16:32:51] <wikibugs>	 (03PS1) 10Vgutierrez: site,hiera: Reimage lvs6003 as liberica [puppet] - 10https://gerrit.wikimedia.org/r/1125472 (https://phabricator.wikimedia.org/T384477)
[16:33:08] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
[16:33:09] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1208.eqiad.wmnet with OS bullseye
[16:33:14] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10613969 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-worker1208.eqiad.wmnet with OS bull...
[16:33:22] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1125472 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez)
[16:33:23] <wikibugs>	 (03PS2) 10Cathal Mooney: Enable BGP Multipath for PyBal group [homer/public] - 10https://gerrit.wikimedia.org/r/1125471 (https://phabricator.wikimedia.org/T332027)
[16:35:17] <wikibugs>	 (03PS2) 10JHathaway: puppetserver: add option to manage git permissions with an acl [puppet] - 10https://gerrit.wikimedia.org/r/1125247 (https://phabricator.wikimedia.org/T385995)
[16:36:20] <wikibugs>	 (03CR) 10DCausse: [C:03+1] "thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125092 (owner: 10Hashar)
[16:37:36] <wikibugs>	 (03CR) 10Vgutierrez: [C:04-2] "do not merge before 2025-03-10" [puppet] - 10https://gerrit.wikimedia.org/r/1125472 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez)
[16:40:51] <wikibugs>	 (03CR) 10Elukey: [C:03+1] "I think that we are ready to proceed, no blocker stands out from my point of view." [cookbooks] - 10https://gerrit.wikimedia.org/r/1120605 (https://phabricator.wikimedia.org/T387023) (owner: 10Federico Ceratto)
[16:41:16] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+1] periodic_jobs: Cleanup last wikitech jobs [puppet] - 10https://gerrit.wikimedia.org/r/1125468 (https://phabricator.wikimedia.org/T388249) (owner: 10Clément Goubert)
[16:41:44] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1202.eqiad.wmnet with reason: host reimage
[16:42:15] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+1] periodic_jobs: Remove last wikitech jobs [puppet] - 10https://gerrit.wikimedia.org/r/1125467 (https://phabricator.wikimedia.org/T388249) (owner: 10Clément Goubert)
[16:43:47] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1125247 (https://phabricator.wikimedia.org/T385995) (owner: 10JHathaway)
[16:44:35] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10614009 (10phaultfinder)
[16:45:32] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1202.eqiad.wmnet with reason: host reimage
[16:46:05] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1246', diff saved to https://phabricator.wikimedia.org/P74156 and previous config saved to /var/cache/conftool/dbconfig/20250307-164605-root.json
[16:46:27] <wikibugs>	 (03PS1) 10Marostegui: Revert "db1246: Enable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1125476
[16:46:27] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10614019 (10Jclark-ctr)
[16:46:56] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] Revert "db1246: Enable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1125476 (owner: 10Marostegui)
[16:47:30] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed & rebooted twice - https://phabricator.wikimedia.org/T387673#10614020 (10Marostegui) @VRiley-WMF the host is depooled and notifications are disabled. So you can change the mainboard anytime you want, whenever it arrives. I will leave it depooled.
[16:48:53] <wikibugs>	 (03CR) 10Ladsgroup: "I think this patch is for groups? I don't know whether we have min replicas for groups." [cookbooks] - 10https://gerrit.wikimedia.org/r/1125421 (https://phabricator.wikimedia.org/T299442) (owner: 10Federico Ceratto)
[16:50:12] <wikibugs>	 (03PS1) 10Muehlenhoff: Record LDAP access for hswan [puppet] - 10https://gerrit.wikimedia.org/r/1125477 (https://phabricator.wikimedia.org/T387522)
[16:50:18] <wikibugs>	 (03PS1) 10DCausse: opensearch: drop minimum_master_nodes [puppet] - 10https://gerrit.wikimedia.org/r/1125478
[16:51:47] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1066 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[16:51:59] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Record LDAP access for hswan [puppet] - 10https://gerrit.wikimedia.org/r/1125477 (https://phabricator.wikimedia.org/T387522) (owner: 10Muehlenhoff)
[16:52:53] <wikibugs>	 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to wmf for hswan - https://phabricator.wikimedia.org/T387522#10614031 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff Access to cn=wmf and cn=logstash-access has been enabled via Wikimedia IDM.
[16:53:52] <wikibugs>	 (03PS2) 10DCausse: opensearch: drop minimum_master_nodes [puppet] - 10https://gerrit.wikimedia.org/r/1125478
[16:54:36] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10614036 (10phaultfinder)
[16:57:09] <wikibugs>	 (03CR) 10Marostegui: "yeah, my comment was more answering Federico's comment." [cookbooks] - 10https://gerrit.wikimedia.org/r/1125421 (https://phabricator.wikimedia.org/T299442) (owner: 10Federico Ceratto)
[17:04:58] <wikibugs>	 (03CR) 10Ladsgroup: "ah sorry for confusion." [cookbooks] - 10https://gerrit.wikimedia.org/r/1125421 (https://phabricator.wikimedia.org/T299442) (owner: 10Federico Ceratto)
[17:06:44] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.dns.netbox
[17:08:51] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
[17:10:49] <wikibugs>	 (03CR) 10Cwhite: [C:03+1] "Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1125478 (owner: 10DCausse)
[17:14:04] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: dns additions for eqiad E8/F8 links to new switches - cmooney@cumin1002"
[17:18:44] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: dns additions for eqiad E8/F8 links to new switches - cmooney@cumin1002"
[17:18:44] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:18:48] <brett>	 Hm, I have an override set for s.ukhe but it's showing him as on call still in the topic
[17:19:32] <wikibugs>	 (03CR) 10DCausse: "quick heads up that cloudelastic1010 is currently master eligible and might special consideration before the re-image (see https://etherpa" [puppet] - 10https://gerrit.wikimedia.org/r/1125227 (owner: 10Bking)
[17:22:29] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Recommission testhost2001.codfw.wmnet as ms-be2089.codfw.wmnet - https://phabricator.wikimedia.org/T388221#10614141 (10wiki_willy) ++ @Jhancock.wm & @Papaul - per our conversation the other day, this will be the R760xd2 seed server th...
[17:26:05] <wikibugs>	 (03PS1) 10Btullis: Enable lock transaction management in the hive metastore on hadoop_test [puppet] - 10https://gerrit.wikimedia.org/r/1125486 (https://phabricator.wikimedia.org/T386854)
[17:26:26] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Enable lock transaction management in the hive metastore on hadoop_test [puppet] - 10https://gerrit.wikimedia.org/r/1125486 (https://phabricator.wikimedia.org/T386854) (owner: 10Btullis)
[17:26:49] <logmsgbot>	 !log jclark@cumin1002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
[17:26:50] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1202.eqiad.wmnet with OS bullseye
[17:27:01] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10614153 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-worker1202.eqiad.wmnet with OS bull...
[17:28:04] <wikibugs>	 (03PS2) 10Btullis: Enable lock transaction management in the hive metastore on hadoop_test [puppet] - 10https://gerrit.wikimedia.org/r/1125486 (https://phabricator.wikimedia.org/T386854)
[17:28:25] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Enable lock transaction management in the hive metastore on hadoop_test [puppet] - 10https://gerrit.wikimedia.org/r/1125486 (https://phabricator.wikimedia.org/T386854) (owner: 10Btullis)
[17:30:08] <wikibugs>	 (03PS3) 10Btullis: Enable lock transaction management in the hive metastore on hadoop_test [puppet] - 10https://gerrit.wikimedia.org/r/1125486 (https://phabricator.wikimedia.org/T386854)
[17:30:42] <wikibugs>	 (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1125486 (https://phabricator.wikimedia.org/T386854) (owner: 10Btullis)
[17:32:23] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.dns.netbox
[17:34:36] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10614189 (10phaultfinder)
[17:35:04] <wikibugs>	 (03PS1) 10Cathal Mooney: Add new Juniper leaf switches eqiad E8/F8 to IBGP cluster [homer/public] - 10https://gerrit.wikimedia.org/r/1125488 (https://phabricator.wikimedia.org/T382017)
[17:38:01] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: dns additions for eqiad E8/F8 links to new switches - cmooney@cumin1002"
[17:38:06] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: dns additions for eqiad E8/F8 links to new switches - cmooney@cumin1002"
[17:38:07] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:41:16] <wikibugs>	 06SRE, 10MediaWiki-extensions-OAuth, 06The-Wikipedia-Library, 07Datacenter-Switchover, 07User-notice-archive: Frequent OAuth failures on Wikimedia wikis since eqiad was repooled due to db-mainstash replication lag - https://phabricator.wikimedia.org/T332650#10614202 (10matmarex) 05Open→03Resolved...
[17:41:59] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: Q2:rack/setup E8/F8 new leaf switches - https://phabricator.wikimedia.org/T382017#10614209 (10cmooney) >>! In T382017#10613705, @VRiley-WMF wrote: > Will be adding these into netbox shortly  Cool I can see them there.  FWIW I added t...
[17:42:59] <wikibugs>	 (03PS3) 10Fabfur: Fix previous commit [debs/benthos] - 10https://gerrit.wikimedia.org/r/1124894 (https://phabricator.wikimedia.org/T256098)
[17:47:42] <jinxer-wm>	 FIRING: HelmReleaseBadStatus: Helm release article-descriptions/main on k8s-mlstaging@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s-mlstaging&var-namespace=article-descriptions - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[17:48:57] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 13Patch-For-Review: Decommission E/F 8 Dell switches - https://phabricator.wikimedia.org/T380050#10614236 (10cmooney) >>! In T380050#10613653, @cmooney wrote: > Please delete the cables/connections from Netbox to match what is done on site.  For the record I deleted the cables...
[17:54:43] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10614250 (10phaultfinder)
[17:57:13] <wikibugs>	 (03PS1) 10D3r1ck01: Set `$wgCentralAuthLoginWiki` to correct default as documented [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125491 (https://phabricator.wikimedia.org/T388218)
[18:04:39] <wikibugs>	 06SRE, 06Fundraising-Backlog, 10fundraising-tech-ops, 10LDAP-Access-Requests: Grant Access to astein for fr-tech icinga acknowledgements - https://phabricator.wikimedia.org/T388186#10614269 (10Dwisehaupt) @MoritzMuehlenhoff Thanks for the info. I'll have @AStein-WMF step through the IDM bits. As far as the...
[18:21:47] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1066 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[18:24:44] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10614292 (10phaultfinder)
[18:29:40] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10614307 (10phaultfinder)
[18:35:41] <icinga-wm>	 PROBLEM - OSPF status on ssw1-f1-eqiad.mgmt is CRITICAL: OSPFv2: 12/14 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[18:36:18] <wikibugs>	 (03PS1) 10Amdrel: CommonSettings.php: Add $wgCentralAuthAutomaticVanishWiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125497 (https://phabricator.wikimedia.org/T380527)
[18:36:41] <icinga-wm>	 RECOVERY - OSPF status on ssw1-f1-eqiad.mgmt is OK: OSPFv2: 12/12 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[18:41:33] <wikibugs>	 (03PS1) 10Andrew Bogott: keystone.conf: update oidc comment section to reflect changes in ID mapping [puppet] - 10https://gerrit.wikimedia.org/r/1125499 (https://phabricator.wikimedia.org/T388137)
[18:42:37] <wikibugs>	 (03CR) 10Papaul: [C:03+2] Remove bfd for link between cr1-codfw and cr2-codfw [homer/public] - 10https://gerrit.wikimedia.org/r/1125448 (https://phabricator.wikimedia.org/T387773) (owner: 10Papaul)
[18:54:18] <wikibugs>	 (03CR) 10Dreamy Jazz: CommonSettings.php: Add $wgCentralAuthAutomaticVanishWiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125497 (https://phabricator.wikimedia.org/T380527) (owner: 10Amdrel)
[19:01:39] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1162 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[19:01:47] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1066 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[19:02:58] <wikibugs>	 (03CR) 10Xcollazo: Enable lock transaction management in the hive metastore on hadoop_test (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1125486 (https://phabricator.wikimedia.org/T386854) (owner: 10Btullis)
[19:03:03] <icinga-wm>	 PROBLEM - Hadoop NodeManager on analytics1076 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[19:05:03] <icinga-wm>	 RECOVERY - Hadoop NodeManager on analytics1076 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[19:07:39] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1162 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[19:09:25] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: elasticsearch-disable-readahead.service on cloudelastic1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:13:40] <wikibugs>	 10SRE-swift-storage, 10Thumbor: Gradually drop all thumbnails as a one-off clean up - https://phabricator.wikimedia.org/T379942#10614420 (10Ladsgroup) Third wave of deletions in codfw just started (from `20` to `2f`). I will start eqiad on Monday.
[19:20:14] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cloudelastic1009.eqiad.wmnet with OS bullseye
[19:20:19] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1125 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[19:23:50] <wikibugs>	 (03CR) 10Scott French: [C:03+1] "Thanks, Effie!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125418 (https://phabricator.wikimedia.org/T383845) (owner: 10Effie Mouzeli)
[19:25:14] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for @Ben.buchenau - https://phabricator.wikimedia.org/T386904#10614432 (10KFrancis) The NDA is complete. Thanks!
[19:26:37] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10614438 (10phaultfinder)
[19:32:26] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1089 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[19:34:06] <wikibugs>	 (03CR) 10Gergő Tisza: [C:03+1] Set `$wgCentralAuthLoginWiki` to correct default as documented [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125491 (https://phabricator.wikimedia.org/T388218) (owner: 10D3r1ck01)
[19:35:20] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1125 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[19:35:49] <wikibugs>	 06SRE, 10Cassandra: Eliminate use of secondary IP interfaces & DNS for Cassandra instances - https://phabricator.wikimedia.org/T388169#10614471 (10Eevans)
[19:40:54] <wikibugs>	 (03CR) 10Scott French: "Thanks, Effie!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125423 (https://phabricator.wikimedia.org/T383845) (owner: 10Effie Mouzeli)
[19:43:26] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1089 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[19:46:06] <wikibugs>	 (03PS1) 10Scott French: mw-(api-ext|web): serve 25% of residual traffic on 8.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125503 (https://phabricator.wikimedia.org/T383845)
[19:53:06] <logmsgbot>	 !log bking@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudelastic1009.eqiad.wmnet with OS bullseye
[19:53:47] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cloudelastic1009.eqiad.wmnet with OS bullseye
[19:55:44] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10614494 (10phaultfinder)
[20:00:20] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10614499 (10Jclark-ctr)
[20:00:50] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10614501 (10Jclark-ctr) 05Open→03Resolved @BTullis  thanks for your assistance today. these are all finished.
[20:06:37] <logmsgbot>	 !log bking@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudelastic1009.eqiad.wmnet with OS bullseye
[20:07:19] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10Mail: Consider Postfix as MTA for our MXes (and OTRS/Mailman/Phab) - https://phabricator.wikimedia.org/T232343#10614521 (10jhathaway) 05Open→03Resolved Postfix has replaced Exim for our inbound and outbound mail servers in production for some time now. Though th...
[20:08:23] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudelastic1009.eqiad.wmnet']
[20:16:06] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudelastic1009.eqiad.wmnet']
[20:16:35] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudelastic1009.eqiad.wmnet']
[20:17:00] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['cloudelastic1009.eqiad.wmnet']
[20:17:00] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1119 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[20:17:12] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudelastic1009.eqiad.wmnet']
[20:17:46] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudelastic1009.eqiad.wmnet']
[20:21:48] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1066 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[20:24:35] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10614568 (10phaultfinder)
[20:25:00] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1119 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[20:32:22] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudelastic1009.eqiad.wmnet']
[20:32:24] <logmsgbot>	 !log bking@cumin2002 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts ['cloudelastic1009.eqiad.wmnet']
[20:32:36] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudelastic1009.eqiad.wmnet']
[20:34:47] <wikibugs>	 (03PS2) 10Amdrel: CommonSettings.php: Add $wgCentralAuthAutomaticVanishWiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125497 (https://phabricator.wikimedia.org/T380527)
[20:35:11] <wikibugs>	 (03CR) 10Amdrel: CommonSettings.php: Add $wgCentralAuthAutomaticVanishWiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125497 (https://phabricator.wikimedia.org/T380527) (owner: 10Amdrel)
[20:42:07] <wikibugs>	 (03PS1) 10Bking: cloudelastic: use EFI boot for cloudelastic1009,1010 [puppet] - 10https://gerrit.wikimedia.org/r/1125520 (https://phabricator.wikimedia.org/T387904)
[20:42:17] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudelastic1009.eqiad.wmnet']
[20:43:15] <wikibugs>	 (03PS2) 10Bking: cloudelastic: use EFI boot for cloudelastic1009,1010 [puppet] - 10https://gerrit.wikimedia.org/r/1125520 (https://phabricator.wikimedia.org/T387904)
[20:44:04] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudelastic1009.eqiad.wmnet']
[20:49:30] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['cloudelastic1009.eqiad.wmnet']
[20:51:30] <wikibugs>	 (03CR) 10Bking: [C:03+2] "Self-merging in the interest of time" [puppet] - 10https://gerrit.wikimedia.org/r/1125520 (https://phabricator.wikimedia.org/T387904) (owner: 10Bking)
[20:51:48] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1066 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[20:58:44] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cloudelastic1009.eqiad.wmnet with OS bullseye
[20:59:35] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10614651 (10phaultfinder)
[21:12:36] <logmsgbot>	 !log bking@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudelastic1009.eqiad.wmnet with OS bullseye
[21:13:13] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cloudelastic1009.eqiad.wmnet with OS bullseye
[21:24:36] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10614719 (10phaultfinder)
[21:32:36] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudelastic1009.eqiad.wmnet with reason: host reimage
[21:36:12] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudelastic1009.eqiad.wmnet with reason: host reimage
[21:37:06] <wikibugs>	 (03CR) 10JHathaway: "Thanks @dwisehaupt@wikimedia.org for pulling me in, just a few initial questions." [puppet] - 10https://gerrit.wikimedia.org/r/1124205 (https://phabricator.wikimedia.org/T383715) (owner: 10Dwisehaupt)
[21:40:46] <wikibugs>	 (03CR) 10JHathaway: "Thanks @dwisehaupt@wikimedia.org for pulling me in. Our preference would be for you to use our Postfix profile. I am happy to help adapt i" [puppet] - 10https://gerrit.wikimedia.org/r/1125223 (https://phabricator.wikimedia.org/T383715) (owner: 10Dwisehaupt)
[21:44:37] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10614784 (10phaultfinder)
[21:47:42] <jinxer-wm>	 FIRING: HelmReleaseBadStatus: Helm release article-descriptions/main on k8s-mlstaging@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s-mlstaging&var-namespace=article-descriptions - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[21:49:54] <wikibugs>	 06SRE, 10Cassandra: Eliminate use of secondary IP interfaces & DNS for Cassandra instances - https://phabricator.wikimedia.org/T388169#10614819 (10Eevans)
[21:53:36] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10614840 (10phaultfinder)
[22:01:04] <wikibugs>	 (03PS3) 10BCornwall: haproxy/icinga: Remove RSA from auth algorithms [puppet] - 10https://gerrit.wikimedia.org/r/1100192 (https://phabricator.wikimedia.org/T370837)
[22:01:05] <wikibugs>	 (03PS3) 10BCornwall: haproxy: Remove cipher regsub of "ECDHE-RSA-" [puppet] - 10https://gerrit.wikimedia.org/r/1100193 (https://phabricator.wikimedia.org/T370837)
[22:01:20] <wikibugs>	 (03CR) 10CI reject: [V:04-1] haproxy/icinga: Remove RSA from auth algorithms [puppet] - 10https://gerrit.wikimedia.org/r/1100192 (https://phabricator.wikimedia.org/T370837) (owner: 10BCornwall)
[22:01:22] <wikibugs>	 (03CR) 10BCornwall: haproxy/icinga: Remove RSA from auth algorithms (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1100192 (https://phabricator.wikimedia.org/T370837) (owner: 10BCornwall)
[22:01:28] <wikibugs>	 (03CR) 10CI reject: [V:04-1] haproxy: Remove cipher regsub of "ECDHE-RSA-" [puppet] - 10https://gerrit.wikimedia.org/r/1100193 (https://phabricator.wikimedia.org/T370837) (owner: 10BCornwall)
[22:01:58] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Unbanning all hosts in cloudelastic
[22:02:02] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Unbanning all hosts in cloudelastic
[22:02:28] <wikibugs>	 (03PS2) 10Scott French: aptrepo: update pcre2 backport from apt-staging [puppet] - 10https://gerrit.wikimedia.org/r/1121388 (https://phabricator.wikimedia.org/T386006)
[22:02:52] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudelastic1009.eqiad.wmnet with OS bullseye
[22:04:04] <wikibugs>	 06SRE, 10SRE-Access-Requests, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Remove production data access for NDA expired user aude - https://phabricator.wikimedia.org/T388034#10614901 (10ttaylor) FYI, @Seddon is the engineering manager who is actively working with this user, according to the records I get...
[22:05:15] <wikibugs>	 (03PS4) 10BCornwall: haproxy/icinga: Remove RSA from auth algorithms [puppet] - 10https://gerrit.wikimedia.org/r/1100192 (https://phabricator.wikimedia.org/T370837)
[22:05:15] <wikibugs>	 (03PS4) 10BCornwall: haproxy: Remove cipher regsub of "ECDHE-RSA-" [puppet] - 10https://gerrit.wikimedia.org/r/1100193 (https://phabricator.wikimedia.org/T370837)
[22:06:31] <wikibugs>	 (03CR) 10Scott French: "Thanks you, Matthew!" [puppet] - 10https://gerrit.wikimedia.org/r/1121388 (https://phabricator.wikimedia.org/T386006) (owner: 10Scott French)
[22:07:47] <wikibugs>	 (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5038/co" [puppet] - 10https://gerrit.wikimedia.org/r/1100192 (https://phabricator.wikimedia.org/T370837) (owner: 10BCornwall)
[22:09:22] <wikibugs>	 (03PS3) 10Dwisehaupt: community_civicrm: dovecot module for serving up local mail [puppet] - 10https://gerrit.wikimedia.org/r/1124205 (https://phabricator.wikimedia.org/T383715)
[22:10:46] <wikibugs>	 (03CR) 10BCornwall: [C:04-1] geo-maps: update South America DCs (part 1/2) (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1124192 (https://phabricator.wikimedia.org/T387774) (owner: 10CDobbins)
[22:13:00] <wikibugs>	 (03CR) 10BCornwall: [C:04-1] "Not sure why this is a separate commit: Shouldn't it be merged in with 1124192?" [dns] - 10https://gerrit.wikimedia.org/r/1124178 (https://phabricator.wikimedia.org/T387774) (owner: 10CDobbins)
[22:14:51] <wikibugs>	 (03CR) 10Dwisehaupt: community_civicrm: dovecot module for serving up local mail (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1124205 (https://phabricator.wikimedia.org/T383715) (owner: 10Dwisehaupt)
[22:17:51] <ryankemper>	 !log [Cloudelastic] Doing a `/_cluster/reroute?retry_failed=true` of all 3 elastic/opensearch clusters
[22:17:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:18:00] <wikibugs>	 (03CR) 10BCornwall: [C:04-1] "[nit] This and 1124178 should be reformatted to email/plain text standards, meaning no markdown." [dns] - 10https://gerrit.wikimedia.org/r/1124192 (https://phabricator.wikimedia.org/T387774) (owner: 10CDobbins)
[22:25:28] <wikibugs>	 (03PS1) 10Ebernhardson: Add sudachi analyzer for japanese [software/opensearch/plugins] - 10https://gerrit.wikimedia.org/r/1125533 (https://phabricator.wikimedia.org/T386868)
[22:27:20] <wikibugs>	 (03PS2) 10Ebernhardson: Add sudachi analyzer for japanese [software/opensearch/plugins] - 10https://gerrit.wikimedia.org/r/1125533 (https://phabricator.wikimedia.org/T386868)
[22:30:42] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10614940 (10phaultfinder)
[22:32:01] <wikibugs>	 (03CR) 10Dzahn: community_civicrm: dovecot module for serving up local mail (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1124205 (https://phabricator.wikimedia.org/T383715) (owner: 10Dwisehaupt)
[22:40:06] <wikibugs>	 (03PS3) 10Huji: New alias for Project namespace on Persian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122278 (https://phabricator.wikimedia.org/T387185)
[22:40:14] <wikibugs>	 (03CR) 10Huji: New alias for Project namespace on Persian Wikipedia (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122278 (https://phabricator.wikimedia.org/T387185) (owner: 10Huji)
[22:41:03] <wikibugs>	 06SRE, 10SRE-Access-Requests, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Remove production data access for NDA expired user aude - https://phabricator.wikimedia.org/T388034#10614967 (10Seddon) Hey, yes @aude is currently reporting to myself and is on an active contract. Can I got the time being request a...
[22:42:02] <inflatador>	 !log bking@cloudelastic1009 exclude `cloudelastic1010` from master voting T387904
[22:42:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:42:05] <stashbot>	 T387904: Migrate Cloudelastic to Opensearch - https://phabricator.wikimedia.org/T387904
[22:44:46] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10614969 (10phaultfinder)
[22:59:04] <wikibugs>	 (03CR) 10JHathaway: community_civicrm: dovecot module for serving up local mail (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1124205 (https://phabricator.wikimedia.org/T383715) (owner: 10Dwisehaupt)
[23:10:22] <wikibugs>	 (03CR) 10Dwisehaupt: community_civicrm: dovecot module for serving up local mail (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1124205 (https://phabricator.wikimedia.org/T383715) (owner: 10Dwisehaupt)
[23:12:16] <wikibugs>	 (03PS1) 10Fabfur: haproxy: use TLS tmpfiles and add certificate check script [puppet] - 10https://gerrit.wikimedia.org/r/1125541 (https://phabricator.wikimedia.org/T388147)
[23:13:15] <icinga-wm>	 RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 114, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[23:29:54] <wikibugs>	 (03PS1) 10Cwhite: beta-logs: restore service functionality [puppet] - 10https://gerrit.wikimedia.org/r/1125543
[23:32:39] <wikibugs>	 (03CR) 10Cwhite: [C:03+2] beta-logs: restore service functionality [puppet] - 10https://gerrit.wikimedia.org/r/1125543 (owner: 10Cwhite)
[23:56:17] <wikibugs>	 (03PS5) 10Fabfur: haproxy: use TLS tmpfiles and add certificate check script [puppet] - 10https://gerrit.wikimedia.org/r/1125541 (https://phabricator.wikimedia.org/T388147)