[00:10:57] <logmsgbot>	 !log brett@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4047.ulsfo.wmnet with OS bullseye
[00:11:05] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin1001 for host cp4047.ulsfo.wmnet with OS bullseye executed with errors: - cp4047 (**FAIL**)   - Downtimed on Ic...
[00:11:31] <logmsgbot>	 !log brett@cumin1001 START - Cookbook sre.hosts.reimage for host cp4047.ulsfo.wmnet with OS bullseye
[00:11:41] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin1001 for host cp4047.ulsfo.wmnet with OS bullseye
[00:14:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[00:15:53] <logmsgbot>	 !log brett@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4047.ulsfo.wmnet with OS bullseye
[00:16:02] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin1001 for host cp4047.ulsfo.wmnet with OS bullseye executed with errors: - cp4047 (**FAIL**)   - Removed from Pu...
[00:16:05] <logmsgbot>	 !log brett@cumin1001 START - Cookbook sre.hosts.reimage for host cp4047.ulsfo.wmnet with OS bullseye
[00:16:15] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin1001 for host cp4047.ulsfo.wmnet with OS bullseye
[00:24:20] <wikibugs>	 (03PS1) 10Zabe: Stop setting cul_actor migration var [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884137 (https://phabricator.wikimedia.org/T233004)
[00:24:25] <logmsgbot>	 !log brett@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4047.ulsfo.wmnet with OS bullseye
[00:24:34] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin1001 for host cp4047.ulsfo.wmnet with OS bullseye executed with errors: - cp4047 (**FAIL**)   - Removed from Pu...
[00:24:38] <wikibugs>	 (03CR) 10Zabe: [C: 03+2] Stop setting cul_actor migration var [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884137 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe)
[00:25:00] <logmsgbot>	 !log brett@cumin1001 START - Cookbook sre.hosts.reimage for host cp4047.ulsfo.wmnet with OS bullseye
[00:25:09] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin1001 for host cp4047.ulsfo.wmnet with OS bullseye
[00:25:38] <wikibugs>	 (03Merged) 10jenkins-bot: Stop setting cul_actor migration var [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884137 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe)
[00:25:58] <logmsgbot>	 !log zabe@deploy1002 Started scap: Backport for [[gerrit:884137|Stop setting cul_actor migration var (T233004)]]
[00:26:02] <stashbot>	 T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004
[00:27:35] <logmsgbot>	 !log zabe@deploy1002 zabe: Backport for [[gerrit:884137|Stop setting cul_actor migration var (T233004)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet
[00:28:55] <wikibugs>	 (03PS1) 10Arlolra: Try to determine what's adding to Parsoid init times [core] (wmf/1.40.0-wmf.20) - 10https://gerrit.wikimedia.org/r/884138
[00:33:35] <logmsgbot>	 !log zabe@deploy1002 Finished scap: Backport for [[gerrit:884137|Stop setting cul_actor migration var (T233004)]] (duration: 07m 36s)
[00:33:39] <stashbot>	 T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004
[00:40:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[00:45:50] <logmsgbot>	 !log brett@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4047.ulsfo.wmnet with reason: host reimage
[00:47:44] <wikibugs>	 (03PS1) 10Eevans: cassandra-dev: treat client encryption as optional [puppet] - 10https://gerrit.wikimedia.org/r/884140
[00:49:00] <logmsgbot>	 !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4047.ulsfo.wmnet with reason: host reimage
[00:49:27] <wikibugs>	 (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/884140 (owner: 10Eevans)
[00:51:41] <wikibugs>	 (03CR) 10Eevans: [C: 03+2] cassandra-dev: treat client encryption as optional [puppet] - 10https://gerrit.wikimedia.org/r/884140 (owner: 10Eevans)
[00:52:02] <wikibugs>	 (03PS1) 10Brian Wolff: Restrict flow-edit-title to autoconfirmed on mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884142 (https://phabricator.wikimedia.org/T328097)
[00:56:44] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching cassandra-dev2*: Applying configuration change to cassandra-dev cluster - eevans@cumin1001
[01:10:55] <logmsgbot>	 !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4047.ulsfo.wmnet with OS bullseye
[01:11:04] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin1001 for host cp4047.ulsfo.wmnet with OS bullseye completed: - cp4047 (**PASS**)   - Removed from Puppet and Pu...
[01:11:38] <logmsgbot>	 !log brett@cumin1001 conftool action : set/pooled=yes; selector: name=cp4047.ulsfo.wmnet
[01:15:59] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching cassandra-dev2*: Applying configuration change to cassandra-dev cluster - eevans@cumin1001
[01:19:48] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall)
[01:20:58] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1087 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[01:52:46] <wikibugs>	 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic: eqsin hosts are not rebooting when running sre.hosts.reimage cookbook - https://phabricator.wikimedia.org/T327812 (10ssingh) >>! In T327812#8563361, @BCornwall wrote: > This is happening the first time I run the cookbooks on any of the newer servers. I've now adapte...
[02:02:36] <wikibugs>	 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic: eqsin hosts are not rebooting when running sre.hosts.reimage cookbook - https://phabricator.wikimedia.org/T327812 (10ssingh) I guess my theory about the incorrect/outdated firmwares was incorrect. On `cp4047`:  ` Integrated Dell Remote Access Controller  5.10.30.00...
[02:06:58] <jinxer-wm>	 (KubernetesAPILatency) firing: (13) High Kubernetes API latency (LIST configurations) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[02:10:45] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:20:45] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:25:26] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1087 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[02:25:38] <wikibugs>	 (03PS1) 10Andrew Bogott: cloud-vps pdns recursor: drastically shorten max_negative_ttl [puppet] - 10https://gerrit.wikimedia.org/r/884145
[02:28:56] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] cloud-vps pdns recursor: drastically shorten max_negative_ttl [puppet] - 10https://gerrit.wikimedia.org/r/884145 (owner: 10Andrew Bogott)
[03:10:11] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[03:15:36] <icinga-wm>	 PROBLEM - Check systemd state on durum1001 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_exim4.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:44:16] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[03:45:09] <wikibugs>	 (03CR) 10Subramanya Sastry: Try to determine what's adding to Parsoid init times (031 comment) [core] (wmf/1.40.0-wmf.20) - 10https://gerrit.wikimedia.org/r/884138 (owner: 10Arlolra)
[03:47:18] <icinga-wm>	 PROBLEM - MariaDB read only pc1 on pc2014 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[03:47:34] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for appserver on alert1001 is CRITICAL: 2257 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[03:47:40] <jinxer-wm>	 (Outbound discards) firing: (2) Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Outbound discards   - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards
[03:47:50] <icinga-wm>	 PROBLEM - MariaDB Replica SQL: pc1 on pc2014 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[03:49:06] <icinga-wm>	 PROBLEM - MariaDB Replica IO: pc1 on pc2014 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[03:49:42] <icinga-wm>	 PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in appserver at codfw on alert1001 is CRITICAL: 0.303 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[03:49:54] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 107 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[03:50:28] <icinga-wm>	 PROBLEM - MariaDB Event Scheduler pc1 on pc2014 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Event_Scheduler
[03:54:24] <icinga-wm>	 PROBLEM - MariaDB read only pc1 on pc2014 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[03:54:56] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[03:56:55] <wikibugs>	 (03CR) 10BBlack: [C: 03+1] esitest: remove deprecated nbproc config option [puppet] - 10https://gerrit.wikimedia.org/r/884056 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh)
[03:57:40] <jinxer-wm>	 (Outbound discards) resolved: (2) Device asw2-a-eqiad.mgmt.eqiad.wmnet recovered from Outbound discards   - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards
[03:57:54] <icinga-wm>	 PROBLEM - MariaDB Replica IO: pc1 on pc2014 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[03:58:30] <icinga-wm>	 PROBLEM - MariaDB Replica SQL: pc1 on pc2014 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[03:59:18] <icinga-wm>	 PROBLEM - MariaDB Event Scheduler pc1 on pc2014 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Event_Scheduler
[04:00:38] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: pc1 on pc2014 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[04:01:14] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[04:02:16] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1087 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[04:02:54] <icinga-wm>	 PROBLEM - MariaDB Event Scheduler pc1 on pc2014 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Event_Scheduler
[04:03:02] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[04:06:06] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[04:06:26] <icinga-wm>	 PROBLEM - MariaDB Event Scheduler pc1 on pc2014 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Event_Scheduler
[04:07:44] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: pc1 on pc2014 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[04:09:42] <icinga-wm>	 RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[04:12:46] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[04:13:02] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1087 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[04:13:34] <icinga-wm>	 PROBLEM - MariaDB Event Scheduler pc1 on pc2014 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Event_Scheduler
[04:13:58] <icinga-wm>	 PROBLEM - MariaDB read only pc1 on pc2014 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[04:14:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[04:15:40] <icinga-wm>	 PROBLEM - MariaDB Replica IO: pc1 on pc2014 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[04:17:30] <icinga-wm>	 PROBLEM - MariaDB read only pc1 on pc2014 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[04:18:02] <icinga-wm>	 PROBLEM - MariaDB Replica SQL: pc1 on pc2014 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[04:21:04] <icinga-wm>	 PROBLEM - MariaDB read only pc1 on pc2014 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[04:21:38] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[04:25:08] <icinga-wm>	 PROBLEM - MariaDB Replica SQL: pc1 on pc2014 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[04:27:42] <icinga-wm>	 PROBLEM - MariaDB Event Scheduler pc1 on pc2014 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Event_Scheduler
[04:28:06] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event_sanitized_analytics_immediate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:28:38] <icinga-wm>	 PROBLEM - MariaDB Replica SQL: pc1 on pc2014 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[04:28:46] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 173 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[04:28:56] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: pc1 on pc2014 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[04:29:46] <icinga-wm>	 PROBLEM - MariaDB Replica IO: pc1 on pc2014 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[04:31:38] <icinga-wm>	 PROBLEM - MariaDB read only pc1 on pc2014 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[04:34:04] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 119 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[04:34:46] <icinga-wm>	 PROBLEM - MariaDB Event Scheduler pc1 on pc2014 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Event_Scheduler
[04:35:08] <icinga-wm>	 PROBLEM - MariaDB Replica IO: pc1 on pc2014 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[04:38:18] <icinga-wm>	 PROBLEM - MariaDB Event Scheduler pc1 on pc2014 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Event_Scheduler
[04:39:24] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 152 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[04:42:16] <icinga-wm>	 PROBLEM - MariaDB Replica IO: pc1 on pc2014 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[04:43:42] <icinga-wm>	 PROBLEM - MariaDB Event Scheduler pc1 on pc2014 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Event_Scheduler
[04:44:34] <icinga-wm>	 PROBLEM - MariaDB Replica SQL: pc1 on pc2014 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[04:44:42] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 165 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[04:45:02] <icinga-wm>	 PROBLEM - BFD status on cr2-eqdfw is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[04:45:18] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1087 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[04:46:42] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: pc1 on pc2014 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[04:48:36] <icinga-wm>	 RECOVERY - BFD status on cr2-eqdfw is OK: OK: UP: 15 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[04:50:02] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 161 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[04:52:54] <icinga-wm>	 PROBLEM - MariaDB Replica IO: pc1 on pc2014 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[04:53:30] <icinga-wm>	 PROBLEM - MariaDB Replica SQL: pc1 on pc2014 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[04:53:36] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 123 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[04:58:10] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[04:58:18] <icinga-wm>	 PROBLEM - MariaDB read only pc1 on pc2014 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[05:00:40] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 107 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[05:02:24] <icinga-wm>	 PROBLEM - MariaDB Replica SQL: pc1 on pc2014 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[05:04:16] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 114 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[05:04:28] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: pc1 on pc2014 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[05:05:22] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[05:06:40] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[05:07:14] <icinga-wm>	 PROBLEM - MariaDB read only pc1 on pc2014 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[05:08:26] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[05:08:36] <icinga-wm>	 PROBLEM - MariaDB Event Scheduler pc1 on pc2014 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Event_Scheduler
[05:09:48] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: pc1 on pc2014 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[05:12:08] <icinga-wm>	 PROBLEM - MariaDB Event Scheduler pc1 on pc2014 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Event_Scheduler
[05:15:44] <icinga-wm>	 PROBLEM - MariaDB Event Scheduler pc1 on pc2014 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Event_Scheduler
[05:16:36] <icinga-wm>	 PROBLEM - MariaDB Replica SQL: pc1 on pc2014 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[05:20:10] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 101 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[05:20:28] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: pc1 on pc2014 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[05:21:18] <icinga-wm>	 PROBLEM - MariaDB Replica IO: pc1 on pc2014 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[05:27:18] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 120 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[05:28:20] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1087 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[05:30:24] <icinga-wm>	 PROBLEM - MariaDB read only pc1 on pc2014 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[05:32:38] <icinga-wm>	 PROBLEM - MariaDB Replica SQL: pc1 on pc2014 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[05:36:12] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 116 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[05:41:30] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 103 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[05:42:42] <icinga-wm>	 PROBLEM - MariaDB Replica IO: pc1 on pc2014 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[05:46:26] <icinga-wm>	 PROBLEM - MariaDB read only pc1 on pc2014 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[05:47:48] <icinga-wm>	 PROBLEM - MariaDB Event Scheduler pc1 on pc2014 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Event_Scheduler
[05:48:30] <icinga-wm>	 PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in appserver at codfw on alert1001 is CRITICAL: 0.3333 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[05:48:36] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 114 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[05:48:40] <icinga-wm>	 PROBLEM - MariaDB Replica SQL: pc1 on pc2014 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[05:52:12] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 106 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[05:54:00] <icinga-wm>	 PROBLEM - MariaDB Replica SQL: pc1 on pc2014 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[05:55:36] <icinga-wm>	 PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in appserver at codfw on alert1001 is CRITICAL: 0.3182 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[06:01:16] <marostegui>	 pc2104 looks down
[06:02:14] <icinga-wm>	 PROBLEM - MariaDB Replica IO: pc1 on pc2014 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[06:05:42] <icinga-wm>	 PROBLEM - MariaDB Replica IO: pc1 on pc2014 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[06:06:20] <icinga-wm>	 RECOVERY - MariaDB Replica SQL: pc1 on pc2014 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[06:06:36] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: pc1 on pc2014 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[06:06:58] <jinxer-wm>	 (KubernetesAPILatency) firing: (13) High Kubernetes API latency (LIST configurations) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[06:07:18] <icinga-wm>	 RECOVERY - MariaDB Event Scheduler pc1 on pc2014 is OK: Version 10.6.10-MariaDB-log, Uptime 83s, read_only: False, event_scheduler: True, 2799.10 QPS, connection latency: 0.004107s, query latency: 0.000537s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Event_Scheduler
[06:07:28] <icinga-wm>	 RECOVERY - MariaDB Replica IO: pc1 on pc2014 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[06:07:42] <icinga-wm>	 RECOVERY - MariaDB read only pc1 on pc2014 is OK: Version 10.6.10-MariaDB-log, Uptime 108s, read_only: False, event_scheduler: True, 2749.20 QPS, connection latency: 0.003582s, query latency: 0.000359s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[06:07:54] <icinga-wm>	 RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in appserver at codfw on alert1001 is OK: All metrics within thresholds. https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[06:08:02] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[06:09:16] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for appserver on alert1001 is OK: (C)100 gt (W)50 gt 2 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[06:17:11] <wikibugs>	 (03PS1) 10Marostegui: pc2011: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/884151
[06:17:34] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] pc2011: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/884151 (owner: 10Marostegui)
[06:20:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:27:23] <wikibugs>	 (03PS1) 10Gergő Tisza: GrowthExperiments: Update campaign configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884153
[06:32:52] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1087 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[07:00:04] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230127T0700)
[07:10:11] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[07:25:22] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1087 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[07:34:57] <wikibugs>	 (03PS1) 10Elukey: wmf-config: add new revision-score streams for EventGate main [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884155 (https://phabricator.wikimedia.org/T317768)
[07:41:29] <elukey>	 !log restart kube-apiserver on ml-serve-ctrl2* nodes as attempt to mitigate some 504 API response errors
[07:41:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:46:58] <jinxer-wm>	 (KubernetesAPILatency) firing: (13) High Kubernetes API latency (LIST configurations) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[07:51:58] <jinxer-wm>	 (KubernetesAPILatency) firing: (13) High Kubernetes API latency (LIST configurations) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[07:54:12] <elukey>	 it should subside at some point in theory
[07:54:17] <elukey>	 metrics have recovered
[08:00:05] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230127T0800)
[08:06:44] <elukey>	 !log restart kube-apiserver on ml-staging-ctrl2* nodes as attempt to mitigate some LIST API high latency
[08:06:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:07:00] * elukey looks forward for k8s 1.23 and up-to-date knative/istio layers
[08:11:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (LIST certificates) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[08:14:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:20:26] <wikibugs>	 (03PS1) 10Marostegui: drop_cul_user_text_T328086.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/884221 (https://phabricator.wikimedia.org/T328086)
[08:22:53] <marostegui>	 !log Apply schema change on db1106 (s1 enwiki) T328086
[08:22:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:22:57] <stashbot>	 T328086: Drop cul_user and cul_user_text from cu_log on wmf wikis - https://phabricator.wikimedia.org/T328086
[08:23:13] <marostegui>	 !log Apply schema change on labtestwiki (clouddb2002-dev)T328086
[08:23:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:29:40] <icinga-wm>	 RECOVERY - puppet last run on idm-test1001 is OK: OK: Puppet is currently disabled (test OIDC - slyngshede), not alerting. Last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[08:30:02] <wikibugs>	 (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39294/console" [puppet] - 10https://gerrit.wikimedia.org/r/884037 (https://phabricator.wikimedia.org/T327949) (owner: 10Jelto)
[08:30:38] <icinga-wm>	 RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[08:31:46] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 81, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[08:33:14] <wikibugs>	 (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab_runner: add separate ensure for docker::network [puppet] - 10https://gerrit.wikimedia.org/r/884037 (https://phabricator.wikimedia.org/T327949) (owner: 10Jelto)
[08:51:26] <icinga-wm>	 PROBLEM - Host mr1-drmrs.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
[08:51:54] <wikibugs>	 10SRE, 10DBA, 10Data-Persistence, 10Discovery-Search, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10Marostegui)
[08:52:28] <wikibugs>	 10SRE, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Marostegui)
[09:01:16] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1087 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[09:07:53] <wikibugs>	 (03PS1) 10Slyngshede: D:apereo_cas::service: Missing s on groups [puppet] - 10https://gerrit.wikimedia.org/r/884224
[09:09:21] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] D:apereo_cas::service: Missing s on groups [puppet] - 10https://gerrit.wikimedia.org/r/884224 (owner: 10Slyngshede)
[09:10:06] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C: 03+1] "LGTM!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884155 (https://phabricator.wikimedia.org/T317768) (owner: 10Elukey)
[09:10:49] <wikibugs>	 (03PS1) 10Muehlenhoff: Disable old bastions [puppet] - 10https://gerrit.wikimedia.org/r/884225 (https://phabricator.wikimedia.org/T324974)
[09:14:27] <wikibugs>	 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management, 10MediaWiki-Page-deletion, and 3 others: Some files cannot be deleted "Error deleting file: An unknown error occurred in storage backend "local-multiwrite". " - https://phabricator.wikimedia.org/T244567 (10I) While examining this vulnerability, I...
[09:17:25] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] "looks good, but I don't get why cp3050 (according to debmonitor) got haproxy2.6 as 'profile::cache::haproxy::version' defaults to 'haproxy" [puppet] - 10https://gerrit.wikimedia.org/r/884056 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh)
[09:29:11] <wikibugs>	 (03CR) 10DCausse: [C: 03+1] wmf-config: add new revision-score streams for EventGate main [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884155 (https://phabricator.wikimedia.org/T317768) (owner: 10Elukey)
[09:33:50] <wikibugs>	 (03PS1) 10JMeybohm: Update openjdk to 11.0.16 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/884267
[09:34:48] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[09:35:30] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[09:36:24] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49419 bytes in 0.063 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[09:37:06] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.282 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[09:38:18] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Update openjdk to 11.0.16 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/884267 (owner: 10JMeybohm)
[09:40:35] <moritzm>	 !log disabling old bastions bast3005/bast4003/bast5002/bast6001, use bast3006/bast4004/bast5003/bast6002 instead
[09:40:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:40:39] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Disable old bastions [puppet] - 10https://gerrit.wikimedia.org/r/884225 (https://phabricator.wikimedia.org/T324974) (owner: 10Muehlenhoff)
[09:48:54] <wikibugs>	 10SRE, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Jelto)
[09:54:38] <wikibugs>	 10SRE, 10DBA, 10Data-Persistence, 10Discovery-Search, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10Jelto)
[09:58:07] <wikibugs>	 (03PS1) 10JMeybohm: openjdk: Fix postinst error [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/884269
[10:00:06] <hashar>	 jayme: java install failing due to lack of `/usr/share/man/man1/` is an issue in the Debian package and tracked at https://phabricator.wikimedia.org/T289694
[10:00:23] <hashar>	 probably could use a patch upstream
[10:00:35] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+2 C: 03+2] openjdk: Fix postinst error [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/884269 (owner: 10JMeybohm)
[10:02:02] <wikibugs>	 10SRE, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10fgiunchedi)
[10:04:19] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1087 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[10:04:54] <wikibugs>	 10SRE, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10MatthewVernon)
[10:07:09] <wikibugs>	 (03PS1) 10JMeybohm: flink-kubernetes-operator: Fix changelog format [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/884271
[10:07:11] <wikibugs>	 (03PS1) 10JMeybohm: openjdk: Fix Dockerfile syntax [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/884272
[10:08:20] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+2 C: 03+2] openjdk: Fix Dockerfile syntax [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/884272 (owner: 10JMeybohm)
[10:08:26] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+2 C: 03+2] flink-kubernetes-operator: Fix changelog format [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/884271 (owner: 10JMeybohm)
[10:10:52] <jayme>	 hashar: yeah, I figured...
[10:13:13] <wikibugs>	 10SRE, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10elukey)
[10:15:49] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts ldap-corp2001.wikimedia.org
[10:16:21] <wikibugs>	 10SRE, 10DBA, 10Data-Persistence, 10Discovery-Search, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10elukey)
[10:17:24] <wikibugs>	 10SRE-Access-Requests, 10Data-Engineering: Create kerberos principal for user matmarex - https://phabricator.wikimedia.org/T328116 (10BTullis)
[10:18:48] <jayme>	 hashar: is it possible to re-trigger a blubber pipeline without a code commit? To rebuild images based on the most recent version of the base image?
[10:18:51] <wikibugs>	 10SRE, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10fgiunchedi)
[10:19:56] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[10:20:21] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove the misc-ops Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/884275
[10:20:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[10:21:22] <wikibugs>	 10SRE-Access-Requests, 10Data-Engineering: Create kerberos principal for user matmarex - https://phabricator.wikimedia.org/T328116 (10BTullis) 05Open→03Resolved I have created the principal. ` btullis@krb1001:~$ sudo manage_principals.py get matmarex get_principal: Principal does not exist while retrieving...
[10:21:36] <jayme>	 or btullis - could you please bump datahub so it get's rebuild ontop of the new java images?
[10:22:05] <jayme>	 (openjdk 11.0.16)
[10:22:09] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] mediawiki: Fix duplicate host.name property in log [deployment-charts] - 10https://gerrit.wikimedia.org/r/884273 (https://phabricator.wikimedia.org/T326794) (owner: 10Clément Goubert)
[10:22:32] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove the misc-ops Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/884275 (owner: 10Muehlenhoff)
[10:23:18] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[10:23:19] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ldap-corp2001.wikimedia.org
[10:23:25] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10LDAP: Retire ldap-corp cluster - https://phabricator.wikimedia.org/T323820 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `ldap-corp2001.wikimedia.org` - ldap-corp2001.wikimedia.org (**PASS**)   - Downtimed host on Icinga/Al...
[10:24:41] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove ldap-corp-related CNAMES [dns] - 10https://gerrit.wikimedia.org/r/884276 (https://phabricator.wikimedia.org/T323820)
[10:24:57] <wikibugs>	 (03PS2) 10Muehlenhoff: Remove ldap-corp-related CNAMES [dns] - 10https://gerrit.wikimedia.org/r/884276 (https://phabricator.wikimedia.org/T323820)
[10:26:43] <logmsgbot>	 !log aborrero@cumin2002 START - Cookbook sre.dns.netbox
[10:26:49] <btullis>	 jayme: Yes, I can trigger a new build of datahub. I'll do it now.
[10:27:23] <wikibugs>	 (03Merged) 10jenkins-bot: mediawiki: Fix duplicate host.name property in log [deployment-charts] - 10https://gerrit.wikimedia.org/r/884273 (https://phabricator.wikimedia.org/T326794) (owner: 10Clément Goubert)
[10:30:02] <wikibugs>	 10SRE, 10DBA, 10Data-Persistence, 10Discovery-Search, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10fgiunchedi)
[10:30:46] <wikibugs>	 10SRE, 10DBA, 10Data-Persistence, 10Discovery-Search, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10fgiunchedi)
[10:33:10] <wikibugs>	 10SRE, 10DBA, 10Data-Persistence, 10Discovery-Search, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10MatthewVernon)
[10:37:32] <logmsgbot>	 !log stevemunene@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main
[10:37:34] <logmsgbot>	 !log stevemunene@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: apply on main
[10:38:01] <logmsgbot>	 !log aborrero@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Sync for cloudlb2001-dev - aborrero@cumin2002"
[10:38:33] <jayme>	 btullis: sweet, thanks. I'll be back in a bit
[10:38:54] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Decom flowspec1001 - https://phabricator.wikimedia.org/T328009 (10ayounsi) 05Open→03Resolved
[10:39:26] <wikibugs>	 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: codfw: Relocate servers to make space for new switches  in rowA and rowB - https://phabricator.wikimedia.org/T326564 (10ayounsi)
[10:39:32] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10ayounsi)
[10:40:02] <wikibugs>	 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: codfw: Relocate servers to make space for new switches  in rowA and rowB - https://phabricator.wikimedia.org/T326564 (10ayounsi)
[10:42:21] <wikibugs>	 (03PS1) 10EoghanGaffney: Add icinga access for eoghan [puppet] - 10https://gerrit.wikimedia.org/r/884279
[10:43:16] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove ldap-corp-related CNAMES [dns] - 10https://gerrit.wikimedia.org/r/884276 (https://phabricator.wikimedia.org/T323820) (owner: 10Muehlenhoff)
[10:43:46] <icinga-wm>	 RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[10:45:07] <logmsgbot>	 !log aborrero@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Sync for cloudlb2001-dev - aborrero@cumin2002"
[10:45:08] <logmsgbot>	 !log aborrero@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[10:45:32] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1087 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[10:52:57] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts ldap-corp1001.wikimedia.org
[10:53:33] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=99) for hosts ldap-corp1001.wikimedia.org
[10:55:15] <wikibugs>	 (03CR) 10Jelto: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/884279 (owner: 10EoghanGaffney)
[10:56:04] <wikibugs>	 (03PS1) 10Muehlenhoff: exim: Remove leftovers of ldap-corp setup [puppet] - 10https://gerrit.wikimedia.org/r/884282 (https://phabricator.wikimedia.org/T323820)
[10:56:15] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM! You can deploy/merge at any time, no further action is needed" [puppet] - 10https://gerrit.wikimedia.org/r/884279 (owner: 10EoghanGaffney)
[11:01:24] <logmsgbot>	 !log stevemunene@deploy1002 helmfile [codfw] START helmfile.d/services/datahub: apply on main
[11:01:27] <logmsgbot>	 !log stevemunene@deploy1002 helmfile [codfw] DONE helmfile.d/services/datahub: apply on main
[11:02:23] <wikibugs>	 (03CR) 10Jgiannelos: [C: 03+1] Enable Linter write namespace, tag and template from core, group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884090 (https://phabricator.wikimedia.org/T299612) (owner: 10Sbailey)
[11:03:20] <logmsgbot>	 !log aborrero@cumin2002 START - Cookbook sre.dns.netbox
[11:04:13] <logmsgbot>	 !log stevemunene@deploy1002 helmfile [eqiad] START helmfile.d/services/datahub: apply on main
[11:04:15] <logmsgbot>	 !log stevemunene@deploy1002 helmfile [eqiad] DONE helmfile.d/services/datahub: apply on main
[11:05:16] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts ldap-corp1001.wikimedia.org
[11:06:52] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1087 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[11:08:00] <logmsgbot>	 !log aborrero@cumin2002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[11:08:02] <logmsgbot>	 !log aborrero@cumin2002 START - Cookbook sre.dns.netbox
[11:09:38] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[11:10:11] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[11:10:49] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[11:10:49] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ldap-corp1001.wikimedia.org
[11:10:57] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10LDAP, 10Patch-For-Review: Retire ldap-corp cluster - https://phabricator.wikimedia.org/T323820 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `ldap-corp1001.wikimedia.org` - ldap-corp1001.wikimedia.org (**PASS**)   - Downt...
[11:11:08] <logmsgbot>	 !log aborrero@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Sync for cloudlb2001-dev - aborrero@cumin2002"
[11:12:01] <wikibugs>	 (03CR) 10Stevemunene: [C: 03+2] Enable oidc env vars for datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/883939 (https://phabricator.wikimedia.org/T327884) (owner: 10Stevemunene)
[11:12:21] <logmsgbot>	 !log aborrero@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Sync for cloudlb2001-dev - aborrero@cumin2002"
[11:12:21] <logmsgbot>	 !log aborrero@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[11:12:25] <wikibugs>	 10ops-eqiad, 10DC-Ops: hw troubleshooting: RAID controller battery for an-worker1087.eqiad.wmnet - https://phabricator.wikimedia.org/T328119 (10BTullis)
[11:12:35] <logmsgbot>	 !log aborrero@cumin2002 START - Cookbook sre.hosts.reimage for host cloudlb2001-dev.codfw.wmnet with OS bullseye
[11:13:12] <logmsgbot>	 !log aborrero@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudlb2001-dev.codfw.wmnet with OS bullseye
[11:13:31] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Machine-Learning-Team: httpbb doesn't support integers in the POST's body - https://phabricator.wikimedia.org/T328120 (10elukey)
[11:14:51] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 30 days, 0:00:00 on an-worker1087.eqiad.wmnet with reason: Shutting down an-worker1087 to allow for RAID BBU replacement
[11:15:16] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on an-worker1087.eqiad.wmnet with reason: Shutting down an-worker1087 to allow for RAID BBU replacement
[11:15:17] <wikibugs>	 (03PS1) 10Elukey: parse: allow integers in form_body [software/httpbb] - 10https://gerrit.wikimedia.org/r/884285 (https://phabricator.wikimedia.org/T328120)
[11:15:18] <wikibugs>	 10ops-eqiad, 10DC-Ops: hw troubleshooting: RAID controller battery for an-worker1087.eqiad.wmnet - https://phabricator.wikimedia.org/T328119 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=b9faed98-4069-454d-bfb5-c0193a85ce5f) set by btullis@cumin1001 for 30 days, 0:00:00 on 1 host(s) and t...
[11:15:22] <logmsgbot>	 !log aborrero@cumin2002 START - Cookbook sre.dns.wipe-cache cloudlb2001-dev.mgmt.codfw.wmnet on all recursors
[11:15:25] <logmsgbot>	 !log aborrero@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudlb2001-dev.mgmt.codfw.wmnet on all recursors
[11:15:42] <logmsgbot>	 !log aborrero@cumin2002 START - Cookbook sre.hosts.reimage for host cloudlb2001-dev.codfw.wmnet with OS bullseye
[11:16:58] <wikibugs>	 (03Merged) 10jenkins-bot: Enable oidc env vars for datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/883939 (https://phabricator.wikimedia.org/T327884) (owner: 10Stevemunene)
[11:18:06] <icinga-wm>	 PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[11:24:11] <logmsgbot>	 !log stevemunene@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main
[11:24:50] <logmsgbot>	 !log aborrero@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudlb2001-dev.codfw.wmnet with OS bullseye
[11:25:22] <logmsgbot>	 !log ayounsi@deploy1002 Started deploy [netbox/deploy@ef7451d]: netbox-next to 3.2.9
[11:25:57] <logmsgbot>	 !log aborrero@cumin2002 START - Cookbook sre.hosts.reimage for host cloudlb2001-dev.codfw.wmnet with OS bullseye
[11:26:18] <logmsgbot>	 !log ayounsi@deploy1002 Finished deploy [netbox/deploy@ef7451d]: netbox-next to 3.2.9 (duration: 00m 56s)
[11:26:56] <wikibugs>	 10ops-eqiad, 10DC-Ops: hw troubleshooting: RAID controller battery for an-worker1087.eqiad.wmnet - https://phabricator.wikimedia.org/T328119 (10BTullis)
[11:27:22] <wikibugs>	 10ops-eqiad, 10DC-Ops: hw troubleshooting: RAID controller battery for an-worker1087.eqiad.wmnet - https://phabricator.wikimedia.org/T328119 (10BTullis) I've added 30 days of someime and shut down the host.
[11:27:25] <logmsgbot>	 !log stevemunene@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main
[11:27:57] <wikibugs>	 (03PS1) 10Btullis: Deploy the new datahub image [deployment-charts] - 10https://gerrit.wikimedia.org/r/884287 (https://phabricator.wikimedia.org/T327799)
[11:31:41] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C: 03+1] "LGTM 🎉" [software/httpbb] - 10https://gerrit.wikimedia.org/r/884285 (https://phabricator.wikimedia.org/T328120) (owner: 10Elukey)
[11:33:20] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/884034 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey)
[11:35:06] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/884038 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey)
[11:36:59] <logmsgbot>	 !log stevemunene@deploy1002 helmfile [codfw] START helmfile.d/services/datahub: apply on main
[11:38:28] <logmsgbot>	 !log stevemunene@deploy1002 helmfile [codfw] DONE helmfile.d/services/datahub: sync on main
[11:39:08] <wikibugs>	 (03CR) 10EoghanGaffney: [C: 03+2] Add icinga access for eoghan [puppet] - 10https://gerrit.wikimedia.org/r/884279 (owner: 10EoghanGaffney)
[11:39:26] <icinga-wm>	 RECOVERY - BGP status on cr2-eqord is OK: BGP OK - up: 185, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:40:08] <logmsgbot>	 !log stevemunene@deploy1002 helmfile [eqiad] START helmfile.d/services/datahub: apply on main
[11:41:14] <logmsgbot>	 !log stevemunene@deploy1002 helmfile [eqiad] DONE helmfile.d/services/datahub: sync on main
[11:41:58] <XioNoX>	 !log restart keyholder on deploy1002
[11:42:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:43:50] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove Puppet references for ldap-corp1001/2001 [puppet] - 10https://gerrit.wikimedia.org/r/884290 (https://phabricator.wikimedia.org/T323820)
[11:46:18] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 80, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[11:47:57] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 62537
[11:48:16] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 62537
[11:48:28] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 12033
[11:48:35] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] Deploy the new datahub image [deployment-charts] - 10https://gerrit.wikimedia.org/r/884287 (https://phabricator.wikimedia.org/T327799) (owner: 10Btullis)
[11:48:37] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 12033
[11:49:02] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 34309
[11:49:19] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Deploy the new datahub image [deployment-charts] - 10https://gerrit.wikimedia.org/r/884287 (https://phabricator.wikimedia.org/T327799) (owner: 10Btullis)
[11:49:40] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 34309
[11:50:03] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 8560
[11:50:17] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 8560
[11:50:54] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 8368
[11:51:03] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 8368
[11:52:10] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 56898
[11:52:54] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] sre.k8s.pool-depool-cluster: handle active/passive services (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/869771 (https://phabricator.wikimedia.org/T277677) (owner: 10Elukey)
[11:53:19] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 56898
[11:53:41] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 14593
[11:54:12] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 14593
[11:54:49] <logmsgbot>	 !log aborrero@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudlb2001-dev.codfw.wmnet with reason: host reimage
[11:55:10] <wikibugs>	 (03Merged) 10jenkins-bot: Deploy the new datahub image [deployment-charts] - 10https://gerrit.wikimedia.org/r/884287 (https://phabricator.wikimedia.org/T327799) (owner: 10Btullis)
[11:55:19] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 50266
[11:56:04] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 50266
[11:56:29] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove Puppet references for ldap-corp1001/2001 [puppet] - 10https://gerrit.wikimedia.org/r/884290 (https://phabricator.wikimedia.org/T323820) (owner: 10Muehlenhoff)
[11:57:27] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 26077
[11:57:47] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 26077
[11:57:54] <logmsgbot>	 !log aborrero@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudlb2001-dev.codfw.wmnet with reason: host reimage
[11:57:58] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 398143
[11:58:12] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 398143
[11:58:15] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 55821
[11:58:41] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 55821
[11:59:01] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 9318
[11:59:44] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 9318
[11:59:57] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 138915
[12:00:19] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main
[12:00:53] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 138915
[12:01:38] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main
[12:03:25] <logmsgbot>	 !log mfossati@deploy1002 Started deploy [airflow-dags/platform_eng@9690bf9]: (no justification provided)
[12:03:41] <logmsgbot>	 !log mfossati@deploy1002 Finished deploy [airflow-dags/platform_eng@9690bf9]: (no justification provided) (duration: 00m 15s)
[12:10:38] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm ping me to deploy" [puppet] - 10https://gerrit.wikimedia.org/r/883965 (https://phabricator.wikimedia.org/T326125) (owner: 10Hashar)
[12:14:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:23:00] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove role::openldap_corp and related profiles/templates [puppet] - 10https://gerrit.wikimedia.org/r/884295 (https://phabricator.wikimedia.org/T323820)
[12:23:18] <wikibugs>	 10SRE, 10conftool: requestctl v1 improvements - https://phabricator.wikimedia.org/T305580 (10jbond)
[12:23:39] <wikibugs>	 10SRE, 10conftool: Add requestctl support to ferm - https://phabricator.wikimedia.org/T313825 (10jbond) 05Open→03Resolved a:03jbond
[12:23:55] <logmsgbot>	 !log aborrero@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - aborrero@cumin2002"
[12:25:09] <logmsgbot>	 !log aborrero@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - aborrero@cumin2002"
[12:25:10] <logmsgbot>	 !log aborrero@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudlb2001-dev.codfw.wmnet with OS bullseye
[12:25:59] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove role::openldap_corp and related profiles/templates [puppet] - 10https://gerrit.wikimedia.org/r/884295 (https://phabricator.wikimedia.org/T323820) (owner: 10Muehlenhoff)
[12:28:34] <wikibugs>	 (03CR) 10Jbond: "https://puppet-compiler.wmflabs.org/output/884040/39295/" [puppet] - 10https://gerrit.wikimedia.org/r/884040 (owner: 10Jbond)
[12:29:31] <wikibugs>	 (03PS2) 10Hashar: gerrit: listen on all ports, DROP requests to host [puppet] - 10https://gerrit.wikimedia.org/r/883965 (https://phabricator.wikimedia.org/T326125)
[12:29:46] <wikibugs>	 (03CR) 10Hashar: "I have fixed a couple typos in the commit message :D" [puppet] - 10https://gerrit.wikimedia.org/r/883965 (https://phabricator.wikimedia.org/T326125) (owner: 10Hashar)
[12:30:08] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] "LGTM, but I'd like Filippo's opinion." [alerts] - 10https://gerrit.wikimedia.org/r/883502 (https://phabricator.wikimedia.org/T326544) (owner: 10Giuseppe Lavagetto)
[12:36:27] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10LDAP, 10Patch-For-Review: Retire ldap-corp cluster - https://phabricator.wikimedia.org/T323820 (10MoritzMuehlenhoff) 05Open→03Resolved The two VMs have been decommissioned and the Puppet code/certs/secrets removed. I've also sent ITS a headsup that this has been s...
[12:37:30] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] gerrit: listen on all ports, DROP requests to host [puppet] - 10https://gerrit.wikimedia.org/r/883965 (https://phabricator.wikimedia.org/T326125) (owner: 10Hashar)
[12:38:45] <hashar>	 !log Stopped Puppet on gerrit1001 to prevent auto deployment of https://gerrit.wikimedia.org/r/883965
[12:38:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:42:06] <hashar>	 !log Rebooting gerrit2002
[12:42:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:43:01] <icinga-wm>	 ACKNOWLEDGEMENT - Host gerrit2002 is DOWN: PING CRITICAL - Packet loss = 100% amusso hardware reboot
[12:47:02] <hashar>	 !log gerrit1001 running Puppet to deploy https://gerrit.wikimedia.org/r/883965 and restarting Apache 2 to change the `Listen` statements # T326125
[12:47:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:47:07] <stashbot>	 T326125: apache2 fails to start after gerrit hosts are rebooted - https://phabricator.wikimedia.org/T326125
[13:08:26] <moritzm>	 !log installing install6002 T327867
[13:08:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:08:31] <stashbot>	 T327867: Migrate the install servers to Bullseye - https://phabricator.wikimedia.org/T327867
[13:13:04] <icinga-wm>	 RECOVERY - Host mr1-drmrs.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 88.69 ms
[13:14:11] <wikibugs>	 10SRE-swift-storage, 10Thumbor Migration: Pooling thumbor-k8s causes spikes in swift 500 errors - https://phabricator.wikimedia.org/T328033 (10hnowlan) Weird numbers that may be completely irrelevant- ms-fe1009 got a lot less errors despite being pooled at the same weight:  ` hnowlan@cumin1001:~$ sudo cumin ms...
[13:34:05] <wikibugs>	 (03PS1) 10JMeybohm: kubernetes: Incease inotify limits [puppet] - 10https://gerrit.wikimedia.org/r/884305 (https://phabricator.wikimedia.org/T307943)
[13:34:45] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] kubernetes: Incease inotify limits [puppet] - 10https://gerrit.wikimedia.org/r/884305 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm)
[13:35:33] <wikibugs>	 (03PS2) 10JMeybohm: kubernetes: Increase inotify limits [puppet] - 10https://gerrit.wikimedia.org/r/884305 (https://phabricator.wikimedia.org/T307943)
[13:36:22] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39296/console" [puppet] - 10https://gerrit.wikimedia.org/r/884305 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm)
[13:46:35] <moritzm>	 !log installing install5002 T327867
[13:46:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:46:40] <stashbot>	 T327867: Migrate the install servers to Bullseye - https://phabricator.wikimedia.org/T327867
[13:54:05] <wikibugs>	 (03PS4) 10Jbond: sre.hardware.upgrade-firmware: Add additional logging [cookbooks] - 10https://gerrit.wikimedia.org/r/883847
[13:54:33] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] sre.hardware.upgrade-firmware: Add additional logging [cookbooks] - 10https://gerrit.wikimedia.org/r/883847 (owner: 10Jbond)
[13:56:13] <wikibugs>	 (03Merged) 10jenkins-bot: sre.hardware.upgrade-firmware: Add additional logging [cookbooks] - 10https://gerrit.wikimedia.org/r/883847 (owner: 10Jbond)
[14:00:26] <wikibugs>	 (03PS2) 10Jbond: sre.hardware.firmware: Add ability to defer or prevent reboots [cookbooks] - 10https://gerrit.wikimedia.org/r/883863
[14:01:59] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.hardware.firmware: Add ability to defer or prevent reboots [cookbooks] - 10https://gerrit.wikimedia.org/r/883863 (owner: 10Jbond)
[14:02:39] <wikibugs>	 (03PS1) 10Muehlenhoff: openstack::nova::common: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/884307
[14:03:49] <wikibugs>	 (03PS3) 10Jbond: sre.hardware.firmware: Add ability to defer or prevent reboots [cookbooks] - 10https://gerrit.wikimedia.org/r/883863
[14:04:18] <wikibugs>	 (03PS1) 10Jelto: sre.gitlab.upgrade: check current and target version [cookbooks] - 10https://gerrit.wikimedia.org/r/884308 (https://phabricator.wikimedia.org/T323569)
[14:06:00] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.gitlab.upgrade: check current and target version [cookbooks] - 10https://gerrit.wikimedia.org/r/884308 (https://phabricator.wikimedia.org/T323569) (owner: 10Jelto)
[14:06:04] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.hardware.firmware: Add ability to defer or prevent reboots [cookbooks] - 10https://gerrit.wikimedia.org/r/883863 (owner: 10Jbond)
[14:08:55] <wikibugs>	 (03PS2) 10Jelto: sre.gitlab.upgrade: check current and target version [cookbooks] - 10https://gerrit.wikimedia.org/r/884308 (https://phabricator.wikimedia.org/T323569)
[14:10:06] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.decommission for hosts clouddb2001-dev.codfw.wmnet
[14:10:43] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.gitlab.upgrade: check current and target version [cookbooks] - 10https://gerrit.wikimedia.org/r/884308 (https://phabricator.wikimedia.org/T323569) (owner: 10Jelto)
[14:13:34] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10Infrastructure-Foundations: Test RAID monitoring on new RAID PERC 755 controllers - https://phabricator.wikimedia.org/T325046 (10Jclark-ctr) Drive pulled again
[14:13:44] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.dns.netbox
[14:15:40] <wikibugs>	 (03PS3) 10Jelto: sre.gitlab.upgrade: check current and target version [cookbooks] - 10https://gerrit.wikimedia.org/r/884308 (https://phabricator.wikimedia.org/T323569)
[14:15:52] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove service toggle for TFTP [puppet] - 10https://gerrit.wikimedia.org/r/884310 (https://phabricator.wikimedia.org/T327867)
[14:17:20] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.gitlab.upgrade: check current and target version [cookbooks] - 10https://gerrit.wikimedia.org/r/884308 (https://phabricator.wikimedia.org/T323569) (owner: 10Jelto)
[14:17:25] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] sre.k8s.pool-depool-cluster: handle active/passive services [cookbooks] - 10https://gerrit.wikimedia.org/r/869771 (https://phabricator.wikimedia.org/T277677) (owner: 10Elukey)
[14:17:31] <wikibugs>	 (03PS21) 10Elukey: sre.k8s.pool-depool-cluster: handle active/passive services [cookbooks] - 10https://gerrit.wikimedia.org/r/869771 (https://phabricator.wikimedia.org/T277677)
[14:17:38] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: clouddb2001-dev.codfw.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin1001"
[14:17:51] <wikibugs>	 (03PS4) 10Jbond: sre.hardware.firmware: Add ability to defer or prevent reboots [cookbooks] - 10https://gerrit.wikimedia.org/r/883863
[14:18:18] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/884310 (https://phabricator.wikimedia.org/T327867) (owner: 10Muehlenhoff)
[14:18:30] <wikibugs>	 (03PS4) 10Jelto: sre.gitlab.upgrade: check current and target version [cookbooks] - 10https://gerrit.wikimedia.org/r/884308 (https://phabricator.wikimedia.org/T323569)
[14:19:43] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.hardware.firmware: Add ability to defer or prevent reboots [cookbooks] - 10https://gerrit.wikimedia.org/r/883863 (owner: 10Jbond)
[14:20:15] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: clouddb2001-dev.codfw.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin1001"
[14:20:15] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:20:16] <logmsgbot>	 !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts clouddb2001-dev.codfw.wmnet
[14:20:19] <wikibugs>	 (03PS1) 10Elukey: services: update liftwing's test database pattern for changeprop staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/884311 (https://phabricator.wikimedia.org/T327302)
[14:20:36] <wikibugs>	 (03PS5) 10Jbond: sre.hardware.firmware: Add ability to defer or prevent reboots [cookbooks] - 10https://gerrit.wikimedia.org/r/883863
[14:20:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:22:09] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.decommission for hosts clouddb2001-dev.codfw.wmnet
[14:22:29] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.hardware.firmware: Add ability to defer or prevent reboots [cookbooks] - 10https://gerrit.wikimedia.org/r/883863 (owner: 10Jbond)
[14:26:19] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.dns.netbox
[14:27:28] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:27:29] <logmsgbot>	 !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts clouddb2001-dev.codfw.wmnet
[14:27:36] <wikibugs>	 10SRE, 10ops-drmrs, 10Infrastructure-Foundations, 10netops: cr2-drmrs:xe-0/1/1 stuck optic - https://phabricator.wikimedia.org/T324555 (10RobH) 05Open→03In progress Neglected to do this earlier this week, I have the photos so I'll work on this today.
[14:30:52] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] services: update liftwing's test database pattern for changeprop staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/884311 (https://phabricator.wikimedia.org/T327302) (owner: 10Elukey)
[14:31:47] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Remove puppet refs to clouddb2001-dev [puppet] - 10https://gerrit.wikimedia.org/r/884087 (https://phabricator.wikimedia.org/T328079) (owner: 10Andrew Bogott)
[14:31:54] <wikibugs>	 (03PS2) 10Andrew Bogott: Remove puppet refs to clouddb2001-dev [puppet] - 10https://gerrit.wikimedia.org/r/884087 (https://phabricator.wikimedia.org/T328079)
[14:32:20] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/884307 (owner: 10Muehlenhoff)
[14:32:22] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Traffic: Feature request: sre.hardware.upgrade-firmware should allow option to defer NIC firmware installation to next reboot - https://phabricator.wikimedia.org/T323717 (10jbond) >>! In T323717#8560745, @ssingh wrote:  > ` >     iDrac shouldn't upgrade to 6.00.00.00 (b...
[14:34:26] <logmsgbot>	 !log elukey@deploy1002 helmfile [staging] START helmfile.d/services/changeprop: sync
[14:34:30] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Traffic: Feature request: sre.hardware.upgrade-firmware should allow option to defer NIC firmware installation to next reboot - https://phabricator.wikimedia.org/T323717 (10ssingh) >>! In T323717#8564660, @jbond wrote: >>>! In T323717#8560745, @ssingh wrote: >  >> ` >>...
[14:34:37] <logmsgbot>	 !log elukey@deploy1002 helmfile [staging] DONE helmfile.d/services/changeprop: sync
[14:34:49] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] esitest: remove deprecated nbproc config option (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/884056 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh)
[14:35:28] <wikibugs>	 10ops-codfw, 10cloud-services-team, 10decommission-hardware, 10Patch-For-Review: decommission clouddb2001-dev.codfw.wmnet - https://phabricator.wikimedia.org/T328079 (10Andrew) a:05Andrew→03Papaul
[14:35:46] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1 C: 03+2] esitest: remove deprecated nbproc config option [puppet] - 10https://gerrit.wikimedia.org/r/884056 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh)
[14:35:50] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Traffic: Feature request: sre.hardware.upgrade-firmware should allow option to defer NIC firmware installation to next reboot - https://phabricator.wikimedia.org/T323717 (10jbond) Hi ssingh ,  i have just tested this by trying to upgrade the bios and the nic with only o...
[14:36:46] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Traffic: Feature request: sre.hardware.upgrade-firmware should allow option to defer NIC firmware installation to next reboot - https://phabricator.wikimedia.org/T323717 (10jbond) >>! In T323717#8564666, @ssingh wrote: >>>! In T323717#8564660, @jbond wrote: >>>>! In T32...
[14:38:13] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Traffic: Feature request: sre.hardware.upgrade-firmware should allow option to defer NIC firmware installation to next reboot - https://phabricator.wikimedia.org/T323717 (10ssingh) >>! In T323717#8564670, @jbond wrote: > Hi ssingh , >  > i have just tested this by tryin...
[14:39:57] <logmsgbot>	 !log btullis@deploy1002 helmfile [codfw] START helmfile.d/services/datahub: apply on main
[14:40:25] <moritzm>	 !log installing install3002 T327867
[14:40:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:40:28] <stashbot>	 T327867: Migrate the install servers to Bullseye - https://phabricator.wikimedia.org/T327867
[14:40:44] <wikibugs>	 (03PS1) 10Hashar: scap: remove plugins/.eslintrc.json on finalize stage [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/884317 (https://phabricator.wikimedia.org/T328134)
[14:40:57] <wikibugs>	 (03CR) 10Hashar: [C: 04-1] scap: remove plugins/.eslintrc.json on finalize stage [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/884317 (https://phabricator.wikimedia.org/T328134) (owner: 10Hashar)
[14:41:02] <logmsgbot>	 !log btullis@deploy1002 helmfile [codfw] DONE helmfile.d/services/datahub: sync on main
[14:41:18] <wikibugs>	 (03CR) 10Hashar: [C: 04-1] "I will try to refactor the eslint config instead and use that in last resort." [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/884317 (https://phabricator.wikimedia.org/T328134) (owner: 10Hashar)
[14:42:30] <icinga-wm>	 PROBLEM - Dell PowerEdge RAID Controller on db1206 is CRITICAL: communication: 0 OK https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring
[14:42:33] <icinga-wm>	 ACKNOWLEDGEMENT - Dell PowerEdge RAID Controller on db1206 is CRITICAL: communication: 0 OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T328135 https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring
[14:42:38] <wikibugs>	 10SRE, 10ops-eqiad: Degraded RAID on db1206 - https://phabricator.wikimedia.org/T328135 (10ops-monitoring-bot)
[14:43:20] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp2027.codfw.wmnet with OS bullseye
[14:43:31] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp2027.codfw.wmnet with OS bullseye
[14:45:23] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[14:45:27] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[14:45:57] <logmsgbot>	 !log btullis@deploy1002 helmfile [eqiad] START helmfile.d/services/datahub: apply on main
[14:46:35] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.232 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[14:46:41] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49420 bytes in 0.115 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[14:46:51] <logmsgbot>	 !log btullis@deploy1002 helmfile [eqiad] DONE helmfile.d/services/datahub: sync on main
[14:49:55] <jinxer-wm>	 (LogstashIngestSpike) firing: Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike
[14:53:19] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4040.ulsfo.wmnet with OS bullseye
[14:53:28] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp4040.ulsfo.wmnet with OS bullseye
[14:53:57] <wikibugs>	 (03PS1) 10Eevans: cassandra-dev: install docker.io package for local testing [puppet] - 10https://gerrit.wikimedia.org/r/884322 (https://phabricator.wikimedia.org/T327954)
[14:54:18] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] cassandra-dev: install docker.io package for local testing [puppet] - 10https://gerrit.wikimedia.org/r/884322 (https://phabricator.wikimedia.org/T327954) (owner: 10Eevans)
[14:54:55] <jinxer-wm>	 (LogstashIngestSpike) resolved: Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike
[14:55:07] <logmsgbot>	 !log elukey@deploy1002 helmfile [staging] START helmfile.d/services/changeprop: sync
[14:55:20] <logmsgbot>	 !log elukey@deploy1002 helmfile [staging] DONE helmfile.d/services/changeprop: sync
[14:56:18] <wikibugs>	 (03CR) 10Muehlenhoff: cassandra-dev: install docker.io package for local testing (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/884322 (https://phabricator.wikimedia.org/T327954) (owner: 10Eevans)
[14:58:56] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp2027.codfw.wmnet with reason: host reimage
[14:59:49] <wikibugs>	 (03PS6) 10Jbond: sre.hardware.firmware: Add ability to defer or prevent reboots [cookbooks] - 10https://gerrit.wikimedia.org/r/883863
[15:02:40] <wikibugs>	 (03CR) 10CDanis: cassandra-dev: install docker.io package for local testing (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/884322 (https://phabricator.wikimedia.org/T327954) (owner: 10Eevans)
[15:02:45] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp2027.codfw.wmnet with reason: host reimage
[15:02:49] <wikibugs>	 (03CR) 10Eevans: cassandra-dev: install docker.io package for local testing (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/884322 (https://phabricator.wikimedia.org/T327954) (owner: 10Eevans)
[15:03:08] <wikibugs>	 (03PS2) 10Eevans: cassandra-dev: install docker.io package for local testing [puppet] - 10https://gerrit.wikimedia.org/r/884322 (https://phabricator.wikimedia.org/T327954)
[15:04:30] <wikibugs>	 (03CR) 10CDanis: [C: 03+1] cassandra-dev: install docker.io package for local testing [puppet] - 10https://gerrit.wikimedia.org/r/884322 (https://phabricator.wikimedia.org/T327954) (owner: 10Eevans)
[15:06:00] <wikibugs>	 (03PS7) 10Jbond: sre.hardware.firmware: Add ability to defer or prevent reboots [cookbooks] - 10https://gerrit.wikimedia.org/r/883863
[15:08:39] <wikibugs>	 (03PS8) 10Jbond: sre.hardware.firmware: Add ability to defer or prevent reboots [cookbooks] - 10https://gerrit.wikimedia.org/r/883863
[15:10:11] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[15:12:29] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/884322 (https://phabricator.wikimedia.org/T327954) (owner: 10Eevans)
[15:12:49] <wikibugs>	 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: eqiad: Move links to new MPC7E linecard - https://phabricator.wikimedia.org/T304712 (10Jclark-ctr) 05Open→03Resolved Netbox is updated
[15:12:56] <wikibugs>	 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops, 10Sustainability (Incident Followup): eqiad: upgrade row C and D uplinks from 4x10G to 1x40G - https://phabricator.wikimedia.org/T313463 (10Jclark-ctr)
[15:13:49] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4040.ulsfo.wmnet with reason: host reimage
[15:16:58] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4040.ulsfo.wmnet with reason: host reimage
[15:17:26] <wikibugs>	 (03CR) 10Eevans: [C: 03+2] cassandra-dev: install docker.io package for local testing [puppet] - 10https://gerrit.wikimedia.org/r/884322 (https://phabricator.wikimedia.org/T327954) (owner: 10Eevans)
[15:19:39] <wikibugs>	 (03PS9) 10Jbond: sre.hardware.firmware: Add ability to defer or prevent reboots [cookbooks] - 10https://gerrit.wikimedia.org/r/883863
[15:20:55] <wikibugs>	 (03PS10) 10Jbond: sre.hardware.firmware: Add ability to defer or prevent reboots [cookbooks] - 10https://gerrit.wikimedia.org/r/883863
[15:22:12] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp2027.codfw.wmnet with OS bullseye
[15:22:25] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp2027.codfw.wmnet with OS bullseye completed: - cp2027 (**PASS**)   - Downtimed on Icinga/Alertm...
[15:24:55] <jinxer-wm>	 (LogstashIngestSpike) firing: Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike
[15:29:55] <jinxer-wm>	 (LogstashIngestSpike) resolved: Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike
[15:31:03] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp2027.codfw.wmnet,service=cdn
[15:31:04] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp2027.codfw.wmnet,service=ats-be
[15:31:46] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh)
[15:36:37] <wikibugs>	 (03PS1) 10Hashar: phabricator: ensure phd uid/gid can not be changed [puppet] - 10https://gerrit.wikimedia.org/r/884324 (https://phabricator.wikimedia.org/T326146)
[15:37:10] <wikibugs>	 (03CR) 10Hashar: phabricator: dedupe phd user creation (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/875265 (https://phabricator.wikimedia.org/T326146) (owner: 10Hashar)
[15:38:56] <wikibugs>	 (03CR) 10JHathaway: [C: 03+1] "looks good to me, thanks" [puppet] - 10https://gerrit.wikimedia.org/r/884282 (https://phabricator.wikimedia.org/T323820) (owner: 10Muehlenhoff)
[15:39:27] <wikibugs>	 (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/884324 (https://phabricator.wikimedia.org/T326146) (owner: 10Hashar)
[15:39:50] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4040.ulsfo.wmnet with OS bullseye
[15:39:55] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp4040.ulsfo.wmnet with OS bullseye completed: - cp4040 (**PASS**)   - Downtimed on Icinga/Alertmanager   - Disabled Pu...
[15:41:06] <wikibugs>	 (03CR) 10Hashar: "I have extracted the code from another pending change https://gerrit.wikimedia.org/r/c/operations/puppet/+/875266/4..5/modules/phabricator" [puppet] - 10https://gerrit.wikimedia.org/r/884324 (https://phabricator.wikimedia.org/T326146) (owner: 10Hashar)
[15:41:34] <wikibugs>	 (03CR) 10Hashar: phabricator: change phd home dir to /var/lib/phd (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/875266 (https://phabricator.wikimedia.org/T326146) (owner: 10Hashar)
[15:41:55] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh)
[15:42:18] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp4040.ulsfo.wmnet,service=cdn
[15:42:18] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp4040.ulsfo.wmnet,service=ats-be
[15:42:36] <wikibugs>	 (03PS1) 10Jaime Nuche: jenkins: add secrets for releasing instance [labs/private] - 10https://gerrit.wikimedia.org/r/884325 (https://phabricator.wikimedia.org/T323909)
[15:48:23] <wikibugs>	 (03PS8) 10Hashar: phabricator: change phd home dir to /var/lib/phd [puppet] - 10https://gerrit.wikimedia.org/r/875266 (https://phabricator.wikimedia.org/T326146)
[15:49:59] <wikibugs>	 (03CR) 10Hashar: "I have moved the part which hardcodes the uid/gid to a standalone change: https://gerrit.wikimedia.org/r/c/operations/puppet/+/884324/  an" [puppet] - 10https://gerrit.wikimedia.org/r/875266 (https://phabricator.wikimedia.org/T326146) (owner: 10Hashar)
[15:50:33] <logmsgbot>	 !log dancy@deploy1002 Started deploy [netbox/deploy@ef7451d]: netbox-next to 3.2.9
[15:50:38] <logmsgbot>	 !log dancy@deploy1002 Finished deploy [netbox/deploy@ef7451d]: netbox-next to 3.2.9 (duration: 00m 04s)
[15:51:16] <wikibugs>	 (03CR) 10Hashar: "PCC result: https://puppet-compiler.wmflabs.org/output/884324/1594/" [puppet] - 10https://gerrit.wikimedia.org/r/884324 (https://phabricator.wikimedia.org/T326146) (owner: 10Hashar)
[15:53:23] <wikibugs>	 10SRE, 10ops-eqiad: Degraded RAID on db1206 - https://phabricator.wikimedia.org/T328135 (10Marostegui) This is testing
[15:53:37] <wikibugs>	 (03PS2) 10Phedenskog: prometheus: remove recording rule for CPU benchmark. [puppet] - 10https://gerrit.wikimedia.org/r/881632 (https://phabricator.wikimedia.org/T321398)
[15:55:27] <wikibugs>	 (03CR) 10Phedenskog: prometheus: remove recording rule for CPU benchmark. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/881632 (https://phabricator.wikimedia.org/T321398) (owner: 10Phedenskog)
[15:56:11] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10Infrastructure-Foundations: Test RAID monitoring on new RAID PERC 755 controllers - https://phabricator.wikimedia.org/T325046 (10Marostegui) Thank you!
[15:57:40] <wikibugs>	 10SRE, 10ops-eqiad: Degraded RAID on db1206 - https://phabricator.wikimedia.org/T328135 (10Marostegui) 05Open→03Invalid
[16:04:08] <wikibugs>	 (03PS1) 10Majavah: P:wmcs::services: adjust toolsdb pinning [puppet] - 10https://gerrit.wikimedia.org/r/884348
[16:07:01] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudcephosd10(3[5-9]|40) - https://phabricator.wikimedia.org/T324998 (10Jclark-ctr) cloudcephosd1035   E3  U33     cableid. 20220009 port. 0    cableid. 20220007 port.  1 cloudcephosd1036   E3  U34     cableid. 20220008...
[16:07:16] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] P:wmcs::services: adjust toolsdb pinning [puppet] - 10https://gerrit.wikimedia.org/r/884348 (owner: 10Majavah)
[16:07:46] <wikibugs>	 (03CR) 10FNegri: [C: 03+2] P:wmcs::services: adjust toolsdb pinning [puppet] - 10https://gerrit.wikimedia.org/r/884348 (owner: 10Majavah)
[16:07:54] <wikibugs>	 (03PS1) 10Herron: logstash: remove rate of ingestion percent change compared to yesterday alert [alerts] - 10https://gerrit.wikimedia.org/r/884349 (https://phabricator.wikimedia.org/T202307)
[16:10:14] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudcephosd10(3[5-9]|40) - https://phabricator.wikimedia.org/T324998 (10Jclark-ctr)
[16:12:00] <wikibugs>	 (03CR) 10Daniel Kinzler: [C: 03+1] Try to determine what's adding to Parsoid init times [core] (wmf/1.40.0-wmf.20) - 10https://gerrit.wikimedia.org/r/884138 (owner: 10Arlolra)
[16:14:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:18:02] <icinga-wm>	 RECOVERY - Check systemd state on mw1411 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:19:17] <wikibugs>	 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops, 10User-Eevans: Relocate hosts: aqs10[3-5] - https://phabricator.wikimedia.org/T307035 (10Jclark-ctr) host  target row aqs1013  d3 u24 port33  aqs1014  e1 u38  port 41 1g aqs1015  f1 u38  port 41 1g
[16:20:17] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10Infrastructure-Foundations: Test RAID monitoring on new RAID PERC 755 controllers - https://phabricator.wikimedia.org/T325046 (10Jclark-ctr) Drive has been Reinserted
[16:28:29] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to 'cn=nda or cn=wmf' for ekalkst - https://phabricator.wikimedia.org/T328145 (10Ekalkst)
[16:32:42] <wikibugs>	 (03PS1) 10JMeybohm: flink(-operator): Update to JRE 11.0.16 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/884351
[16:34:26] <wikibugs>	 (03CR) 10JMeybohm: "Feel free to merge and build anytime" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/884351 (owner: 10JMeybohm)
[16:44:03] <jinxer-wm>	 (ProbeDown) firing: (3) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:52:53] <wikibugs>	 10SRE-swift-storage, 10Thumbor Migration: Pooling thumbor-k8s causes spikes in swift 500 errors - https://phabricator.wikimedia.org/T328033 (10hnowlan) In server.log on the swift frontends there is a significant uptick in ERROR messages with timeouts for images during the period Thumbor-k8s was pooled:  ` Jan...
[17:01:31] <wikibugs>	 10SRE-swift-storage: >=27k objects listed in swift containers but not extant - https://phabricator.wikimedia.org/T327253 (10MatthewVernon) I've done a bunch of investigating, and I don't think I'm much nearer a useful answer.  First, though, it's clear that while DELETE on these "ghost" objects returns 404, it d...
[17:10:48] <icinga-wm>	 RECOVERY - Dell PowerEdge RAID Controller on db1206 is OK: communication: 0 OK https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring
[17:15:01] <wikibugs>	 (03CR) 10Krinkle: "I guess the fact that these haven't fired despite being dead, is suspect. We know the others do fire when metrics regress, but I guess the" [alerts] - 10https://gerrit.wikimedia.org/r/879925 (https://phabricator.wikimedia.org/T323623) (owner: 10Krinkle)
[17:15:56] <logmsgbot>	 !log brett@cumin1001 START - Cookbook sre.hosts.reimage for host cp4048.ulsfo.wmnet with OS bullseye
[17:16:02] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin1001 for host cp4048.ulsfo.wmnet with OS bullseye
[17:18:20] <wikibugs>	 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management, 10MediaWiki-Page-deletion, and 4 others: Some files cannot be deleted "Error deleting file: An unknown error occurred in storage backend "local-multiwrite". " - https://phabricator.wikimedia.org/T244567 (10Ammarpad) These files are missing on dis...
[17:20:27] <wikibugs>	 10SRE-swift-storage: >=27k objects listed in swift containers but not extant - https://phabricator.wikimedia.org/T327253 (10MatthewVernon) https://commons.wikimedia.org/w/index.php?title=Special:Log&page=File%3AFlying+Seagull.jpg I think explains the observations - there was a previous object by the same name de...
[17:28:25] <wikibugs>	 10SRE-swift-storage: >=27k objects listed in swift containers but not extant - https://phabricator.wikimedia.org/T327253 (10jcrespo) I've mentioned to Emperor some things that help explain *some* of the comments. E.g. https://commons.wikimedia.org/w/index.php?title=Special:Log&page=File%3AFlying+Seagull.jpg show...
[17:28:32] <logmsgbot>	 !log brett@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4048.ulsfo.wmnet with OS bullseye
[17:28:37] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin1001 for host cp4048.ulsfo.wmnet with OS bullseye executed with errors: - cp4048 (**FAIL**)   - Downtimed on Icinga/Alertmanager   -...
[17:28:45] <logmsgbot>	 !log brett@cumin1001 START - Cookbook sre.hosts.reimage for host cp4048.ulsfo.wmnet with OS bullseye
[17:28:51] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin1001 for host cp4048.ulsfo.wmnet with OS bullseye
[17:30:35] <wikibugs>	 (03PS1) 10Cwhite: logstash: only perform dns lookup if parse was successful [puppet] - 10https://gerrit.wikimedia.org/r/884109
[17:35:12] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] logstash: only perform dns lookup if parse was successful [puppet] - 10https://gerrit.wikimedia.org/r/884109 (owner: 10Cwhite)
[17:38:05] <logmsgbot>	 !log mfossati@deploy1002 Started deploy [airflow-dags/platform_eng@907fe2a]: (no justification provided)
[17:38:20] <logmsgbot>	 !log mfossati@deploy1002 Finished deploy [airflow-dags/platform_eng@907fe2a]: (no justification provided) (duration: 00m 14s)
[17:40:44] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10cmooney) >>! In T327938#8560262, @ayounsi wrote: > * public1-a/b-codfw host might be better grouped in a single rack per row, providing still redundancy (4 racks per sit...
[17:48:22] <wikibugs>	 (03CR) 10RLazarus: [C: 03+2] "Looks good, thanks for the patch! I'll deploy a new version today." [software/httpbb] - 10https://gerrit.wikimedia.org/r/884285 (https://phabricator.wikimedia.org/T328120) (owner: 10Elukey)
[17:48:33] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10cmooney) >>! In T327919#8560080, @ayounsi wrote: > @Papaul could you rename (Netbox, label, console, et...
[17:48:42] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] "LGTM, let me know if you need a +2 from o11y" [alerts] - 10https://gerrit.wikimedia.org/r/879925 (https://phabricator.wikimedia.org/T323623) (owner: 10Krinkle)
[17:49:22] <logmsgbot>	 !log brett@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4048.ulsfo.wmnet with reason: host reimage
[17:49:54] <wikibugs>	 (03Merged) 10jenkins-bot: parse: allow integers in form_body [software/httpbb] - 10https://gerrit.wikimedia.org/r/884285 (https://phabricator.wikimedia.org/T328120) (owner: 10Elukey)
[17:52:43] <logmsgbot>	 !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4048.ulsfo.wmnet with reason: host reimage
[17:53:20] <wikibugs>	 (03CR) 10Krinkle: "What will happen and not happen when I +2 this? Does it auto-deploy?" [alerts] - 10https://gerrit.wikimedia.org/r/879925 (https://phabricator.wikimedia.org/T323623) (owner: 10Krinkle)
[17:58:31] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] team-perf: Remove firstinputtiming alerts (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/879925 (https://phabricator.wikimedia.org/T323623) (owner: 10Krinkle)
[18:06:25] <wikibugs>	 (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/884110
[18:07:12] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: mediawiki: adapt rsyslog parsing of slowlog to ecs 1.11 [deployment-charts] - 10https://gerrit.wikimedia.org/r/884360
[18:12:27] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10ayounsi) > I've no particular preference, if I were doing it myself probably a slight one for the CWDM4/LC links, but happy to go with whatever the consensus/cheapest is...
[18:14:48] <wikibugs>	 (03PS1) 10Catrope: Add VueTest to extension-list, add config var [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884361 (https://phabricator.wikimedia.org/T315621)
[18:14:59] <logmsgbot>	 !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4048.ulsfo.wmnet with OS bullseye
[18:15:05] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin1001 for host cp4048.ulsfo.wmnet with OS bullseye completed: - cp4048 (**PASS**)   - Removed from Puppet and PuppetDB if present   -...
[18:18:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (UPDATE certificaterequests) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[18:19:29] <wikibugs>	 (03CR) 10Catrope: [C: 04-2] "Not ready yet" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884361 (https://phabricator.wikimedia.org/T315621) (owner: 10Catrope)
[18:20:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[18:23:00] <wikibugs>	 (03PS2) 10Sbailey: Enable Linter write namespace, tag and template from core, group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884090 (https://phabricator.wikimedia.org/T299612)
[18:23:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (UPDATE certificaterequests) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[18:24:16] <logmsgbot>	 !log brett@cumin1001 conftool action : set/pooled=yes; selector: name=cp4048.ulsfo.wmnet
[18:24:55] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall)
[18:25:04] <logmsgbot>	 !log brett@cumin1001 START - Cookbook sre.hosts.reimage for host cp4041.ulsfo.wmnet with OS bullseye
[18:25:12] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin1001 for host cp4041.ulsfo.wmnet with OS bullseye
[18:31:26] <wikibugs>	 (03CR) 10Krinkle: [C: 03+2] "Thanks." [alerts] - 10https://gerrit.wikimedia.org/r/879925 (https://phabricator.wikimedia.org/T323623) (owner: 10Krinkle)
[18:32:38] <wikibugs>	 (03Merged) 10jenkins-bot: team-perf: Remove firstinputtiming alerts [alerts] - 10https://gerrit.wikimedia.org/r/879925 (https://phabricator.wikimedia.org/T323623) (owner: 10Krinkle)
[18:35:45] <wikibugs>	 (03CR) 10BryanDavis: [C: 03+1] P:toolforge::grid: install python3-mwparserfromhell [puppet] - 10https://gerrit.wikimedia.org/r/882220 (https://phabricator.wikimedia.org/T327600) (owner: 10Majavah)
[18:37:04] <logmsgbot>	 !log brett@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4041.ulsfo.wmnet with OS bullseye
[18:37:10] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin1001 for host cp4041.ulsfo.wmnet with OS bullseye executed with errors: - cp4041 (**FAIL**)   - Downtimed on Icinga/Alertmanager   -...
[18:37:22] <logmsgbot>	 !log brett@cumin1001 START - Cookbook sre.hosts.reimage for host cp4041.ulsfo.wmnet with OS bullseye
[18:37:28] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin1001 for host cp4041.ulsfo.wmnet with OS bullseye
[18:42:57] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10Papaul) @cmooney this looks good to me just one question. Is it possible to use xe-0/0/[46-47] for the...
[18:57:28] <logmsgbot>	 !log brett@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4041.ulsfo.wmnet with reason: host reimage
[19:00:49] <wikibugs>	 (03PS1) 10BBlack: Commentary re: image timestamps in URL query part [puppet] - 10https://gerrit.wikimedia.org/r/884363 (https://phabricator.wikimedia.org/T38380)
[19:01:52] <wikibugs>	 (03CR) 10BCornwall: [C: 03+1] Commentary re: image timestamps in URL query part [puppet] - 10https://gerrit.wikimedia.org/r/884363 (https://phabricator.wikimedia.org/T38380) (owner: 10BBlack)
[19:02:00] <wikibugs>	 (03PS2) 10BBlack: Commentary re: image timestamps in URL query part [puppet] - 10https://gerrit.wikimedia.org/r/884363 (https://phabricator.wikimedia.org/T38380)
[19:02:11] <logmsgbot>	 !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4041.ulsfo.wmnet with reason: host reimage
[19:02:23] <wikibugs>	 (03CR) 10BCornwall: [C: 03+1] Commentary re: image timestamps in URL query part [puppet] - 10https://gerrit.wikimedia.org/r/884363 (https://phabricator.wikimedia.org/T38380) (owner: 10BBlack)
[19:04:00] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] Commentary re: image timestamps in URL query part [puppet] - 10https://gerrit.wikimedia.org/r/884363 (https://phabricator.wikimedia.org/T38380) (owner: 10BBlack)
[19:05:24] <wikibugs>	 (03PS1) 10Catrope: Enable VueTest in labs only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884364 (https://phabricator.wikimedia.org/T315621)
[19:05:52] <wikibugs>	 (03CR) 10Catrope: [C: 04-2] "Not ready yet" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884364 (https://phabricator.wikimedia.org/T315621) (owner: 10Catrope)
[19:10:06] <wikibugs>	 10SRE, 10Traffic-Icebox, 10Patch-For-Review: Remove unused plain HTTP services from LVS - https://phabricator.wikimedia.org/T236065 (10BCornwall)
[19:10:11] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[19:11:55] <icinga-wm>	 PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (backup2002), Fresh: 116 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[19:15:35] <jynus>	 checking
[19:18:08] <jynus>	 it should fix itself in a few minutes- just backups are running slower than usal this week
[19:19:02] <wikibugs>	 (03PS1) 10Eevans: cassandra-dev: install siege for testing [puppet] - 10https://gerrit.wikimedia.org/r/884368 (https://phabricator.wikimedia.org/T327954)
[19:26:03] <wikibugs>	 (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/884368 (https://phabricator.wikimedia.org/T327954) (owner: 10Eevans)
[19:28:10] <logmsgbot>	 !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4041.ulsfo.wmnet with OS bullseye
[19:28:17] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin1001 for host cp4041.ulsfo.wmnet with OS bullseye completed: - cp4041 (**PASS**)   - Removed from Puppet and PuppetDB if present   -...
[19:29:36] <wikibugs>	 (03CR) 10Eevans: [C: 03+2] cassandra-dev: install siege for testing [puppet] - 10https://gerrit.wikimedia.org/r/884368 (https://phabricator.wikimedia.org/T327954) (owner: 10Eevans)
[19:31:12] <logmsgbot>	 !log brett@cumin1001 conftool action : set/pooled=yes; selector: name=cp404.ulsfo.wmnet
[19:31:23] <logmsgbot>	 !log brett@cumin1001 conftool action : set/pooled=yes; selector: name=cp4041.ulsfo.wmnet
[19:32:44] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall)
[19:32:57] <logmsgbot>	 !log brett@cumin1001 START - Cookbook sre.hosts.reimage for host cp4049.ulsfo.wmnet with OS bullseye
[19:33:03] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin1001 for host cp4049.ulsfo.wmnet with OS bullseye
[19:33:59] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10serviceops: Release httpbb 0.0.2 - https://phabricator.wikimedia.org/T328162 (10RLazarus)
[19:34:20] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10serviceops: Release httpbb 0.0.2 - https://phabricator.wikimedia.org/T328162 (10RLazarus) 05Open→03In progress p:05Triage→03Medium
[19:34:50] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10serviceops: Release httpbb 0.0.2 - https://phabricator.wikimedia.org/T328162 (10RLazarus)
[19:34:58] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Machine-Learning-Team: httpbb doesn't support integers in the POST's body - https://phabricator.wikimedia.org/T328120 (10RLazarus)
[19:38:18] <logmsgbot>	 !log brett@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4049.ulsfo.wmnet with OS bullseye
[19:38:23] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin1001 for host cp4049.ulsfo.wmnet with OS bullseye executed with errors: - cp4049 (**FAIL**)   - Downtimed on Icinga/Alertmanager   -...
[19:38:59] <logmsgbot>	 !log brett@cumin1001 START - Cookbook sre.hosts.reimage for host cp4049.ulsfo.wmnet with OS bullseye
[19:39:05] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin1001 for host cp4049.ulsfo.wmnet with OS bullseye
[19:54:23] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [alerts] - 10https://gerrit.wikimedia.org/r/884349 (https://phabricator.wikimedia.org/T202307) (owner: 10Herron)
[19:57:05] <wikibugs>	 (03PS1) 10RLazarus: Release v0.0.2 [software/httpbb] - 10https://gerrit.wikimedia.org/r/884373 (https://phabricator.wikimedia.org/T328162)
[19:59:30] <wikibugs>	 (03CR) 10RLazarus: [C: 03+2] Release v0.0.2 [software/httpbb] - 10https://gerrit.wikimedia.org/r/884373 (https://phabricator.wikimedia.org/T328162) (owner: 10RLazarus)
[20:01:00] <wikibugs>	 (03Merged) 10jenkins-bot: Release v0.0.2 [software/httpbb] - 10https://gerrit.wikimedia.org/r/884373 (https://phabricator.wikimedia.org/T328162) (owner: 10RLazarus)
[20:02:56] <logmsgbot>	 !log brett@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4049.ulsfo.wmnet with OS bullseye
[20:03:01] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin1001 for host cp4049.ulsfo.wmnet with OS bullseye executed with errors: - cp4049 (**FAIL**)   - Removed from Puppet and PuppetDB if p...
[20:05:58] <logmsgbot>	 !log brett@cumin1001 START - Cookbook sre.hosts.reimage for host cp4049.ulsfo.wmnet with OS bullseye
[20:06:04] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin1001 for host cp4049.ulsfo.wmnet with OS bullseye
[20:18:47] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 227, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[20:23:51] <icinga-wm>	 RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 90, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[20:25:19] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] gerrit: listen on all ports, DROP requests to host [puppet] - 10https://gerrit.wikimedia.org/r/883965 (https://phabricator.wikimedia.org/T326125) (owner: 10Hashar)
[20:26:26] <logmsgbot>	 !log brett@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4049.ulsfo.wmnet with reason: host reimage
[20:29:32] <logmsgbot>	 !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4049.ulsfo.wmnet with reason: host reimage
[20:32:35] <wikibugs>	 (03CR) 10BBlack: [C: 03+1] varnish: Reword misc-frontend vcl_switch comment [puppet] - 10https://gerrit.wikimedia.org/r/882716 (https://phabricator.wikimedia.org/T205988) (owner: 10BCornwall)
[20:38:05] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to 'cn=nda or cn=wmf' for ekalkst - https://phabricator.wikimedia.org/T328145 (10Dzahn) Hi @Ekalkst are you an employee of Wikimedia foundation, a contractor or a volunteer, please?
[20:40:13] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "yea, I like this since meanwhile we do the type assertion also when it's not a class parameter. Originally it was moved to parameters only" [puppet] - 10https://gerrit.wikimedia.org/r/884324 (https://phabricator.wikimedia.org/T326146) (owner: 10Hashar)
[20:42:06] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] phabricator: ensure phd uid/gid can not be changed [puppet] - 10https://gerrit.wikimedia.org/r/884324 (https://phabricator.wikimedia.org/T326146) (owner: 10Hashar)
[20:44:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:44:21] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "We might have to check on what UID/GID we get on the cloud instance." [puppet] - 10https://gerrit.wikimedia.org/r/884324 (https://phabricator.wikimedia.org/T326146) (owner: 10Hashar)
[20:45:59] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] Enable profile::auto_restarts::service for etherpad-lite [puppet] - 10https://gerrit.wikimedia.org/r/883949 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[20:50:30] <wikibugs>	 (03CR) 10Dzahn: "merged the outsourced part" [puppet] - 10https://gerrit.wikimedia.org/r/875266 (https://phabricator.wikimedia.org/T326146) (owner: 10Hashar)
[20:56:20] <logmsgbot>	 !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4049.ulsfo.wmnet with OS bullseye
[20:56:25] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin1001 for host cp4049.ulsfo.wmnet with OS bullseye completed: - cp4049 (**WARN**)   - Removed from Puppet and PuppetDB if present   -...
[20:56:55] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] phabricator: dedupe phd user creation (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/875265 (https://phabricator.wikimedia.org/T326146) (owner: 10Hashar)
[21:29:57] <wikibugs>	 (03PS1) 10Stang: lmowiktionary: Create extendedmover group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884378 (https://phabricator.wikimedia.org/T327340)
[21:44:25] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[21:44:33] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[21:49:14] <logmsgbot>	 !log brett@cumin1001 conftool action : set/pooled=yes; selector: name=cp4049.ulsfo.wmnet
[21:49:35] <wikibugs>	 (03CR) 10BCornwall: [C: 03+2] varnish: Reword misc-frontend vcl_switch comment [puppet] - 10https://gerrit.wikimedia.org/r/882716 (https://phabricator.wikimedia.org/T205988) (owner: 10BCornwall)
[21:50:04] <wikibugs>	 10SRE, 10Traffic-Icebox, 10Patch-For-Review: Simplify comment misc-frontend.inc.vcl.erb - https://phabricator.wikimedia.org/T205988 (10BCornwall) 05In progress→03Resolved
[21:51:04] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall)
[21:51:21] <logmsgbot>	 !log brett@cumin1001 START - Cookbook sre.hosts.reimage for host cp4042.ulsfo.wmnet with OS bullseye
[21:51:26] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin1001 for host cp4042.ulsfo.wmnet with OS bullseye
[21:52:07] <wikibugs>	 (03PS2) 10Gergő Tisza: GrowthExperiments: Update campaign configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884153 (https://phabricator.wikimedia.org/T790650)
[21:59:17] <logmsgbot>	 !log brett@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4042.ulsfo.wmnet with OS bullseye
[21:59:24] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin1001 for host cp4042.ulsfo.wmnet with OS bullseye executed with errors: - cp4042 (**FAIL**)   - Downtimed on Icinga/Alertmanager   -...
[22:00:40] <logmsgbot>	 !log brett@cumin1001 START - Cookbook sre.hosts.reimage for host cp4042.ulsfo.wmnet with OS bullseye
[22:00:46] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin1001 for host cp4042.ulsfo.wmnet with OS bullseye
[22:11:30] <rzl>	 !log rzl@apt1001:~$ sudo -i reprepro -C main include buster-wikimedia /home/rzl/httpbb/buster/httpbb_0.0.2-1_amd64.changes  # T328162
[22:11:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:11:34] <stashbot>	 T328162: Release httpbb 0.0.2 - https://phabricator.wikimedia.org/T328162
[22:11:56] <rzl>	 !log rzl@apt1001:~$ sudo -i reprepro -C main include bullseye-wikimedia /home/rzl/httpbb/bullseye/httpbb_0.0.2-1+deb11u1_amd64.changes  # T328162
[22:12:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:20:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[22:20:48] <logmsgbot>	 !log brett@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4042.ulsfo.wmnet with reason: host reimage
[22:24:08] <logmsgbot>	 !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4042.ulsfo.wmnet with reason: host reimage
[22:35:09] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 2.467 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[22:35:09] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49421 bytes in 2.504 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[22:46:50] <logmsgbot>	 !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4042.ulsfo.wmnet with OS bullseye
[22:46:56] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin1001 for host cp4042.ulsfo.wmnet with OS bullseye completed: - cp4042 (**PASS**)   - Removed from Puppet and PuppetDB if present   -...
[22:48:25] <wikibugs>	 (03PS1) 10Cwhite: logstash: extract tcp flags from ulogd logs [puppet] - 10https://gerrit.wikimedia.org/r/884111 (https://phabricator.wikimedia.org/T325806)
[22:59:21] <icinga-wm>	 PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 89, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[22:59:59] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 226, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[23:10:11] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[23:11:12] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10serviceops: Release httpbb 0.0.2 - https://phabricator.wikimedia.org/T328162 (10RLazarus) 05In progress→03Resolved
[23:11:20] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Machine-Learning-Team: httpbb doesn't support integers in the POST's body - https://phabricator.wikimedia.org/T328120 (10RLazarus)
[23:15:57] <wikibugs>	 (03PS1) 10RLazarus: httpbb: Enable --retry_on_timeout so intermittent latency doesn't alert [puppet] - 10https://gerrit.wikimedia.org/r/884388 (https://phabricator.wikimedia.org/T323707)
[23:20:15] <icinga-wm>	 RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 90, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[23:20:55] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 227, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[23:21:13] <wikibugs>	 (03PS1) 10Cwhite: logstash: apply lowercase on fields that require it [puppet] - 10https://gerrit.wikimedia.org/r/884112
[23:21:27] <logmsgbot>	 !log brett@cumin1001 conftool action : set/pooled=yes; selector: name=cp4042.ulsfo.wmnet
[23:22:19] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall)
[23:22:52] <logmsgbot>	 !log brett@cumin1001 START - Cookbook sre.hosts.reimage for host cp4050.ulsfo.wmnet with OS bullseye
[23:23:09] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin1001 for host cp4050.ulsfo.wmnet with OS bullseye
[23:31:19] <logmsgbot>	 !log brett@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4050.ulsfo.wmnet with OS bullseye
[23:31:25] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin1001 for host cp4050.ulsfo.wmnet with OS bullseye executed with errors: - cp4050 (**FAIL**)   - Downtimed on Icinga/Alertmanager   -...
[23:31:34] <logmsgbot>	 !log brett@cumin1001 START - Cookbook sre.hosts.reimage for host cp4050.ulsfo.wmnet with OS bullseye
[23:31:40] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin1001 for host cp4050.ulsfo.wmnet with OS bullseye
[23:33:41] <wikibugs>	 (03PS1) 10Dzahn: planet: add blackbox::check::http monitor [puppet] - 10https://gerrit.wikimedia.org/r/884390 (https://phabricator.wikimedia.org/T327977)
[23:37:21] <wikibugs>	 (03PS1) 10Dzahn: releases: add blackbox::check::http monitor [puppet] - 10https://gerrit.wikimedia.org/r/884392 (https://phabricator.wikimedia.org/T327975)
[23:39:38] <wikibugs>	 (03PS1) 10Dzahn: doc: add blackbox::check::http monitor [puppet] - 10https://gerrit.wikimedia.org/r/884393 (https://phabricator.wikimedia.org/T327973)
[23:40:18] <wikibugs>	 (03PS2) 10Dzahn: planet: add blackbox::check::http monitor [puppet] - 10https://gerrit.wikimedia.org/r/884390 (https://phabricator.wikimedia.org/T327977)
[23:41:16] <wikibugs>	 (03PS2) 10Dzahn: releases: add blackbox::check::http monitor [puppet] - 10https://gerrit.wikimedia.org/r/884392 (https://phabricator.wikimedia.org/T327975)
[23:41:30] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] doc: add blackbox::check::http monitor [puppet] - 10https://gerrit.wikimedia.org/r/884393 (https://phabricator.wikimedia.org/T327973) (owner: 10Dzahn)
[23:43:35] <wikibugs>	 (03PS1) 10Dzahn: integration: add blackbox::check::http monitor [puppet] - 10https://gerrit.wikimedia.org/r/884395 (https://phabricator.wikimedia.org/T327972)
[23:45:10] <wikibugs>	 (03PS2) 10Dzahn: doc: add blackbox::check::http monitor [puppet] - 10https://gerrit.wikimedia.org/r/884393 (https://phabricator.wikimedia.org/T327973)
[23:45:41] <wikibugs>	 (03PS4) 10Superpes15: Create additional namespaces on shn.wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883620 (https://phabricator.wikimedia.org/T327850)
[23:46:04] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 226, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[23:46:54] <wikibugs>	 (03PS1) 10Dzahn: etherpad: add blackbox::check::http monitor [puppet] - 10https://gerrit.wikimedia.org/r/884396 (https://phabricator.wikimedia.org/T327974)
[23:46:57] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] doc: add blackbox::check::http monitor [puppet] - 10https://gerrit.wikimedia.org/r/884393 (https://phabricator.wikimedia.org/T327973) (owner: 10Dzahn)
[23:51:48] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 227, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[23:52:03] <logmsgbot>	 !log brett@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4050.ulsfo.wmnet with reason: host reimage
[23:55:23] <logmsgbot>	 !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4050.ulsfo.wmnet with reason: host reimage