[00:10:57] !log brett@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4047.ulsfo.wmnet with OS bullseye [00:11:05] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin1001 for host cp4047.ulsfo.wmnet with OS bullseye executed with errors: - cp4047 (**FAIL**) - Downtimed on Ic... [00:11:31] !log brett@cumin1001 START - Cookbook sre.hosts.reimage for host cp4047.ulsfo.wmnet with OS bullseye [00:11:41] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin1001 for host cp4047.ulsfo.wmnet with OS bullseye [00:14:03] (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:15:53] !log brett@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4047.ulsfo.wmnet with OS bullseye [00:16:02] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin1001 for host cp4047.ulsfo.wmnet with OS bullseye executed with errors: - cp4047 (**FAIL**) - Removed from Pu... [00:16:05] !log brett@cumin1001 START - Cookbook sre.hosts.reimage for host cp4047.ulsfo.wmnet with OS bullseye [00:16:15] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin1001 for host cp4047.ulsfo.wmnet with OS bullseye [00:24:20] (03PS1) 10Zabe: Stop setting cul_actor migration var [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884137 (https://phabricator.wikimedia.org/T233004) [00:24:25] !log brett@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4047.ulsfo.wmnet with OS bullseye [00:24:34] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin1001 for host cp4047.ulsfo.wmnet with OS bullseye executed with errors: - cp4047 (**FAIL**) - Removed from Pu... [00:24:38] (03CR) 10Zabe: [C: 03+2] Stop setting cul_actor migration var [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884137 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [00:25:00] !log brett@cumin1001 START - Cookbook sre.hosts.reimage for host cp4047.ulsfo.wmnet with OS bullseye [00:25:09] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin1001 for host cp4047.ulsfo.wmnet with OS bullseye [00:25:38] (03Merged) 10jenkins-bot: Stop setting cul_actor migration var [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884137 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [00:25:58] !log zabe@deploy1002 Started scap: Backport for [[gerrit:884137|Stop setting cul_actor migration var (T233004)]] [00:26:02] T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 [00:27:35] !log zabe@deploy1002 zabe: Backport for [[gerrit:884137|Stop setting cul_actor migration var (T233004)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet [00:28:55] (03PS1) 10Arlolra: Try to determine what's adding to Parsoid init times [core] (wmf/1.40.0-wmf.20) - 10https://gerrit.wikimedia.org/r/884138 [00:33:35] !log zabe@deploy1002 Finished scap: Backport for [[gerrit:884137|Stop setting cul_actor migration var (T233004)]] (duration: 07m 36s) [00:33:39] T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 [00:40:45] (JobUnavailable) firing: Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:45:50] !log brett@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4047.ulsfo.wmnet with reason: host reimage [00:47:44] (03PS1) 10Eevans: cassandra-dev: treat client encryption as optional [puppet] - 10https://gerrit.wikimedia.org/r/884140 [00:49:00] !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4047.ulsfo.wmnet with reason: host reimage [00:49:27] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/884140 (owner: 10Eevans) [00:51:41] (03CR) 10Eevans: [C: 03+2] cassandra-dev: treat client encryption as optional [puppet] - 10https://gerrit.wikimedia.org/r/884140 (owner: 10Eevans) [00:52:02] (03PS1) 10Brian Wolff: Restrict flow-edit-title to autoconfirmed on mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884142 (https://phabricator.wikimedia.org/T328097) [00:56:44] !log eevans@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching cassandra-dev2*: Applying configuration change to cassandra-dev cluster - eevans@cumin1001 [01:10:55] !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4047.ulsfo.wmnet with OS bullseye [01:11:04] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin1001 for host cp4047.ulsfo.wmnet with OS bullseye completed: - cp4047 (**PASS**) - Removed from Puppet and Pu... [01:11:38] !log brett@cumin1001 conftool action : set/pooled=yes; selector: name=cp4047.ulsfo.wmnet [01:15:59] !log eevans@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching cassandra-dev2*: Applying configuration change to cassandra-dev cluster - eevans@cumin1001 [01:19:48] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [01:20:58] PROBLEM - MegaRAID on an-worker1087 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [01:52:46] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic: eqsin hosts are not rebooting when running sre.hosts.reimage cookbook - https://phabricator.wikimedia.org/T327812 (10ssingh) >>! In T327812#8563361, @BCornwall wrote: > This is happening the first time I run the cookbooks on any of the newer servers. I've now adapte... [02:02:36] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic: eqsin hosts are not rebooting when running sre.hosts.reimage cookbook - https://phabricator.wikimedia.org/T327812 (10ssingh) I guess my theory about the incorrect/outdated firmwares was incorrect. On `cp4047`: ` Integrated Dell Remote Access Controller 5.10.30.00... [02:06:58] (KubernetesAPILatency) firing: (13) High Kubernetes API latency (LIST configurations) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [02:10:45] (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:20:45] (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:25:26] RECOVERY - MegaRAID on an-worker1087 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [02:25:38] (03PS1) 10Andrew Bogott: cloud-vps pdns recursor: drastically shorten max_negative_ttl [puppet] - 10https://gerrit.wikimedia.org/r/884145 [02:28:56] (03CR) 10Andrew Bogott: [C: 03+2] cloud-vps pdns recursor: drastically shorten max_negative_ttl [puppet] - 10https://gerrit.wikimedia.org/r/884145 (owner: 10Andrew Bogott) [03:10:11] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [03:15:36] PROBLEM - Check systemd state on durum1001 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_exim4.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:44:16] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:45:09] (03CR) 10Subramanya Sastry: Try to determine what's adding to Parsoid init times (031 comment) [core] (wmf/1.40.0-wmf.20) - 10https://gerrit.wikimedia.org/r/884138 (owner: 10Arlolra) [03:47:18] PROBLEM - MariaDB read only pc1 on pc2014 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [03:47:34] PROBLEM - MediaWiki exceptions and fatals per minute for appserver on alert1001 is CRITICAL: 2257 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [03:47:40] (Outbound discards) firing: (2) Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [03:47:50] PROBLEM - MariaDB Replica SQL: pc1 on pc2014 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [03:49:06] PROBLEM - MariaDB Replica IO: pc1 on pc2014 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [03:49:42] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in appserver at codfw on alert1001 is CRITICAL: 0.303 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [03:49:54] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 107 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [03:50:28] PROBLEM - MariaDB Event Scheduler pc1 on pc2014 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Event_Scheduler [03:54:24] PROBLEM - MariaDB read only pc1 on pc2014 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [03:54:56] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:56:55] (03CR) 10BBlack: [C: 03+1] esitest: remove deprecated nbproc config option [puppet] - 10https://gerrit.wikimedia.org/r/884056 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [03:57:40] (Outbound discards) resolved: (2) Device asw2-a-eqiad.mgmt.eqiad.wmnet recovered from Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [03:57:54] PROBLEM - MariaDB Replica IO: pc1 on pc2014 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [03:58:30] PROBLEM - MariaDB Replica SQL: pc1 on pc2014 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [03:59:18] PROBLEM - MariaDB Event Scheduler pc1 on pc2014 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Event_Scheduler [04:00:38] PROBLEM - MariaDB Replica Lag: pc1 on pc2014 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [04:01:14] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:02:16] PROBLEM - MegaRAID on an-worker1087 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [04:02:54] PROBLEM - MariaDB Event Scheduler pc1 on pc2014 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Event_Scheduler [04:03:02] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:06:06] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [04:06:26] PROBLEM - MariaDB Event Scheduler pc1 on pc2014 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Event_Scheduler [04:07:44] PROBLEM - MariaDB Replica Lag: pc1 on pc2014 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [04:09:42] RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [04:12:46] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:13:02] RECOVERY - MegaRAID on an-worker1087 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [04:13:34] PROBLEM - MariaDB Event Scheduler pc1 on pc2014 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Event_Scheduler [04:13:58] PROBLEM - MariaDB read only pc1 on pc2014 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [04:14:03] (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:15:40] PROBLEM - MariaDB Replica IO: pc1 on pc2014 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [04:17:30] PROBLEM - MariaDB read only pc1 on pc2014 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [04:18:02] PROBLEM - MariaDB Replica SQL: pc1 on pc2014 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [04:21:04] PROBLEM - MariaDB read only pc1 on pc2014 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [04:21:38] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:25:08] PROBLEM - MariaDB Replica SQL: pc1 on pc2014 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [04:27:42] PROBLEM - MariaDB Event Scheduler pc1 on pc2014 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Event_Scheduler [04:28:06] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event_sanitized_analytics_immediate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:28:38] PROBLEM - MariaDB Replica SQL: pc1 on pc2014 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [04:28:46] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 173 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [04:28:56] PROBLEM - MariaDB Replica Lag: pc1 on pc2014 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [04:29:46] PROBLEM - MariaDB Replica IO: pc1 on pc2014 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [04:31:38] PROBLEM - MariaDB read only pc1 on pc2014 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [04:34:04] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 119 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [04:34:46] PROBLEM - MariaDB Event Scheduler pc1 on pc2014 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Event_Scheduler [04:35:08] PROBLEM - MariaDB Replica IO: pc1 on pc2014 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [04:38:18] PROBLEM - MariaDB Event Scheduler pc1 on pc2014 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Event_Scheduler [04:39:24] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 152 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [04:42:16] PROBLEM - MariaDB Replica IO: pc1 on pc2014 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [04:43:42] PROBLEM - MariaDB Event Scheduler pc1 on pc2014 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Event_Scheduler [04:44:34] PROBLEM - MariaDB Replica SQL: pc1 on pc2014 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [04:44:42] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 165 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [04:45:02] PROBLEM - BFD status on cr2-eqdfw is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [04:45:18] PROBLEM - MegaRAID on an-worker1087 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [04:46:42] PROBLEM - MariaDB Replica Lag: pc1 on pc2014 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [04:48:36] RECOVERY - BFD status on cr2-eqdfw is OK: OK: UP: 15 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [04:50:02] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 161 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [04:52:54] PROBLEM - MariaDB Replica IO: pc1 on pc2014 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [04:53:30] PROBLEM - MariaDB Replica SQL: pc1 on pc2014 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [04:53:36] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 123 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [04:58:10] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:58:18] PROBLEM - MariaDB read only pc1 on pc2014 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [05:00:40] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 107 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:02:24] PROBLEM - MariaDB Replica SQL: pc1 on pc2014 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:04:16] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 114 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:04:28] PROBLEM - MariaDB Replica Lag: pc1 on pc2014 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:05:22] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:06:40] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:07:14] PROBLEM - MariaDB read only pc1 on pc2014 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [05:08:26] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:08:36] PROBLEM - MariaDB Event Scheduler pc1 on pc2014 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Event_Scheduler [05:09:48] PROBLEM - MariaDB Replica Lag: pc1 on pc2014 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:12:08] PROBLEM - MariaDB Event Scheduler pc1 on pc2014 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Event_Scheduler [05:15:44] PROBLEM - MariaDB Event Scheduler pc1 on pc2014 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Event_Scheduler [05:16:36] PROBLEM - MariaDB Replica SQL: pc1 on pc2014 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:20:10] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 101 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:20:28] PROBLEM - MariaDB Replica Lag: pc1 on pc2014 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:21:18] PROBLEM - MariaDB Replica IO: pc1 on pc2014 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:27:18] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 120 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:28:20] RECOVERY - MegaRAID on an-worker1087 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [05:30:24] PROBLEM - MariaDB read only pc1 on pc2014 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [05:32:38] PROBLEM - MariaDB Replica SQL: pc1 on pc2014 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:36:12] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 116 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:41:30] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 103 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:42:42] PROBLEM - MariaDB Replica IO: pc1 on pc2014 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:46:26] PROBLEM - MariaDB read only pc1 on pc2014 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [05:47:48] PROBLEM - MariaDB Event Scheduler pc1 on pc2014 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Event_Scheduler [05:48:30] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in appserver at codfw on alert1001 is CRITICAL: 0.3333 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [05:48:36] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 114 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:48:40] PROBLEM - MariaDB Replica SQL: pc1 on pc2014 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:52:12] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 106 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:54:00] PROBLEM - MariaDB Replica SQL: pc1 on pc2014 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:55:36] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in appserver at codfw on alert1001 is CRITICAL: 0.3182 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [06:01:16] pc2104 looks down [06:02:14] PROBLEM - MariaDB Replica IO: pc1 on pc2014 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [06:05:42] PROBLEM - MariaDB Replica IO: pc1 on pc2014 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [06:06:20] RECOVERY - MariaDB Replica SQL: pc1 on pc2014 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [06:06:36] RECOVERY - MariaDB Replica Lag: pc1 on pc2014 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [06:06:58] (KubernetesAPILatency) firing: (13) High Kubernetes API latency (LIST configurations) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [06:07:18] RECOVERY - MariaDB Event Scheduler pc1 on pc2014 is OK: Version 10.6.10-MariaDB-log, Uptime 83s, read_only: False, event_scheduler: True, 2799.10 QPS, connection latency: 0.004107s, query latency: 0.000537s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Event_Scheduler [06:07:28] RECOVERY - MariaDB Replica IO: pc1 on pc2014 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [06:07:42] RECOVERY - MariaDB read only pc1 on pc2014 is OK: Version 10.6.10-MariaDB-log, Uptime 108s, read_only: False, event_scheduler: True, 2749.20 QPS, connection latency: 0.003582s, query latency: 0.000359s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [06:07:54] RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in appserver at codfw on alert1001 is OK: All metrics within thresholds. https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [06:08:02] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [06:09:16] RECOVERY - MediaWiki exceptions and fatals per minute for appserver on alert1001 is OK: (C)100 gt (W)50 gt 2 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [06:17:11] (03PS1) 10Marostegui: pc2011: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/884151 [06:17:34] (03CR) 10Marostegui: [C: 03+2] pc2011: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/884151 (owner: 10Marostegui) [06:20:45] (JobUnavailable) firing: Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:27:23] (03PS1) 10Gergő Tisza: GrowthExperiments: Update campaign configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884153 [06:32:52] PROBLEM - MegaRAID on an-worker1087 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [07:00:04] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230127T0700) [07:10:11] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [07:25:22] RECOVERY - MegaRAID on an-worker1087 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [07:34:57] (03PS1) 10Elukey: wmf-config: add new revision-score streams for EventGate main [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884155 (https://phabricator.wikimedia.org/T317768) [07:41:29] !log restart kube-apiserver on ml-serve-ctrl2* nodes as attempt to mitigate some 504 API response errors [07:41:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:46:58] (KubernetesAPILatency) firing: (13) High Kubernetes API latency (LIST configurations) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:51:58] (KubernetesAPILatency) firing: (13) High Kubernetes API latency (LIST configurations) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:54:12] it should subside at some point in theory [07:54:17] metrics have recovered [08:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230127T0800) [08:06:44] !log restart kube-apiserver on ml-staging-ctrl2* nodes as attempt to mitigate some LIST API high latency [08:06:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:00] * elukey looks forward for k8s 1.23 and up-to-date knative/istio layers [08:11:58] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (LIST certificates) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:14:03] (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:20:26] (03PS1) 10Marostegui: drop_cul_user_text_T328086.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/884221 (https://phabricator.wikimedia.org/T328086) [08:22:53] !log Apply schema change on db1106 (s1 enwiki) T328086 [08:22:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:57] T328086: Drop cul_user and cul_user_text from cu_log on wmf wikis - https://phabricator.wikimedia.org/T328086 [08:23:13] !log Apply schema change on labtestwiki (clouddb2002-dev)T328086 [08:23:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:40] RECOVERY - puppet last run on idm-test1001 is OK: OK: Puppet is currently disabled (test OIDC - slyngshede), not alerting. Last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:30:02] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39294/console" [puppet] - 10https://gerrit.wikimedia.org/r/884037 (https://phabricator.wikimedia.org/T327949) (owner: 10Jelto) [08:30:38] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:31:46] RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 81, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:33:14] (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab_runner: add separate ensure for docker::network [puppet] - 10https://gerrit.wikimedia.org/r/884037 (https://phabricator.wikimedia.org/T327949) (owner: 10Jelto) [08:51:26] PROBLEM - Host mr1-drmrs.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [08:51:54] 10SRE, 10DBA, 10Data-Persistence, 10Discovery-Search, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10Marostegui) [08:52:28] 10SRE, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Marostegui) [09:01:16] PROBLEM - MegaRAID on an-worker1087 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [09:07:53] (03PS1) 10Slyngshede: D:apereo_cas::service: Missing s on groups [puppet] - 10https://gerrit.wikimedia.org/r/884224 [09:09:21] (03CR) 10Slyngshede: [C: 03+2] D:apereo_cas::service: Missing s on groups [puppet] - 10https://gerrit.wikimedia.org/r/884224 (owner: 10Slyngshede) [09:10:06] (03CR) 10Ilias Sarantopoulos: [C: 03+1] "LGTM!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884155 (https://phabricator.wikimedia.org/T317768) (owner: 10Elukey) [09:10:49] (03PS1) 10Muehlenhoff: Disable old bastions [puppet] - 10https://gerrit.wikimedia.org/r/884225 (https://phabricator.wikimedia.org/T324974) [09:14:27] 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management, 10MediaWiki-Page-deletion, and 3 others: Some files cannot be deleted "Error deleting file: An unknown error occurred in storage backend "local-multiwrite". " - https://phabricator.wikimedia.org/T244567 (10I) While examining this vulnerability, I... [09:17:25] (03CR) 10Vgutierrez: [C: 03+1] "looks good, but I don't get why cp3050 (according to debmonitor) got haproxy2.6 as 'profile::cache::haproxy::version' defaults to 'haproxy" [puppet] - 10https://gerrit.wikimedia.org/r/884056 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [09:29:11] (03CR) 10DCausse: [C: 03+1] wmf-config: add new revision-score streams for EventGate main [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884155 (https://phabricator.wikimedia.org/T317768) (owner: 10Elukey) [09:33:50] (03PS1) 10JMeybohm: Update openjdk to 11.0.16 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/884267 [09:34:48] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:35:30] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:36:24] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49419 bytes in 0.063 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:37:06] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.282 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:38:18] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Update openjdk to 11.0.16 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/884267 (owner: 10JMeybohm) [09:40:35] !log disabling old bastions bast3005/bast4003/bast5002/bast6001, use bast3006/bast4004/bast5003/bast6002 instead [09:40:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:39] (03CR) 10Muehlenhoff: [C: 03+2] Disable old bastions [puppet] - 10https://gerrit.wikimedia.org/r/884225 (https://phabricator.wikimedia.org/T324974) (owner: 10Muehlenhoff) [09:48:54] 10SRE, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Jelto) [09:54:38] 10SRE, 10DBA, 10Data-Persistence, 10Discovery-Search, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10Jelto) [09:58:07] (03PS1) 10JMeybohm: openjdk: Fix postinst error [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/884269 [10:00:06] jayme: java install failing due to lack of `/usr/share/man/man1/` is an issue in the Debian package and tracked at https://phabricator.wikimedia.org/T289694 [10:00:23] probably could use a patch upstream [10:00:35] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] openjdk: Fix postinst error [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/884269 (owner: 10JMeybohm) [10:02:02] 10SRE, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10fgiunchedi) [10:04:19] RECOVERY - MegaRAID on an-worker1087 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [10:04:54] 10SRE, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10MatthewVernon) [10:07:09] (03PS1) 10JMeybohm: flink-kubernetes-operator: Fix changelog format [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/884271 [10:07:11] (03PS1) 10JMeybohm: openjdk: Fix Dockerfile syntax [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/884272 [10:08:20] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] openjdk: Fix Dockerfile syntax [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/884272 (owner: 10JMeybohm) [10:08:26] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] flink-kubernetes-operator: Fix changelog format [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/884271 (owner: 10JMeybohm) [10:10:52] hashar: yeah, I figured... [10:13:13] 10SRE, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10elukey) [10:15:49] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts ldap-corp2001.wikimedia.org [10:16:21] 10SRE, 10DBA, 10Data-Persistence, 10Discovery-Search, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10elukey) [10:17:24] 10SRE-Access-Requests, 10Data-Engineering: Create kerberos principal for user matmarex - https://phabricator.wikimedia.org/T328116 (10BTullis) [10:18:48] hashar: is it possible to re-trigger a blubber pipeline without a code commit? To rebuild images based on the most recent version of the base image? [10:18:51] 10SRE, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10fgiunchedi) [10:19:56] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [10:20:21] (03PS1) 10Muehlenhoff: Remove the misc-ops Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/884275 [10:20:45] (JobUnavailable) firing: Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:21:22] 10SRE-Access-Requests, 10Data-Engineering: Create kerberos principal for user matmarex - https://phabricator.wikimedia.org/T328116 (10BTullis) 05Open→03Resolved I have created the principal. ` btullis@krb1001:~$ sudo manage_principals.py get matmarex get_principal: Principal does not exist while retrieving... [10:21:36] or btullis - could you please bump datahub so it get's rebuild ontop of the new java images? [10:22:05] (openjdk 11.0.16) [10:22:09] (03CR) 10Clément Goubert: [C: 03+2] mediawiki: Fix duplicate host.name property in log [deployment-charts] - 10https://gerrit.wikimedia.org/r/884273 (https://phabricator.wikimedia.org/T326794) (owner: 10Clément Goubert) [10:22:32] (03CR) 10Muehlenhoff: [C: 03+2] Remove the misc-ops Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/884275 (owner: 10Muehlenhoff) [10:23:18] !log jmm@cumin2002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [10:23:19] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ldap-corp2001.wikimedia.org [10:23:25] 10SRE, 10Infrastructure-Foundations, 10LDAP: Retire ldap-corp cluster - https://phabricator.wikimedia.org/T323820 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `ldap-corp2001.wikimedia.org` - ldap-corp2001.wikimedia.org (**PASS**) - Downtimed host on Icinga/Al... [10:24:41] (03PS1) 10Muehlenhoff: Remove ldap-corp-related CNAMES [dns] - 10https://gerrit.wikimedia.org/r/884276 (https://phabricator.wikimedia.org/T323820) [10:24:57] (03PS2) 10Muehlenhoff: Remove ldap-corp-related CNAMES [dns] - 10https://gerrit.wikimedia.org/r/884276 (https://phabricator.wikimedia.org/T323820) [10:26:43] !log aborrero@cumin2002 START - Cookbook sre.dns.netbox [10:26:49] jayme: Yes, I can trigger a new build of datahub. I'll do it now. [10:27:23] (03Merged) 10jenkins-bot: mediawiki: Fix duplicate host.name property in log [deployment-charts] - 10https://gerrit.wikimedia.org/r/884273 (https://phabricator.wikimedia.org/T326794) (owner: 10Clément Goubert) [10:30:02] 10SRE, 10DBA, 10Data-Persistence, 10Discovery-Search, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10fgiunchedi) [10:30:46] 10SRE, 10DBA, 10Data-Persistence, 10Discovery-Search, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10fgiunchedi) [10:33:10] 10SRE, 10DBA, 10Data-Persistence, 10Discovery-Search, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10MatthewVernon) [10:37:32] !log stevemunene@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [10:37:34] !log stevemunene@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: apply on main [10:38:01] !log aborrero@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Sync for cloudlb2001-dev - aborrero@cumin2002" [10:38:33] btullis: sweet, thanks. I'll be back in a bit [10:38:54] 10SRE, 10Infrastructure-Foundations, 10netops: Decom flowspec1001 - https://phabricator.wikimedia.org/T328009 (10ayounsi) 05Open→03Resolved [10:39:26] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: codfw: Relocate servers to make space for new switches in rowA and rowB - https://phabricator.wikimedia.org/T326564 (10ayounsi) [10:39:32] 10SRE, 10Infrastructure-Foundations, 10netops: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10ayounsi) [10:40:02] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: codfw: Relocate servers to make space for new switches in rowA and rowB - https://phabricator.wikimedia.org/T326564 (10ayounsi) [10:42:21] (03PS1) 10EoghanGaffney: Add icinga access for eoghan [puppet] - 10https://gerrit.wikimedia.org/r/884279 [10:43:16] (03CR) 10Muehlenhoff: [C: 03+2] Remove ldap-corp-related CNAMES [dns] - 10https://gerrit.wikimedia.org/r/884276 (https://phabricator.wikimedia.org/T323820) (owner: 10Muehlenhoff) [10:43:46] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [10:45:07] !log aborrero@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Sync for cloudlb2001-dev - aborrero@cumin2002" [10:45:08] !log aborrero@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:45:32] PROBLEM - MegaRAID on an-worker1087 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [10:52:57] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts ldap-corp1001.wikimedia.org [10:53:33] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=99) for hosts ldap-corp1001.wikimedia.org [10:55:15] (03CR) 10Jelto: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/884279 (owner: 10EoghanGaffney) [10:56:04] (03PS1) 10Muehlenhoff: exim: Remove leftovers of ldap-corp setup [puppet] - 10https://gerrit.wikimedia.org/r/884282 (https://phabricator.wikimedia.org/T323820) [10:56:15] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM! You can deploy/merge at any time, no further action is needed" [puppet] - 10https://gerrit.wikimedia.org/r/884279 (owner: 10EoghanGaffney) [11:01:24] !log stevemunene@deploy1002 helmfile [codfw] START helmfile.d/services/datahub: apply on main [11:01:27] !log stevemunene@deploy1002 helmfile [codfw] DONE helmfile.d/services/datahub: apply on main [11:02:23] (03CR) 10Jgiannelos: [C: 03+1] Enable Linter write namespace, tag and template from core, group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884090 (https://phabricator.wikimedia.org/T299612) (owner: 10Sbailey) [11:03:20] !log aborrero@cumin2002 START - Cookbook sre.dns.netbox [11:04:13] !log stevemunene@deploy1002 helmfile [eqiad] START helmfile.d/services/datahub: apply on main [11:04:15] !log stevemunene@deploy1002 helmfile [eqiad] DONE helmfile.d/services/datahub: apply on main [11:05:16] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts ldap-corp1001.wikimedia.org [11:06:52] RECOVERY - MegaRAID on an-worker1087 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [11:08:00] !log aborrero@cumin2002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [11:08:02] !log aborrero@cumin2002 START - Cookbook sre.dns.netbox [11:09:38] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [11:10:11] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [11:10:49] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:10:49] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ldap-corp1001.wikimedia.org [11:10:57] 10SRE, 10Infrastructure-Foundations, 10LDAP, 10Patch-For-Review: Retire ldap-corp cluster - https://phabricator.wikimedia.org/T323820 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `ldap-corp1001.wikimedia.org` - ldap-corp1001.wikimedia.org (**PASS**) - Downt... [11:11:08] !log aborrero@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Sync for cloudlb2001-dev - aborrero@cumin2002" [11:12:01] (03CR) 10Stevemunene: [C: 03+2] Enable oidc env vars for datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/883939 (https://phabricator.wikimedia.org/T327884) (owner: 10Stevemunene) [11:12:21] !log aborrero@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Sync for cloudlb2001-dev - aborrero@cumin2002" [11:12:21] !log aborrero@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:12:25] 10ops-eqiad, 10DC-Ops: hw troubleshooting: RAID controller battery for an-worker1087.eqiad.wmnet - https://phabricator.wikimedia.org/T328119 (10BTullis) [11:12:35] !log aborrero@cumin2002 START - Cookbook sre.hosts.reimage for host cloudlb2001-dev.codfw.wmnet with OS bullseye [11:13:12] !log aborrero@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudlb2001-dev.codfw.wmnet with OS bullseye [11:13:31] 10SRE-tools, 10Infrastructure-Foundations, 10Machine-Learning-Team: httpbb doesn't support integers in the POST's body - https://phabricator.wikimedia.org/T328120 (10elukey) [11:14:51] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 30 days, 0:00:00 on an-worker1087.eqiad.wmnet with reason: Shutting down an-worker1087 to allow for RAID BBU replacement [11:15:16] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on an-worker1087.eqiad.wmnet with reason: Shutting down an-worker1087 to allow for RAID BBU replacement [11:15:17] (03PS1) 10Elukey: parse: allow integers in form_body [software/httpbb] - 10https://gerrit.wikimedia.org/r/884285 (https://phabricator.wikimedia.org/T328120) [11:15:18] 10ops-eqiad, 10DC-Ops: hw troubleshooting: RAID controller battery for an-worker1087.eqiad.wmnet - https://phabricator.wikimedia.org/T328119 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=b9faed98-4069-454d-bfb5-c0193a85ce5f) set by btullis@cumin1001 for 30 days, 0:00:00 on 1 host(s) and t... [11:15:22] !log aborrero@cumin2002 START - Cookbook sre.dns.wipe-cache cloudlb2001-dev.mgmt.codfw.wmnet on all recursors [11:15:25] !log aborrero@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudlb2001-dev.mgmt.codfw.wmnet on all recursors [11:15:42] !log aborrero@cumin2002 START - Cookbook sre.hosts.reimage for host cloudlb2001-dev.codfw.wmnet with OS bullseye [11:16:58] (03Merged) 10jenkins-bot: Enable oidc env vars for datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/883939 (https://phabricator.wikimedia.org/T327884) (owner: 10Stevemunene) [11:18:06] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:24:11] !log stevemunene@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [11:24:50] !log aborrero@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudlb2001-dev.codfw.wmnet with OS bullseye [11:25:22] !log ayounsi@deploy1002 Started deploy [netbox/deploy@ef7451d]: netbox-next to 3.2.9 [11:25:57] !log aborrero@cumin2002 START - Cookbook sre.hosts.reimage for host cloudlb2001-dev.codfw.wmnet with OS bullseye [11:26:18] !log ayounsi@deploy1002 Finished deploy [netbox/deploy@ef7451d]: netbox-next to 3.2.9 (duration: 00m 56s) [11:26:56] 10ops-eqiad, 10DC-Ops: hw troubleshooting: RAID controller battery for an-worker1087.eqiad.wmnet - https://phabricator.wikimedia.org/T328119 (10BTullis) [11:27:22] 10ops-eqiad, 10DC-Ops: hw troubleshooting: RAID controller battery for an-worker1087.eqiad.wmnet - https://phabricator.wikimedia.org/T328119 (10BTullis) I've added 30 days of someime and shut down the host. [11:27:25] !log stevemunene@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [11:27:57] (03PS1) 10Btullis: Deploy the new datahub image [deployment-charts] - 10https://gerrit.wikimedia.org/r/884287 (https://phabricator.wikimedia.org/T327799) [11:31:41] (03CR) 10Ilias Sarantopoulos: [C: 03+1] "LGTM 🎉" [software/httpbb] - 10https://gerrit.wikimedia.org/r/884285 (https://phabricator.wikimedia.org/T328120) (owner: 10Elukey) [11:33:20] (03CR) 10JMeybohm: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/884034 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey) [11:35:06] (03CR) 10JMeybohm: [C: 03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/884038 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey) [11:36:59] !log stevemunene@deploy1002 helmfile [codfw] START helmfile.d/services/datahub: apply on main [11:38:28] !log stevemunene@deploy1002 helmfile [codfw] DONE helmfile.d/services/datahub: sync on main [11:39:08] (03CR) 10EoghanGaffney: [C: 03+2] Add icinga access for eoghan [puppet] - 10https://gerrit.wikimedia.org/r/884279 (owner: 10EoghanGaffney) [11:39:26] RECOVERY - BGP status on cr2-eqord is OK: BGP OK - up: 185, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:40:08] !log stevemunene@deploy1002 helmfile [eqiad] START helmfile.d/services/datahub: apply on main [11:41:14] !log stevemunene@deploy1002 helmfile [eqiad] DONE helmfile.d/services/datahub: sync on main [11:41:58] !log restart keyholder on deploy1002 [11:42:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:50] (03PS1) 10Muehlenhoff: Remove Puppet references for ldap-corp1001/2001 [puppet] - 10https://gerrit.wikimedia.org/r/884290 (https://phabricator.wikimedia.org/T323820) [11:46:18] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 80, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:47:57] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 62537 [11:48:16] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 62537 [11:48:28] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 12033 [11:48:35] (03CR) 10JMeybohm: [C: 03+1] Deploy the new datahub image [deployment-charts] - 10https://gerrit.wikimedia.org/r/884287 (https://phabricator.wikimedia.org/T327799) (owner: 10Btullis) [11:48:37] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 12033 [11:49:02] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 34309 [11:49:19] (03CR) 10Btullis: [C: 03+2] Deploy the new datahub image [deployment-charts] - 10https://gerrit.wikimedia.org/r/884287 (https://phabricator.wikimedia.org/T327799) (owner: 10Btullis) [11:49:40] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 34309 [11:50:03] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 8560 [11:50:17] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 8560 [11:50:54] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 8368 [11:51:03] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 8368 [11:52:10] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 56898 [11:52:54] (03CR) 10JMeybohm: [C: 03+1] sre.k8s.pool-depool-cluster: handle active/passive services (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/869771 (https://phabricator.wikimedia.org/T277677) (owner: 10Elukey) [11:53:19] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 56898 [11:53:41] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 14593 [11:54:12] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 14593 [11:54:49] !log aborrero@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudlb2001-dev.codfw.wmnet with reason: host reimage [11:55:10] (03Merged) 10jenkins-bot: Deploy the new datahub image [deployment-charts] - 10https://gerrit.wikimedia.org/r/884287 (https://phabricator.wikimedia.org/T327799) (owner: 10Btullis) [11:55:19] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 50266 [11:56:04] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 50266 [11:56:29] (03CR) 10Muehlenhoff: [C: 03+2] Remove Puppet references for ldap-corp1001/2001 [puppet] - 10https://gerrit.wikimedia.org/r/884290 (https://phabricator.wikimedia.org/T323820) (owner: 10Muehlenhoff) [11:57:27] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 26077 [11:57:47] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 26077 [11:57:54] !log aborrero@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudlb2001-dev.codfw.wmnet with reason: host reimage [11:57:58] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 398143 [11:58:12] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 398143 [11:58:15] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 55821 [11:58:41] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 55821 [11:59:01] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 9318 [11:59:44] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 9318 [11:59:57] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 138915 [12:00:19] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [12:00:53] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 138915 [12:01:38] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [12:03:25] !log mfossati@deploy1002 Started deploy [airflow-dags/platform_eng@9690bf9]: (no justification provided) [12:03:41] !log mfossati@deploy1002 Finished deploy [airflow-dags/platform_eng@9690bf9]: (no justification provided) (duration: 00m 15s) [12:10:38] (03CR) 10Jbond: [C: 03+1] "lgtm ping me to deploy" [puppet] - 10https://gerrit.wikimedia.org/r/883965 (https://phabricator.wikimedia.org/T326125) (owner: 10Hashar) [12:14:03] (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:23:00] (03PS1) 10Muehlenhoff: Remove role::openldap_corp and related profiles/templates [puppet] - 10https://gerrit.wikimedia.org/r/884295 (https://phabricator.wikimedia.org/T323820) [12:23:18] 10SRE, 10conftool: requestctl v1 improvements - https://phabricator.wikimedia.org/T305580 (10jbond) [12:23:39] 10SRE, 10conftool: Add requestctl support to ferm - https://phabricator.wikimedia.org/T313825 (10jbond) 05Open→03Resolved a:03jbond [12:23:55] !log aborrero@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - aborrero@cumin2002" [12:25:09] !log aborrero@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - aborrero@cumin2002" [12:25:10] !log aborrero@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudlb2001-dev.codfw.wmnet with OS bullseye [12:25:59] (03CR) 10Muehlenhoff: [C: 03+2] Remove role::openldap_corp and related profiles/templates [puppet] - 10https://gerrit.wikimedia.org/r/884295 (https://phabricator.wikimedia.org/T323820) (owner: 10Muehlenhoff) [12:28:34] (03CR) 10Jbond: "https://puppet-compiler.wmflabs.org/output/884040/39295/" [puppet] - 10https://gerrit.wikimedia.org/r/884040 (owner: 10Jbond) [12:29:31] (03PS2) 10Hashar: gerrit: listen on all ports, DROP requests to host [puppet] - 10https://gerrit.wikimedia.org/r/883965 (https://phabricator.wikimedia.org/T326125) [12:29:46] (03CR) 10Hashar: "I have fixed a couple typos in the commit message :D" [puppet] - 10https://gerrit.wikimedia.org/r/883965 (https://phabricator.wikimedia.org/T326125) (owner: 10Hashar) [12:30:08] (03CR) 10Clément Goubert: [C: 03+1] "LGTM, but I'd like Filippo's opinion." [alerts] - 10https://gerrit.wikimedia.org/r/883502 (https://phabricator.wikimedia.org/T326544) (owner: 10Giuseppe Lavagetto) [12:36:27] 10SRE, 10Infrastructure-Foundations, 10LDAP, 10Patch-For-Review: Retire ldap-corp cluster - https://phabricator.wikimedia.org/T323820 (10MoritzMuehlenhoff) 05Open→03Resolved The two VMs have been decommissioned and the Puppet code/certs/secrets removed. I've also sent ITS a headsup that this has been s... [12:37:30] (03CR) 10Jbond: [C: 03+2] gerrit: listen on all ports, DROP requests to host [puppet] - 10https://gerrit.wikimedia.org/r/883965 (https://phabricator.wikimedia.org/T326125) (owner: 10Hashar) [12:38:45] !log Stopped Puppet on gerrit1001 to prevent auto deployment of https://gerrit.wikimedia.org/r/883965 [12:38:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:06] !log Rebooting gerrit2002 [12:42:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:01] ACKNOWLEDGEMENT - Host gerrit2002 is DOWN: PING CRITICAL - Packet loss = 100% amusso hardware reboot [12:47:02] !log gerrit1001 running Puppet to deploy https://gerrit.wikimedia.org/r/883965 and restarting Apache 2 to change the `Listen` statements # T326125 [12:47:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:07] T326125: apache2 fails to start after gerrit hosts are rebooted - https://phabricator.wikimedia.org/T326125 [13:08:26] !log installing install6002 T327867 [13:08:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:31] T327867: Migrate the install servers to Bullseye - https://phabricator.wikimedia.org/T327867 [13:13:04] RECOVERY - Host mr1-drmrs.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 88.69 ms [13:14:11] 10SRE-swift-storage, 10Thumbor Migration: Pooling thumbor-k8s causes spikes in swift 500 errors - https://phabricator.wikimedia.org/T328033 (10hnowlan) Weird numbers that may be completely irrelevant- ms-fe1009 got a lot less errors despite being pooled at the same weight: ` hnowlan@cumin1001:~$ sudo cumin ms... [13:34:05] (03PS1) 10JMeybohm: kubernetes: Incease inotify limits [puppet] - 10https://gerrit.wikimedia.org/r/884305 (https://phabricator.wikimedia.org/T307943) [13:34:45] (03CR) 10Clément Goubert: [C: 03+1] kubernetes: Incease inotify limits [puppet] - 10https://gerrit.wikimedia.org/r/884305 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [13:35:33] (03PS2) 10JMeybohm: kubernetes: Increase inotify limits [puppet] - 10https://gerrit.wikimedia.org/r/884305 (https://phabricator.wikimedia.org/T307943) [13:36:22] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39296/console" [puppet] - 10https://gerrit.wikimedia.org/r/884305 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [13:46:35] !log installing install5002 T327867 [13:46:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:40] T327867: Migrate the install servers to Bullseye - https://phabricator.wikimedia.org/T327867 [13:54:05] (03PS4) 10Jbond: sre.hardware.upgrade-firmware: Add additional logging [cookbooks] - 10https://gerrit.wikimedia.org/r/883847 [13:54:33] (03CR) 10Jbond: [C: 03+2] sre.hardware.upgrade-firmware: Add additional logging [cookbooks] - 10https://gerrit.wikimedia.org/r/883847 (owner: 10Jbond) [13:56:13] (03Merged) 10jenkins-bot: sre.hardware.upgrade-firmware: Add additional logging [cookbooks] - 10https://gerrit.wikimedia.org/r/883847 (owner: 10Jbond) [14:00:26] (03PS2) 10Jbond: sre.hardware.firmware: Add ability to defer or prevent reboots [cookbooks] - 10https://gerrit.wikimedia.org/r/883863 [14:01:59] (03CR) 10CI reject: [V: 04-1] sre.hardware.firmware: Add ability to defer or prevent reboots [cookbooks] - 10https://gerrit.wikimedia.org/r/883863 (owner: 10Jbond) [14:02:39] (03PS1) 10Muehlenhoff: openstack::nova::common: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/884307 [14:03:49] (03PS3) 10Jbond: sre.hardware.firmware: Add ability to defer or prevent reboots [cookbooks] - 10https://gerrit.wikimedia.org/r/883863 [14:04:18] (03PS1) 10Jelto: sre.gitlab.upgrade: check current and target version [cookbooks] - 10https://gerrit.wikimedia.org/r/884308 (https://phabricator.wikimedia.org/T323569) [14:06:00] (03CR) 10CI reject: [V: 04-1] sre.gitlab.upgrade: check current and target version [cookbooks] - 10https://gerrit.wikimedia.org/r/884308 (https://phabricator.wikimedia.org/T323569) (owner: 10Jelto) [14:06:04] (03CR) 10CI reject: [V: 04-1] sre.hardware.firmware: Add ability to defer or prevent reboots [cookbooks] - 10https://gerrit.wikimedia.org/r/883863 (owner: 10Jbond) [14:08:55] (03PS2) 10Jelto: sre.gitlab.upgrade: check current and target version [cookbooks] - 10https://gerrit.wikimedia.org/r/884308 (https://phabricator.wikimedia.org/T323569) [14:10:06] !log andrew@cumin1001 START - Cookbook sre.hosts.decommission for hosts clouddb2001-dev.codfw.wmnet [14:10:43] (03CR) 10CI reject: [V: 04-1] sre.gitlab.upgrade: check current and target version [cookbooks] - 10https://gerrit.wikimedia.org/r/884308 (https://phabricator.wikimedia.org/T323569) (owner: 10Jelto) [14:13:34] 10SRE, 10ops-eqiad, 10DBA, 10Infrastructure-Foundations: Test RAID monitoring on new RAID PERC 755 controllers - https://phabricator.wikimedia.org/T325046 (10Jclark-ctr) Drive pulled again [14:13:44] !log andrew@cumin1001 START - Cookbook sre.dns.netbox [14:15:40] (03PS3) 10Jelto: sre.gitlab.upgrade: check current and target version [cookbooks] - 10https://gerrit.wikimedia.org/r/884308 (https://phabricator.wikimedia.org/T323569) [14:15:52] (03PS1) 10Muehlenhoff: Remove service toggle for TFTP [puppet] - 10https://gerrit.wikimedia.org/r/884310 (https://phabricator.wikimedia.org/T327867) [14:17:20] (03CR) 10CI reject: [V: 04-1] sre.gitlab.upgrade: check current and target version [cookbooks] - 10https://gerrit.wikimedia.org/r/884308 (https://phabricator.wikimedia.org/T323569) (owner: 10Jelto) [14:17:25] (03CR) 10Elukey: [C: 03+2] sre.k8s.pool-depool-cluster: handle active/passive services [cookbooks] - 10https://gerrit.wikimedia.org/r/869771 (https://phabricator.wikimedia.org/T277677) (owner: 10Elukey) [14:17:31] (03PS21) 10Elukey: sre.k8s.pool-depool-cluster: handle active/passive services [cookbooks] - 10https://gerrit.wikimedia.org/r/869771 (https://phabricator.wikimedia.org/T277677) [14:17:38] !log andrew@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: clouddb2001-dev.codfw.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin1001" [14:17:51] (03PS4) 10Jbond: sre.hardware.firmware: Add ability to defer or prevent reboots [cookbooks] - 10https://gerrit.wikimedia.org/r/883863 [14:18:18] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/884310 (https://phabricator.wikimedia.org/T327867) (owner: 10Muehlenhoff) [14:18:30] (03PS4) 10Jelto: sre.gitlab.upgrade: check current and target version [cookbooks] - 10https://gerrit.wikimedia.org/r/884308 (https://phabricator.wikimedia.org/T323569) [14:19:43] (03CR) 10CI reject: [V: 04-1] sre.hardware.firmware: Add ability to defer or prevent reboots [cookbooks] - 10https://gerrit.wikimedia.org/r/883863 (owner: 10Jbond) [14:20:15] !log andrew@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: clouddb2001-dev.codfw.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin1001" [14:20:15] !log andrew@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:20:16] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts clouddb2001-dev.codfw.wmnet [14:20:19] (03PS1) 10Elukey: services: update liftwing's test database pattern for changeprop staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/884311 (https://phabricator.wikimedia.org/T327302) [14:20:36] (03PS5) 10Jbond: sre.hardware.firmware: Add ability to defer or prevent reboots [cookbooks] - 10https://gerrit.wikimedia.org/r/883863 [14:20:45] (JobUnavailable) firing: Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:22:09] !log andrew@cumin1001 START - Cookbook sre.hosts.decommission for hosts clouddb2001-dev.codfw.wmnet [14:22:29] (03CR) 10CI reject: [V: 04-1] sre.hardware.firmware: Add ability to defer or prevent reboots [cookbooks] - 10https://gerrit.wikimedia.org/r/883863 (owner: 10Jbond) [14:26:19] !log andrew@cumin1001 START - Cookbook sre.dns.netbox [14:27:28] !log andrew@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:27:29] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts clouddb2001-dev.codfw.wmnet [14:27:36] 10SRE, 10ops-drmrs, 10Infrastructure-Foundations, 10netops: cr2-drmrs:xe-0/1/1 stuck optic - https://phabricator.wikimedia.org/T324555 (10RobH) 05Open→03In progress Neglected to do this earlier this week, I have the photos so I'll work on this today. [14:30:52] (03CR) 10Elukey: [C: 03+2] services: update liftwing's test database pattern for changeprop staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/884311 (https://phabricator.wikimedia.org/T327302) (owner: 10Elukey) [14:31:47] (03CR) 10Andrew Bogott: [C: 03+2] Remove puppet refs to clouddb2001-dev [puppet] - 10https://gerrit.wikimedia.org/r/884087 (https://phabricator.wikimedia.org/T328079) (owner: 10Andrew Bogott) [14:31:54] (03PS2) 10Andrew Bogott: Remove puppet refs to clouddb2001-dev [puppet] - 10https://gerrit.wikimedia.org/r/884087 (https://phabricator.wikimedia.org/T328079) [14:32:20] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/884307 (owner: 10Muehlenhoff) [14:32:22] 10SRE, 10Infrastructure-Foundations, 10Traffic: Feature request: sre.hardware.upgrade-firmware should allow option to defer NIC firmware installation to next reboot - https://phabricator.wikimedia.org/T323717 (10jbond) >>! In T323717#8560745, @ssingh wrote: > ` > iDrac shouldn't upgrade to 6.00.00.00 (b... [14:34:26] !log elukey@deploy1002 helmfile [staging] START helmfile.d/services/changeprop: sync [14:34:30] 10SRE, 10Infrastructure-Foundations, 10Traffic: Feature request: sre.hardware.upgrade-firmware should allow option to defer NIC firmware installation to next reboot - https://phabricator.wikimedia.org/T323717 (10ssingh) >>! In T323717#8564660, @jbond wrote: >>>! In T323717#8560745, @ssingh wrote: > >> ` >>... [14:34:37] !log elukey@deploy1002 helmfile [staging] DONE helmfile.d/services/changeprop: sync [14:34:49] (03CR) 10Ssingh: [V: 03+1] esitest: remove deprecated nbproc config option (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/884056 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [14:35:28] 10ops-codfw, 10cloud-services-team, 10decommission-hardware, 10Patch-For-Review: decommission clouddb2001-dev.codfw.wmnet - https://phabricator.wikimedia.org/T328079 (10Andrew) a:05Andrew→03Papaul [14:35:46] (03CR) 10Ssingh: [V: 03+1 C: 03+2] esitest: remove deprecated nbproc config option [puppet] - 10https://gerrit.wikimedia.org/r/884056 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [14:35:50] 10SRE, 10Infrastructure-Foundations, 10Traffic: Feature request: sre.hardware.upgrade-firmware should allow option to defer NIC firmware installation to next reboot - https://phabricator.wikimedia.org/T323717 (10jbond) Hi ssingh , i have just tested this by trying to upgrade the bios and the nic with only o... [14:36:46] 10SRE, 10Infrastructure-Foundations, 10Traffic: Feature request: sre.hardware.upgrade-firmware should allow option to defer NIC firmware installation to next reboot - https://phabricator.wikimedia.org/T323717 (10jbond) >>! In T323717#8564666, @ssingh wrote: >>>! In T323717#8564660, @jbond wrote: >>>>! In T32... [14:38:13] 10SRE, 10Infrastructure-Foundations, 10Traffic: Feature request: sre.hardware.upgrade-firmware should allow option to defer NIC firmware installation to next reboot - https://phabricator.wikimedia.org/T323717 (10ssingh) >>! In T323717#8564670, @jbond wrote: > Hi ssingh , > > i have just tested this by tryin... [14:39:57] !log btullis@deploy1002 helmfile [codfw] START helmfile.d/services/datahub: apply on main [14:40:25] !log installing install3002 T327867 [14:40:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:28] T327867: Migrate the install servers to Bullseye - https://phabricator.wikimedia.org/T327867 [14:40:44] (03PS1) 10Hashar: scap: remove plugins/.eslintrc.json on finalize stage [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/884317 (https://phabricator.wikimedia.org/T328134) [14:40:57] (03CR) 10Hashar: [C: 04-1] scap: remove plugins/.eslintrc.json on finalize stage [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/884317 (https://phabricator.wikimedia.org/T328134) (owner: 10Hashar) [14:41:02] !log btullis@deploy1002 helmfile [codfw] DONE helmfile.d/services/datahub: sync on main [14:41:18] (03CR) 10Hashar: [C: 04-1] "I will try to refactor the eslint config instead and use that in last resort." [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/884317 (https://phabricator.wikimedia.org/T328134) (owner: 10Hashar) [14:42:30] PROBLEM - Dell PowerEdge RAID Controller on db1206 is CRITICAL: communication: 0 OK https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [14:42:33] ACKNOWLEDGEMENT - Dell PowerEdge RAID Controller on db1206 is CRITICAL: communication: 0 OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T328135 https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [14:42:38] 10SRE, 10ops-eqiad: Degraded RAID on db1206 - https://phabricator.wikimedia.org/T328135 (10ops-monitoring-bot) [14:43:20] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp2027.codfw.wmnet with OS bullseye [14:43:31] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp2027.codfw.wmnet with OS bullseye [14:45:23] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:45:27] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:45:57] !log btullis@deploy1002 helmfile [eqiad] START helmfile.d/services/datahub: apply on main [14:46:35] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.232 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:46:41] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49420 bytes in 0.115 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:46:51] !log btullis@deploy1002 helmfile [eqiad] DONE helmfile.d/services/datahub: sync on main [14:49:55] (LogstashIngestSpike) firing: Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike [14:53:19] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4040.ulsfo.wmnet with OS bullseye [14:53:28] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp4040.ulsfo.wmnet with OS bullseye [14:53:57] (03PS1) 10Eevans: cassandra-dev: install docker.io package for local testing [puppet] - 10https://gerrit.wikimedia.org/r/884322 (https://phabricator.wikimedia.org/T327954) [14:54:18] (03CR) 10CI reject: [V: 04-1] cassandra-dev: install docker.io package for local testing [puppet] - 10https://gerrit.wikimedia.org/r/884322 (https://phabricator.wikimedia.org/T327954) (owner: 10Eevans) [14:54:55] (LogstashIngestSpike) resolved: Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike [14:55:07] !log elukey@deploy1002 helmfile [staging] START helmfile.d/services/changeprop: sync [14:55:20] !log elukey@deploy1002 helmfile [staging] DONE helmfile.d/services/changeprop: sync [14:56:18] (03CR) 10Muehlenhoff: cassandra-dev: install docker.io package for local testing (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/884322 (https://phabricator.wikimedia.org/T327954) (owner: 10Eevans) [14:58:56] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp2027.codfw.wmnet with reason: host reimage [14:59:49] (03PS6) 10Jbond: sre.hardware.firmware: Add ability to defer or prevent reboots [cookbooks] - 10https://gerrit.wikimedia.org/r/883863 [15:02:40] (03CR) 10CDanis: cassandra-dev: install docker.io package for local testing (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/884322 (https://phabricator.wikimedia.org/T327954) (owner: 10Eevans) [15:02:45] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp2027.codfw.wmnet with reason: host reimage [15:02:49] (03CR) 10Eevans: cassandra-dev: install docker.io package for local testing (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/884322 (https://phabricator.wikimedia.org/T327954) (owner: 10Eevans) [15:03:08] (03PS2) 10Eevans: cassandra-dev: install docker.io package for local testing [puppet] - 10https://gerrit.wikimedia.org/r/884322 (https://phabricator.wikimedia.org/T327954) [15:04:30] (03CR) 10CDanis: [C: 03+1] cassandra-dev: install docker.io package for local testing [puppet] - 10https://gerrit.wikimedia.org/r/884322 (https://phabricator.wikimedia.org/T327954) (owner: 10Eevans) [15:06:00] (03PS7) 10Jbond: sre.hardware.firmware: Add ability to defer or prevent reboots [cookbooks] - 10https://gerrit.wikimedia.org/r/883863 [15:08:39] (03PS8) 10Jbond: sre.hardware.firmware: Add ability to defer or prevent reboots [cookbooks] - 10https://gerrit.wikimedia.org/r/883863 [15:10:11] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [15:12:29] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/884322 (https://phabricator.wikimedia.org/T327954) (owner: 10Eevans) [15:12:49] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: eqiad: Move links to new MPC7E linecard - https://phabricator.wikimedia.org/T304712 (10Jclark-ctr) 05Open→03Resolved Netbox is updated [15:12:56] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops, 10Sustainability (Incident Followup): eqiad: upgrade row C and D uplinks from 4x10G to 1x40G - https://phabricator.wikimedia.org/T313463 (10Jclark-ctr) [15:13:49] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4040.ulsfo.wmnet with reason: host reimage [15:16:58] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4040.ulsfo.wmnet with reason: host reimage [15:17:26] (03CR) 10Eevans: [C: 03+2] cassandra-dev: install docker.io package for local testing [puppet] - 10https://gerrit.wikimedia.org/r/884322 (https://phabricator.wikimedia.org/T327954) (owner: 10Eevans) [15:19:39] (03PS9) 10Jbond: sre.hardware.firmware: Add ability to defer or prevent reboots [cookbooks] - 10https://gerrit.wikimedia.org/r/883863 [15:20:55] (03PS10) 10Jbond: sre.hardware.firmware: Add ability to defer or prevent reboots [cookbooks] - 10https://gerrit.wikimedia.org/r/883863 [15:22:12] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp2027.codfw.wmnet with OS bullseye [15:22:25] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp2027.codfw.wmnet with OS bullseye completed: - cp2027 (**PASS**) - Downtimed on Icinga/Alertm... [15:24:55] (LogstashIngestSpike) firing: Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike [15:29:55] (LogstashIngestSpike) resolved: Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike [15:31:03] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp2027.codfw.wmnet,service=cdn [15:31:04] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp2027.codfw.wmnet,service=ats-be [15:31:46] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh) [15:36:37] (03PS1) 10Hashar: phabricator: ensure phd uid/gid can not be changed [puppet] - 10https://gerrit.wikimedia.org/r/884324 (https://phabricator.wikimedia.org/T326146) [15:37:10] (03CR) 10Hashar: phabricator: dedupe phd user creation (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/875265 (https://phabricator.wikimedia.org/T326146) (owner: 10Hashar) [15:38:56] (03CR) 10JHathaway: [C: 03+1] "looks good to me, thanks" [puppet] - 10https://gerrit.wikimedia.org/r/884282 (https://phabricator.wikimedia.org/T323820) (owner: 10Muehlenhoff) [15:39:27] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/884324 (https://phabricator.wikimedia.org/T326146) (owner: 10Hashar) [15:39:50] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4040.ulsfo.wmnet with OS bullseye [15:39:55] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp4040.ulsfo.wmnet with OS bullseye completed: - cp4040 (**PASS**) - Downtimed on Icinga/Alertmanager - Disabled Pu... [15:41:06] (03CR) 10Hashar: "I have extracted the code from another pending change https://gerrit.wikimedia.org/r/c/operations/puppet/+/875266/4..5/modules/phabricator" [puppet] - 10https://gerrit.wikimedia.org/r/884324 (https://phabricator.wikimedia.org/T326146) (owner: 10Hashar) [15:41:34] (03CR) 10Hashar: phabricator: change phd home dir to /var/lib/phd (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/875266 (https://phabricator.wikimedia.org/T326146) (owner: 10Hashar) [15:41:55] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh) [15:42:18] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp4040.ulsfo.wmnet,service=cdn [15:42:18] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp4040.ulsfo.wmnet,service=ats-be [15:42:36] (03PS1) 10Jaime Nuche: jenkins: add secrets for releasing instance [labs/private] - 10https://gerrit.wikimedia.org/r/884325 (https://phabricator.wikimedia.org/T323909) [15:48:23] (03PS8) 10Hashar: phabricator: change phd home dir to /var/lib/phd [puppet] - 10https://gerrit.wikimedia.org/r/875266 (https://phabricator.wikimedia.org/T326146) [15:49:59] (03CR) 10Hashar: "I have moved the part which hardcodes the uid/gid to a standalone change: https://gerrit.wikimedia.org/r/c/operations/puppet/+/884324/ an" [puppet] - 10https://gerrit.wikimedia.org/r/875266 (https://phabricator.wikimedia.org/T326146) (owner: 10Hashar) [15:50:33] !log dancy@deploy1002 Started deploy [netbox/deploy@ef7451d]: netbox-next to 3.2.9 [15:50:38] !log dancy@deploy1002 Finished deploy [netbox/deploy@ef7451d]: netbox-next to 3.2.9 (duration: 00m 04s) [15:51:16] (03CR) 10Hashar: "PCC result: https://puppet-compiler.wmflabs.org/output/884324/1594/" [puppet] - 10https://gerrit.wikimedia.org/r/884324 (https://phabricator.wikimedia.org/T326146) (owner: 10Hashar) [15:53:23] 10SRE, 10ops-eqiad: Degraded RAID on db1206 - https://phabricator.wikimedia.org/T328135 (10Marostegui) This is testing [15:53:37] (03PS2) 10Phedenskog: prometheus: remove recording rule for CPU benchmark. [puppet] - 10https://gerrit.wikimedia.org/r/881632 (https://phabricator.wikimedia.org/T321398) [15:55:27] (03CR) 10Phedenskog: prometheus: remove recording rule for CPU benchmark. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/881632 (https://phabricator.wikimedia.org/T321398) (owner: 10Phedenskog) [15:56:11] 10SRE, 10ops-eqiad, 10DBA, 10Infrastructure-Foundations: Test RAID monitoring on new RAID PERC 755 controllers - https://phabricator.wikimedia.org/T325046 (10Marostegui) Thank you! [15:57:40] 10SRE, 10ops-eqiad: Degraded RAID on db1206 - https://phabricator.wikimedia.org/T328135 (10Marostegui) 05Open→03Invalid [16:04:08] (03PS1) 10Majavah: P:wmcs::services: adjust toolsdb pinning [puppet] - 10https://gerrit.wikimedia.org/r/884348 [16:07:01] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudcephosd10(3[5-9]|40) - https://phabricator.wikimedia.org/T324998 (10Jclark-ctr) cloudcephosd1035 E3 U33 cableid. 20220009 port. 0 cableid. 20220007 port. 1 cloudcephosd1036 E3 U34 cableid. 20220008... [16:07:16] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] P:wmcs::services: adjust toolsdb pinning [puppet] - 10https://gerrit.wikimedia.org/r/884348 (owner: 10Majavah) [16:07:46] (03CR) 10FNegri: [C: 03+2] P:wmcs::services: adjust toolsdb pinning [puppet] - 10https://gerrit.wikimedia.org/r/884348 (owner: 10Majavah) [16:07:54] (03PS1) 10Herron: logstash: remove rate of ingestion percent change compared to yesterday alert [alerts] - 10https://gerrit.wikimedia.org/r/884349 (https://phabricator.wikimedia.org/T202307) [16:10:14] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudcephosd10(3[5-9]|40) - https://phabricator.wikimedia.org/T324998 (10Jclark-ctr) [16:12:00] (03CR) 10Daniel Kinzler: [C: 03+1] Try to determine what's adding to Parsoid init times [core] (wmf/1.40.0-wmf.20) - 10https://gerrit.wikimedia.org/r/884138 (owner: 10Arlolra) [16:14:03] (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:18:02] RECOVERY - Check systemd state on mw1411 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:19:17] 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops, 10User-Eevans: Relocate hosts: aqs10[3-5] - https://phabricator.wikimedia.org/T307035 (10Jclark-ctr) host target row aqs1013 d3 u24 port33 aqs1014 e1 u38 port 41 1g aqs1015 f1 u38 port 41 1g [16:20:17] 10SRE, 10ops-eqiad, 10DBA, 10Infrastructure-Foundations: Test RAID monitoring on new RAID PERC 755 controllers - https://phabricator.wikimedia.org/T325046 (10Jclark-ctr) Drive has been Reinserted [16:28:29] 10SRE, 10LDAP-Access-Requests: Grant Access to 'cn=nda or cn=wmf' for ekalkst - https://phabricator.wikimedia.org/T328145 (10Ekalkst) [16:32:42] (03PS1) 10JMeybohm: flink(-operator): Update to JRE 11.0.16 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/884351 [16:34:26] (03CR) 10JMeybohm: "Feel free to merge and build anytime" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/884351 (owner: 10JMeybohm) [16:44:03] (ProbeDown) firing: (3) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:52:53] 10SRE-swift-storage, 10Thumbor Migration: Pooling thumbor-k8s causes spikes in swift 500 errors - https://phabricator.wikimedia.org/T328033 (10hnowlan) In server.log on the swift frontends there is a significant uptick in ERROR messages with timeouts for images during the period Thumbor-k8s was pooled: ` Jan... [17:01:31] 10SRE-swift-storage: >=27k objects listed in swift containers but not extant - https://phabricator.wikimedia.org/T327253 (10MatthewVernon) I've done a bunch of investigating, and I don't think I'm much nearer a useful answer. First, though, it's clear that while DELETE on these "ghost" objects returns 404, it d... [17:10:48] RECOVERY - Dell PowerEdge RAID Controller on db1206 is OK: communication: 0 OK https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [17:15:01] (03CR) 10Krinkle: "I guess the fact that these haven't fired despite being dead, is suspect. We know the others do fire when metrics regress, but I guess the" [alerts] - 10https://gerrit.wikimedia.org/r/879925 (https://phabricator.wikimedia.org/T323623) (owner: 10Krinkle) [17:15:56] !log brett@cumin1001 START - Cookbook sre.hosts.reimage for host cp4048.ulsfo.wmnet with OS bullseye [17:16:02] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin1001 for host cp4048.ulsfo.wmnet with OS bullseye [17:18:20] 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management, 10MediaWiki-Page-deletion, and 4 others: Some files cannot be deleted "Error deleting file: An unknown error occurred in storage backend "local-multiwrite". " - https://phabricator.wikimedia.org/T244567 (10Ammarpad) These files are missing on dis... [17:20:27] 10SRE-swift-storage: >=27k objects listed in swift containers but not extant - https://phabricator.wikimedia.org/T327253 (10MatthewVernon) https://commons.wikimedia.org/w/index.php?title=Special:Log&page=File%3AFlying+Seagull.jpg I think explains the observations - there was a previous object by the same name de... [17:28:25] 10SRE-swift-storage: >=27k objects listed in swift containers but not extant - https://phabricator.wikimedia.org/T327253 (10jcrespo) I've mentioned to Emperor some things that help explain *some* of the comments. E.g. https://commons.wikimedia.org/w/index.php?title=Special:Log&page=File%3AFlying+Seagull.jpg show... [17:28:32] !log brett@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4048.ulsfo.wmnet with OS bullseye [17:28:37] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin1001 for host cp4048.ulsfo.wmnet with OS bullseye executed with errors: - cp4048 (**FAIL**) - Downtimed on Icinga/Alertmanager -... [17:28:45] !log brett@cumin1001 START - Cookbook sre.hosts.reimage for host cp4048.ulsfo.wmnet with OS bullseye [17:28:51] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin1001 for host cp4048.ulsfo.wmnet with OS bullseye [17:30:35] (03PS1) 10Cwhite: logstash: only perform dns lookup if parse was successful [puppet] - 10https://gerrit.wikimedia.org/r/884109 [17:35:12] (03CR) 10Cwhite: [C: 03+2] logstash: only perform dns lookup if parse was successful [puppet] - 10https://gerrit.wikimedia.org/r/884109 (owner: 10Cwhite) [17:38:05] !log mfossati@deploy1002 Started deploy [airflow-dags/platform_eng@907fe2a]: (no justification provided) [17:38:20] !log mfossati@deploy1002 Finished deploy [airflow-dags/platform_eng@907fe2a]: (no justification provided) (duration: 00m 14s) [17:40:44] 10SRE, 10Infrastructure-Foundations, 10netops: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10cmooney) >>! In T327938#8560262, @ayounsi wrote: > * public1-a/b-codfw host might be better grouped in a single rack per row, providing still redundancy (4 racks per sit... [17:48:22] (03CR) 10RLazarus: [C: 03+2] "Looks good, thanks for the patch! I'll deploy a new version today." [software/httpbb] - 10https://gerrit.wikimedia.org/r/884285 (https://phabricator.wikimedia.org/T328120) (owner: 10Elukey) [17:48:33] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10cmooney) >>! In T327919#8560080, @ayounsi wrote: > @Papaul could you rename (Netbox, label, console, et... [17:48:42] (03CR) 10Cwhite: [C: 03+1] "LGTM, let me know if you need a +2 from o11y" [alerts] - 10https://gerrit.wikimedia.org/r/879925 (https://phabricator.wikimedia.org/T323623) (owner: 10Krinkle) [17:49:22] !log brett@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4048.ulsfo.wmnet with reason: host reimage [17:49:54] (03Merged) 10jenkins-bot: parse: allow integers in form_body [software/httpbb] - 10https://gerrit.wikimedia.org/r/884285 (https://phabricator.wikimedia.org/T328120) (owner: 10Elukey) [17:52:43] !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4048.ulsfo.wmnet with reason: host reimage [17:53:20] (03CR) 10Krinkle: "What will happen and not happen when I +2 this? Does it auto-deploy?" [alerts] - 10https://gerrit.wikimedia.org/r/879925 (https://phabricator.wikimedia.org/T323623) (owner: 10Krinkle) [17:58:31] (03CR) 10Cwhite: [C: 03+1] team-perf: Remove firstinputtiming alerts (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/879925 (https://phabricator.wikimedia.org/T323623) (owner: 10Krinkle) [18:06:25] (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/884110 [18:07:12] (03PS1) 10Giuseppe Lavagetto: mediawiki: adapt rsyslog parsing of slowlog to ecs 1.11 [deployment-charts] - 10https://gerrit.wikimedia.org/r/884360 [18:12:27] 10SRE, 10Infrastructure-Foundations, 10netops: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10ayounsi) > I've no particular preference, if I were doing it myself probably a slight one for the CWDM4/LC links, but happy to go with whatever the consensus/cheapest is... [18:14:48] (03PS1) 10Catrope: Add VueTest to extension-list, add config var [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884361 (https://phabricator.wikimedia.org/T315621) [18:14:59] !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4048.ulsfo.wmnet with OS bullseye [18:15:05] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin1001 for host cp4048.ulsfo.wmnet with OS bullseye completed: - cp4048 (**PASS**) - Removed from Puppet and PuppetDB if present -... [18:18:58] (KubernetesAPILatency) firing: High Kubernetes API latency (UPDATE certificaterequests) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:19:29] (03CR) 10Catrope: [C: 04-2] "Not ready yet" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884361 (https://phabricator.wikimedia.org/T315621) (owner: 10Catrope) [18:20:45] (JobUnavailable) firing: Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:23:00] (03PS2) 10Sbailey: Enable Linter write namespace, tag and template from core, group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884090 (https://phabricator.wikimedia.org/T299612) [18:23:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (UPDATE certificaterequests) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:24:16] !log brett@cumin1001 conftool action : set/pooled=yes; selector: name=cp4048.ulsfo.wmnet [18:24:55] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [18:25:04] !log brett@cumin1001 START - Cookbook sre.hosts.reimage for host cp4041.ulsfo.wmnet with OS bullseye [18:25:12] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin1001 for host cp4041.ulsfo.wmnet with OS bullseye [18:31:26] (03CR) 10Krinkle: [C: 03+2] "Thanks." [alerts] - 10https://gerrit.wikimedia.org/r/879925 (https://phabricator.wikimedia.org/T323623) (owner: 10Krinkle) [18:32:38] (03Merged) 10jenkins-bot: team-perf: Remove firstinputtiming alerts [alerts] - 10https://gerrit.wikimedia.org/r/879925 (https://phabricator.wikimedia.org/T323623) (owner: 10Krinkle) [18:35:45] (03CR) 10BryanDavis: [C: 03+1] P:toolforge::grid: install python3-mwparserfromhell [puppet] - 10https://gerrit.wikimedia.org/r/882220 (https://phabricator.wikimedia.org/T327600) (owner: 10Majavah) [18:37:04] !log brett@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4041.ulsfo.wmnet with OS bullseye [18:37:10] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin1001 for host cp4041.ulsfo.wmnet with OS bullseye executed with errors: - cp4041 (**FAIL**) - Downtimed on Icinga/Alertmanager -... [18:37:22] !log brett@cumin1001 START - Cookbook sre.hosts.reimage for host cp4041.ulsfo.wmnet with OS bullseye [18:37:28] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin1001 for host cp4041.ulsfo.wmnet with OS bullseye [18:42:57] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10Papaul) @cmooney this looks good to me just one question. Is it possible to use xe-0/0/[46-47] for the... [18:57:28] !log brett@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4041.ulsfo.wmnet with reason: host reimage [19:00:49] (03PS1) 10BBlack: Commentary re: image timestamps in URL query part [puppet] - 10https://gerrit.wikimedia.org/r/884363 (https://phabricator.wikimedia.org/T38380) [19:01:52] (03CR) 10BCornwall: [C: 03+1] Commentary re: image timestamps in URL query part [puppet] - 10https://gerrit.wikimedia.org/r/884363 (https://phabricator.wikimedia.org/T38380) (owner: 10BBlack) [19:02:00] (03PS2) 10BBlack: Commentary re: image timestamps in URL query part [puppet] - 10https://gerrit.wikimedia.org/r/884363 (https://phabricator.wikimedia.org/T38380) [19:02:11] !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4041.ulsfo.wmnet with reason: host reimage [19:02:23] (03CR) 10BCornwall: [C: 03+1] Commentary re: image timestamps in URL query part [puppet] - 10https://gerrit.wikimedia.org/r/884363 (https://phabricator.wikimedia.org/T38380) (owner: 10BBlack) [19:04:00] (03CR) 10BBlack: [C: 03+2] Commentary re: image timestamps in URL query part [puppet] - 10https://gerrit.wikimedia.org/r/884363 (https://phabricator.wikimedia.org/T38380) (owner: 10BBlack) [19:05:24] (03PS1) 10Catrope: Enable VueTest in labs only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884364 (https://phabricator.wikimedia.org/T315621) [19:05:52] (03CR) 10Catrope: [C: 04-2] "Not ready yet" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884364 (https://phabricator.wikimedia.org/T315621) (owner: 10Catrope) [19:10:06] 10SRE, 10Traffic-Icebox, 10Patch-For-Review: Remove unused plain HTTP services from LVS - https://phabricator.wikimedia.org/T236065 (10BCornwall) [19:10:11] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [19:11:55] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (backup2002), Fresh: 116 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [19:15:35] checking [19:18:08] it should fix itself in a few minutes- just backups are running slower than usal this week [19:19:02] (03PS1) 10Eevans: cassandra-dev: install siege for testing [puppet] - 10https://gerrit.wikimedia.org/r/884368 (https://phabricator.wikimedia.org/T327954) [19:26:03] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/884368 (https://phabricator.wikimedia.org/T327954) (owner: 10Eevans) [19:28:10] !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4041.ulsfo.wmnet with OS bullseye [19:28:17] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin1001 for host cp4041.ulsfo.wmnet with OS bullseye completed: - cp4041 (**PASS**) - Removed from Puppet and PuppetDB if present -... [19:29:36] (03CR) 10Eevans: [C: 03+2] cassandra-dev: install siege for testing [puppet] - 10https://gerrit.wikimedia.org/r/884368 (https://phabricator.wikimedia.org/T327954) (owner: 10Eevans) [19:31:12] !log brett@cumin1001 conftool action : set/pooled=yes; selector: name=cp404.ulsfo.wmnet [19:31:23] !log brett@cumin1001 conftool action : set/pooled=yes; selector: name=cp4041.ulsfo.wmnet [19:32:44] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [19:32:57] !log brett@cumin1001 START - Cookbook sre.hosts.reimage for host cp4049.ulsfo.wmnet with OS bullseye [19:33:03] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin1001 for host cp4049.ulsfo.wmnet with OS bullseye [19:33:59] 10SRE-tools, 10Infrastructure-Foundations, 10serviceops: Release httpbb 0.0.2 - https://phabricator.wikimedia.org/T328162 (10RLazarus) [19:34:20] 10SRE-tools, 10Infrastructure-Foundations, 10serviceops: Release httpbb 0.0.2 - https://phabricator.wikimedia.org/T328162 (10RLazarus) 05Open→03In progress p:05Triage→03Medium [19:34:50] 10SRE-tools, 10Infrastructure-Foundations, 10serviceops: Release httpbb 0.0.2 - https://phabricator.wikimedia.org/T328162 (10RLazarus) [19:34:58] 10SRE-tools, 10Infrastructure-Foundations, 10Machine-Learning-Team: httpbb doesn't support integers in the POST's body - https://phabricator.wikimedia.org/T328120 (10RLazarus) [19:38:18] !log brett@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4049.ulsfo.wmnet with OS bullseye [19:38:23] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin1001 for host cp4049.ulsfo.wmnet with OS bullseye executed with errors: - cp4049 (**FAIL**) - Downtimed on Icinga/Alertmanager -... [19:38:59] !log brett@cumin1001 START - Cookbook sre.hosts.reimage for host cp4049.ulsfo.wmnet with OS bullseye [19:39:05] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin1001 for host cp4049.ulsfo.wmnet with OS bullseye [19:54:23] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [alerts] - 10https://gerrit.wikimedia.org/r/884349 (https://phabricator.wikimedia.org/T202307) (owner: 10Herron) [19:57:05] (03PS1) 10RLazarus: Release v0.0.2 [software/httpbb] - 10https://gerrit.wikimedia.org/r/884373 (https://phabricator.wikimedia.org/T328162) [19:59:30] (03CR) 10RLazarus: [C: 03+2] Release v0.0.2 [software/httpbb] - 10https://gerrit.wikimedia.org/r/884373 (https://phabricator.wikimedia.org/T328162) (owner: 10RLazarus) [20:01:00] (03Merged) 10jenkins-bot: Release v0.0.2 [software/httpbb] - 10https://gerrit.wikimedia.org/r/884373 (https://phabricator.wikimedia.org/T328162) (owner: 10RLazarus) [20:02:56] !log brett@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4049.ulsfo.wmnet with OS bullseye [20:03:01] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin1001 for host cp4049.ulsfo.wmnet with OS bullseye executed with errors: - cp4049 (**FAIL**) - Removed from Puppet and PuppetDB if p... [20:05:58] !log brett@cumin1001 START - Cookbook sre.hosts.reimage for host cp4049.ulsfo.wmnet with OS bullseye [20:06:04] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin1001 for host cp4049.ulsfo.wmnet with OS bullseye [20:18:47] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 227, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:23:51] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 90, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:25:19] (03CR) 10Dzahn: [C: 03+1] gerrit: listen on all ports, DROP requests to host [puppet] - 10https://gerrit.wikimedia.org/r/883965 (https://phabricator.wikimedia.org/T326125) (owner: 10Hashar) [20:26:26] !log brett@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4049.ulsfo.wmnet with reason: host reimage [20:29:32] !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4049.ulsfo.wmnet with reason: host reimage [20:32:35] (03CR) 10BBlack: [C: 03+1] varnish: Reword misc-frontend vcl_switch comment [puppet] - 10https://gerrit.wikimedia.org/r/882716 (https://phabricator.wikimedia.org/T205988) (owner: 10BCornwall) [20:38:05] 10SRE, 10LDAP-Access-Requests: Grant Access to 'cn=nda or cn=wmf' for ekalkst - https://phabricator.wikimedia.org/T328145 (10Dzahn) Hi @Ekalkst are you an employee of Wikimedia foundation, a contractor or a volunteer, please? [20:40:13] (03CR) 10Dzahn: [C: 03+1] "yea, I like this since meanwhile we do the type assertion also when it's not a class parameter. Originally it was moved to parameters only" [puppet] - 10https://gerrit.wikimedia.org/r/884324 (https://phabricator.wikimedia.org/T326146) (owner: 10Hashar) [20:42:06] (03CR) 10Dzahn: [C: 03+2] phabricator: ensure phd uid/gid can not be changed [puppet] - 10https://gerrit.wikimedia.org/r/884324 (https://phabricator.wikimedia.org/T326146) (owner: 10Hashar) [20:44:03] (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:44:21] (03CR) 10Dzahn: [C: 03+2] "We might have to check on what UID/GID we get on the cloud instance." [puppet] - 10https://gerrit.wikimedia.org/r/884324 (https://phabricator.wikimedia.org/T326146) (owner: 10Hashar) [20:45:59] (03CR) 10Dzahn: [C: 03+2] Enable profile::auto_restarts::service for etherpad-lite [puppet] - 10https://gerrit.wikimedia.org/r/883949 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [20:50:30] (03CR) 10Dzahn: "merged the outsourced part" [puppet] - 10https://gerrit.wikimedia.org/r/875266 (https://phabricator.wikimedia.org/T326146) (owner: 10Hashar) [20:56:20] !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4049.ulsfo.wmnet with OS bullseye [20:56:25] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin1001 for host cp4049.ulsfo.wmnet with OS bullseye completed: - cp4049 (**WARN**) - Removed from Puppet and PuppetDB if present -... [20:56:55] (03CR) 10Dzahn: [V: 03+1 C: 03+2] phabricator: dedupe phd user creation (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/875265 (https://phabricator.wikimedia.org/T326146) (owner: 10Hashar) [21:29:57] (03PS1) 10Stang: lmowiktionary: Create extendedmover group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884378 (https://phabricator.wikimedia.org/T327340) [21:44:25] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:44:33] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:49:14] !log brett@cumin1001 conftool action : set/pooled=yes; selector: name=cp4049.ulsfo.wmnet [21:49:35] (03CR) 10BCornwall: [C: 03+2] varnish: Reword misc-frontend vcl_switch comment [puppet] - 10https://gerrit.wikimedia.org/r/882716 (https://phabricator.wikimedia.org/T205988) (owner: 10BCornwall) [21:50:04] 10SRE, 10Traffic-Icebox, 10Patch-For-Review: Simplify comment misc-frontend.inc.vcl.erb - https://phabricator.wikimedia.org/T205988 (10BCornwall) 05In progress→03Resolved [21:51:04] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [21:51:21] !log brett@cumin1001 START - Cookbook sre.hosts.reimage for host cp4042.ulsfo.wmnet with OS bullseye [21:51:26] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin1001 for host cp4042.ulsfo.wmnet with OS bullseye [21:52:07] (03PS2) 10Gergő Tisza: GrowthExperiments: Update campaign configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884153 (https://phabricator.wikimedia.org/T790650) [21:59:17] !log brett@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4042.ulsfo.wmnet with OS bullseye [21:59:24] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin1001 for host cp4042.ulsfo.wmnet with OS bullseye executed with errors: - cp4042 (**FAIL**) - Downtimed on Icinga/Alertmanager -... [22:00:40] !log brett@cumin1001 START - Cookbook sre.hosts.reimage for host cp4042.ulsfo.wmnet with OS bullseye [22:00:46] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin1001 for host cp4042.ulsfo.wmnet with OS bullseye [22:11:30] !log rzl@apt1001:~$ sudo -i reprepro -C main include buster-wikimedia /home/rzl/httpbb/buster/httpbb_0.0.2-1_amd64.changes # T328162 [22:11:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:11:34] T328162: Release httpbb 0.0.2 - https://phabricator.wikimedia.org/T328162 [22:11:56] !log rzl@apt1001:~$ sudo -i reprepro -C main include bullseye-wikimedia /home/rzl/httpbb/bullseye/httpbb_0.0.2-1+deb11u1_amd64.changes # T328162 [22:12:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:20:45] (JobUnavailable) firing: Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:20:48] !log brett@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4042.ulsfo.wmnet with reason: host reimage [22:24:08] !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4042.ulsfo.wmnet with reason: host reimage [22:35:09] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 2.467 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:35:09] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49421 bytes in 2.504 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:46:50] !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4042.ulsfo.wmnet with OS bullseye [22:46:56] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin1001 for host cp4042.ulsfo.wmnet with OS bullseye completed: - cp4042 (**PASS**) - Removed from Puppet and PuppetDB if present -... [22:48:25] (03PS1) 10Cwhite: logstash: extract tcp flags from ulogd logs [puppet] - 10https://gerrit.wikimedia.org/r/884111 (https://phabricator.wikimedia.org/T325806) [22:59:21] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 89, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:59:59] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 226, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:10:11] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [23:11:12] 10SRE-tools, 10Infrastructure-Foundations, 10serviceops: Release httpbb 0.0.2 - https://phabricator.wikimedia.org/T328162 (10RLazarus) 05In progress→03Resolved [23:11:20] 10SRE-tools, 10Infrastructure-Foundations, 10Machine-Learning-Team: httpbb doesn't support integers in the POST's body - https://phabricator.wikimedia.org/T328120 (10RLazarus) [23:15:57] (03PS1) 10RLazarus: httpbb: Enable --retry_on_timeout so intermittent latency doesn't alert [puppet] - 10https://gerrit.wikimedia.org/r/884388 (https://phabricator.wikimedia.org/T323707) [23:20:15] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 90, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:20:55] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 227, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:21:13] (03PS1) 10Cwhite: logstash: apply lowercase on fields that require it [puppet] - 10https://gerrit.wikimedia.org/r/884112 [23:21:27] !log brett@cumin1001 conftool action : set/pooled=yes; selector: name=cp4042.ulsfo.wmnet [23:22:19] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [23:22:52] !log brett@cumin1001 START - Cookbook sre.hosts.reimage for host cp4050.ulsfo.wmnet with OS bullseye [23:23:09] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin1001 for host cp4050.ulsfo.wmnet with OS bullseye [23:31:19] !log brett@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4050.ulsfo.wmnet with OS bullseye [23:31:25] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin1001 for host cp4050.ulsfo.wmnet with OS bullseye executed with errors: - cp4050 (**FAIL**) - Downtimed on Icinga/Alertmanager -... [23:31:34] !log brett@cumin1001 START - Cookbook sre.hosts.reimage for host cp4050.ulsfo.wmnet with OS bullseye [23:31:40] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin1001 for host cp4050.ulsfo.wmnet with OS bullseye [23:33:41] (03PS1) 10Dzahn: planet: add blackbox::check::http monitor [puppet] - 10https://gerrit.wikimedia.org/r/884390 (https://phabricator.wikimedia.org/T327977) [23:37:21] (03PS1) 10Dzahn: releases: add blackbox::check::http monitor [puppet] - 10https://gerrit.wikimedia.org/r/884392 (https://phabricator.wikimedia.org/T327975) [23:39:38] (03PS1) 10Dzahn: doc: add blackbox::check::http monitor [puppet] - 10https://gerrit.wikimedia.org/r/884393 (https://phabricator.wikimedia.org/T327973) [23:40:18] (03PS2) 10Dzahn: planet: add blackbox::check::http monitor [puppet] - 10https://gerrit.wikimedia.org/r/884390 (https://phabricator.wikimedia.org/T327977) [23:41:16] (03PS2) 10Dzahn: releases: add blackbox::check::http monitor [puppet] - 10https://gerrit.wikimedia.org/r/884392 (https://phabricator.wikimedia.org/T327975) [23:41:30] (03CR) 10CI reject: [V: 04-1] doc: add blackbox::check::http monitor [puppet] - 10https://gerrit.wikimedia.org/r/884393 (https://phabricator.wikimedia.org/T327973) (owner: 10Dzahn) [23:43:35] (03PS1) 10Dzahn: integration: add blackbox::check::http monitor [puppet] - 10https://gerrit.wikimedia.org/r/884395 (https://phabricator.wikimedia.org/T327972) [23:45:10] (03PS2) 10Dzahn: doc: add blackbox::check::http monitor [puppet] - 10https://gerrit.wikimedia.org/r/884393 (https://phabricator.wikimedia.org/T327973) [23:45:41] (03PS4) 10Superpes15: Create additional namespaces on shn.wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883620 (https://phabricator.wikimedia.org/T327850) [23:46:04] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 226, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:46:54] (03PS1) 10Dzahn: etherpad: add blackbox::check::http monitor [puppet] - 10https://gerrit.wikimedia.org/r/884396 (https://phabricator.wikimedia.org/T327974) [23:46:57] (03CR) 10CI reject: [V: 04-1] doc: add blackbox::check::http monitor [puppet] - 10https://gerrit.wikimedia.org/r/884393 (https://phabricator.wikimedia.org/T327973) (owner: 10Dzahn) [23:51:48] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 227, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:52:03] !log brett@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4050.ulsfo.wmnet with reason: host reimage [23:55:23] !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4050.ulsfo.wmnet with reason: host reimage