[00:09:21] PROBLEM - BGP status on cr1-esams is CRITICAL: BGP CRITICAL - No response from remote host 185.15.59.128 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:09:21] PROBLEM - BGP status on cr1-drmrs is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:12:59] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be1076.mgmt.eqiad.wmnet with reboot policy FORCED [00:13:03] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be1077.mgmt.eqiad.wmnet with reboot policy FORCED [00:13:06] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be1078.mgmt.eqiad.wmnet with reboot policy FORCED [00:13:08] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be1079.mgmt.eqiad.wmnet with reboot policy FORCED [00:13:57] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be1076'] [00:13:58] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be1076'] [00:13:59] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be1076'] [00:14:01] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be1076'] [00:14:15] !log jclark@cumin1001 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts ['ms-be1076'] [00:14:21] !log jclark@cumin1001 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts ['ms-be1076'] [00:14:22] !log jclark@cumin1001 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts ['ms-be1076'] [00:14:24] !log jclark@cumin1001 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts ['ms-be1076'] [00:14:29] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be1076'] [00:14:46] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['ms-be1076'] [00:14:47] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be1076'] [00:14:51] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['ms-be1076'] [00:15:06] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be1076'] [00:15:28] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['ms-be1076'] [00:30:02] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host ms-be1076.eqiad.wmnet with OS bullseye [00:30:09] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349840 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host ms-be1076.eqiad.wmnet with OS bullseye [00:30:16] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host ms-be1077.eqiad.wmnet with OS bullseye [00:30:22] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349840 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host ms-be1077.eqiad.wmnet with OS bullseye [00:30:31] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host ms-be1079.eqiad.wmnet with OS bullseye [00:30:36] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349840 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host ms-be1079.eqiad.wmnet with OS bullseye [00:30:50] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host ms-be1078.eqiad.wmnet with OS bullseye [00:30:56] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349840 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host ms-be1078.eqiad.wmnet with OS bullseye [00:37:22] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349840 (10Jclark-ctr) [00:38:51] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/979446 [00:38:57] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/979446 (owner: 10TrainBranchBot) [00:43:11] RECOVERY - Check systemd state on logstash1026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:43:25] RECOVERY - Check systemd state on logstash2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:56:07] (03CR) 10CI reject: [V: 04-1] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/979446 (owner: 10TrainBranchBot) [00:59:29] 10SRE, 10Data-Engineering, 10Data-Platform-SRE, 10observability, and 3 others: Upgrade Kafka to 2.x or 3.x - https://phabricator.wikimedia.org/T300102 (10odimitrijevic) A few questions: * While we ought to consider an upgrade for all 4 clusters, from what I understand Jumbo can be upgraded independently.... [01:09:55] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [01:50:20] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-be1076.eqiad.wmnet with OS bullseye [01:50:25] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349840 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host ms-be1076.eqiad.wmnet with OS bullseye executed with errors: - ms-be1076 (**FAIL... [01:50:32] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-be1077.eqiad.wmnet with OS bullseye [01:50:38] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349840 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host ms-be1077.eqiad.wmnet with OS bullseye executed with errors: - ms-be1077 (**FAIL... [01:50:48] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-be1079.eqiad.wmnet with OS bullseye [01:50:52] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349840 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host ms-be1079.eqiad.wmnet with OS bullseye executed with errors: - ms-be1079 (**FAIL... [01:51:07] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-be1078.eqiad.wmnet with OS bullseye [01:51:11] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349840 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host ms-be1078.eqiad.wmnet with OS bullseye executed with errors: - ms-be1078 (**FAIL... [02:39:02] (JobUnavailable) firing: (10) Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:00:16] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [03:00:40] (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk [03:09:03] (JobUnavailable) firing: (10) Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:54:57] PROBLEM - BGP status on cr2-eqdfw is CRITICAL: BGP CRITICAL - ASunknown/IPv4: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:00:16] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [07:00:41] (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk [07:01:05] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [07:10:02] (JobUnavailable) firing: (9) Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:29:23] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:30:59] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:31:05] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [07:43:13] PROBLEM - SSH on wdqs1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:06:07] PROBLEM - SSH on wdqs1022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:21:51] PROBLEM - Check systemd state on wdqs1023 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:23:19] RECOVERY - Check systemd state on wdqs1023 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:26:43] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:29:01] RECOVERY - SSH on wdqs1023 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:33:43] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin1001 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:40:51] PROBLEM - Check systemd state on wdqs1022 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:51:07] RECOVERY - Check systemd state on wdqs1022 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:10:59] RECOVERY - SSH on wdqs1022 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:17:01] PROBLEM - SSH on wdqs1022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:21:15] RECOVERY - SSH on wdqs1022 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [11:00:16] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [11:00:41] (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk [11:10:02] (JobUnavailable) firing: (9) Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:39:02] (JobUnavailable) firing: (10) Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:48:33] RECOVERY - Check systemd state on puppetserver1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:59:02] (JobUnavailable) firing: (10) Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:00:17] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [15:00:42] (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk [17:44:09] (03PS1) 10JHathaway: apt_repo: validate preseed data with a JSON Schema [puppet] - 10https://gerrit.wikimedia.org/r/979469 (https://phabricator.wikimedia.org/T352604) [17:48:16] (03CR) 10CI reject: [V: 04-1] apt_repo: validate preseed data with a JSON Schema [puppet] - 10https://gerrit.wikimedia.org/r/979469 (https://phabricator.wikimedia.org/T352604) (owner: 10JHathaway) [19:00:03] (JobUnavailable) firing: (9) Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:00:17] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [19:00:42] (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk [19:50:29] (03PS1) 10JHathaway: apt_repo: move hiera data into module, to allow for validation [puppet] - 10https://gerrit.wikimedia.org/r/979470 (https://phabricator.wikimedia.org/T352604) [19:52:55] (03CR) 10CI reject: [V: 04-1] apt_repo: move hiera data into module, to allow for validation [puppet] - 10https://gerrit.wikimedia.org/r/979470 (https://phabricator.wikimedia.org/T352604) (owner: 10JHathaway) [19:54:15] (03CR) 10JHathaway: Explicitly link the apt_repo.yaml hiera file to the modules/profile specs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/979119 (owner: 10Brouberol) [19:59:23] (03PS2) 10JHathaway: apt_repo: validate preseed data with a JSON Schema [puppet] - 10https://gerrit.wikimedia.org/r/979469 (https://phabricator.wikimedia.org/T352604) [20:03:29] (03CR) 10CI reject: [V: 04-1] apt_repo: validate preseed data with a JSON Schema [puppet] - 10https://gerrit.wikimedia.org/r/979469 (https://phabricator.wikimedia.org/T352604) (owner: 10JHathaway) [20:09:15] (03PS3) 10JHathaway: apt_repo: validate preseed data with a JSON Schema [puppet] - 10https://gerrit.wikimedia.org/r/979469 (https://phabricator.wikimedia.org/T352604) [20:14:35] PROBLEM - Work requests waiting in Zuul Gearman server on contint2002 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [20:35:26] (03PS4) 10JHathaway: apt_repo: validate preseed data with a JSON Schema [puppet] - 10https://gerrit.wikimedia.org/r/979469 (https://phabricator.wikimedia.org/T352604) [20:56:43] (03PS5) 10JHathaway: apt_repo: validate preseed data with a JSON Schema [puppet] - 10https://gerrit.wikimedia.org/r/979469 (https://phabricator.wikimedia.org/T352604) [21:00:48] (03CR) 10CI reject: [V: 04-1] apt_repo: validate preseed data with a JSON Schema [puppet] - 10https://gerrit.wikimedia.org/r/979469 (https://phabricator.wikimedia.org/T352604) (owner: 10JHathaway) [21:19:31] RECOVERY - Work requests waiting in Zuul Gearman server on contint2002 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [21:40:05] PROBLEM - Work requests waiting in Zuul Gearman server on contint2002 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [23:00:11] (JobUnavailable) firing: (9) Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:00:17] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [23:00:42] (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk [23:17:21] RECOVERY - Work requests waiting in Zuul Gearman server on contint2002 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10