[00:20:15] 10SRE, 10Infrastructure-Foundations, 10Mail: MX record issue on mx1001.wikimedia.org - https://phabricator.wikimedia.org/T297017 (10RLazarus) Hi @bcampbell and @eliza, thanks for the heads up. Based on your notification, SRE investigated and found a firewall issue (potentially related to a kernel bug) that... [00:22:39] 10SRE, 10Infrastructure-Foundations, 10Mail: MX record issue on mx2001.wikimedia.org - https://phabricator.wikimedia.org/T297017 (10RLazarus) [00:24:27] PROBLEM - Check systemd state on otrs1001 is CRITICAL: CRITICAL - degraded: The following units failed: clamav-daemon.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:24:55] PROBLEM - clamd running on otrs1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 112 (clamav), command name clamd https://wikitech.wikimedia.org/wiki/OTRS%23ClamAV [00:26:27] (03PS1) 10Herron: Prefer mx1001 over mx2001 for weights in MX records [dns] - 10https://gerrit.wikimedia.org/r/743526 [00:26:53] (03PS1) 10Herron: Prefer mx1001 over mx2001 for smart hosts / wiki mail [puppet] - 10https://gerrit.wikimedia.org/r/743527 [00:27:10] (03PS2) 10Herron: Prefer mx1001 over mx2001 for weights in MX records [dns] - 10https://gerrit.wikimedia.org/r/743526 [00:30:49] RECOVERY - Check systemd state on otrs1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:31:17] RECOVERY - clamd running on otrs1001 is OK: PROCS OK: 1 process with UID = 112 (clamav), command name clamd https://wikitech.wikimedia.org/wiki/OTRS%23ClamAV [00:31:37] !log manually restarting clamav on otrs1001 after being killed [00:31:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:32:35] RECOVERY - exim queue on mx2001 is OK: OK: Less than 2000 mails in exim queue. https://wikitech.wikimedia.org/wiki/Exim [00:45:35] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:47:21] PROBLEM - Exim SMTP on mx2001 is CRITICAL: connect to address 208.80.153.45 and port 25: Connection refused https://wikitech.wikimedia.org/wiki/Mail%23Troubleshooting [00:49:57] (03CR) 10JHathaway: [C: 03+2] Prefer mx1001 over mx2001 for weights in MX records [dns] - 10https://gerrit.wikimedia.org/r/743526 (owner: 10Herron) [00:51:45] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=sidekiq site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:53:53] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:53:56] 10SRE, 10Infrastructure-Foundations, 10Mail: MX record issue on mx2001.wikimedia.org - https://phabricator.wikimedia.org/T297017 (10RLazarus) Update: The mail queue length on mx2001 is back to normal, so we're substantially caught up on the delayed emails. We'll continue to keep an eye on things and you can... [00:54:42] !log rebooting mx2001 [00:54:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:57:27] PROBLEM - Host mx2001 is DOWN: PING CRITICAL - Packet loss = 100% [01:09:55] RECOVERY - Host mx2001 is UP: PING OK - Packet loss = 0%, RTA = 31.66 ms [01:13:53] PROBLEM - spamassassin on mx2001 is CRITICAL: PROCS CRITICAL: 0 processes with args spamd https://wikitech.wikimedia.org/wiki/Mail%23SpamAssassin [01:14:18] !log mx2001 - did not come back from reboot, did not get IP on interface, could not start ferm, logged in via console with root password, in /etc/network/interfaces replaced all "ens5" with "ens13", rebooted again, selected previous kernel version [01:14:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:14:55] RECOVERY - Exim SMTP on mx2001 is OK: OK - Certificate mx1001.wikimedia.org will expire on Tue 04 Jan 2022 11:55:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mail%23Troubleshooting [01:15:13] PROBLEM - Check systemd state on mx2001 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service,ifup@ens13.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:15:57] RECOVERY - spamassassin on mx2001 is OK: PROCS OK: 3 processes with args spamd https://wikitech.wikimedia.org/wiki/Mail%23SpamAssassin [01:23:25] PROBLEM - Exim SMTP on mx2001 is CRITICAL: connect to address 208.80.153.45 and port 25: Connection refused https://wikitech.wikimedia.org/wiki/Mail%23Troubleshooting [01:24:17] (03CR) 10Herron: [C: 03+2] "proceeding with this due to T297017" [puppet] - 10https://gerrit.wikimedia.org/r/743527 (owner: 10Herron) [01:25:51] RECOVERY - Check systemd state on mx2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:38:11] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:44:35] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:28:49] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: database-backups-snapshots.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:59:48] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:06:17] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:58:41] PROBLEM - SSH on mw2252.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:21:01] (CirrusSearchJVMGCOldPoolFlatlined) firing: Elasticsearch instance elastic2040-production-search-psi-codfw is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://alerts.wikimedia.org [06:59:43] RECOVERY - SSH on mw2252.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:09:43] (03CR) 10Juan90264: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/743528 (owner: 10Juan90264) [07:10:35] (03PS3) 10Juan90264: Enable groups autopatrolled and patroller for bnwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/743528 (https://phabricator.wikimedia.org/T296637) [07:24:31] (03CR) 10Juan90264: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/743529 (owner: 10Juan90264) [07:27:10] (03PS3) 10Juan90264: Enable SandboxLink extension for bnwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/743529 (https://phabricator.wikimedia.org/T296637) [09:13:35] (03CR) 10Legoktm: "My test plan is to build and push this image with a :testing tag. Then I'll manually adjust one of my k8s deployments to use the :testing " [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/738503 (https://phabricator.wikimedia.org/T293552) (owner: 10Legoktm) [10:21:16] (CirrusSearchJVMGCOldPoolFlatlined) firing: Elasticsearch instance elastic2040-production-search-psi-codfw is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://alerts.wikimedia.org [10:36:01] (CirrusSearchJVMGCOldPoolFlatlined) resolved: Elasticsearch instance elastic2040-production-search-psi-codfw is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://alerts.wikimedia.org [11:21:07] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10Majavah) [12:23:15] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:29:51] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:54:44] (03PS1) 10Majavah: toolforge: provision delete-crashing-pods values [puppet] - 10https://gerrit.wikimedia.org/r/743574 (https://phabricator.wikimedia.org/T292925) [13:01:05] PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:03:13] RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:02:17] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:08:53] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:15:22] (03PS1) 10Urbanecm: Deploy Growth mentor dashboard to all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/743602 (https://phabricator.wikimedia.org/T278920) [21:37:39] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:44:19] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:08:15] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:14:55] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state