[00:20:15] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Mail: MX record issue on mx1001.wikimedia.org - https://phabricator.wikimedia.org/T297017 (10RLazarus) Hi @bcampbell and @eliza, thanks for the heads up.  Based on your notification, SRE investigated and found a firewall issue (potentially related to a kernel bug) that...
[00:22:39] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Mail: MX record issue on mx2001.wikimedia.org - https://phabricator.wikimedia.org/T297017 (10RLazarus)
[00:24:27] <icinga-wm>	 PROBLEM - Check systemd state on otrs1001 is CRITICAL: CRITICAL - degraded: The following units failed: clamav-daemon.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:24:55] <icinga-wm>	 PROBLEM - clamd running on otrs1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 112 (clamav), command name clamd https://wikitech.wikimedia.org/wiki/OTRS%23ClamAV
[00:26:27] <wikibugs>	 (03PS1) 10Herron: Prefer mx1001 over mx2001 for weights in MX records [dns] - 10https://gerrit.wikimedia.org/r/743526
[00:26:53] <wikibugs>	 (03PS1) 10Herron: Prefer mx1001 over mx2001 for smart hosts / wiki mail [puppet] - 10https://gerrit.wikimedia.org/r/743527
[00:27:10] <wikibugs>	 (03PS2) 10Herron: Prefer mx1001 over mx2001 for weights in MX records [dns] - 10https://gerrit.wikimedia.org/r/743526
[00:30:49] <icinga-wm>	 RECOVERY - Check systemd state on otrs1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:31:17] <icinga-wm>	 RECOVERY - clamd running on otrs1001 is OK: PROCS OK: 1 process with UID = 112 (clamav), command name clamd https://wikitech.wikimedia.org/wiki/OTRS%23ClamAV
[00:31:37] <jynus>	 !log manually restarting clamav on otrs1001 after being killed
[00:31:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:32:35] <icinga-wm>	 RECOVERY - exim queue on mx2001 is OK: OK: Less than 2000 mails in exim queue. https://wikitech.wikimedia.org/wiki/Exim
[00:45:35] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[00:47:21] <icinga-wm>	 PROBLEM - Exim SMTP on mx2001 is CRITICAL: connect to address 208.80.153.45 and port 25: Connection refused https://wikitech.wikimedia.org/wiki/Mail%23Troubleshooting
[00:49:57] <wikibugs>	 (03CR) 10JHathaway: [C: 03+2] Prefer mx1001 over mx2001 for weights in MX records [dns] - 10https://gerrit.wikimedia.org/r/743526 (owner: 10Herron)
[00:51:45] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=sidekiq site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[00:53:53] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[00:53:56] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Mail: MX record issue on mx2001.wikimedia.org - https://phabricator.wikimedia.org/T297017 (10RLazarus) Update: The mail queue length on mx2001 is back to normal, so we're substantially caught up on the delayed emails. We'll continue to keep an eye on things and you can...
[00:54:42] <mutante>	 !log rebooting mx2001
[00:54:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:57:27] <icinga-wm>	 PROBLEM - Host mx2001 is DOWN: PING CRITICAL - Packet loss = 100%
[01:09:55] <icinga-wm>	 RECOVERY - Host mx2001 is UP: PING OK - Packet loss = 0%, RTA = 31.66 ms
[01:13:53] <icinga-wm>	 PROBLEM - spamassassin on mx2001 is CRITICAL: PROCS CRITICAL: 0 processes with args spamd https://wikitech.wikimedia.org/wiki/Mail%23SpamAssassin
[01:14:18] <mutante>	 !log mx2001 - did not come back from reboot, did not get IP on interface, could not start ferm, logged in via console with root password, in /etc/network/interfaces replaced all "ens5" with "ens13", rebooted again, selected previous kernel version
[01:14:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:14:55] <icinga-wm>	 RECOVERY - Exim SMTP on mx2001 is OK: OK - Certificate mx1001.wikimedia.org will expire on Tue 04 Jan 2022 11:55:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mail%23Troubleshooting
[01:15:13] <icinga-wm>	 PROBLEM - Check systemd state on mx2001 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service,ifup@ens13.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:15:57] <icinga-wm>	 RECOVERY - spamassassin on mx2001 is OK: PROCS OK: 3 processes with args spamd https://wikitech.wikimedia.org/wiki/Mail%23SpamAssassin
[01:23:25] <icinga-wm>	 PROBLEM - Exim SMTP on mx2001 is CRITICAL: connect to address 208.80.153.45 and port 25: Connection refused https://wikitech.wikimedia.org/wiki/Mail%23Troubleshooting
[01:24:17] <wikibugs>	 (03CR) 10Herron: [C: 03+2] "proceeding with this due to T297017" [puppet] - 10https://gerrit.wikimedia.org/r/743527 (owner: 10Herron)
[01:25:51] <icinga-wm>	 RECOVERY - Check systemd state on mx2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:38:11] <icinga-wm>	 RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:44:35] <icinga-wm>	 PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:28:49] <icinga-wm>	 PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: database-backups-snapshots.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:59:48] <icinga-wm>	 RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:06:17] <icinga-wm>	 PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:58:41] <icinga-wm>	 PROBLEM - SSH on mw2252.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[06:21:01] <jinxer-wm>	 (CirrusSearchJVMGCOldPoolFlatlined) firing: Elasticsearch instance elastic2040-production-search-psi-codfw is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell  - https://alerts.wikimedia.org
[06:59:43] <icinga-wm>	 RECOVERY - SSH on mw2252.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[07:09:43] <wikibugs>	 (03CR) 10Juan90264: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/743528 (owner: 10Juan90264)
[07:10:35] <wikibugs>	 (03PS3) 10Juan90264: Enable groups autopatrolled and patroller for bnwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/743528 (https://phabricator.wikimedia.org/T296637)
[07:24:31] <wikibugs>	 (03CR) 10Juan90264: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/743529 (owner: 10Juan90264)
[07:27:10] <wikibugs>	 (03PS3) 10Juan90264: Enable SandboxLink extension for bnwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/743529 (https://phabricator.wikimedia.org/T296637)
[09:13:35] <wikibugs>	 (03CR) 10Legoktm: "My test plan is to build and push this image with a :testing tag. Then I'll manually adjust one of my k8s deployments to use the :testing " [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/738503 (https://phabricator.wikimedia.org/T293552) (owner: 10Legoktm)
[10:21:16] <jinxer-wm>	 (CirrusSearchJVMGCOldPoolFlatlined) firing: Elasticsearch instance elastic2040-production-search-psi-codfw is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell  - https://alerts.wikimedia.org
[10:36:01] <jinxer-wm>	 (CirrusSearchJVMGCOldPoolFlatlined) resolved: Elasticsearch instance elastic2040-production-search-psi-codfw is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell  - https://alerts.wikimedia.org
[11:21:07] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10Majavah)
[12:23:15] <icinga-wm>	 RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:29:51] <icinga-wm>	 PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:54:44] <wikibugs>	 (03PS1) 10Majavah: toolforge: provision delete-crashing-pods values [puppet] - 10https://gerrit.wikimedia.org/r/743574 (https://phabricator.wikimedia.org/T292925)
[13:01:05] <icinga-wm>	 PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[15:03:13] <icinga-wm>	 RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[21:02:17] <icinga-wm>	 RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:08:53] <icinga-wm>	 PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:15:22] <wikibugs>	 (03PS1) 10Urbanecm: Deploy Growth mentor dashboard to all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/743602 (https://phabricator.wikimedia.org/T278920)
[21:37:39] <icinga-wm>	 RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:44:19] <icinga-wm>	 PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:08:15] <icinga-wm>	 RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:14:55] <icinga-wm>	 PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state