[00:00:34] <brennen>	 i should have remembered this e-mail yesterday afternoon, but seems like it would fit, i think?
[00:05:20] <brennen>	 i reopened T303010.
[00:05:21] <stashbot>	 T303010: Wikimedia\Rdbms\DBQueryError: Error 1969: Query execution was interrupted (max_statement_time exceeded) (db1096:3316) Function: [function] - https://phabricator.wikimedia.org/T303010
[00:08:13] <icinga-wm>	 PROBLEM - Check systemd state on grafana1002 is CRITICAL: CRITICAL - degraded: The following units failed: grafana-ldap-users-sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:14:17] <icinga-wm>	 PROBLEM - SSH on wtp1041.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[00:17:25] <icinga-wm>	 PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin2002 is CRITICAL: CRITICAL: the following (6) node(s) change every puppet run: wdqs2002, wdqs1004, cloudcontrol1004, wdqs2003, cloudcontrol1005, cloudcontrol1003 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes
[00:23:21] <icinga-wm>	 PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin1001 is CRITICAL: CRITICAL: the following (6) node(s) change every puppet run: wdqs1004, cloudcontrol1003, cloudcontrol1005, cloudcontrol1004, wdqs2003, wdqs2002 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes
[00:35:19] <icinga-wm>	 PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin2001 is CRITICAL: CRITICAL: the following (6) node(s) change every puppet run: wdqs1004, cloudcontrol1004, cloudcontrol1005, wdqs2002, cloudcontrol1003, wdqs2003 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes
[01:15:57] <icinga-wm>	 RECOVERY - SSH on wtp1041.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[01:40:30] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job gitlab in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org
[02:20:05] <icinga-wm>	 PROBLEM - SSH on dumpsdata1002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[04:23:35] <icinga-wm>	 RECOVERY - SSH on dumpsdata1002.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[05:39:55] <icinga-wm>	 PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: database-backups-snapshots.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:40:56] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job gitlab in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org
[07:05:15] <icinga-wm>	 PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: database-backups-snapshots.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:42:33] <icinga-wm>	 PROBLEM - SSH on dns5001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[08:01:59] <icinga-wm>	 RECOVERY - SSH on db2090.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[08:15:53] <icinga-wm>	 PROBLEM - Host cp1090.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[08:44:17] <icinga-wm>	 RECOVERY - SSH on dns5001.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[09:25:35] <icinga-wm>	 RECOVERY - Host cp1090.mgmt is UP: PING OK - Packet loss = 0%, RTA = 13.78 ms
[09:35:35] <icinga-wm>	 PROBLEM - Host cp1090.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[09:40:56] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job gitlab in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org
[09:49:33] <_joe_>	 wfm ^^
[10:18:07] <icinga-wm>	 PROBLEM - SSH on mw2258.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[10:50:09] <icinga-wm>	 PROBLEM - Check systemd state on otrs1001 is CRITICAL: CRITICAL - degraded: The following units failed: clamav-daemon.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:51:17] <icinga-wm>	 PROBLEM - clamd running on otrs1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 112 (clamav), command name clamd https://wikitech.wikimedia.org/wiki/OTRS%23ClamAV
[11:09:43] <icinga-wm>	 RECOVERY - Check systemd state on otrs1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:10:53] <icinga-wm>	 RECOVERY - clamd running on otrs1001 is OK: PROCS OK: 1 process with UID = 112 (clamav), command name clamd https://wikitech.wikimedia.org/wiki/OTRS%23ClamAV
[11:19:45] <icinga-wm>	 RECOVERY - SSH on mw2258.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[11:50:31] <icinga-wm>	 PROBLEM - SSH on bast5002 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring
[11:53:19] <icinga-wm>	 RECOVERY - SSH on bast5002 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[12:09:05] <icinga-wm>	 RECOVERY - Host cp1090.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.03 ms
[12:09:11] <icinga-wm>	 PROBLEM - SSH on db2090.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[12:25:53] <icinga-wm>	 PROBLEM - Host cp1090.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[13:10:53] <icinga-wm>	 RECOVERY - SSH on db2090.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[13:21:41] <icinga-wm>	 RECOVERY - Host cp1090.mgmt is UP: PING OK - Packet loss = 0%, RTA = 12.49 ms
[13:40:56] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job gitlab in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org
[14:25:07] <icinga-wm>	 PROBLEM - SSH on mw2258.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[15:26:47] <icinga-wm>	 RECOVERY - SSH on mw2258.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[16:16:11] <icinga-wm>	 PROBLEM - SSH on db2090.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[17:40:56] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job gitlab in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org
[18:00:25] <icinga-wm>	 PROBLEM - SSH on dns5001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[19:02:07] <icinga-wm>	 RECOVERY - SSH on dns5001.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[20:23:07] <icinga-wm>	 RECOVERY - SSH on db2090.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[21:16:26] <wikibugs>	 10SRE, 10Commons, 10Data-Persistence (Consultation), 10MediaWiki-extensions-WikibaseClient, and 4 others: Enable statement usage tracking on Commons and Co - https://phabricator.wikimedia.org/T188730 (10Umherirrender)
[21:38:33] <wikibugs>	 (03Abandoned) 10Zabe: Fix HD logo size at slwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734614 (https://phabricator.wikimedia.org/T250731) (owner: 10Zabe)
[21:40:56] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job gitlab in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org
[21:48:14] <wikibugs>	 (03PS1) 10Zabe: Migrate wmfDatacenter(s) to wmgDatacenter(s) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/768254 (https://phabricator.wikimedia.org/T45956)
[21:56:11] <wikibugs>	 (03PS1) 10Zabe: Stop writing to wmf* constants [mediawiki-config] - 10https://gerrit.wikimedia.org/r/768255 (https://phabricator.wikimedia.org/T45956)
[21:57:01] <icinga-wm>	 PROBLEM - snapshot of s4 in eqiad on alert1001 is CRITICAL: snapshot for s4 at eqiad taken more than 3 days ago: Most recent backup 2022-03-02 21:24:43 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting
[21:58:17] <wikibugs>	 (03PS1) 10Zabe: Migrate wmfDbconfigFromEtcd to wmgDbconfigFromEtcd [mediawiki-config] - 10https://gerrit.wikimedia.org/r/768256 (https://phabricator.wikimedia.org/T45956)
[21:58:43] <wikibugs>	 (03PS2) 10Zabe: Stop writing to wmf* constants [mediawiki-config] - 10https://gerrit.wikimedia.org/r/768255 (https://phabricator.wikimedia.org/T45956)
[22:50:34] <wikibugs>	 (03PS1) 10Zabe: Write the same value to wmgSwiftConfig as to wmfSwiftConfig [mediawiki-config] - 10https://gerrit.wikimedia.org/r/768259 (https://phabricator.wikimedia.org/T45956)
[22:53:27] <wikibugs>	 (03PS1) 10Zabe: wikitech_private: write to wmg* constants [puppet] - 10https://gerrit.wikimedia.org/r/768260 (https://phabricator.wikimedia.org/T45956)