[00:34:48] <icinga-wm>	 PROBLEM - Maps tiles generation on alert1001 is CRITICAL: CRITICAL: 90.28% of data under the critical threshold [5.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1
[01:42:38] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s4 on db2097 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1127.73 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[02:58:02] <icinga-wm>	 PROBLEM - SSH on cp5005.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[03:58:50] <icinga-wm>	 RECOVERY - SSH on cp5005.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[04:24:12] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s4 on db2097 is OK: OK slave_sql_lag Replication lag: 0.48 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[07:00:04] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210821T0700)
[07:02:53] <jinxer-wm>	 (Processor usage over 85%) firing: Processor usage over 85%   - https://alerts.wikimedia.org
[07:32:53] <jinxer-wm>	 (Processor usage over 85%) resolved: Processor usage over 85%   - https://alerts.wikimedia.org
[07:56:53] <jinxer-wm>	 (Processor usage over 85%) firing: Processor usage over 85%   - https://alerts.wikimedia.org
[08:06:53] <jinxer-wm>	 (Processor usage over 85%) resolved: Processor usage over 85%   - https://alerts.wikimedia.org
[08:21:46] <jinxer-wm>	 (Processor usage over 85%) firing: Processor usage over 85%   - https://alerts.wikimedia.org
[08:31:46] <jinxer-wm>	 (Processor usage over 85%) resolved: Processor usage over 85%   - https://alerts.wikimedia.org
[09:03:42] <icinga-wm>	 PROBLEM - SSH on cp5005.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[10:04:34] <icinga-wm>	 RECOVERY - SSH on cp5005.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[10:12:09] <wikibugs>	 (03PS1) 10Majavah: prometheus_local_crontabs: use a systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/714173 (https://phabricator.wikimedia.org/T273673)
[10:59:47] <wikibugs>	 10SRE, 10wikidiff2, 10Community-Tech (CommTech-Sprint-7): Deploy wikidiff2 1.12.0 - https://phabricator.wikimedia.org/T285857 (10Daimona)
[11:02:42] <wikibugs>	 10SRE, 10wikidiff2, 10Community-Tech (CommTech-Sprint-7): Deploy wikidiff2 1.12.0 - https://phabricator.wikimedia.org/T285857 (10Daimona) @jijiki Hi, CC'ing you since you deployed the previous versions linked in the task description.
[13:16:00] <wikibugs>	 (03CR) 10Ladsgroup: "Ping" [puppet] - 10https://gerrit.wikimedia.org/r/697637 (https://phabricator.wikimedia.org/T282303) (owner: 10Ladsgroup)
[13:30:20] <icinga-wm>	 PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 96 probes of 621 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[13:36:08] <icinga-wm>	 RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 39 probes of 621 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[13:50:30] <icinga-wm>	 PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator
[13:52:26] <icinga-wm>	 RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 3 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator
[15:03:38] <wikibugs>	 (03PS2) 10Urbanecm: [labs] enwiki: Enable mentorship for 10% of users only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/714144 (https://phabricator.wikimedia.org/T287903)
[15:03:53] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] "beta-only, no-op for prod" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/714144 (https://phabricator.wikimedia.org/T287903) (owner: 10Urbanecm)
[15:04:35] <wikibugs>	 (03Merged) 10jenkins-bot: [labs] enwiki: Enable mentorship for 10% of users only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/714144 (https://phabricator.wikimedia.org/T287903) (owner: 10Urbanecm)
[15:11:16] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[15:11:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:12:41] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[15:12:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:21:35] <wikibugs>	 (03PS1) 10Majavah: toolforge: remove portgrabber [puppet] - 10https://gerrit.wikimedia.org/r/714187
[18:43:58] <icinga-wm>	 PROBLEM - Ensure traffic_exporter for the tls instance binds on port 9322 and responds to HTTP requests on cp1083 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[18:55:18] <icinga-wm>	 RECOVERY - Ensure traffic_exporter for the tls instance binds on port 9322 and responds to HTTP requests on cp1083 is OK: HTTP OK: HTTP/1.0 200 OK - 23653 bytes in 0.007 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[21:41:10] <icinga-wm>	 PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:02:32] <icinga-wm>	 RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:25:46] <icinga-wm>	 PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:02:38] <icinga-wm>	 RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:25:54] <icinga-wm>	 PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state