[00:34:48] PROBLEM - Maps tiles generation on alert1001 is CRITICAL: CRITICAL: 90.28% of data under the critical threshold [5.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [01:42:38] PROBLEM - MariaDB Replica Lag: s4 on db2097 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1127.73 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [02:58:02] PROBLEM - SSH on cp5005.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:58:50] RECOVERY - SSH on cp5005.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:24:12] RECOVERY - MariaDB Replica Lag: s4 on db2097 is OK: OK slave_sql_lag Replication lag: 0.48 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210821T0700) [07:02:53] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [07:32:53] (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org [07:56:53] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [08:06:53] (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org [08:21:46] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [08:31:46] (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org [09:03:42] PROBLEM - SSH on cp5005.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:04:34] RECOVERY - SSH on cp5005.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:12:09] (03PS1) 10Majavah: prometheus_local_crontabs: use a systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/714173 (https://phabricator.wikimedia.org/T273673) [10:59:47] 10SRE, 10wikidiff2, 10Community-Tech (CommTech-Sprint-7): Deploy wikidiff2 1.12.0 - https://phabricator.wikimedia.org/T285857 (10Daimona) [11:02:42] 10SRE, 10wikidiff2, 10Community-Tech (CommTech-Sprint-7): Deploy wikidiff2 1.12.0 - https://phabricator.wikimedia.org/T285857 (10Daimona) @jijiki Hi, CC'ing you since you deployed the previous versions linked in the task description. [13:16:00] (03CR) 10Ladsgroup: "Ping" [puppet] - 10https://gerrit.wikimedia.org/r/697637 (https://phabricator.wikimedia.org/T282303) (owner: 10Ladsgroup) [13:30:20] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 96 probes of 621 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [13:36:08] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 39 probes of 621 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [13:50:30] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [13:52:26] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 3 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [15:03:38] (03PS2) 10Urbanecm: [labs] enwiki: Enable mentorship for 10% of users only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/714144 (https://phabricator.wikimedia.org/T287903) [15:03:53] (03CR) 10Urbanecm: [C: 03+2] "beta-only, no-op for prod" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/714144 (https://phabricator.wikimedia.org/T287903) (owner: 10Urbanecm) [15:04:35] (03Merged) 10jenkins-bot: [labs] enwiki: Enable mentorship for 10% of users only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/714144 (https://phabricator.wikimedia.org/T287903) (owner: 10Urbanecm) [15:11:16] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [15:11:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:41] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [15:12:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:35] (03PS1) 10Majavah: toolforge: remove portgrabber [puppet] - 10https://gerrit.wikimedia.org/r/714187 [18:43:58] PROBLEM - Ensure traffic_exporter for the tls instance binds on port 9322 and responds to HTTP requests on cp1083 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [18:55:18] RECOVERY - Ensure traffic_exporter for the tls instance binds on port 9322 and responds to HTTP requests on cp1083 is OK: HTTP OK: HTTP/1.0 200 OK - 23653 bytes in 0.007 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [21:41:10] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:02:32] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:25:46] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:02:38] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:25:54] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state