[00:50:30] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=sidekiq site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:52:42] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:35:44] RECOVERY - Maps tiles generation on alert1001 is OK: OK: Less than 90.00% under the threshold [10.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [02:48:50] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [03:37:50] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1004 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [04:08:26] PROBLEM - Check systemd state on deneb is CRITICAL: CRITICAL - degraded: The following units failed: debian-weekly-rebuild.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:24:52] PROBLEM - snapshot of s3 in eqiad on alert1001 is CRITICAL: snapshot for s3 at eqiad taken more than 3 days ago: Most recent backup 2021-11-25 03:57:50 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [05:25:28] PROBLEM - snapshot of s3 in codfw on alert1001 is CRITICAL: snapshot for s3 at codfw taken more than 3 days ago: Most recent backup 2021-11-25 05:04:43 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [07:45:14] (03CR) 10Jayprakash12345: [C: 03+1] Make use of the ?? operator in some more situations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740305 (owner: 10Thiemo Kreuz (WMDE)) [08:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211128T0800) [08:00:40] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [08:41:50] PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:07:50] PROBLEM - Host logstash2028.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [09:14:04] RECOVERY - Host logstash2028.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.63 ms [09:39:06] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1004 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [10:01:56] PROBLEM - SSH on contint1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:44:06] RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:56:53] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /robots.txt (Untitled test) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [10:58:52] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [11:02:56] RECOVERY - SSH on contint1001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:16:13] 10SRE: config-master fingerprints should include information about user groups - https://phabricator.wikimedia.org/T296588 (10Majavah) [11:32:00] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [11:38:24] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1004 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [11:42:31] (03PS1) 10Majavah: extdist: Use packaged composer on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/742236 (https://phabricator.wikimedia.org/T293055) [12:59:11] (03PS1) 10Majavah: trafficserver: Enable tls on integration.wm.o [puppet] - 10https://gerrit.wikimedia.org/r/742240 (https://phabricator.wikimedia.org/T263830) [13:01:14] (03PS2) 10Majavah: set up tls termination on cloudweb2001-dev [puppet] - 10https://gerrit.wikimedia.org/r/742213 (https://phabricator.wikimedia.org/T263829) [13:01:16] (03PS3) 10Majavah: P::trafficserver: use https for cloudweb2001-dev [puppet] - 10https://gerrit.wikimedia.org/r/742214 (https://phabricator.wikimedia.org/T263829) [14:05:52] 10SRE: config-master fingerprints should include information about user groups - https://phabricator.wikimedia.org/T296588 (10Urbanecm) Risking I'm saying the obvious here: The group information is available in hiera as `profile::admin::groups`. It should be possible to build something that combines hiera data a... [14:32:44] 10SRE, 10ops-eqiad: Degraded RAID on cloudvirt1018 - https://phabricator.wikimedia.org/T296592 (10ops-monitoring-bot) [14:43:55] 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudvirt1018 - https://phabricator.wikimedia.org/T296592 (10Majavah) This is one of the localdisk hypervisors we use for Toolforge/Toolsbeta etcd, thankfully not a ToolsDB server [14:58:36] PROBLEM - Host logstash2028.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:04:50] RECOVERY - Host logstash2028.mgmt is UP: PING OK - Packet loss = 0%, RTA = 40.87 ms [15:10:16] (03PS1) 10Majavah: dynamicproxy: Validate route project [puppet] - 10https://gerrit.wikimedia.org/r/742267 (https://phabricator.wikimedia.org/T129800) [15:48:28] (03PS1) 10Urbanecm: Disable Growth IP research survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742268 (https://phabricator.wikimedia.org/T294568) [17:12:10] !log elukey@deploy1002 Started deploy [ores/deploy@69ed061]: Canary upgrade of mwparserfromhell - T296563 [17:12:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:22] !log elukey@deploy1002 Finished deploy [ores/deploy@69ed061]: Canary upgrade of mwparserfromhell - T296563 (duration: 02m 11s) [17:14:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:43:52] RECOVERY - Check systemd state on ms-fe2010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:01:22] RECOVERY - Check systemd state on cumin2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:01:48] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:56:23] (03PS1) 10PipelineBot: image-suggestion-api: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/742271 [22:03:11] (03PS1) 10Andrew Bogott: cinder.conf: Tune settings for the backup agent. [puppet] - 10https://gerrit.wikimedia.org/r/742273 (https://phabricator.wikimedia.org/T292546) [22:05:26] (03CR) 10Andrew Bogott: [C: 03+2] cinder.conf: Tune settings for the backup agent. [puppet] - 10https://gerrit.wikimedia.org/r/742273 (https://phabricator.wikimedia.org/T292546) (owner: 10Andrew Bogott) [22:18:09] (03PS1) 10Andrew Bogott: cinder-backup: subscribe to config file changes [puppet] - 10https://gerrit.wikimedia.org/r/742274 [22:20:20] (03CR) 10Andrew Bogott: [C: 03+2] cinder-backup: subscribe to config file changes [puppet] - 10https://gerrit.wikimedia.org/r/742274 (owner: 10Andrew Bogott)