[00:50:30] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=sidekiq site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[00:52:42] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[01:35:44] <icinga-wm>	 RECOVERY - Maps tiles generation on alert1001 is OK: OK: Less than 90.00% under the threshold [10.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1
[02:48:50] <icinga-wm>	 PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1
[03:37:50] <icinga-wm>	 RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1004 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1
[04:08:26] <icinga-wm>	 PROBLEM - Check systemd state on deneb is CRITICAL: CRITICAL - degraded: The following units failed: debian-weekly-rebuild.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:24:52] <icinga-wm>	 PROBLEM - snapshot of s3 in eqiad on alert1001 is CRITICAL: snapshot for s3 at eqiad taken more than 3 days ago: Most recent backup 2021-11-25 03:57:50 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting
[05:25:28] <icinga-wm>	 PROBLEM - snapshot of s3 in codfw on alert1001 is CRITICAL: snapshot for s3 at codfw taken more than 3 days ago: Most recent backup 2021-11-25 05:04:43 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting
[07:45:14] <wikibugs>	 (03CR) 10Jayprakash12345: [C: 03+1] Make use of the ?? operator in some more situations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740305 (owner: 10Thiemo Kreuz (WMDE))
[08:00:04] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211128T0800)
[08:00:40] <icinga-wm>	 PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1
[08:41:50] <icinga-wm>	 PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[09:07:50] <icinga-wm>	 PROBLEM - Host logstash2028.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[09:14:04] <icinga-wm>	 RECOVERY - Host logstash2028.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.63 ms
[09:39:06] <icinga-wm>	 RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1004 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1
[10:01:56] <icinga-wm>	 PROBLEM - SSH on contint1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[10:44:06] <icinga-wm>	 RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[10:56:53] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /robots.txt (Untitled test) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid
[10:58:52] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[11:02:56] <icinga-wm>	 RECOVERY - SSH on contint1001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[11:16:13] <wikibugs>	 10SRE: config-master fingerprints should include information about user groups - https://phabricator.wikimedia.org/T296588 (10Majavah)
[11:32:00] <icinga-wm>	 PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1
[11:38:24] <icinga-wm>	 RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1004 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1
[11:42:31] <wikibugs>	 (03PS1) 10Majavah: extdist: Use packaged composer on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/742236 (https://phabricator.wikimedia.org/T293055)
[12:59:11] <wikibugs>	 (03PS1) 10Majavah: trafficserver: Enable tls on integration.wm.o [puppet] - 10https://gerrit.wikimedia.org/r/742240 (https://phabricator.wikimedia.org/T263830)
[13:01:14] <wikibugs>	 (03PS2) 10Majavah: set up tls termination on cloudweb2001-dev [puppet] - 10https://gerrit.wikimedia.org/r/742213 (https://phabricator.wikimedia.org/T263829)
[13:01:16] <wikibugs>	 (03PS3) 10Majavah: P::trafficserver: use https for cloudweb2001-dev [puppet] - 10https://gerrit.wikimedia.org/r/742214 (https://phabricator.wikimedia.org/T263829)
[14:05:52] <wikibugs>	 10SRE: config-master fingerprints should include information about user groups - https://phabricator.wikimedia.org/T296588 (10Urbanecm) Risking I'm saying the obvious here: The group information is available in hiera as `profile::admin::groups`. It should be possible to build something that combines hiera data a...
[14:32:44] <wikibugs>	 10SRE, 10ops-eqiad: Degraded RAID on cloudvirt1018 - https://phabricator.wikimedia.org/T296592 (10ops-monitoring-bot)
[14:43:55] <wikibugs>	 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudvirt1018 - https://phabricator.wikimedia.org/T296592 (10Majavah) This is one of the localdisk hypervisors we use for Toolforge/Toolsbeta etcd, thankfully not a ToolsDB server
[14:58:36] <icinga-wm>	 PROBLEM - Host logstash2028.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[15:04:50] <icinga-wm>	 RECOVERY - Host logstash2028.mgmt is UP: PING OK - Packet loss = 0%, RTA = 40.87 ms
[15:10:16] <wikibugs>	 (03PS1) 10Majavah: dynamicproxy: Validate route project [puppet] - 10https://gerrit.wikimedia.org/r/742267 (https://phabricator.wikimedia.org/T129800)
[15:48:28] <wikibugs>	 (03PS1) 10Urbanecm: Disable Growth IP research survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742268 (https://phabricator.wikimedia.org/T294568)
[17:12:10] <logmsgbot>	 !log elukey@deploy1002 Started deploy [ores/deploy@69ed061]: Canary upgrade of mwparserfromhell - T296563
[17:12:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:14:22] <logmsgbot>	 !log elukey@deploy1002 Finished deploy [ores/deploy@69ed061]: Canary upgrade of mwparserfromhell - T296563 (duration: 02m 11s)
[17:14:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:43:52] <icinga-wm>	 RECOVERY - Check systemd state on ms-fe2010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:01:22] <icinga-wm>	 RECOVERY - Check systemd state on cumin2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:01:48] <icinga-wm>	 RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:56:23] <wikibugs>	 (03PS1) 10PipelineBot: image-suggestion-api: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/742271
[22:03:11] <wikibugs>	 (03PS1) 10Andrew Bogott: cinder.conf: Tune settings for the backup agent. [puppet] - 10https://gerrit.wikimedia.org/r/742273 (https://phabricator.wikimedia.org/T292546)
[22:05:26] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] cinder.conf: Tune settings for the backup agent. [puppet] - 10https://gerrit.wikimedia.org/r/742273 (https://phabricator.wikimedia.org/T292546) (owner: 10Andrew Bogott)
[22:18:09] <wikibugs>	 (03PS1) 10Andrew Bogott: cinder-backup: subscribe to config file changes [puppet] - 10https://gerrit.wikimedia.org/r/742274
[22:20:20] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] cinder-backup: subscribe to config file changes [puppet] - 10https://gerrit.wikimedia.org/r/742274 (owner: 10Andrew Bogott)