[00:04:58] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: drop_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:15:17] <wikibugs>	 (03CR) 10Ryan Kemper: [V: 03+1 C: 03+2] wdqs-internal: lower depool threshold to .3 [puppet] - 10https://gerrit.wikimedia.org/r/698069 (https://phabricator.wikimedia.org/T284264) (owner: 10Ryan Kemper)
[01:11:32] <icinga-wm>	 RECOVERY - Check systemd state on sodium is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:58:31] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to nda for west1 - https://phabricator.wikimedia.org/T284136 (10Aklapper)
[02:12:52] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_eventstreams_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[02:14:40] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[02:35:38] <icinga-wm>	 PROBLEM - snapshot of x1 in codfw on alert1001 is CRITICAL: snapshot for x1 at codfw taken more than 3 days ago: Most recent backup 2021-06-03 02:23:55 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting
[03:18:30] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_mobileapps_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[03:20:20] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[04:49:55] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: Find list owners for lists without them - https://phabricator.wikimedia.org/T281779 (10Aklapper) * https://lists.wikimedia.org/postorius/lists/wikiid-l.lists.wikimedia.org/ - https://lists.wikimedia.org/hyperkitty/list/wikiid-l@lists.wikimedia.org/latest shows last posts 4, 9,...
[05:11:24] <icinga-wm>	 PROBLEM - SSH on contint2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[05:32:44] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=routinator site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[05:36:24] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[06:00:26] <icinga-wm>	 PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1001), Fresh: 100 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[06:12:02] <icinga-wm>	 RECOVERY - SSH on contint2001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[07:00:05] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210606T0700)
[07:25:42] <icinga-wm>	 PROBLEM - Check systemd state on sodium is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:12:32] <wikibugs>	 10SRE, 10netops: routinator: create gabage collection job - https://phabricator.wikimedia.org/T282469 (10cmooney) Loop me in on that Arzhel be interested to see the process.
[11:27:36] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[11:29:20] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[11:36:52] <wikibugs>	 10SRE, 10netops: routinator: create garbage collection job - https://phabricator.wikimedia.org/T282469 (10Aklapper)
[12:05:06] <icinga-wm>	 RECOVERY - Backup freshness on backup1001 is OK: Fresh: 101 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[12:50:44] <icinga-wm>	 PROBLEM - snapshot of s7 in codfw on alert1001 is CRITICAL: snapshot for s7 at codfw taken more than 3 days ago: Most recent backup 2021-06-03 12:39:53 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting
[13:22:26] <icinga-wm>	 RECOVERY - Check systemd state on sodium is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:31:32] <icinga-wm>	 PROBLEM - Check systemd state on sodium is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:49:04] <wikibugs>	 (03PS1) 10H.krishna123: [T284399] Perform first commit on operations/bernard repository, add .gitignore and README.md [software/bernard] - 10https://gerrit.wikimedia.org/r/698327 (https://phabricator.wikimedia.org/T284399)
[15:09:00] <wikibugs>	 (03PS2) 10H.krishna123: [T284399] Perform first commit on operations/bernard repository, add .gitignore and README.md [software/bernard] - 10https://gerrit.wikimedia.org/r/698327 (https://phabricator.wikimedia.org/T284399)
[15:10:17] <wikibugs>	 (03CR) 10H.krishna123: "Testing my first commit to the Bernard repository 😊" [software/bernard] - 10https://gerrit.wikimedia.org/r/698327 (https://phabricator.wikimedia.org/T284399) (owner: 10H.krishna123)
[15:38:04] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 68 probes of 617 (alerts on 65) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[15:43:52] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 39 probes of 617 (alerts on 65) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[18:46:16] <icinga-wm>	 PROBLEM - mailman3_queue_size on lists1001 is CRITICAL: CRITICAL: 1 mailman3 queues above limits: bounces is 120 (limit: 25) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring https://grafana.wikimedia.org/d/GvuAmuuGk/mailman3
[18:47:43] <legoktm>	 looks temporary, it's already dropping
[18:49:42] <icinga-wm>	 RECOVERY - mailman3_queue_size on lists1001 is OK: OK: mailman3 queues are below the limits https://wikitech.wikimedia.org/wiki/Mailman/Monitoring https://grafana.wikimedia.org/d/GvuAmuuGk/mailman3
[19:01:36] <icinga-wm>	 RECOVERY - Check systemd state on cumin2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:27:30] <icinga-wm>	 RECOVERY - Check systemd state on sodium is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:30:03] <wikibugs>	 10SRE, 10Tracking-Neverending: Tracking and Reducing cron-spam to root@ - https://phabricator.wikimedia.org/T132324 (10Aklapper)
[21:59:00] <icinga-wm>	 PROBLEM - SSH on wdqs2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[22:24:26] <icinga-wm>	 PROBLEM - SSH on contint2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[22:32:50] <icinga-wm>	 PROBLEM - WDQS high update lag on wdqs1013 is CRITICAL: 6.959e+04 ge 4.32e+04 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[22:59:44] <icinga-wm>	 RECOVERY - SSH on wdqs2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook