[00:10:10] <icinga-wm>	 PROBLEM - Check unit status of netbox_ganeti_drmrs02_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_drmrs02_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[00:10:28] <icinga-wm>	 RECOVERY - Check unit status of netbox_ganeti_drmrs01_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_drmrs01_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[00:24:16] <icinga-wm>	 PROBLEM - Check unit status of netbox_ganeti_drmrs01_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_drmrs01_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[00:37:22] <icinga-wm>	 PROBLEM - WDQS high update lag on wdqs1013 is CRITICAL: 6.648e+07 ge 4.32e+07 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[01:00:37] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org
[01:37:32] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org
[01:40:37] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org
[01:41:24] <icinga-wm>	 RECOVERY - Check unit status of netbox_ganeti_drmrs02_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_drmrs02_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[01:54:54] <icinga-wm>	 RECOVERY - WDQS high update lag on wdqs1013 is OK: (C)4.32e+07 ge (W)2.16e+07 ge 2.069e+07 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[01:55:16] <icinga-wm>	 PROBLEM - Check unit status of netbox_ganeti_drmrs02_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_drmrs02_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[02:27:30] <icinga-wm>	 RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:34:48] <icinga-wm>	 PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:52:47] <jinxer-wm>	 (Processor usage over 85%) firing: (3) Alert for device scs-a1-codfw.mgmt.codfw.wmnet - Processor usage over 85%   - https://alerts.wikimedia.org
[04:00:42] <icinga-wm>	 PROBLEM - snapshot of s6 in codfw on alert1001 is CRITICAL: snapshot for s6 at codfw taken more than 3 days ago: Most recent backup 2022-02-24 03:29:30 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting
[04:00:42] <icinga-wm>	 RECOVERY - Check systemd state on build2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:10:32] <icinga-wm>	 PROBLEM - Check systemd state on build2001 is CRITICAL: CRITICAL - degraded: The following units failed: debian-weekly-rebuild.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:10:08] <icinga-wm>	 RECOVERY - Check unit status of netbox_ganeti_drmrs01_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_drmrs01_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[05:24:04] <icinga-wm>	 PROBLEM - Check unit status of netbox_ganeti_drmrs01_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_drmrs01_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[05:40:37] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org
[05:41:58] <icinga-wm>	 PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (install6001), Fresh: 104 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[06:16:30] <icinga-wm>	 PROBLEM - Disk space on deneb is CRITICAL: DISK CRITICAL - free space: / 11186 MB (5% inode=64%): /tmp 11186 MB (5% inode=64%): /var/tmp 11186 MB (5% inode=64%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=deneb&var-datasource=codfw+prometheus/ops
[06:19:58] <icinga-wm>	 PROBLEM - mailman3_queue_size on lists1001 is CRITICAL: CRITICAL: 1 mailman3 queues above limits: bounces is 1566 (limit: 25) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring https://grafana.wikimedia.org/d/GvuAmuuGk/mailman3
[06:41:20] <icinga-wm>	 RECOVERY - Check unit status of netbox_ganeti_drmrs02_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_drmrs02_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[06:55:20] <icinga-wm>	 PROBLEM - Check unit status of netbox_ganeti_drmrs02_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_drmrs02_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[06:58:42] <icinga-wm>	 RECOVERY - mailman3_queue_size on lists1001 is OK: OK: mailman3 queues are below the limits https://wikitech.wikimedia.org/wiki/Mailman/Monitoring https://grafana.wikimedia.org/d/GvuAmuuGk/mailman3
[07:41:10] <icinga-wm>	 RECOVERY - Check unit status of netbox_ganeti_drmrs02_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_drmrs02_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[07:52:47] <jinxer-wm>	 (Processor usage over 85%) firing: (3) Alert for device scs-a1-codfw.mgmt.codfw.wmnet - Processor usage over 85%   - https://alerts.wikimedia.org
[07:55:06] <icinga-wm>	 PROBLEM - Check unit status of netbox_ganeti_drmrs02_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_drmrs02_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[08:00:05] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220227T0800)
[08:32:39] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for thanos-be1003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org
[09:23:11] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Logstash Access for Ammarpad - https://phabricator.wikimedia.org/T302250 (10Ammarpad) Thank you @Dzahn.
[09:40:37] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org
[09:49:26] <icinga-wm>	 PROBLEM - SSH on thumbor2004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[09:56:16] <icinga-wm>	 PROBLEM - Host cp1090.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[10:00:56] <icinga-wm>	 RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:02:46] <icinga-wm>	 RECOVERY - Host cp1090.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.03 ms
[10:04:00] <wikibugs>	 (03PS1) 10Majavah: openstack: use tls for horizon->api connections [puppet] - 10https://gerrit.wikimedia.org/r/766281
[10:07:01] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33997/console" [puppet] - 10https://gerrit.wikimedia.org/r/766281 (owner: 10Majavah)
[10:08:22] <icinga-wm>	 PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:11:05] <wikibugs>	 (03PS2) 10Majavah: openstack: use tls for horizon->api connections [puppet] - 10https://gerrit.wikimedia.org/r/766281
[10:27:50] <icinga-wm>	 PROBLEM - Host cp1090.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[10:41:02] <icinga-wm>	 RECOVERY - Host cp1090.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.03 ms
[10:50:48] <icinga-wm>	 RECOVERY - SSH on thumbor2004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[11:09:50] <wikibugs>	 (03PS1) 10Majavah: Extract ssh fingerprint publishing to an independent class [puppet] - 10https://gerrit.wikimedia.org/r/766291
[11:09:52] <wikibugs>	 (03PS1) 10Majavah: P:toolforge::static: publish SSH fingerprints under /admin [puppet] - 10https://gerrit.wikimedia.org/r/766292
[11:11:09] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33998/console" [puppet] - 10https://gerrit.wikimedia.org/r/766291 (owner: 10Majavah)
[11:11:11] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33999/console" [puppet] - 10https://gerrit.wikimedia.org/r/766292 (owner: 10Majavah)
[11:29:28] <icinga-wm>	 PROBLEM - Host cp1090.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[11:35:58] <icinga-wm>	 RECOVERY - Host cp1090.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.01 ms
[11:52:47] <jinxer-wm>	 (Processor usage over 85%) firing: (3) Alert for device scs-a1-codfw.mgmt.codfw.wmnet - Processor usage over 85%   - https://alerts.wikimedia.org
[11:56:12] <icinga-wm>	 RECOVERY - Check unit status of netbox_ganeti_drmrs02_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_drmrs02_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[12:10:10] <icinga-wm>	 PROBLEM - Check unit status of netbox_ganeti_drmrs02_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_drmrs02_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[12:32:54] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for thanos-be1003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org
[12:50:42] <icinga-wm>	 PROBLEM - Host cp1090.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[12:56:12] <icinga-wm>	 RECOVERY - Check unit status of netbox_ganeti_drmrs02_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_drmrs02_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[13:10:20] <icinga-wm>	 PROBLEM - Check unit status of netbox_ganeti_drmrs02_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_drmrs02_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[13:10:46] <icinga-wm>	 RECOVERY - Host cp1090.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.06 ms
[13:20:08] <icinga-wm>	 PROBLEM - Host cp1090.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[13:33:22] <icinga-wm>	 RECOVERY - Host cp1090.mgmt is UP: PING OK - Packet loss = 0%, RTA = 12.34 ms
[13:40:37] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org
[13:56:18] <icinga-wm>	 RECOVERY - Check unit status of netbox_ganeti_drmrs02_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_drmrs02_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[14:03:48] <icinga-wm>	 PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 113 probes of 663 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[14:10:18] <icinga-wm>	 PROBLEM - Check unit status of netbox_ganeti_drmrs02_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_drmrs02_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[14:16:48] <icinga-wm>	 RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 58 probes of 663 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[14:54:02] <icinga-wm>	 PROBLEM - Host durum6001 is DOWN: PING CRITICAL - Packet loss = 100%
[14:54:08] <icinga-wm>	 PROBLEM - Host durum6002 is DOWN: PING CRITICAL - Packet loss = 100%
[14:56:18] <icinga-wm>	 RECOVERY - Check unit status of netbox_ganeti_drmrs02_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_drmrs02_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[15:10:24] <icinga-wm>	 PROBLEM - Check unit status of netbox_ganeti_drmrs02_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_drmrs02_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[15:48:35] <wikibugs>	 (03PS1) 10Zabe: Add centralauth-suppress to steward at metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/766306 (https://phabricator.wikimedia.org/T302675)
[15:48:37] <wikibugs>	 (03PS1) 10Zabe: Remove centralauth-oversight from steward at metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/766307 (https://phabricator.wikimedia.org/T302675)
[15:52:47] <jinxer-wm>	 (Processor usage over 85%) firing: (3) Alert for device scs-a1-codfw.mgmt.codfw.wmnet - Processor usage over 85%   - https://alerts.wikimedia.org
[15:55:28] <icinga-wm>	 RECOVERY - Check unit status of netbox_ganeti_drmrs01_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_drmrs01_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[15:56:26] <icinga-wm>	 RECOVERY - Check unit status of netbox_ganeti_drmrs02_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_drmrs02_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[16:09:34] <icinga-wm>	 PROBLEM - Check unit status of netbox_ganeti_drmrs01_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_drmrs01_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[16:10:32] <icinga-wm>	 PROBLEM - Check unit status of netbox_ganeti_drmrs02_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_drmrs02_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[16:21:10] <icinga-wm>	 PROBLEM - SSH on analytics1063.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[16:32:54] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for thanos-be1003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org
[17:40:37] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org
[19:02:24] <icinga-wm>	 RECOVERY - Check systemd state on cumin2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:25:46] <icinga-wm>	 RECOVERY - SSH on analytics1063.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[19:44:18] <icinga-wm>	 PROBLEM - SSH on db2090.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[19:47:39] <jinxer-wm>	 (NodeTextfileStale) firing: (2) Stale textfile for thanos-be1003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org
[19:52:47] <jinxer-wm>	 (Processor usage over 85%) firing: (3) Alert for device scs-a1-codfw.mgmt.codfw.wmnet - Processor usage over 85%   - https://alerts.wikimedia.org
[20:42:13] <XioNoX>	 !log configure OSPF between cr2-drmrs and cr2-eqdfw
[20:42:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:44:33] <jinxer-wm>	 (ProbeHttpFailed) firing: URL did not return HTTP 2xx or 3xx response (or probe/connection failed) - https://wikitech.wikimedia.org/wiki/Prometheus#Watchrat_Non-23xx_HTTP_response - https://grafana.wikimedia.org/d/GYciEga7z/watchrat - https://alerts.wikimedia.org
[20:44:33] <jinxer-wm>	 (ProbeHttpFailed) firing: (13) URL did not return HTTP 2xx or 3xx response (or probe/connection failed) - https://wikitech.wikimedia.org/wiki/Prometheus#Watchrat_Non-23xx_HTTP_response - https://grafana.wikimedia.org/d/GYciEga7z/watchrat - https://alerts.wikimedia.org
[20:44:37] <icinga-wm>	 RECOVERY - Host cr1-drmrs is UP: PING OK - Packet loss = 0%, RTA = 147.26 ms
[20:44:38] <icinga-wm>	 RECOVERY - Host cr2-drmrs is UP: PING OK - Packet loss = 0%, RTA = 149.14 ms
[20:44:40] <icinga-wm>	 RECOVERY - Host ncredir-lb.drmrs.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 149.65 ms
[20:44:40] <icinga-wm>	 RECOVERY - Host bast6001 is UP: PING OK - Packet loss = 0%, RTA = 147.24 ms
[20:44:40] <icinga-wm>	 PROBLEM - Host lvs6002.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[20:44:40] <icinga-wm>	 PROBLEM - Host cp6015 is DOWN: PING CRITICAL - Packet loss = 100%
[20:44:40] <icinga-wm>	 PROBLEM - Host ganeti6002 is DOWN: PING CRITICAL - Packet loss = 100%
[20:44:40] <icinga-wm>	 PROBLEM - Host lvs6001 is DOWN: PING CRITICAL - Packet loss = 100%
[20:44:42] <icinga-wm>	 PROBLEM - Host cp6011 is DOWN: PING CRITICAL - Packet loss = 100%
[20:44:42] <icinga-wm>	 PROBLEM - Host cp6002 is DOWN: PING CRITICAL - Packet loss = 100%
[20:44:42] <icinga-wm>	 PROBLEM - Host cp6003 is DOWN: PING CRITICAL - Packet loss = 100%
[20:44:44] <icinga-wm>	 PROBLEM - Host cp6008 is DOWN: PING CRITICAL - Packet loss = 100%
[20:44:44] <icinga-wm>	 PROBLEM - Host cp6001 is DOWN: PING CRITICAL - Packet loss = 100%
[20:44:44] <icinga-wm>	 PROBLEM - Host cp6013 is DOWN: PING CRITICAL - Packet loss = 100%
[20:44:44] <icinga-wm>	 PROBLEM - Host cp6004 is DOWN: PING CRITICAL - Packet loss = 100%
[20:44:44] <icinga-wm>	 PROBLEM - Host cp6009 is DOWN: PING CRITICAL - Packet loss = 100%
[20:44:45] <icinga-wm>	 PROBLEM - Host lvs6002 is DOWN: PING CRITICAL - Packet loss = 100%
[20:44:45] <icinga-wm>	 PROBLEM - Host cp6012 is DOWN: PING CRITICAL - Packet loss = 100%
[20:44:46] <icinga-wm>	 PROBLEM - Host cp6016 is DOWN: PING CRITICAL - Packet loss = 100%
[20:44:46] <icinga-wm>	 PROBLEM - Host cp6007 is DOWN: PING CRITICAL - Packet loss = 100%
[20:44:47] <icinga-wm>	 PROBLEM - Host cp6010 is DOWN: PING CRITICAL - Packet loss = 100%
[20:44:47] <icinga-wm>	 PROBLEM - Host cp6005 is DOWN: PING CRITICAL - Packet loss = 100%
[20:44:48] <icinga-wm>	 PROBLEM - Host ganeti6001 is DOWN: PING CRITICAL - Packet loss = 100%
[20:44:48] <icinga-wm>	 PROBLEM - Host ganeti6003 is DOWN: PING CRITICAL - Packet loss = 100%
[20:46:03] <icinga-wm>	 RECOVERY - Host text-lb.drmrs.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 147.27 ms
[20:46:19] <icinga-wm>	 RECOVERY - LVS upload drmrs port 80/tcp - Images and other media- upload.eqiad.wikimedia.org IPv4 #page on upload-lb.drmrs.wikimedia.org is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 478 bytes in 0.300 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[20:47:32] <icinga-wm>	 RECOVERY - Host mr1-drmrs IPv6 is UP: PING OK - Packet loss = 0%, RTA = 151.45 ms
[20:49:14] <icinga-wm>	 RECOVERY - Host cr1-drmrs IPv6 is UP: PING OK - Packet loss = 0%, RTA = 147.18 ms
[20:50:34] <icinga-wm>	 RECOVERY - Host cr2-drmrs IPv6 is UP: PING OK - Packet loss = 0%, RTA = 150.32 ms
[20:51:25] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org
[20:52:09] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org
[20:52:29] <jinxer-wm>	 (ProbeHttpFailed) resolved: URL did not return HTTP 2xx or 3xx response (or probe/connection failed) - https://wikitech.wikimedia.org/wiki/Prometheus#Watchrat_Non-23xx_HTTP_response - https://grafana.wikimedia.org/d/GYciEga7z/watchrat - https://alerts.wikimedia.org
[20:52:34] <jinxer-wm>	 (ProbeHttpFailed) firing: (22) URL did not return HTTP 2xx or 3xx response (or probe/connection failed) - https://wikitech.wikimedia.org/wiki/Prometheus#Watchrat_Non-23xx_HTTP_response - https://grafana.wikimedia.org/d/GYciEga7z/watchrat - https://alerts.wikimedia.org
[21:00:18] <wikibugs>	 (03PS2) 10Zabe: Add centralauth-suppress to steward and wmf-supportsafety at metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/766306 (https://phabricator.wikimedia.org/T302675)
[21:07:09] <wikibugs>	 (03PS3) 10Zabe: Add centralauth-suppress to steward and wmf-supportsafety at metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/766306 (https://phabricator.wikimedia.org/T302675)
[21:33:16] <icinga-wm>	 PROBLEM - SSH on dns5001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[21:40:26] <icinga-wm>	 PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 116 probes of 663 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[21:53:34] <icinga-wm>	 RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 60 probes of 663 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[22:02:13] <wikibugs>	 (03PS2) 10Zabe: Remove centralauth-oversight from the config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/766307 (https://phabricator.wikimedia.org/T302675)
[22:30:38] <icinga-wm>	 PROBLEM - SSH on analytics1063.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[22:34:42] <icinga-wm>	 RECOVERY - SSH on dns5001.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[23:24:08] <icinga-wm>	 PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 114 probes of 662 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[23:30:46] <icinga-wm>	 RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 59 probes of 662 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[23:32:16] <icinga-wm>	 RECOVERY - SSH on analytics1063.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[23:41:34] <icinga-wm>	 RECOVERY - Check unit status of netbox_ganeti_drmrs02_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_drmrs02_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[23:47:54] <jinxer-wm>	 (NodeTextfileStale) firing: (2) Stale textfile for thanos-be1003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org
[23:50:00] <wikibugs>	 (03PS7) 10JHathaway: [WIP] team-sre: add hardware-related checks [alerts] - 10https://gerrit.wikimedia.org/r/757489 (https://phabricator.wikimedia.org/T294564) (owner: 10Volans)
[23:52:47] <jinxer-wm>	 (Processor usage over 85%) firing: (3) Alert for device scs-a1-codfw.mgmt.codfw.wmnet - Processor usage over 85%   - https://alerts.wikimedia.org
[23:55:40] <icinga-wm>	 PROBLEM - Check unit status of netbox_ganeti_drmrs02_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_drmrs02_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[23:57:28] <icinga-wm>	 PROBLEM - SSH on dumpsdata1002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook