[00:10:10] PROBLEM - Check unit status of netbox_ganeti_drmrs02_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_drmrs02_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [00:10:28] RECOVERY - Check unit status of netbox_ganeti_drmrs01_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_drmrs01_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [00:24:16] PROBLEM - Check unit status of netbox_ganeti_drmrs01_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_drmrs01_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [00:37:22] PROBLEM - WDQS high update lag on wdqs1013 is CRITICAL: 6.648e+07 ge 4.32e+07 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [01:00:37] (JobUnavailable) firing: Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [01:37:32] (JobUnavailable) firing: (2) Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [01:40:37] (JobUnavailable) firing: (2) Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [01:41:24] RECOVERY - Check unit status of netbox_ganeti_drmrs02_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_drmrs02_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [01:54:54] RECOVERY - WDQS high update lag on wdqs1013 is OK: (C)4.32e+07 ge (W)2.16e+07 ge 2.069e+07 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [01:55:16] PROBLEM - Check unit status of netbox_ganeti_drmrs02_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_drmrs02_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [02:27:30] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:34:48] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:52:47] (Processor usage over 85%) firing: (3) Alert for device scs-a1-codfw.mgmt.codfw.wmnet - Processor usage over 85% - https://alerts.wikimedia.org [04:00:42] PROBLEM - snapshot of s6 in codfw on alert1001 is CRITICAL: snapshot for s6 at codfw taken more than 3 days ago: Most recent backup 2022-02-24 03:29:30 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [04:00:42] RECOVERY - Check systemd state on build2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:10:32] PROBLEM - Check systemd state on build2001 is CRITICAL: CRITICAL - degraded: The following units failed: debian-weekly-rebuild.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:10:08] RECOVERY - Check unit status of netbox_ganeti_drmrs01_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_drmrs01_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [05:24:04] PROBLEM - Check unit status of netbox_ganeti_drmrs01_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_drmrs01_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [05:40:37] (JobUnavailable) firing: Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [05:41:58] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (install6001), Fresh: 104 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [06:16:30] PROBLEM - Disk space on deneb is CRITICAL: DISK CRITICAL - free space: / 11186 MB (5% inode=64%): /tmp 11186 MB (5% inode=64%): /var/tmp 11186 MB (5% inode=64%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=deneb&var-datasource=codfw+prometheus/ops [06:19:58] PROBLEM - mailman3_queue_size on lists1001 is CRITICAL: CRITICAL: 1 mailman3 queues above limits: bounces is 1566 (limit: 25) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring https://grafana.wikimedia.org/d/GvuAmuuGk/mailman3 [06:41:20] RECOVERY - Check unit status of netbox_ganeti_drmrs02_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_drmrs02_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [06:55:20] PROBLEM - Check unit status of netbox_ganeti_drmrs02_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_drmrs02_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [06:58:42] RECOVERY - mailman3_queue_size on lists1001 is OK: OK: mailman3 queues are below the limits https://wikitech.wikimedia.org/wiki/Mailman/Monitoring https://grafana.wikimedia.org/d/GvuAmuuGk/mailman3 [07:41:10] RECOVERY - Check unit status of netbox_ganeti_drmrs02_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_drmrs02_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [07:52:47] (Processor usage over 85%) firing: (3) Alert for device scs-a1-codfw.mgmt.codfw.wmnet - Processor usage over 85% - https://alerts.wikimedia.org [07:55:06] PROBLEM - Check unit status of netbox_ganeti_drmrs02_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_drmrs02_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [08:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220227T0800) [08:32:39] (NodeTextfileStale) firing: Stale textfile for thanos-be1003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org [09:23:11] 10SRE, 10LDAP-Access-Requests: Logstash Access for Ammarpad - https://phabricator.wikimedia.org/T302250 (10Ammarpad) Thank you @Dzahn. [09:40:37] (JobUnavailable) firing: Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [09:49:26] PROBLEM - SSH on thumbor2004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:56:16] PROBLEM - Host cp1090.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [10:00:56] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:02:46] RECOVERY - Host cp1090.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.03 ms [10:04:00] (03PS1) 10Majavah: openstack: use tls for horizon->api connections [puppet] - 10https://gerrit.wikimedia.org/r/766281 [10:07:01] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33997/console" [puppet] - 10https://gerrit.wikimedia.org/r/766281 (owner: 10Majavah) [10:08:22] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:11:05] (03PS2) 10Majavah: openstack: use tls for horizon->api connections [puppet] - 10https://gerrit.wikimedia.org/r/766281 [10:27:50] PROBLEM - Host cp1090.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [10:41:02] RECOVERY - Host cp1090.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.03 ms [10:50:48] RECOVERY - SSH on thumbor2004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:09:50] (03PS1) 10Majavah: Extract ssh fingerprint publishing to an independent class [puppet] - 10https://gerrit.wikimedia.org/r/766291 [11:09:52] (03PS1) 10Majavah: P:toolforge::static: publish SSH fingerprints under /admin [puppet] - 10https://gerrit.wikimedia.org/r/766292 [11:11:09] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33998/console" [puppet] - 10https://gerrit.wikimedia.org/r/766291 (owner: 10Majavah) [11:11:11] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33999/console" [puppet] - 10https://gerrit.wikimedia.org/r/766292 (owner: 10Majavah) [11:29:28] PROBLEM - Host cp1090.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:35:58] RECOVERY - Host cp1090.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.01 ms [11:52:47] (Processor usage over 85%) firing: (3) Alert for device scs-a1-codfw.mgmt.codfw.wmnet - Processor usage over 85% - https://alerts.wikimedia.org [11:56:12] RECOVERY - Check unit status of netbox_ganeti_drmrs02_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_drmrs02_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [12:10:10] PROBLEM - Check unit status of netbox_ganeti_drmrs02_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_drmrs02_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [12:32:54] (NodeTextfileStale) firing: Stale textfile for thanos-be1003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org [12:50:42] PROBLEM - Host cp1090.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [12:56:12] RECOVERY - Check unit status of netbox_ganeti_drmrs02_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_drmrs02_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [13:10:20] PROBLEM - Check unit status of netbox_ganeti_drmrs02_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_drmrs02_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [13:10:46] RECOVERY - Host cp1090.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.06 ms [13:20:08] PROBLEM - Host cp1090.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [13:33:22] RECOVERY - Host cp1090.mgmt is UP: PING OK - Packet loss = 0%, RTA = 12.34 ms [13:40:37] (JobUnavailable) firing: Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [13:56:18] RECOVERY - Check unit status of netbox_ganeti_drmrs02_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_drmrs02_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [14:03:48] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 113 probes of 663 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:10:18] PROBLEM - Check unit status of netbox_ganeti_drmrs02_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_drmrs02_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [14:16:48] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 58 probes of 663 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:54:02] PROBLEM - Host durum6001 is DOWN: PING CRITICAL - Packet loss = 100% [14:54:08] PROBLEM - Host durum6002 is DOWN: PING CRITICAL - Packet loss = 100% [14:56:18] RECOVERY - Check unit status of netbox_ganeti_drmrs02_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_drmrs02_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [15:10:24] PROBLEM - Check unit status of netbox_ganeti_drmrs02_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_drmrs02_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [15:48:35] (03PS1) 10Zabe: Add centralauth-suppress to steward at metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/766306 (https://phabricator.wikimedia.org/T302675) [15:48:37] (03PS1) 10Zabe: Remove centralauth-oversight from steward at metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/766307 (https://phabricator.wikimedia.org/T302675) [15:52:47] (Processor usage over 85%) firing: (3) Alert for device scs-a1-codfw.mgmt.codfw.wmnet - Processor usage over 85% - https://alerts.wikimedia.org [15:55:28] RECOVERY - Check unit status of netbox_ganeti_drmrs01_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_drmrs01_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [15:56:26] RECOVERY - Check unit status of netbox_ganeti_drmrs02_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_drmrs02_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [16:09:34] PROBLEM - Check unit status of netbox_ganeti_drmrs01_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_drmrs01_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [16:10:32] PROBLEM - Check unit status of netbox_ganeti_drmrs02_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_drmrs02_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [16:21:10] PROBLEM - SSH on analytics1063.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:32:54] (NodeTextfileStale) firing: Stale textfile for thanos-be1003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org [17:40:37] (JobUnavailable) firing: Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [19:02:24] RECOVERY - Check systemd state on cumin2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:25:46] RECOVERY - SSH on analytics1063.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:44:18] PROBLEM - SSH on db2090.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:47:39] (NodeTextfileStale) firing: (2) Stale textfile for thanos-be1003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org [19:52:47] (Processor usage over 85%) firing: (3) Alert for device scs-a1-codfw.mgmt.codfw.wmnet - Processor usage over 85% - https://alerts.wikimedia.org [20:42:13] !log configure OSPF between cr2-drmrs and cr2-eqdfw [20:42:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:44:33] (ProbeHttpFailed) firing: URL did not return HTTP 2xx or 3xx response (or probe/connection failed) - https://wikitech.wikimedia.org/wiki/Prometheus#Watchrat_Non-23xx_HTTP_response - https://grafana.wikimedia.org/d/GYciEga7z/watchrat - https://alerts.wikimedia.org [20:44:33] (ProbeHttpFailed) firing: (13) URL did not return HTTP 2xx or 3xx response (or probe/connection failed) - https://wikitech.wikimedia.org/wiki/Prometheus#Watchrat_Non-23xx_HTTP_response - https://grafana.wikimedia.org/d/GYciEga7z/watchrat - https://alerts.wikimedia.org [20:44:37] RECOVERY - Host cr1-drmrs is UP: PING OK - Packet loss = 0%, RTA = 147.26 ms [20:44:38] RECOVERY - Host cr2-drmrs is UP: PING OK - Packet loss = 0%, RTA = 149.14 ms [20:44:40] RECOVERY - Host ncredir-lb.drmrs.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 149.65 ms [20:44:40] RECOVERY - Host bast6001 is UP: PING OK - Packet loss = 0%, RTA = 147.24 ms [20:44:40] PROBLEM - Host lvs6002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [20:44:40] PROBLEM - Host cp6015 is DOWN: PING CRITICAL - Packet loss = 100% [20:44:40] PROBLEM - Host ganeti6002 is DOWN: PING CRITICAL - Packet loss = 100% [20:44:40] PROBLEM - Host lvs6001 is DOWN: PING CRITICAL - Packet loss = 100% [20:44:42] PROBLEM - Host cp6011 is DOWN: PING CRITICAL - Packet loss = 100% [20:44:42] PROBLEM - Host cp6002 is DOWN: PING CRITICAL - Packet loss = 100% [20:44:42] PROBLEM - Host cp6003 is DOWN: PING CRITICAL - Packet loss = 100% [20:44:44] PROBLEM - Host cp6008 is DOWN: PING CRITICAL - Packet loss = 100% [20:44:44] PROBLEM - Host cp6001 is DOWN: PING CRITICAL - Packet loss = 100% [20:44:44] PROBLEM - Host cp6013 is DOWN: PING CRITICAL - Packet loss = 100% [20:44:44] PROBLEM - Host cp6004 is DOWN: PING CRITICAL - Packet loss = 100% [20:44:44] PROBLEM - Host cp6009 is DOWN: PING CRITICAL - Packet loss = 100% [20:44:45] PROBLEM - Host lvs6002 is DOWN: PING CRITICAL - Packet loss = 100% [20:44:45] PROBLEM - Host cp6012 is DOWN: PING CRITICAL - Packet loss = 100% [20:44:46] PROBLEM - Host cp6016 is DOWN: PING CRITICAL - Packet loss = 100% [20:44:46] PROBLEM - Host cp6007 is DOWN: PING CRITICAL - Packet loss = 100% [20:44:47] PROBLEM - Host cp6010 is DOWN: PING CRITICAL - Packet loss = 100% [20:44:47] PROBLEM - Host cp6005 is DOWN: PING CRITICAL - Packet loss = 100% [20:44:48] PROBLEM - Host ganeti6001 is DOWN: PING CRITICAL - Packet loss = 100% [20:44:48] PROBLEM - Host ganeti6003 is DOWN: PING CRITICAL - Packet loss = 100% [20:46:03] RECOVERY - Host text-lb.drmrs.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 147.27 ms [20:46:19] RECOVERY - LVS upload drmrs port 80/tcp - Images and other media- upload.eqiad.wikimedia.org IPv4 #page on upload-lb.drmrs.wikimedia.org is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 478 bytes in 0.300 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [20:47:32] RECOVERY - Host mr1-drmrs IPv6 is UP: PING OK - Packet loss = 0%, RTA = 151.45 ms [20:49:14] RECOVERY - Host cr1-drmrs IPv6 is UP: PING OK - Packet loss = 0%, RTA = 147.18 ms [20:50:34] RECOVERY - Host cr2-drmrs IPv6 is UP: PING OK - Packet loss = 0%, RTA = 150.32 ms [20:51:25] (JobUnavailable) firing: (4) Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [20:52:09] (JobUnavailable) firing: (4) Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [20:52:29] (ProbeHttpFailed) resolved: URL did not return HTTP 2xx or 3xx response (or probe/connection failed) - https://wikitech.wikimedia.org/wiki/Prometheus#Watchrat_Non-23xx_HTTP_response - https://grafana.wikimedia.org/d/GYciEga7z/watchrat - https://alerts.wikimedia.org [20:52:34] (ProbeHttpFailed) firing: (22) URL did not return HTTP 2xx or 3xx response (or probe/connection failed) - https://wikitech.wikimedia.org/wiki/Prometheus#Watchrat_Non-23xx_HTTP_response - https://grafana.wikimedia.org/d/GYciEga7z/watchrat - https://alerts.wikimedia.org [21:00:18] (03PS2) 10Zabe: Add centralauth-suppress to steward and wmf-supportsafety at metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/766306 (https://phabricator.wikimedia.org/T302675) [21:07:09] (03PS3) 10Zabe: Add centralauth-suppress to steward and wmf-supportsafety at metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/766306 (https://phabricator.wikimedia.org/T302675) [21:33:16] PROBLEM - SSH on dns5001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:40:26] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 116 probes of 663 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [21:53:34] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 60 probes of 663 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [22:02:13] (03PS2) 10Zabe: Remove centralauth-oversight from the config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/766307 (https://phabricator.wikimedia.org/T302675) [22:30:38] PROBLEM - SSH on analytics1063.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:34:42] RECOVERY - SSH on dns5001.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:24:08] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 114 probes of 662 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [23:30:46] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 59 probes of 662 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [23:32:16] RECOVERY - SSH on analytics1063.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:41:34] RECOVERY - Check unit status of netbox_ganeti_drmrs02_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_drmrs02_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [23:47:54] (NodeTextfileStale) firing: (2) Stale textfile for thanos-be1003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org [23:50:00] (03PS7) 10JHathaway: [WIP] team-sre: add hardware-related checks [alerts] - 10https://gerrit.wikimedia.org/r/757489 (https://phabricator.wikimedia.org/T294564) (owner: 10Volans) [23:52:47] (Processor usage over 85%) firing: (3) Alert for device scs-a1-codfw.mgmt.codfw.wmnet - Processor usage over 85% - https://alerts.wikimedia.org [23:55:40] PROBLEM - Check unit status of netbox_ganeti_drmrs02_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_drmrs02_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [23:57:28] PROBLEM - SSH on dumpsdata1002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook