[00:03:11] PROBLEM - Maps tiles generation on alert1001 is CRITICAL: CRITICAL: 91.39% of data under the critical threshold [5.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [00:08:33] RECOVERY - SSH on kubernetes1002.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:23:25] RECOVERY - Maps tiles generation on alert1001 is OK: OK: Less than 90.00% under the threshold [10.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [00:34:03] RECOVERY - SSH on mw2252.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:43:41] RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:49:51] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=sidekiq site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:54:09] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:15:17] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:24:01] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:08:07] PROBLEM - SSH on mw2254.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:49:31] PROBLEM - WDQS high update lag on wdqs1006 is CRITICAL: 4.675e+07 ge 4.32e+07 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [04:09:13] RECOVERY - SSH on mw2254.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:16:05] RECOVERY - WDQS high update lag on wdqs1006 is OK: (C)4.32e+07 ge (W)2.16e+07 ge 2.031e+07 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [05:56:01] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:56:03] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:53:39] 10SRE, 10Wikimedia-Mailing-lists: Request for creation: Wikibase.Cloud users mailing list - https://phabricator.wikimedia.org/T298350 (10Legoktm) Would "wikibase-cloud@" be OK? [07:04:27] 10SRE, 10Wikimedia-Mailing-lists: Request for creation: Wikibase.Cloud users mailing list - https://phabricator.wikimedia.org/T298350 (10Masssly) >>! In T298350#7590997, @Legoktm wrote: > Would "wikibase-cloud@" be OK? Yes. [08:43:47] PROBLEM - SSH on mw2252.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:10:31] PROBLEM - Disk space on kubestagemaster1001 is CRITICAL: DISK CRITICAL - free space: / 714 MB (3% inode=94%): /tmp 714 MB (3% inode=94%): /var/tmp 714 MB (3% inode=94%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=kubestagemaster1001&var-datasource=eqiad+prometheus/ops [09:27:04] 10SRE, 10Wikimedia-Mailing-lists, 10User-Ladsgroup: Request for creation: Wikibase.Cloud users mailing list - https://phabricator.wikimedia.org/T298350 (10Ladsgroup) 05Open→03Resolved a:03Ladsgroup Done: https://lists.wikimedia.org/postorius/lists/wikibase-cloud.lists.wikimedia.org Note that I created... [09:44:49] RECOVERY - SSH on mw2252.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:30:39] !log kill tcpdump process on kubestagemaster1001 (kept a big pcap file opened that kept growing) [10:30:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:11] RECOVERY - Disk space on kubestagemaster1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=kubestagemaster1001&var-datasource=eqiad+prometheus/ops [10:38:23] PROBLEM - Check systemd state on elastic1061 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service,elasticsearch_6@production-search-eqiad.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:00:40] 10SRE, 10Foundational Technology Requests, 10Traffic, 10Wikimedia Enterprise, 10Wikimedia Enterprise Discussion: Allow-Listing for Enterprise IPs - https://phabricator.wikimedia.org/T294798 (10AnnaMikla) [11:04:41] RECOVERY - Check systemd state on elastic1061 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:56:55] PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:06:43] PROBLEM - SSH on analytics1063.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:13:20] hi xSavitar - got a minute? [12:22:57] PROBLEM - SSH on kubernetes1002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:58:01] RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:24:03] RECOVERY - SSH on kubernetes1002.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:10:11] RECOVERY - SSH on analytics1063.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:43:45] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:43:51] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:50:19] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:50:27] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:52:31] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:52:39] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:32:09] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:34:13] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:40:47] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:42:53] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:00:16] (03CR) 10RhinosF1: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/749875 (https://phabricator.wikimedia.org/T273585) (owner: 10RhinosF1) [20:00:26] (03CR) 10jerkins-bot: [V: 04-1] Update static html dump to mention enterprise [puppet] - 10https://gerrit.wikimedia.org/r/749875 (https://phabricator.wikimedia.org/T273585) (owner: 10RhinosF1) [20:01:42] (03PS5) 10RhinosF1: Update static html dump to mention enterprise [puppet] - 10https://gerrit.wikimedia.org/r/749875 (https://phabricator.wikimedia.org/T273585) [20:02:18] (03CR) 10jerkins-bot: [V: 04-1] Update static html dump to mention enterprise [puppet] - 10https://gerrit.wikimedia.org/r/749875 (https://phabricator.wikimedia.org/T273585) (owner: 10RhinosF1) [20:03:45] (03PS6) 10RhinosF1: Update static html dump to mention enterprise [puppet] - 10https://gerrit.wikimedia.org/r/749875 (https://phabricator.wikimedia.org/T273585) [20:09:39] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:11:55] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:17:21] PROBLEM - SSH on contint1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:18:29] RECOVERY - SSH on contint1001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook