[00:03:57] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:11:13] PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:13:21] PROBLEM - SSH on wtp1038.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:15:02] (KubernetesRsyslogDown) firing: (2) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [00:31:31] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:36:47] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [00:38:37] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_eventlogging_legacy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:50:19] PROBLEM - Host mr1-drmrs.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [00:50:19] PROBLEM - Host mr1-drmrs.oob is DOWN: PING CRITICAL - Packet loss = 100% [00:55:02] (KubernetesRsyslogDown) firing: rsyslog on ml-serve-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [00:56:41] RECOVERY - Host mr1-drmrs.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 85.59 ms [00:56:41] RECOVERY - Host mr1-drmrs.oob is UP: PING OK - Packet loss = 0%, RTA = 85.61 ms [01:35:53] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [01:44:01] PROBLEM - k8s API server requests latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 verb={CREATE,UPDATE} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [02:06:41] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [02:10:02] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1004-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [02:13:57] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:22:25] PROBLEM - WDQS SPARQL on wdqs1013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:24:39] RECOVERY - WDQS SPARQL on wdqs1013 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.073 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:36:31] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [04:07:11] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [04:15:02] (KubernetesRsyslogDown) firing: (2) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [04:18:37] RECOVERY - SSH on wtp1038.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:55:02] (KubernetesRsyslogDown) firing: rsyslog on ml-serve-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [05:10:47] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:17:55] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:57:07] PROBLEM - SSH on restbase1018.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:06:37] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [06:09:16] (03CR) 10Urbanecm: [C: 04-1] "maybe I'm missing something, but AFAICS, once train is deployed, this will use the legacy/cloud service (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1004-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [06:13:09] (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808207 (https://phabricator.wikimedia.org/T306032) (owner: 10Kosta Harlan) [06:58:27] RECOVERY - SSH on restbase1018.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220625T0700) [07:21:53] 10SRE, 10Platform Team Workboards (Green): Install wrk, siege and lua-cjson packages on deploy1001 - https://phabricator.wikimedia.org/T230178 (10hashar) Thank you @Eevans and @Dzahn . It is good to see cleanup happening :] [07:28:43] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:49:55] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:05:39] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=POST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [08:13:25] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:14:43] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:15:02] (KubernetesRsyslogDown) firing: (2) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [08:20:27] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift-account-stats_search:platform.service,swift-account-stats_wdqs:flink.service,swift_dispersion_stats.service,swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:21:49] RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:24:05] PROBLEM - SSH on wtp1038.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:36:11] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [08:41:41] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:51:05] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service,swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:55:02] (KubernetesRsyslogDown) firing: rsyslog on ml-serve-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [09:06:51] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [09:25:19] RECOVERY - SSH on wtp1038.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:32:21] !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b3fe77c]: (no justification provided) [09:32:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:26] !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b3fe77c]: (no justification provided) (duration: 00m 04s) [09:32:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:33] !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b3fe77c]: (no justification provided) [09:33:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:57] !log restart rsyslog on ml-serve-ctrl200[1,2] - broken connections to centrallog [09:55:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:53] (KubernetesRsyslogDown) resolved: rsyslog on ml-serve-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [09:59:55] PROBLEM - SSH on mw1321.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:10:02] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1004-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [10:23:32] !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b3fe77c]: (no justification provided) (duration: 49m 58s) [10:23:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:40] !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b3fe77c]: (no justification provided) [10:25:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:44] !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b3fe77c]: (no justification provided) (duration: 00m 04s) [10:25:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:01] !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b3fe77c]: (no justification provided) [10:28:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:19] !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b3fe77c]: (no justification provided) (duration: 00m 18s) [10:28:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:29] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [10:42:55] !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b3fe77c]: (no justification provided) [10:42:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:29] !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b3fe77c]: (no justification provided) (duration: 01m 33s) [10:44:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:00] !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b3fe77c]: (no justification provided) [10:45:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:27] !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b3fe77c]: (no justification provided) (duration: 08m 27s) [10:53:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:49] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:35:19] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [11:36:25] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:03:37] (03CR) 10David Caro: [C: 03+1] "If nobody gets at it by tuesday I'll merge then, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/806433 (owner: 10Majavah) [12:06:57] (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/806553 (https://phabricator.wikimedia.org/T286301) (owner: 10Majavah) [12:07:51] (03CR) 10David Caro: [C: 03+1] P:wmcs::metricsinfra::prometheus: enable thanos sidecar [puppet] - 10https://gerrit.wikimedia.org/r/806551 (https://phabricator.wikimedia.org/T286301) (owner: 10Majavah) [12:07:57] (03CR) 10David Caro: [C: 03+1] P:metricsinfra: add thanos query [puppet] - 10https://gerrit.wikimedia.org/r/806552 (https://phabricator.wikimedia.org/T286301) (owner: 10Majavah) [12:08:35] (03CR) 10David Caro: [C: 03+1] "Will merge on tuesday if nobody gets to it sooner 👍" [puppet] - 10https://gerrit.wikimedia.org/r/806553 (https://phabricator.wikimedia.org/T286301) (owner: 10Majavah) [12:13:29] (03CR) 10David Caro: [C: 03+2] "I think this is useful and quite safe, merging and deploying, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/806459 (owner: 10Majavah) [12:15:02] (KubernetesRsyslogDown) firing: (2) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [12:20:38] (03CR) 10David Caro: [C: 03+2] "Awesome, this was really good, reduced in around 100 (of 800, 12.5%) the number of connections on the active galera node. Thanks a lot!!" [puppet] - 10https://gerrit.wikimedia.org/r/806459 (owner: 10Majavah) [12:36:27] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [13:03:49] RECOVERY - SSH on mw1321.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:06:45] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [13:16:40] !log restart rsyslog on ml-staging-ctrl200[1,2] - broken connections to centrallog [13:16:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:54] (KubernetesRsyslogDown) resolved: (2) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [13:23:59] PROBLEM - k8s API server requests latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 verb={CREATE,UPDATE} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [13:37:27] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [13:52:56] (03PS1) 10Stang: enwikiquote: Create rollbacker user group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808370 (https://phabricator.wikimedia.org/T310950) [14:05:51] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [14:10:02] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1004-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [15:37:59] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [16:06:23] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [16:25:27] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:34:45] PROBLEM - SSH on wtp1038.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:35:21] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [17:35:57] RECOVERY - SSH on wtp1038.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:40:41] PROBLEM - k8s API server requests latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 verb={CREATE,UPDATE} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [18:10:02] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1004-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [18:12:14] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host sretest1001.eqiad.wmnet [18:12:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:21] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet [18:12:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:50] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1001.eqiad.wmnet [18:17:52] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet [18:17:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:19] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:17:25] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_ring_manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:42:31] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:03:49] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift-account-stats_swift:dispersion.service,swift_dispersion_stats.service,swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:07:22] 10SRE, 10MediaWiki-General, 10Traffic-Icebox, 10Patch-For-Review: Investigate query parameter normalization for MW/services - https://phabricator.wikimedia.org/T138093 (10ori) The array parameter notation is straightforward to deal with as long as it uses a literal '[', since we just treat it as a stop cha... [22:10:02] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1004-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency