[00:03:57] <icinga-wm>	 RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[00:11:13] <icinga-wm>	 PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[00:13:21] <icinga-wm>	 PROBLEM - SSH on wtp1038.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[00:15:02] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (2) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[00:31:31] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:36:47] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[00:38:37] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_eventlogging_legacy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:50:19] <icinga-wm>	 PROBLEM - Host mr1-drmrs.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
[00:50:19] <icinga-wm>	 PROBLEM - Host mr1-drmrs.oob is DOWN: PING CRITICAL - Packet loss = 100%
[00:55:02] <jinxer-wm>	 (KubernetesRsyslogDown) firing: rsyslog on ml-serve-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[00:56:41] <icinga-wm>	 RECOVERY - Host mr1-drmrs.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 85.59 ms
[00:56:41] <icinga-wm>	 RECOVERY - Host mr1-drmrs.oob is UP: PING OK - Packet loss = 0%, RTA = 85.61 ms
[01:35:53] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[01:44:01] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 verb={CREATE,UPDATE} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[02:06:41] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[02:10:02] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1004-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[02:13:57] <icinga-wm>	 PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[03:22:25] <icinga-wm>	 PROBLEM - WDQS SPARQL on wdqs1013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[03:24:39] <icinga-wm>	 RECOVERY - WDQS SPARQL on wdqs1013 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.073 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[03:36:31] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[04:07:11] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[04:15:02] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (2) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[04:18:37] <icinga-wm>	 RECOVERY - SSH on wtp1038.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[04:55:02] <jinxer-wm>	 (KubernetesRsyslogDown) firing: rsyslog on ml-serve-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[05:10:47] <icinga-wm>	 PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[05:17:55] <icinga-wm>	 RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[05:57:07] <icinga-wm>	 PROBLEM - SSH on restbase1018.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[06:06:37] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[06:09:16] <wikibugs>	 (03CR) 10Urbanecm: [C: 04-1] "maybe I'm missing something, but AFAICS, once train is deployed, this will use the legacy/cloud service <https://image-suggestion-api.wmcl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808208 (https://phabricator.wikimedia.org/T306032) (owner: 10Kosta Harlan)
[06:10:02] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1004-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[06:13:09] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808207 (https://phabricator.wikimedia.org/T306032) (owner: 10Kosta Harlan)
[06:58:27] <icinga-wm>	 RECOVERY - SSH on restbase1018.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[07:00:05] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220625T0700)
[07:21:53] <wikibugs>	 10SRE, 10Platform Team Workboards (Green): Install wrk, siege and lua-cjson packages on deploy1001 - https://phabricator.wikimedia.org/T230178 (10hashar) Thank you @Eevans and @Dzahn . It is good to see cleanup happening :]
[07:28:43] <icinga-wm>	 RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:49:55] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:05:39] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=POST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[08:13:25] <icinga-wm>	 RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:14:43] <icinga-wm>	 RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[08:15:02] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (2) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[08:20:27] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift-account-stats_search:platform.service,swift-account-stats_wdqs:flink.service,swift_dispersion_stats.service,swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:21:49] <icinga-wm>	 RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[08:24:05] <icinga-wm>	 PROBLEM - SSH on wtp1038.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[08:36:11] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[08:41:41] <icinga-wm>	 RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:51:05] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service,swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:55:02] <jinxer-wm>	 (KubernetesRsyslogDown) firing: rsyslog on ml-serve-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[09:06:51] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[09:25:19] <icinga-wm>	 RECOVERY - SSH on wtp1038.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[09:32:21] <logmsgbot>	 !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b3fe77c]: (no justification provided)
[09:32:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:32:26] <logmsgbot>	 !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b3fe77c]: (no justification provided) (duration: 00m 04s)
[09:32:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:33:33] <logmsgbot>	 !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b3fe77c]: (no justification provided)
[09:33:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:54:57] <elukey>	 !log restart rsyslog on ml-serve-ctrl200[1,2] - broken connections to centrallog
[09:55:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:59:53] <jinxer-wm>	 (KubernetesRsyslogDown) resolved: rsyslog on ml-serve-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[09:59:55] <icinga-wm>	 PROBLEM - SSH on mw1321.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[10:10:02] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1004-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[10:23:32] <logmsgbot>	 !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b3fe77c]: (no justification provided) (duration: 49m 58s)
[10:23:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:25:40] <logmsgbot>	 !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b3fe77c]: (no justification provided)
[10:25:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:25:44] <logmsgbot>	 !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b3fe77c]: (no justification provided) (duration: 00m 04s)
[10:25:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:28:01] <logmsgbot>	 !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b3fe77c]: (no justification provided)
[10:28:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:28:19] <logmsgbot>	 !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b3fe77c]: (no justification provided) (duration: 00m 18s)
[10:28:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:36:29] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[10:42:55] <logmsgbot>	 !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b3fe77c]: (no justification provided)
[10:42:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:44:29] <logmsgbot>	 !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b3fe77c]: (no justification provided) (duration: 01m 33s)
[10:44:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:45:00] <logmsgbot>	 !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b3fe77c]: (no justification provided)
[10:45:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:53:27] <logmsgbot>	 !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b3fe77c]: (no justification provided) (duration: 08m 27s)
[10:53:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:18:49] <icinga-wm>	 PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[11:35:19] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[11:36:25] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:03:37] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] "If nobody gets at it by tuesday I'll merge then, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/806433 (owner: 10Majavah)
[12:06:57] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/806553 (https://phabricator.wikimedia.org/T286301) (owner: 10Majavah)
[12:07:51] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] P:wmcs::metricsinfra::prometheus: enable thanos sidecar [puppet] - 10https://gerrit.wikimedia.org/r/806551 (https://phabricator.wikimedia.org/T286301) (owner: 10Majavah)
[12:07:57] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] P:metricsinfra: add thanos query [puppet] - 10https://gerrit.wikimedia.org/r/806552 (https://phabricator.wikimedia.org/T286301) (owner: 10Majavah)
[12:08:35] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] "Will merge on tuesday if nobody gets to it sooner  👍" [puppet] - 10https://gerrit.wikimedia.org/r/806553 (https://phabricator.wikimedia.org/T286301) (owner: 10Majavah)
[12:13:29] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] "I think this is useful and quite safe, merging and deploying, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/806459 (owner: 10Majavah)
[12:15:02] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (2) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[12:20:38] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] "Awesome, this was really good, reduced in around 100 (of 800, 12.5%) the number of connections on the active galera node. Thanks a lot!!" [puppet] - 10https://gerrit.wikimedia.org/r/806459 (owner: 10Majavah)
[12:36:27] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[13:03:49] <icinga-wm>	 RECOVERY - SSH on mw1321.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[13:06:45] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[13:16:40] <elukey>	 !log restart rsyslog on ml-staging-ctrl200[1,2] - broken connections to centrallog
[13:16:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:19:54] <jinxer-wm>	 (KubernetesRsyslogDown) resolved: (2) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[13:23:59] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 verb={CREATE,UPDATE} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[13:37:27] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[13:52:56] <wikibugs>	 (03PS1) 10Stang: enwikiquote: Create rollbacker user group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808370 (https://phabricator.wikimedia.org/T310950)
[14:05:51] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[14:10:02] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1004-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[15:37:59] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[16:06:23] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[16:25:27] <icinga-wm>	 RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[16:34:45] <icinga-wm>	 PROBLEM - SSH on wtp1038.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[17:35:21] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[17:35:57] <icinga-wm>	 RECOVERY - SSH on wtp1038.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[17:40:41] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 verb={CREATE,UPDATE} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[18:10:02] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1004-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[18:12:14] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host sretest1001.eqiad.wmnet
[18:12:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:12:21] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet
[18:12:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:17:50] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1001.eqiad.wmnet
[18:17:52] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet
[18:17:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:17:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:10:19] <icinga-wm>	 RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:17:25] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_ring_manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:42:31] <icinga-wm>	 RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:03:49] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift-account-stats_swift:dispersion.service,swift_dispersion_stats.service,swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:07:22] <wikibugs>	 10SRE, 10MediaWiki-General, 10Traffic-Icebox, 10Patch-For-Review: Investigate query parameter normalization for MW/services - https://phabricator.wikimedia.org/T138093 (10ori) The array parameter notation is straightforward to deal with as long as it uses a literal '[', since we just treat it as a stop cha...
[22:10:02] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1004-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency