[00:04:17] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1005-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [00:45:06] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance [00:45:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:45:20] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance [00:45:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:05:29] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10Cmjohnson) [01:05:43] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10Cmjohnson) 05Open→03Resolved [01:32:46] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [01:32:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:33:00] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [01:33:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:48:50] RECOVERY - Checks that the local airflow scheduler for airflow @research is working properly on an-airflow1002 is OK: OK: /usr/bin/env AIRFLOW_HOME=/srv/airflow-research /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1002.eqiad.wmnet succeeded https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [02:49:03] !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b3fe77c]: (no justification provided) [02:49:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:49:13] !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b3fe77c]: (no justification provided) (duration: 00m 09s) [02:49:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:20:50] PROBLEM - WDQS SPARQL on wdqs1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:25:24] RECOVERY - WDQS SPARQL on wdqs1012 is OK: HTTP OK: HTTP/1.1 200 OK - 688 bytes in 1.054 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:56:26] !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b3fe77c]: (no justification provided) [03:56:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:56:34] !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b3fe77c]: (no justification provided) (duration: 00m 08s) [03:56:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:57:25] !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b3fe77c]: (no justification provided) [03:57:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:57:34] !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b3fe77c]: (no justification provided) (duration: 00m 08s) [03:57:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:59:09] !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b3fe77c]: (no justification provided) [03:59:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:59:18] !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b3fe77c]: (no justification provided) (duration: 00m 08s) [03:59:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:04:36] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1005-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [04:16:04] (03CR) 10ArielGlenn: [C: 03+1] "Thanks for the cleanup." [puppet] - 10https://gerrit.wikimedia.org/r/810319 (owner: 10Muehlenhoff) [04:32:20] PROBLEM - k8s API server requests latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 verb={CREATE,UPDATE} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [04:47:06] PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:48:17] !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b3fe77c]: (no justification provided) [04:48:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:48:25] !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b3fe77c]: (no justification provided) (duration: 00m 08s) [04:48:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:49:44] !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b3fe77c]: (no justification provided) [04:49:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:49:53] !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b3fe77c]: (no justification provided) (duration: 00m 08s) [04:49:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:11:02] !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b3fe77c]: (no justification provided) [05:11:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:11:11] !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b3fe77c]: (no justification provided) (duration: 00m 09s) [05:11:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:20:59] !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b3fe77c]: (no justification provided) [05:21:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:21:06] !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b3fe77c]: (no justification provided) (duration: 00m 08s) [05:21:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:23:55] !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b3fe77c]: (no justification provided) [05:23:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:24:04] !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b3fe77c]: (no justification provided) (duration: 00m 09s) [05:24:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:36:02] !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b3fe77c]: (no justification provided) [05:36:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:36:11] !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b3fe77c]: (no justification provided) (duration: 00m 09s) [05:36:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:53:04] PROBLEM - Checks that the local airflow scheduler for airflow @research is working properly on an-airflow1002 is CRITICAL: CRITICAL: /usr/bin/env AIRFLOW_HOME=/srv/airflow-research /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1002.eqiad.wmnet did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [05:59:50] (03PS1) 10Majavah: toolsdb: disable replication for s54518__mw [puppet] - 10https://gerrit.wikimedia.org/r/810420 [06:30:08] (03PS3) 10Majavah: wmcs: vps: remove_instance: add support for puppet deactivation [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/801784 [06:30:10] (03PS3) 10Majavah: wmcs: toolforge: add a cookbook to remove a grid node [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/801785 (https://phabricator.wikimedia.org/T309525) [06:34:40] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:35:14] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:35:44] (03PS1) 10Majavah: Remove systemd241 Stretch backport [puppet] - 10https://gerrit.wikimedia.org/r/810421 [06:36:17] (03CR) 10CI reject: [V: 04-1] Remove systemd241 Stretch backport [puppet] - 10https://gerrit.wikimedia.org/r/810421 (owner: 10Majavah) [06:36:49] (03PS2) 10Majavah: Remove systemd241 Stretch backport [puppet] - 10https://gerrit.wikimedia.org/r/810421 [06:37:12] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:38:06] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36170/console" [puppet] - 10https://gerrit.wikimedia.org/r/810421 (owner: 10Majavah) [06:39:22] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 24 Aug 2022 07:48:40 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:39:44] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48390 bytes in 0.064 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:41:32] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.307 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:47:24] (03PS1) 10Majavah: P:wmcs::nfsclient: remove ref to secondary_nfs_servers [puppet] - 10https://gerrit.wikimedia.org/r/810423 [06:53:30] (03PS1) 10Majavah: prometheus: openstack stale certs: ignore non-host certs [puppet] - 10https://gerrit.wikimedia.org/r/810425 [06:54:50] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:56:03] (03PS2) 10Majavah: prometheus: openstack stale certs: ignore non-host certs [puppet] - 10https://gerrit.wikimedia.org/r/810425 [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220702T0700) [07:56:12] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:04:36] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1005-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [08:11:02] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:50:05] PROBLEM - k8s API server requests latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 verb=CREATE https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [08:52:42] RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:49:06] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:00:21] (03PS6) 10Aklapper: Redirect svn.wikimedia.org/doc properly [puppet] - 10https://gerrit.wikimedia.org/r/631888 (https://phabricator.wikimedia.org/T109950) (owner: 10Dereckson) [10:02:24] PROBLEM - SSH on restbase2012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:03:13] (03CR) 10CI reject: [V: 04-1] Redirect svn.wikimedia.org/doc properly [puppet] - 10https://gerrit.wikimedia.org/r/631888 (https://phabricator.wikimedia.org/T109950) (owner: 10Dereckson) [10:50:32] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:00:40] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:03:30] (03PS2) 10David Caro: cloudnet: add show, reboot_node and roll_reboot_cloudnets [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810368 [11:03:32] (03CR) 10David Caro: cloudnet: add show, reboot_node and roll_reboot_cloudnets (033 comments) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810368 (owner: 10David Caro) [11:03:34] (03PS1) 10David Caro: wmcs.lib.openstack: move to a directory module [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810451 [11:03:46] RECOVERY - SSH on restbase2012.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:04:10] (03PS3) 10David Caro: wmcs.openstack: Add runbook to increase the quotas [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/806429 (https://phabricator.wikimedia.org/T297606) [11:05:11] (03PS3) 10Urbanecm: [beta] Growth: Enable structured mentor list at cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808268 (https://phabricator.wikimedia.org/T310905) [11:08:56] (03CR) 10CI reject: [V: 04-1] cloudnet: add show, reboot_node and roll_reboot_cloudnets [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810368 (owner: 10David Caro) [11:10:18] (03PS1) 10Urbanecm: [beta] Temporarily allow everyone to enroll as mentor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810452 (https://phabricator.wikimedia.org/T310905) [11:10:55] (03PS4) 10Urbanecm: [beta] Growth: Enable structured mentor list at enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808268 (https://phabricator.wikimedia.org/T310905) [11:11:09] (03CR) 10CI reject: [V: 04-1] [beta] Temporarily allow everyone to enroll as mentor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810452 (https://phabricator.wikimedia.org/T310905) (owner: 10Urbanecm) [11:14:28] (03PS2) 10Urbanecm: [beta] Temporarily allow everyone to enroll as mentor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810452 (https://phabricator.wikimedia.org/T310905) [11:15:57] (03CR) 10Urbanecm: [C: 03+1] "lgtm" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808208 (https://phabricator.wikimedia.org/T306032) (owner: 10Kosta Harlan) [11:17:38] (03PS5) 10Urbanecm: [beta] Growth: Enable structured mentor list at enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808268 (https://phabricator.wikimedia.org/T310905) [11:20:20] (03PS2) 10Urbanecm: [beta] Growth: Switch to structured mentor list at all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808269 (https://phabricator.wikimedia.org/T310905) [11:30:56] PROBLEM - SSH on wtp1038.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:01:56] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:04:36] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1005-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [12:32:16] RECOVERY - SSH on wtp1038.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:55:36] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:37:06] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=UPDATE https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [14:01:44] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=CREATE https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [14:03:26] (03PS4) 10Majavah: kubeadm: drop support for 1.20 [puppet] - 10https://gerrit.wikimedia.org/r/802143 [14:03:28] (03PS4) 10Majavah: aptrepo: add thirdparty/kubeadm-k8s-1-22 [puppet] - 10https://gerrit.wikimedia.org/r/802146 (https://phabricator.wikimedia.org/T286856) [14:03:30] (03PS1) 10Majavah: aptrepo: drop kubeadm components from stretch [puppet] - 10https://gerrit.wikimedia.org/r/810459 [14:11:42] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:36:16] not sure if an emergency deployment is possible for T311916. The patch to fix it is two lines. The immediate issue is mitigated by an off switch for the feature, but a long-running experiment associated with the feature will be disrupted if it is switched off for several days. [15:36:17] T311916: "Add an image" structured edits add a blank line instead of an image - https://phabricator.wikimedia.org/T311916 [15:37:42] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:04:36] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1005-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [16:08:33] 10SRE, 10MediaWiki-General, 10Traffic-Icebox: Investigate query parameter normalization for MW/services - https://phabricator.wikimedia.org/T138093 (10ori) vmod migrated to Gerrit: https://gerrit.wikimedia.org/g/operations/software/varnish/libvmod-querysort Next step, packaging. [16:19:04] PROBLEM - SSH on db1109.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:21:48] (03CR) 10Andrew Bogott: [C: 03+2] striker: Open firewall for Docker-managed service [puppet] - 10https://gerrit.wikimedia.org/r/810413 (https://phabricator.wikimedia.org/T306469) (owner: 10BryanDavis) [16:22:04] (03CR) 10Andrew Bogott: [C: 03+2] striker: Bump container version to 2022-07-01-210101-production [puppet] - 10https://gerrit.wikimedia.org/r/810414 (https://phabricator.wikimedia.org/T306469) (owner: 10BryanDavis) [16:25:56] (03CR) 10Majavah: "-1 but I was too late: since the Docker container itself doesn't do TLS termination IIRC, you need to use profile::tlsproxy::envoy or simi" [puppet] - 10https://gerrit.wikimedia.org/r/810413 (https://phabricator.wikimedia.org/T306469) (owner: 10BryanDavis) [16:34:18] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [17:05:09] (03PS3) 10David Caro: cloudnet: add show, reboot_node and roll_reboot_cloudnets [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810368 [17:07:11] 10SRE, 10Icinga, 10Observability-Alerting, 10wikitech.wikimedia.org: PROBLEM: Icinga on alert2001.wikimedia.org is CRITICAL - https://phabricator.wikimedia.org/T311926 (10jcrespo) [17:08:27] 10SRE, 10Patch-For-Review, 10Tracking-Neverending: Tracking and Reducing cron-spam to root@ - https://phabricator.wikimedia.org/T132324 (10jcrespo) [17:08:30] 10SRE, 10Icinga, 10Observability-Alerting, 10wikitech.wikimedia.org: PROBLEM: Icinga on alert2001.wikimedia.org is CRITICAL - https://phabricator.wikimedia.org/T311926 (10jcrespo) [17:20:26] RECOVERY - SSH on db1109.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:33:25] (03PS1) 10Urbanecm: AddImageArticleTarget: Update to new mediaClass/mediaTag format [extensions/GrowthExperiments] (wmf/1.39.0-wmf.18) - 10https://gerrit.wikimedia.org/r/810509 (https://phabricator.wikimedia.org/T311916) [19:39:24] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=UPDATE https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [20:04:36] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1005-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [20:27:36] PROBLEM - k8s API server requests latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 verb={CREATE,UPDATE} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [20:43:33] (03PS4) 10David Caro: cloudnet: add show, reboot_node and roll_reboot_cloudnets [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810368 [20:52:14] (03PS5) 10David Caro: cloudnet: add show, reboot_node and roll_reboot_cloudnets [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810368 [20:54:00] PROBLEM - k8s API server requests latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 verb=UPDATE https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [21:10:08] PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook