[00:00:45] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:15:02] (KubernetesRsyslogDown) firing: (2) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [00:20:01] PROBLEM - Check systemd state on an-web1001 is CRITICAL: CRITICAL - degraded: The following units failed: hardsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:24:08] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_eventlogging_analytics.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:31:47] RECOVERY - Check systemd state on an-web1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:39:35] PROBLEM - puppet last run on authdns2001 is CRITICAL: CRITICAL: Puppet last ran 11 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [00:55:02] (KubernetesRsyslogDown) firing: rsyslog on ml-serve-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [00:58:41] RECOVERY - puppet last run on authdns2001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [01:15:57] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_ring_manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:06:09] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [02:10:02] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1004-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [02:10:21] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:10:57] 10SRE, 10Infrastructure-Foundations, 10netbox: Grant cn=nda some sort of read only access to Netbox - https://phabricator.wikimedia.org/T302870 (10wiki_willy) Thanks for checking @ayounsi. My personal opinion on the contacts list is to restrict it if possible. I don't see any issues sharing the generic ven... [02:26:33] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:37:47] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:58:21] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:12:31] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:23:15] PROBLEM - WDQS SPARQL on wdqs1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:25:29] RECOVERY - WDQS SPARQL on wdqs1004 is OK: HTTP OK: HTTP/1.1 200 OK - 690 bytes in 1.088 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [04:07:39] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [04:15:02] (KubernetesRsyslogDown) firing: (2) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [04:45:27] 10SRE, 10MediaWiki-General, 10Traffic-Icebox, 10Patch-For-Review: Investigate query parameter normalization for MW/services - https://phabricator.wikimedia.org/T138093 (10ori) I took a closer look at varnish's `std.querysort()` and I think it's a relatively straightforward change to get it to sort the way... [04:47:27] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:55:02] (KubernetesRsyslogDown) firing: rsyslog on ml-serve-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [04:59:19] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:06:27] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:06:57] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [05:15:57] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:21:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1172 db1174 db1175 for kernel reboots', diff saved to https://phabricator.wikimedia.org/P30046 and previous config saved to /var/cache/conftool/dbconfig/20220624-052137-root.json [05:21:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:27:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1168 db1169 for kernel reboots', diff saved to https://phabricator.wikimedia.org/P30047 and previous config saved to /var/cache/conftool/dbconfig/20220624-052758-root.json [05:28:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:31:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1172 (re)pooling @ 2%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30048 and previous config saved to /var/cache/conftool/dbconfig/20220624-053128-root.json [05:31:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:31:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1174 (re)pooling @ 2%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30049 and previous config saved to /var/cache/conftool/dbconfig/20220624-053134-root.json [05:31:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:31:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1175 (re)pooling @ 2%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30050 and previous config saved to /var/cache/conftool/dbconfig/20220624-053138-root.json [05:31:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:34:19] PROBLEM - k8s API server requests latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 verb={CREATE,UPDATE} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [05:34:55] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=CREATE https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [05:35:31] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [05:35:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1169 (re)pooling @ 2%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30051 and previous config saved to /var/cache/conftool/dbconfig/20220624-053531-root.json [05:35:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:35:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1168 (re)pooling @ 2%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30052 and previous config saved to /var/cache/conftool/dbconfig/20220624-053538-root.json [05:35:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:36:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1170 for kernel reboots', diff saved to https://phabricator.wikimedia.org/P30053 and previous config saved to /var/cache/conftool/dbconfig/20220624-053652-root.json [05:36:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:41:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1024 (re)pooling @ 75%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30054 and previous config saved to /var/cache/conftool/dbconfig/20220624-054139-root.json [05:41:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:43:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1170 after kernel reboots', diff saved to https://phabricator.wikimedia.org/P30055 and previous config saved to /var/cache/conftool/dbconfig/20220624-054259-root.json [05:43:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:46:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1172 (re)pooling @ 5%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30056 and previous config saved to /var/cache/conftool/dbconfig/20220624-054632-root.json [05:46:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:46:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1174 (re)pooling @ 5%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30057 and previous config saved to /var/cache/conftool/dbconfig/20220624-054637-root.json [05:46:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:46:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1175 (re)pooling @ 5%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30058 and previous config saved to /var/cache/conftool/dbconfig/20220624-054642-root.json [05:46:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:50:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1169 (re)pooling @ 5%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30059 and previous config saved to /var/cache/conftool/dbconfig/20220624-055035-root.json [05:50:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:50:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1168 (re)pooling @ 5%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30060 and previous config saved to /var/cache/conftool/dbconfig/20220624-055042-root.json [05:50:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:53:55] RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [05:56:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1024 (re)pooling @ 100%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30061 and previous config saved to /var/cache/conftool/dbconfig/20220624-055643-root.json [05:56:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:01:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1172 (re)pooling @ 10%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30062 and previous config saved to /var/cache/conftool/dbconfig/20220624-060136-root.json [06:01:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:01:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1174 (re)pooling @ 10%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30063 and previous config saved to /var/cache/conftool/dbconfig/20220624-060141-root.json [06:01:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:01:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1175 (re)pooling @ 10%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30064 and previous config saved to /var/cache/conftool/dbconfig/20220624-060146-root.json [06:01:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:05:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1169 (re)pooling @ 10%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30065 and previous config saved to /var/cache/conftool/dbconfig/20220624-060539-root.json [06:05:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:05:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1168 (re)pooling @ 10%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30066 and previous config saved to /var/cache/conftool/dbconfig/20220624-060545-root.json [06:05:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:09:34] (03PS1) 10Marostegui: db1154: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/808102 [06:10:02] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1004-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [06:12:09] (03CR) 10Marostegui: [C: 03+2] db1154: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/808102 (owner: 10Marostegui) [06:15:23] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift-account-stats_search:platform.service,swift_ring_manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:16:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1172 (re)pooling @ 25%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30067 and previous config saved to /var/cache/conftool/dbconfig/20220624-061640-root.json [06:16:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:16:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1174 (re)pooling @ 25%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30068 and previous config saved to /var/cache/conftool/dbconfig/20220624-061645-root.json [06:16:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:16:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1175 (re)pooling @ 25%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30069 and previous config saved to /var/cache/conftool/dbconfig/20220624-061650-root.json [06:16:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:20:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1169 (re)pooling @ 25%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30070 and previous config saved to /var/cache/conftool/dbconfig/20220624-062043-root.json [06:20:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:20:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1168 (re)pooling @ 25%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30071 and previous config saved to /var/cache/conftool/dbconfig/20220624-062049-root.json [06:20:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:27:43] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [06:31:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1172 (re)pooling @ 50%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30072 and previous config saved to /var/cache/conftool/dbconfig/20220624-063143-root.json [06:31:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:31:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1174 (re)pooling @ 50%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30073 and previous config saved to /var/cache/conftool/dbconfig/20220624-063149-root.json [06:31:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:31:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1175 (re)pooling @ 50%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30074 and previous config saved to /var/cache/conftool/dbconfig/20220624-063154-root.json [06:31:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:35:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1169 (re)pooling @ 50%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30075 and previous config saved to /var/cache/conftool/dbconfig/20220624-063547-root.json [06:35:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:35:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1168 (re)pooling @ 50%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30076 and previous config saved to /var/cache/conftool/dbconfig/20220624-063553-root.json [06:35:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:35:59] (03PS1) 10Marostegui: Revert "db1154: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/808111 [06:36:36] (03CR) 10Marostegui: [C: 03+2] Revert "db1154: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/808111 (owner: 10Marostegui) [06:42:51] (Memory over 85%) firing: Alert for device cloudsw1-d5-eqiad.mgmt.eqiad.wmnet - Memory over 85% - https://alerts.wikimedia.org/?q=alertname%3DMemory+over+85%25 [06:46:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1172 (re)pooling @ 75%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30077 and previous config saved to /var/cache/conftool/dbconfig/20220624-064647-root.json [06:46:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:46:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1174 (re)pooling @ 75%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30078 and previous config saved to /var/cache/conftool/dbconfig/20220624-064653-root.json [06:46:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:46:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1175 (re)pooling @ 75%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30079 and previous config saved to /var/cache/conftool/dbconfig/20220624-064657-root.json [06:47:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:50:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1169 (re)pooling @ 75%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30080 and previous config saved to /var/cache/conftool/dbconfig/20220624-065051-root.json [06:50:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:50:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1168 (re)pooling @ 75%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30081 and previous config saved to /var/cache/conftool/dbconfig/20220624-065057-root.json [06:51:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:53:32] !log restarting bacula director @ backup1001 [06:53:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220624T0700) [07:01:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1172 (re)pooling @ 100%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30082 and previous config saved to /var/cache/conftool/dbconfig/20220624-070151-root.json [07:01:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:01:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1174 (re)pooling @ 100%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30083 and previous config saved to /var/cache/conftool/dbconfig/20220624-070157-root.json [07:02:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:02:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1175 (re)pooling @ 100%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30084 and previous config saved to /var/cache/conftool/dbconfig/20220624-070201-root.json [07:02:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:02:29] !log Reboot db1117 for kernel upgrade (expect haproxy irc alerts) [07:02:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:05:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1169 (re)pooling @ 100%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30085 and previous config saved to /var/cache/conftool/dbconfig/20220624-070555-root.json [07:05:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:06:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1168 (re)pooling @ 100%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30086 and previous config saved to /var/cache/conftool/dbconfig/20220624-070601-root.json [07:06:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:06:09] (03CR) 10Ayounsi: [C: 03+1] "Reviewed PCC and it looks good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/808043 (https://phabricator.wikimedia.org/T310574) (owner: 10Ssingh) [07:07:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es1022 es1025 for kernel reboots', diff saved to https://phabricator.wikimedia.org/P30087 and previous config saved to /var/cache/conftool/dbconfig/20220624-070700-root.json [07:07:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:09:06] (03PS1) 10Muehlenhoff: Record extended access for jmads [puppet] - 10https://gerrit.wikimedia.org/r/808126 [07:10:18] (03PS1) 10Muehlenhoff: Extend access for edtadros [puppet] - 10https://gerrit.wikimedia.org/r/808127 [07:11:03] (03CR) 10Muehlenhoff: [C: 03+2] Record extended access for jmads [puppet] - 10https://gerrit.wikimedia.org/r/808126 (owner: 10Muehlenhoff) [07:12:11] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:13:49] (03CR) 10Muehlenhoff: [C: 03+2] Extend access for edtadros [puppet] - 10https://gerrit.wikimedia.org/r/808127 (owner: 10Muehlenhoff) [07:14:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1126 for kernel reboots', diff saved to https://phabricator.wikimedia.org/P30088 and previous config saved to /var/cache/conftool/dbconfig/20220624-071439-root.json [07:14:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:15:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1096 for kernel reboots', diff saved to https://phabricator.wikimedia.org/P30089 and previous config saved to /var/cache/conftool/dbconfig/20220624-071551-root.json [07:15:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:51] (Memory over 85%) resolved: Alert for device cloudsw1-d5-eqiad.mgmt.eqiad.wmnet - Memory over 85% got acknowledged - https://alerts.wikimedia.org/?q=alertname%3DMemory+over+85%25 [07:19:18] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host an-tool1011.eqiad.wmnet [07:19:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1126 (re)pooling @ 2%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30090 and previous config saved to /var/cache/conftool/dbconfig/20220624-071940-root.json [07:19:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:20:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1025 (re)pooling @ 50%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30091 and previous config saved to /var/cache/conftool/dbconfig/20220624-072054-root.json [07:20:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:21:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 2%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30092 and previous config saved to /var/cache/conftool/dbconfig/20220624-072106-root.json [07:21:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:21:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-tool1011.eqiad.wmnet [07:21:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:21:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3315 (re)pooling @ 2%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30093 and previous config saved to /var/cache/conftool/dbconfig/20220624-072147-root.json [07:21:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:21:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 2%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30094 and previous config saved to /var/cache/conftool/dbconfig/20220624-072153-root.json [07:21:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:22:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1118 for kernel reboots', diff saved to https://phabricator.wikimedia.org/P30095 and previous config saved to /var/cache/conftool/dbconfig/20220624-072240-root.json [07:22:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:26:23] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift-account-stats_mlserve:prod.service,swift_dispersion_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:26:38] 10SRE, 10SRE-tools, 10DNS, 10Infrastructure-Foundations, and 2 others: sre.dns.netbox cookbook dosn't support period terminated domains - https://phabricator.wikimedia.org/T306809 (10ayounsi) [07:28:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1118 (re)pooling @ 2%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30096 and previous config saved to /var/cache/conftool/dbconfig/20220624-072841-root.json [07:28:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:21] (03PS3) 10Slyngshede: class role::apt_repo switch apt-repo to Apache2, from nginx. [puppet] - 10https://gerrit.wikimedia.org/r/807983 [07:31:38] (03CR) 10DCausse: elastic: configure keystore values for restore (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/807623 (https://phabricator.wikimedia.org/T309648) (owner: 10Ryan Kemper) [07:32:30] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on ml-cache[2001-2003].codfw.wmnet with reason: reboots [07:32:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:32:36] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on ml-cache[2001-2003].codfw.wmnet with reason: reboots [07:32:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:33:47] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is CRITICAL: 43.29 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [07:34:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1126 (re)pooling @ 5%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30097 and previous config saved to /var/cache/conftool/dbconfig/20220624-073444-root.json [07:34:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1141 for kernel reboots', diff saved to https://phabricator.wikimedia.org/P30098 and previous config saved to /var/cache/conftool/dbconfig/20220624-073543-root.json [07:35:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1025 (re)pooling @ 75%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30099 and previous config saved to /var/cache/conftool/dbconfig/20220624-073558-root.json [07:36:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:36:09] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [07:36:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 5%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30100 and previous config saved to /var/cache/conftool/dbconfig/20220624-073610-root.json [07:36:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:36:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3315 (re)pooling @ 5%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30101 and previous config saved to /var/cache/conftool/dbconfig/20220624-073651-root.json [07:36:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:36:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 5%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30102 and previous config saved to /var/cache/conftool/dbconfig/20220624-073657-root.json [07:37:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1141 (re)pooling @ 2%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30103 and previous config saved to /var/cache/conftool/dbconfig/20220624-074204-root.json [07:42:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:46] (03PS1) 10Muehlenhoff: Remove old Arclamp buster VMs [puppet] - 10https://gerrit.wikimedia.org/r/808192 (https://phabricator.wikimedia.org/T305460) [07:43:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1118 (re)pooling @ 5%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30104 and previous config saved to /var/cache/conftool/dbconfig/20220624-074345-root.json [07:43:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:45:57] 10SRE, 10SRE-tools, 10DNS, 10Infrastructure-Foundations, and 2 others: sre.dns.netbox cookbook dosn't support period terminated domains - https://phabricator.wikimedia.org/T306809 (10ayounsi) 05Stalled→03Open [07:49:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1126 (re)pooling @ 10%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30105 and previous config saved to /var/cache/conftool/dbconfig/20220624-074947-root.json [07:49:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:51:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1025 (re)pooling @ 100%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30106 and previous config saved to /var/cache/conftool/dbconfig/20220624-075102-root.json [07:51:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:51:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 10%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30107 and previous config saved to /var/cache/conftool/dbconfig/20220624-075114-root.json [07:51:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:51:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3315 (re)pooling @ 10%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30108 and previous config saved to /var/cache/conftool/dbconfig/20220624-075154-root.json [07:51:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:52:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 10%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30109 and previous config saved to /var/cache/conftool/dbconfig/20220624-075201-root.json [07:52:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:52:54] (03CR) 10Muehlenhoff: [C: 04-1] "Actually, we'll have to keep U2F disabled until webauthn is enabled, otherwise we e.g can't reuse the database storing the tokens." [puppet] - 10https://gerrit.wikimedia.org/r/805836 (owner: 10Muehlenhoff) [07:53:10] (03CR) 10Muehlenhoff: [C: 03+2] Remove old Arclamp buster VMs [puppet] - 10https://gerrit.wikimedia.org/r/808192 (https://phabricator.wikimedia.org/T305460) (owner: 10Muehlenhoff) [07:53:12] 10SRE, 10ops-eqiad, 10DBA: db1173 won't boot up - https://phabricator.wikimedia.org/T310595 (10Marostegui) @Cmjohnson did the part arrive? I don't know if you want to place that new part on this host or if you prefer to leave the DIMM that you already placed? Thank you! [07:56:39] (03PS4) 10Slyngshede: class role::apt_repo switch apt-repo to Apache2, from nginx. [puppet] - 10https://gerrit.wikimedia.org/r/807983 [07:56:54] (03PS1) 10Zabe: dumps: remove absented update-dump-statusfiles cron [puppet] - 10https://gerrit.wikimedia.org/r/808193 (https://phabricator.wikimedia.org/T273673) [07:57:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1141 (re)pooling @ 5%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30110 and previous config saved to /var/cache/conftool/dbconfig/20220624-075707-root.json [07:57:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1118 (re)pooling @ 10%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30111 and previous config saved to /var/cache/conftool/dbconfig/20220624-075849-root.json [07:58:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1126 (re)pooling @ 25%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30112 and previous config saved to /var/cache/conftool/dbconfig/20220624-080451-root.json [08:04:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:06:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 25%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30113 and previous config saved to /var/cache/conftool/dbconfig/20220624-080618-root.json [08:06:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:06:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3315 (re)pooling @ 25%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30114 and previous config saved to /var/cache/conftool/dbconfig/20220624-080658-root.json [08:07:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 25%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30115 and previous config saved to /var/cache/conftool/dbconfig/20220624-080705-root.json [08:07:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:13] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [08:12:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1141 (re)pooling @ 10%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30116 and previous config saved to /var/cache/conftool/dbconfig/20220624-081211-root.json [08:13:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1118 (re)pooling @ 25%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30117 and previous config saved to /var/cache/conftool/dbconfig/20220624-081353-root.json [08:15:02] (KubernetesRsyslogDown) firing: (2) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [08:15:41] (03CR) 10Gehel: [C: 04-1] "minor comment inline" [puppet] - 10https://gerrit.wikimedia.org/r/807623 (https://phabricator.wikimedia.org/T309648) (owner: 10Ryan Kemper) [08:17:51] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Traffic-Icebox: Remove old letsencrypt puppet module - https://phabricator.wikimedia.org/T221268 (10taavi) 05Open→03Resolved a:03Andrew https://gerrit.wikimedia.org/r/c/operations/puppet/+/655762 [08:19:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1126 (re)pooling @ 50%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30118 and previous config saved to /var/cache/conftool/dbconfig/20220624-081955-root.json [08:21:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 50%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30119 and previous config saved to /var/cache/conftool/dbconfig/20220624-082122-root.json [08:22:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3315 (re)pooling @ 50%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30120 and previous config saved to /var/cache/conftool/dbconfig/20220624-082202-root.json [08:22:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 50%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30121 and previous config saved to /var/cache/conftool/dbconfig/20220624-082209-root.json [08:27:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1141 (re)pooling @ 25%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30122 and previous config saved to /var/cache/conftool/dbconfig/20220624-082715-root.json [08:28:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1118 (re)pooling @ 50%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30123 and previous config saved to /var/cache/conftool/dbconfig/20220624-082857-root.json [08:29:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:41] (03CR) 10ArielGlenn: [C: 03+1] "Fine by me, can be merged whenever," [puppet] - 10https://gerrit.wikimedia.org/r/808193 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [08:35:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1126 (re)pooling @ 75%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30124 and previous config saved to /var/cache/conftool/dbconfig/20220624-083459-root.json [08:35:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:28] (03CR) 10Jbond: [C: 03+2] security: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/808072 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [08:36:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 75%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30125 and previous config saved to /var/cache/conftool/dbconfig/20220624-083625-root.json [08:36:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3315 (re)pooling @ 75%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30126 and previous config saved to /var/cache/conftool/dbconfig/20220624-083706-root.json [08:37:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 75%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30127 and previous config saved to /var/cache/conftool/dbconfig/20220624-083713-root.json [08:37:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1098 for kernel reboots', diff saved to https://phabricator.wikimedia.org/P30128 and previous config saved to /var/cache/conftool/dbconfig/20220624-083806-root.json [08:38:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:33] 10SRE, 10Infrastructure-Foundations: Implement Prometheus exporter for Ganeti capacity data - https://phabricator.wikimedia.org/T311288 (10SLyngshede-WMF) [08:41:45] 10SRE, 10Infrastructure-Foundations: Implement Prometheus exporter for Ganeti capacity data - https://phabricator.wikimedia.org/T311288 (10SLyngshede-WMF) 05Open→03In progress p:05Triage→03Low [08:42:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1141 (re)pooling @ 50%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30129 and previous config saved to /var/cache/conftool/dbconfig/20220624-084219-root.json [08:42:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:18] (03PS7) 10Slyngshede: Ganeti Prometheus exporter, initial checkin [debs/prometheus-ganeti-exporter] - 10https://gerrit.wikimedia.org/r/804276 (https://phabricator.wikimedia.org/T311288) [08:44:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1118 (re)pooling @ 75%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30130 and previous config saved to /var/cache/conftool/dbconfig/20220624-084401-root.json [08:44:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:15] (03CR) 10Slyngshede: Ganeti Prometheus exporter, initial checkin (031 comment) [debs/prometheus-ganeti-exporter] - 10https://gerrit.wikimedia.org/r/804276 (https://phabricator.wikimedia.org/T311288) (owner: 10Slyngshede) [08:44:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1098:3316 (re)pooling @ 2%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30131 and previous config saved to /var/cache/conftool/dbconfig/20220624-084415-root.json [08:44:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1098:3317 (re)pooling @ 2%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30132 and previous config saved to /var/cache/conftool/dbconfig/20220624-084426-root.json [08:44:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1126 (re)pooling @ 100%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30133 and previous config saved to /var/cache/conftool/dbconfig/20220624-085003-root.json [08:50:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 100%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30134 and previous config saved to /var/cache/conftool/dbconfig/20220624-085129-root.json [08:51:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3315 (re)pooling @ 100%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30135 and previous config saved to /var/cache/conftool/dbconfig/20220624-085210-root.json [08:52:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 100%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30136 and previous config saved to /var/cache/conftool/dbconfig/20220624-085217-root.json [08:52:19] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts webperf2002.codfw.wmnet [08:52:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:57] (03PS5) 10Slyngshede: class role::apt_repo switch apt-repo to Apache2, from nginx. [puppet] - 10https://gerrit.wikimedia.org/r/807983 [08:55:02] (KubernetesRsyslogDown) firing: rsyslog on ml-serve-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [08:55:11] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36028/console" [puppet] - 10https://gerrit.wikimedia.org/r/807983 (owner: 10Slyngshede) [08:55:46] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [08:55:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1141 (re)pooling @ 75%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30137 and previous config saved to /var/cache/conftool/dbconfig/20220624-085723-root.json [08:57:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:58:56] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts webperf2002.codfw.wmnet [08:58:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:03] (03PS1) 10Jbond: netbox: update netbox service definition so it pages [puppet] - 10https://gerrit.wikimedia.org/r/808197 (https://phabricator.wikimedia.org/T296452) [08:59:04] 10SRE, 10Performance-Team, 10Patch-For-Review: Upgrade webperf hosts to Bullseye - https://phabricator.wikimedia.org/T305460 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `webperf2002.codfw.wmnet` - webperf2002.codfw.wmnet (**FAIL**) - Downtimed host on Icinga... [08:59:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1118 (re)pooling @ 100%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30139 and previous config saved to /var/cache/conftool/dbconfig/20220624-085904-root.json [08:59:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1098:3316 (re)pooling @ 10%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30140 and previous config saved to /var/cache/conftool/dbconfig/20220624-085919-root.json [08:59:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1098:3317 (re)pooling @ 10%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30141 and previous config saved to /var/cache/conftool/dbconfig/20220624-085930-root.json [08:59:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:26] (03PS1) 10Jbond: netbox: update netbox so that its active/active [dns] - 10https://gerrit.wikimedia.org/r/808198 (https://phabricator.wikimedia.org/T296452) [09:02:42] (03PS1) 10Jbond: netbox: update netbox service to active/active [puppet] - 10https://gerrit.wikimedia.org/r/808199 (https://phabricator.wikimedia.org/T296452) [09:03:23] (03CR) 10CI reject: [V: 04-1] netbox: update netbox so that its active/active [dns] - 10https://gerrit.wikimedia.org/r/808198 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [09:08:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1137,db1138 for kernel reboots', diff saved to https://phabricator.wikimedia.org/P30142 and previous config saved to /var/cache/conftool/dbconfig/20220624-090810-root.json [09:08:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1141 (re)pooling @ 100%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30143 and previous config saved to /var/cache/conftool/dbconfig/20220624-091227-root.json [09:12:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:12] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Install NVMe SSDs into moss-be200[1|2] & thanos-be200? - https://phabricator.wikimedia.org/T310923 (10LSobanski) @Papaul Matthew is back on Monday, we'll get back to you then. [09:13:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1137 (re)pooling @ 2%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30144 and previous config saved to /var/cache/conftool/dbconfig/20220624-091334-root.json [09:13:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1138 (re)pooling @ 2%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30145 and previous config saved to /var/cache/conftool/dbconfig/20220624-091344-root.json [09:13:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1098:3316 (re)pooling @ 25%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30146 and previous config saved to /var/cache/conftool/dbconfig/20220624-091423-root.json [09:14:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1098:3317 (re)pooling @ 25%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30147 and previous config saved to /var/cache/conftool/dbconfig/20220624-091434-root.json [09:14:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:55] 10SRE, 10MediaWiki-Debug-Logger, 10Traffic, 10noc.wikimedia.org, and 2 others: noc.wikimedia.org with X-Wikimedia-Debug routes to mwdebug but host is not served there - https://phabricator.wikimedia.org/T245552 (10Nintendofan885) [09:19:21] 10SRE, 10Internet-Archive, 10noc.wikimedia.org: noc.wikimedia.org is a 404 when X-Wikimedia-Debug is enabled - https://phabricator.wikimedia.org/T274342 (10Nintendofan885) [09:20:19] 10SRE, 10noc.wikimedia.org: Fix "Blog" link on noc.wikimedia.org - https://phabricator.wikimedia.org/T259978 (10Nintendofan885) [09:20:33] 10SRE, 10Traffic, 10noc.wikimedia.org: noc.wikimedia.org consistently 503s in eqsin and sometimes 503s in esams - https://phabricator.wikimedia.org/T255368 (10Nintendofan885) [09:20:38] (03CR) 10Elukey: [C: 03+1] "<3 I wanted to propose a similar thing, tracked in https://phabricator.wikimedia.org/T310073" [puppet] - 10https://gerrit.wikimedia.org/r/808012 (https://phabricator.wikimedia.org/T310714) (owner: 10JMeybohm) [09:24:55] !log installing publicsuffix updates from last buster point release [09:24:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:40] !log ayounsi@cumin1001 START - Cookbook sre.dns.netbox [09:25:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:36] (03CR) 10Slyngshede: [C: 03+2] dumps: remove absented dumps-exception-checker cron [puppet] - 10https://gerrit.wikimedia.org/r/807101 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [09:28:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1137 (re)pooling @ 5%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30148 and previous config saved to /var/cache/conftool/dbconfig/20220624-092838-root.json [09:28:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1138 (re)pooling @ 5%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30149 and previous config saved to /var/cache/conftool/dbconfig/20220624-092848-root.json [09:28:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:57] (03CR) 10Slyngshede: [C: 03+2] dumps: remove absented update-dump-statusfiles cron [puppet] - 10https://gerrit.wikimedia.org/r/808193 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [09:29:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1098:3316 (re)pooling @ 50%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30150 and previous config saved to /var/cache/conftool/dbconfig/20220624-092927-root.json [09:29:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1098:3317 (re)pooling @ 50%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30151 and previous config saved to /var/cache/conftool/dbconfig/20220624-092938-root.json [09:29:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:13] (03CR) 10Jbond: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/808198 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [09:31:14] !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [09:31:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:54] (03PS4) 10Jbond: bsection: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/805190 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [09:35:21] !log ayounsi@cumin1001 START - Cookbook sre.dns.netbox [09:35:21] !log ayounsi@cumin1001 END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) [09:35:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:29] !log ayounsi@cumin1001 START - Cookbook sre.dns.netbox [09:35:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:07] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:40:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:22] (03CR) 10Jbond: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/808198 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [09:42:12] (03PS2) 10Jbond: netbox: update netbox so that its active/active [dns] - 10https://gerrit.wikimedia.org/r/808198 (https://phabricator.wikimedia.org/T296452) [09:42:47] (03CR) 10Jbond: [C: 03+2] bsection: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/805190 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [09:43:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1137 (re)pooling @ 10%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30152 and previous config saved to /var/cache/conftool/dbconfig/20220624-094342-root.json [09:43:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1138 (re)pooling @ 10%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30153 and previous config saved to /var/cache/conftool/dbconfig/20220624-094352-root.json [09:43:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1098:3316 (re)pooling @ 75%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30154 and previous config saved to /var/cache/conftool/dbconfig/20220624-094431-root.json [09:44:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1098:3317 (re)pooling @ 75%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30155 and previous config saved to /var/cache/conftool/dbconfig/20220624-094442-root.json [09:44:44] 10SRE, 10DNS, 10Traffic: DNS CI is broken - https://phabricator.wikimedia.org/T311290 (10ayounsi) 05Open→03Resolved a:03ayounsi Error message can be a bit criptic but if deployed to DNS this would have meant that: `lvs4005.ulsfo.wmnet` (and similar) pointed to both AAAA 2620:0:863:1:f6e9:d4ff:feba:f46... [09:44:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:45] (03CR) 10Ayounsi: [C: 03+1] "+1 to make it active active, but I can't vouch for the actual implementation." [puppet] - 10https://gerrit.wikimedia.org/r/808199 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [09:48:09] (03PS1) 10Muehlenhoff: Remove absented diamond collector for puppet [puppet] - 10https://gerrit.wikimedia.org/r/808206 [09:49:19] 10SRE, 10observability, 10Patch-For-Review, 10User-fgiunchedi: Deprovision Diamond collectors no longer in use - https://phabricator.wikimedia.org/T183454 (10MoritzMuehlenhoff) [09:50:01] PROBLEM - k8s API server requests latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 verb=UPDATE https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [09:50:11] 10SRE, 10Infrastructure-Foundations: Integrate Buster 10.12 point update - https://phabricator.wikimedia.org/T304546 (10MoritzMuehlenhoff) [09:50:13] (03PS1) 10Kosta Harlan: GrowthExperiments: Set GEImageRecommendationApiHandler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808207 (https://phabricator.wikimedia.org/T306032) [09:50:15] (03PS1) 10Kosta Harlan: GrowthExperiments: Set GEImageRecommendationApiHandler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808208 (https://phabricator.wikimedia.org/T306032) [09:50:32] 10SRE, 10observability, 10Patch-For-Review, 10cloud-services-team (Kanban): Deprecate Diamond collectors in Cloud VPS - https://phabricator.wikimedia.org/T210993 (10MoritzMuehlenhoff) 05Stalled→03Resolved a:03MoritzMuehlenhoff This is complete, all puppet-managed Diamond collectors are gone by now. [09:51:07] 10SRE, 10Infrastructure-Foundations: Integrate Buster 10.12 point update - https://phabricator.wikimedia.org/T304546 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff This is completed [09:51:10] (03PS2) 10Kosta Harlan: [betalabs] GrowthExperiments: Set GEImageRecommendationApiHandler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808207 (https://phabricator.wikimedia.org/T306032) [09:51:20] (03PS1) 10Volans: sre.hosts.decommission: unblock decoms [cookbooks] - 10https://gerrit.wikimedia.org/r/808209 [09:51:40] (03PS2) 10Kosta Harlan: GrowthExperiments: Set GEImageRecommendationApiHandler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808208 (https://phabricator.wikimedia.org/T306032) [09:52:44] (03PS1) 10Kosta Harlan: GrowthExperiments: Switch GEImageRecommendationApiHandler to production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808211 (https://phabricator.wikimedia.org/T306032) [09:54:20] (03CR) 10Kosta Harlan: "On second thought, I suppose we could just change this to 'production' from the start, and it will be used once the relevant code arrives " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808208 (https://phabricator.wikimedia.org/T306032) (owner: 10Kosta Harlan) [09:54:28] (03Abandoned) 10Kosta Harlan: GrowthExperiments: Switch GEImageRecommendationApiHandler to production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808211 (https://phabricator.wikimedia.org/T306032) (owner: 10Kosta Harlan) [09:54:39] (03PS3) 10Kosta Harlan: GrowthExperiments: Set GEImageRecommendationApiHandler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808208 (https://phabricator.wikimedia.org/T306032) [09:55:07] (03PS4) 10Kosta Harlan: GrowthExperiments: Set GEImageRecommendationApiHandler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808208 (https://phabricator.wikimedia.org/T306032) [09:55:57] (03PS5) 10Jbond: mcrouter: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/803279 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [09:55:59] (03PS1) 10Jbond: mcrouter: update tox configuration [puppet] - 10https://gerrit.wikimedia.org/r/808212 [09:56:01] (03CR) 10Ayounsi: [C: 03+1] sre.hosts.decommission: unblock decoms [cookbooks] - 10https://gerrit.wikimedia.org/r/808209 (owner: 10Volans) [09:56:46] (03CR) 10CI reject: [V: 04-1] mcrouter: update tox configuration [puppet] - 10https://gerrit.wikimedia.org/r/808212 (owner: 10Jbond) [09:57:07] (03CR) 10CI reject: [V: 04-1] mcrouter: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/803279 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [09:57:44] (03PS4) 10WMDE-Fisch: Drop deprecated feature flags [mediawiki-config] - 10https://gerrit.wikimedia.org/r/804609 (https://phabricator.wikimedia.org/T310684) (owner: 10Awight) [09:58:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1137 (re)pooling @ 25%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30156 and previous config saved to /var/cache/conftool/dbconfig/20220624-095845-root.json [09:58:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1138 (re)pooling @ 25%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30157 and previous config saved to /var/cache/conftool/dbconfig/20220624-095856-root.json [09:58:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1098:3316 (re)pooling @ 100%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30158 and previous config saved to /var/cache/conftool/dbconfig/20220624-095935-root.json [09:59:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1098:3317 (re)pooling @ 100%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30159 and previous config saved to /var/cache/conftool/dbconfig/20220624-095946-root.json [09:59:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:41] (03PS2) 10Jbond: mcrouter: update tox configuration [puppet] - 10https://gerrit.wikimedia.org/r/808212 [10:00:43] (03PS6) 10Jbond: mcrouter: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/803279 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [10:01:25] (03CR) 10CI reject: [V: 04-1] mcrouter: update tox configuration [puppet] - 10https://gerrit.wikimedia.org/r/808212 (owner: 10Jbond) [10:02:59] 10SRE, 10Ganeti, 10Infrastructure-Foundations, 10Patch-For-Review: Implement Prometheus exporter for Ganeti capacity data - https://phabricator.wikimedia.org/T311288 (10MoritzMuehlenhoff) [10:03:28] 10SRE, 10Ganeti, 10Infrastructure-Foundations: SSH host key verification failures in Ganeti intra node SSH calls after Bullseye update - https://phabricator.wikimedia.org/T309724 (10MoritzMuehlenhoff) [10:03:41] 10SRE, 10SRE-tools, 10Ganeti, 10Infrastructure-Foundations, and 2 others: Create a spicerack cookbook to empty a ganeti node from VMs - https://phabricator.wikimedia.org/T203964 (10MoritzMuehlenhoff) [10:04:58] 10SRE, 10Ganeti: Cookbooks for Ganeti maintenance tasks - https://phabricator.wikimedia.org/T283319 (10MoritzMuehlenhoff) [10:05:23] 10SRE, 10Ganeti, 10Infrastructure-Foundations, 10observability: Enable drbd collector on ganeti nodes - https://phabricator.wikimedia.org/T299560 (10MoritzMuehlenhoff) [10:05:49] 10SRE, 10SRE-tools, 10Ganeti, 10Infrastructure-Foundations: sre.ganeti.makevm: Allow passing a secondary disk - https://phabricator.wikimedia.org/T300046 (10MoritzMuehlenhoff) [10:05:55] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Write a cookbook to align the "master-capable" state of Ganeti nodes - https://phabricator.wikimedia.org/T299034 (10MoritzMuehlenhoff) [10:06:07] 10SRE, 10Ganeti, 10Traffic, 10Patch-For-Review: Remove SLAAC IPs from Ganeti hosts - https://phabricator.wikimedia.org/T265904 (10MoritzMuehlenhoff) [10:06:22] 10SRE-tools, 10Ganeti, 10Infrastructure-Foundations: sre.ganeti.makevm cook book only allows specifying RAM size in full gigabytes - https://phabricator.wikimedia.org/T230712 (10MoritzMuehlenhoff) [10:06:32] 10SRE, 10SRE-tools, 10Ganeti, 10Infrastructure-Foundations: Cookbook to failover the Ganeti master - https://phabricator.wikimedia.org/T283320 (10MoritzMuehlenhoff) [10:06:46] 10SRE-tools, 10Ganeti, 10Infrastructure-Foundations: Make Spicerack cookbook to resize ganeti VM - https://phabricator.wikimedia.org/T219454 (10MoritzMuehlenhoff) [10:07:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1119 for kernel reboots', diff saved to https://phabricator.wikimedia.org/P30160 and previous config saved to /var/cache/conftool/dbconfig/20220624-100752-root.json [10:07:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:19] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: sre.ganeti.makevm NXDOMAIN race condition - https://phabricator.wikimedia.org/T309505 (10jbond) 05Open→03Resolved a:03jbond Thsi is fixed now [10:10:02] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1004-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [10:12:07] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:13:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1137 (re)pooling @ 50%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30161 and previous config saved to /var/cache/conftool/dbconfig/20220624-101349-root.json [10:13:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1138 (re)pooling @ 50%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30162 and previous config saved to /var/cache/conftool/dbconfig/20220624-101400-root.json [10:14:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1119 (re)pooling @ 2%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30163 and previous config saved to /var/cache/conftool/dbconfig/20220624-101753-root.json [10:17:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:39] (03CR) 10Muehlenhoff: "Welcome back, great news! Can you please open a Phabricator task and tag it SRE-Access-Request, then Tyler can approve there and your acce" [puppet] - 10https://gerrit.wikimedia.org/r/808079 (owner: 10Chad) [10:26:18] (03PS3) 10Jbond: mcrouter: update tox configuration [puppet] - 10https://gerrit.wikimedia.org/r/808212 [10:26:20] (03PS1) 10Jbond: spdx: Add csr files to the list of files to ignore. [puppet] - 10https://gerrit.wikimedia.org/r/808219 [10:27:33] (03CR) 10CI reject: [V: 04-1] mcrouter: update tox configuration [puppet] - 10https://gerrit.wikimedia.org/r/808212 (owner: 10Jbond) [10:28:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1100 for kernel reboots', diff saved to https://phabricator.wikimedia.org/P30164 and previous config saved to /var/cache/conftool/dbconfig/20220624-102856-root.json [10:29:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1137 (re)pooling @ 75%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30165 and previous config saved to /var/cache/conftool/dbconfig/20220624-102859-root.json [10:29:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1138 (re)pooling @ 75%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30166 and previous config saved to /var/cache/conftool/dbconfig/20220624-102904-root.json [10:29:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:55] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:31:01] (03PS2) 10Jbond: spdx: Add csr files to the list of files to ignore. [puppet] - 10https://gerrit.wikimedia.org/r/808219 [10:31:19] (03PS4) 10Jbond: mcrouter: update tox configuration [puppet] - 10https://gerrit.wikimedia.org/r/808212 [10:32:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1119 (re)pooling @ 5%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30168 and previous config saved to /var/cache/conftool/dbconfig/20220624-103257-root.json [10:33:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1100 (re)pooling @ 2%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30169 and previous config saved to /var/cache/conftool/dbconfig/20220624-103342-root.json [10:33:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:50] (03PS3) 10Jbond: spdx: Add csr files to the list of files to ignore. [puppet] - 10https://gerrit.wikimedia.org/r/808219 [10:35:42] (03CR) 10Jbond: spdx: Add csr files to the list of files to ignore. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/808219 (owner: 10Jbond) [10:35:51] (03PS5) 10Jbond: mcrouter: update tox configuration [puppet] - 10https://gerrit.wikimedia.org/r/808212 [10:35:58] (03PS7) 10Jbond: mcrouter: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/803279 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [10:36:13] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [10:42:45] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:44:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1137 (re)pooling @ 100%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30170 and previous config saved to /var/cache/conftool/dbconfig/20220624-104403-root.json [10:44:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1138 (re)pooling @ 100%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30171 and previous config saved to /var/cache/conftool/dbconfig/20220624-104407-root.json [10:44:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:40] (03CR) 10Muehlenhoff: "A few followup comments" [debs/prometheus-ganeti-exporter] - 10https://gerrit.wikimedia.org/r/804276 (https://phabricator.wikimedia.org/T311288) (owner: 10Slyngshede) [10:48:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1119 (re)pooling @ 10%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30172 and previous config saved to /var/cache/conftool/dbconfig/20220624-104801-root.json [10:48:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1142 for kernel reboots', diff saved to https://phabricator.wikimedia.org/P30173 and previous config saved to /var/cache/conftool/dbconfig/20220624-104849-root.json [10:48:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1100 (re)pooling @ 5%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30174 and previous config saved to /var/cache/conftool/dbconfig/20220624-104852-root.json [10:48:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:36] (03Abandoned) 10Itamar Givon: Turn Wikbase termbox SSR off for beta wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802770 (https://phabricator.wikimedia.org/T304328) (owner: 10Itamar Givon) [10:55:56] (03PS2) 10Itamar Givon: Unconfigure wmgWikibaseSSRTermboxServerUrl on Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803498 (https://phabricator.wikimedia.org/T304328) (owner: 10Lucas Werkmeister (WMDE)) [10:56:09] (03CR) 10CI reject: [V: 04-1] Unconfigure wmgWikibaseSSRTermboxServerUrl on Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803498 (https://phabricator.wikimedia.org/T304328) (owner: 10Lucas Werkmeister (WMDE)) [10:56:48] (03PS3) 10Itamar Givon: Unconfigure wmgWikibaseSSRTermboxServerUrl on Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803498 (https://phabricator.wikimedia.org/T304328) (owner: 10Lucas Werkmeister (WMDE)) [10:57:03] (03CR) 10CI reject: [V: 04-1] Unconfigure wmgWikibaseSSRTermboxServerUrl on Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803498 (https://phabricator.wikimedia.org/T304328) (owner: 10Lucas Werkmeister (WMDE)) [10:57:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1142 (re)pooling @ 2%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30175 and previous config saved to /var/cache/conftool/dbconfig/20220624-105705-root.json [10:57:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1119 (re)pooling @ 25%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30176 and previous config saved to /var/cache/conftool/dbconfig/20220624-110305-root.json [11:03:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1100 (re)pooling @ 10%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30177 and previous config saved to /var/cache/conftool/dbconfig/20220624-110356-root.json [11:03:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:17] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift-account-stats_mlserve:prod.service,swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:09:06] 10SRE, 10DBA, 10Infrastructure-Foundations, 10CAS-SSO: Repurpose the "cas" database for webauthn tokens - https://phabricator.wikimedia.org/T311300 (10MoritzMuehlenhoff) [11:12:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1142 (re)pooling @ 5%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30178 and previous config saved to /var/cache/conftool/dbconfig/20220624-111209-root.json [11:12:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1119 (re)pooling @ 50%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30179 and previous config saved to /var/cache/conftool/dbconfig/20220624-111808-root.json [11:18:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1100 (re)pooling @ 25%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30180 and previous config saved to /var/cache/conftool/dbconfig/20220624-111859-root.json [11:19:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:35] 10SRE, 10DSE-Kubernetes-Cluster, 10Infrastructure-Foundations, 10vm-requests: Site: eqiad : 3 VMs requested for Etcd cluster in support of the new DSE Kubernetes cluster - https://phabricator.wikimedia.org/T311131 (10BTullis) Apologies, I should have mentioned that I'm happy to deploy these machines myself... [11:26:17] 10SRE, 10DBA, 10Infrastructure-Foundations, 10CAS-SSO: Repurpose the "cas" database for webauthn tokens - https://phabricator.wikimedia.org/T311300 (10Marostegui) p:05Triage→03Medium a:03Marostegui @MoritzMuehlenhoff so you want me to drop or truncate this table?: ` root@db1164.eqiad.wmnet[cas]> show... [11:27:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1142 (re)pooling @ 10%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30181 and previous config saved to /var/cache/conftool/dbconfig/20220624-112713-root.json [11:27:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:18] 10SRE, 10DSE-Kubernetes-Cluster, 10Infrastructure-Foundations, 10vm-requests: Site: eqiad : 2 VMs requested for DSE Kubernetes Cluster control plane servers - https://phabricator.wikimedia.org/T311133 (10BTullis) Ah, if I'm reading this correctly it looks like the error might have been fixed now: {T309505}... [11:29:36] 10SRE, 10DSE-Kubernetes-Cluster, 10Infrastructure-Foundations, 10vm-requests: Site: eqiad : 3 VMs requested for Etcd cluster in support of the new DSE Kubernetes cluster - https://phabricator.wikimedia.org/T311131 (10MoritzMuehlenhoff) These look fine. Our Ganeti cookbook doesn't allow to create disks in p... [11:30:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1101 for kernel reboots', diff saved to https://phabricator.wikimedia.org/P30182 and previous config saved to /var/cache/conftool/dbconfig/20220624-113020-root.json [11:30:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1119 (re)pooling @ 75%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30183 and previous config saved to /var/cache/conftool/dbconfig/20220624-113312-root.json [11:33:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1100 (re)pooling @ 50%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30184 and previous config saved to /var/cache/conftool/dbconfig/20220624-113403-root.json [11:34:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:10] (03CR) 10Jbond: [C: 03+2] utils: Add small script to set up bundler [puppet] - 10https://gerrit.wikimedia.org/r/803341 (owner: 10Jbond) [11:38:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1101:3317 (re)pooling @ 2%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30185 and previous config saved to /var/cache/conftool/dbconfig/20220624-113841-root.json [11:38:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1101:3318 (re)pooling @ 2%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30186 and previous config saved to /var/cache/conftool/dbconfig/20220624-113914-root.json [11:39:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1142 (re)pooling @ 25%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30187 and previous config saved to /var/cache/conftool/dbconfig/20220624-114217-root.json [11:42:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:48:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1119 (re)pooling @ 100%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30188 and previous config saved to /var/cache/conftool/dbconfig/20220624-114816-root.json [11:48:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1100 (re)pooling @ 75%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30189 and previous config saved to /var/cache/conftool/dbconfig/20220624-114907-root.json [11:49:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:00] (03PS1) 10Hnowlan: image-suggestion: new container version [deployment-charts] - 10https://gerrit.wikimedia.org/r/808228 (https://phabricator.wikimedia.org/T311220) [11:53:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1101:3317 (re)pooling @ 10%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30190 and previous config saved to /var/cache/conftool/dbconfig/20220624-115345-root.json [11:53:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1101:3318 (re)pooling @ 5%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30191 and previous config saved to /var/cache/conftool/dbconfig/20220624-115418-root.json [11:54:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:42] (03PS1) 10Slyngshede: show_runtimes.py allow script to send email. [dumps] - 10https://gerrit.wikimedia.org/r/808229 [11:57:09] (03CR) 10CI reject: [V: 04-1] show_runtimes.py allow script to send email. [dumps] - 10https://gerrit.wikimedia.org/r/808229 (owner: 10Slyngshede) [11:57:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1142 (re)pooling @ 50%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30192 and previous config saved to /var/cache/conftool/dbconfig/20220624-115720-root.json [11:57:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:57:53] (03CR) 10Hnowlan: [C: 03+1] "lgtm - I can deploy this on Monday." [puppet] - 10https://gerrit.wikimedia.org/r/790679 (https://phabricator.wikimedia.org/T307182) (owner: 10Isabelle Hurbain-Palatin) [11:58:29] (03CR) 10Ayounsi: Initial support for servers switch interfaces (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/803261 (owner: 10Ayounsi) [12:03:19] (03PS2) 10Slyngshede: show_runtimes.py allow script to send email. [dumps] - 10https://gerrit.wikimedia.org/r/808229 [12:04:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1100 (re)pooling @ 100%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30193 and previous config saved to /var/cache/conftool/dbconfig/20220624-120411-root.json [12:04:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:34] !log bmansurov@deploy1002 Started deploy [airflow-dags/research@18182aa]: (no justification provided) [12:04:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:39] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [12:06:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1166 for kernel reboots', diff saved to https://phabricator.wikimedia.org/P30194 and previous config saved to /var/cache/conftool/dbconfig/20220624-120632-root.json [12:06:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:44] (03CR) 10CI reject: [V: 04-1] show_runtimes.py allow script to send email. [dumps] - 10https://gerrit.wikimedia.org/r/808229 (owner: 10Slyngshede) [12:08:22] !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@18182aa]: (no justification provided) (duration: 03m 47s) [12:08:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1101:3317 (re)pooling @ 25%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30195 and previous config saved to /var/cache/conftool/dbconfig/20220624-120849-root.json [12:08:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1101:3318 (re)pooling @ 10%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30196 and previous config saved to /var/cache/conftool/dbconfig/20220624-120922-root.json [12:09:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:07] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:12:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1142 (re)pooling @ 75%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30197 and previous config saved to /var/cache/conftool/dbconfig/20220624-121224-root.json [12:12:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:45] (03PS3) 10Slyngshede: show_runtimes.py allow script to send email. [dumps] - 10https://gerrit.wikimedia.org/r/808229 [12:14:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 2%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30198 and previous config saved to /var/cache/conftool/dbconfig/20220624-121359-root.json [12:14:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:31] !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b3fe77c]: (no justification provided) [12:14:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:59] !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b3fe77c]: (no justification provided) (duration: 00m 28s) [12:15:02] (KubernetesRsyslogDown) firing: (2) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [12:15:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:23] (03PS4) 10Slyngshede: show_runtimes.py allow script to send email. [dumps] - 10https://gerrit.wikimedia.org/r/808229 [12:22:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1122 for kernel reboots', diff saved to https://phabricator.wikimedia.org/P30199 and previous config saved to /var/cache/conftool/dbconfig/20220624-122256-root.json [12:23:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1101:3317 (re)pooling @ 50%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30200 and previous config saved to /var/cache/conftool/dbconfig/20220624-122353-root.json [12:23:55] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service,swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:23:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1101:3318 (re)pooling @ 25%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30201 and previous config saved to /var/cache/conftool/dbconfig/20220624-122425-root.json [12:24:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1142 (re)pooling @ 100%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30202 and previous config saved to /var/cache/conftool/dbconfig/20220624-122728-root.json [12:27:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 5%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30203 and previous config saved to /var/cache/conftool/dbconfig/20220624-122903-root.json [12:29:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1122 (re)pooling @ 2%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30204 and previous config saved to /var/cache/conftool/dbconfig/20220624-122916-root.json [12:29:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:51] 10SRE, 10DBA, 10Infrastructure-Foundations, 10CAS-SSO: Repurpose the "cas" database for webauthn tokens - https://phabricator.wikimedia.org/T311300 (10MoritzMuehlenhoff) >>! In T311300#8025518, @Marostegui wrote: > @MoritzMuehlenhoff so you want me to drop or truncate this table?: > ` Please drop the enti... [12:34:41] !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b3fe77c]: (no justification provided) [12:34:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:44] !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b3fe77c]: (no justification provided) (duration: 00m 03s) [12:34:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:34] 10SRE, 10Infrastructure-Foundations, 10CAS-SSO: Enable webauthn in CAS to replace U2F - https://phabricator.wikimedia.org/T311236 (10Marostegui) [12:36:02] 10SRE, 10DBA, 10Infrastructure-Foundations, 10CAS-SSO: Repurpose the "cas" database for webauthn tokens - https://phabricator.wikimedia.org/T311300 (10Marostegui) 05Open→03Resolved I have taken a quick backup from those tables: ` root@cumin1001:/home/marostegui/T311300# ls -lh total 56K -rw-r--r-- 1 ro... [12:37:57] 10SRE, 10Data-Engineering, 10Traffic, 10Patch-For-Review: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10zeljkofilipin) [12:38:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1101:3317 (re)pooling @ 75%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30205 and previous config saved to /var/cache/conftool/dbconfig/20220624-123857-root.json [12:39:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1101:3318 (re)pooling @ 50%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30206 and previous config saved to /var/cache/conftool/dbconfig/20220624-123929-root.json [12:39:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:50] !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b3fe77c]: (no justification provided) [12:40:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:54] !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b3fe77c]: (no justification provided) (duration: 00m 03s) [12:40:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:48] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10User-jbond: document puppet/netbox/hiera interaction - https://phabricator.wikimedia.org/T311304 (10jbond) [12:44:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 10%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30207 and previous config saved to /var/cache/conftool/dbconfig/20220624-124407-root.json [12:44:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1122 (re)pooling @ 5%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30208 and previous config saved to /var/cache/conftool/dbconfig/20220624-124420-root.json [12:44:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:02] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [12:45:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:25] !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b3fe77c]: (no justification provided) [12:46:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:34] !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b3fe77c]: (no justification provided) (duration: 00m 08s) [12:46:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:48:47] !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b3fe77c]: (no justification provided) [12:48:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:48:55] !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b3fe77c]: (no justification provided) (duration: 00m 08s) [12:48:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:21] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:51:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:52:19] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host wdqs1016.mgmt.eqiad.wmnet with reboot policy FORCED [12:52:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:52:36] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wdqs1016.mgmt.eqiad.wmnet with reboot policy FORCED [12:52:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:45] !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b3fe77c]: (no justification provided) [12:53:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:53] !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b3fe77c]: (no justification provided) (duration: 00m 08s) [12:53:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:00] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host wdqs1016.eqiad.wmnet with OS buster [12:54:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1101:3317 (re)pooling @ 100%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30209 and previous config saved to /var/cache/conftool/dbconfig/20220624-125401-root.json [12:54:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:06] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): Q4:(Need By: TBD) rack/setup/install wdqs101[4,5,6] - https://phabricator.wikimedia.org/T307138 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host wdqs1016.eqiad.wmnet with OS buster [12:54:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1101:3318 (re)pooling @ 75%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30210 and previous config saved to /var/cache/conftool/dbconfig/20220624-125433-root.json [12:54:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:02] (KubernetesRsyslogDown) firing: rsyslog on ml-serve-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [12:58:13] !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b3fe77c]: (no justification provided) [12:58:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:21] !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b3fe77c]: (no justification provided) (duration: 00m 07s) [12:58:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:28] 10SRE, 10DBA, 10Infrastructure-Foundations, 10CAS-SSO: Repurpose the "cas" database for webauthn tokens - https://phabricator.wikimedia.org/T311300 (10MoritzMuehlenhoff) Thanks! [12:58:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1113 for kernel reboots', diff saved to https://phabricator.wikimedia.org/P30211 and previous config saved to /var/cache/conftool/dbconfig/20220624-125834-root.json [12:58:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 25%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30212 and previous config saved to /var/cache/conftool/dbconfig/20220624-125911-root.json [12:59:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1122 (re)pooling @ 10%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30213 and previous config saved to /var/cache/conftool/dbconfig/20220624-125924-root.json [12:59:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1114 for kernel reboots', diff saved to https://phabricator.wikimedia.org/P30214 and previous config saved to /var/cache/conftool/dbconfig/20220624-130055-root.json [13:00:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:51] !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b3fe77c]: (no justification provided) [13:02:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:59] !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b3fe77c]: (no justification provided) (duration: 00m 08s) [13:03:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1113:3316 (re)pooling @ 2%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30215 and previous config saved to /var/cache/conftool/dbconfig/20220624-130514-root.json [13:05:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1113:3315 (re)pooling @ 2%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30216 and previous config saved to /var/cache/conftool/dbconfig/20220624-130519-root.json [13:05:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:50] !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b3fe77c]: (no justification provided) [13:05:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:58] !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b3fe77c]: (no justification provided) (duration: 00m 08s) [13:06:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:43] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1016.eqiad.wmnet with reason: host reimage [13:06:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1114 (re)pooling @ 2%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30217 and previous config saved to /var/cache/conftool/dbconfig/20220624-130743-root.json [13:07:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1101:3318 (re)pooling @ 100%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30218 and previous config saved to /var/cache/conftool/dbconfig/20220624-130937-root.json [13:09:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:50] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1016.eqiad.wmnet with reason: host reimage [13:09:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:54] !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b3fe77c]: (no justification provided) [13:11:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:02] !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b3fe77c]: (no justification provided) (duration: 00m 08s) [13:12:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 50%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30219 and previous config saved to /var/cache/conftool/dbconfig/20220624-131415-root.json [13:14:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1122 (re)pooling @ 25%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30220 and previous config saved to /var/cache/conftool/dbconfig/20220624-131428-root.json [13:14:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:01] (03PS1) 10Jbond: wmflib: Add check for storconfig to puppetdb functions [puppet] - 10https://gerrit.wikimedia.org/r/808236 (https://phabricator.wikimedia.org/T311240) [13:16:56] (03CR) 10CI reject: [V: 04-1] wmflib: Add check for storconfig to puppetdb functions [puppet] - 10https://gerrit.wikimedia.org/r/808236 (https://phabricator.wikimedia.org/T311240) (owner: 10Jbond) [13:18:35] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36029/console" [puppet] - 10https://gerrit.wikimedia.org/r/808236 (https://phabricator.wikimedia.org/T311240) (owner: 10Jbond) [13:20:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1113:3316 (re)pooling @ 5%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30221 and previous config saved to /var/cache/conftool/dbconfig/20220624-132017-root.json [13:20:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1113:3315 (re)pooling @ 5%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30222 and previous config saved to /var/cache/conftool/dbconfig/20220624-132024-root.json [13:20:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:10] (03CR) 10ArielGlenn: "There are at least two other scripts that also would need to be modified. Maybe it's better to allow the systemd timer wrapper to send mai" [dumps] - 10https://gerrit.wikimedia.org/r/808229 (owner: 10Slyngshede) [13:21:54] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs1016.eqiad.wmnet with OS buster [13:21:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:59] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): Q4:(Need By: TBD) rack/setup/install wdqs101[4,5,6] - https://phabricator.wikimedia.org/T307138 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host wdqs1016.eqiad.wmnet with OS buster completed:... [13:22:37] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): Q4:(Need By: TBD) rack/setup/install wdqs101[4,5,6] - https://phabricator.wikimedia.org/T307138 (10Cmjohnson) [13:22:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1114 (re)pooling @ 10%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30223 and previous config saved to /var/cache/conftool/dbconfig/20220624-132247-root.json [13:22:50] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): Q4:(Need By: TBD) rack/setup/install wdqs101[4,5,6] - https://phabricator.wikimedia.org/T307138 (10Cmjohnson) 05Open→03Resolved [13:22:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:46] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: (Need By: TBD) rack/setup/install frdb1005, frdev1003 - https://phabricator.wikimedia.org/T306935 (10Cmjohnson) @Jgreen I have mgmt ip's for them, they need manual setup. I will not get to them today but should be able to turn them over to you next week. [13:24:46] (03CR) 10Jbond: [V: 03+1 C: 04-2] "im not sure this works as planned further (especially with CI) i don't think its a good idea" [puppet] - 10https://gerrit.wikimedia.org/r/808236 (https://phabricator.wikimedia.org/T311240) (owner: 10Jbond) [13:25:00] (03Abandoned) 10Jbond: wmflib: Add check for storconfig to puppetdb functions [puppet] - 10https://gerrit.wikimedia.org/r/808236 (https://phabricator.wikimedia.org/T311240) (owner: 10Jbond) [13:25:22] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Install NVMe SSDs into moss-be200[1|2] & thanos-be200? - https://phabricator.wikimedia.org/T310923 (10Papaul) @LSobanski thanks [13:29:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 75%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30224 and previous config saved to /var/cache/conftool/dbconfig/20220624-132919-root.json [13:29:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1122 (re)pooling @ 50%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30225 and previous config saved to /var/cache/conftool/dbconfig/20220624-132932-root.json [13:29:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:42] (03PS7) 10Slyngshede: C:snapshot::dumps::timechecker convert cron to timer. [puppet] - 10https://gerrit.wikimedia.org/r/806366 (https://phabricator.wikimedia.org/T273673) [13:35:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1113:3316 (re)pooling @ 10%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30226 and previous config saved to /var/cache/conftool/dbconfig/20220624-133521-root.json [13:35:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1113:3315 (re)pooling @ 10%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30227 and previous config saved to /var/cache/conftool/dbconfig/20220624-133528-root.json [13:35:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1114 (re)pooling @ 25%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30228 and previous config saved to /var/cache/conftool/dbconfig/20220624-133751-root.json [13:37:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:28] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/790670 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [13:38:38] (03PS3) 10Daimona Eaytoy: Remove references to $wgEnableLocalTimedText [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802894 [13:42:57] 10SRE, 10SRE-OnFire, 10Observability-Alerting, 10I18n: Internationalization (i18n) & localization (l10n) of www.wikimediastatus.net - https://phabricator.wikimedia.org/T305896 (10lmata) [13:43:42] 10SRE, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q4), 10Sustainability (Incident Followup): Most Icinga http checks ignore the URL parameter - https://phabricator.wikimedia.org/T304321 (10lmata) [13:44:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 100%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30229 and previous config saved to /var/cache/conftool/dbconfig/20220624-134423-root.json [13:44:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1122 (re)pooling @ 75%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30230 and previous config saved to /var/cache/conftool/dbconfig/20220624-134436-root.json [13:44:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:19] (03Abandoned) 10Slyngshede: show_runtimes.py allow script to send email. [dumps] - 10https://gerrit.wikimedia.org/r/808229 (owner: 10Slyngshede) [13:50:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1113:3316 (re)pooling @ 25%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30231 and previous config saved to /var/cache/conftool/dbconfig/20220624-135025-root.json [13:50:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1113:3315 (re)pooling @ 25%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30232 and previous config saved to /var/cache/conftool/dbconfig/20220624-135032-root.json [13:50:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1114 (re)pooling @ 50%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30233 and previous config saved to /var/cache/conftool/dbconfig/20220624-135255-root.json [13:53:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:48] (03PS1) 10Ssingh: test_dns: remove redundant comments about ECS [software/knead-wikidough] - 10https://gerrit.wikimedia.org/r/808243 [13:55:18] (03CR) 10Ssingh: [C: 03+2] test_dns: remove redundant comments about ECS [software/knead-wikidough] - 10https://gerrit.wikimedia.org/r/808243 (owner: 10Ssingh) [13:59:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1122 (re)pooling @ 100%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30234 and previous config saved to /var/cache/conftool/dbconfig/20220624-135940-root.json [13:59:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:31] !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b3fe77c]: (no justification provided) [14:02:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:40] !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b3fe77c]: (no justification provided) (duration: 00m 08s) [14:02:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:17] !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b3fe77c]: (no justification provided) [14:03:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:25] !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b3fe77c]: (no justification provided) (duration: 00m 08s) [14:03:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1113:3316 (re)pooling @ 50%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30235 and previous config saved to /var/cache/conftool/dbconfig/20220624-140529-root.json [14:05:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1113:3315 (re)pooling @ 50%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30236 and previous config saved to /var/cache/conftool/dbconfig/20220624-140536-root.json [14:05:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:01] (03CR) 10Ayounsi: [C: 03+1] Add sukhe to super-user for router configuration (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/807145 (owner: 10Ssingh) [14:07:52] (03CR) 10Ssingh: [C: 03+2] Add sukhe to super-user for router configuration [homer/public] - 10https://gerrit.wikimedia.org/r/807145 (owner: 10Ssingh) [14:07:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1114 (re)pooling @ 75%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30237 and previous config saved to /var/cache/conftool/dbconfig/20220624-140759-root.json [14:08:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:07] !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b3fe77c]: (no justification provided) [14:08:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:15] !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b3fe77c]: (no justification provided) (duration: 00m 08s) [14:08:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:54] (03Merged) 10jenkins-bot: Add sukhe to super-user for router configuration [homer/public] - 10https://gerrit.wikimedia.org/r/807145 (owner: 10Ssingh) [14:09:50] !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b3fe77c]: (no justification provided) [14:09:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:58] !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b3fe77c]: (no justification provided) (duration: 00m 08s) [14:10:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:02] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1004-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [14:10:55] 10SRE, 10ops-eqiad: cloudstore1008 - eno2 reporting no carrier - https://phabricator.wikimedia.org/T309885 (10Jclark-ctr) device has never had eno2 connected @aborrero looks like port was enabled after buster upgrade. [14:10:58] !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b3fe77c]: (no justification provided) [14:11:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:07] !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b3fe77c]: (no justification provided) (duration: 00m 08s) [14:11:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:16] !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b3fe77c]: (no justification provided) [14:12:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:24] !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b3fe77c]: (no justification provided) (duration: 00m 08s) [14:12:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:17] !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b3fe77c]: (no justification provided) [14:14:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:25] !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b3fe77c]: (no justification provided) (duration: 00m 07s) [14:14:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1113:3316 (re)pooling @ 75%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30238 and previous config saved to /var/cache/conftool/dbconfig/20220624-142033-root.json [14:20:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1113:3315 (re)pooling @ 75%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30239 and previous config saved to /var/cache/conftool/dbconfig/20220624-142040-root.json [14:20:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:55] (03CR) 10Volans: [C: 04-1] "Unless something has changed with the upgrade to Netbox 3.2 it's using Redis on localhost (as it was migrated away of the central redis cl" [dns] - 10https://gerrit.wikimedia.org/r/808198 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [14:21:53] (03CR) 10Volans: [C: 04-1] "Same for the other CR. Unless something has changed it's using Redis on localhost." [puppet] - 10https://gerrit.wikimedia.org/r/808199 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [14:23:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1114 (re)pooling @ 100%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30240 and previous config saved to /var/cache/conftool/dbconfig/20220624-142303-root.json [14:23:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:33] (03PS7) 10DDesouza: QuickSurveys: Deploy research-incentive to jawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806960 (https://phabricator.wikimedia.org/T311015) [14:31:27] !log running homer * commit "adding sukhe" CR: 807145 [14:31:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1113:3316 (re)pooling @ 100%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30241 and previous config saved to /var/cache/conftool/dbconfig/20220624-143537-root.json [14:35:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1113:3315 (re)pooling @ 100%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30242 and previous config saved to /var/cache/conftool/dbconfig/20220624-143544-root.json [14:35:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:28] !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b3fe77c]: (no justification provided) [14:39:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:36] !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b3fe77c]: (no justification provided) (duration: 00m 07s) [14:39:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:41] !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b3fe77c]: (no justification provided) [14:40:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:49] !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b3fe77c]: (no justification provided) (duration: 00m 07s) [14:40:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:13] (03CR) 10Jgiannelos: [C: 03+1] "I also did a couple of manual runs with real data and looks like its working fine. The only 2 things to check after deployment are:" [puppet] - 10https://gerrit.wikimedia.org/r/790679 (https://phabricator.wikimedia.org/T307182) (owner: 10Isabelle Hurbain-Palatin) [14:48:34] !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b3fe77c]: (no justification provided) [14:48:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:42] !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b3fe77c]: (no justification provided) (duration: 00m 08s) [14:48:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:57] !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b3fe77c]: (no justification provided) [14:51:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:55] (03PS1) 10Muehlenhoff: Extend access for aniketasrs [puppet] - 10https://gerrit.wikimedia.org/r/808252 [14:53:34] !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b3fe77c]: (no justification provided) (duration: 02m 37s) [14:53:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:56] !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b3fe77c]: (no justification provided) [14:53:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:00] !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b3fe77c]: (no justification provided) (duration: 00m 04s) [14:54:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:41] (03CR) 10Muehlenhoff: [C: 03+2] Extend access for aniketasrs [puppet] - 10https://gerrit.wikimedia.org/r/808252 (owner: 10Muehlenhoff) [14:56:25] jbond: I'll merge your util/bundler patch along [14:57:32] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [14:57:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:25] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:00:27] (03PS1) 10Muehlenhoff: Add thirdparty/hwraid component for wikimedia-private repo [puppet] - 10https://gerrit.wikimedia.org/r/808253 (https://phabricator.wikimedia.org/T308027) [15:01:11] 10SRE, 10ops-eqsin, 10Traffic: SSH on cp5012.mgmt is flapping (CRITICAL) - https://phabricator.wikimedia.org/T311264 (10ssingh) >>! In T311264#8024288, @RobH wrote: > So if the idrac is accessible, the firmware update isn't OS impacting. However, I cannot login to this idrac interface via HTTPS or SSH, so i... [15:04:25] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [15:07:15] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:07:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:33] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudcephosd10[25-34] - https://phabricator.wikimedia.org/T294972 (10Cmjohnson) [15:10:27] (03CR) 10Hnowlan: [C: 03+1] "nice! thanks." [puppet] - 10https://gerrit.wikimedia.org/r/807553 (https://phabricator.wikimedia.org/T311156) (owner: 10Jbond) [15:17:52] !log dancy@deploy1002 Started deploy [integration/docroot@ea9b8fa]: (no justification provided) [15:17:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:01] !log dancy@deploy1002 Finished deploy [integration/docroot@ea9b8fa]: (no justification provided) (duration: 00m 08s) [15:18:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:00] 10SRE, 10ops-codfw: codfw: Master PDU rack/setup row A, row B, rowC and row D task - https://phabricator.wikimedia.org/T309956 (10Papaul) [15:22:41] 10SRE, 10ops-codfw: (Need By:TBD) rack/setup/install row A new PDUs - https://phabricator.wikimedia.org/T309957 (10Papaul) [15:24:11] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:25:44] ^ the systemd status on thanos-fe1001 has been flapping for some time [15:43:03] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service,swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:47:09] (03PS2) 10Jdlrobson: Enable title above tabs on group 1 and group 0 wikis (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808056 (https://phabricator.wikimedia.org/T310054) [15:47:13] (03PS2) 10Jdlrobson: Enable title above tabs on all opt-in wikis (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808057 (https://phabricator.wikimedia.org/T310054) [15:48:31] (03CR) 10CI reject: [V: 04-1] Enable title above tabs on group 1 and group 0 wikis (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808056 (https://phabricator.wikimedia.org/T310054) (owner: 10Jdlrobson) [15:49:43] 10SRE, 10Generated Data Platform, 10Image-Suggestions, 10serviceops, and 3 others: New Service Request Generated Datasets: Image Suggestions Service - https://phabricator.wikimedia.org/T304891 (10hnowlan) I've added some basic service configuration docs from an ops perspective [[ https://wikitech.wikimedia... [15:53:58] (03PS1) 10Urbanecm: Remove wgGEMentorDashboardBetaMode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808263 [15:54:00] (03PS1) 10Urbanecm: [beta] Remove wgGEMentorDashboardDiscoveryEnabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808264 [15:56:24] (03PS1) 10Muehlenhoff: Extend access for mnz [puppet] - 10https://gerrit.wikimedia.org/r/808266 [15:58:26] (03CR) 10Muehlenhoff: [C: 03+2] Extend access for mnz [puppet] - 10https://gerrit.wikimedia.org/r/808266 (owner: 10Muehlenhoff) [15:58:31] (03PS2) 10Muehlenhoff: Extend access for mnz [puppet] - 10https://gerrit.wikimedia.org/r/808266 [16:00:41] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:01:09] (03PS1) 10Urbanecm: Add GEMentorProvider to configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808267 (https://phabricator.wikimedia.org/T310905) [16:01:11] (03PS1) 10Urbanecm: [beta] Growth: Enable structured mentor list at cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808268 (https://phabricator.wikimedia.org/T310905) [16:03:13] (03PS1) 10Urbanecm: [beta] Growth: Switch to structured mentor list at all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808269 (https://phabricator.wikimedia.org/T310905) [16:03:24] (03CR) 10Urbanecm: [C: 04-2] "not yet" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808269 (https://phabricator.wikimedia.org/T310905) (owner: 10Urbanecm) [16:03:25] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [16:08:45] PROBLEM - SSH on restbase2012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:08:59] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:09:45] (03CR) 10Urbanecm: [C: 04-1] Enable title above tabs on group 1 and group 0 wikis (1/2) (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808056 (https://phabricator.wikimedia.org/T310054) (owner: 10Jdlrobson) [16:10:17] 10SRE-Access-Requests: Shell access request for @demon - https://phabricator.wikimedia.org/T311314 (10demon) [16:10:25] (03CR) 10Urbanecm: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806960 (https://phabricator.wikimedia.org/T311015) (owner: 10DDesouza) [16:12:19] (03PS2) 10Chad: Reinstate my shell account, grab all the roles for RelEng [puppet] - 10https://gerrit.wikimedia.org/r/808079 [16:12:58] (03CR) 10Urbanecm: [C: 03+1] QuickSurveys: Deploy research-incentive to jawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806960 (https://phabricator.wikimedia.org/T311015) (owner: 10DDesouza) [16:13:34] (03PS3) 10Chad: Reinstate my shell account, grab all the roles for RelEng [puppet] - 10https://gerrit.wikimedia.org/r/808079 (https://phabricator.wikimedia.org/T311314) [16:15:02] (KubernetesRsyslogDown) firing: (2) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [16:15:15] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [16:16:03] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service,swift_ring_manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:29:21] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [16:31:05] !log finished running homer * commit "adding sukhe" CR: 8071451 [16:31:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:21] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [16:55:02] (KubernetesRsyslogDown) firing: rsyslog on ml-serve-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [17:06:55] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [17:09:59] RECOVERY - SSH on restbase2012.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:12:43] 10SRE, 10Observability-Alerting: Reminders for unhandled/unacked alerts - https://phabricator.wikimedia.org/T307958 (10lmata) [17:28:59] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:36:03] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:41:39] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:45:29] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:53:02] (03CR) 10Ahmon Dancy: P:mediawiki::scap_client: add paremeter to indicate scap master (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/807510 (https://phabricator.wikimedia.org/T310740) (owner: 10Jbond) [17:55:16] (03CR) 10Slyngshede: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/808253 (https://phabricator.wikimedia.org/T308027) (owner: 10Muehlenhoff) [18:04:18] (03PS8) 10Slyngshede: Ganeti Prometheus exporter, initial checkin [debs/prometheus-ganeti-exporter] - 10https://gerrit.wikimedia.org/r/804276 (https://phabricator.wikimedia.org/T311288) [18:04:39] (03CR) 10Slyngshede: Ganeti Prometheus exporter, initial checkin (0310 comments) [debs/prometheus-ganeti-exporter] - 10https://gerrit.wikimedia.org/r/804276 (https://phabricator.wikimedia.org/T311288) (owner: 10Slyngshede) [18:06:45] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:10:02] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1004-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [18:18:33] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:23:02] 10SRE, 10SRE Observability: systemd state on thanos-fe1001 is flapping - https://phabricator.wikimedia.org/T311322 (10ssingh) [18:23:30] 10SRE, 10SRE Observability: systemd state on thanos-fe1001 is flapping - https://phabricator.wikimedia.org/T311322 (10ssingh) p:05Triage→03Low [18:34:59] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:41:17] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [18:46:51] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:50:30] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [18:50:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:51:19] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [18:51:20] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [18:51:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:51:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:52:12] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [18:52:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:55:36] xx [18:56:34] 10SRE, 10DSE-Kubernetes-Cluster, 10Infrastructure-Foundations, 10vm-requests: Site: eqiad : 2 VMs requested for DSE Kubernetes Cluster control plane servers - https://phabricator.wikimedia.org/T311133 (10Dzahn) T311290 has been named as the reason for that issue with the cookbook. Should be fixed already. [18:57:17] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [18:57:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:58:13] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [18:58:15] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [18:58:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:58:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:59:09] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [18:59:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:03] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=UPDATE https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [19:01:45] 10SRE-OnFire, 10SRE Observability (FY2021/2022-Q4): Set up POC Dispatch environment and evaluate its viability - https://phabricator.wikimedia.org/T309033 (10herron) [19:01:55] (03CR) 10Dzahn: "can be done anytime. before cookbook run is best." [puppet] - 10https://gerrit.wikimedia.org/r/802824 (https://phabricator.wikimedia.org/T307142) (owner: 10Dzahn) [19:03:17] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:08:56] 10SRE-OnFire, 10SRE Observability (FY2021/2022-Q4): Set up POC Dispatch environment and evaluate its viability - https://phabricator.wikimedia.org/T309033 (10herron) >>! In T309033#8024731, @herron wrote: > Still running into some problems after rebuilding and upgrading. Primarily, incidents created are missi... [19:14:31] ACKNOWLEDGEMENT - DPKG on durum1002 is CRITICAL: DPKG CRITICAL dpkg reports broken packages daniel_zahn wait until https://gerrit.wikimedia.org/r/c/operations/puppet/+/808043/ is merged https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [19:14:48] thanks mutante ^ [19:35:20] (03PS8) 10DDesouza: QuickSurveys: Deploy research-incentive to jawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806960 (https://phabricator.wikimedia.org/T311015) [19:35:41] !log dancy@deploy1002 backport aborted: (duration: 00m 12s) [19:35:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:36:09] (03CR) 10DDesouza: [C: 03+1] "Added a new parameter to the survey." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806960 (https://phabricator.wikimedia.org/T311015) (owner: 10DDesouza) [19:40:37] 10SRE, 10DNS, 10Traffic, 10WMF-Legal, and 2 others: Setup redirect of policy.wikimedia.org to Advocacy portal on Foundation website - https://phabricator.wikimedia.org/T310738 (10Varnent) My meeting will happen before they shut things off - so there will likely be a slight delay into July - and I should ha... [19:42:20] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Shell access request for @demon - https://phabricator.wikimedia.org/T311314 (10thcipriani) > Tyler CC'd for approval Approved! I can confirm: he's back 😂 [19:45:53] (03CR) 10Majavah: Reinstate my shell account, grab all the roles for RelEng (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/808079 (https://phabricator.wikimedia.org/T311314) (owner: 10Chad) [20:15:02] (KubernetesRsyslogDown) firing: (2) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [20:23:24] (03PS1) 10Dzahn: switch policy.wikimedia.org back from Wordpress to WMF DNS [dns] - 10https://gerrit.wikimedia.org/r/808309 (https://phabricator.wikimedia.org/T310738) [20:24:21] (03CR) 10Dzahn: [C: 04-2] "should not be merged before rewrite rule are in place. but to raise awareness of the request in the first place" [dns] - 10https://gerrit.wikimedia.org/r/808309 (https://phabricator.wikimedia.org/T310738) (owner: 10Dzahn) [20:35:30] (03PS1) 10Jcrespo: delete-media-file: Add failsafe to file deletion [software/mediabackups] - 10https://gerrit.wikimedia.org/r/808314 (https://phabricator.wikimedia.org/T311215) [20:36:06] (03CR) 10CI reject: [V: 04-1] delete-media-file: Add failsafe to file deletion [software/mediabackups] - 10https://gerrit.wikimedia.org/r/808314 (https://phabricator.wikimedia.org/T311215) (owner: 10Jcrespo) [20:37:01] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [20:45:17] (03PS1) 10Matthias Mullie: Echo tables can live in a different db [extensions/ImageSuggestions] (wmf/1.39.0-wmf.17) - 10https://gerrit.wikimedia.org/r/808120 [20:55:02] (KubernetesRsyslogDown) firing: rsyslog on ml-serve-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [21:06:39] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [21:15:53] 10SRE, 10MediaWiki-General, 10Traffic-Icebox, 10Patch-For-Review: Investigate query parameter normalization for MW/services - https://phabricator.wikimedia.org/T138093 (10ori) >>! In T138093#8024884, @ori wrote: > It'd be straightforward to carve this out into separate vmod. Done: https://github.com/atdt/... [21:28:01] PROBLEM - k8s API server requests latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 verb={CREATE,UPDATE} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [21:35:55] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [21:38:50] 10SRE, 10MediaWiki-General, 10Traffic-Icebox, 10Patch-For-Review: Investigate query parameter normalization for MW/services - https://phabricator.wikimedia.org/T138093 (10matmarex) I wanted to add a note regarding the PHP tricks of passing arrays in query parameters that were mentioned in T138093#2586982.... [22:01:25] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:10:02] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1004-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [22:14:17] PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 103 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [22:16:33] RECOVERY - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is OK: (C)100 gt (W)50 gt 5 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [22:36:35] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [22:37:57] (03CR) 10Cwhite: [C: 03+2] "Approved by Tyler on linked task." [puppet] - 10https://gerrit.wikimedia.org/r/808079 (https://phabricator.wikimedia.org/T311314) (owner: 10Chad) [22:39:03] (03CR) 10Cwhite: [C: 03+2] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/808079 (https://phabricator.wikimedia.org/T311314) (owner: 10Chad) [22:43:38] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Shell access request for @demon - https://phabricator.wikimedia.org/T311314 (10colewhite) 05Open→03Resolved a:03colewhite Group membership and ssh key deployed. [22:50:55] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Shell access request for @demon - https://phabricator.wikimedia.org/T311314 (10Dzahn) Welcome back! Wondering ..what about LDAP groups. Is dn: uid=demon,ou=people,dc=wikimedia,dc=org supposed to be added to "wmf"? Was it previously in there? Or was that a... [23:35:39] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27