[00:12:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T297189)', diff saved to https://phabricator.wikimedia.org/P24191 and previous config saved to /var/cache/conftool/dbconfig/20220407-001254-marostegui.json [00:12:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:12:58] T297189: Schema change for dropping ft_title and ft_namesapce - https://phabricator.wikimedia.org/T297189 [00:15:12] PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:19:25] (03PS1) 10RLazarus: external_clouds_vendors: Support entity types besides "cloud" [puppet] - 10https://gerrit.wikimedia.org/r/777899 (https://phabricator.wikimedia.org/T305581) [00:26:52] (03PS1) 10Krinkle: static.php: Fold "unknown" handling into "nohash" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/777900 (https://phabricator.wikimedia.org/T302465) [00:26:54] (03PS1) 10Krinkle: static.php: Restore short cache for temporary 'mismatch' response [mediawiki-config] - 10https://gerrit.wikimedia.org/r/777901 (https://phabricator.wikimedia.org/T274041) [00:27:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P24192 and previous config saved to /var/cache/conftool/dbconfig/20220407-002759-marostegui.json [00:28:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:28:07] (03CR) 10jerkins-bot: [V: 04-1] static.php: Restore short cache for temporary 'mismatch' response [mediawiki-config] - 10https://gerrit.wikimedia.org/r/777901 (https://phabricator.wikimedia.org/T274041) (owner: 10Krinkle) [00:28:09] (03CR) 10jerkins-bot: [V: 04-1] static.php: Fold "unknown" handling into "nohash" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/777900 (https://phabricator.wikimedia.org/T302465) (owner: 10Krinkle) [00:32:01] (03CR) 10Krinkle: [C: 03+1] "Good to go. Thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776255 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [00:32:09] (03PS2) 10Krinkle: Stop writing to $wmfUsingKubernetes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776256 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [00:32:46] (03CR) 10Krinkle: [C: 03+1] "This must be staged and synced separately from the parent - Good to go!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776256 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [00:35:55] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:39:13] RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:43:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P24193 and previous config saved to /var/cache/conftool/dbconfig/20220407-004304-marostegui.json [00:43:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:44:15] PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:57:30] 10SRE, 10Traffic: Upgrading Wikidough and durum VMs to bullseye - https://phabricator.wikimedia.org/T305589 (10ssingh) >>! In T305589#7836863, @Dzahn wrote: > My 2 cents: Thanks for the feedback! > cookbook not worth it in this case, likely more work to create and debug it than the actual time savings with i... [00:58:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T297189)', diff saved to https://phabricator.wikimedia.org/P24194 and previous config saved to /var/cache/conftool/dbconfig/20220407-005809-marostegui.json [00:58:11] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1129.eqiad.wmnet with reason: Maintenance [00:58:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:58:13] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1129.eqiad.wmnet with reason: Maintenance [00:58:14] T297189: Schema change for dropping ft_title and ft_namesapce - https://phabricator.wikimedia.org/T297189 [00:58:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:58:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:58:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1129 (T297189)', diff saved to https://phabricator.wikimedia.org/P24195 and previous config saved to /var/cache/conftool/dbconfig/20220407-005817-marostegui.json [00:58:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:16:52] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:18:48] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:28:40] RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:37:08] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2001 is CRITICAL: instance=10.192.32.33 verb=POST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [01:38:40] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [01:38:45] (JobUnavailable) firing: (2) Reduced availability for job k8s-pods in k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:39:26] (03PS2) 10Krinkle: static.php: Fold "unknown" handling into "nohash" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/777900 (https://phabricator.wikimedia.org/T302465) [01:39:28] (03PS2) 10Krinkle: static.php: Restore short cache for temporary 'mismatch' response [mediawiki-config] - 10https://gerrit.wikimedia.org/r/777901 (https://phabricator.wikimedia.org/T274041) [01:40:20] (03CR) 10jerkins-bot: [V: 04-1] static.php: Fold "unknown" handling into "nohash" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/777900 (https://phabricator.wikimedia.org/T302465) (owner: 10Krinkle) [01:40:23] (03CR) 10jerkins-bot: [V: 04-1] static.php: Restore short cache for temporary 'mismatch' response [mediawiki-config] - 10https://gerrit.wikimedia.org/r/777901 (https://phabricator.wikimedia.org/T274041) (owner: 10Krinkle) [01:41:07] (03CR) 10Krinkle: static.php: Restore short cache for temporary 'mismatch' response (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/777901 (https://phabricator.wikimedia.org/T274041) (owner: 10Krinkle) [01:41:51] (03PS3) 10Krinkle: static.php: Fold "unknown" handling into "nohash" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/777900 (https://phabricator.wikimedia.org/T302465) [01:41:53] (03PS3) 10Krinkle: static.php: Restore short cache for temporary 'mismatch' response [mediawiki-config] - 10https://gerrit.wikimedia.org/r/777901 (https://phabricator.wikimedia.org/T274041) [01:42:41] (03PS4) 10Krinkle: static.php: Fold "unknown" handling into "nohash" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/777900 (https://phabricator.wikimedia.org/T302465) [01:42:43] (03PS4) 10Krinkle: static.php: Restore short cache for temporary 'mismatch' response [mediawiki-config] - 10https://gerrit.wikimedia.org/r/777901 (https://phabricator.wikimedia.org/T274041) [01:43:01] (03CR) 10Krinkle: "@dancy These next two are a bit less trivial. Could use a second pair of eyes." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/777900 (https://phabricator.wikimedia.org/T302465) (owner: 10Krinkle) [01:43:44] PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:43:45] (JobUnavailable) firing: (2) Reduced availability for job k8s-pods in k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:50:27] (03PS1) 10Krinkle: varnish: Expand static.php optimisation regarless of query string [puppet] - 10https://gerrit.wikimedia.org/r/777904 (https://phabricator.wikimedia.org/T302465) [01:58:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T297189)', diff saved to https://phabricator.wikimedia.org/P24196 and previous config saved to /var/cache/conftool/dbconfig/20220407-015832-marostegui.json [01:58:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:58:37] T297189: Schema change for dropping ft_title and ft_namesapce - https://phabricator.wikimedia.org/T297189 [01:59:59] PROBLEM - SSH on bast5002 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [02:02:49] RECOVERY - SSH on bast5002 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [02:13:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P24197 and previous config saved to /var/cache/conftool/dbconfig/20220407-021337-marostegui.json [02:13:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:15:13] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:26:55] RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:28:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P24198 and previous config saved to /var/cache/conftool/dbconfig/20220407-022842-marostegui.json [02:28:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:37:46] (BlazegraphJvmQuakeWarnGC) firing: Blazegraph instance wdqs2003:9100 is entering a GC death spiral - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphJvmQuakeWarnGC [02:43:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T297189)', diff saved to https://phabricator.wikimedia.org/P24199 and previous config saved to /var/cache/conftool/dbconfig/20220407-024347-marostegui.json [02:43:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:43:52] T297189: Schema change for dropping ft_title and ft_namesapce - https://phabricator.wikimedia.org/T297189 [02:46:47] PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:59:48] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2001 is CRITICAL: instance=10.192.32.33 verb=POST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [03:00:16] RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:02:26] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [03:14:32] PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:17:54] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2001 is CRITICAL: instance=10.192.32.33 verb=POST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [03:19:54] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [03:25:08] RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:32:04] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2001 is CRITICAL: instance=10.192.32.33 verb=POST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [03:33:58] PROBLEM - WDQS SPARQL on wdqs1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:36:16] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [03:36:30] PROBLEM - SSH on aqs1009.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:38:16] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:41:42] RECOVERY - WDQS SPARQL on wdqs1012 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.079 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:44:36] PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:56:21] RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:06:07] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2001 is CRITICAL: instance=10.192.32.33 verb=POST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [04:09:11] 10SRE, 10SRE-Access-Requests: Requesting access to Analytic Cluster for Research Intern (paramita_das) - https://phabricator.wikimedia.org/T305298 (10paramita_das) Hi @Aklapper @Ottomata, I am trying to open a SSH tunnel to connect to analytics clients using the command mentioned https://wikitech.wikimedia.or... [04:09:21] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [04:13:08] !log [Elastic] Beginning rolling reboot of codfw elastic to apply kernel security updates: `ryankemper@cumin1001:~$ sudo -E cookbook sre.elasticsearch.rolling-operation search_codfw "codfw cluster reboot" --reboot --nodes-per-run 3 --start-datetime 2022-04-07T04:09:05 --task-id T304938` [04:13:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:13:17] !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.REBOOT (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reboot - ryankemper@cumin1001 - T304938 [04:13:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:16:05] PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:18:39] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:20:33] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:25:29] PROBLEM - Check systemd state on elastic2060 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:27:45] RECOVERY - Check systemd state on elastic2060 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:28:31] PROBLEM - Check systemd state on elastic2034 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:29:44] !log [Elastic] for future reference, we still need to fix the fact that we haven't told systemd that the prometheus-wmf-elasticsearch exporters need to start after the actual elasticsearch service [04:29:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:31:03] RECOVERY - Check systemd state on elastic2034 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:31:21] (manually restarted failing prometheus exporter units) [04:39:49] RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:40:43] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2001 is CRITICAL: instance=10.192.32.33 verb=POST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [04:41:52] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1146.eqiad.wmnet with reason: Maintenance [04:41:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:41:54] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1146.eqiad.wmnet with reason: Maintenance [04:41:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:41:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3312 (T297189)', diff saved to https://phabricator.wikimedia.org/P24200 and previous config saved to /var/cache/conftool/dbconfig/20220407-044158-marostegui.json [04:42:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:42:01] T297189: Schema change for dropping ft_title and ft_namesapce - https://phabricator.wikimedia.org/T297189 [04:42:29] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [04:45:01] PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:53:30] (03PS1) 10Marostegui: Revert "db1163: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/777776 [04:54:19] (03CR) 10Marostegui: [C: 03+2] Revert "db1163: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/777776 (owner: 10Marostegui) [04:57:33] RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:01:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2076 db2086:3317 db2086:3318 db2107 db2137:3314 db2137:3315 db2143 db2147 es2029 es2030 T305469', diff saved to https://phabricator.wikimedia.org/P24201 and previous config saved to /var/cache/conftool/dbconfig/20220407-050149-root.json [05:01:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:01:54] T305469: codfw: Dedicate Rack B1 for cloudX-dev servers - https://phabricator.wikimedia.org/T305469 [05:04:15] PROBLEM - Check systemd state on elastic2036 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:04:51] PROBLEM - Check systemd state on elastic2053 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:05:29] PROBLEM - Check systemd state on elastic2052 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:11:59] RECOVERY - Check systemd state on elastic2052 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:16:53] PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:18:27] PROBLEM - SSH on mw2258.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:25:54] RECOVERY - Check systemd state on elastic2036 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:28:04] RECOVERY - Check systemd state on elastic2053 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:29:12] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 241, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:30:28] RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:37:26] RECOVERY - SSH on aqs1009.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:42:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T297189)', diff saved to https://phabricator.wikimedia.org/P24202 and previous config saved to /var/cache/conftool/dbconfig/20220407-054213-marostegui.json [05:42:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:42:17] T297189: Schema change for dropping ft_title and ft_namesapce - https://phabricator.wikimedia.org/T297189 [05:43:12] PROBLEM - Check systemd state on elastic2051 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:43:12] PROBLEM - Check systemd state on elastic2054 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:43:58] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:44:00] (JobUnavailable) firing: Reduced availability for job k8s-pods in k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:44:04] PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:45:38] 10SRE, 10conftool: Make the VCL that goes to production from requestctl safer/more explicit to apply - https://phabricator.wikimedia.org/T305606 (10Joe) [05:45:56] 10SRE, 10conftool: Make the VCL that goes to production from requestctl safer/more explicit to apply - https://phabricator.wikimedia.org/T305606 (10Joe) p:05Triage→03High [05:53:56] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REBOOT (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reboot - ryankemper@cumin1001 - T304938 [05:53:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:55:20] RECOVERY - Check systemd state on elastic2051 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:55:20] RECOVERY - Check systemd state on elastic2054 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:56:07] 10SRE, 10conftool: Support NOT in the dsl grammar - https://phabricator.wikimedia.org/T305607 (10Joe) [05:56:12] RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:56:21] 10SRE, 10conftool: Support NOT in the dsl grammar - https://phabricator.wikimedia.org/T305607 (10Joe) p:05Triage→03Medium [05:57:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P24203 and previous config saved to /var/cache/conftool/dbconfig/20220407-055718-marostegui.json [05:57:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:58:14] !log [Elastic] Manually restarted elasticsearch exporters on `cloudelastic1004` and `elastic2054` [05:58:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:58:36] RECOVERY - Check systemd state on cloudelastic1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:00:05] kormat, marostegui, and Amir1: #bothumor My software never has bugs. It just develops random features. Rise for Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220407T0600). [06:00:05] !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.REBOOT (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reboot - ryankemper@cumin1001 - T304938 [06:00:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:02:06] 10SRE, 10conftool, 10Patch-For-Review: ipblocks support for other "entities" (not clouds, not abuse nets) - https://phabricator.wikimedia.org/T305581 (10Joe) Ipblock, per se, supports arbitrary scope names. What we need is to add support for thes other scopes in VCL. My proposal would be to ditch the `X-Pu... [06:12:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P24205 and previous config saved to /var/cache/conftool/dbconfig/20220407-061223-marostegui.json [06:12:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:12:42] PROBLEM - Check systemd state on elastic2028 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:12:44] PROBLEM - Check systemd state on elastic2043 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:12:46] PROBLEM - Check systemd state on elastic2058 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:15:44] PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:18:02] RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:19:14] RECOVERY - Check systemd state on elastic2028 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:19:28] RECOVERY - SSH on mw2258.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:21:36] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS6939/IPv4: Connect - HE, AS6939/IPv6: Active - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [06:25:48] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REBOOT (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reboot - ryankemper@cumin1001 - T304938 [06:25:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:26:28] RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:27:16] !log [Elastic] Manually restarted elasticsearch exporters on `elastic2043` and `elastic2058` [06:27:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:27:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T297189)', diff saved to https://phabricator.wikimedia.org/P24206 and previous config saved to /var/cache/conftool/dbconfig/20220407-062728-marostegui.json [06:27:30] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1129.eqiad.wmnet with reason: Maintenance [06:27:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:27:31] T297189: Schema change for dropping ft_title and ft_namesapce - https://phabricator.wikimedia.org/T297189 [06:27:31] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1129.eqiad.wmnet with reason: Maintenance [06:27:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:27:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:27:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1129 (T297189)', diff saved to https://phabricator.wikimedia.org/P24207 and previous config saved to /var/cache/conftool/dbconfig/20220407-062736-marostegui.json [06:27:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:27:48] RECOVERY - Check systemd state on elastic2043 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:27:48] RECOVERY - Check systemd state on elastic2058 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:37:46] (BlazegraphJvmQuakeWarnGC) firing: Blazegraph instance wdqs2003:9100 is entering a GC death spiral - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphJvmQuakeWarnGC [06:42:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163 (T300775)', diff saved to https://phabricator.wikimedia.org/P24208 and previous config saved to /var/cache/conftool/dbconfig/20220407-064258-marostegui.json [06:43:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:43:02] T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775 [06:43:41] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1163.eqiad.wmnet with reason: Maintenance [06:43:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:43:43] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1163.eqiad.wmnet with reason: Maintenance [06:43:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:45:58] PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:47:12] (03PS1) 10Ladsgroup: Enable videojs on wiktionary wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/778197 (https://phabricator.wikimedia.org/T248418) [06:53:06] good morning [06:53:39] I am going to restart CI and Gerrit entirely starting at 7:00 UTC (7 minutes from now) [06:54:01] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host ml-cache1002.eqiad.wmnet with OS bullseye [06:54:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:54:08] !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ml-cache1002.eqiad.wmnet with OS bullseye [06:54:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:55:46] RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:56:31] (03CR) 10Ayounsi: Add inbound filter to analytics IRB interfaces on EVPN switches Eqiad (032 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/777855 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [06:58:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163', diff saved to https://phabricator.wikimedia.org/P24209 and previous config saved to /var/cache/conftool/dbconfig/20220407-065803-marostegui.json [06:58:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:05] hashar: Time to snap out of that daydream and deploy CI/Gerrit maintenance. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220407T0700). [07:00:05] Amir1, apergos, and taavi: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC morning backport and config training. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220407T0700). [07:00:08] there is a trainee in the window but no patches scheduled, which might be a good thing, given that gerrit is set for 30 minutes of maintenance beginning now. [07:00:21] good morning [07:00:30] I'll catch the trainee if they show up in the google meet and explain things. they can reschedule. [07:00:33] hello hasha r [07:00:37] I apologize for the backport & config window hijack [07:00:43] but should be a fast operation :] [07:00:43] apergos: where is the meeting? I don't have the link [07:01:10] https://meet.google.com/ium-qmwp-wvd?authuser=0 but don't bother showing up [07:01:48] you should get it to show up on your calendar, ask Tyler [07:01:55] since you're listed for this window always [07:02:08] !log Restarting contint1001.wikimedia.org [07:02:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:03:10] PROBLEM - Host contint1001 is DOWN: PING CRITICAL - Packet loss = 100% [07:05:08] RECOVERY - Host contint1001 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [07:08:32] PROBLEM - Check systemd state on contint1001 is CRITICAL: CRITICAL - starting: Late bootup, before the job queue becomes idle for the first time, or one of the rescue targets are reached. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:10:03] !log Restarting gerrit1001.wikimedia.org [07:10:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:10:21] !log Restarting contint2001.wikimedia.Org [07:10:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:13:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163', diff saved to https://phabricator.wikimedia.org/P24210 and previous config saved to /var/cache/conftool/dbconfig/20220407-071308-marostegui.json [07:13:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:13:25] (03CR) 10Ayounsi: "Thanks! that's awesome." [puppet] - 10https://gerrit.wikimedia.org/r/777347 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi) [07:13:47] Apr 07 07:12:08 gerrit1001 apachectl[886]: (99)Cannot assign requested address: AH00072: make_sock: could not bind to address [2620:0:861:2:208:80:154:137]:80 [07:13:48] RECOVERY - Check systemd state on contint1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:13:53] poor Apache [07:14:20] PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:14:42] !log gerrit1001.wikimedia.org: restarted apache2 service [07:14:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:16:04] PROBLEM - Check systemd state on contint2001 is CRITICAL: CRITICAL - starting: Late bootup, before the job queue becomes idle for the first time, or one of the rescue targets are reached. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:17:06] (03PS1) 10Marostegui: control-mariadb-10.6-bullseye: Add to repo [software] - 10https://gerrit.wikimedia.org/r/778206 (https://phabricator.wikimedia.org/T301879) [07:17:21] !log CI and Gerrit are back up [07:17:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:14] (03PS3) 10Elukey: role::ml_k8s::master: change the codfw svc IP range [puppet] - 10https://gerrit.wikimedia.org/r/776880 (https://phabricator.wikimedia.org/T304673) [07:19:40] (03CR) 10Marostegui: "recheck" [software] - 10https://gerrit.wikimedia.org/r/778206 (https://phabricator.wikimedia.org/T301879) (owner: 10Marostegui) [07:20:24] (03PS4) 10Elukey: role::ml_k8s::master: change the codfw svc/pod IP ranges [puppet] - 10https://gerrit.wikimedia.org/r/776880 (https://phabricator.wikimedia.org/T304673) [07:20:48] (03PS2) 10Elukey: Change the Calico's pod IP subnet for ml-serve-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/776877 (https://phabricator.wikimedia.org/T304673) [07:21:30] hey TheresNoTime there are no patches for today, so I've commented on the training task, let's try again for next week. [07:22:06] (03PS3) 10Elukey: Change the Calico's pod IP subnet for ml-serve-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/776877 (https://phabricator.wikimedia.org/T304673) [07:23:44] (03PS1) 10Elukey: Change POD IPv4 subnet for ml-serve-codfw [homer/public] - 10https://gerrit.wikimedia.org/r/778208 (https://phabricator.wikimedia.org/T304673) [07:26:03] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2001 is CRITICAL: instance=10.192.32.33 verb=POST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [07:26:22] marostegui: is the large amout of lag on db1163 expected? [07:26:31] RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:28:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163 (T300775)', diff saved to https://phabricator.wikimedia.org/P24211 and previous config saved to /var/cache/conftool/dbconfig/20220407-072813-marostegui.json [07:28:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:19] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [07:28:19] T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775 [07:29:07] JJMC89: checking [07:29:47] it is not [07:29:51] It shouldn't have been repooled [07:29:53] depooling it [07:30:01] Amir1: we need to check why it was repooled [07:30:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1163', diff saved to https://phabricator.wikimedia.org/P24212 and previous config saved to /var/cache/conftool/dbconfig/20220407-073013-root.json [07:30:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:44] I check [07:30:50] Amir1: I think I know why [07:30:57] Amir1: both schema changes overlapped [07:31:05] JJMC89: thanks for the heads up! [07:31:18] wait I thought you were done with s1 [07:31:18] is it s1? [07:31:31] Amir1: No, I had to hosts pending [07:31:33] I am now done [07:31:48] I started them yesterday and the finished today [07:32:01] no problem [07:32:06] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [07:32:07] https://phabricator.wikimedia.org/T300775#7837123 [07:32:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:32:08] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [07:32:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:32:13] RECOVERY - Check systemd state on contint2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:32:33] marostegui: I am so sorry, I took the comment as "it is done" https://phabricator.wikimedia.org/T300775#7837123 [07:33:01] Amir1: Yeah, not sure what happened, as I see the host beeing repooled today too [07:33:18] it happens [07:33:31] should I stop my schema change? [07:33:34] Amir1: No, I see what happened, the schema change did finish, but the host was still catching up [07:33:41] That is why I commented there [07:33:57] aaah, That's "finish" [07:35:21] it actually reminds me of a famous aviation accident which there was a misunderstanding on what "take off" meant [07:35:45] and after that the rules changed [07:35:59] * Amir1 stops channeling his inner wikipedia [07:39:00] https://www.vintag.es/2022/03/tenerife-airport-disaster.html [07:41:46] (03CR) 10Ayounsi: [C: 03+1] Change POD IPv4 subnet for ml-serve-codfw [homer/public] - 10https://gerrit.wikimedia.org/r/778208 (https://phabricator.wikimedia.org/T304673) (owner: 10Elukey) [07:43:55] (03PS2) 10MMandere: site: Reimage cp3050 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/777846 (https://phabricator.wikimedia.org/T290005) [07:44:13] !log depool cp3050 for reimage - T290005 [07:44:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:44:17] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [07:45:03] PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:46:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T297189)', diff saved to https://phabricator.wikimedia.org/P24213 and previous config saved to /var/cache/conftool/dbconfig/20220407-074654-marostegui.json [07:46:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:46:58] T297189: Schema change for dropping ft_title and ft_namesapce - https://phabricator.wikimedia.org/T297189 [07:48:33] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 125, down: 2, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:54:52] (03CR) 10MMandere: [C: 03+2] site: Reimage cp3050 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/777846 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [07:55:07] RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:55:55] !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host cp3050.esams.wmnet with OS buster [07:55:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:56:04] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp3050.esams.wmnet with OS buster [08:00:04] jnuche and hashar: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for MediaWiki train - Utc-0 Version . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220407T0800). [08:00:37] !log depool cp6014 for reimage - T290005 [08:00:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:42] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [08:01:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P24214 and previous config saved to /var/cache/conftool/dbconfig/20220407-080159-marostegui.json [08:02:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:14] there are some blockers :( [08:05:10] (03PS2) 10MMandere: site: Reimage cp6014 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/777847 (https://phabricator.wikimedia.org/T290005) [08:06:13] eg https://phabricator.wikimedia.org/T305531 [08:06:35] PROBLEM - Check systemd state on ms-be1064 is CRITICAL: CRITICAL - degraded: The following units failed: swift-drive-audit.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:07:17] (03CR) 10MMandere: [C: 03+2] site: Reimage cp6014 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/777847 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [08:09:49] !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host cp6014.drmrs.wmnet with OS buster [08:09:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:54] hmm processing [08:09:57] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp6014.drmrs.wmnet with OS buster [08:10:13] (03PS1) 10Hashar: all wikis to 1.39.0-wmf.6 refs T305212 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/778217 [08:10:15] (03CR) 10Hashar: [C: 03+2] all wikis to 1.39.0-wmf.6 refs T305212 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/778217 (owner: 10Hashar) [08:11:08] (03Merged) 10jenkins-bot: all wikis to 1.39.0-wmf.6 refs T305212 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/778217 (owner: 10Hashar) [08:13:00] !log hashar@deploy1002 rebuilt and synchronized wikiversions files: all wikis to 1.39.0-wmf.6 refs T305212 [08:13:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:03] T305212: 1.39.0-wmf.6 deployment blockers - https://phabricator.wikimedia.org/T305212 [08:14:25] PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:15:06] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [08:15:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:55] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [08:15:56] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [08:15:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:16:23] RECOVERY - Check systemd state on gitlab2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:16:55] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [08:16:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P24215 and previous config saved to /var/cache/conftool/dbconfig/20220407-081704-marostegui.json [08:17:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:04] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1099.eqiad.wmnet with reason: Maintenance [08:19:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:05] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1099.eqiad.wmnet with reason: Maintenance [08:19:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1099:3311 (T298565)', diff saved to https://phabricator.wikimedia.org/P24216 and previous config saved to /var/cache/conftool/dbconfig/20220407-081910-ladsgroup.json [08:19:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:13] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:19:13] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [08:19:43] (03CR) 10Marostegui: [C: 03+2] control-mariadb-10.6-bullseye: Add to repo [software] - 10https://gerrit.wikimedia.org/r/778206 (https://phabricator.wikimedia.org/T301879) (owner: 10Marostegui) [08:20:19] (03Merged) 10jenkins-bot: control-mariadb-10.6-bullseye: Add to repo [software] - 10https://gerrit.wikimedia.org/r/778206 (https://phabricator.wikimedia.org/T301879) (owner: 10Marostegui) [08:21:23] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:23:32] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host ml-cache1002.eqiad.wmnet with OS bullseye [08:23:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:53] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:23:57] !log mmandere@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3050.esams.wmnet with reason: host reimage [08:23:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:39] (03CR) 10Klausman: [C: 03+1] Change POD IPv4 subnet for ml-serve-codfw [homer/public] - 10https://gerrit.wikimedia.org/r/778208 (https://phabricator.wikimedia.org/T304673) (owner: 10Elukey) [08:26:53] !log mmandere@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp6014.drmrs.wmnet with reason: host reimage [08:26:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:24] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp3050.esams.wmnet with reason: host reimage [08:27:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:13] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp6014.drmrs.wmnet with reason: host reimage [08:30:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T297189)', diff saved to https://phabricator.wikimedia.org/P24217 and previous config saved to /var/cache/conftool/dbconfig/20220407-083209-marostegui.json [08:32:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:13] T297189: Schema change for dropping ft_title and ft_namesapce - https://phabricator.wikimedia.org/T297189 [08:32:26] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2001 is CRITICAL: instance=10.192.32.33 verb=POST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [08:33:34] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [08:33:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:36] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [08:33:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:09] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-cache1002.eqiad.wmnet with reason: host reimage [08:35:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:41] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [08:35:55] RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:37:08] (03CR) 10Cathal Mooney: "Thanks for the response, I'll submit a new patchset with those changes and push." [homer/public] - 10https://gerrit.wikimedia.org/r/777855 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [08:38:33] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-cache1002.eqiad.wmnet with reason: host reimage [08:38:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:19] (03PS2) 10Cathal Mooney: Add inbound filter to analytics IRB interfaces on EVPN switches Eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/777855 (https://phabricator.wikimedia.org/T299758) [08:41:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1163 (re)pooling @ 5%: After schema change', diff saved to https://phabricator.wikimedia.org/P24218 and previous config saved to /var/cache/conftool/dbconfig/20220407-084103-root.json [08:41:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:34] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1146.eqiad.wmnet with reason: Maintenance [08:41:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:36] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1146.eqiad.wmnet with reason: Maintenance [08:41:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3312 (T297189)', diff saved to https://phabricator.wikimedia.org/P24219 and previous config saved to /var/cache/conftool/dbconfig/20220407-084140-marostegui.json [08:41:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:45] T297189: Schema change for dropping ft_title and ft_namesapce - https://phabricator.wikimedia.org/T297189 [08:42:39] (03CR) 10Cathal Mooney: [C: 03+2] Add inbound filter to analytics IRB interfaces on EVPN switches Eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/777855 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [08:43:35] (03Merged) 10jenkins-bot: Add inbound filter to analytics IRB interfaces on EVPN switches Eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/777855 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [08:49:04] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-cache1002.eqiad.wmnet with OS bullseye [08:49:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:50] (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: bugfix remove rsyslog-set log.level from blackbox_exporter events [puppet] - 10https://gerrit.wikimedia.org/r/777877 (owner: 10Cwhite) [08:56:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1163 (re)pooling @ 10%: After schema change', diff saved to https://phabricator.wikimedia.org/P24220 and previous config saved to /var/cache/conftool/dbconfig/20220407-085608-root.json [08:56:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:41] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3050.esams.wmnet with OS buster [08:56:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:51] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp3050.esams.wmnet with OS buster com... [08:59:00] (03CR) 10Filippo Giunchedi: WIP move core routers definitions to hiera (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/777347 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi) [08:59:06] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db2122.codfw.wmnet with reason: Rebooting for T303174 [08:59:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:08] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db2122.codfw.wmnet with reason: Rebooting for T303174 [08:59:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:48] !log jynus@cumin2002 START - Cookbook sre.hosts.reimage for host db2098.codfw.wmnet with OS bullseye [09:00:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:11] !log pool cp3050 with HAProxy as TLS termination layer - T290005 [09:01:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:14] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [09:01:42] RECOVERY - Check systemd state on ms-be1064 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:01:55] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1181.eqiad.wmnet with reason: Maintenance [09:01:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:56] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1181.eqiad.wmnet with reason: Maintenance [09:01:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1181 (T305300)', diff saved to https://phabricator.wikimedia.org/P24221 and previous config saved to /var/cache/conftool/dbconfig/20220407-090201-ladsgroup.json [09:02:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:04] T305300: Add lu_attachment_method column to localuser table - https://phabricator.wikimedia.org/T305300 [09:05:24] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db2150.codfw.wmnet with reason: Rebooting for T303174 [09:05:26] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db2150.codfw.wmnet with reason: Rebooting for T303174 [09:05:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:14] 10SRE, 10Traffic: Upgrading Wikidough and durum VMs to bullseye - https://phabricator.wikimedia.org/T305589 (10fgiunchedi) Thanks @ssingh for kickstarting the discussion! My two cents as an owner (with o11y) of some VMs that will need upgrading (grafana, logstash, etc): I think our strategy when it comes to l... [09:08:10] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2001 is CRITICAL: instance=10.192.32.33 verb=POST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [09:10:00] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [09:11:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1163 (re)pooling @ 25%: After schema change', diff saved to https://phabricator.wikimedia.org/P24222 and previous config saved to /var/cache/conftool/dbconfig/20220407-091112-root.json [09:11:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:23] !log jynus@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2098.codfw.wmnet with reason: host reimage [09:11:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:41] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db2152.codfw.wmnet with reason: Rebooting for T303174 [09:12:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:42] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db2152.codfw.wmnet with reason: Rebooting for T303174 [09:12:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:58] PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:14:19] !log jynus@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2098.codfw.wmnet with reason: host reimage [09:14:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:25] 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team: Q3:(Need By: TBD) rack/setup/install ml-cache100[1-3] - https://phabricator.wikimedia.org/T299435 (10elukey) 05Open→03Resolved Host reimaged correctly, all done! [09:16:02] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp6014.drmrs.wmnet with OS buster [09:16:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:11] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp6014.drmrs.wmnet with OS buster com... [09:20:12] !log pool cp6014 with HAProxy as TLS termination layer - T290005 [09:20:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:16] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [09:20:41] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 7 hosts with reason: Rebooting primary T303174 [09:20:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:46] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Rebooting primary T303174 [09:20:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:57] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db2105.codfw.wmnet with reason: Rebooting for T303174 [09:20:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:59] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db2105.codfw.wmnet with reason: Rebooting for T303174 [09:21:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:34] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2001 is CRITICAL: instance=10.192.32.33 verb=POST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [09:25:26] !log depool cp3053 for reimage - T290005 [09:25:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:29] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [09:25:32] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [09:25:44] RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:26:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1163 (re)pooling @ 50%: After schema change', diff saved to https://phabricator.wikimedia.org/P24223 and previous config saved to /var/cache/conftool/dbconfig/20220407-092616-root.json [09:26:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:08] (03PS2) 10MMandere: site: Reimage cp3053 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/777848 (https://phabricator.wikimedia.org/T290005) [09:30:08] (03PS1) 10Elukey: kserve-inference: Allow prometheus to scrape istio sidecar's port [deployment-charts] - 10https://gerrit.wikimedia.org/r/778247 (https://phabricator.wikimedia.org/T297612) [09:30:32] (03CR) 10MMandere: [C: 03+2] site: Reimage cp3053 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/777848 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [09:30:37] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db2123.codfw.wmnet with reason: Rebooting for T303174 [09:30:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:38] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db2123.codfw.wmnet with reason: Rebooting for T303174 [09:30:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:08] (03CR) 10Klausman: [C: 03+1] kserve-inference: Allow prometheus to scrape istio sidecar's port [deployment-charts] - 10https://gerrit.wikimedia.org/r/778247 (https://phabricator.wikimedia.org/T297612) (owner: 10Elukey) [09:33:38] !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host cp3053.esams.wmnet with OS buster [09:33:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:48] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp3053.esams.wmnet with OS buster [09:34:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T305300)', diff saved to https://phabricator.wikimedia.org/P24224 and previous config saved to /var/cache/conftool/dbconfig/20220407-093412-ladsgroup.json [09:34:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:15] T305300: Add lu_attachment_method column to localuser table - https://phabricator.wikimedia.org/T305300 [09:34:22] !log depool cp6006 for reimage - T290005 [09:34:24] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 9 hosts with reason: Rebooting primary T303174 [09:34:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:25] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [09:34:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:30] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 9 hosts with reason: Rebooting primary T303174 [09:34:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:54] (03PS2) 10MMandere: site: Reimage cp6006 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/777849 (https://phabricator.wikimedia.org/T290005) [09:35:28] (03PS1) 10Btullis: Correct the GMS port number that is in use [deployment-charts] - 10https://gerrit.wikimedia.org/r/778249 (https://phabricator.wikimedia.org/T301454) [09:35:50] !log jynus@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2098.codfw.wmnet with OS bullseye [09:35:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:05] (03PS3) 10Mvolz: citoid: switch to native prometheus metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/776233 (https://phabricator.wikimedia.org/T205870) (owner: 10PipelineBot) [09:37:15] (03CR) 10jerkins-bot: [V: 04-1] citoid: switch to native prometheus metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/776233 (https://phabricator.wikimedia.org/T205870) (owner: 10PipelineBot) [09:37:58] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 9 hosts with reason: Rebooting primary T303174 [09:38:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:05] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 9 hosts with reason: Rebooting primary T303174 [09:38:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:17] (03CR) 10MMandere: [C: 03+2] site: Reimage cp6006 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/777849 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [09:39:32] !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host cp6006.drmrs.wmnet with OS buster [09:39:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:41] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp6006.drmrs.wmnet with OS buster [09:40:03] (03PS4) 10Mvolz: citoid: switch to native prometheus metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/776233 (https://phabricator.wikimedia.org/T205870) (owner: 10PipelineBot) [09:40:07] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db2129.codfw.wmnet with reason: Rebooting for T303174 [09:40:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:09] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db2129.codfw.wmnet with reason: Rebooting for T303174 [09:40:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:24] (03CR) 10Elukey: [C: 03+2] kserve-inference: Allow prometheus to scrape istio sidecar's port [deployment-charts] - 10https://gerrit.wikimedia.org/r/778247 (https://phabricator.wikimedia.org/T297612) (owner: 10Elukey) [09:40:30] PROBLEM - SSH on aqs1009.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:41:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1163 (re)pooling @ 75%: After schema change', diff saved to https://phabricator.wikimedia.org/P24225 and previous config saved to /var/cache/conftool/dbconfig/20220407-094120-root.json [09:41:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1146:3312 (re)pooling @ 25%: After schema change', diff saved to https://phabricator.wikimedia.org/P24226 and previous config saved to /var/cache/conftool/dbconfig/20220407-094310-root.json [09:43:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:13] !log jynus@cumin1001 START - Cookbook sre.hosts.reimage for host db1102.eqiad.wmnet with OS bullseye [09:43:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:00] (JobUnavailable) firing: Reduced availability for job k8s-pods in k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:45:09] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [09:45:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:56] PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:45:58] (03CR) 10Btullis: [C: 03+2] Correct the GMS port number that is in use [deployment-charts] - 10https://gerrit.wikimedia.org/r/778249 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis) [09:49:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P24227 and previous config saved to /var/cache/conftool/dbconfig/20220407-094917-ladsgroup.json [09:49:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:49] (03Merged) 10jenkins-bot: Correct the GMS port number that is in use [deployment-charts] - 10https://gerrit.wikimedia.org/r/778249 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis) [09:50:00] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [09:50:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:20] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [09:50:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:22] (03CR) 10Cathal Mooney: Modify cr-loopback Capirca definition to make it compatible with QFX (032 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/773299 (https://phabricator.wikimedia.org/T304553) (owner: 10Cathal Mooney) [09:51:36] !log jynus@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1102.eqiad.wmnet with reason: host reimage [09:51:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:06] (03PS4) 10Cathal Mooney: Modify cr-loopback Capirca definition to make it compatible with QFX [homer/public] - 10https://gerrit.wikimedia.org/r/773299 (https://phabricator.wikimedia.org/T304553) [09:52:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311 (T298565)', diff saved to https://phabricator.wikimedia.org/P24228 and previous config saved to /var/cache/conftool/dbconfig/20220407-095224-ladsgroup.json [09:52:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:27] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [09:52:54] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [09:52:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:29] (03PS5) 10Cathal Mooney: Modify cr-loopback Capirca definition to make it compatible with QFX [homer/public] - 10https://gerrit.wikimedia.org/r/773299 (https://phabricator.wikimedia.org/T304553) [09:53:30] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [09:53:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:24] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [09:54:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:58] !log jynus@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1102.eqiad.wmnet with reason: host reimage [09:54:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:12] (03PS1) 10Kevin Bazira: ml-services: add plwiki, ptwiki & rowiki editquality isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/778251 (https://phabricator.wikimedia.org/T301415) [09:55:15] RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:55:52] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host aqs1007.eqiad.wmnet [09:55:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:11] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [09:56:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1163 (re)pooling @ 100%: After schema change', diff saved to https://phabricator.wikimedia.org/P24229 and previous config saved to /var/cache/conftool/dbconfig/20220407-095624-root.json [09:56:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:42] ACKNOWLEDGEMENT - dump of es4 in codfw on alert1001 is CRITICAL: dump for es4 at codfw (es2022) taken more than 8 days ago: Most recent backup 2022-03-29 00:00:01 Jcrespo backup taking failed again - The acknowledgement expires at: 2022-04-08 09:56:13. https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [09:56:56] !log mmandere@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp6006.drmrs.wmnet with reason: host reimage [09:56:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1146:3312 (re)pooling @ 50%: After schema change', diff saved to https://phabricator.wikimedia.org/P24230 and previous config saved to /var/cache/conftool/dbconfig/20220407-095814-root.json [09:58:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:33] !log jynus@cumin2002 START - Cookbook sre.hosts.reimage for host db2099.codfw.wmnet with OS bullseye [09:58:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:03] 10SRE-swift-storage: Refactor swift puppet code, particularly where swift_ring_manager config is stored - https://phabricator.wikimedia.org/T305617 (10MatthewVernon) [10:00:03] (03CR) 10Mvolz: [C: 03+2] Update zotero to include get endpoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/777758 (https://phabricator.wikimedia.org/T291707) (owner: 10Mvolz) [10:00:05] mvolz: My dear minions, it's time we take the moon! Just kidding. Time for Services – Citoid / Zotero deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220407T1000). [10:00:21] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp6006.drmrs.wmnet with reason: host reimage [10:00:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:34] !log mmandere@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3053.esams.wmnet with reason: host reimage [10:00:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:58] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp3053.esams.wmnet with reason: host reimage [10:03:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P24231 and previous config saved to /var/cache/conftool/dbconfig/20220407-100423-ladsgroup.json [10:04:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:25] (03Merged) 10jenkins-bot: Update zotero to include get endpoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/777758 (https://phabricator.wikimedia.org/T291707) (owner: 10Mvolz) [10:04:51] !log mvolz@deploy1002 helmfile [staging] START helmfile.d/services/zotero: apply [10:04:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:54] !log mvolz@deploy1002 helmfile [staging] DONE helmfile.d/services/zotero: apply [10:04:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:21] !log mvolz@deploy1002 helmfile [staging] START helmfile.d/services/zotero: apply [10:05:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:06:08] !log mvolz@deploy1002 helmfile [staging] DONE helmfile.d/services/zotero: apply [10:06:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:06:43] !log mvolz@deploy1002 helmfile [codfw] START helmfile.d/services/zotero: apply [10:06:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311', diff saved to https://phabricator.wikimedia.org/P24232 and previous config saved to /var/cache/conftool/dbconfig/20220407-100729-ladsgroup.json [10:07:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:38] !log mvolz@deploy1002 helmfile [codfw] DONE helmfile.d/services/zotero: apply [10:07:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:12] !log mvolz@deploy1002 helmfile [eqiad] START helmfile.d/services/zotero: apply [10:08:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:21] !log btullis@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host aqs1007.eqiad.wmnet [10:08:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:36] (03PS1) 10Elukey: Increase namespace constraints for ml-serve-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/778254 [10:08:48] !log jynus@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1102.eqiad.wmnet with OS bullseye [10:08:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:51] !log mvolz@deploy1002 helmfile [eqiad] DONE helmfile.d/services/zotero: apply [10:08:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:17] !log jynus@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2099.codfw.wmnet with reason: host reimage [10:09:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:19] (03CR) 10Klausman: [C: 03+1] Increase namespace constraints for ml-serve-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/778254 (owner: 10Elukey) [10:12:53] !log jynus@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2099.codfw.wmnet with reason: host reimage [10:12:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1146:3312 (re)pooling @ 75%: After schema change', diff saved to https://phabricator.wikimedia.org/P24233 and previous config saved to /var/cache/conftool/dbconfig/20220407-101318-root.json [10:13:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:36] PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:13:54] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2001 is CRITICAL: instance=10.192.32.33 verb=POST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [10:15:56] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [10:16:06] (03CR) 10Mvolz: [C: 03+2] "Based on I78018d4e230 ; hopefully this works!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/776233 (https://phabricator.wikimedia.org/T205870) (owner: 10PipelineBot) [10:16:30] !log jynus@cumin1001 START - Cookbook sre.hosts.reimage for host db1116.eqiad.wmnet with OS bullseye [10:16:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:19] (03CR) 10Elukey: [C: 03+2] Increase namespace constraints for ml-serve-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/778254 (owner: 10Elukey) [10:19:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T305300)', diff saved to https://phabricator.wikimedia.org/P24234 and previous config saved to /var/cache/conftool/dbconfig/20220407-101928-ladsgroup.json [10:19:30] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1174.eqiad.wmnet with reason: Maintenance [10:19:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:31] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1174.eqiad.wmnet with reason: Maintenance [10:19:32] T305300: Add lu_attachment_method column to localuser table - https://phabricator.wikimedia.org/T305300 [10:19:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1174 (T305300)', diff saved to https://phabricator.wikimedia.org/P24235 and previous config saved to /var/cache/conftool/dbconfig/20220407-101936-ladsgroup.json [10:19:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:20:27] (03Merged) 10jenkins-bot: citoid: switch to native prometheus metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/776233 (https://phabricator.wikimedia.org/T205870) (owner: 10PipelineBot) [10:20:38] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [10:20:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:20:42] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [10:20:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311', diff saved to https://phabricator.wikimedia.org/P24236 and previous config saved to /var/cache/conftool/dbconfig/20220407-102234-ladsgroup.json [10:22:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:17] !log mvolz@deploy1002 helmfile [staging] START helmfile.d/services/citoid: apply [10:24:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:43] !log jynus@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1116.eqiad.wmnet with reason: host reimage [10:24:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:07] (03PS1) 10Btullis: Bump datahub version to use the containers with wmf-certicates [deployment-charts] - 10https://gerrit.wikimedia.org/r/778257 (https://phabricator.wikimedia.org/T301454) [10:25:27] !log mvolz@deploy1002 helmfile [staging] DONE helmfile.d/services/citoid: apply [10:25:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:45] (03CR) 10Klausman: [C: 03+1] ml-services: add plwiki, ptwiki & rowiki editquality isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/778251 (https://phabricator.wikimedia.org/T301415) (owner: 10Kevin Bazira) [10:27:10] !log jynus@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2099.codfw.wmnet with OS bullseye [10:27:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:09] !log jynus@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1116.eqiad.wmnet with reason: host reimage [10:28:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1146:3312 (re)pooling @ 100%: After schema change', diff saved to https://phabricator.wikimedia.org/P24237 and previous config saved to /var/cache/conftool/dbconfig/20220407-102821-root.json [10:28:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:54] RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:34:10] (03PS1) 10Filippo Giunchedi: sre: add alerts for exporter-specific unavailability [alerts] - 10https://gerrit.wikimedia.org/r/778259 (https://phabricator.wikimedia.org/T288726) [10:35:31] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [10:35:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:34] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: apply on main [10:35:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:46] (03CR) 10Btullis: [C: 03+2] Bump datahub version to use the containers with wmf-certicates [deployment-charts] - 10https://gerrit.wikimedia.org/r/778257 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis) [10:36:06] (03PS2) 10JMeybohm: Move kubemaster2002 to bullseye [puppet] - 10https://gerrit.wikimedia.org/r/777021 [10:36:10] !log mvolz@deploy1002 helmfile [codfw] START helmfile.d/services/citoid: apply [10:36:11] (03PS1) 10Filippo Giunchedi: thanos: add recording rules for exporter-specific availability [puppet] - 10https://gerrit.wikimedia.org/r/778261 (https://phabricator.wikimedia.org/T288726) [10:36:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:56] !log mvolz@deploy1002 helmfile [codfw] DONE helmfile.d/services/citoid: apply [10:36:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:22] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp6006.drmrs.wmnet with OS buster [10:37:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:32] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp6006.drmrs.wmnet with OS buster com... [10:37:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311 (T298565)', diff saved to https://phabricator.wikimedia.org/P24238 and previous config saved to /var/cache/conftool/dbconfig/20220407-103739-ladsgroup.json [10:37:41] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [10:37:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:43] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [10:37:44] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [10:37:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:46] (BlazegraphJvmQuakeWarnGC) firing: Blazegraph instance wdqs2003:9100 is entering a GC death spiral - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphJvmQuakeWarnGC [10:37:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:49] !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on kubemaster2002.codfw.wmnet with reason: reimage [10:37:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:55] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on kubemaster2002.codfw.wmnet with reason: reimage [10:37:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:11] (03CR) 10JMeybohm: [C: 03+2] Move kubemaster2002 to bullseye [puppet] - 10https://gerrit.wikimedia.org/r/777021 (owner: 10JMeybohm) [10:38:45] (JobUnavailable) resolved: Reduced availability for job k8s-pods in k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:39:18] 10SRE, 10ops-eqiad: eqiad: Move links to new MPC7E linecard - https://phabricator.wikimedia.org/T304712 (10ayounsi) [10:39:43] (03Merged) 10jenkins-bot: Bump datahub version to use the containers with wmf-certicates [deployment-charts] - 10https://gerrit.wikimedia.org/r/778257 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis) [10:40:26] !log pool cp6006 with HAProxy as TLS termination layer - T290005 [10:40:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:29] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [10:40:32] (03PS1) 10Giuseppe Lavagetto: requestctl: allow safer changes to the production VCL [software/conftool] - 10https://gerrit.wikimedia.org/r/778263 (https://phabricator.wikimedia.org/T305606) [10:41:44] !log jynus@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1116.eqiad.wmnet with OS bullseye [10:41:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:29] (03PS5) 10Jgiannelos: Remove unused wgKartographerDfltStyle after tegola roll out [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772428 (https://phabricator.wikimedia.org/T298249) [10:43:35] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [10:43:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:02] !log jynus@cumin2002 START - Cookbook sre.hosts.reimage for host db2100.codfw.wmnet with OS bullseye [10:44:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:04] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:45:32] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [10:45:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:36] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:49:06] me [10:49:48] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:50:11] where would I go looking if ganeti (codfw) does not return my calls? (gnt-instance modify just hangs) [10:51:09] (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: add $schema field to w3creportingapi tests [puppet] - 10https://gerrit.wikimedia.org/r/776025 (owner: 10Cwhite) [10:51:10] !log jynus@cumin1001 START - Cookbook sre.hosts.reimage for host db1139.eqiad.wmnet with OS bullseye [10:51:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:24] "waiting for locks" looks promising [10:51:49] (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: replace all instances of @metadata.partition [puppet] - 10https://gerrit.wikimedia.org/r/777874 (https://phabricator.wikimedia.org/T305175) (owner: 10Cwhite) [10:51:58] akosiaris: or anyone who knows metrics/grafana, could anyone help with me fixing metrics on codfw? [10:52:11] I want to fix metrics before I deploy to equiad [10:52:43] The traffic metrics aren't working: https://grafana-rw.wikimedia.org/d/NJkCVermz/citoid?orgId=1&var-dc=codfw%20prometheus%2Fk8s&var-service=citoid&from=now-15m&to=now&forceLogin&editPanel=10&refresh=5m [10:53:25] This is probobably because the name changed, but when I fix the name it still doesn't seem the work. I know the metrics are making it to prometheus because I can see them in the prometheus browser! [10:53:37] https://thanos.wikimedia.org/graph?g0.expr=citoid_router_request_duration_seconds_count&g0.tab=0&g0.stacked=0&g0.range_input=1h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D [10:54:19] if the metrics are on prometheus, then the only thing is to check them on grafana? [10:54:45] yeah, I just don't know how to fix grafana - obviously the query is wrong [10:54:55] but everything i try it's just "no data" [10:54:57] :) [10:55:06] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3053.esams.wmnet with OS buster [10:55:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:07] !log jynus@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2100.codfw.wmnet with reason: host reimage [10:55:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:15] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp3053.esams.wmnet with OS buster com... [10:55:40] wait actually I think I figured it out [10:55:44] current metric says: [10:55:46] sum(rate(service_runner_request_duration_seconds_count{service="$service"}[5m])) [10:55:52] well one of them [10:55:57] yeah [10:56:17] if it is only a variable change, it should be just correcting that [10:56:20] yeah I changed it to citoid router and that worked phew... [10:57:02] now it says "AnnotationQueryRunner failed" t[a] is not iterable [10:58:32] !log jynus@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2100.codfw.wmnet with reason: host reimage [10:58:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:34] !log pool cp3053 with HAProxy as TLS termination layer - T290005 [10:59:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:36] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [10:59:44] !log jynus@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1139.eqiad.wmnet with reason: host reimage [10:59:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:19] 10SRE, 10Traffic, 10WMF-Legal, 10Performance-Team (Radar), 10Privacy: Consider disabling Chrome Lite pages for Wikipedia on Chrome on mobile with Cache-Control: no-transform - https://phabricator.wikimedia.org/T218618 (10Nicholas_Perry) Hi all, we received some info from Google which may help inform this... [11:01:28] just fyi I'm going to run over my window, as no one is after me in the schedule [11:03:21] !log jynus@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1139.eqiad.wmnet with reason: host reimage [11:03:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:50] !log jynus@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2100.codfw.wmnet with OS bullseye [11:12:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:12] PROBLEM - Device not healthy -SMART- on aqs1007 is CRITICAL: cluster=aqs device={sdh,sdm} instance=aqs1007 job=node site=eqiad https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/d/000000377/host-overview?var-server=aqs1007&var-datasource=eqiad+prometheus/ops [11:15:25] !log mvolz@deploy1002 helmfile [eqiad] START helmfile.d/services/citoid: apply [11:15:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:03] !log mvolz@deploy1002 helmfile [eqiad] DONE helmfile.d/services/citoid: apply [11:16:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:45] (03PS5) 10Krinkle: static.php: Restore short cache for temporary 'mismatch' response [mediawiki-config] - 10https://gerrit.wikimedia.org/r/777901 (https://phabricator.wikimedia.org/T302465) [11:17:19] !log jynus@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1139.eqiad.wmnet with OS bullseye [11:17:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:52] !log jynus@cumin2002 START - Cookbook sre.hosts.reimage for host db2101.codfw.wmnet with OS bullseye [11:17:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:43] ok, I'm done deploying. [11:19:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T305300)', diff saved to https://phabricator.wikimedia.org/P24239 and previous config saved to /var/cache/conftool/dbconfig/20220407-111950-ladsgroup.json [11:19:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:54] T305300: Add lu_attachment_method column to localuser table - https://phabricator.wikimedia.org/T305300 [11:22:10] 10SRE, 10Goal, 10MW-1.38-notes (1.38.0-wmf.4; 2021-10-12), 10Patch-For-Review, and 2 others: Fully migrate producers off statsd - https://phabricator.wikimedia.org/T205870 (10Mvolz) [11:23:19] !log jynus@cumin1001 START - Cookbook sre.hosts.reimage for host db1140.eqiad.wmnet with OS bullseye [11:23:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:30] !log depool cp3051 for reimage - T290005 [11:23:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:33] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [11:25:35] 10SRE, 10Goal, 10MW-1.38-notes (1.38.0-wmf.4; 2021-10-12), 10Patch-For-Review, and 2 others: Fully migrate producers off statsd - https://phabricator.wikimedia.org/T205870 (10Mvolz) This is now deployed for citoid. I have updated grafana for the most part, however there are a few (minor) metrics this bro... [11:28:27] !log jynus@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2101.codfw.wmnet with reason: host reimage [11:28:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:04] jouncebot: now [11:30:04] No deployments scheduled for the next 1 hour(s) and 29 minute(s) [11:30:07] Cool. [11:30:16] (03PS2) 10MMandere: site: Reimage cp3051 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/777850 (https://phabricator.wikimedia.org/T290005) [11:31:52] !log jynus@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1140.eqiad.wmnet with reason: host reimage [11:31:53] !log jynus@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2101.codfw.wmnet with reason: host reimage [11:31:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:12] (03CR) 10MMandere: [C: 03+2] site: Reimage cp3051 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/777850 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [11:32:25] !log jforrester@deploy1002 Started deploy [integration/docroot@d88e2fa]: d88e2fa19fd6 [WikiLambda] Fix link typo and re-group/re-word other links [11:32:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:34] !log jforrester@deploy1002 Finished deploy [integration/docroot@d88e2fa]: d88e2fa19fd6 [WikiLambda] Fix link typo and re-group/re-word other links (duration: 00m 09s) [11:32:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:20] !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host cp3051.esams.wmnet with OS buster [11:34:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:29] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp3051.esams.wmnet with OS buster [11:34:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P24240 and previous config saved to /var/cache/conftool/dbconfig/20220407-113455-ladsgroup.json [11:34:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:17] !log jynus@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1140.eqiad.wmnet with reason: host reimage [11:35:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:25] !log depool cp6013 for reimage - T290005 [11:35:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:28] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [11:39:39] (03PS2) 10MMandere: site: Reimage cp6013 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/777851 (https://phabricator.wikimedia.org/T290005) [11:41:09] RECOVERY - SSH on aqs1009.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:44:19] (03CR) 10MMandere: [C: 03+2] site: Reimage cp6013 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/777851 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [11:45:40] !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host cp6013.drmrs.wmnet with OS buster [11:45:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:49] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp6013.drmrs.wmnet with OS buster [11:46:02] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/778277 [11:46:04] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/778278 [11:46:34] !log jynus@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2101.codfw.wmnet with OS bullseye [11:46:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:48:25] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:49:13] !log jynus@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1140.eqiad.wmnet with OS bullseye [11:49:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P24241 and previous config saved to /var/cache/conftool/dbconfig/20220407-115002-ladsgroup.json [11:50:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:25] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2001 is CRITICAL: instance=10.192.32.33 verb=POST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [11:55:39] RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:55:41] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [12:03:00] !log mmandere@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3051.esams.wmnet with reason: host reimage [12:03:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:35] !log mmandere@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp6013.drmrs.wmnet with reason: host reimage [12:03:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T305300)', diff saved to https://phabricator.wikimedia.org/P24242 and previous config saved to /var/cache/conftool/dbconfig/20220407-120507-ladsgroup.json [12:05:08] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1127.eqiad.wmnet with reason: Maintenance [12:05:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:10] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1127.eqiad.wmnet with reason: Maintenance [12:05:10] T305300: Add lu_attachment_method column to localuser table - https://phabricator.wikimedia.org/T305300 [12:05:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1127 (T305300)', diff saved to https://phabricator.wikimedia.org/P24243 and previous config saved to /var/cache/conftool/dbconfig/20220407-120514-ladsgroup.json [12:05:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:24] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp3051.esams.wmnet with reason: host reimage [12:06:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:47] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp6013.drmrs.wmnet with reason: host reimage [12:08:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:41] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2001 is CRITICAL: instance=10.192.32.33 verb=POST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [12:12:43] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [12:13:21] 10SRE, 10Traffic: Upgrading Wikidough and durum VMs to bullseye - https://phabricator.wikimedia.org/T305589 (10Volans) >>! In T305589#7837526, @fgiunchedi wrote: > AIUI the decom cookbook doesn't support VMs yet (?) That's not actually correct, the decommission cookbook does support VMs since the start. What... [12:13:49] PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:16:13] (03PS1) 10Giuseppe Lavagetto: external_clouds_vendors: install python3-git [puppet] - 10https://gerrit.wikimedia.org/r/778280 [12:19:22] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db2096.codfw.wmnet with reason: Rebooting for T303174 [12:19:23] !log kormat@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 1:30:00 on db2096.codfw.wmnet with reason: Rebooting for T303174 [12:19:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:52] (03CR) 10Volans: [C: 03+1] "I didn't test it, but changes looks sane to me." [puppet] - 10https://gerrit.wikimedia.org/r/777899 (https://phabricator.wikimedia.org/T305581) (owner: 10RLazarus) [12:23:28] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db2096.codfw.wmnet with reason: Rebooting for T303174 [12:23:29] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db2096.codfw.wmnet with reason: Rebooting for T303174 [12:23:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:12] 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations, 10serviceops, 10Release-Engineering-Team (Radar): Need a service account on deploy servers - https://phabricator.wikimedia.org/T303857 (10Joe) >>! In T303857#7818920, @dancy wrote: > I have confirmed that being in the `deployment` group will all... [12:25:21] RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:30:16] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3051.esams.wmnet with OS buster [12:30:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:25] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp3051.esams.wmnet with OS buster com... [12:32:22] !log pool cp3051 with HAProxy as TLS termination layer - T290005 [12:32:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:25] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [12:34:01] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db[2078,2132].codfw.wmnet with reason: Rebooting primary T303174 [12:34:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:04] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db[2078,2132].codfw.wmnet with reason: Rebooting primary T303174 [12:34:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:16] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db2132.codfw.wmnet with reason: Rebooting for T303174 [12:34:18] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db2132.codfw.wmnet with reason: Rebooting for T303174 [12:34:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:07] 10SRE, 10Traffic: Upgrading Wikidough and durum VMs to bullseye - https://phabricator.wikimedia.org/T305589 (10fgiunchedi) >>! In T305589#7837933, @Volans wrote: >>>! In T305589#7837526, @fgiunchedi wrote: >> AIUI the decom cookbook doesn't support VMs yet (?) > > That's not actually correct, the decommission... [12:37:40] 10SRE, 10conftool, 10Patch-For-Review: ipblocks support for other "entities" (not clouds, not abuse nets) - https://phabricator.wikimedia.org/T305581 (10CDanis) I suggest something simpler: Use a common prefix in the header name, with the name of the ipblock group as the suffix. X-SRE-Ipblock-Cloud X-SRE-I... [12:38:40] 10SRE, 10Goal, 10MW-1.38-notes (1.38.0-wmf.4; 2021-10-12), 10Patch-For-Review, and 2 others: Fully migrate producers off statsd - https://phabricator.wikimedia.org/T205870 (10fgiunchedi) >>! In T205870#7837817, @Mvolz wrote: > This is now deployed for citoid. This is great to see! Thanks for your help @M... [12:40:09] PROBLEM - haproxy failover on dbproxy2001 is CRITICAL: CRITICAL check_failover servers up 2 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [12:40:11] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db[2078,2133].codfw.wmnet with reason: Rebooting primary T303174 [12:40:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:13] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db[2078,2133].codfw.wmnet with reason: Rebooting primary T303174 [12:40:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:27] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db2133.codfw.wmnet with reason: Rebooting for T303174 [12:40:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:28] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db2133.codfw.wmnet with reason: Rebooting for T303174 [12:40:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:09] PROBLEM - Host logstash2024 is DOWN: PING CRITICAL - Packet loss = 100% [12:44:14] !log jynus@cumin1001 START - Cookbook sre.hosts.reimage for host db1145.eqiad.wmnet with OS bullseye [12:44:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:29] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:45:06] (03CR) 10Elukey: [C: 03+1] "LGTM, I checked thanos and all models are correctly listed." [deployment-charts] - 10https://gerrit.wikimedia.org/r/778251 (https://phabricator.wikimedia.org/T301415) (owner: 10Kevin Bazira) [12:45:37] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db2134.codfw.wmnet with reason: Rebooting for T303174 [12:45:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:39] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db2134.codfw.wmnet with reason: Rebooting for T303174 [12:45:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:53] PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:46:43] PROBLEM - haproxy failover on dbproxy2002 is CRITICAL: CRITICAL check_failover servers up 2 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [12:46:49] 10SRE, 10ops-eqiad, 10Cloud-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): rack/setup/install labvirt101[5-8] - https://phabricator.wikimedia.org/T165531 (10aborrero) [12:47:43] (03PS1) 10Giuseppe Lavagetto: admin: add mwbuilder to the deployment group [puppet] - 10https://gerrit.wikimedia.org/r/778284 (https://phabricator.wikimedia.org/T303857) [12:47:45] (03PS1) 10Giuseppe Lavagetto: mwdebug-deploy: run as mwbuilder, use release repository [puppet] - 10https://gerrit.wikimedia.org/r/778285 (https://phabricator.wikimedia.org/T299648) [12:48:30] (03CR) 10Giuseppe Lavagetto: [C: 03+2] external_clouds_vendors: install python3-git [puppet] - 10https://gerrit.wikimedia.org/r/778280 (owner: 10Giuseppe Lavagetto) [12:49:52] !log sudo gnt-cluster modify -H kvm:migration_downtime=3000 for ganeti01.svc.codfw.wmnet and ganeti01.svc.eqiad.wmnet to combat some logstash VM migration issues. [12:49:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:34] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db2135.codfw.wmnet with reason: Rebooting for T303174 [12:51:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:35] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db2135.codfw.wmnet with reason: Rebooting for T303174 [12:51:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:52:06] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp6013.drmrs.wmnet with OS buster [12:52:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:52:15] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp6013.drmrs.wmnet with OS buster com... [12:54:50] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34730/console" [puppet] - 10https://gerrit.wikimedia.org/r/778284 (https://phabricator.wikimedia.org/T303857) (owner: 10Giuseppe Lavagetto) [12:55:14] !log jynus@cumin2002 START - Cookbook sre.hosts.reimage for host db2139.codfw.wmnet with OS bullseye [12:55:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:39] !log jynus@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1145.eqiad.wmnet with reason: host reimage [12:55:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:41] RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:56:25] (03CR) 10CDanis: [C: 03+1] requestctl: allow safer changes to the production VCL [software/conftool] - 10https://gerrit.wikimedia.org/r/778263 (https://phabricator.wikimedia.org/T305606) (owner: 10Giuseppe Lavagetto) [12:57:43] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 9 hosts with reason: Rebooting primary T303174 [12:57:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:50] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 9 hosts with reason: Rebooting primary T303174 [12:57:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:54] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] admin: add mwbuilder to the deployment group [puppet] - 10https://gerrit.wikimedia.org/r/778284 (https://phabricator.wikimedia.org/T303857) (owner: 10Giuseppe Lavagetto) [12:58:02] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db2104.codfw.wmnet with reason: Rebooting for T303174 [12:58:04] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db2104.codfw.wmnet with reason: Rebooting for T303174 [12:58:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:56] !log depool cp6005 for reimage - T290005 [12:58:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:58] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [12:58:59] !log jynus@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1145.eqiad.wmnet with reason: host reimage [12:59:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:24] PROBLEM - haproxy failover on dbproxy2003 is CRITICAL: CRITICAL check_failover servers up 2 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [12:59:30] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2001 is CRITICAL: instance=10.192.32.33 verb=POST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [13:00:05] RoanKattouw, Lucas_WMDE, and Urbanecm: #bothumor My software never has bugs. It just develops random features. Rise for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220407T1300). [13:00:05] nemo-yiannis: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:02:00] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 94, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:02:30] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 127, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:04:59] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on es2028.codfw.wmnet with reason: Rebooting for T303174 [13:05:00] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on es2028.codfw.wmnet with reason: Rebooting for T303174 [13:05:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T305300)', diff saved to https://phabricator.wikimedia.org/P24244 and previous config saved to /var/cache/conftool/dbconfig/20220407-130529-ladsgroup.json [13:05:31] 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations, 10serviceops, and 2 others: Need a service account on deploy servers - https://phabricator.wikimedia.org/T303857 (10Joe) 05Open→03Resolved `lang=bash oblivian@deploy1002:~ $ sudo -u mwbuilder groups mwbuilder docker deployment ` [13:05:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:33] T305300: Add lu_attachment_method column to localuser table - https://phabricator.wikimedia.org/T305300 [13:05:56] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [13:07:34] (03PS2) 10JMeybohm: Move kubemaster2001 to bullseye [puppet] - 10https://gerrit.wikimedia.org/r/777311 (https://phabricator.wikimedia.org/T305435) [13:08:09] (03PS2) 10MMandere: site: Reimage cp6005 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/777852 (https://phabricator.wikimedia.org/T290005) [13:08:33] (03PS2) 10Giuseppe Lavagetto: mwdebug-deploy: run as mwbuilder, use release repository [puppet] - 10https://gerrit.wikimedia.org/r/778285 (https://phabricator.wikimedia.org/T299648) [13:08:34] !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on kubemaster2001.codfw.wmnet with reason: reimage [13:08:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:36] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on kubemaster2001.codfw.wmnet with reason: reimage [13:08:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:08] (03CR) 10JMeybohm: [C: 03+2] Move kubemaster2001 to bullseye [puppet] - 10https://gerrit.wikimedia.org/r/777311 (https://phabricator.wikimedia.org/T305435) (owner: 10JMeybohm) [13:09:36] (03CR) 10MMandere: [C: 03+2] site: Reimage cp6005 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/777852 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [13:09:54] PROBLEM - Widespread puppet agent failures- no resources reported on alert1001 is CRITICAL: 0.0628 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [13:10:06] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on es2030.codfw.wmnet with reason: Rebooting for T303174 [13:10:07] !log jynus@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2139.codfw.wmnet with reason: host reimage [13:10:07] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on es2030.codfw.wmnet with reason: Rebooting for T303174 [13:10:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:23] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34731/console" [puppet] - 10https://gerrit.wikimedia.org/r/778285 (https://phabricator.wikimedia.org/T299648) (owner: 10Giuseppe Lavagetto) [13:11:43] <_joe_> jouncebot: next [13:11:43] In 2 hour(s) and 48 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220407T1600) [13:11:52] <_joe_> ok I got plenty time [13:11:59] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] mwdebug-deploy: run as mwbuilder, use release repository [puppet] - 10https://gerrit.wikimedia.org/r/778285 (https://phabricator.wikimedia.org/T299648) (owner: 10Giuseppe Lavagetto) [13:12:10] !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host cp6005.drmrs.wmnet with OS buster [13:12:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:18] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp6005.drmrs.wmnet with OS buster [13:13:30] !log jynus@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2139.codfw.wmnet with reason: host reimage [13:13:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:33] PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:13:37] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:13:46] !log depool cp6012 for reimage - T290005 [13:13:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:49] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [13:14:05] <_joe_> uh jayme are you doing something with the codfw k8s cluster? [13:14:23] !log jynus@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1145.eqiad.wmnet with OS bullseye [13:14:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:31] yep,thats me again, sorry [13:14:39] _joe_: just reimaging masters [13:14:52] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on es2032.codfw.wmnet with reason: Rebooting for T303174 [13:14:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:53] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on es2032.codfw.wmnet with reason: Rebooting for T303174 [13:14:54] one-by-one obviously [13:14:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:06] <_joe_> jayme: why not all at the same time!?! [13:16:58] (03PS2) 10MMandere: site: Reimage cp6012 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/777853 (https://phabricator.wikimedia.org/T290005) [13:17:09] I'm too pansy \o/ ;) [13:17:39] RECOVERY - Check systemd state on puppetmaster2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:17:47] (03CR) 10MMandere: [C: 03+2] site: Reimage cp6012 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/777853 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [13:19:57] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:20:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P24245 and previous config saved to /var/cache/conftool/dbconfig/20220407-132034-ladsgroup.json [13:20:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:47] !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host cp6012.drmrs.wmnet with OS buster [13:20:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:00] (03CR) 10Gehel: [C: 03+2] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/777875 (owner: 10Ryan Kemper) [13:21:01] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp6012.drmrs.wmnet with OS buster [13:23:43] 10SRE, 10conftool: Support NOT in the dsl grammar - https://phabricator.wikimedia.org/T305607 (10CDanis) [13:23:47] 10SRE, 10conftool, 10Patch-For-Review: Make the VCL that goes to production from requestctl safer/more explicit to apply - https://phabricator.wikimedia.org/T305606 (10CDanis) [13:24:46] !log jynus@cumin1001 START - Cookbook sre.hosts.reimage for host db1150.eqiad.wmnet with OS bullseye [13:24:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:51] (03Merged) 10jenkins-bot: elastic: relforge needs --without-lvs [cookbooks] - 10https://gerrit.wikimedia.org/r/777875 (owner: 10Ryan Kemper) [13:29:39] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on es2026.codfw.wmnet with reason: Rebooting for T303174 [13:29:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:40] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on es2026.codfw.wmnet with reason: Rebooting for T303174 [13:29:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:55] !log jynus@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2139.codfw.wmnet with OS bullseye [13:29:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:14] !log mmandere@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp6005.drmrs.wmnet with reason: host reimage [13:30:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:40] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp6005.drmrs.wmnet with reason: host reimage [13:33:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:11] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on es2031.codfw.wmnet with reason: Rebooting for T303174 [13:34:12] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on es2031.codfw.wmnet with reason: Rebooting for T303174 [13:34:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P24246 and previous config saved to /var/cache/conftool/dbconfig/20220407-133539-ladsgroup.json [13:35:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:23] !log jynus@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1150.eqiad.wmnet with reason: host reimage [13:36:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:21] !log mmandere@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp6012.drmrs.wmnet with reason: host reimage [13:37:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:45] !log jynus@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1150.eqiad.wmnet with reason: host reimage [13:39:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:44] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on es2033.codfw.wmnet with reason: Rebooting for T303174 [13:40:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:45] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on es2033.codfw.wmnet with reason: Rebooting for T303174 [13:40:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:45] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp6012.drmrs.wmnet with reason: host reimage [13:41:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:19] PROBLEM - haproxy failover on dbproxy2004 is CRITICAL: CRITICAL check_failover servers up 2 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [13:44:12] (03CR) 10Giuseppe Lavagetto: [C: 03+2] requestctl: allow safer changes to the production VCL [software/conftool] - 10https://gerrit.wikimedia.org/r/778263 (https://phabricator.wikimedia.org/T305606) (owner: 10Giuseppe Lavagetto) [13:45:15] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on es2027.codfw.wmnet with reason: Rebooting for T303174 [13:45:17] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on es2027.codfw.wmnet with reason: Rebooting for T303174 [13:45:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:20] mvolz: I gather you solved the grafana issues you had? Or is there anything I can help with? [13:45:25] checking that haproxy alert [13:46:02] (03Merged) 10jenkins-bot: requestctl: allow safer changes to the production VCL [software/conftool] - 10https://gerrit.wikimedia.org/r/778263 (https://phabricator.wikimedia.org/T305606) (owner: 10Giuseppe Lavagetto) [13:47:15] RECOVERY - haproxy failover on dbproxy2004 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [13:47:33] (03PS1) 10Giuseppe Lavagetto: Debian changelog update [software/conftool] - 10https://gerrit.wikimedia.org/r/778293 [13:48:49] RECOVERY - haproxy failover on dbproxy2002 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [13:49:05] RECOVERY - haproxy failover on dbproxy2003 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [13:49:31] !log jynus@cumin2002 START - Cookbook sre.hosts.reimage for host db2141.codfw.wmnet with OS bullseye [13:49:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:53] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 94, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:50:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T305300)', diff saved to https://phabricator.wikimedia.org/P24247 and previous config saved to /var/cache/conftool/dbconfig/20220407-135044-ladsgroup.json [13:50:46] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [13:50:47] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [13:50:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:49] T305300: Add lu_attachment_method column to localuser table - https://phabricator.wikimedia.org/T305300 [13:50:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3317 (T305300)', diff saved to https://phabricator.wikimedia.org/P24248 and previous config saved to /var/cache/conftool/dbconfig/20220407-135052-ladsgroup.json [13:50:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:25] 10SRE, 10Traffic: Upgrading Wikidough and durum VMs to bullseye - https://phabricator.wikimedia.org/T305589 (10ssingh) Thanks for the feedback @fgiunchedi and @Volans! >>! In T305589#7837933, @Volans wrote: >>>! In T305589#7837526, @fgiunchedi wrote: >> AIUI the decom cookbook doesn't support VMs yet (?) > >... [13:52:04] RECOVERY - haproxy failover on dbproxy2001 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [13:53:16] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on es2029.codfw.wmnet with reason: Rebooting for T303174 [13:53:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:18] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on es2029.codfw.wmnet with reason: Rebooting for T303174 [13:53:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:15] !log jynus@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1150.eqiad.wmnet with OS bullseye [13:55:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:04] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for drochford - https://phabricator.wikimedia.org/T305634 (10drochford) [13:59:39] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for drochford - https://phabricator.wikimedia.org/T305634 (10RhinosF1) @drochford: approving party will be whoever your manager is [14:02:12] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for drochford - https://phabricator.wikimedia.org/T305634 (10RhinosF1) analytics-privatedata-users will need @Ottomata or @odimitrijevic's approval too. [14:02:57] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on es2034.codfw.wmnet with reason: Rebooting for T303174 [14:02:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:58] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on es2034.codfw.wmnet with reason: Rebooting for T303174 [14:03:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:34] !log jynus@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2141.codfw.wmnet with reason: host reimage [14:03:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:00] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 127, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:04:09] !log bking@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.REBOOT (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reboot - bking@cumin1001 - T304938 [14:04:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:30] !log jynus@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2141.codfw.wmnet with reason: host reimage [14:06:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:42] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp6005.drmrs.wmnet with OS buster [14:06:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:52] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp6005.drmrs.wmnet with OS buster com... [14:08:21] PROBLEM - SSH on aqs1007.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:08:43] !log pool cp6005 with HAProxy as TLS termination layer - T290005 [14:08:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:47] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [14:10:33] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on es2025.codfw.wmnet with reason: Rebooting for T303174 [14:10:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:34] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on es2025.codfw.wmnet with reason: Rebooting for T303174 [14:10:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:52] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp6012.drmrs.wmnet with OS buster [14:10:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:01] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp6012.drmrs.wmnet with OS buster com... [14:13:02] !log pool cp6012 with HAProxy as TLS termination layer - T290005 [14:13:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:43] PROBLEM - Check systemd state on elastic2029 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:15:07] PROBLEM - Check systemd state on elastic2042 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:15:36] (03CR) 10CDanis: [C: 03+1] Debian changelog update [software/conftool] - 10https://gerrit.wikimedia.org/r/778293 (owner: 10Giuseppe Lavagetto) [14:18:30] (03PS2) 10MMandere: site: Reimage cp6004 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/777854 (https://phabricator.wikimedia.org/T290005) [14:19:18] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db2115.codfw.wmnet with reason: Rebooting for T303174 [14:19:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:19] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db2115.codfw.wmnet with reason: Rebooting for T303174 [14:19:21] !log depool cp6004 for reimage - T290005 [14:19:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:24] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [14:20:49] RECOVERY - Check systemd state on elastic2029 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:21:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T305300)', diff saved to https://phabricator.wikimedia.org/P24249 and previous config saved to /var/cache/conftool/dbconfig/20220407-142117-ladsgroup.json [14:21:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:20] T305300: Add lu_attachment_method column to localuser table - https://phabricator.wikimedia.org/T305300 [14:22:17] (03CR) 10MMandere: [C: 03+2] site: Reimage cp6004 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/777854 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [14:22:33] !log jynus@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2141.codfw.wmnet with OS bullseye [14:22:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:15] !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host cp6004.drmrs.wmnet with OS buster [14:24:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:25] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp6004.drmrs.wmnet with OS buster [14:25:44] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db2131.codfw.wmnet with reason: Rebooting for T303174 [14:25:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:46] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db2131.codfw.wmnet with reason: Rebooting for T303174 [14:25:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:51] (03PS1) 10Krinkle: mediawiki: Update httpbb tests for /static/current going away [puppet] - 10https://gerrit.wikimedia.org/r/778295 (https://phabricator.wikimedia.org/T302465) [14:28:02] RECOVERY - Check systemd state on elastic2042 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:32:14] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db2143.codfw.wmnet with reason: Rebooting for T303174 [14:32:15] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db2143.codfw.wmnet with reason: Rebooting for T303174 [14:32:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:56] 10SRE, 10Phabricator, 10SRE Observability (FY2021/2022-Q4), 10User-Ladsgroup: SRE access request to join #triagers for user lmata - https://phabricator.wikimedia.org/T305463 (10lmata) Thanks! [14:36:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P24250 and previous config saved to /var/cache/conftool/dbconfig/20220407-143622-ladsgroup.json [14:36:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:36] !log kormat@cumin1001 dbctl commit (dc=all): 'db2143 (re)pooling @ 25%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P24251 and previous config saved to /var/cache/conftool/dbconfig/20220407-143635-kormat.json [14:36:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:46] (BlazegraphJvmQuakeWarnGC) firing: Blazegraph instance wdqs2003:9100 is entering a GC death spiral - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphJvmQuakeWarnGC [14:38:58] (03CR) 10Kevin Bazira: ml-services: add plwiki, ptwiki & rowiki editquality isvcs (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/778251 (https://phabricator.wikimedia.org/T301415) (owner: 10Kevin Bazira) [14:41:13] !log mmandere@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp6004.drmrs.wmnet with reason: host reimage [14:41:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:38] PROBLEM - Check systemd state on elastic2030 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:42:02] PROBLEM - Check systemd state on elastic2044 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:43:26] PROBLEM - SSH on aqs1009.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:44:10] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp6004.drmrs.wmnet with reason: host reimage [14:44:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:32] !log jynus@cumin1001 START - Cookbook sre.hosts.reimage for host db1171.eqiad.wmnet with OS bullseye [14:44:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:06] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:46:14] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:49:10] (03PS2) 10Volans: service: add new module to expose service::catalog [software/spicerack] - 10https://gerrit.wikimedia.org/r/775904 [14:50:15] 10SRE, 10conftool: requestctl v1 improvements - https://phabricator.wikimedia.org/T305580 (10CDanis) [14:51:07] (03CR) 10Volans: "Replies inline" [software/spicerack] - 10https://gerrit.wikimedia.org/r/775904 (owner: 10Volans) [14:51:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P24252 and previous config saved to /var/cache/conftool/dbconfig/20220407-145127-ladsgroup.json [14:51:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:40] !log kormat@cumin1001 dbctl commit (dc=all): 'db2143 (re)pooling @ 50%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P24253 and previous config saved to /var/cache/conftool/dbconfig/20220407-145139-kormat.json [14:51:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:56] !log kormat@cumin1001 dbctl commit (dc=all): 'db2143 (re)pooling @ 100%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P24254 and previous config saved to /var/cache/conftool/dbconfig/20220407-145455-kormat.json [14:54:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:14] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db2144.codfw.wmnet with reason: Rebooting for T303174 [14:55:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:16] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db2144.codfw.wmnet with reason: Rebooting for T303174 [14:55:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:16] !log jynus@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1171.eqiad.wmnet with reason: host reimage [14:56:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:23] RECOVERY - Check systemd state on elastic2044 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:59:35] !log jynus@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1171.eqiad.wmnet with reason: host reimage [14:59:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:21] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/778299 [15:06:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T305300)', diff saved to https://phabricator.wikimedia.org/P24255 and previous config saved to /var/cache/conftool/dbconfig/20220407-150632-ladsgroup.json [15:06:34] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1101.eqiad.wmnet with reason: Maintenance [15:06:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:35] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1101.eqiad.wmnet with reason: Maintenance [15:06:37] T305300: Add lu_attachment_method column to localuser table - https://phabricator.wikimedia.org/T305300 [15:06:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1101:3317 (T305300)', diff saved to https://phabricator.wikimedia.org/P24256 and previous config saved to /var/cache/conftool/dbconfig/20220407-150640-ladsgroup.json [15:06:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:56] (03PS1) 10MMandere: site: Reimage cp6011 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/778300 (https://phabricator.wikimedia.org/T290005) [15:07:58] (03PS1) 10MMandere: site: Reimage cp6003 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/778301 (https://phabricator.wikimedia.org/T290005) [15:08:00] (03PS1) 10MMandere: site: Reimage cp6010 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/778302 (https://phabricator.wikimedia.org/T290005) [15:08:02] (03PS1) 10MMandere: site: Reimage cp6002 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/778303 (https://phabricator.wikimedia.org/T290005) [15:08:04] (03PS1) 10MMandere: site: Reimage cp6009 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/778304 (https://phabricator.wikimedia.org/T290005) [15:08:06] (03PS1) 10MMandere: site: Reimage cp6001 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/778305 (https://phabricator.wikimedia.org/T290005) [15:11:25] PROBLEM - SSH on wtp1048.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:12:35] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp6004.drmrs.wmnet with OS buster [15:12:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:43] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp6004.drmrs.wmnet with OS buster com... [15:13:49] 10SRE, 10ops-codfw, 10DBA: codfw: Dedicate Rack B1 for cloudX-dev servers - https://phabricator.wikimedia.org/T305469 (10Papaul) [15:14:56] !log jynus@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1171.eqiad.wmnet with OS bullseye [15:14:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:03] 10SRE, 10ops-codfw, 10DBA: codfw: Dedicate Rack B1 for cloudX-dev servers - https://phabricator.wikimedia.org/T305469 (10Papaul) [15:18:43] 10SRE, 10ops-codfw, 10DBA: codfw: Dedicate Rack B1 for cloudX-dev servers - https://phabricator.wikimedia.org/T305469 (10Papaul) [15:18:55] RECOVERY - Check systemd state on elastic2030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:19:23] PROBLEM - Check systemd state on elastic2047 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:20:09] PROBLEM - Check systemd state on elastic2031 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:20:43] PROBLEM - Check systemd state on elastic2059 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:21:32] !log pool cp6004 with HAProxy as TLS termination layer - T290005 [15:21:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:37] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [15:23:11] 10SRE, 10ops-codfw, 10DBA: codfw: Dedicate Rack B1 for cloudX-dev servers - https://phabricator.wikimedia.org/T305469 (10Papaul) [15:23:53] (03CR) 10Herron: [C: 03+1] sre: add alerts for exporter-specific unavailability [alerts] - 10https://gerrit.wikimedia.org/r/778259 (https://phabricator.wikimedia.org/T288726) (owner: 10Filippo Giunchedi) [15:26:07] (03PS1) 10Giuseppe Lavagetto: mediaiki: add new member of the deployment group everywhere [puppet] - 10https://gerrit.wikimedia.org/r/778307 [15:28:06] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, even DRYer this way" [puppet] - 10https://gerrit.wikimedia.org/r/778307 (owner: 10Giuseppe Lavagetto) [15:29:23] (03PS2) 10Giuseppe Lavagetto: mediaiki: add new member of the deployment group everywhere [puppet] - 10https://gerrit.wikimedia.org/r/778307 [15:30:20] <_joe_> godog: I left behind a damn require that wasn't really needed, btw [15:30:40] sigh [15:30:43] <_joe_> but yeah if compilation is happ, this, I'll proceed [15:30:55] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (NOOP 3 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34735/console" [puppet] - 10https://gerrit.wikimedia.org/r/778307 (owner: 10Giuseppe Lavagetto) [15:30:59] <_joe_> yeah pcc is now happy [15:31:02] <_joe_> going to merge it [15:31:19] SGTM [15:31:26] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] mediaiki: add new member of the deployment group everywhere [puppet] - 10https://gerrit.wikimedia.org/r/778307 (owner: 10Giuseppe Lavagetto) [15:32:31] <_joe_> sorry again, It just didn't pass through my mind that same group was everywhere basically [15:33:47] <_joe_> and yes, this fixes puppet [15:34:09] neato [15:34:20] <_joe_> jouncebot: next [15:34:20] In 0 hour(s) and 25 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220407T1600) [15:39:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T305300)', diff saved to https://phabricator.wikimedia.org/P24257 and previous config saved to /var/cache/conftool/dbconfig/20220407-153905-ladsgroup.json [15:39:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:11] T305300: Add lu_attachment_method column to localuser table - https://phabricator.wikimedia.org/T305300 [15:39:17] (03CR) 10Vgutierrez: [C: 03+1] site: Reimage cp6011 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/778300 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [15:39:29] (03CR) 10Herron: [C: 03+1] thanos: add recording rules for exporter-specific availability [puppet] - 10https://gerrit.wikimedia.org/r/778261 (https://phabricator.wikimedia.org/T288726) (owner: 10Filippo Giunchedi) [15:39:41] (03PS1) 10Btullis: Enable SSL/TLS for accessing the datahub-gms service [deployment-charts] - 10https://gerrit.wikimedia.org/r/778308 (https://phabricator.wikimedia.org/T301454) [15:39:47] (03CR) 10Vgutierrez: [C: 03+1] site: Reimage cp6003 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/778301 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [15:41:11] (03CR) 10Herron: [C: 03+1] logstash: replace all instances of @metadata.partition [puppet] - 10https://gerrit.wikimedia.org/r/777874 (https://phabricator.wikimedia.org/T305175) (owner: 10Cwhite) [15:43:01] (03CR) 10Herron: [C: 03+1] logstash: bugfix remove rsyslog-set log.level from blackbox_exporter events (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/777877 (owner: 10Cwhite) [15:43:03] (03CR) 10Vgutierrez: [C: 04-1] site: Reimage cp6010 as cache::text_haproxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/778302 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [15:43:14] RECOVERY - Check systemd state on elastic2031 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:43:23] (03CR) 10Vgutierrez: [C: 03+1] site: Reimage cp6002 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/778303 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [15:44:00] RECOVERY - SSH on aqs1009.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:45:38] RECOVERY - Check systemd state on elastic2059 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:48:10] (03PS1) 10Ahmon Dancy: Merge remote-tracking branch 'origin/master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/778310 [15:48:56] PROBLEM - Check systemd state on elastic2045 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:49:28] PROBLEM - Check systemd state on elastic2048 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:50:25] (03CR) 10Filippo Giunchedi: [C: 03+2] thanos: add recording rules for exporter-specific availability [puppet] - 10https://gerrit.wikimedia.org/r/778261 (https://phabricator.wikimedia.org/T288726) (owner: 10Filippo Giunchedi) [15:51:10] RECOVERY - Check systemd state on elastic2047 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:52:16] (03CR) 10Ahmon Dancy: [C: 03+2] Merge remote-tracking branch 'origin/master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/778310 (owner: 10Ahmon Dancy) [15:52:52] PROBLEM - Check systemd state on elastic2049 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:53:12] RECOVERY - Check systemd state on elastic2045 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:53:21] (03Merged) 10jenkins-bot: Merge remote-tracking branch 'origin/master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/778310 (owner: 10Ahmon Dancy) [15:54:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P24258 and previous config saved to /var/cache/conftool/dbconfig/20220407-155410-ladsgroup.json [15:54:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:08] 10SRE, 10ops-codfw, 10DBA: codfw: Dedicate Rack B1 for cloudX-dev servers - https://phabricator.wikimedia.org/T305469 (10Papaul) [15:56:02] RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:57:04] (03CR) 10Herron: [C: 03+1] logstash: add $schema field to w3creportingapi tests (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/776025 (owner: 10Cwhite) [15:58:10] (03PS2) 10MMandere: site: Reimage cp6010 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/778302 (https://phabricator.wikimedia.org/T290005) [15:59:14] (03CR) 10Vgutierrez: [C: 03+1] site: Reimage cp6010 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/778302 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [15:59:22] RECOVERY - Check systemd state on elastic2049 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:59:51] (03CR) 10Vgutierrez: [C: 04-1] site: Reimage cp6009 as cache::text_haproxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/778304 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [16:00:04] jbond and rzl: Dear deployers, time to do the Puppet request window deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220407T1600). [16:00:04] No Gerrit patches in the queue for this window AFAICS. [16:02:24] RECOVERY - Check systemd state on elastic2048 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:03:08] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] "I very much like the idea to make static mode the default on our Wikimedia cluster, and list wikis that are allowed to use dynamic mode as" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773883 (https://phabricator.wikimedia.org/T291736) (owner: 10Krinkle) [16:03:22] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for drochford - https://phabricator.wikimedia.org/T305634 (10drochford) [16:03:36] (03PS2) 10MMandere: site: Reimage cp6009 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/778304 (https://phabricator.wikimedia.org/T290005) [16:04:20] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for drochford - https://phabricator.wikimedia.org/T305634 (10drochford) Thanks @RhinosF1 - My manager is Jan Eissfeldt, but Jan does hot use Phabricator. I've updated the task above. [16:05:00] (03CR) 10Herron: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/777423 (https://phabricator.wikimedia.org/T280472) (owner: 10JHathaway) [16:06:15] 10SRE, 10ops-codfw, 10DBA: codfw: Dedicate Rack B1 for cloudX-dev servers - https://phabricator.wikimedia.org/T305469 (10Papaul) [16:06:18] (03CR) 10Vivian Rook: "https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34738/" [puppet] - 10https://gerrit.wikimedia.org/r/777873 (https://phabricator.wikimedia.org/T304694) (owner: 10Vivian Rook) [16:06:41] (03PS1) 10Ahmon Dancy: Train dev fixups (again) [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/778314 [16:06:43] (03CR) 10Ahmon Dancy: [C: 03+2] Train dev fixups (again) [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/778314 (owner: 10Ahmon Dancy) [16:07:04] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for drochford - https://phabricator.wikimedia.org/T305634 (10Zabe) >>! In T305634#7838700, @drochford wrote: > Thanks @RhinosF1 - My manager is Jan Eissfeldt, but Jan does hot use Phabricator. I've updated the task above. Their pha... [16:08:00] (03Merged) 10jenkins-bot: Train dev fixups (again) [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/778314 (owner: 10Ahmon Dancy) [16:08:12] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host ml-serve1005.eqiad.wmnet with OS bullseye [16:08:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:21] 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team, 10Patch-For-Review: Q2:(Need By: TBD) rack/setup/install ml-serve100[5-8] - https://phabricator.wikimedia.org/T294949 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ml-serve1005.eqiad.wmnet wit... [16:09:12] RECOVERY - Widespread puppet agent failures- no resources reported on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.001598 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [16:09:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P24259 and previous config saved to /var/cache/conftool/dbconfig/20220407-160916-ladsgroup.json [16:09:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:46] RECOVERY - SSH on aqs1007.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:12:30] RECOVERY - SSH on wtp1048.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:13:27] jouncebot now [16:13:27] For the next 0 hour(s) and 46 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220407T1600) [16:14:23] dancy: nothing in that window, it's all yours if you need it [16:14:31] thx! [16:15:32] PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:15:39] (03CR) 10Vgutierrez: [C: 03+1] site: Reimage cp6009 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/778304 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [16:16:12] (03CR) 10Vgutierrez: [C: 03+1] site: Reimage cp6001 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/778305 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [16:17:30] 10SRE, 10MediaWiki-Debug-Logger, 10MediaWiki-General, 10Developer Productivity, and 2 others: Debug hosts sometimes Fatal error: "The UdpSocket to 127.0.0.1:10514 has been closed" - https://phabricator.wikimedia.org/T214734 (10Krinkle) [16:17:40] 10SRE, 10MediaWiki-Debug-Logger, 10MediaWiki-General, 10Developer Productivity, and 2 others: Debug hosts sometimes Fatal error: "The UdpSocket to 127.0.0.1:10514 has been closed" - https://phabricator.wikimedia.org/T214734 (10Krinkle) @tstarling wrote a good summary of the issue at T285823: > […] Probab... [16:17:51] !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ml-serve1005.eqiad.wmnet with OS bullseye [16:17:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:59] 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team, 10Patch-For-Review: Q2:(Need By: TBD) rack/setup/install ml-serve100[5-8] - https://phabricator.wikimedia.org/T294949 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ml-serve1005.eqiad.wmnet with OS... [16:18:11] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host ml-serve1005.eqiad.wmnet with OS bullseye [16:18:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:20] 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team, 10Patch-For-Review: Q2:(Need By: TBD) rack/setup/install ml-serve100[5-8] - https://phabricator.wikimedia.org/T294949 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ml-serve1005.eqiad.wmnet wit... [16:20:04] 10SRE, 10ops-codfw, 10DBA: codfw: Dedicate Rack B1 for cloudX-dev servers - https://phabricator.wikimedia.org/T305469 (10Papaul) [16:21:48] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for drochford - https://phabricator.wikimedia.org/T305634 (10JanWMF) approved, I asked David to when it turned out neither JohnB nor I had access and I need a presentation based on it :) [16:22:53] (03Abandoned) 10Jgiannelos: Disable triggering tile pregeneration on OSM syncs [puppet] - 10https://gerrit.wikimedia.org/r/753111 (https://phabricator.wikimedia.org/T298246) (owner: 10Jgiannelos) [16:23:49] 10SRE, 10MediaWiki-Debug-Logger, 10MediaWiki-General, 10Developer Productivity, and 2 others: Debug hosts sometimes Fatal error: "The UdpSocket to 127.0.0.1:10514 has been closed" - https://phabricator.wikimedia.org/T214734 (10Krinkle) My hunch is that some code is (in)directly calling `logger->debug()` f... [16:24:12] PROBLEM - SSH on wtp1045.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:24:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T305300)', diff saved to https://phabricator.wikimedia.org/P24260 and previous config saved to /var/cache/conftool/dbconfig/20220407-162421-ladsgroup.json [16:24:24] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [16:24:26] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [16:24:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:31] T305300: Add lu_attachment_method column to localuser table - https://phabricator.wikimedia.org/T305300 [16:24:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3317 (T305300)', diff saved to https://phabricator.wikimedia.org/P24261 and previous config saved to /var/cache/conftool/dbconfig/20220407-162430-ladsgroup.json [16:24:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T305300)', diff saved to https://phabricator.wikimedia.org/P24262 and previous config saved to /var/cache/conftool/dbconfig/20220407-162537-ladsgroup.json [16:25:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:45] (03CR) 10David Caro: add chunkeddriver.py.patch to wallaby (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/777873 (https://phabricator.wikimedia.org/T304694) (owner: 10Vivian Rook) [16:26:10] (03CR) 10Krinkle: List Kartographer static map exemptions and document+flip default (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773883 (https://phabricator.wikimedia.org/T291736) (owner: 10Krinkle) [16:26:24] RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:27:13] (03PS2) 10Krinkle: mediawiki: Update httpbb tests for /static/current going away [puppet] - 10https://gerrit.wikimedia.org/r/778295 (https://phabricator.wikimedia.org/T302465) [16:30:23] (03CR) 10Vivian Rook: add chunkeddriver.py.patch to wallaby (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/777873 (https://phabricator.wikimedia.org/T304694) (owner: 10Vivian Rook) [16:31:55] 10SRE, 10Goal, 10MW-1.38-notes (1.38.0-wmf.4; 2021-10-12), 10Patch-For-Review, and 2 others: Fully migrate producers off statsd - https://phabricator.wikimedia.org/T205870 (10Mvolz) >>! In T205870#7838013, @fgiunchedi wrote: >>>! In T205870#7837817, @Mvolz wrote: >> This is now deployed for citoid. > > T... [16:33:14] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for drochford - https://phabricator.wikimedia.org/T305634 (10RhinosF1) @JanWMF: Is there a deadline this needs to be done by then? [16:33:44] (03CR) 10Bking: [C: 03+2] wdqs: tune jvmquake settings (take 2) [puppet] - 10https://gerrit.wikimedia.org/r/776857 (https://phabricator.wikimedia.org/T293862) (owner: 10DCausse) [16:34:18] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:34:40] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:35:26] PROBLEM - Host elastic2033 is DOWN: PING CRITICAL - Packet loss = 100% [16:35:52] ACKNOWLEDGEMENT - Backup freshness on backup1001 is CRITICAL: Stale-full only: 1 (doc1001), Fresh: 107 jobs Jcrespo full backup of doc1001 failed, retrying - The acknowledgement expires at: 2022-04-08 12:35:20. https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [16:36:32] PROBLEM - Check systemd state on elastic2046 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:36:45] (03CR) 10Andrew Bogott: [C: 04-1] "You are both right -- the manifest isn't version-specific but it has a version guard around it limiting things to Victoria." [puppet] - 10https://gerrit.wikimedia.org/r/777873 (https://phabricator.wikimedia.org/T304694) (owner: 10Vivian Rook) [16:37:26] PROBLEM - Check systemd state on elastic2032 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:39:18] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:40:07] (03PS2) 10Andrew Bogott: add chunkeddriver.py.patch to wallaby [puppet] - 10https://gerrit.wikimedia.org/r/777873 (https://phabricator.wikimedia.org/T304694) (owner: 10Vivian Rook) [16:40:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P24263 and previous config saved to /var/cache/conftool/dbconfig/20220407-164042-ladsgroup.json [16:40:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:31] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve1005.eqiad.wmnet with reason: host reimage [16:41:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:38] (03CR) 10Cwhite: [C: 03+2] logstash: add $schema field to w3creportingapi tests [puppet] - 10https://gerrit.wikimedia.org/r/776025 (owner: 10Cwhite) [16:43:34] (03CR) 10Cwhite: [C: 03+2] logstash: replace all instances of @metadata.partition [puppet] - 10https://gerrit.wikimedia.org/r/777874 (https://phabricator.wikimedia.org/T305175) (owner: 10Cwhite) [16:45:02] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve1005.eqiad.wmnet with reason: host reimage [16:45:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:45:13] PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:48:47] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1001.eqiad.wmnet with OS bullseye [16:48:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:52] 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team: Q1:(Need By: TBD) rack/setup/install dse-k8s-worker100[1-4] - https://phabricator.wikimedia.org/T291579 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host dse-k8s-worker1001.eqiad.wmnet with OS bullseye [16:49:02] btullis: heya - would be nearby? [16:49:07] !log bking@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REBOOT (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reboot - bking@cumin1001 - T304938 [16:49:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:18] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:49:41] joal: Yes, right here. [16:50:13] btullis: could you help me figure out the cache setup currently in eqiad? [16:50:21] btullis: I never know where to look :S [16:50:28] !log bking@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.REBOOT (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reboot - bking@cumin1001 - T304938 [16:50:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:33] Sure thing. batcave? [16:50:35] RECOVERY - Check systemd state on elastic2032 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:50:39] btullis: pooled nodes, and cache-types (text of upload) [16:50:43] please :) [16:53:37] RECOVERY - Check systemd state on elastic2046 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:54:18] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:55:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P24264 and previous config saved to /var/cache/conftool/dbconfig/20220407-165547-ladsgroup.json [16:55:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:59] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1002.eqiad.wmnet with OS bullseye [16:56:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:56:05] 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team: Q1:(Need By: TBD) rack/setup/install dse-k8s-worker100[1-4] - https://phabricator.wikimedia.org/T291579 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host dse-k8s-worker1002.eqiad.wmnet with OS bullseye [16:56:25] RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:59:53] PROBLEM - PHP7 jobrunner on mw1437 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [17:01:19] RECOVERY - PHP7 jobrunner on mw1437 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 6.303 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [17:06:30] !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dse-k8s-worker1001.eqiad.wmnet with OS bullseye [17:06:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:35] 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team: Q1:(Need By: TBD) rack/setup/install dse-k8s-worker100[1-4] - https://phabricator.wikimedia.org/T291579 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host dse-k8s-worker1001.eqiad.wmnet with OS bullseye... [17:06:39] !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dse-k8s-worker1002.eqiad.wmnet with OS bullseye [17:06:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:44] 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team: Q1:(Need By: TBD) rack/setup/install dse-k8s-worker100[1-4] - https://phabricator.wikimedia.org/T291579 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host dse-k8s-worker1002.eqiad.wmnet with OS bullseye... [17:08:50] !log herron@cumin1001 START - Cookbook sre.kafka.reboot-workers for Kafka logging-codfw cluster: Reboot kafka nodes [17:08:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:09] !log herron@cumin1001 END (FAIL) - Cookbook sre.kafka.reboot-workers (exit_code=99) for Kafka logging-codfw cluster: Reboot kafka nodes [17:09:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:52] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1001.eqiad.wmnet with OS bullseye [17:09:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:57] 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team: Q1:(Need By: TBD) rack/setup/install dse-k8s-worker100[1-4] - https://phabricator.wikimedia.org/T291579 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host dse-k8s-worker1001.eqiad.wmnet with OS bullseye [17:10:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T305300)', diff saved to https://phabricator.wikimedia.org/P24265 and previous config saved to /var/cache/conftool/dbconfig/20220407-171052-ladsgroup.json [17:10:54] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1158.eqiad.wmnet with reason: Maintenance [17:10:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:10:55] T305300: Add lu_attachment_method column to localuser table - https://phabricator.wikimedia.org/T305300 [17:10:56] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1158.eqiad.wmnet with reason: Maintenance [17:10:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:10:57] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [17:10:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:00] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [17:11:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1158 (T305300)', diff saved to https://phabricator.wikimedia.org/P24266 and previous config saved to /var/cache/conftool/dbconfig/20220407-171105-ladsgroup.json [17:11:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:38] !log mbsantos@deploy1002 helmfile [staging] START helmfile.d/services/mobileapps: apply [17:11:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:12:06] !log mbsantos@deploy1002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [17:12:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:12:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T305300)', diff saved to https://phabricator.wikimedia.org/P24267 and previous config saved to /var/cache/conftool/dbconfig/20220407-171211-ladsgroup.json [17:12:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:20] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-serve1005.eqiad.wmnet with OS bullseye [17:13:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:34] 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team, 10Patch-For-Review: Q2:(Need By: TBD) rack/setup/install ml-serve100[5-8] - https://phabricator.wikimedia.org/T294949 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ml-serve1005.eqiad.wmnet with OS... [17:14:07] !log mbsantos@deploy1002 helmfile [codfw] START helmfile.d/services/mobileapps: apply [17:14:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:32] 10ops-codfw, 10Discovery: elastic2033 without bootable devices available (repeat of T281621) - https://phabricator.wikimedia.org/T305646 (10bking) [17:14:55] !log mbsantos@deploy1002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [17:14:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:35] PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:16:12] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1002.eqiad.wmnet with OS bullseye [17:16:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:16:17] 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team: Q1:(Need By: TBD) rack/setup/install dse-k8s-worker100[1-4] - https://phabricator.wikimedia.org/T291579 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host dse-k8s-worker1002.eqiad.wmnet with OS bullseye [17:16:32] !log mbsantos@deploy1002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [17:16:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:16:53] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1003.eqiad.wmnet with OS bullseye [17:16:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:16:59] 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team: Q1:(Need By: TBD) rack/setup/install dse-k8s-worker100[1-4] - https://phabricator.wikimedia.org/T291579 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host dse-k8s-worker1003.eqiad.wmnet with OS bullseye [17:17:14] !log mbsantos@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [17:17:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:06] RECOVERY - SSH on wtp1045.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:26:19] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1004.eqiad.wmnet with OS bullseye [17:26:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:45] 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team: Q1:(Need By: TBD) rack/setup/install dse-k8s-worker100[1-4] - https://phabricator.wikimedia.org/T291579 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host dse-k8s-worker1004.eqiad.wmnet with OS bullseye [17:27:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P24268 and previous config saved to /var/cache/conftool/dbconfig/20220407-172719-ladsgroup.json [17:27:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:02] PROBLEM - PHP7 jobrunner on mw1437 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [17:29:18] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:29:48] (ProbeDown) firing: Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:31:13] !log [WDQS] T293862 Need to do a rolling restart of wdqs public; going to just roll a full deploy since it's equal work [17:31:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:16] T293862: Investigate using jvmquake to limit the time a JVM is unusable due to GC overhead - https://phabricator.wikimedia.org/T293862 [17:31:26] RECOVERY - PHP7 jobrunner on mw1437 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 5.338 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [17:31:36] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-worker1001.eqiad.wmnet with reason: host reimage [17:31:36] !log [WDQS Deploy] Gearing up for deploy of wdqs `0.3.110`. Pre-deploy tests passing on canary `wdqs1003` [17:31:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:45] !log ryankemper@deploy1002 Started deploy [wdqs/wdqs@0d95eca]: 0.3.110 [17:31:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:32:30] !log [WDQS Deploy] Tests passing following deploy of `0.3.110` on canary `wdqs1003`; proceeding to rest of fleet [17:32:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:18] RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:34:33] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:34:48] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:35:07] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-worker1001.eqiad.wmnet with reason: host reimage [17:35:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:00] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-worker1003.eqiad.wmnet with reason: host reimage [17:38:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:06] !log ryankemper@deploy1002 Finished deploy [wdqs/wdqs@0d95eca]: 0.3.110 (duration: 06m 21s) [17:38:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:20] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-worker1002.eqiad.wmnet with reason: host reimage [17:38:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:42] 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team, 10Patch-For-Review: Q2:(Need By: TBD) rack/setup/install ml-serve100[5-8] - https://phabricator.wikimedia.org/T294949 (10Cmjohnson) [17:39:33] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:39:48] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:40:02] 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team, 10Patch-For-Review: Q2:(Need By: TBD) rack/setup/install ml-serve100[5-8] - https://phabricator.wikimedia.org/T294949 (10Cmjohnson) 05Open→03Resolved on-site work completed [17:40:46] !log [WDQS Deploy] Restarted `wdqs-updater` across all hosts, 4 hosts at a time: `sudo -E cumin -b 4 'A:wdqs-all' 'systemctl restart wdqs-updater'` [17:40:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:50] !log [WDQS Deploy] Restarted `wdqs-categories` across all test hosts simultaneously: `sudo -E cumin 'A:wdqs-test' 'systemctl restart wdqs-categories'` [17:40:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:53] !log [WDQS Deploy] Restarting `wdqs-categories` across lvs-managed hosts, one node at a time: `sudo -E cumin -b 1 'A:wdqs-all and not A:wdqs-test' 'depool && sleep 45 && systemctl restart wdqs-categories && sleep 45 && pool'` [17:40:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:20] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-worker1003.eqiad.wmnet with reason: host reimage [17:41:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:37] 10SRE, 10Goal, 10MW-1.38-notes (1.38.0-wmf.4; 2021-10-12), 10Patch-For-Review, and 2 others: Fully migrate producers off statsd - https://phabricator.wikimedia.org/T205870 (10colewhite) >>! In T205870#7838787, @Mvolz wrote: > I looked into a bit ago and didn't make any progress, and I'm not going to be abl... [17:42:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P24269 and previous config saved to /var/cache/conftool/dbconfig/20220407-174224-ladsgroup.json [17:42:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:43:27] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-worker1002.eqiad.wmnet with reason: host reimage [17:43:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:43:48] PROBLEM - Host ms-be1068 is DOWN: PING CRITICAL - Packet loss = 100% [17:43:56] PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:44:11] !log T293862 Rolling restart of wdqs public is complete; new jvmquake settings have been uptaken on wdqs public hosts: `-agentpath:/usr/lib/libjvmquake.so=1000,5,0,warn=60,touch=/tmp/wdqs_blazegraph_jvmquake_warn_gc` [17:44:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:44:14] T293862: Investigate using jvmquake to limit the time a JVM is unusable due to GC overhead - https://phabricator.wikimedia.org/T293862 [17:44:33] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:46:20] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:46:40] (03CR) 10Cwhite: [C: 03+2] logstash: bugfix remove rsyslog-set log.level from blackbox_exporter events [puppet] - 10https://gerrit.wikimedia.org/r/777877 (owner: 10Cwhite) [17:46:51] !log dzahn@cumin2002 conftool action : set/pooled=no; selector: dc=eqiad,name=wtp1036.eqiad.wmnet [17:46:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:37] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-worker1004.eqiad.wmnet with reason: host reimage [17:47:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:49:16] RECOVERY - Host ms-be1068 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [17:49:48] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:50:24] !log T293862 Removed touched files so that it'll be easier to see when the new jvmquake threshold is crossed: `ryankemper@cumin1001:~$ sudo -E cumin 'A:wdqs-public' "rm -fv '/tmp/wdqs_blazegraph_jvmquake_warn_gc'"` [17:50:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:50:28] T293862: Investigate using jvmquake to limit the time a JVM is unusable due to GC overhead - https://phabricator.wikimedia.org/T293862 [17:50:31] (BlazegraphJvmQuakeWarnGC) resolved: Blazegraph instance wdqs2003:9100 is entering a GC death spiral - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphJvmQuakeWarnGC [17:51:05] 10SRE, 10SRE Observability: sre.kafka.reboot-workers fails on logging cluster with failed to execute command 'systemctl stop kafka-mirror': - https://phabricator.wikimedia.org/T305652 (10herron) p:05Triage→03Medium [17:51:05] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-worker1004.eqiad.wmnet with reason: host reimage [17:51:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:50] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be10[68-71] - https://phabricator.wikimedia.org/T299462 (10Cmjohnson) @fgiunchedi Do you recall how the disks are supposed to be set up and I can fix [17:51:58] (03CR) 10Jforrester: [C: 04-1] "Not yet." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612348 (https://phabricator.wikimedia.org/T248418) (owner: 10Jforrester) [17:52:01] (03PS1) 10Herron: sre.kafka.reboot-workers: add --skip-mirrormaker option [cookbooks] - 10https://gerrit.wikimedia.org/r/778325 (https://phabricator.wikimedia.org/T305652) [17:52:13] (03PS2) 10Herron: sre.kafka.reboot-workers: add --skip-mirrormaker option [cookbooks] - 10https://gerrit.wikimedia.org/r/778325 (https://phabricator.wikimedia.org/T305652) [17:52:42] !log dzahn@cumin2002 conftool action : set/pooled=no; selector: dc=eqiad,name=wtp1037.eqiad.wmnet [17:52:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:52:59] !log rebooting wtp103* servers [17:53:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:09] PROBLEM - Host wtp1036 is DOWN: PING CRITICAL - Packet loss = 100% [17:54:33] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:54:48] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:55:29] RECOVERY - Host wtp1036 is UP: PING OK - Packet loss = 0%, RTA = 2.00 ms [17:55:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [17:56:57] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in appserver at eqiad on alert1001 is CRITICAL: 0.3562 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [17:56:59] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [17:57:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T305300)', diff saved to https://phabricator.wikimedia.org/P24270 and previous config saved to /var/cache/conftool/dbconfig/20220407-175730-ladsgroup.json [17:57:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:57:34] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2121.codfw.wmnet with reason: Maintenance [17:57:35] T305300: Add lu_attachment_method column to localuser table - https://phabricator.wikimedia.org/T305300 [17:57:36] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2121.codfw.wmnet with reason: Maintenance [17:57:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:57:37] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 10 hosts with reason: Maintenance [17:57:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:57:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:57:39] PROBLEM - Check systemd state on ms-be1068 is CRITICAL: CRITICAL - degraded: The following units failed: srv-swift\x2dstorage-sda4.mount,srv-swift\x2dstorage-sdb3.mount https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:57:44] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 10 hosts with reason: Maintenance [17:57:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:57:48] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance [17:57:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:57:50] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance [17:57:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:05] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-worker1001.eqiad.wmnet with OS bullseye [17:58:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:11] 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team: Q1:(Need By: TBD) rack/setup/install dse-k8s-worker100[1-4] - https://phabricator.wikimedia.org/T291579 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host dse-k8s-worker1001.eqiad.wmnet with OS bullseye... [17:58:23] RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in appserver at eqiad on alert1001 is OK: (C)0.3 gt (W)0.1 gt 0.06849 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [17:58:25] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [17:59:30] logstashes were recently restarted, kafka lag should clear in a moment [17:59:33] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:59:48] !log ryankemper@deploy1002 Started deploy [wdqs/wdqs@0d95eca] (wcqs): Deploy 0.3.110 to WCQS [17:59:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:10] !log [WCQS Deploy] Tests look good following deploy of `0.3.110` to `wcqs1003.eqiad.wmnet`, proceeding to rest of fleet [18:00:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:55] herron: I'm seeing a huge spike in dropped logs. Looks to me like mediawiki dumped a lot of "Persisting session for unknown reason" logs from centralauth [18:00:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [18:01:13] PROBLEM - MD RAID on ms-be1068 is CRITICAL: CRITICAL: State: degraded, Active: 2, Working: 2, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [18:01:14] ACKNOWLEDGEMENT - MD RAID on ms-be1068 is CRITICAL: CRITICAL: State: degraded, Active: 2, Working: 2, Failed: 0, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T305653 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [18:01:19] 10SRE, 10ops-eqiad: Degraded RAID on ms-be1068 - https://phabricator.wikimedia.org/T305653 (10ops-monitoring-bot) [18:01:43] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-worker1003.eqiad.wmnet with OS bullseye [18:01:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:01:46] !log ryankemper@deploy1002 Finished deploy [wdqs/wdqs@0d95eca] (wcqs): Deploy 0.3.110 to WCQS (duration: 01m 58s) [18:01:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:01:49] 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team: Q1:(Need By: TBD) rack/setup/install dse-k8s-worker100[1-4] - https://phabricator.wikimedia.org/T291579 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host dse-k8s-worker1003.eqiad.wmnet with OS bullseye... [18:02:00] !log dzahn@cumin2002 conftool action : set/pooled=yes; selector: dc=eqiad,name=wtp1036.eqiad.wmnet [18:02:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:39] PROBLEM - Check systemd state on elastic2037 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:04:53] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-worker1002.eqiad.wmnet with OS bullseye [18:04:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:59] PROBLEM - Check systemd state on elastic2027 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:04:59] 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team: Q1:(Need By: TBD) rack/setup/install dse-k8s-worker100[1-4] - https://phabricator.wikimedia.org/T291579 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host dse-k8s-worker1002.eqiad.wmnet with OS bullseye... [18:05:05] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [18:06:45] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [18:07:31] !log dzahn@cumin2002 conftool action : set/pooled=yes; selector: dc=eqiad,name=wtp1037.eqiad.wmnet [18:07:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:57] !log dzahn@cumin2002 conftool action : set/pooled=no; selector: dc=eqiad,name=wtp1035.eqiad.wmnet [18:07:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:23] !log dzahn@cumin2002 conftool action : set/pooled=no; selector: dc=eqiad,name=wtp1034.eqiad.wmnet [18:08:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:41] !log [WCQS Deploy] Restarted `wcqs-updater` across all hosts [18:08:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:48] !log [WCQS Deploy] Successful test query placed on commons-query.wikimedia.org, there's no relevant criticals in Icinga, and Grafana looks good. WCQS deploy complete [18:08:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:09:18] !log [WDQS Deploy] Deploy complete. Successful test query placed on query.wikidata.org, there's no relevant criticals in Icinga, and Grafana looks good [18:09:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:09:33] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:09:55] PROBLEM - Check systemd state on elastic2055 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:12:49] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-worker1004.eqiad.wmnet with OS bullseye [18:12:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:54] 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team: Q1:(Need By: TBD) rack/setup/install dse-k8s-worker100[1-4] - https://phabricator.wikimedia.org/T291579 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host dse-k8s-worker1004.eqiad.wmnet with OS bullseye... [18:13:09] PROBLEM - Host wtp1035 is DOWN: PING CRITICAL - Packet loss = 100% [18:13:19] RECOVERY - Host wtp1035 is UP: PING OK - Packet loss = 0%, RTA = 0.44 ms [18:13:39] 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team: Q1:(Need By: TBD) rack/setup/install dse-k8s-worker100[1-4] - https://phabricator.wikimedia.org/T291579 (10Cmjohnson) [18:13:49] RECOVERY - Check systemd state on elastic2027 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:13:52] 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team: Q1:(Need By: TBD) rack/setup/install dse-k8s-worker100[1-4] - https://phabricator.wikimedia.org/T291579 (10Cmjohnson) 05Open→03Resolved [18:14:30] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be10[68-71] - https://phabricator.wikimedia.org/T299462 (10Cmjohnson) [18:14:33] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:16:00] 10SRE, 10ops-eqiad: Degraded RAID on ms-be1068 - https://phabricator.wikimedia.org/T305653 (10Cmjohnson) 05Open→03Invalid this is an re-image error [18:17:31] RECOVERY - Check systemd state on elastic2055 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:17:39] !log dzahn@cumin2002 conftool action : set/pooled=yes; selector: dc=eqiad,name=wtp1035.eqiad.wmnet [18:17:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:47] !log dzahn@cumin2002 conftool action : set/pooled=no; selector: dc=eqiad,name=wtp1033.eqiad.wmnet [18:17:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:13] PROBLEM - Host wtp1034 is DOWN: PING CRITICAL - Packet loss = 100% [18:21:48] RECOVERY - Host wtp1034 is UP: PING OK - Packet loss = 0%, RTA = 0.38 ms [18:22:28] !log dzahn@cumin2002 conftool action : set/pooled=yes; selector: dc=eqiad,name=wtp1034.eqiad.wmnet [18:22:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:22:36] !log dzahn@cumin2002 conftool action : set/pooled=no; selector: dc=eqiad,name=wtp1032.eqiad.wmnet [18:22:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:56] PROBLEM - Host wtp1033 is DOWN: PING CRITICAL - Packet loss = 100% [18:23:56] RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:24:32] RECOVERY - Host wtp1033 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [18:24:33] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:25:11] !log razzi@cumin1001 START - Cookbook sre.hadoop.reboot-workers for Hadoop test cluster [18:25:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:06] RECOVERY - Check systemd state on elastic2037 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:27:12] PROBLEM - Check systemd state on elastic2025 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:28:56] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [18:28:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:29:12] !log dzahn@cumin2002 conftool action : set/pooled=yes; selector: dc=eqiad,name=wtp1033.eqiad.wmnet [18:29:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:29:19] !log dzahn@cumin2002 conftool action : set/pooled=no; selector: dc=eqiad,name=wtp1031.eqiad.wmnet [18:29:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:29:33] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:30:58] PROBLEM - Host wtp1032 is DOWN: PING CRITICAL - Packet loss = 100% [18:32:36] !log [Elastic] Pooled `elastic1052` (likely was erroneously left depooled after https://phabricator.wikimedia.org/P19885) [18:32:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:33:30] RECOVERY - Host wtp1032 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [18:33:37] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:33:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:33:39] 10SRE, 10ops-codfw, 10Discovery: elastic2033 without bootable devices available (repeat of T281621) - https://phabricator.wikimedia.org/T305646 (10RKemper) [18:33:54] 10SRE, 10ops-codfw, 10Discovery: elastic2033 without bootable devices available (repeat of T281621) - https://phabricator.wikimedia.org/T305646 (10RKemper) Banned host like so: ` curl -H 'Content-Type: application/json' -XPUT http://localhost:9200/_cluster/settings -d '{"transient":{"cluster.routing.allocat... [18:34:33] (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:34:38] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): Q3:(Need By: TBD) rack/setup/install elastic1089-1102 - https://phabricator.wikimedia.org/T299609 (10Cmjohnson) [18:34:53] !log dzahn@cumin2002 conftool action : set/pooled=yes; selector: dc=eqiad,name=wtp1032.eqiad.wmnet [18:34:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:35:02] !log dzahn@cumin2002 conftool action : set/pooled=no; selector: dc=eqiad,name=wtp1030.eqiad.wmnet [18:35:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:35:48] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:37:12] PROBLEM - Check systemd state on elastic2038 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:37:12] PROBLEM - Host wtp1031 is DOWN: PING CRITICAL - Packet loss = 100% [18:37:24] RECOVERY - Host wtp1031 is UP: PING OK - Packet loss = 0%, RTA = 0.19 ms [18:38:34] !log dzahn@cumin2002 conftool action : set/pooled=yes; selector: dc=eqiad,name=wtp1031.eqiad.wmnet [18:38:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:00] !log dzahn@cumin2002 conftool action : set/pooled=no; selector: dc=eqiad,name=wtp1029.eqiad.wmnet [18:39:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:40:48] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:42:09] (03PS1) 10Cmjohnson: Adding new elastic servers to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/778329 (https://phabricator.wikimedia.org/T299609) [18:42:16] RECOVERY - Check systemd state on elastic2038 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:43:16] RECOVERY - Check systemd state on elastic2025 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:43:28] !log dzahn@cumin2002 conftool action : set/pooled=yes; selector: dc=eqiad,name=wtp1030.eqiad.wmnet [18:43:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:35] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host elastic1089.mgmt.eqiad.wmnet with reboot policy FORCED [18:43:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:58] !log dzahn@cumin2002 conftool action : set/pooled=no; selector: dc=eqiad,name=wtp1028.eqiad.wmnet [18:44:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:44:20] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host elastic1090.mgmt.eqiad.wmnet with reboot policy FORCED [18:44:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:08] (03PS9) 10Bking: elastic: don't wait for green on first node [software/spicerack] - 10https://gerrit.wikimedia.org/r/776999 (https://phabricator.wikimedia.org/T304570) [18:45:12] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host elastic1091.mgmt.eqiad.wmnet with reboot policy FORCED [18:45:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:18] PROBLEM - Check systemd state on elastic2026 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:45:30] PROBLEM - Check systemd state on elastic2040 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:45:51] (03PS1) 10Volans: spicerack: simplify profile [puppet] - 10https://gerrit.wikimedia.org/r/778331 [18:45:53] (03PS1) 10Volans: service::catalog: add Spicerack comments [puppet] - 10https://gerrit.wikimedia.org/r/778332 [18:45:53] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:45:54] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host elastic1092.mgmt.eqiad.wmnet with reboot policy FORCED [18:45:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:55] (03PS1) 10Volans: spicerack: install service::catalog configuration [puppet] - 10https://gerrit.wikimedia.org/r/778333 [18:45:56] PROBLEM - Check systemd state on elastic2056 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:46:26] (03CR) 10jerkins-bot: [V: 04-1] spicerack: simplify profile [puppet] - 10https://gerrit.wikimedia.org/r/778331 (owner: 10Volans) [18:46:40] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host elastic1094.mgmt.eqiad.wmnet with reboot policy FORCED [18:46:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:47:06] (03CR) 10jerkins-bot: [V: 04-1] spicerack: install service::catalog configuration [puppet] - 10https://gerrit.wikimedia.org/r/778333 (owner: 10Volans) [18:48:29] !log dzahn@cumin2002 conftool action : set/pooled=yes; selector: dc=eqiad,name=wtp1029.eqiad.wmnet [18:48:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:48:35] !log dzahn@cumin2002 conftool action : set/pooled=no; selector: dc=eqiad,name=wtp1027.eqiad.wmnet [18:48:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:49:24] RECOVERY - Check systemd state on elastic2040 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:50:32] (03PS10) 10Gehel: elastic: don't wait for green on first node [software/spicerack] - 10https://gerrit.wikimedia.org/r/776999 (https://phabricator.wikimedia.org/T304570) (owner: 10Bking) [18:50:34] PROBLEM - Host wtp1028 is DOWN: PING CRITICAL - Packet loss = 100% [18:50:58] RECOVERY - Host wtp1028 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [18:51:18] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:53:16] (03PS11) 10Ryan Kemper: elastic: don't wait for green on first node [software/spicerack] - 10https://gerrit.wikimedia.org/r/776999 (https://phabricator.wikimedia.org/T304570) (owner: 10Bking) [18:53:50] !log dzahn@cumin2002 conftool action : set/pooled=yes; selector: dc=eqiad,name=wtp1028.eqiad.wmnet [18:53:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:56:18] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:57:54] RECOVERY - Check systemd state on elastic2056 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:58:18] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:59:09] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for drochford - https://phabricator.wikimedia.org/T305634 (10JanWMF) Thanks @RhinosF1; timely but no ironclad hard deadline, so we can certainly go proper process here :) [18:59:14] !log dzahn@cumin2002 conftool action : set/pooled=yes; selector: dc=eqiad,name=wtp1027.eqiad.wmnet [18:59:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:25] (03CR) 10jerkins-bot: [V: 04-1] elastic: don't wait for green on first node [software/spicerack] - 10https://gerrit.wikimedia.org/r/776999 (https://phabricator.wikimedia.org/T304570) (owner: 10Bking) [19:01:02] PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:01:50] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic1090.mgmt.eqiad.wmnet with reboot policy FORCED [19:01:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:01:54] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic1089.mgmt.eqiad.wmnet with reboot policy FORCED [19:01:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:01:57] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic1091.mgmt.eqiad.wmnet with reboot policy FORCED [19:01:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:02:00] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic1092.mgmt.eqiad.wmnet with reboot policy FORCED [19:02:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:02:04] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic1094.mgmt.eqiad.wmnet with reboot policy FORCED [19:02:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:02:11] !log bking@cumin1001 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.REBOOT (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reboot - bking@cumin1001 - T304938 [19:02:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:02:53] (03PS12) 10Gehel: elastic: don't wait for green on first node [software/spicerack] - 10https://gerrit.wikimedia.org/r/776999 (https://phabricator.wikimedia.org/T304570) (owner: 10Bking) [19:03:05] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host elastic1095.mgmt.eqiad.wmnet with reboot policy FORCED [19:03:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:03:08] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host elastic1096.mgmt.eqiad.wmnet with reboot policy FORCED [19:03:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:03:10] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host elastic1097.mgmt.eqiad.wmnet with reboot policy FORCED [19:03:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:03:12] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host elastic1098.mgmt.eqiad.wmnet with reboot policy FORCED [19:03:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:03:15] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host elastic1099.mgmt.eqiad.wmnet with reboot policy FORCED [19:03:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:03:18] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:04:06] PROBLEM - Check systemd state on puppetmaster1001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:04:18] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:04:26] !log razzi@cumin1001 END (PASS) - Cookbook sre.hadoop.reboot-workers (exit_code=0) for Hadoop test cluster [19:04:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:14] PROBLEM - SSH on aqs1009.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:11:56] RECOVERY - Check systemd state on elastic2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:13:33] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:14:18] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:16:18] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:16:46] (03PS2) 10Cmjohnson: Adding new elastic servers to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/778329 (https://phabricator.wikimedia.org/T299609) [19:17:52] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic1095.mgmt.eqiad.wmnet with reboot policy FORCED [19:17:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:56] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic1096.mgmt.eqiad.wmnet with reboot policy FORCED [19:17:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:59] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic1097.mgmt.eqiad.wmnet with reboot policy FORCED [19:18:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:27] (03PS1) 10Ryan Kemper: elastic: allow waiting for yellow instead of green [cookbooks] - 10https://gerrit.wikimedia.org/r/778335 (https://phabricator.wikimedia.org/T304570) [19:18:33] (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:18:56] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host elastic1100.mgmt.eqiad.wmnet with reboot policy FORCED [19:18:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:58] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host elastic1101.mgmt.eqiad.wmnet with reboot policy FORCED [19:18:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:00] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host elastic1102.mgmt.eqiad.wmnet with reboot policy FORCED [19:19:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:10] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic1098.mgmt.eqiad.wmnet with reboot policy FORCED [19:19:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:13] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic1099.mgmt.eqiad.wmnet with reboot policy FORCED [19:19:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:56] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work), 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install elastic1089-1102 - https://phabricator.wikimedia.org/T299609 (10Cmjohnson) [19:21:18] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:22:22] (03CR) 10jerkins-bot: [V: 04-1] elastic: allow waiting for yellow instead of green [cookbooks] - 10https://gerrit.wikimedia.org/r/778335 (https://phabricator.wikimedia.org/T304570) (owner: 10Ryan Kemper) [19:22:38] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host elastic1100.mgmt.eqiad.wmnet with reboot policy FORCED [19:22:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:26] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host elastic1100.mgmt.eqiad.wmnet with reboot policy FORCED [19:24:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:56] RECOVERY - Check systemd state on puppetmaster1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:26:07] !log razzi@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-tool1005.eqiad.wmnet [19:26:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:26:18] (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:26:30] (03PS13) 10Ryan Kemper: elastic: don't wait for green on first node [software/spicerack] - 10https://gerrit.wikimedia.org/r/776999 (https://phabricator.wikimedia.org/T304570) (owner: 10Bking) [19:28:53] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host elastic1100.mgmt.eqiad.wmnet with reboot policy FORCED [19:28:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:29:49] !log razzi@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-tool1005.eqiad.wmnet [19:29:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:32:18] (ProbeDown) firing: Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:33:34] (03PS1) 10Btullis: Configure LDAP authentication for DataHub [deployment-charts] - 10https://gerrit.wikimedia.org/r/778345 (https://phabricator.wikimedia.org/T301462) [19:34:33] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic1101.mgmt.eqiad.wmnet with reboot policy FORCED [19:34:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:34:36] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic1102.mgmt.eqiad.wmnet with reboot policy FORCED [19:34:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:37:18] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:38:12] (03CR) 10Btullis: [C: 03+2] Enable SSL/TLS for accessing the datahub-gms service [deployment-charts] - 10https://gerrit.wikimedia.org/r/778308 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis) [19:39:22] (03CR) 10Dzahn: [C: 03+1] postgresql: migrate backup crons to systemd timer jobs [puppet] - 10https://gerrit.wikimedia.org/r/777433 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [19:41:12] (03CR) 10Btullis: "I've added @muelenhoff as a reviewer primarily to sanity-check the jaas-ldap.conf file and general LDAP authentication configuration. Than" [deployment-charts] - 10https://gerrit.wikimedia.org/r/778345 (https://phabricator.wikimedia.org/T301462) (owner: 10Btullis) [19:42:21] (03Merged) 10jenkins-bot: Enable SSL/TLS for accessing the datahub-gms service [deployment-charts] - 10https://gerrit.wikimedia.org/r/778308 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis) [19:42:51] (03PS2) 10Volans: spicerack: simplify profile [puppet] - 10https://gerrit.wikimedia.org/r/778331 [19:44:06] (03PS3) 10Volans: spicerack: simplify profile [puppet] - 10https://gerrit.wikimedia.org/r/778331 [19:44:12] (03PS2) 10Volans: service::catalog: add Spicerack comments [puppet] - 10https://gerrit.wikimedia.org/r/778332 [19:44:16] (03PS2) 10Volans: spicerack: install service::catalog configuration [puppet] - 10https://gerrit.wikimedia.org/r/778333 [19:44:44] (03CR) 10jerkins-bot: [V: 04-1] spicerack: simplify profile [puppet] - 10https://gerrit.wikimedia.org/r/778331 (owner: 10Volans) [19:45:18] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:45:31] (03CR) 10jerkins-bot: [V: 04-1] spicerack: install service::catalog configuration [puppet] - 10https://gerrit.wikimedia.org/r/778333 (owner: 10Volans) [19:45:48] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host aqs1008.eqiad.wmnet [19:45:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:46:06] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [19:46:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:46:09] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: apply on main [19:46:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:46:33] (03PS4) 10Volans: spicerack: simplify profile [puppet] - 10https://gerrit.wikimedia.org/r/778331 [19:46:35] (03PS3) 10Volans: service::catalog: add Spicerack comments [puppet] - 10https://gerrit.wikimedia.org/r/778332 [19:46:37] (03PS3) 10Volans: spicerack: install service::catalog configuration [puppet] - 10https://gerrit.wikimedia.org/r/778333 [19:47:28] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [19:47:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:30] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: apply on main [19:47:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:48:18] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:50:12] (03CR) 10Volans: "PCC results seems to agree on the noop: https://puppet-compiler.wmflabs.org/pcc-worker1003/34740/" [puppet] - 10https://gerrit.wikimedia.org/r/778331 (owner: 10Volans) [19:50:18] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:51:06] (03PS1) 10Btullis: Bump datahub chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/778348 [19:52:26] (03PS4) 10Volans: service::catalog: add Spicerack comments [puppet] - 10https://gerrit.wikimedia.org/r/778332 [19:52:28] (03PS4) 10Volans: spicerack: install service::catalog configuration [puppet] - 10https://gerrit.wikimedia.org/r/778333 [19:54:06] (03PS2) 10Btullis: Bump datahub chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/778348 [19:54:37] (03CR) 10Volans: "PCC seems to agree that is a noop on the template: https://puppet-compiler.wmflabs.org/pcc-worker1002/34742/" [puppet] - 10https://gerrit.wikimedia.org/r/778332 (owner: 10Volans) [19:55:18] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:55:40] RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:55:58] (03CR) 10Volans: "PCC diff: https://puppet-compiler.wmflabs.org/pcc-worker1001/34743/cumin1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/778333 (owner: 10Volans) [19:57:01] !log razzi@cumin1001 START - Cookbook sre.hosts.reboot-single for host flerovium.eqiad.wmnet [19:57:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:57:50] !log btullis@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host aqs1008.eqiad.wmnet [19:57:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:58:02] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host aqs1009.eqiad.wmnet [19:58:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:04] brennen: My dear minions, it's time we take the moon! Just kidding. Time for UTC late backport and config training deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220407T2000). [20:00:18] (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:02:32] !log razzi@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host flerovium.eqiad.wmnet [20:02:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:03:18] !log razzi@cumin1001 START - Cookbook sre.hosts.reboot-single for host furud.codfw.wmnet [20:03:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:04:25] (03CR) 10Btullis: [C: 03+2] Bump datahub chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/778348 (owner: 10Btullis) [20:04:56] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:(Need By: TBD) rack/setup/install cloudvirt10[48-50].eqiad.wmnet - https://phabricator.wikimedia.org/T299574 (10Cmjohnson) These are racked but the switches are not in netbox yet. I am blocked [20:08:38] (03Merged) 10jenkins-bot: Bump datahub chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/778348 (owner: 10Btullis) [20:08:53] !log btullis@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host aqs1009.eqiad.wmnet [20:08:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:24] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1089.eqiad.wmnet with OS bullseye [20:10:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:32] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work), 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install elastic1089-1102 - https://phabricator.wikimedia.org/T299609 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host elastic1089.eqiad.... [20:13:20] PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:17:28] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1090.eqiad.wmnet with OS bullseye [20:17:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:37] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work), 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install elastic1089-1102 - https://phabricator.wikimedia.org/T299609 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host elastic1090.eqiad.... [20:17:53] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1091.eqiad.wmnet with OS bullseye [20:17:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:18:02] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work), 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install elastic1089-1102 - https://phabricator.wikimedia.org/T299609 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host elastic1091.eqiad.... [20:18:28] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1092.eqiad.wmnet with OS bullseye [20:18:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:18:37] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work), 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install elastic1089-1102 - https://phabricator.wikimedia.org/T299609 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host elastic1092.eqiad.... [20:21:26] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:(Need By: TBD) rack/setup/install cloudvirt10[48-50].eqiad.wmnet - https://phabricator.wikimedia.org/T299574 (10cmooney) @Cmjohnson The switches are in Netbox: https://netbox.wikimedia.org/dcim/devices/3931/ https://netbox.wikimedia.org/d... [20:21:44] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1094.eqiad.wmnet with OS bullseye [20:21:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:21:53] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1089.eqiad.wmnet with reason: host reimage [20:21:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:21:55] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work), 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install elastic1089-1102 - https://phabricator.wikimedia.org/T299609 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host elastic1094.eqiad.... [20:23:36] PROBLEM - Host furud is DOWN: PING CRITICAL - Packet loss = 100% [20:24:37] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [20:24:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:25:14] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1089.eqiad.wmnet with reason: host reimage [20:25:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:25:26] (03PS1) 10Cwhite: logstash: reprioritize dlq filter [puppet] - 10https://gerrit.wikimedia.org/r/778353 (https://phabricator.wikimedia.org/T305088) [20:25:29] RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:26:01] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [20:26:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:26:52] !log bking@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.REBOOT (3 nodes at a time) for ElasticSearch cluster search_eqiad: security updates - bking@cumin1001 - T304938 [20:26:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:19] (03PS2) 10Cwhite: logstash: reprioritize dlq filter [puppet] - 10https://gerrit.wikimedia.org/r/778353 (https://phabricator.wikimedia.org/T305088) [20:28:15] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: Q3:(Need By: TBD) rack/setup/install dumpsdata100[67] - https://phabricator.wikimedia.org/T299443 (10Cmjohnson) @Jclark-ctr moved the DAC cable to the correct port, these should work now. I will image shortly [20:28:58] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1090.eqiad.wmnet with reason: host reimage [20:28:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:04] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host elastic1100.mgmt.eqiad.wmnet with reboot policy FORCED [20:29:04] (03CR) 10Vivian Rook: "If I'm reading Andrew's comment correctly the updated patch should get us potential access to wallaby, but we'll still need to update clou" [puppet] - 10https://gerrit.wikimedia.org/r/777873 (https://phabricator.wikimedia.org/T304694) (owner: 10Vivian Rook) [20:29:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:11] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1091.eqiad.wmnet with reason: host reimage [20:29:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:46] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1092.eqiad.wmnet with reason: host reimage [20:29:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:20] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1090.eqiad.wmnet with reason: host reimage [20:32:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:54] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1094.eqiad.wmnet with reason: host reimage [20:32:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:47] 10SRE, 10SRE-OnFire (FY2021/2022-Q3), 10Infrastructure-Foundations, 10SRE Observability (FY2021/2022-Q3): Implement an accurate and easy to understand status page for all wikis - https://phabricator.wikimedia.org/T202061 (10lmata) That is very cool, thanks! Would it be interesting to replicate similar beha... [20:33:53] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1089.eqiad.wmnet with OS bullseye [20:33:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:34:02] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work), 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install elastic1089-1102 - https://phabricator.wikimedia.org/T299609 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host elastic1089.eqiad.wmne... [20:34:32] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1095.eqiad.wmnet with OS bullseye [20:34:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:34:38] (03CR) 10JHathaway: [C: 03+1] "looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/778331 (owner: 10Volans) [20:34:42] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work), 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install elastic1089-1102 - https://phabricator.wikimedia.org/T299609 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host elastic1095.eqiad.... [20:34:48] (03PS1) 10Cwhite: thanos: fix yaml error [puppet] - 10https://gerrit.wikimedia.org/r/778354 (https://phabricator.wikimedia.org/T288726) [20:35:09] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1092.eqiad.wmnet with reason: host reimage [20:35:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:35:34] (03CR) 10Cwhite: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/775375 (https://phabricator.wikimedia.org/T305090) (owner: 10Cwhite) [20:36:21] 10SRE, 10serviceops: Migrate node-based services in production to node12 - https://phabricator.wikimedia.org/T290750 (10Jdforrester-WMF) [20:36:27] (03CR) 10JHathaway: [C: 03+1] "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/778333 (owner: 10Volans) [20:36:45] (03CR) 10Cwhite: [C: 03+2] thanos: fix yaml error [puppet] - 10https://gerrit.wikimedia.org/r/778354 (https://phabricator.wikimedia.org/T288726) (owner: 10Cwhite) [20:37:53] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1094.eqiad.wmnet with reason: host reimage [20:37:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:39:05] 10SRE, 10DC-Ops: Confirm support of PERC 750 raid controller - https://phabricator.wikimedia.org/T297913 (10RobH) This was updated, same issue on dumpsdata1007 and sent info to our Dell team. [20:39:11] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on elastic1091.eqiad.wmnet with reason: host reimage [20:39:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:39:55] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host dumpsdata1006.eqiad.wmnet with OS buster [20:39:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:40:01] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: Q3:(Need By: TBD) rack/setup/install dumpsdata100[67] - https://phabricator.wikimedia.org/T299443 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host dumpsdata1006.eqiad.wmnet with OS buster [20:40:56] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1090.eqiad.wmnet with OS bullseye [20:40:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:41:03] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work), 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install elastic1089-1102 - https://phabricator.wikimedia.org/T299609 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host elastic1090.eqiad.wmne... [20:41:40] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1096.eqiad.wmnet with OS bullseye [20:41:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:41:48] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work), 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install elastic1089-1102 - https://phabricator.wikimedia.org/T299609 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host elastic1096.eqiad.... [20:42:07] (03CR) 10Cwhite: [C: 03+1] "LGTM!" [cookbooks] - 10https://gerrit.wikimedia.org/r/778325 (https://phabricator.wikimedia.org/T305652) (owner: 10Herron) [20:42:22] PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:42:22] RECOVERY - Host furud is UP: PING OK - Packet loss = 0%, RTA = 31.62 ms [20:42:48] !log razzi@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host furud.codfw.wmnet [20:42:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:52] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic1100.mgmt.eqiad.wmnet with reboot policy FORCED [20:43:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:58] (03PS2) 10Ryan Kemper: elastic: allow waiting for yellow instead of green [cookbooks] - 10https://gerrit.wikimedia.org/r/778335 (https://phabricator.wikimedia.org/T304570) [20:44:55] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1092.eqiad.wmnet with OS bullseye [20:44:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:45:03] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work), 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install elastic1089-1102 - https://phabricator.wikimedia.org/T299609 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host elastic1092.eqiad.wmne... [20:45:17] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1097.eqiad.wmnet with OS bullseye [20:45:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:45:25] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work), 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install elastic1089-1102 - https://phabricator.wikimedia.org/T299609 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host elastic1097.eqiad.... [20:45:53] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1095.eqiad.wmnet with reason: host reimage [20:45:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:46:24] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 91 probes of 664 (alerts on 65) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [20:46:56] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1094.eqiad.wmnet with OS bullseye [20:46:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:47:03] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host dumpsdata1007.eqiad.wmnet with OS buster [20:47:03] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work), 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install elastic1089-1102 - https://phabricator.wikimedia.org/T299609 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host elastic1094.eqiad.wmne... [20:47:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:47:11] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: Q3:(Need By: TBD) rack/setup/install dumpsdata100[67] - https://phabricator.wikimedia.org/T299443 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host dumpsdata1007.eqiad.wmnet with OS buster [20:48:22] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1098.eqiad.wmnet with OS bullseye [20:48:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:48:31] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work), 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install elastic1089-1102 - https://phabricator.wikimedia.org/T299609 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host elastic1098.eqiad.... [20:49:21] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1095.eqiad.wmnet with reason: host reimage [20:49:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:49:51] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: Q3:(Need By: TBD) rack/setup/install dumpsdata100[67] - https://phabricator.wikimedia.org/T299443 (10RobH) Please note the partman part will faill due to the raid controller reordering the disk array numbers and puts SSDs as SDB. This was failing PXE for m... [20:51:31] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 58 probes of 664 (alerts on 65) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [20:52:01] PROBLEM - Check systemd state on elastic1062 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:52:59] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1096.eqiad.wmnet with reason: host reimage [20:53:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:54:05] !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dumpsdata1007.eqiad.wmnet with OS buster [20:54:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:54:06] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1091.eqiad.wmnet with OS bullseye [20:54:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:54:09] !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dumpsdata1006.eqiad.wmnet with OS buster [20:54:09] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: Q3:(Need By: TBD) rack/setup/install dumpsdata100[67] - https://phabricator.wikimedia.org/T299443 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host dumpsdata1007.eqiad.wmnet with OS buster executed with erro... [20:54:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:54:16] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work), 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install elastic1089-1102 - https://phabricator.wikimedia.org/T299609 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host elastic1091.eqiad.wmne... [20:54:22] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: Q3:(Need By: TBD) rack/setup/install dumpsdata100[67] - https://phabricator.wikimedia.org/T299443 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host dumpsdata1006.eqiad.wmnet with OS buster executed with erro... [20:54:39] PROBLEM - SSH on wtp1037.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:55:36] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1099.eqiad.wmnet with OS bullseye [20:55:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:55:39] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1101.eqiad.wmnet with OS bullseye [20:55:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:55:45] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work), 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install elastic1089-1102 - https://phabricator.wikimedia.org/T299609 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host elastic1099.eqiad.... [20:55:47] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work), 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install elastic1089-1102 - https://phabricator.wikimedia.org/T299609 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host elastic1101.eqiad.... [20:55:50] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1100.eqiad.wmnet with OS bullseye [20:55:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:55:59] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work), 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install elastic1089-1102 - https://phabricator.wikimedia.org/T299609 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host elastic1100.eqiad.... [20:56:25] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1096.eqiad.wmnet with reason: host reimage [20:56:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:56:34] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1102.eqiad.wmnet with OS bullseye [20:56:34] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1097.eqiad.wmnet with reason: host reimage [20:56:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:56:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:56:43] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work), 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install elastic1089-1102 - https://phabricator.wikimedia.org/T299609 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host elastic1102.eqiad.... [20:59:00] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1095.eqiad.wmnet with OS bullseye [20:59:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:59:08] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work), 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install elastic1089-1102 - https://phabricator.wikimedia.org/T299609 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host elastic1095.eqiad.wmne... [20:59:39] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1098.eqiad.wmnet with reason: host reimage [20:59:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:59:59] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1097.eqiad.wmnet with reason: host reimage [21:00:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:03:06] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1098.eqiad.wmnet with reason: host reimage [21:03:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:05:02] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1096.eqiad.wmnet with OS bullseye [21:05:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:05:15] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work), 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install elastic1089-1102 - https://phabricator.wikimedia.org/T299609 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host elastic1096.eqiad.wmne... [21:06:50] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1099.eqiad.wmnet with reason: host reimage [21:06:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:06:55] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1101.eqiad.wmnet with reason: host reimage [21:06:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:07:01] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1100.eqiad.wmnet with reason: host reimage [21:07:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:07:54] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1102.eqiad.wmnet with reason: host reimage [21:07:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:09:29] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1097.eqiad.wmnet with OS bullseye [21:09:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:09:38] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work), 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install elastic1089-1102 - https://phabricator.wikimedia.org/T299609 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host elastic1097.eqiad.wmne... [21:10:17] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1099.eqiad.wmnet with reason: host reimage [21:10:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:13:03] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1102.eqiad.wmnet with reason: host reimage [21:13:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:13:23] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1098.eqiad.wmnet with OS bullseye [21:13:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:13:31] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work), 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install elastic1089-1102 - https://phabricator.wikimedia.org/T299609 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host elastic1098.eqiad.wmne... [21:15:12] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1100.eqiad.wmnet with reason: host reimage [21:15:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:15:50] PROBLEM - Check systemd state on elastic1061 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:16:55] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on elastic1101.eqiad.wmnet with reason: host reimage [21:16:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:18:12] PROBLEM - Check systemd state on elastic1066 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:18:16] (03CR) 10Bking: [C: 03+1] elastic: allow waiting for yellow instead of green [cookbooks] - 10https://gerrit.wikimedia.org/r/778335 (https://phabricator.wikimedia.org/T304570) (owner: 10Ryan Kemper) [21:19:18] (03PS3) 10JHathaway: mx: test rejecting email to legacy mailing list domains [puppet] - 10https://gerrit.wikimedia.org/r/777423 (https://phabricator.wikimedia.org/T280472) [21:19:50] RECOVERY - Check systemd state on elastic1062 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:19:57] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1099.eqiad.wmnet with OS bullseye [21:19:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:20:06] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work), 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install elastic1089-1102 - https://phabricator.wikimedia.org/T299609 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host elastic1099.eqiad.wmne... [21:20:12] RECOVERY - Check systemd state on elastic1061 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:23:25] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1102.eqiad.wmnet with OS bullseye [21:23:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:23:35] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work), 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install elastic1089-1102 - https://phabricator.wikimedia.org/T299609 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host elastic1102.eqiad.wmne... [21:23:37] (03PS4) 10JHathaway: mx: test rejecting email to legacy mailing list domains [puppet] - 10https://gerrit.wikimedia.org/r/777423 (https://phabricator.wikimedia.org/T280472) [21:24:40] PROBLEM - Check systemd state on elastic1065 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:24:41] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1100.eqiad.wmnet with OS bullseye [21:24:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:24:49] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work), 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install elastic1089-1102 - https://phabricator.wikimedia.org/T299609 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host elastic1100.eqiad.wmne... [21:26:28] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1101.eqiad.wmnet with OS bullseye [21:26:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:26:37] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work), 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install elastic1089-1102 - https://phabricator.wikimedia.org/T299609 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host elastic1101.eqiad.wmne... [21:27:08] (03PS5) 10JHathaway: mx: test rejecting email to legacy mailing list domains [puppet] - 10https://gerrit.wikimedia.org/r/777423 (https://phabricator.wikimedia.org/T280472) [21:28:25] (03CR) 10JHathaway: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34747/console" [puppet] - 10https://gerrit.wikimedia.org/r/777423 (https://phabricator.wikimedia.org/T280472) (owner: 10JHathaway) [21:29:59] (03CR) 10JHathaway: [V: 03+1 C: 03+2] mx: test rejecting email to legacy mailing list domains [puppet] - 10https://gerrit.wikimedia.org/r/777423 (https://phabricator.wikimedia.org/T280472) (owner: 10JHathaway) [21:30:30] (03CR) 10JHathaway: [V: 03+1 C: 03+2] "pcc looks correct, https://puppet-compiler.wmflabs.org/pcc-worker1001/34747/" [puppet] - 10https://gerrit.wikimedia.org/r/777423 (https://phabricator.wikimedia.org/T280472) (owner: 10JHathaway) [21:30:54] (03CR) 10JHathaway: [V: 03+1 C: 03+2] mx: test rejecting email to legacy mailing list domains (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/777423 (https://phabricator.wikimedia.org/T280472) (owner: 10JHathaway) [21:32:54] RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:39:10] RECOVERY - SSH on aqs1009.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:40:42] RECOVERY - Check systemd state on elastic1066 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:42:00] RECOVERY - Check systemd state on elastic1065 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:44:06] PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:44:26] PROBLEM - Check systemd state on elastic1064 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:45:20] PROBLEM - Check systemd state on elastic1063 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:46:50] (03CR) 10Volans: "addressed comment" [puppet] - 10https://gerrit.wikimedia.org/r/778331 (owner: 10Volans) [21:46:52] (03PS5) 10Volans: spicerack: simplify profile [puppet] - 10https://gerrit.wikimedia.org/r/778331 [21:46:54] (03PS5) 10Volans: service::catalog: add Spicerack comments [puppet] - 10https://gerrit.wikimedia.org/r/778332 [21:46:56] (03PS5) 10Volans: spicerack: install service::catalog configuration [puppet] - 10https://gerrit.wikimedia.org/r/778333 [21:47:28] RECOVERY - Check systemd state on elastic1063 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:54:12] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:54:32] RECOVERY - SSH on wtp1037.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:55:44] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work), 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install elastic1089-1102 - https://phabricator.wikimedia.org/T299609 (10Cmjohnson) [21:56:31] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work), 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install elastic1089-1102 - https://phabricator.wikimedia.org/T299609 (10Cmjohnson) 05Open→03Resolved on-site work has been completed [21:57:25] 10SRE, 10Infrastructure-Foundations, 10netops: Configure cloudsw1-e4-eqiad and cloudsw1-f4-eqiad - https://phabricator.wikimedia.org/T304936 (10cmooney) [21:57:31] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:(Need By: TBD) rack/setup/install cloudvirt10[48-50].eqiad.wmnet - https://phabricator.wikimedia.org/T299574 (10cmooney) [21:58:58] RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:01:07] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [22:01:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:05:57] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:05:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:07:42] RECOVERY - Check systemd state on elastic1064 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:13:51] (03PS2) 10Btullis: Configure LDAP authentication for DataHub [deployment-charts] - 10https://gerrit.wikimedia.org/r/778345 (https://phabricator.wikimedia.org/T301462) [22:14:02] PROBLEM - Check systemd state on elastic1058 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:14:58] PROBLEM - SSH on aqs1007.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:15:48] PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:16:25] (03PS3) 10Btullis: Configure LDAP authentication for DataHub [deployment-charts] - 10https://gerrit.wikimedia.org/r/778345 (https://phabricator.wikimedia.org/T301462) [22:17:00] jouncebot: nowandnext [22:17:00] No deployments scheduled for the next 8 hour(s) and 42 minute(s) [22:17:00] In 8 hour(s) and 42 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220408T0700) [22:25:04] PROBLEM - Check systemd state on elastic1057 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:25:56] RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:27:10] PROBLEM - Check systemd state on elastic1052 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:29:40] PROBLEM - Check systemd state on snapshot1008 is CRITICAL: CRITICAL - degraded: The following units failed: cirrussearch-dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:38:04] PROBLEM - Check systemd state on elastic1082 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:38:20] RECOVERY - Check systemd state on elastic1052 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:38:54] PROBLEM - Check systemd state on elastic1081 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:40:04] PROBLEM - Check systemd state on elastic1051 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:40:28] RECOVERY - Check systemd state on elastic1057 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:42:30] RECOVERY - Check systemd state on elastic1082 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:43:48] RECOVERY - Check systemd state on elastic1058 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:45:52] PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:47:52] RECOVERY - Check systemd state on elastic1081 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:55:42] RECOVERY - Check systemd state on elastic1051 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:56:05] 10SRE-tools, 10Infrastructure-Foundations, 10Observability-Alerting, 10SRE Observability (FY2021/2022-Q4), 10User-fgiunchedi: Spicerack: add support for Alertmanager - https://phabricator.wikimedia.org/T293209 (10lmata) [22:57:04] RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:58:22] 10SRE, 10Observability-Metrics: Tooling for end-of-quarter SLO reporting - https://phabricator.wikimedia.org/T290924 (10lmata) [23:00:53] 10SRE, 10SRE-OnFire (FY2021/2022-Q4), 10Infrastructure-Foundations, 10SRE Observability (FY2021/2022-Q4): Implement an accurate and easy to understand status page for all wikis - https://phabricator.wikimedia.org/T202061 (10lmata) [23:01:12] 10SRE, 10SRE-OnFire (FY2021/2022-Q4), 10Infrastructure-Foundations (FY2021/2022-Q4), 10SRE Observability (FY2021/2022-Q4): Implement an accurate and easy to understand status page for all wikis - https://phabricator.wikimedia.org/T202061 (10lmata) [23:03:03] 10SRE, 10Observability-Logging, 10SRE Observability (FY2021/2022-Q4): apifeatureusage hosts hanging on shutdown - https://phabricator.wikimedia.org/T305403 (10lmata) [23:05:44] PROBLEM - Check systemd state on elastic1059 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:05:46] PROBLEM - Check systemd state on elastic1080 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:05:46] PROBLEM - Check systemd state on elastic1083 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:07:09] 10SRE, 10DNS, 10Traffic-Icebox, 10Mobile, and 2 others: Many misc wikis lack mobile domains - https://phabricator.wikimedia.org/T152882 (10Dzahn) [23:07:38] 10SRE, 10DNS, 10Traffic-Icebox, 10Mobile, and 2 others: Many misc wikis lack mobile domains - https://phabricator.wikimedia.org/T152882 (10Dzahn) added checkboxes, checked those that already resolve meanwhile [23:10:21] 10SRE, 10DNS, 10Traffic-Icebox, 10Mobile, and 2 others: Many misc wikis lack mobile domains - https://phabricator.wikimedia.org/T152882 (10Dzahn) [23:14:44] PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:14:48] 10SRE, 10Analytics-Radar, 10Traffic-Icebox, 10User-jbond: Fix geoip updaters for new MaxMind hashed keys by 2019-08-15 - https://phabricator.wikimedia.org/T228533 (10Dzahn) @BBlack This sounds like a duplicate of T303464 (and/or /T302864) to me. Maybe you can just merge it. [23:16:04] RECOVERY - SSH on aqs1007.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:19:02] RECOVERY - Check systemd state on elastic1083 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:23:30] RECOVERY - Check systemd state on elastic1059 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:24:52] PROBLEM - Router interfaces on mr1-esams is CRITICAL: CRITICAL: No response from remote host 91.198.174.247 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:25:58] RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:27:02] RECOVERY - Router interfaces on mr1-esams is OK: OK: host 91.198.174.247, interfaces up: 41, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:28:25] (03PS1) 10Dzahn: phabricator: allow disabling ssh-phab service except on one host [puppet] - 10https://gerrit.wikimedia.org/r/778366 (https://phabricator.wikimedia.org/T296022) [23:30:30] (03CR) 10Dzahn: "we also don't want to apply the "interface::alias" from profile::phabricator::main but that only happens if $vcs_ip_v4 or $vcs_ip_v6 are s" [puppet] - 10https://gerrit.wikimedia.org/r/778366 (https://phabricator.wikimedia.org/T296022) (owner: 10Dzahn) [23:32:32] RECOVERY - Check systemd state on elastic1080 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:33:18] (ProbeDown) firing: (2) Service api-https:443 has failed probes (http_api-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:35:19] (03CR) 10Dzahn: "compiling PS1 shows how it's different between phab1001 and phab2001. in PS2 phab1001 and phab2001 will be the same, point being on phab10" [puppet] - 10https://gerrit.wikimedia.org/r/778366 (https://phabricator.wikimedia.org/T296022) (owner: 10Dzahn) [23:37:08] (03PS2) 10Dzahn: phabricator: allow disabling ssh-phab service except on one host [puppet] - 10https://gerrit.wikimedia.org/r/778366 (https://phabricator.wikimedia.org/T296022) [23:38:04] PROBLEM - Check systemd state on elastic1072 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:38:18] (ProbeDown) firing: (2) Service api-https:443 has failed probes (http_api-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:38:54] PROBLEM - Check systemd state on elastic1070 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:38:58] PROBLEM - Check systemd state on elastic1073 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:41:08] RECOVERY - Check systemd state on elastic1070 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:46:10] PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:47:52] RECOVERY - Check systemd state on elastic1073 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:48:18] (ProbeDown) resolved: (2) Service api-https:443 has failed probes (http_api-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:55:06] RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:55:56] RECOVERY - Check systemd state on elastic1072 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state