[00:12:54] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T297189)', diff saved to https://phabricator.wikimedia.org/P24191 and previous config saved to /var/cache/conftool/dbconfig/20220407-001254-marostegui.json
[00:12:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:12:58] <stashbot>	 T297189: Schema change for dropping ft_title and ft_namesapce - https://phabricator.wikimedia.org/T297189
[00:15:12] <icinga-wm>	 PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:19:25] <wikibugs>	 (03PS1) 10RLazarus: external_clouds_vendors: Support entity types besides "cloud" [puppet] - 10https://gerrit.wikimedia.org/r/777899 (https://phabricator.wikimedia.org/T305581)
[00:26:52] <wikibugs>	 (03PS1) 10Krinkle: static.php: Fold "unknown" handling into "nohash" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/777900 (https://phabricator.wikimedia.org/T302465)
[00:26:54] <wikibugs>	 (03PS1) 10Krinkle: static.php: Restore short cache for temporary 'mismatch' response [mediawiki-config] - 10https://gerrit.wikimedia.org/r/777901 (https://phabricator.wikimedia.org/T274041)
[00:27:59] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P24192 and previous config saved to /var/cache/conftool/dbconfig/20220407-002759-marostegui.json
[00:28:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:28:07] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] static.php: Restore short cache for temporary 'mismatch' response [mediawiki-config] - 10https://gerrit.wikimedia.org/r/777901 (https://phabricator.wikimedia.org/T274041) (owner: 10Krinkle)
[00:28:09] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] static.php: Fold "unknown" handling into "nohash" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/777900 (https://phabricator.wikimedia.org/T302465) (owner: 10Krinkle)
[00:32:01] <wikibugs>	 (03CR) 10Krinkle: [C: 03+1] "Good to go. Thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776255 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe)
[00:32:09] <wikibugs>	 (03PS2) 10Krinkle: Stop writing to $wmfUsingKubernetes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776256 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe)
[00:32:46] <wikibugs>	 (03CR) 10Krinkle: [C: 03+1] "This must be staged and synced separately from the parent - Good to go!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776256 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe)
[00:35:55] <icinga-wm>	 RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[00:39:13] <icinga-wm>	 RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:43:04] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P24193 and previous config saved to /var/cache/conftool/dbconfig/20220407-004304-marostegui.json
[00:43:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:44:15] <icinga-wm>	 PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:57:30] <wikibugs>	 10SRE, 10Traffic: Upgrading Wikidough and durum VMs to bullseye - https://phabricator.wikimedia.org/T305589 (10ssingh) >>! In T305589#7836863, @Dzahn wrote: > My 2 cents:  Thanks for the feedback!  > cookbook not worth it in this case, likely more work to create and debug it than the actual time savings with i...
[00:58:10] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T297189)', diff saved to https://phabricator.wikimedia.org/P24194 and previous config saved to /var/cache/conftool/dbconfig/20220407-005809-marostegui.json
[00:58:11] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1129.eqiad.wmnet with reason: Maintenance
[00:58:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:58:13] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1129.eqiad.wmnet with reason: Maintenance
[00:58:14] <stashbot>	 T297189: Schema change for dropping ft_title and ft_namesapce - https://phabricator.wikimedia.org/T297189
[00:58:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:58:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:58:18] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1129 (T297189)', diff saved to https://phabricator.wikimedia.org/P24195 and previous config saved to /var/cache/conftool/dbconfig/20220407-005817-marostegui.json
[00:58:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:16:52] <icinga-wm>	 PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[01:18:48] <icinga-wm>	 RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[01:28:40] <icinga-wm>	 RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:37:08] <icinga-wm>	 PROBLEM - k8s API server requests latencies on ml-serve-ctrl2001 is CRITICAL: instance=10.192.32.33 verb=POST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[01:38:40] <icinga-wm>	 RECOVERY - k8s API server requests latencies on ml-serve-ctrl2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[01:38:45] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job k8s-pods in k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:39:26] <wikibugs>	 (03PS2) 10Krinkle: static.php: Fold "unknown" handling into "nohash" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/777900 (https://phabricator.wikimedia.org/T302465)
[01:39:28] <wikibugs>	 (03PS2) 10Krinkle: static.php: Restore short cache for temporary 'mismatch' response [mediawiki-config] - 10https://gerrit.wikimedia.org/r/777901 (https://phabricator.wikimedia.org/T274041)
[01:40:20] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] static.php: Fold "unknown" handling into "nohash" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/777900 (https://phabricator.wikimedia.org/T302465) (owner: 10Krinkle)
[01:40:23] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] static.php: Restore short cache for temporary 'mismatch' response [mediawiki-config] - 10https://gerrit.wikimedia.org/r/777901 (https://phabricator.wikimedia.org/T274041) (owner: 10Krinkle)
[01:41:07] <wikibugs>	 (03CR) 10Krinkle: static.php: Restore short cache for temporary 'mismatch' response (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/777901 (https://phabricator.wikimedia.org/T274041) (owner: 10Krinkle)
[01:41:51] <wikibugs>	 (03PS3) 10Krinkle: static.php: Fold "unknown" handling into "nohash" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/777900 (https://phabricator.wikimedia.org/T302465)
[01:41:53] <wikibugs>	 (03PS3) 10Krinkle: static.php: Restore short cache for temporary 'mismatch' response [mediawiki-config] - 10https://gerrit.wikimedia.org/r/777901 (https://phabricator.wikimedia.org/T274041)
[01:42:41] <wikibugs>	 (03PS4) 10Krinkle: static.php: Fold "unknown" handling into "nohash" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/777900 (https://phabricator.wikimedia.org/T302465)
[01:42:43] <wikibugs>	 (03PS4) 10Krinkle: static.php: Restore short cache for temporary 'mismatch' response [mediawiki-config] - 10https://gerrit.wikimedia.org/r/777901 (https://phabricator.wikimedia.org/T274041)
[01:43:01] <wikibugs>	 (03CR) 10Krinkle: "@dancy These next two are a bit less trivial. Could use a second pair of eyes." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/777900 (https://phabricator.wikimedia.org/T302465) (owner: 10Krinkle)
[01:43:44] <icinga-wm>	 PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:43:45] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job k8s-pods in k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:50:27] <wikibugs>	 (03PS1) 10Krinkle: varnish: Expand static.php optimisation regarless of query string [puppet] - 10https://gerrit.wikimedia.org/r/777904 (https://phabricator.wikimedia.org/T302465)
[01:58:32] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T297189)', diff saved to https://phabricator.wikimedia.org/P24196 and previous config saved to /var/cache/conftool/dbconfig/20220407-015832-marostegui.json
[01:58:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:58:37] <stashbot>	 T297189: Schema change for dropping ft_title and ft_namesapce - https://phabricator.wikimedia.org/T297189
[01:59:59] <icinga-wm>	 PROBLEM - SSH on bast5002 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring
[02:02:49] <icinga-wm>	 RECOVERY - SSH on bast5002 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[02:13:38] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P24197 and previous config saved to /var/cache/conftool/dbconfig/20220407-021337-marostegui.json
[02:13:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:15:13] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[02:26:55] <icinga-wm>	 RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:28:43] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P24198 and previous config saved to /var/cache/conftool/dbconfig/20220407-022842-marostegui.json
[02:28:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:37:46] <jinxer-wm>	 (BlazegraphJvmQuakeWarnGC) firing: Blazegraph instance wdqs2003:9100 is entering a GC death spiral - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphJvmQuakeWarnGC
[02:43:48] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T297189)', diff saved to https://phabricator.wikimedia.org/P24199 and previous config saved to /var/cache/conftool/dbconfig/20220407-024347-marostegui.json
[02:43:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:43:52] <stashbot>	 T297189: Schema change for dropping ft_title and ft_namesapce - https://phabricator.wikimedia.org/T297189
[02:46:47] <icinga-wm>	 PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:59:48] <icinga-wm>	 PROBLEM - k8s API server requests latencies on ml-serve-ctrl2001 is CRITICAL: instance=10.192.32.33 verb=POST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[03:00:16] <icinga-wm>	 RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:02:26] <icinga-wm>	 RECOVERY - k8s API server requests latencies on ml-serve-ctrl2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[03:14:32] <icinga-wm>	 PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:17:54] <icinga-wm>	 PROBLEM - k8s API server requests latencies on ml-serve-ctrl2001 is CRITICAL: instance=10.192.32.33 verb=POST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[03:19:54] <icinga-wm>	 RECOVERY - k8s API server requests latencies on ml-serve-ctrl2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[03:25:08] <icinga-wm>	 RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:32:04] <icinga-wm>	 PROBLEM - k8s API server requests latencies on ml-serve-ctrl2001 is CRITICAL: instance=10.192.32.33 verb=POST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[03:33:58] <icinga-wm>	 PROBLEM - WDQS SPARQL on wdqs1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[03:36:16] <icinga-wm>	 RECOVERY - k8s API server requests latencies on ml-serve-ctrl2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[03:36:30] <icinga-wm>	 PROBLEM - SSH on aqs1009.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[03:38:16] <icinga-wm>	 PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[03:41:42] <icinga-wm>	 RECOVERY - WDQS SPARQL on wdqs1012 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.079 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[03:44:36] <icinga-wm>	 PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:56:21] <icinga-wm>	 RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:06:07] <icinga-wm>	 PROBLEM - k8s API server requests latencies on ml-serve-ctrl2001 is CRITICAL: instance=10.192.32.33 verb=POST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[04:09:11] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to Analytic Cluster for Research Intern (paramita_das) - https://phabricator.wikimedia.org/T305298 (10paramita_das) Hi @Aklapper  @Ottomata, I am trying to open a SSH tunnel to connect to analytics clients using the command mentioned https://wikitech.wikimedia.or...
[04:09:21] <icinga-wm>	 RECOVERY - k8s API server requests latencies on ml-serve-ctrl2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[04:13:08] <ryankemper>	 !log [Elastic] Beginning rolling reboot of codfw elastic to apply kernel security updates: `ryankemper@cumin1001:~$ sudo -E cookbook sre.elasticsearch.rolling-operation search_codfw "codfw cluster reboot" --reboot --nodes-per-run 3 --start-datetime 2022-04-07T04:09:05 --task-id T304938`
[04:13:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:13:17] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.REBOOT (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reboot - ryankemper@cumin1001 - T304938
[04:13:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:16:05] <icinga-wm>	 PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:18:39] <icinga-wm>	 PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[04:20:33] <icinga-wm>	 RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[04:25:29] <icinga-wm>	 PROBLEM - Check systemd state on elastic2060 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:27:45] <icinga-wm>	 RECOVERY - Check systemd state on elastic2060 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:28:31] <icinga-wm>	 PROBLEM - Check systemd state on elastic2034 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:29:44] <ryankemper>	 !log [Elastic] for future reference, we still need to fix the fact that we haven't told systemd that the prometheus-wmf-elasticsearch exporters need to start after the actual elasticsearch service
[04:29:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:31:03] <icinga-wm>	 RECOVERY - Check systemd state on elastic2034 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:31:21] <ryankemper>	 (manually restarted failing prometheus exporter units)
[04:39:49] <icinga-wm>	 RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:40:43] <icinga-wm>	 PROBLEM - k8s API server requests latencies on ml-serve-ctrl2001 is CRITICAL: instance=10.192.32.33 verb=POST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[04:41:52] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1146.eqiad.wmnet with reason: Maintenance
[04:41:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:41:54] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1146.eqiad.wmnet with reason: Maintenance
[04:41:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:41:59] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3312 (T297189)', diff saved to https://phabricator.wikimedia.org/P24200 and previous config saved to /var/cache/conftool/dbconfig/20220407-044158-marostegui.json
[04:42:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:42:01] <stashbot>	 T297189: Schema change for dropping ft_title and ft_namesapce - https://phabricator.wikimedia.org/T297189
[04:42:29] <icinga-wm>	 RECOVERY - k8s API server requests latencies on ml-serve-ctrl2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[04:45:01] <icinga-wm>	 PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:53:30] <wikibugs>	 (03PS1) 10Marostegui: Revert "db1163: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/777776
[04:54:19] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "db1163: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/777776 (owner: 10Marostegui)
[04:57:33] <icinga-wm>	 RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:01:50] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2076 db2086:3317 db2086:3318 db2107 db2137:3314 db2137:3315 db2143 db2147 es2029 es2030 T305469', diff saved to https://phabricator.wikimedia.org/P24201 and previous config saved to /var/cache/conftool/dbconfig/20220407-050149-root.json
[05:01:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:01:54] <stashbot>	 T305469: codfw: Dedicate Rack B1 for cloudX-dev servers - https://phabricator.wikimedia.org/T305469
[05:04:15] <icinga-wm>	 PROBLEM - Check systemd state on elastic2036 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:04:51] <icinga-wm>	 PROBLEM - Check systemd state on elastic2053 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:05:29] <icinga-wm>	 PROBLEM - Check systemd state on elastic2052 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:11:59] <icinga-wm>	 RECOVERY - Check systemd state on elastic2052 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:16:53] <icinga-wm>	 PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:18:27] <icinga-wm>	 PROBLEM - SSH on mw2258.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[05:25:54] <icinga-wm>	 RECOVERY - Check systemd state on elastic2036 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:28:04] <icinga-wm>	 RECOVERY - Check systemd state on elastic2053 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:29:12] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 241, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:30:28] <icinga-wm>	 RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:37:26] <icinga-wm>	 RECOVERY - SSH on aqs1009.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[05:42:13] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T297189)', diff saved to https://phabricator.wikimedia.org/P24202 and previous config saved to /var/cache/conftool/dbconfig/20220407-054213-marostegui.json
[05:42:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:42:17] <stashbot>	 T297189: Schema change for dropping ft_title and ft_namesapce - https://phabricator.wikimedia.org/T297189
[05:43:12] <icinga-wm>	 PROBLEM - Check systemd state on elastic2051 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:43:12] <icinga-wm>	 PROBLEM - Check systemd state on elastic2054 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:43:58] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:44:00] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job k8s-pods in k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:44:04] <icinga-wm>	 PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:45:38] <wikibugs>	 10SRE, 10conftool: Make the VCL that goes to production from requestctl safer/more explicit to apply - https://phabricator.wikimedia.org/T305606 (10Joe)
[05:45:56] <wikibugs>	 10SRE, 10conftool: Make the VCL that goes to production from requestctl safer/more explicit to apply - https://phabricator.wikimedia.org/T305606 (10Joe) p:05Triage→03High
[05:53:56] <logmsgbot>	 !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REBOOT (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reboot - ryankemper@cumin1001 - T304938
[05:53:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:55:20] <icinga-wm>	 RECOVERY - Check systemd state on elastic2051 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:55:20] <icinga-wm>	 RECOVERY - Check systemd state on elastic2054 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:56:07] <wikibugs>	 10SRE, 10conftool: Support NOT in the dsl grammar - https://phabricator.wikimedia.org/T305607 (10Joe)
[05:56:12] <icinga-wm>	 RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:56:21] <wikibugs>	 10SRE, 10conftool: Support NOT in the dsl grammar - https://phabricator.wikimedia.org/T305607 (10Joe) p:05Triage→03Medium
[05:57:18] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P24203 and previous config saved to /var/cache/conftool/dbconfig/20220407-055718-marostegui.json
[05:57:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:58:14] <ryankemper>	 !log [Elastic] Manually restarted elasticsearch exporters on `cloudelastic1004` and `elastic2054`
[05:58:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:58:36] <icinga-wm>	 RECOVERY - Check systemd state on cloudelastic1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:00:05] <jouncebot>	 kormat, marostegui, and Amir1: #bothumor My software never has bugs. It just develops random features. Rise for Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220407T0600).
[06:00:05] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.REBOOT (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reboot - ryankemper@cumin1001 - T304938
[06:00:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:02:06] <wikibugs>	 10SRE, 10conftool, 10Patch-For-Review: ipblocks support for other "entities" (not clouds, not abuse nets) - https://phabricator.wikimedia.org/T305581 (10Joe) Ipblock, per se, supports arbitrary scope names.  What we need is to add support for thes other scopes in VCL.  My proposal would be to ditch the `X-Pu...
[06:12:23] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P24205 and previous config saved to /var/cache/conftool/dbconfig/20220407-061223-marostegui.json
[06:12:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:12:42] <icinga-wm>	 PROBLEM - Check systemd state on elastic2028 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:12:44] <icinga-wm>	 PROBLEM - Check systemd state on elastic2043 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:12:46] <icinga-wm>	 PROBLEM - Check systemd state on elastic2058 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:15:44] <icinga-wm>	 PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:18:02] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:19:14] <icinga-wm>	 RECOVERY - Check systemd state on elastic2028 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:19:28] <icinga-wm>	 RECOVERY - SSH on mw2258.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[06:21:36] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS6939/IPv4: Connect - HE, AS6939/IPv6: Active - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[06:25:48] <logmsgbot>	 !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REBOOT (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reboot - ryankemper@cumin1001 - T304938
[06:25:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:26:28] <icinga-wm>	 RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:27:16] <ryankemper>	 !log [Elastic] Manually restarted elasticsearch exporters on `elastic2043` and `elastic2058`
[06:27:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:27:28] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T297189)', diff saved to https://phabricator.wikimedia.org/P24206 and previous config saved to /var/cache/conftool/dbconfig/20220407-062728-marostegui.json
[06:27:30] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1129.eqiad.wmnet with reason: Maintenance
[06:27:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:27:31] <stashbot>	 T297189: Schema change for dropping ft_title and ft_namesapce - https://phabricator.wikimedia.org/T297189
[06:27:31] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1129.eqiad.wmnet with reason: Maintenance
[06:27:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:27:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:27:36] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1129 (T297189)', diff saved to https://phabricator.wikimedia.org/P24207 and previous config saved to /var/cache/conftool/dbconfig/20220407-062736-marostegui.json
[06:27:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:27:48] <icinga-wm>	 RECOVERY - Check systemd state on elastic2043 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:27:48] <icinga-wm>	 RECOVERY - Check systemd state on elastic2058 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:37:46] <jinxer-wm>	 (BlazegraphJvmQuakeWarnGC) firing: Blazegraph instance wdqs2003:9100 is entering a GC death spiral - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphJvmQuakeWarnGC
[06:42:58] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163 (T300775)', diff saved to https://phabricator.wikimedia.org/P24208 and previous config saved to /var/cache/conftool/dbconfig/20220407-064258-marostegui.json
[06:43:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:43:02] <stashbot>	 T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775
[06:43:41] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1163.eqiad.wmnet with reason: Maintenance
[06:43:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:43:43] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1163.eqiad.wmnet with reason: Maintenance
[06:43:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:45:58] <icinga-wm>	 PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:47:12] <wikibugs>	 (03PS1) 10Ladsgroup: Enable videojs on wiktionary wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/778197 (https://phabricator.wikimedia.org/T248418)
[06:53:06] <hashar>	 good morning
[06:53:39] <hashar>	 I am going to restart CI and Gerrit entirely starting at 7:00 UTC (7 minutes from now)
[06:54:01] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host ml-cache1002.eqiad.wmnet with OS bullseye
[06:54:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:54:08] <logmsgbot>	 !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ml-cache1002.eqiad.wmnet with OS bullseye
[06:54:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:55:46] <icinga-wm>	 RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:56:31] <wikibugs>	 (03CR) 10Ayounsi: Add inbound filter to analytics IRB interfaces on EVPN switches Eqiad (032 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/777855 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney)
[06:58:03] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163', diff saved to https://phabricator.wikimedia.org/P24209 and previous config saved to /var/cache/conftool/dbconfig/20220407-065803-marostegui.json
[06:58:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:00:05] <jouncebot>	 hashar: Time to snap out of that daydream and deploy CI/Gerrit maintenance. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220407T0700).
[07:00:05] <jouncebot>	 Amir1, apergos, and taavi: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC morning backport and config training. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220407T0700).
[07:00:08] <apergos>	 there is a trainee in the window but no patches scheduled, which might be a good thing, given that gerrit is set for 30 minutes of maintenance beginning now. 
[07:00:21] <hashar>	 good morning
[07:00:30] <apergos>	 I'll catch the trainee if they show up in the google meet and explain things. they can reschedule.
[07:00:33] <apergos>	 hello hasha r
[07:00:37] <hashar>	 I apologize for the backport & config window hijack
[07:00:43] <hashar>	 but should be a fast operation :]
[07:00:43] <Amir1>	 apergos: where is the meeting? I don't have the link
[07:01:10] <apergos>	 https://meet.google.com/ium-qmwp-wvd?authuser=0  but don't bother showing up
[07:01:48] <apergos>	 you should get it to show up on your calendar, ask Tyler 
[07:01:55] <apergos>	 since you're listed for this window always
[07:02:08] <hashar>	 !log Restarting contint1001.wikimedia.org
[07:02:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:03:10] <icinga-wm>	 PROBLEM - Host contint1001 is DOWN: PING CRITICAL - Packet loss = 100%
[07:05:08] <icinga-wm>	 RECOVERY - Host contint1001 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms
[07:08:32] <icinga-wm>	 PROBLEM - Check systemd state on contint1001 is CRITICAL: CRITICAL - starting: Late bootup, before the job queue becomes idle for the first time, or one of the rescue targets are reached. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:10:03] <hashar>	 !log Restarting gerrit1001.wikimedia.org
[07:10:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:10:21] <hashar>	 !log Restarting contint2001.wikimedia.Org
[07:10:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:13:08] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163', diff saved to https://phabricator.wikimedia.org/P24210 and previous config saved to /var/cache/conftool/dbconfig/20220407-071308-marostegui.json
[07:13:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:13:25] <wikibugs>	 (03CR) 10Ayounsi: "Thanks! that's awesome." [puppet] - 10https://gerrit.wikimedia.org/r/777347 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi)
[07:13:47] <hashar>	 Apr 07 07:12:08 gerrit1001 apachectl[886]: (99)Cannot assign requested address: AH00072: make_sock: could not bind to address [2620:0:861:2:208:80:154:137]:80
[07:13:48] <icinga-wm>	 RECOVERY - Check systemd state on contint1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:13:53] <hashar>	 poor Apache
[07:14:20] <icinga-wm>	 PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:14:42] <hashar>	 !log gerrit1001.wikimedia.org: restarted apache2 service
[07:14:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:16:04] <icinga-wm>	 PROBLEM - Check systemd state on contint2001 is CRITICAL: CRITICAL - starting: Late bootup, before the job queue becomes idle for the first time, or one of the rescue targets are reached. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:17:06] <wikibugs>	 (03PS1) 10Marostegui: control-mariadb-10.6-bullseye: Add to repo [software] - 10https://gerrit.wikimedia.org/r/778206 (https://phabricator.wikimedia.org/T301879)
[07:17:21] <hashar>	 !log CI and Gerrit are back up
[07:17:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:18:14] <wikibugs>	 (03PS3) 10Elukey: role::ml_k8s::master: change the codfw svc IP range [puppet] - 10https://gerrit.wikimedia.org/r/776880 (https://phabricator.wikimedia.org/T304673)
[07:19:40] <wikibugs>	 (03CR) 10Marostegui: "recheck" [software] - 10https://gerrit.wikimedia.org/r/778206 (https://phabricator.wikimedia.org/T301879) (owner: 10Marostegui)
[07:20:24] <wikibugs>	 (03PS4) 10Elukey: role::ml_k8s::master: change the codfw svc/pod IP ranges [puppet] - 10https://gerrit.wikimedia.org/r/776880 (https://phabricator.wikimedia.org/T304673)
[07:20:48] <wikibugs>	 (03PS2) 10Elukey: Change the Calico's pod IP subnet for ml-serve-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/776877 (https://phabricator.wikimedia.org/T304673)
[07:21:30] <apergos>	 hey TheresNoTime there are no patches for today, so I've commented on the training task, let's try again for next week.
[07:22:06] <wikibugs>	 (03PS3) 10Elukey: Change the Calico's pod IP subnet for ml-serve-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/776877 (https://phabricator.wikimedia.org/T304673)
[07:23:44] <wikibugs>	 (03PS1) 10Elukey: Change POD IPv4 subnet for ml-serve-codfw [homer/public] - 10https://gerrit.wikimedia.org/r/778208 (https://phabricator.wikimedia.org/T304673)
[07:26:03] <icinga-wm>	 PROBLEM - k8s API server requests latencies on ml-serve-ctrl2001 is CRITICAL: instance=10.192.32.33 verb=POST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[07:26:22] <JJMC89>	 marostegui: is the large amout of lag on db1163 expected?
[07:26:31] <icinga-wm>	 RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:28:13] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163 (T300775)', diff saved to https://phabricator.wikimedia.org/P24211 and previous config saved to /var/cache/conftool/dbconfig/20220407-072813-marostegui.json
[07:28:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:28:19] <icinga-wm>	 RECOVERY - k8s API server requests latencies on ml-serve-ctrl2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[07:28:19] <stashbot>	 T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775
[07:29:07] <marostegui>	 JJMC89: checking
[07:29:47] <marostegui>	 it is not
[07:29:51] <marostegui>	 It shouldn't have been repooled
[07:29:53] <marostegui>	 depooling it
[07:30:01] <marostegui>	 Amir1: we need to check why it was repooled
[07:30:15] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1163', diff saved to https://phabricator.wikimedia.org/P24212 and previous config saved to /var/cache/conftool/dbconfig/20220407-073013-root.json
[07:30:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:30:44] <Amir1>	 I check
[07:30:50] <marostegui>	 Amir1: I think I know why
[07:30:57] <marostegui>	 Amir1: both schema changes overlapped
[07:31:05] <marostegui>	 JJMC89: thanks for the heads up!
[07:31:18] <Amir1>	 wait I thought you were done with s1
[07:31:18] <Amir1>	 is it s1?
[07:31:31] <marostegui>	 Amir1: No, I had to hosts pending
[07:31:33] <marostegui>	 I am now done
[07:31:48] <marostegui>	 I started them yesterday and the finished today
[07:32:01] <JJMC89>	 no problem
[07:32:06] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance
[07:32:07] <Amir1>	 https://phabricator.wikimedia.org/T300775#7837123
[07:32:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:32:08] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance
[07:32:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:32:13] <icinga-wm>	 RECOVERY - Check systemd state on contint2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:32:33] <Amir1>	 marostegui: I am so sorry, I took the comment as "it is done" https://phabricator.wikimedia.org/T300775#7837123
[07:33:01] <marostegui>	 Amir1: Yeah, not sure what happened, as I see the host beeing repooled today too
[07:33:18] <Amir1>	 it happens
[07:33:31] <Amir1>	 should I stop my schema change?
[07:33:34] <marostegui>	 Amir1: No, I see what happened, the schema change did finish, but the host was still catching up
[07:33:41] <marostegui>	 That is why I commented there
[07:33:57] <Amir1>	 aaah, That's "finish"
[07:35:21] <Amir1>	 it actually reminds me of a famous aviation accident which there was a misunderstanding on what "take off" meant
[07:35:45] <Amir1>	 and after that the rules changed
[07:35:59] * Amir1 stops channeling his inner wikipedia 
[07:39:00] <Amir1>	 https://www.vintag.es/2022/03/tenerife-airport-disaster.html
[07:41:46] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] Change POD IPv4 subnet for ml-serve-codfw [homer/public] - 10https://gerrit.wikimedia.org/r/778208 (https://phabricator.wikimedia.org/T304673) (owner: 10Elukey)
[07:43:55] <wikibugs>	 (03PS2) 10MMandere: site: Reimage cp3050 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/777846 (https://phabricator.wikimedia.org/T290005)
[07:44:13] <mmandere>	 !log depool cp3050 for reimage - T290005
[07:44:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:44:17] <stashbot>	 T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005
[07:45:03] <icinga-wm>	 PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:46:54] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T297189)', diff saved to https://phabricator.wikimedia.org/P24213 and previous config saved to /var/cache/conftool/dbconfig/20220407-074654-marostegui.json
[07:46:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:46:58] <stashbot>	 T297189: Schema change for dropping ft_title and ft_namesapce - https://phabricator.wikimedia.org/T297189
[07:48:33] <icinga-wm>	 RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 125, down: 2, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[07:54:52] <wikibugs>	 (03CR) 10MMandere: [C: 03+2] site: Reimage cp3050 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/777846 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere)
[07:55:07] <icinga-wm>	 RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:55:55] <logmsgbot>	 !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host cp3050.esams.wmnet with OS buster
[07:55:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:56:04] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp3050.esams.wmnet with OS buster
[08:00:04] <jouncebot>	 jnuche and hashar: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for MediaWiki train - Utc-0 Version . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220407T0800).
[08:00:37] <mmandere>	 !log depool cp6014 for reimage - T290005
[08:00:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:00:42] <stashbot>	 T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005
[08:01:59] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P24214 and previous config saved to /var/cache/conftool/dbconfig/20220407-080159-marostegui.json
[08:02:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:03:14] <hashar>	 there are some blockers :(
[08:05:10] <wikibugs>	 (03PS2) 10MMandere: site: Reimage cp6014 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/777847 (https://phabricator.wikimedia.org/T290005)
[08:06:13] <hashar>	 eg https://phabricator.wikimedia.org/T305531
[08:06:35] <icinga-wm>	 PROBLEM - Check systemd state on ms-be1064 is CRITICAL: CRITICAL - degraded: The following units failed: swift-drive-audit.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:07:17] <wikibugs>	 (03CR) 10MMandere: [C: 03+2] site: Reimage cp6014 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/777847 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere)
[08:09:49] <logmsgbot>	 !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host cp6014.drmrs.wmnet with OS buster
[08:09:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:09:54] <hashar>	 hmm processing
[08:09:57] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp6014.drmrs.wmnet with OS buster
[08:10:13] <wikibugs>	 (03PS1) 10Hashar: all wikis to 1.39.0-wmf.6  refs T305212 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/778217
[08:10:15] <wikibugs>	 (03CR) 10Hashar: [C: 03+2] all wikis to 1.39.0-wmf.6  refs T305212 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/778217 (owner: 10Hashar)
[08:11:08] <wikibugs>	 (03Merged) 10jenkins-bot: all wikis to 1.39.0-wmf.6  refs T305212 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/778217 (owner: 10Hashar)
[08:13:00] <logmsgbot>	 !log hashar@deploy1002 rebuilt and synchronized wikiversions files: all wikis to 1.39.0-wmf.6  refs T305212
[08:13:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:13:03] <stashbot>	 T305212: 1.39.0-wmf.6 deployment blockers - https://phabricator.wikimedia.org/T305212
[08:14:25] <icinga-wm>	 PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:15:06] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[08:15:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:15:55] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[08:15:56] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[08:15:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:15:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:16:23] <icinga-wm>	 RECOVERY - Check systemd state on gitlab2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:16:55] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[08:16:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:17:04] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P24215 and previous config saved to /var/cache/conftool/dbconfig/20220407-081704-marostegui.json
[08:17:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:19:04] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1099.eqiad.wmnet with reason: Maintenance
[08:19:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:19:05] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1099.eqiad.wmnet with reason: Maintenance
[08:19:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:19:10] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1099:3311 (T298565)', diff saved to https://phabricator.wikimedia.org/P24216 and previous config saved to /var/cache/conftool/dbconfig/20220407-081910-ladsgroup.json
[08:19:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:19:13] <icinga-wm>	 PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[08:19:13] <stashbot>	 T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565
[08:19:43] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] control-mariadb-10.6-bullseye: Add to repo [software] - 10https://gerrit.wikimedia.org/r/778206 (https://phabricator.wikimedia.org/T301879) (owner: 10Marostegui)
[08:20:19] <wikibugs>	 (03Merged) 10jenkins-bot: control-mariadb-10.6-bullseye: Add to repo [software] - 10https://gerrit.wikimedia.org/r/778206 (https://phabricator.wikimedia.org/T301879) (owner: 10Marostegui)
[08:21:23] <icinga-wm>	 RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[08:23:32] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host ml-cache1002.eqiad.wmnet with OS bullseye
[08:23:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:23:53] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[08:23:57] <logmsgbot>	 !log mmandere@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3050.esams.wmnet with reason: host reimage
[08:23:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:25:39] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] Change POD IPv4 subnet for ml-serve-codfw [homer/public] - 10https://gerrit.wikimedia.org/r/778208 (https://phabricator.wikimedia.org/T304673) (owner: 10Elukey)
[08:26:53] <logmsgbot>	 !log mmandere@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp6014.drmrs.wmnet with reason: host reimage
[08:26:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:27:24] <logmsgbot>	 !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp3050.esams.wmnet with reason: host reimage
[08:27:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:30:13] <logmsgbot>	 !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp6014.drmrs.wmnet with reason: host reimage
[08:30:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:32:09] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T297189)', diff saved to https://phabricator.wikimedia.org/P24217 and previous config saved to /var/cache/conftool/dbconfig/20220407-083209-marostegui.json
[08:32:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:32:13] <stashbot>	 T297189: Schema change for dropping ft_title and ft_namesapce - https://phabricator.wikimedia.org/T297189
[08:32:26] <icinga-wm>	 PROBLEM - k8s API server requests latencies on ml-serve-ctrl2001 is CRITICAL: instance=10.192.32.33 verb=POST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[08:33:34] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
[08:33:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:33:36] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
[08:33:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:35:09] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-cache1002.eqiad.wmnet with reason: host reimage
[08:35:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:35:41] <icinga-wm>	 RECOVERY - k8s API server requests latencies on ml-serve-ctrl2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[08:35:55] <icinga-wm>	 RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:37:08] <wikibugs>	 (03CR) 10Cathal Mooney: "Thanks for the response, I'll submit a new patchset with those changes and push." [homer/public] - 10https://gerrit.wikimedia.org/r/777855 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney)
[08:38:33] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-cache1002.eqiad.wmnet with reason: host reimage
[08:38:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:39:19] <wikibugs>	 (03PS2) 10Cathal Mooney: Add inbound filter to analytics IRB interfaces on EVPN switches Eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/777855 (https://phabricator.wikimedia.org/T299758)
[08:41:04] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1163 (re)pooling @ 5%: After schema change', diff saved to https://phabricator.wikimedia.org/P24218 and previous config saved to /var/cache/conftool/dbconfig/20220407-084103-root.json
[08:41:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:41:34] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1146.eqiad.wmnet with reason: Maintenance
[08:41:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:41:36] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1146.eqiad.wmnet with reason: Maintenance
[08:41:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:41:41] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3312 (T297189)', diff saved to https://phabricator.wikimedia.org/P24219 and previous config saved to /var/cache/conftool/dbconfig/20220407-084140-marostegui.json
[08:41:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:41:45] <stashbot>	 T297189: Schema change for dropping ft_title and ft_namesapce - https://phabricator.wikimedia.org/T297189
[08:42:39] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] Add inbound filter to analytics IRB interfaces on EVPN switches Eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/777855 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney)
[08:43:35] <wikibugs>	 (03Merged) 10jenkins-bot: Add inbound filter to analytics IRB interfaces on EVPN switches Eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/777855 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney)
[08:49:04] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-cache1002.eqiad.wmnet with OS bullseye
[08:49:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:52:50] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: bugfix remove rsyslog-set log.level from blackbox_exporter events [puppet] - 10https://gerrit.wikimedia.org/r/777877 (owner: 10Cwhite)
[08:56:09] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1163 (re)pooling @ 10%: After schema change', diff saved to https://phabricator.wikimedia.org/P24220 and previous config saved to /var/cache/conftool/dbconfig/20220407-085608-root.json
[08:56:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:56:41] <logmsgbot>	 !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3050.esams.wmnet with OS buster
[08:56:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:56:51] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp3050.esams.wmnet with OS buster com...
[08:59:00] <wikibugs>	 (03CR) 10Filippo Giunchedi: WIP move core routers definitions to hiera (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/777347 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi)
[08:59:06] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db2122.codfw.wmnet with reason: Rebooting for T303174
[08:59:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:59:08] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db2122.codfw.wmnet with reason: Rebooting for T303174
[08:59:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:00:48] <logmsgbot>	 !log jynus@cumin2002 START - Cookbook sre.hosts.reimage for host db2098.codfw.wmnet with OS bullseye
[09:00:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:01:11] <mmandere>	 !log pool cp3050 with HAProxy as TLS termination layer - T290005
[09:01:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:01:14] <stashbot>	 T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005
[09:01:42] <icinga-wm>	 RECOVERY - Check systemd state on ms-be1064 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:01:55] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1181.eqiad.wmnet with reason: Maintenance
[09:01:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:01:56] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1181.eqiad.wmnet with reason: Maintenance
[09:01:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:02:01] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1181 (T305300)', diff saved to https://phabricator.wikimedia.org/P24221 and previous config saved to /var/cache/conftool/dbconfig/20220407-090201-ladsgroup.json
[09:02:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:02:04] <stashbot>	 T305300: Add lu_attachment_method column to localuser table - https://phabricator.wikimedia.org/T305300
[09:05:24] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db2150.codfw.wmnet with reason: Rebooting for T303174
[09:05:26] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db2150.codfw.wmnet with reason: Rebooting for T303174
[09:05:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:05:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:07:14] <wikibugs>	 10SRE, 10Traffic: Upgrading Wikidough and durum VMs to bullseye - https://phabricator.wikimedia.org/T305589 (10fgiunchedi) Thanks @ssingh for kickstarting the discussion!  My two cents as an owner (with o11y) of some VMs that will need upgrading (grafana, logstash, etc): I think our strategy when it comes to l...
[09:08:10] <icinga-wm>	 PROBLEM - k8s API server requests latencies on ml-serve-ctrl2001 is CRITICAL: instance=10.192.32.33 verb=POST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[09:10:00] <icinga-wm>	 RECOVERY - k8s API server requests latencies on ml-serve-ctrl2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[09:11:12] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1163 (re)pooling @ 25%: After schema change', diff saved to https://phabricator.wikimedia.org/P24222 and previous config saved to /var/cache/conftool/dbconfig/20220407-091112-root.json
[09:11:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:11:23] <logmsgbot>	 !log jynus@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2098.codfw.wmnet with reason: host reimage
[09:11:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:12:41] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db2152.codfw.wmnet with reason: Rebooting for T303174
[09:12:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:12:42] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db2152.codfw.wmnet with reason: Rebooting for T303174
[09:12:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:13:58] <icinga-wm>	 PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:14:19] <logmsgbot>	 !log jynus@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2098.codfw.wmnet with reason: host reimage
[09:14:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:15:25] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team: Q3:(Need By: TBD) rack/setup/install ml-cache100[1-3] - https://phabricator.wikimedia.org/T299435 (10elukey) 05Open→03Resolved Host reimaged correctly, all done!
[09:16:02] <logmsgbot>	 !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp6014.drmrs.wmnet with OS buster
[09:16:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:16:11] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp6014.drmrs.wmnet with OS buster com...
[09:20:12] <mmandere>	 !log pool cp6014 with HAProxy as TLS termination layer - T290005
[09:20:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:20:16] <stashbot>	 T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005
[09:20:41] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 7 hosts with reason: Rebooting primary T303174
[09:20:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:20:46] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Rebooting primary T303174
[09:20:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:20:57] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db2105.codfw.wmnet with reason: Rebooting for T303174
[09:20:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:20:59] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db2105.codfw.wmnet with reason: Rebooting for T303174
[09:21:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:22:34] <icinga-wm>	 PROBLEM - k8s API server requests latencies on ml-serve-ctrl2001 is CRITICAL: instance=10.192.32.33 verb=POST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[09:25:26] <mmandere>	 !log depool cp3053 for reimage - T290005
[09:25:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:25:29] <stashbot>	 T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005
[09:25:32] <icinga-wm>	 RECOVERY - k8s API server requests latencies on ml-serve-ctrl2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[09:25:44] <icinga-wm>	 RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:26:16] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1163 (re)pooling @ 50%: After schema change', diff saved to https://phabricator.wikimedia.org/P24223 and previous config saved to /var/cache/conftool/dbconfig/20220407-092616-root.json
[09:26:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:28:08] <wikibugs>	 (03PS2) 10MMandere: site: Reimage cp3053 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/777848 (https://phabricator.wikimedia.org/T290005)
[09:30:08] <wikibugs>	 (03PS1) 10Elukey: kserve-inference: Allow prometheus to scrape istio sidecar's port [deployment-charts] - 10https://gerrit.wikimedia.org/r/778247 (https://phabricator.wikimedia.org/T297612)
[09:30:32] <wikibugs>	 (03CR) 10MMandere: [C: 03+2] site: Reimage cp3053 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/777848 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere)
[09:30:37] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db2123.codfw.wmnet with reason: Rebooting for T303174
[09:30:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:30:38] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db2123.codfw.wmnet with reason: Rebooting for T303174
[09:30:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:33:08] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] kserve-inference: Allow prometheus to scrape istio sidecar's port [deployment-charts] - 10https://gerrit.wikimedia.org/r/778247 (https://phabricator.wikimedia.org/T297612) (owner: 10Elukey)
[09:33:38] <logmsgbot>	 !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host cp3053.esams.wmnet with OS buster
[09:33:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:33:48] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp3053.esams.wmnet with OS buster
[09:34:12] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T305300)', diff saved to https://phabricator.wikimedia.org/P24224 and previous config saved to /var/cache/conftool/dbconfig/20220407-093412-ladsgroup.json
[09:34:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:34:15] <stashbot>	 T305300: Add lu_attachment_method column to localuser table - https://phabricator.wikimedia.org/T305300
[09:34:22] <mmandere>	 !log depool cp6006 for reimage - T290005
[09:34:24] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 9 hosts with reason: Rebooting primary T303174
[09:34:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:34:25] <stashbot>	 T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005
[09:34:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:34:30] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 9 hosts with reason: Rebooting primary T303174
[09:34:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:34:54] <wikibugs>	 (03PS2) 10MMandere: site: Reimage cp6006 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/777849 (https://phabricator.wikimedia.org/T290005)
[09:35:28] <wikibugs>	 (03PS1) 10Btullis: Correct the GMS port number that is in use [deployment-charts] - 10https://gerrit.wikimedia.org/r/778249 (https://phabricator.wikimedia.org/T301454)
[09:35:50] <logmsgbot>	 !log jynus@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2098.codfw.wmnet with OS bullseye
[09:35:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:36:05] <wikibugs>	 (03PS3) 10Mvolz: citoid: switch to native prometheus metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/776233 (https://phabricator.wikimedia.org/T205870) (owner: 10PipelineBot)
[09:37:15] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] citoid: switch to native prometheus metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/776233 (https://phabricator.wikimedia.org/T205870) (owner: 10PipelineBot)
[09:37:58] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 9 hosts with reason: Rebooting primary T303174
[09:38:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:38:05] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 9 hosts with reason: Rebooting primary T303174
[09:38:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:38:17] <wikibugs>	 (03CR) 10MMandere: [C: 03+2] site: Reimage cp6006 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/777849 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere)
[09:39:32] <logmsgbot>	 !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host cp6006.drmrs.wmnet with OS buster
[09:39:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:39:41] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp6006.drmrs.wmnet with OS buster
[09:40:03] <wikibugs>	 (03PS4) 10Mvolz: citoid: switch to native prometheus metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/776233 (https://phabricator.wikimedia.org/T205870) (owner: 10PipelineBot)
[09:40:07] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db2129.codfw.wmnet with reason: Rebooting for T303174
[09:40:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:40:09] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db2129.codfw.wmnet with reason: Rebooting for T303174
[09:40:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:40:24] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] kserve-inference: Allow prometheus to scrape istio sidecar's port [deployment-charts] - 10https://gerrit.wikimedia.org/r/778247 (https://phabricator.wikimedia.org/T297612) (owner: 10Elukey)
[09:40:30] <icinga-wm>	 PROBLEM - SSH on aqs1009.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[09:41:20] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1163 (re)pooling @ 75%: After schema change', diff saved to https://phabricator.wikimedia.org/P24225 and previous config saved to /var/cache/conftool/dbconfig/20220407-094120-root.json
[09:41:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:43:10] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1146:3312 (re)pooling @ 25%: After schema change', diff saved to https://phabricator.wikimedia.org/P24226 and previous config saved to /var/cache/conftool/dbconfig/20220407-094310-root.json
[09:43:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:43:13] <logmsgbot>	 !log jynus@cumin1001 START - Cookbook sre.hosts.reimage for host db1102.eqiad.wmnet with OS bullseye
[09:43:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:44:00] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job k8s-pods in k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[09:45:09] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' .
[09:45:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:45:56] <icinga-wm>	 PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:45:58] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Correct the GMS port number that is in use [deployment-charts] - 10https://gerrit.wikimedia.org/r/778249 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis)
[09:49:18] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P24227 and previous config saved to /var/cache/conftool/dbconfig/20220407-094917-ladsgroup.json
[09:49:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:49:49] <wikibugs>	 (03Merged) 10jenkins-bot: Correct the GMS port number that is in use [deployment-charts] - 10https://gerrit.wikimedia.org/r/778249 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis)
[09:50:00] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' .
[09:50:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:50:20] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' .
[09:50:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:51:22] <wikibugs>	 (03CR) 10Cathal Mooney: Modify cr-loopback Capirca definition to make it compatible with QFX (032 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/773299 (https://phabricator.wikimedia.org/T304553) (owner: 10Cathal Mooney)
[09:51:36] <logmsgbot>	 !log jynus@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1102.eqiad.wmnet with reason: host reimage
[09:51:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:52:06] <wikibugs>	 (03PS4) 10Cathal Mooney: Modify cr-loopback Capirca definition to make it compatible with QFX [homer/public] - 10https://gerrit.wikimedia.org/r/773299 (https://phabricator.wikimedia.org/T304553)
[09:52:24] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311 (T298565)', diff saved to https://phabricator.wikimedia.org/P24228 and previous config saved to /var/cache/conftool/dbconfig/20220407-095224-ladsgroup.json
[09:52:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:52:27] <stashbot>	 T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565
[09:52:54] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main
[09:52:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:53:29] <wikibugs>	 (03PS5) 10Cathal Mooney: Modify cr-loopback Capirca definition to make it compatible with QFX [homer/public] - 10https://gerrit.wikimedia.org/r/773299 (https://phabricator.wikimedia.org/T304553)
[09:53:30] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main
[09:53:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:54:24] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' .
[09:54:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:54:58] <logmsgbot>	 !log jynus@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1102.eqiad.wmnet with reason: host reimage
[09:54:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:55:12] <wikibugs>	 (03PS1) 10Kevin Bazira: ml-services: add plwiki, ptwiki & rowiki editquality isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/778251 (https://phabricator.wikimedia.org/T301415)
[09:55:15] <icinga-wm>	 RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:55:52] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host aqs1007.eqiad.wmnet
[09:55:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:56:11] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' .
[09:56:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:56:24] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1163 (re)pooling @ 100%: After schema change', diff saved to https://phabricator.wikimedia.org/P24229 and previous config saved to /var/cache/conftool/dbconfig/20220407-095624-root.json
[09:56:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:56:42] <icinga-wm>	 ACKNOWLEDGEMENT - dump of es4 in codfw on alert1001 is CRITICAL: dump for es4 at codfw (es2022) taken more than 8 days ago: Most recent backup 2022-03-29 00:00:01 Jcrespo backup taking failed again - The acknowledgement expires at: 2022-04-08 09:56:13. https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[09:56:56] <logmsgbot>	 !log mmandere@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp6006.drmrs.wmnet with reason: host reimage
[09:56:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:58:14] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1146:3312 (re)pooling @ 50%: After schema change', diff saved to https://phabricator.wikimedia.org/P24230 and previous config saved to /var/cache/conftool/dbconfig/20220407-095814-root.json
[09:58:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:58:33] <logmsgbot>	 !log jynus@cumin2002 START - Cookbook sre.hosts.reimage for host db2099.codfw.wmnet with OS bullseye
[09:58:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:59:03] <wikibugs>	 10SRE-swift-storage: Refactor swift puppet code, particularly where swift_ring_manager config is stored - https://phabricator.wikimedia.org/T305617 (10MatthewVernon)
[10:00:03] <wikibugs>	 (03CR) 10Mvolz: [C: 03+2] Update zotero to include get endpoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/777758 (https://phabricator.wikimedia.org/T291707) (owner: 10Mvolz)
[10:00:05] <jouncebot>	 mvolz: My dear minions, it's time we take the moon! Just kidding. Time for Services – Citoid / Zotero deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220407T1000).
[10:00:21] <logmsgbot>	 !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp6006.drmrs.wmnet with reason: host reimage
[10:00:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:00:34] <logmsgbot>	 !log mmandere@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3053.esams.wmnet with reason: host reimage
[10:00:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:03:58] <logmsgbot>	 !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp3053.esams.wmnet with reason: host reimage
[10:03:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:04:23] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P24231 and previous config saved to /var/cache/conftool/dbconfig/20220407-100423-ladsgroup.json
[10:04:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:04:25] <wikibugs>	 (03Merged) 10jenkins-bot: Update zotero to include get endpoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/777758 (https://phabricator.wikimedia.org/T291707) (owner: 10Mvolz)
[10:04:51] <logmsgbot>	 !log mvolz@deploy1002 helmfile [staging] START helmfile.d/services/zotero: apply
[10:04:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:04:54] <logmsgbot>	 !log mvolz@deploy1002 helmfile [staging] DONE helmfile.d/services/zotero: apply
[10:04:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:05:21] <logmsgbot>	 !log mvolz@deploy1002 helmfile [staging] START helmfile.d/services/zotero: apply
[10:05:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:06:08] <logmsgbot>	 !log mvolz@deploy1002 helmfile [staging] DONE helmfile.d/services/zotero: apply
[10:06:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:06:43] <logmsgbot>	 !log mvolz@deploy1002 helmfile [codfw] START helmfile.d/services/zotero: apply
[10:06:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:07:29] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311', diff saved to https://phabricator.wikimedia.org/P24232 and previous config saved to /var/cache/conftool/dbconfig/20220407-100729-ladsgroup.json
[10:07:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:07:38] <logmsgbot>	 !log mvolz@deploy1002 helmfile [codfw] DONE helmfile.d/services/zotero: apply
[10:07:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:08:12] <logmsgbot>	 !log mvolz@deploy1002 helmfile [eqiad] START helmfile.d/services/zotero: apply
[10:08:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:08:21] <logmsgbot>	 !log btullis@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host aqs1007.eqiad.wmnet
[10:08:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:08:36] <wikibugs>	 (03PS1) 10Elukey: Increase namespace constraints for ml-serve-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/778254
[10:08:48] <logmsgbot>	 !log jynus@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1102.eqiad.wmnet with OS bullseye
[10:08:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:08:51] <logmsgbot>	 !log mvolz@deploy1002 helmfile [eqiad] DONE helmfile.d/services/zotero: apply
[10:08:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:09:17] <logmsgbot>	 !log jynus@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2099.codfw.wmnet with reason: host reimage
[10:09:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:09:19] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] Increase namespace constraints for ml-serve-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/778254 (owner: 10Elukey)
[10:12:53] <logmsgbot>	 !log jynus@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2099.codfw.wmnet with reason: host reimage
[10:12:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:13:18] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1146:3312 (re)pooling @ 75%: After schema change', diff saved to https://phabricator.wikimedia.org/P24233 and previous config saved to /var/cache/conftool/dbconfig/20220407-101318-root.json
[10:13:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:13:36] <icinga-wm>	 PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:13:54] <icinga-wm>	 PROBLEM - k8s API server requests latencies on ml-serve-ctrl2001 is CRITICAL: instance=10.192.32.33 verb=POST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[10:15:56] <icinga-wm>	 RECOVERY - k8s API server requests latencies on ml-serve-ctrl2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[10:16:06] <wikibugs>	 (03CR) 10Mvolz: [C: 03+2] "Based on I78018d4e230 ; hopefully this works!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/776233 (https://phabricator.wikimedia.org/T205870) (owner: 10PipelineBot)
[10:16:30] <logmsgbot>	 !log jynus@cumin1001 START - Cookbook sre.hosts.reimage for host db1116.eqiad.wmnet with OS bullseye
[10:16:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:18:19] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Increase namespace constraints for ml-serve-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/778254 (owner: 10Elukey)
[10:19:28] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T305300)', diff saved to https://phabricator.wikimedia.org/P24234 and previous config saved to /var/cache/conftool/dbconfig/20220407-101928-ladsgroup.json
[10:19:30] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1174.eqiad.wmnet with reason: Maintenance
[10:19:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:19:31] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1174.eqiad.wmnet with reason: Maintenance
[10:19:32] <stashbot>	 T305300: Add lu_attachment_method column to localuser table - https://phabricator.wikimedia.org/T305300
[10:19:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:19:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:19:36] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1174 (T305300)', diff saved to https://phabricator.wikimedia.org/P24235 and previous config saved to /var/cache/conftool/dbconfig/20220407-101936-ladsgroup.json
[10:19:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:20:27] <wikibugs>	 (03Merged) 10jenkins-bot: citoid: switch to native prometheus metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/776233 (https://phabricator.wikimedia.org/T205870) (owner: 10PipelineBot)
[10:20:38] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.
[10:20:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:20:42] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.
[10:20:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:22:35] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311', diff saved to https://phabricator.wikimedia.org/P24236 and previous config saved to /var/cache/conftool/dbconfig/20220407-102234-ladsgroup.json
[10:22:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:24:17] <logmsgbot>	 !log mvolz@deploy1002 helmfile [staging] START helmfile.d/services/citoid: apply
[10:24:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:24:43] <logmsgbot>	 !log jynus@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1116.eqiad.wmnet with reason: host reimage
[10:24:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:25:07] <wikibugs>	 (03PS1) 10Btullis: Bump datahub version to use the containers with wmf-certicates [deployment-charts] - 10https://gerrit.wikimedia.org/r/778257 (https://phabricator.wikimedia.org/T301454)
[10:25:27] <logmsgbot>	 !log mvolz@deploy1002 helmfile [staging] DONE helmfile.d/services/citoid: apply
[10:25:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:25:45] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] ml-services: add plwiki, ptwiki & rowiki editquality isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/778251 (https://phabricator.wikimedia.org/T301415) (owner: 10Kevin Bazira)
[10:27:10] <logmsgbot>	 !log jynus@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2099.codfw.wmnet with OS bullseye
[10:27:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:28:09] <logmsgbot>	 !log jynus@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1116.eqiad.wmnet with reason: host reimage
[10:28:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:28:22] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1146:3312 (re)pooling @ 100%: After schema change', diff saved to https://phabricator.wikimedia.org/P24237 and previous config saved to /var/cache/conftool/dbconfig/20220407-102821-root.json
[10:28:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:33:54] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[10:34:10] <wikibugs>	 (03PS1) 10Filippo Giunchedi: sre: add alerts for exporter-specific unavailability [alerts] - 10https://gerrit.wikimedia.org/r/778259 (https://phabricator.wikimedia.org/T288726)
[10:35:31] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main
[10:35:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:35:34] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: apply on main
[10:35:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:35:46] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Bump datahub version to use the containers with wmf-certicates [deployment-charts] - 10https://gerrit.wikimedia.org/r/778257 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis)
[10:36:06] <wikibugs>	 (03PS2) 10JMeybohm: Move kubemaster2002 to bullseye [puppet] - 10https://gerrit.wikimedia.org/r/777021
[10:36:10] <logmsgbot>	 !log mvolz@deploy1002 helmfile [codfw] START helmfile.d/services/citoid: apply
[10:36:11] <wikibugs>	 (03PS1) 10Filippo Giunchedi: thanos: add recording rules for exporter-specific availability [puppet] - 10https://gerrit.wikimedia.org/r/778261 (https://phabricator.wikimedia.org/T288726)
[10:36:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:36:56] <logmsgbot>	 !log mvolz@deploy1002 helmfile [codfw] DONE helmfile.d/services/citoid: apply
[10:36:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:37:22] <logmsgbot>	 !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp6006.drmrs.wmnet with OS buster
[10:37:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:37:32] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp6006.drmrs.wmnet with OS buster com...
[10:37:40] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311 (T298565)', diff saved to https://phabricator.wikimedia.org/P24238 and previous config saved to /var/cache/conftool/dbconfig/20220407-103739-ladsgroup.json
[10:37:41] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
[10:37:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:37:43] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
[10:37:44] <stashbot>	 T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565
[10:37:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:37:46] <jinxer-wm>	 (BlazegraphJvmQuakeWarnGC) firing: Blazegraph instance wdqs2003:9100 is entering a GC death spiral - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphJvmQuakeWarnGC
[10:37:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:37:49] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on kubemaster2002.codfw.wmnet with reason: reimage
[10:37:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:37:55] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on kubemaster2002.codfw.wmnet with reason: reimage
[10:37:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:38:11] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] Move kubemaster2002 to bullseye [puppet] - 10https://gerrit.wikimedia.org/r/777021 (owner: 10JMeybohm)
[10:38:45] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job k8s-pods in k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[10:39:18] <wikibugs>	 10SRE, 10ops-eqiad: eqiad: Move links to new MPC7E linecard - https://phabricator.wikimedia.org/T304712 (10ayounsi)
[10:39:43] <wikibugs>	 (03Merged) 10jenkins-bot: Bump datahub version to use the containers with wmf-certicates [deployment-charts] - 10https://gerrit.wikimedia.org/r/778257 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis)
[10:40:26] <mmandere>	 !log pool cp6006 with HAProxy as TLS termination layer - T290005
[10:40:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:40:29] <stashbot>	 T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005
[10:40:32] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: requestctl: allow safer changes to the production VCL [software/conftool] - 10https://gerrit.wikimedia.org/r/778263 (https://phabricator.wikimedia.org/T305606)
[10:41:44] <logmsgbot>	 !log jynus@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1116.eqiad.wmnet with OS bullseye
[10:41:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:42:29] <wikibugs>	 (03PS5) 10Jgiannelos: Remove unused wgKartographerDfltStyle after tegola roll out [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772428 (https://phabricator.wikimedia.org/T298249)
[10:43:35] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main
[10:43:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:44:02] <logmsgbot>	 !log jynus@cumin2002 START - Cookbook sre.hosts.reimage for host db2100.codfw.wmnet with OS bullseye
[10:44:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:44:04] <icinga-wm>	 PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[10:45:32] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main
[10:45:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:48:36] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:49:06] <jayme>	 me
[10:49:48] <icinga-wm>	 RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[10:50:11] <jayme>	 where would I go looking if ganeti (codfw) does not return my calls? (gnt-instance modify just hangs)
[10:51:09] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: add $schema field to w3creportingapi tests [puppet] - 10https://gerrit.wikimedia.org/r/776025 (owner: 10Cwhite)
[10:51:10] <logmsgbot>	 !log jynus@cumin1001 START - Cookbook sre.hosts.reimage for host db1139.eqiad.wmnet with OS bullseye
[10:51:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:51:24] <jayme>	 "waiting for locks" looks promising
[10:51:49] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: replace all instances of @metadata.partition [puppet] - 10https://gerrit.wikimedia.org/r/777874 (https://phabricator.wikimedia.org/T305175) (owner: 10Cwhite)
[10:51:58] <mvolz>	 akosiaris: or anyone who knows metrics/grafana, could anyone help with me fixing metrics on codfw? 
[10:52:11] <mvolz>	 I want to fix metrics before I deploy to equiad
[10:52:43] <mvolz>	 The traffic metrics aren't working: https://grafana-rw.wikimedia.org/d/NJkCVermz/citoid?orgId=1&var-dc=codfw%20prometheus%2Fk8s&var-service=citoid&from=now-15m&to=now&forceLogin&editPanel=10&refresh=5m
[10:53:25] <mvolz>	 This is probobably because the name changed, but when I fix the name it still doesn't seem the work. I know the metrics are making it to prometheus because I can see them in the prometheus browser! 
[10:53:37] <mvolz>	 https://thanos.wikimedia.org/graph?g0.expr=citoid_router_request_duration_seconds_count&g0.tab=0&g0.stacked=0&g0.range_input=1h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D
[10:54:19] <jynus>	 if the metrics are on prometheus, then the only thing is to check them on grafana?
[10:54:45] <mvolz>	 yeah, I just don't know how to fix grafana - obviously the query is wrong
[10:54:55] <mvolz>	 but everything i try it's just "no data"
[10:54:57] <mvolz>	 :)
[10:55:06] <logmsgbot>	 !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3053.esams.wmnet with OS buster
[10:55:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:55:07] <logmsgbot>	 !log jynus@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2100.codfw.wmnet with reason: host reimage
[10:55:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:55:15] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp3053.esams.wmnet with OS buster com...
[10:55:40] <mvolz>	 wait actually I think I figured it out
[10:55:44] <jynus>	 current metric says:
[10:55:46] <jynus>	 sum(rate(service_runner_request_duration_seconds_count{service="$service"}[5m]))
[10:55:52] <mvolz>	 well one of them
[10:55:57] <jynus>	 yeah
[10:56:17] <jynus>	 if it is only a variable change, it should be just correcting that
[10:56:20] <mvolz>	 yeah I changed it to citoid router and that worked phew... 
[10:57:02] <jynus>	 now it says "AnnotationQueryRunner failed" t[a] is not iterable
[10:58:32] <logmsgbot>	 !log jynus@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2100.codfw.wmnet with reason: host reimage
[10:58:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:59:34] <mmandere>	 !log pool cp3053 with HAProxy as TLS termination layer - T290005
[10:59:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:59:36] <stashbot>	 T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005
[10:59:44] <logmsgbot>	 !log jynus@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1139.eqiad.wmnet with reason: host reimage
[10:59:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:01:19] <wikibugs>	 10SRE, 10Traffic, 10WMF-Legal, 10Performance-Team (Radar), 10Privacy: Consider disabling Chrome Lite pages for Wikipedia on Chrome on mobile with Cache-Control: no-transform - https://phabricator.wikimedia.org/T218618 (10Nicholas_Perry) Hi all, we received some info from Google which may help inform this...
[11:01:28] <mvolz>	 just fyi I'm going to run over my window, as no one is after me in the schedule
[11:03:21] <logmsgbot>	 !log jynus@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1139.eqiad.wmnet with reason: host reimage
[11:03:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:12:50] <logmsgbot>	 !log jynus@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2100.codfw.wmnet with OS bullseye
[11:12:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:14:12] <icinga-wm>	 PROBLEM - Device not healthy -SMART- on aqs1007 is CRITICAL: cluster=aqs device={sdh,sdm} instance=aqs1007 job=node site=eqiad https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/d/000000377/host-overview?var-server=aqs1007&var-datasource=eqiad+prometheus/ops
[11:15:25] <logmsgbot>	 !log mvolz@deploy1002 helmfile [eqiad] START helmfile.d/services/citoid: apply
[11:15:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:16:03] <logmsgbot>	 !log mvolz@deploy1002 helmfile [eqiad] DONE helmfile.d/services/citoid: apply
[11:16:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:16:45] <wikibugs>	 (03PS5) 10Krinkle: static.php: Restore short cache for temporary 'mismatch' response [mediawiki-config] - 10https://gerrit.wikimedia.org/r/777901 (https://phabricator.wikimedia.org/T302465)
[11:17:19] <logmsgbot>	 !log jynus@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1139.eqiad.wmnet with OS bullseye
[11:17:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:17:52] <logmsgbot>	 !log jynus@cumin2002 START - Cookbook sre.hosts.reimage for host db2101.codfw.wmnet with OS bullseye
[11:17:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:18:43] <mvolz>	 ok, I'm done deploying. 
[11:19:51] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T305300)', diff saved to https://phabricator.wikimedia.org/P24239 and previous config saved to /var/cache/conftool/dbconfig/20220407-111950-ladsgroup.json
[11:19:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:19:54] <stashbot>	 T305300: Add lu_attachment_method column to localuser table - https://phabricator.wikimedia.org/T305300
[11:22:10] <wikibugs>	 10SRE, 10Goal, 10MW-1.38-notes (1.38.0-wmf.4; 2021-10-12), 10Patch-For-Review, and 2 others: Fully migrate producers off statsd - https://phabricator.wikimedia.org/T205870 (10Mvolz)
[11:23:19] <logmsgbot>	 !log jynus@cumin1001 START - Cookbook sre.hosts.reimage for host db1140.eqiad.wmnet with OS bullseye
[11:23:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:23:30] <mmandere>	 !log depool cp3051 for reimage - T290005
[11:23:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:23:33] <stashbot>	 T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005
[11:25:35] <wikibugs>	 10SRE, 10Goal, 10MW-1.38-notes (1.38.0-wmf.4; 2021-10-12), 10Patch-For-Review, and 2 others: Fully migrate producers off statsd - https://phabricator.wikimedia.org/T205870 (10Mvolz) This is now deployed for citoid.   I have updated grafana for the most part, however there are a few (minor) metrics this bro...
[11:28:27] <logmsgbot>	 !log jynus@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2101.codfw.wmnet with reason: host reimage
[11:28:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:30:04] <James_F>	 jouncebot: now
[11:30:04] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 29 minute(s)
[11:30:07] <James_F>	 Cool.
[11:30:16] <wikibugs>	 (03PS2) 10MMandere: site: Reimage cp3051 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/777850 (https://phabricator.wikimedia.org/T290005)
[11:31:52] <logmsgbot>	 !log jynus@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1140.eqiad.wmnet with reason: host reimage
[11:31:53] <logmsgbot>	 !log jynus@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2101.codfw.wmnet with reason: host reimage
[11:31:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:31:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:32:12] <wikibugs>	 (03CR) 10MMandere: [C: 03+2] site: Reimage cp3051 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/777850 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere)
[11:32:25] <logmsgbot>	 !log jforrester@deploy1002 Started deploy [integration/docroot@d88e2fa]: d88e2fa19fd6 [WikiLambda] Fix link typo and re-group/re-word other links
[11:32:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:32:34] <logmsgbot>	 !log jforrester@deploy1002 Finished deploy [integration/docroot@d88e2fa]: d88e2fa19fd6 [WikiLambda] Fix link typo and re-group/re-word other links (duration: 00m 09s)
[11:32:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:34:20] <logmsgbot>	 !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host cp3051.esams.wmnet with OS buster
[11:34:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:34:29] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp3051.esams.wmnet with OS buster
[11:34:57] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P24240 and previous config saved to /var/cache/conftool/dbconfig/20220407-113455-ladsgroup.json
[11:34:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:35:17] <logmsgbot>	 !log jynus@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1140.eqiad.wmnet with reason: host reimage
[11:35:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:35:25] <mmandere>	 !log depool cp6013 for reimage - T290005
[11:35:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:35:28] <stashbot>	 T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005
[11:39:39] <wikibugs>	 (03PS2) 10MMandere: site: Reimage cp6013 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/777851 (https://phabricator.wikimedia.org/T290005)
[11:41:09] <icinga-wm>	 RECOVERY - SSH on aqs1009.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[11:44:19] <wikibugs>	 (03CR) 10MMandere: [C: 03+2] site: Reimage cp6013 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/777851 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere)
[11:45:40] <logmsgbot>	 !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host cp6013.drmrs.wmnet with OS buster
[11:45:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:45:49] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp6013.drmrs.wmnet with OS buster
[11:46:02] <wikibugs>	 (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/778277
[11:46:04] <wikibugs>	 (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/778278
[11:46:34] <logmsgbot>	 !log jynus@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2101.codfw.wmnet with OS bullseye
[11:46:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:48:25] <icinga-wm>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:49:13] <logmsgbot>	 !log jynus@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1140.eqiad.wmnet with OS bullseye
[11:49:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:50:02] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P24241 and previous config saved to /var/cache/conftool/dbconfig/20220407-115002-ladsgroup.json
[11:50:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:53:25] <icinga-wm>	 PROBLEM - k8s API server requests latencies on ml-serve-ctrl2001 is CRITICAL: instance=10.192.32.33 verb=POST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[11:55:39] <icinga-wm>	 RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:55:41] <icinga-wm>	 RECOVERY - k8s API server requests latencies on ml-serve-ctrl2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[12:03:00] <logmsgbot>	 !log mmandere@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3051.esams.wmnet with reason: host reimage
[12:03:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:03:35] <logmsgbot>	 !log mmandere@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp6013.drmrs.wmnet with reason: host reimage
[12:03:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:05:07] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T305300)', diff saved to https://phabricator.wikimedia.org/P24242 and previous config saved to /var/cache/conftool/dbconfig/20220407-120507-ladsgroup.json
[12:05:08] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1127.eqiad.wmnet with reason: Maintenance
[12:05:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:05:10] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1127.eqiad.wmnet with reason: Maintenance
[12:05:10] <stashbot>	 T305300: Add lu_attachment_method column to localuser table - https://phabricator.wikimedia.org/T305300
[12:05:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:05:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:05:15] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1127 (T305300)', diff saved to https://phabricator.wikimedia.org/P24243 and previous config saved to /var/cache/conftool/dbconfig/20220407-120514-ladsgroup.json
[12:05:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:06:24] <logmsgbot>	 !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp3051.esams.wmnet with reason: host reimage
[12:06:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:08:47] <logmsgbot>	 !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp6013.drmrs.wmnet with reason: host reimage
[12:08:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:10:41] <icinga-wm>	 PROBLEM - k8s API server requests latencies on ml-serve-ctrl2001 is CRITICAL: instance=10.192.32.33 verb=POST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[12:12:43] <icinga-wm>	 RECOVERY - k8s API server requests latencies on ml-serve-ctrl2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[12:13:21] <wikibugs>	 10SRE, 10Traffic: Upgrading Wikidough and durum VMs to bullseye - https://phabricator.wikimedia.org/T305589 (10Volans) >>! In T305589#7837526, @fgiunchedi wrote: > AIUI the decom cookbook doesn't support VMs yet (?)  That's not actually correct, the decommission cookbook does support VMs since the start. What...
[12:13:49] <icinga-wm>	 PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:16:13] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: external_clouds_vendors: install python3-git [puppet] - 10https://gerrit.wikimedia.org/r/778280
[12:19:22] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db2096.codfw.wmnet with reason: Rebooting for T303174
[12:19:23] <logmsgbot>	 !log kormat@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 1:30:00 on db2096.codfw.wmnet with reason: Rebooting for T303174
[12:19:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:19:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:21:52] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "I didn't test it, but changes looks sane to me." [puppet] - 10https://gerrit.wikimedia.org/r/777899 (https://phabricator.wikimedia.org/T305581) (owner: 10RLazarus)
[12:23:28] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db2096.codfw.wmnet with reason: Rebooting for T303174
[12:23:29] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db2096.codfw.wmnet with reason: Rebooting for T303174
[12:23:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:23:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:25:12] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations, 10serviceops, 10Release-Engineering-Team (Radar): Need a service account on deploy servers - https://phabricator.wikimedia.org/T303857 (10Joe) >>! In T303857#7818920, @dancy wrote: > I have confirmed that being in the `deployment` group will all...
[12:25:21] <icinga-wm>	 RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:30:16] <logmsgbot>	 !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3051.esams.wmnet with OS buster
[12:30:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:30:25] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp3051.esams.wmnet with OS buster com...
[12:32:22] <mmandere>	 !log pool cp3051 with HAProxy as TLS termination layer - T290005
[12:32:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:32:25] <stashbot>	 T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005
[12:34:01] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db[2078,2132].codfw.wmnet with reason: Rebooting primary T303174
[12:34:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:34:04] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db[2078,2132].codfw.wmnet with reason: Rebooting primary T303174
[12:34:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:34:16] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db2132.codfw.wmnet with reason: Rebooting for T303174
[12:34:18] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db2132.codfw.wmnet with reason: Rebooting for T303174
[12:34:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:34:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:37:07] <wikibugs>	 10SRE, 10Traffic: Upgrading Wikidough and durum VMs to bullseye - https://phabricator.wikimedia.org/T305589 (10fgiunchedi) >>! In T305589#7837933, @Volans wrote: >>>! In T305589#7837526, @fgiunchedi wrote: >> AIUI the decom cookbook doesn't support VMs yet (?) >  > That's not actually correct, the decommission...
[12:37:40] <wikibugs>	 10SRE, 10conftool, 10Patch-For-Review: ipblocks support for other "entities" (not clouds, not abuse nets) - https://phabricator.wikimedia.org/T305581 (10CDanis) I suggest something simpler:  Use a common prefix in the header name, with the name of the ipblock group as the suffix.  X-SRE-Ipblock-Cloud X-SRE-I...
[12:38:40] <wikibugs>	 10SRE, 10Goal, 10MW-1.38-notes (1.38.0-wmf.4; 2021-10-12), 10Patch-For-Review, and 2 others: Fully migrate producers off statsd - https://phabricator.wikimedia.org/T205870 (10fgiunchedi) >>! In T205870#7837817, @Mvolz wrote: > This is now deployed for citoid.   This is great to see! Thanks for your help @M...
[12:40:09] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy2001 is CRITICAL: CRITICAL check_failover servers up 2 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[12:40:11] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db[2078,2133].codfw.wmnet with reason: Rebooting primary T303174
[12:40:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:40:13] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db[2078,2133].codfw.wmnet with reason: Rebooting primary T303174
[12:40:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:40:27] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db2133.codfw.wmnet with reason: Rebooting for T303174
[12:40:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:40:28] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db2133.codfw.wmnet with reason: Rebooting for T303174
[12:40:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:44:09] <icinga-wm>	 PROBLEM - Host logstash2024 is DOWN: PING CRITICAL - Packet loss = 100%
[12:44:14] <logmsgbot>	 !log jynus@cumin1001 START - Cookbook sre.hosts.reimage for host db1145.eqiad.wmnet with OS bullseye
[12:44:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:44:29] <icinga-wm>	 RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[12:45:06] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] "LGTM, I checked thanos and all models are correctly listed." [deployment-charts] - 10https://gerrit.wikimedia.org/r/778251 (https://phabricator.wikimedia.org/T301415) (owner: 10Kevin Bazira)
[12:45:37] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db2134.codfw.wmnet with reason: Rebooting for T303174
[12:45:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:45:39] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db2134.codfw.wmnet with reason: Rebooting for T303174
[12:45:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:45:53] <icinga-wm>	 PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:46:43] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy2002 is CRITICAL: CRITICAL check_failover servers up 2 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[12:46:49] <wikibugs>	 10SRE, 10ops-eqiad, 10Cloud-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): rack/setup/install labvirt101[5-8] - https://phabricator.wikimedia.org/T165531 (10aborrero)
[12:47:43] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: admin: add mwbuilder to the deployment group [puppet] - 10https://gerrit.wikimedia.org/r/778284 (https://phabricator.wikimedia.org/T303857)
[12:47:45] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: mwdebug-deploy: run as mwbuilder, use release repository [puppet] - 10https://gerrit.wikimedia.org/r/778285 (https://phabricator.wikimedia.org/T299648)
[12:48:30] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] external_clouds_vendors: install python3-git [puppet] - 10https://gerrit.wikimedia.org/r/778280 (owner: 10Giuseppe Lavagetto)
[12:49:52] <akosiaris>	 !log sudo gnt-cluster modify -H kvm:migration_downtime=3000 for ganeti01.svc.codfw.wmnet and ganeti01.svc.eqiad.wmnet to combat some logstash VM migration issues.
[12:49:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:51:34] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db2135.codfw.wmnet with reason: Rebooting for T303174
[12:51:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:51:35] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db2135.codfw.wmnet with reason: Rebooting for T303174
[12:51:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:52:06] <logmsgbot>	 !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp6013.drmrs.wmnet with OS buster
[12:52:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:52:15] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp6013.drmrs.wmnet with OS buster com...
[12:54:50] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34730/console" [puppet] - 10https://gerrit.wikimedia.org/r/778284 (https://phabricator.wikimedia.org/T303857) (owner: 10Giuseppe Lavagetto)
[12:55:14] <logmsgbot>	 !log jynus@cumin2002 START - Cookbook sre.hosts.reimage for host db2139.codfw.wmnet with OS bullseye
[12:55:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:55:39] <logmsgbot>	 !log jynus@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1145.eqiad.wmnet with reason: host reimage
[12:55:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:55:41] <icinga-wm>	 RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:56:25] <wikibugs>	 (03CR) 10CDanis: [C: 03+1] requestctl: allow safer changes to the production VCL [software/conftool] - 10https://gerrit.wikimedia.org/r/778263 (https://phabricator.wikimedia.org/T305606) (owner: 10Giuseppe Lavagetto)
[12:57:43] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 9 hosts with reason: Rebooting primary T303174
[12:57:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:57:50] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 9 hosts with reason: Rebooting primary T303174
[12:57:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:57:54] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] admin: add mwbuilder to the deployment group [puppet] - 10https://gerrit.wikimedia.org/r/778284 (https://phabricator.wikimedia.org/T303857) (owner: 10Giuseppe Lavagetto)
[12:58:02] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db2104.codfw.wmnet with reason: Rebooting for T303174
[12:58:04] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db2104.codfw.wmnet with reason: Rebooting for T303174
[12:58:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:58:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:58:56] <mmandere>	 !log depool cp6005 for reimage - T290005
[12:58:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:58:58] <stashbot>	 T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005
[12:58:59] <logmsgbot>	 !log jynus@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1145.eqiad.wmnet with reason: host reimage
[12:59:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:59:24] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy2003 is CRITICAL: CRITICAL check_failover servers up 2 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[12:59:30] <icinga-wm>	 PROBLEM - k8s API server requests latencies on ml-serve-ctrl2001 is CRITICAL: instance=10.192.32.33 verb=POST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[13:00:05] <jouncebot>	 RoanKattouw, Lucas_WMDE, and Urbanecm: #bothumor My software never has bugs. It just develops random features. Rise for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220407T1300).
[13:00:05] <jouncebot>	 nemo-yiannis: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:02:00] <icinga-wm>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 94, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:02:30] <icinga-wm>	 RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 127, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:04:59] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on es2028.codfw.wmnet with reason: Rebooting for T303174
[13:05:00] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on es2028.codfw.wmnet with reason: Rebooting for T303174
[13:05:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:05:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:05:29] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T305300)', diff saved to https://phabricator.wikimedia.org/P24244 and previous config saved to /var/cache/conftool/dbconfig/20220407-130529-ladsgroup.json
[13:05:31] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations, 10serviceops, and 2 others: Need a service account on deploy servers - https://phabricator.wikimedia.org/T303857 (10Joe) 05Open→03Resolved `lang=bash oblivian@deploy1002:~ $ sudo -u mwbuilder groups mwbuilder docker deployment `
[13:05:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:05:33] <stashbot>	 T305300: Add lu_attachment_method column to localuser table - https://phabricator.wikimedia.org/T305300
[13:05:56] <icinga-wm>	 RECOVERY - k8s API server requests latencies on ml-serve-ctrl2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[13:07:34] <wikibugs>	 (03PS2) 10JMeybohm: Move kubemaster2001 to bullseye [puppet] - 10https://gerrit.wikimedia.org/r/777311 (https://phabricator.wikimedia.org/T305435)
[13:08:09] <wikibugs>	 (03PS2) 10MMandere: site: Reimage cp6005 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/777852 (https://phabricator.wikimedia.org/T290005)
[13:08:33] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: mwdebug-deploy: run as mwbuilder, use release repository [puppet] - 10https://gerrit.wikimedia.org/r/778285 (https://phabricator.wikimedia.org/T299648)
[13:08:34] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on kubemaster2001.codfw.wmnet with reason: reimage
[13:08:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:08:36] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on kubemaster2001.codfw.wmnet with reason: reimage
[13:08:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:09:08] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] Move kubemaster2001 to bullseye [puppet] - 10https://gerrit.wikimedia.org/r/777311 (https://phabricator.wikimedia.org/T305435) (owner: 10JMeybohm)
[13:09:36] <wikibugs>	 (03CR) 10MMandere: [C: 03+2] site: Reimage cp6005 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/777852 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere)
[13:09:54] <icinga-wm>	 PROBLEM - Widespread puppet agent failures- no resources reported on alert1001 is CRITICAL: 0.0628 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet
[13:10:06] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on es2030.codfw.wmnet with reason: Rebooting for T303174
[13:10:07] <logmsgbot>	 !log jynus@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2139.codfw.wmnet with reason: host reimage
[13:10:07] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on es2030.codfw.wmnet with reason: Rebooting for T303174
[13:10:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:10:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:10:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:10:23] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34731/console" [puppet] - 10https://gerrit.wikimedia.org/r/778285 (https://phabricator.wikimedia.org/T299648) (owner: 10Giuseppe Lavagetto)
[13:11:43] <_joe_>	 jouncebot: next
[13:11:43] <jouncebot>	 In 2 hour(s) and 48 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220407T1600)
[13:11:52] <_joe_>	 ok I got plenty time
[13:11:59] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] mwdebug-deploy: run as mwbuilder, use release repository [puppet] - 10https://gerrit.wikimedia.org/r/778285 (https://phabricator.wikimedia.org/T299648) (owner: 10Giuseppe Lavagetto)
[13:12:10] <logmsgbot>	 !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host cp6005.drmrs.wmnet with OS buster
[13:12:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:12:18] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp6005.drmrs.wmnet with OS buster
[13:13:30] <logmsgbot>	 !log jynus@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2139.codfw.wmnet with reason: host reimage
[13:13:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:13:33] <icinga-wm>	 PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:13:37] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:13:46] <mmandere>	 !log depool cp6012 for reimage - T290005
[13:13:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:13:49] <stashbot>	 T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005
[13:14:05] <_joe_>	 uh jayme are you doing something with the codfw k8s cluster?
[13:14:23] <logmsgbot>	 !log jynus@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1145.eqiad.wmnet with OS bullseye
[13:14:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:14:31] <jayme>	 yep,thats me again, sorry
[13:14:39] <jayme>	 _joe_: just reimaging masters
[13:14:52] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on es2032.codfw.wmnet with reason: Rebooting for T303174
[13:14:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:14:53] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on es2032.codfw.wmnet with reason: Rebooting for T303174
[13:14:54] <jayme>	 one-by-one obviously
[13:14:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:15:06] <_joe_>	 jayme: why not all at the same time!?!
[13:16:58] <wikibugs>	 (03PS2) 10MMandere: site: Reimage cp6012 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/777853 (https://phabricator.wikimedia.org/T290005)
[13:17:09] <jayme>	 I'm too pansy \o/ ;)
[13:17:39] <icinga-wm>	 RECOVERY - Check systemd state on puppetmaster2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:17:47] <wikibugs>	 (03CR) 10MMandere: [C: 03+2] site: Reimage cp6012 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/777853 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere)
[13:19:57] <icinga-wm>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:20:34] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P24245 and previous config saved to /var/cache/conftool/dbconfig/20220407-132034-ladsgroup.json
[13:20:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:20:47] <logmsgbot>	 !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host cp6012.drmrs.wmnet with OS buster
[13:20:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:21:00] <wikibugs>	 (03CR) 10Gehel: [C: 03+2] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/777875 (owner: 10Ryan Kemper)
[13:21:01] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp6012.drmrs.wmnet with OS buster
[13:23:43] <wikibugs>	 10SRE, 10conftool: Support NOT in the dsl grammar - https://phabricator.wikimedia.org/T305607 (10CDanis)
[13:23:47] <wikibugs>	 10SRE, 10conftool, 10Patch-For-Review: Make the VCL that goes to production from requestctl safer/more explicit to apply - https://phabricator.wikimedia.org/T305606 (10CDanis)
[13:24:46] <logmsgbot>	 !log jynus@cumin1001 START - Cookbook sre.hosts.reimage for host db1150.eqiad.wmnet with OS bullseye
[13:24:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:25:51] <wikibugs>	 (03Merged) 10jenkins-bot: elastic: relforge needs --without-lvs [cookbooks] - 10https://gerrit.wikimedia.org/r/777875 (owner: 10Ryan Kemper)
[13:29:39] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on es2026.codfw.wmnet with reason: Rebooting for T303174
[13:29:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:29:40] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on es2026.codfw.wmnet with reason: Rebooting for T303174
[13:29:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:29:55] <logmsgbot>	 !log jynus@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2139.codfw.wmnet with OS bullseye
[13:29:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:30:14] <logmsgbot>	 !log mmandere@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp6005.drmrs.wmnet with reason: host reimage
[13:30:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:33:40] <logmsgbot>	 !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp6005.drmrs.wmnet with reason: host reimage
[13:33:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:34:11] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on es2031.codfw.wmnet with reason: Rebooting for T303174
[13:34:12] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on es2031.codfw.wmnet with reason: Rebooting for T303174
[13:34:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:34:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:35:39] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P24246 and previous config saved to /var/cache/conftool/dbconfig/20220407-133539-ladsgroup.json
[13:35:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:36:23] <logmsgbot>	 !log jynus@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1150.eqiad.wmnet with reason: host reimage
[13:36:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:37:21] <logmsgbot>	 !log mmandere@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp6012.drmrs.wmnet with reason: host reimage
[13:37:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:39:45] <logmsgbot>	 !log jynus@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1150.eqiad.wmnet with reason: host reimage
[13:39:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:40:44] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on es2033.codfw.wmnet with reason: Rebooting for T303174
[13:40:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:40:45] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on es2033.codfw.wmnet with reason: Rebooting for T303174
[13:40:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:41:45] <logmsgbot>	 !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp6012.drmrs.wmnet with reason: host reimage
[13:41:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:43:19] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy2004 is CRITICAL: CRITICAL check_failover servers up 2 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[13:44:12] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] requestctl: allow safer changes to the production VCL [software/conftool] - 10https://gerrit.wikimedia.org/r/778263 (https://phabricator.wikimedia.org/T305606) (owner: 10Giuseppe Lavagetto)
[13:45:15] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on es2027.codfw.wmnet with reason: Rebooting for T303174
[13:45:17] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on es2027.codfw.wmnet with reason: Rebooting for T303174
[13:45:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:45:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:45:20] <akosiaris>	 mvolz: I gather you solved the grafana issues you had? Or is there anything I can help with?
[13:45:25] <marostegui>	 checking that haproxy alert
[13:46:02] <wikibugs>	 (03Merged) 10jenkins-bot: requestctl: allow safer changes to the production VCL [software/conftool] - 10https://gerrit.wikimedia.org/r/778263 (https://phabricator.wikimedia.org/T305606) (owner: 10Giuseppe Lavagetto)
[13:47:15] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy2004 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[13:47:33] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Debian changelog update [software/conftool] - 10https://gerrit.wikimedia.org/r/778293
[13:48:49] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy2002 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[13:49:05] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy2003 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[13:49:31] <logmsgbot>	 !log jynus@cumin2002 START - Cookbook sre.hosts.reimage for host db2141.codfw.wmnet with OS bullseye
[13:49:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:49:53] <icinga-wm>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 94, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:50:44] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T305300)', diff saved to https://phabricator.wikimedia.org/P24247 and previous config saved to /var/cache/conftool/dbconfig/20220407-135044-ladsgroup.json
[13:50:46] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance
[13:50:47] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance
[13:50:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:50:49] <stashbot>	 T305300: Add lu_attachment_method column to localuser table - https://phabricator.wikimedia.org/T305300
[13:50:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:50:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:50:52] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3317 (T305300)', diff saved to https://phabricator.wikimedia.org/P24248 and previous config saved to /var/cache/conftool/dbconfig/20220407-135052-ladsgroup.json
[13:50:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:51:25] <wikibugs>	 10SRE, 10Traffic: Upgrading Wikidough and durum VMs to bullseye - https://phabricator.wikimedia.org/T305589 (10ssingh) Thanks for the feedback @fgiunchedi and @Volans!  >>! In T305589#7837933, @Volans wrote: >>>! In T305589#7837526, @fgiunchedi wrote: >> AIUI the decom cookbook doesn't support VMs yet (?) >  >...
[13:52:04] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy2001 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[13:53:16] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on es2029.codfw.wmnet with reason: Rebooting for T303174
[13:53:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:53:18] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on es2029.codfw.wmnet with reason: Rebooting for T303174
[13:53:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:55:15] <logmsgbot>	 !log jynus@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1150.eqiad.wmnet with OS bullseye
[13:55:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:56:04] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for drochford - https://phabricator.wikimedia.org/T305634 (10drochford)
[13:59:39] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for drochford - https://phabricator.wikimedia.org/T305634 (10RhinosF1) @drochford: approving party will be whoever your manager is
[14:02:12] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for drochford - https://phabricator.wikimedia.org/T305634 (10RhinosF1) analytics-privatedata-users will need @Ottomata or @odimitrijevic's approval too.
[14:02:57] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on es2034.codfw.wmnet with reason: Rebooting for T303174
[14:02:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:02:58] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on es2034.codfw.wmnet with reason: Rebooting for T303174
[14:03:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:03:34] <logmsgbot>	 !log jynus@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2141.codfw.wmnet with reason: host reimage
[14:03:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:04:00] <icinga-wm>	 RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 127, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:04:09] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.REBOOT (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reboot - bking@cumin1001 - T304938
[14:04:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:06:30] <logmsgbot>	 !log jynus@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2141.codfw.wmnet with reason: host reimage
[14:06:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:06:42] <logmsgbot>	 !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp6005.drmrs.wmnet with OS buster
[14:06:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:06:52] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp6005.drmrs.wmnet with OS buster com...
[14:08:21] <icinga-wm>	 PROBLEM - SSH on aqs1007.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[14:08:43] <mmandere>	 !log pool cp6005 with HAProxy as TLS termination layer - T290005
[14:08:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:08:47] <stashbot>	 T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005
[14:10:33] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on es2025.codfw.wmnet with reason: Rebooting for T303174
[14:10:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:10:34] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on es2025.codfw.wmnet with reason: Rebooting for T303174
[14:10:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:10:52] <logmsgbot>	 !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp6012.drmrs.wmnet with OS buster
[14:10:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:11:01] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp6012.drmrs.wmnet with OS buster com...
[14:13:02] <mmandere>	 !log pool cp6012 with HAProxy as TLS termination layer - T290005
[14:13:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:14:43] <icinga-wm>	 PROBLEM - Check systemd state on elastic2029 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:15:07] <icinga-wm>	 PROBLEM - Check systemd state on elastic2042 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:15:36] <wikibugs>	 (03CR) 10CDanis: [C: 03+1] Debian changelog update [software/conftool] - 10https://gerrit.wikimedia.org/r/778293 (owner: 10Giuseppe Lavagetto)
[14:18:30] <wikibugs>	 (03PS2) 10MMandere: site: Reimage cp6004 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/777854 (https://phabricator.wikimedia.org/T290005)
[14:19:18] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db2115.codfw.wmnet with reason: Rebooting for T303174
[14:19:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:19:19] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db2115.codfw.wmnet with reason: Rebooting for T303174
[14:19:21] <mmandere>	 !log depool cp6004 for reimage - T290005
[14:19:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:19:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:19:24] <stashbot>	 T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005
[14:20:49] <icinga-wm>	 RECOVERY - Check systemd state on elastic2029 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:21:17] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T305300)', diff saved to https://phabricator.wikimedia.org/P24249 and previous config saved to /var/cache/conftool/dbconfig/20220407-142117-ladsgroup.json
[14:21:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:21:20] <stashbot>	 T305300: Add lu_attachment_method column to localuser table - https://phabricator.wikimedia.org/T305300
[14:22:17] <wikibugs>	 (03CR) 10MMandere: [C: 03+2] site: Reimage cp6004 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/777854 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere)
[14:22:33] <logmsgbot>	 !log jynus@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2141.codfw.wmnet with OS bullseye
[14:22:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:24:15] <logmsgbot>	 !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host cp6004.drmrs.wmnet with OS buster
[14:24:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:24:25] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp6004.drmrs.wmnet with OS buster
[14:25:44] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db2131.codfw.wmnet with reason: Rebooting for T303174
[14:25:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:25:46] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db2131.codfw.wmnet with reason: Rebooting for T303174
[14:25:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:27:51] <wikibugs>	 (03PS1) 10Krinkle: mediawiki: Update httpbb tests for /static/current going away [puppet] - 10https://gerrit.wikimedia.org/r/778295 (https://phabricator.wikimedia.org/T302465)
[14:28:02] <icinga-wm>	 RECOVERY - Check systemd state on elastic2042 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:32:14] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db2143.codfw.wmnet with reason: Rebooting for T303174
[14:32:15] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db2143.codfw.wmnet with reason: Rebooting for T303174
[14:32:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:32:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:34:56] <wikibugs>	 10SRE, 10Phabricator, 10SRE Observability (FY2021/2022-Q4), 10User-Ladsgroup: SRE access request to join #triagers for user lmata - https://phabricator.wikimedia.org/T305463 (10lmata) Thanks!
[14:36:22] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P24250 and previous config saved to /var/cache/conftool/dbconfig/20220407-143622-ladsgroup.json
[14:36:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:36:36] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'db2143 (re)pooling @ 25%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P24251 and previous config saved to /var/cache/conftool/dbconfig/20220407-143635-kormat.json
[14:36:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:37:46] <jinxer-wm>	 (BlazegraphJvmQuakeWarnGC) firing: Blazegraph instance wdqs2003:9100 is entering a GC death spiral - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphJvmQuakeWarnGC
[14:38:58] <wikibugs>	 (03CR) 10Kevin Bazira: ml-services: add plwiki, ptwiki & rowiki editquality isvcs (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/778251 (https://phabricator.wikimedia.org/T301415) (owner: 10Kevin Bazira)
[14:41:13] <logmsgbot>	 !log mmandere@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp6004.drmrs.wmnet with reason: host reimage
[14:41:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:41:38] <icinga-wm>	 PROBLEM - Check systemd state on elastic2030 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:42:02] <icinga-wm>	 PROBLEM - Check systemd state on elastic2044 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:43:26] <icinga-wm>	 PROBLEM - SSH on aqs1009.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[14:44:10] <logmsgbot>	 !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp6004.drmrs.wmnet with reason: host reimage
[14:44:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:44:32] <logmsgbot>	 !log jynus@cumin1001 START - Cookbook sre.hosts.reimage for host db1171.eqiad.wmnet with OS bullseye
[14:44:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:45:06] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:46:14] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:49:10] <wikibugs>	 (03PS2) 10Volans: service: add new module to expose service::catalog [software/spicerack] - 10https://gerrit.wikimedia.org/r/775904
[14:50:15] <wikibugs>	 10SRE, 10conftool: requestctl v1 improvements - https://phabricator.wikimedia.org/T305580 (10CDanis)
[14:51:07] <wikibugs>	 (03CR) 10Volans: "Replies inline" [software/spicerack] - 10https://gerrit.wikimedia.org/r/775904 (owner: 10Volans)
[14:51:27] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P24252 and previous config saved to /var/cache/conftool/dbconfig/20220407-145127-ladsgroup.json
[14:51:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:51:40] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'db2143 (re)pooling @ 50%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P24253 and previous config saved to /var/cache/conftool/dbconfig/20220407-145139-kormat.json
[14:51:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:54:56] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'db2143 (re)pooling @ 100%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P24254 and previous config saved to /var/cache/conftool/dbconfig/20220407-145455-kormat.json
[14:54:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:55:14] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db2144.codfw.wmnet with reason: Rebooting for T303174
[14:55:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:55:16] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db2144.codfw.wmnet with reason: Rebooting for T303174
[14:55:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:56:16] <logmsgbot>	 !log jynus@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1171.eqiad.wmnet with reason: host reimage
[14:56:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:58:23] <icinga-wm>	 RECOVERY - Check systemd state on elastic2044 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:59:35] <logmsgbot>	 !log jynus@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1171.eqiad.wmnet with reason: host reimage
[14:59:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:02:21] <wikibugs>	 (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/778299
[15:06:32] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T305300)', diff saved to https://phabricator.wikimedia.org/P24255 and previous config saved to /var/cache/conftool/dbconfig/20220407-150632-ladsgroup.json
[15:06:34] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1101.eqiad.wmnet with reason: Maintenance
[15:06:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:06:35] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1101.eqiad.wmnet with reason: Maintenance
[15:06:37] <stashbot>	 T305300: Add lu_attachment_method column to localuser table - https://phabricator.wikimedia.org/T305300
[15:06:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:06:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:06:40] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1101:3317 (T305300)', diff saved to https://phabricator.wikimedia.org/P24256 and previous config saved to /var/cache/conftool/dbconfig/20220407-150640-ladsgroup.json
[15:06:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:07:56] <wikibugs>	 (03PS1) 10MMandere: site: Reimage cp6011 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/778300 (https://phabricator.wikimedia.org/T290005)
[15:07:58] <wikibugs>	 (03PS1) 10MMandere: site: Reimage cp6003 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/778301 (https://phabricator.wikimedia.org/T290005)
[15:08:00] <wikibugs>	 (03PS1) 10MMandere: site: Reimage cp6010 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/778302 (https://phabricator.wikimedia.org/T290005)
[15:08:02] <wikibugs>	 (03PS1) 10MMandere: site: Reimage cp6002 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/778303 (https://phabricator.wikimedia.org/T290005)
[15:08:04] <wikibugs>	 (03PS1) 10MMandere: site: Reimage cp6009 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/778304 (https://phabricator.wikimedia.org/T290005)
[15:08:06] <wikibugs>	 (03PS1) 10MMandere: site: Reimage cp6001 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/778305 (https://phabricator.wikimedia.org/T290005)
[15:11:25] <icinga-wm>	 PROBLEM - SSH on wtp1048.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[15:12:35] <logmsgbot>	 !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp6004.drmrs.wmnet with OS buster
[15:12:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:12:43] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp6004.drmrs.wmnet with OS buster com...
[15:13:49] <wikibugs>	 10SRE, 10ops-codfw, 10DBA: codfw: Dedicate Rack B1 for cloudX-dev servers - https://phabricator.wikimedia.org/T305469 (10Papaul)
[15:14:56] <logmsgbot>	 !log jynus@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1171.eqiad.wmnet with OS bullseye
[15:14:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:18:03] <wikibugs>	 10SRE, 10ops-codfw, 10DBA: codfw: Dedicate Rack B1 for cloudX-dev servers - https://phabricator.wikimedia.org/T305469 (10Papaul)
[15:18:43] <wikibugs>	 10SRE, 10ops-codfw, 10DBA: codfw: Dedicate Rack B1 for cloudX-dev servers - https://phabricator.wikimedia.org/T305469 (10Papaul)
[15:18:55] <icinga-wm>	 RECOVERY - Check systemd state on elastic2030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:19:23] <icinga-wm>	 PROBLEM - Check systemd state on elastic2047 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:20:09] <icinga-wm>	 PROBLEM - Check systemd state on elastic2031 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:20:43] <icinga-wm>	 PROBLEM - Check systemd state on elastic2059 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:21:32] <mmandere>	 !log pool cp6004 with HAProxy as TLS termination layer - T290005
[15:21:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:21:37] <stashbot>	 T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005
[15:23:11] <wikibugs>	 10SRE, 10ops-codfw, 10DBA: codfw: Dedicate Rack B1 for cloudX-dev servers - https://phabricator.wikimedia.org/T305469 (10Papaul)
[15:23:53] <wikibugs>	 (03CR) 10Herron: [C: 03+1] sre: add alerts for exporter-specific unavailability [alerts] - 10https://gerrit.wikimedia.org/r/778259 (https://phabricator.wikimedia.org/T288726) (owner: 10Filippo Giunchedi)
[15:26:07] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: mediaiki: add new member of the deployment group everywhere [puppet] - 10https://gerrit.wikimedia.org/r/778307
[15:28:06] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, even DRYer this way" [puppet] - 10https://gerrit.wikimedia.org/r/778307 (owner: 10Giuseppe Lavagetto)
[15:29:23] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: mediaiki: add new member of the deployment group everywhere [puppet] - 10https://gerrit.wikimedia.org/r/778307
[15:30:20] <_joe_>	 godog: I left behind a damn require that wasn't really needed, btw
[15:30:40] <godog>	 sigh
[15:30:43] <_joe_>	 but yeah if compilation is happ, this, I'll proceed
[15:30:55] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (NOOP 3 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34735/console" [puppet] - 10https://gerrit.wikimedia.org/r/778307 (owner: 10Giuseppe Lavagetto)
[15:30:59] <_joe_>	 yeah pcc is now happy
[15:31:02] <_joe_>	 going to merge it
[15:31:19] <godog>	 SGTM
[15:31:26] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] mediaiki: add new member of the deployment group everywhere [puppet] - 10https://gerrit.wikimedia.org/r/778307 (owner: 10Giuseppe Lavagetto)
[15:32:31] <_joe_>	 sorry again, It just didn't pass through my mind that same group was everywhere basically
[15:33:47] <_joe_>	 and yes, this fixes puppet
[15:34:09] <godog>	 neato
[15:34:20] <_joe_>	 jouncebot: next
[15:34:20] <jouncebot>	 In 0 hour(s) and 25 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220407T1600)
[15:39:06] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T305300)', diff saved to https://phabricator.wikimedia.org/P24257 and previous config saved to /var/cache/conftool/dbconfig/20220407-153905-ladsgroup.json
[15:39:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:39:11] <stashbot>	 T305300: Add lu_attachment_method column to localuser table - https://phabricator.wikimedia.org/T305300
[15:39:17] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] site: Reimage cp6011 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/778300 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere)
[15:39:29] <wikibugs>	 (03CR) 10Herron: [C: 03+1] thanos: add recording rules for exporter-specific availability [puppet] - 10https://gerrit.wikimedia.org/r/778261 (https://phabricator.wikimedia.org/T288726) (owner: 10Filippo Giunchedi)
[15:39:41] <wikibugs>	 (03PS1) 10Btullis: Enable SSL/TLS for accessing the datahub-gms service [deployment-charts] - 10https://gerrit.wikimedia.org/r/778308 (https://phabricator.wikimedia.org/T301454)
[15:39:47] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] site: Reimage cp6003 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/778301 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere)
[15:41:11] <wikibugs>	 (03CR) 10Herron: [C: 03+1] logstash: replace all instances of @metadata.partition [puppet] - 10https://gerrit.wikimedia.org/r/777874 (https://phabricator.wikimedia.org/T305175) (owner: 10Cwhite)
[15:43:01] <wikibugs>	 (03CR) 10Herron: [C: 03+1] logstash: bugfix remove rsyslog-set log.level from blackbox_exporter events (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/777877 (owner: 10Cwhite)
[15:43:03] <wikibugs>	 (03CR) 10Vgutierrez: [C: 04-1] site: Reimage cp6010 as cache::text_haproxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/778302 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere)
[15:43:14] <icinga-wm>	 RECOVERY - Check systemd state on elastic2031 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:43:23] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] site: Reimage cp6002 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/778303 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere)
[15:44:00] <icinga-wm>	 RECOVERY - SSH on aqs1009.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[15:45:38] <icinga-wm>	 RECOVERY - Check systemd state on elastic2059 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:48:10] <wikibugs>	 (03PS1) 10Ahmon Dancy: Merge remote-tracking branch 'origin/master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/778310
[15:48:56] <icinga-wm>	 PROBLEM - Check systemd state on elastic2045 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:49:28] <icinga-wm>	 PROBLEM - Check systemd state on elastic2048 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:50:25] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] thanos: add recording rules for exporter-specific availability [puppet] - 10https://gerrit.wikimedia.org/r/778261 (https://phabricator.wikimedia.org/T288726) (owner: 10Filippo Giunchedi)
[15:51:10] <icinga-wm>	 RECOVERY - Check systemd state on elastic2047 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:52:16] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 03+2] Merge remote-tracking branch 'origin/master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/778310 (owner: 10Ahmon Dancy)
[15:52:52] <icinga-wm>	 PROBLEM - Check systemd state on elastic2049 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:53:12] <icinga-wm>	 RECOVERY - Check systemd state on elastic2045 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:53:21] <wikibugs>	 (03Merged) 10jenkins-bot: Merge remote-tracking branch 'origin/master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/778310 (owner: 10Ahmon Dancy)
[15:54:11] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P24258 and previous config saved to /var/cache/conftool/dbconfig/20220407-155410-ladsgroup.json
[15:54:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:55:08] <wikibugs>	 10SRE, 10ops-codfw, 10DBA: codfw: Dedicate Rack B1 for cloudX-dev servers - https://phabricator.wikimedia.org/T305469 (10Papaul)
[15:56:02] <icinga-wm>	 RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:57:04] <wikibugs>	 (03CR) 10Herron: [C: 03+1] logstash: add $schema field to w3creportingapi tests (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/776025 (owner: 10Cwhite)
[15:58:10] <wikibugs>	 (03PS2) 10MMandere: site: Reimage cp6010 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/778302 (https://phabricator.wikimedia.org/T290005)
[15:59:14] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] site: Reimage cp6010 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/778302 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere)
[15:59:22] <icinga-wm>	 RECOVERY - Check systemd state on elastic2049 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:59:51] <wikibugs>	 (03CR) 10Vgutierrez: [C: 04-1] site: Reimage cp6009 as cache::text_haproxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/778304 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere)
[16:00:04] <jouncebot>	 jbond and rzl: Dear deployers, time to do the Puppet request window deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220407T1600).
[16:00:04] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[16:02:24] <icinga-wm>	 RECOVERY - Check systemd state on elastic2048 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:03:08] <wikibugs>	 (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] "I very much like the idea to make static mode the default on our Wikimedia cluster, and list wikis that are allowed to use dynamic mode as" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773883 (https://phabricator.wikimedia.org/T291736) (owner: 10Krinkle)
[16:03:22] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for drochford - https://phabricator.wikimedia.org/T305634 (10drochford)
[16:03:36] <wikibugs>	 (03PS2) 10MMandere: site: Reimage cp6009 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/778304 (https://phabricator.wikimedia.org/T290005)
[16:04:20] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for drochford - https://phabricator.wikimedia.org/T305634 (10drochford) Thanks @RhinosF1 - My manager is Jan Eissfeldt, but Jan does hot use Phabricator. I've updated the task above.
[16:05:00] <wikibugs>	 (03CR) 10Herron: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/777423 (https://phabricator.wikimedia.org/T280472) (owner: 10JHathaway)
[16:06:15] <wikibugs>	 10SRE, 10ops-codfw, 10DBA: codfw: Dedicate Rack B1 for cloudX-dev servers - https://phabricator.wikimedia.org/T305469 (10Papaul)
[16:06:18] <wikibugs>	 (03CR) 10Vivian Rook: "https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34738/" [puppet] - 10https://gerrit.wikimedia.org/r/777873 (https://phabricator.wikimedia.org/T304694) (owner: 10Vivian Rook)
[16:06:41] <wikibugs>	 (03PS1) 10Ahmon Dancy: Train dev fixups (again) [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/778314
[16:06:43] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 03+2] Train dev fixups (again) [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/778314 (owner: 10Ahmon Dancy)
[16:07:04] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for drochford - https://phabricator.wikimedia.org/T305634 (10Zabe) >>! In T305634#7838700, @drochford wrote: > Thanks @RhinosF1 - My manager is Jan Eissfeldt, but Jan does hot use Phabricator. I've updated the task above.  Their pha...
[16:08:00] <wikibugs>	 (03Merged) 10jenkins-bot: Train dev fixups (again) [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/778314 (owner: 10Ahmon Dancy)
[16:08:12] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host ml-serve1005.eqiad.wmnet with OS bullseye
[16:08:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:08:21] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team, 10Patch-For-Review: Q2:(Need By: TBD) rack/setup/install ml-serve100[5-8] - https://phabricator.wikimedia.org/T294949 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ml-serve1005.eqiad.wmnet wit...
[16:09:12] <icinga-wm>	 RECOVERY - Widespread puppet agent failures- no resources reported on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.001598 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet
[16:09:16] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P24259 and previous config saved to /var/cache/conftool/dbconfig/20220407-160916-ladsgroup.json
[16:09:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:09:46] <icinga-wm>	 RECOVERY - SSH on aqs1007.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[16:12:30] <icinga-wm>	 RECOVERY - SSH on wtp1048.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[16:13:27] <dancy>	 jouncebot now
[16:13:27] <jouncebot>	 For the next 0 hour(s) and 46 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220407T1600)
[16:14:23] <rzl>	 dancy: nothing in that window, it's all yours if you need it
[16:14:31] <dancy>	 thx!
[16:15:32] <icinga-wm>	 PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:15:39] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] site: Reimage cp6009 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/778304 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere)
[16:16:12] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] site: Reimage cp6001 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/778305 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere)
[16:17:30] <wikibugs>	 10SRE, 10MediaWiki-Debug-Logger, 10MediaWiki-General, 10Developer Productivity, and 2 others: Debug hosts sometimes Fatal error:  "The UdpSocket to 127.0.0.1:10514 has been closed" - https://phabricator.wikimedia.org/T214734 (10Krinkle)
[16:17:40] <wikibugs>	 10SRE, 10MediaWiki-Debug-Logger, 10MediaWiki-General, 10Developer Productivity, and 2 others: Debug hosts sometimes Fatal error:  "The UdpSocket to 127.0.0.1:10514 has been closed" - https://phabricator.wikimedia.org/T214734 (10Krinkle) @tstarling wrote a good summary of the issue at T285823:  > […] Probab...
[16:17:51] <logmsgbot>	 !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ml-serve1005.eqiad.wmnet with OS bullseye
[16:17:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:17:59] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team, 10Patch-For-Review: Q2:(Need By: TBD) rack/setup/install ml-serve100[5-8] - https://phabricator.wikimedia.org/T294949 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ml-serve1005.eqiad.wmnet with OS...
[16:18:11] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host ml-serve1005.eqiad.wmnet with OS bullseye
[16:18:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:18:20] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team, 10Patch-For-Review: Q2:(Need By: TBD) rack/setup/install ml-serve100[5-8] - https://phabricator.wikimedia.org/T294949 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ml-serve1005.eqiad.wmnet wit...
[16:20:04] <wikibugs>	 10SRE, 10ops-codfw, 10DBA: codfw: Dedicate Rack B1 for cloudX-dev servers - https://phabricator.wikimedia.org/T305469 (10Papaul)
[16:21:48] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for drochford - https://phabricator.wikimedia.org/T305634 (10JanWMF) approved, I asked David to when it turned out neither JohnB nor I had access and I need a presentation based on it :)
[16:22:53] <wikibugs>	 (03Abandoned) 10Jgiannelos: Disable triggering tile pregeneration on OSM syncs [puppet] - 10https://gerrit.wikimedia.org/r/753111 (https://phabricator.wikimedia.org/T298246) (owner: 10Jgiannelos)
[16:23:49] <wikibugs>	 10SRE, 10MediaWiki-Debug-Logger, 10MediaWiki-General, 10Developer Productivity, and 2 others: Debug hosts sometimes Fatal error:  "The UdpSocket to 127.0.0.1:10514 has been closed" - https://phabricator.wikimedia.org/T214734 (10Krinkle) My hunch is that some code is (in)directly calling `logger->debug()` f...
[16:24:12] <icinga-wm>	 PROBLEM - SSH on wtp1045.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[16:24:23] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T305300)', diff saved to https://phabricator.wikimedia.org/P24260 and previous config saved to /var/cache/conftool/dbconfig/20220407-162421-ladsgroup.json
[16:24:24] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance
[16:24:26] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance
[16:24:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:24:31] <stashbot>	 T305300: Add lu_attachment_method column to localuser table - https://phabricator.wikimedia.org/T305300
[16:24:31] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3317 (T305300)', diff saved to https://phabricator.wikimedia.org/P24261 and previous config saved to /var/cache/conftool/dbconfig/20220407-162430-ladsgroup.json
[16:24:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:24:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:24:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:25:37] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T305300)', diff saved to https://phabricator.wikimedia.org/P24262 and previous config saved to /var/cache/conftool/dbconfig/20220407-162537-ladsgroup.json
[16:25:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:25:45] <wikibugs>	 (03CR) 10David Caro: add chunkeddriver.py.patch to wallaby (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/777873 (https://phabricator.wikimedia.org/T304694) (owner: 10Vivian Rook)
[16:26:10] <wikibugs>	 (03CR) 10Krinkle: List Kartographer static map exemptions and document+flip default (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773883 (https://phabricator.wikimedia.org/T291736) (owner: 10Krinkle)
[16:26:24] <icinga-wm>	 RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:27:13] <wikibugs>	 (03PS2) 10Krinkle: mediawiki: Update httpbb tests for /static/current going away [puppet] - 10https://gerrit.wikimedia.org/r/778295 (https://phabricator.wikimedia.org/T302465)
[16:30:23] <wikibugs>	 (03CR) 10Vivian Rook: add chunkeddriver.py.patch to wallaby (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/777873 (https://phabricator.wikimedia.org/T304694) (owner: 10Vivian Rook)
[16:31:55] <wikibugs>	 10SRE, 10Goal, 10MW-1.38-notes (1.38.0-wmf.4; 2021-10-12), 10Patch-For-Review, and 2 others: Fully migrate producers off statsd - https://phabricator.wikimedia.org/T205870 (10Mvolz) >>! In T205870#7838013, @fgiunchedi wrote: >>>! In T205870#7837817, @Mvolz wrote: >> This is now deployed for citoid.  >  > T...
[16:33:14] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for drochford - https://phabricator.wikimedia.org/T305634 (10RhinosF1) @JanWMF: Is there a deadline this needs to be done by then?
[16:33:44] <wikibugs>	 (03CR) 10Bking: [C: 03+2] wdqs: tune jvmquake settings (take 2) [puppet] - 10https://gerrit.wikimedia.org/r/776857 (https://phabricator.wikimedia.org/T293862) (owner: 10DCausse)
[16:34:18] <jinxer-wm>	 (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:34:40] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[16:35:26] <icinga-wm>	 PROBLEM - Host elastic2033 is DOWN: PING CRITICAL - Packet loss = 100%
[16:35:52] <icinga-wm>	 ACKNOWLEDGEMENT - Backup freshness on backup1001 is CRITICAL: Stale-full only: 1 (doc1001), Fresh: 107 jobs Jcrespo full backup of doc1001 failed, retrying - The acknowledgement expires at: 2022-04-08 12:35:20. https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[16:36:32] <icinga-wm>	 PROBLEM - Check systemd state on elastic2046 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:36:45] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 04-1] "You are both right -- the manifest isn't version-specific but it has a version guard around it limiting things to Victoria." [puppet] - 10https://gerrit.wikimedia.org/r/777873 (https://phabricator.wikimedia.org/T304694) (owner: 10Vivian Rook)
[16:37:26] <icinga-wm>	 PROBLEM - Check systemd state on elastic2032 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:39:18] <jinxer-wm>	 (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:40:07] <wikibugs>	 (03PS2) 10Andrew Bogott: add chunkeddriver.py.patch to wallaby [puppet] - 10https://gerrit.wikimedia.org/r/777873 (https://phabricator.wikimedia.org/T304694) (owner: 10Vivian Rook)
[16:40:42] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P24263 and previous config saved to /var/cache/conftool/dbconfig/20220407-164042-ladsgroup.json
[16:40:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:41:31] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve1005.eqiad.wmnet with reason: host reimage
[16:41:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:42:38] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] logstash: add $schema field to w3creportingapi tests [puppet] - 10https://gerrit.wikimedia.org/r/776025 (owner: 10Cwhite)
[16:43:34] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] logstash: replace all instances of @metadata.partition [puppet] - 10https://gerrit.wikimedia.org/r/777874 (https://phabricator.wikimedia.org/T305175) (owner: 10Cwhite)
[16:45:02] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve1005.eqiad.wmnet with reason: host reimage
[16:45:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:45:13] <icinga-wm>	 PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:48:47] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1001.eqiad.wmnet with OS bullseye
[16:48:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:48:52] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team: Q1:(Need By: TBD) rack/setup/install dse-k8s-worker100[1-4] - https://phabricator.wikimedia.org/T291579 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host dse-k8s-worker1001.eqiad.wmnet with OS bullseye
[16:49:02] <joal>	 btullis: heya - would be nearby?
[16:49:07] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REBOOT (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reboot - bking@cumin1001 - T304938
[16:49:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:49:18] <jinxer-wm>	 (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:49:41] <btullis>	 joal: Yes, right here.
[16:50:13] <joal>	 btullis: could you help me figure out the cache setup currently in eqiad?
[16:50:21] <joal>	 btullis: I never know where to look :S
[16:50:28] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.REBOOT (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reboot - bking@cumin1001 - T304938
[16:50:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:50:33] <btullis>	 Sure thing. batcave?
[16:50:35] <icinga-wm>	 RECOVERY - Check systemd state on elastic2032 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:50:39] <joal>	 btullis: pooled nodes, and cache-types (text of upload)
[16:50:43] <joal>	 please :)
[16:53:37] <icinga-wm>	 RECOVERY - Check systemd state on elastic2046 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:54:18] <jinxer-wm>	 (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:55:47] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P24264 and previous config saved to /var/cache/conftool/dbconfig/20220407-165547-ladsgroup.json
[16:55:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:55:59] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1002.eqiad.wmnet with OS bullseye
[16:56:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:56:05] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team: Q1:(Need By: TBD) rack/setup/install dse-k8s-worker100[1-4] - https://phabricator.wikimedia.org/T291579 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host dse-k8s-worker1002.eqiad.wmnet with OS bullseye
[16:56:25] <icinga-wm>	 RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:59:53] <icinga-wm>	 PROBLEM - PHP7 jobrunner on mw1437 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner
[17:01:19] <icinga-wm>	 RECOVERY - PHP7 jobrunner on mw1437 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 6.303 second response time https://wikitech.wikimedia.org/wiki/Jobrunner
[17:06:30] <logmsgbot>	 !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dse-k8s-worker1001.eqiad.wmnet with OS bullseye
[17:06:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:06:35] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team: Q1:(Need By: TBD) rack/setup/install dse-k8s-worker100[1-4] - https://phabricator.wikimedia.org/T291579 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host dse-k8s-worker1001.eqiad.wmnet with OS bullseye...
[17:06:39] <logmsgbot>	 !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dse-k8s-worker1002.eqiad.wmnet with OS bullseye
[17:06:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:06:44] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team: Q1:(Need By: TBD) rack/setup/install dse-k8s-worker100[1-4] - https://phabricator.wikimedia.org/T291579 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host dse-k8s-worker1002.eqiad.wmnet with OS bullseye...
[17:08:50] <logmsgbot>	 !log herron@cumin1001 START - Cookbook sre.kafka.reboot-workers for Kafka logging-codfw cluster: Reboot kafka nodes
[17:08:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:09:09] <logmsgbot>	 !log herron@cumin1001 END (FAIL) - Cookbook sre.kafka.reboot-workers (exit_code=99) for Kafka logging-codfw cluster: Reboot kafka nodes
[17:09:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:09:52] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1001.eqiad.wmnet with OS bullseye
[17:09:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:09:57] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team: Q1:(Need By: TBD) rack/setup/install dse-k8s-worker100[1-4] - https://phabricator.wikimedia.org/T291579 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host dse-k8s-worker1001.eqiad.wmnet with OS bullseye
[17:10:52] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T305300)', diff saved to https://phabricator.wikimedia.org/P24265 and previous config saved to /var/cache/conftool/dbconfig/20220407-171052-ladsgroup.json
[17:10:54] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1158.eqiad.wmnet with reason: Maintenance
[17:10:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:10:55] <stashbot>	 T305300: Add lu_attachment_method column to localuser table - https://phabricator.wikimedia.org/T305300
[17:10:56] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1158.eqiad.wmnet with reason: Maintenance
[17:10:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:10:57] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[17:10:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:11:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:11:00] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[17:11:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:11:05] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1158 (T305300)', diff saved to https://phabricator.wikimedia.org/P24266 and previous config saved to /var/cache/conftool/dbconfig/20220407-171105-ladsgroup.json
[17:11:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:11:38] <logmsgbot>	 !log mbsantos@deploy1002 helmfile [staging] START helmfile.d/services/mobileapps: apply
[17:11:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:12:06] <logmsgbot>	 !log mbsantos@deploy1002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply
[17:12:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:12:12] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T305300)', diff saved to https://phabricator.wikimedia.org/P24267 and previous config saved to /var/cache/conftool/dbconfig/20220407-171211-ladsgroup.json
[17:12:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:13:20] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-serve1005.eqiad.wmnet with OS bullseye
[17:13:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:13:34] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team, 10Patch-For-Review: Q2:(Need By: TBD) rack/setup/install ml-serve100[5-8] - https://phabricator.wikimedia.org/T294949 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ml-serve1005.eqiad.wmnet with OS...
[17:14:07] <logmsgbot>	 !log mbsantos@deploy1002 helmfile [codfw] START helmfile.d/services/mobileapps: apply
[17:14:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:14:32] <wikibugs>	 10ops-codfw, 10Discovery: elastic2033 without bootable devices available (repeat of T281621) - https://phabricator.wikimedia.org/T305646 (10bking)
[17:14:55] <logmsgbot>	 !log mbsantos@deploy1002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply
[17:14:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:15:35] <icinga-wm>	 PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:16:12] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1002.eqiad.wmnet with OS bullseye
[17:16:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:16:17] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team: Q1:(Need By: TBD) rack/setup/install dse-k8s-worker100[1-4] - https://phabricator.wikimedia.org/T291579 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host dse-k8s-worker1002.eqiad.wmnet with OS bullseye
[17:16:32] <logmsgbot>	 !log mbsantos@deploy1002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply
[17:16:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:16:53] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1003.eqiad.wmnet with OS bullseye
[17:16:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:16:59] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team: Q1:(Need By: TBD) rack/setup/install dse-k8s-worker100[1-4] - https://phabricator.wikimedia.org/T291579 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host dse-k8s-worker1003.eqiad.wmnet with OS bullseye
[17:17:14] <logmsgbot>	 !log mbsantos@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply
[17:17:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:24:06] <icinga-wm>	 RECOVERY - SSH on wtp1045.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[17:26:19] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1004.eqiad.wmnet with OS bullseye
[17:26:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:26:45] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team: Q1:(Need By: TBD) rack/setup/install dse-k8s-worker100[1-4] - https://phabricator.wikimedia.org/T291579 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host dse-k8s-worker1004.eqiad.wmnet with OS bullseye
[17:27:20] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P24268 and previous config saved to /var/cache/conftool/dbconfig/20220407-172719-ladsgroup.json
[17:27:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:29:02] <icinga-wm>	 PROBLEM - PHP7 jobrunner on mw1437 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner
[17:29:18] <jinxer-wm>	 (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:29:48] <jinxer-wm>	 (ProbeDown) firing: Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:31:13] <ryankemper>	 !log [WDQS] T293862 Need to do a rolling restart of wdqs public; going to just roll a full deploy since it's equal work
[17:31:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:31:16] <stashbot>	 T293862: Investigate using jvmquake to limit the time a JVM is unusable due to GC overhead - https://phabricator.wikimedia.org/T293862
[17:31:26] <icinga-wm>	 RECOVERY - PHP7 jobrunner on mw1437 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 5.338 second response time https://wikitech.wikimedia.org/wiki/Jobrunner
[17:31:36] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-worker1001.eqiad.wmnet with reason: host reimage
[17:31:36] <ryankemper>	 !log [WDQS Deploy] Gearing up for deploy of wdqs `0.3.110`. Pre-deploy tests passing on canary `wdqs1003`
[17:31:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:31:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:31:45] <logmsgbot>	 !log ryankemper@deploy1002 Started deploy [wdqs/wdqs@0d95eca]: 0.3.110
[17:31:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:32:30] <ryankemper>	 !log [WDQS Deploy] Tests passing following deploy of `0.3.110` on canary `wdqs1003`; proceeding to rest of fleet
[17:32:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:33:18] <icinga-wm>	 RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:34:33] <jinxer-wm>	 (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:34:48] <jinxer-wm>	 (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:35:07] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-worker1001.eqiad.wmnet with reason: host reimage
[17:35:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:38:00] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-worker1003.eqiad.wmnet with reason: host reimage
[17:38:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:38:06] <logmsgbot>	 !log ryankemper@deploy1002 Finished deploy [wdqs/wdqs@0d95eca]: 0.3.110 (duration: 06m 21s)
[17:38:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:38:20] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-worker1002.eqiad.wmnet with reason: host reimage
[17:38:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:38:42] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team, 10Patch-For-Review: Q2:(Need By: TBD) rack/setup/install ml-serve100[5-8] - https://phabricator.wikimedia.org/T294949 (10Cmjohnson)
[17:39:33] <jinxer-wm>	 (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:39:48] <jinxer-wm>	 (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:40:02] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team, 10Patch-For-Review: Q2:(Need By: TBD) rack/setup/install ml-serve100[5-8] - https://phabricator.wikimedia.org/T294949 (10Cmjohnson) 05Open→03Resolved on-site work completed
[17:40:46] <ryankemper>	 !log [WDQS Deploy] Restarted `wdqs-updater` across all hosts, 4 hosts at a time: `sudo -E cumin -b 4 'A:wdqs-all' 'systemctl restart wdqs-updater'`
[17:40:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:40:50] <ryankemper>	 !log [WDQS Deploy] Restarted `wdqs-categories` across all test hosts simultaneously: `sudo -E cumin 'A:wdqs-test' 'systemctl restart wdqs-categories'`
[17:40:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:40:53] <ryankemper>	 !log [WDQS Deploy] Restarting `wdqs-categories` across lvs-managed hosts, one node at a time: `sudo -E cumin -b 1 'A:wdqs-all and not A:wdqs-test' 'depool && sleep 45 && systemctl restart wdqs-categories && sleep 45 && pool'`
[17:40:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:41:20] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-worker1003.eqiad.wmnet with reason: host reimage
[17:41:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:41:37] <wikibugs>	 10SRE, 10Goal, 10MW-1.38-notes (1.38.0-wmf.4; 2021-10-12), 10Patch-For-Review, and 2 others: Fully migrate producers off statsd - https://phabricator.wikimedia.org/T205870 (10colewhite) >>! In T205870#7838787, @Mvolz wrote: > I looked into a bit ago and didn't make any progress, and I'm not going to be abl...
[17:42:25] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P24269 and previous config saved to /var/cache/conftool/dbconfig/20220407-174224-ladsgroup.json
[17:42:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:43:27] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-worker1002.eqiad.wmnet with reason: host reimage
[17:43:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:43:48] <icinga-wm>	 PROBLEM - Host ms-be1068 is DOWN: PING CRITICAL - Packet loss = 100%
[17:43:56] <icinga-wm>	 PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:44:11] <ryankemper>	 !log T293862 Rolling restart of wdqs public is complete; new jvmquake settings have been uptaken on wdqs public hosts: `-agentpath:/usr/lib/libjvmquake.so=1000,5,0,warn=60,touch=/tmp/wdqs_blazegraph_jvmquake_warn_gc`
[17:44:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:44:14] <stashbot>	 T293862: Investigate using jvmquake to limit the time a JVM is unusable due to GC overhead - https://phabricator.wikimedia.org/T293862
[17:44:33] <jinxer-wm>	 (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:46:20] <icinga-wm>	 PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[17:46:40] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] logstash: bugfix remove rsyslog-set log.level from blackbox_exporter events [puppet] - 10https://gerrit.wikimedia.org/r/777877 (owner: 10Cwhite)
[17:46:51] <logmsgbot>	 !log dzahn@cumin2002 conftool action : set/pooled=no; selector: dc=eqiad,name=wtp1036.eqiad.wmnet
[17:46:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:47:37] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-worker1004.eqiad.wmnet with reason: host reimage
[17:47:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:49:16] <icinga-wm>	 RECOVERY - Host ms-be1068 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms
[17:49:48] <jinxer-wm>	 (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:50:24] <ryankemper>	 !log T293862 Removed touched files so that it'll be easier to see when the new jvmquake threshold is crossed: `ryankemper@cumin1001:~$ sudo -E cumin 'A:wdqs-public' "rm -fv '/tmp/wdqs_blazegraph_jvmquake_warn_gc'"`
[17:50:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:50:28] <stashbot>	 T293862: Investigate using jvmquake to limit the time a JVM is unusable due to GC overhead - https://phabricator.wikimedia.org/T293862
[17:50:31] <jinxer-wm>	 (BlazegraphJvmQuakeWarnGC) resolved: Blazegraph instance wdqs2003:9100 is entering a GC death spiral - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphJvmQuakeWarnGC
[17:51:05] <wikibugs>	 10SRE, 10SRE Observability: sre.kafka.reboot-workers fails on logging cluster with failed to execute command 'systemctl stop kafka-mirror': - https://phabricator.wikimedia.org/T305652 (10herron) p:05Triage→03Medium
[17:51:05] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-worker1004.eqiad.wmnet with reason: host reimage
[17:51:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:51:50] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be10[68-71] - https://phabricator.wikimedia.org/T299462 (10Cmjohnson) @fgiunchedi Do you recall how the disks are supposed to be set up and I can fix
[17:51:58] <wikibugs>	 (03CR) 10Jforrester: [C: 04-1] "Not yet." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612348 (https://phabricator.wikimedia.org/T248418) (owner: 10Jforrester)
[17:52:01] <wikibugs>	 (03PS1) 10Herron: sre.kafka.reboot-workers: add --skip-mirrormaker option [cookbooks] - 10https://gerrit.wikimedia.org/r/778325 (https://phabricator.wikimedia.org/T305652)
[17:52:13] <wikibugs>	 (03PS2) 10Herron: sre.kafka.reboot-workers: add --skip-mirrormaker option [cookbooks] - 10https://gerrit.wikimedia.org/r/778325 (https://phabricator.wikimedia.org/T305652)
[17:52:42] <logmsgbot>	 !log dzahn@cumin2002 conftool action : set/pooled=no; selector: dc=eqiad,name=wtp1037.eqiad.wmnet
[17:52:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:52:59] <mutante>	 !log rebooting wtp103* servers
[17:53:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:54:09] <icinga-wm>	 PROBLEM - Host wtp1036 is DOWN: PING CRITICAL - Packet loss = 100%
[17:54:33] <jinxer-wm>	 (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:54:48] <jinxer-wm>	 (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:55:29] <icinga-wm>	 RECOVERY - Host wtp1036 is UP: PING OK - Packet loss = 0%, RTA = 2.00 ms
[17:55:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[17:56:57] <icinga-wm>	 PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in appserver at eqiad on alert1001 is CRITICAL: 0.3562 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[17:56:59] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[17:57:30] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T305300)', diff saved to https://phabricator.wikimedia.org/P24270 and previous config saved to /var/cache/conftool/dbconfig/20220407-175730-ladsgroup.json
[17:57:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:57:34] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2121.codfw.wmnet with reason: Maintenance
[17:57:35] <stashbot>	 T305300: Add lu_attachment_method column to localuser table - https://phabricator.wikimedia.org/T305300
[17:57:36] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2121.codfw.wmnet with reason: Maintenance
[17:57:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:57:37] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 10 hosts with reason: Maintenance
[17:57:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:57:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:57:39] <icinga-wm>	 PROBLEM - Check systemd state on ms-be1068 is CRITICAL: CRITICAL - degraded: The following units failed: srv-swift\x2dstorage-sda4.mount,srv-swift\x2dstorage-sdb3.mount https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:57:44] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 10 hosts with reason: Maintenance
[17:57:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:57:48] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance
[17:57:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:57:50] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance
[17:57:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:58:05] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-worker1001.eqiad.wmnet with OS bullseye
[17:58:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:58:11] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team: Q1:(Need By: TBD) rack/setup/install dse-k8s-worker100[1-4] - https://phabricator.wikimedia.org/T291579 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host dse-k8s-worker1001.eqiad.wmnet with OS bullseye...
[17:58:23] <icinga-wm>	 RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in appserver at eqiad on alert1001 is OK: (C)0.3 gt (W)0.1 gt 0.06849 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[17:58:25] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[17:59:30] <herron>	 logstashes were recently restarted, kafka lag should clear in a moment
[17:59:33] <jinxer-wm>	 (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:59:48] <logmsgbot>	 !log ryankemper@deploy1002 Started deploy [wdqs/wdqs@0d95eca] (wcqs): Deploy 0.3.110 to WCQS
[17:59:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:00:10] <ryankemper>	 !log [WCQS Deploy] Tests look good following deploy of `0.3.110` to `wcqs1003.eqiad.wmnet`, proceeding to rest of fleet
[18:00:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:00:55] <cwhite>	 herron: I'm seeing a huge spike in dropped logs.  Looks to me like mediawiki dumped a lot of "Persisting session for unknown reason" logs from centralauth
[18:00:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[18:01:13] <icinga-wm>	 PROBLEM - MD RAID on ms-be1068 is CRITICAL: CRITICAL: State: degraded, Active: 2, Working: 2, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[18:01:14] <icinga-wm>	 ACKNOWLEDGEMENT - MD RAID on ms-be1068 is CRITICAL: CRITICAL: State: degraded, Active: 2, Working: 2, Failed: 0, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T305653 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[18:01:19] <wikibugs>	 10SRE, 10ops-eqiad: Degraded RAID on ms-be1068 - https://phabricator.wikimedia.org/T305653 (10ops-monitoring-bot)
[18:01:43] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-worker1003.eqiad.wmnet with OS bullseye
[18:01:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:01:46] <logmsgbot>	 !log ryankemper@deploy1002 Finished deploy [wdqs/wdqs@0d95eca] (wcqs): Deploy 0.3.110 to WCQS (duration: 01m 58s)
[18:01:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:01:49] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team: Q1:(Need By: TBD) rack/setup/install dse-k8s-worker100[1-4] - https://phabricator.wikimedia.org/T291579 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host dse-k8s-worker1003.eqiad.wmnet with OS bullseye...
[18:02:00] <logmsgbot>	 !log dzahn@cumin2002 conftool action : set/pooled=yes; selector: dc=eqiad,name=wtp1036.eqiad.wmnet
[18:02:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:04:39] <icinga-wm>	 PROBLEM - Check systemd state on elastic2037 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:04:53] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-worker1002.eqiad.wmnet with OS bullseye
[18:04:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:04:59] <icinga-wm>	 PROBLEM - Check systemd state on elastic2027 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:04:59] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team: Q1:(Need By: TBD) rack/setup/install dse-k8s-worker100[1-4] - https://phabricator.wikimedia.org/T291579 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host dse-k8s-worker1002.eqiad.wmnet with OS bullseye...
[18:05:05] <icinga-wm>	 PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[18:06:45] <icinga-wm>	 RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[18:07:31] <logmsgbot>	 !log dzahn@cumin2002 conftool action : set/pooled=yes; selector: dc=eqiad,name=wtp1037.eqiad.wmnet
[18:07:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:07:57] <logmsgbot>	 !log dzahn@cumin2002 conftool action : set/pooled=no; selector: dc=eqiad,name=wtp1035.eqiad.wmnet
[18:07:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:08:23] <logmsgbot>	 !log dzahn@cumin2002 conftool action : set/pooled=no; selector: dc=eqiad,name=wtp1034.eqiad.wmnet
[18:08:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:08:41] <ryankemper>	 !log [WCQS Deploy] Restarted `wcqs-updater` across all hosts
[18:08:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:08:48] <ryankemper>	 !log [WCQS Deploy] Successful test query placed on commons-query.wikimedia.org, there's no relevant criticals in Icinga, and Grafana looks good. WCQS deploy complete
[18:08:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:09:18] <ryankemper>	 !log [WDQS Deploy] Deploy complete. Successful test query placed on query.wikidata.org, there's no relevant criticals in Icinga, and Grafana looks good
[18:09:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:09:33] <jinxer-wm>	 (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:09:55] <icinga-wm>	 PROBLEM - Check systemd state on elastic2055 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:12:49] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-worker1004.eqiad.wmnet with OS bullseye
[18:12:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:12:54] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team: Q1:(Need By: TBD) rack/setup/install dse-k8s-worker100[1-4] - https://phabricator.wikimedia.org/T291579 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host dse-k8s-worker1004.eqiad.wmnet with OS bullseye...
[18:13:09] <icinga-wm>	 PROBLEM - Host wtp1035 is DOWN: PING CRITICAL - Packet loss = 100%
[18:13:19] <icinga-wm>	 RECOVERY - Host wtp1035 is UP: PING OK - Packet loss = 0%, RTA = 0.44 ms
[18:13:39] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team: Q1:(Need By: TBD) rack/setup/install dse-k8s-worker100[1-4] - https://phabricator.wikimedia.org/T291579 (10Cmjohnson)
[18:13:49] <icinga-wm>	 RECOVERY - Check systemd state on elastic2027 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:13:52] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team: Q1:(Need By: TBD) rack/setup/install dse-k8s-worker100[1-4] - https://phabricator.wikimedia.org/T291579 (10Cmjohnson) 05Open→03Resolved
[18:14:30] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be10[68-71] - https://phabricator.wikimedia.org/T299462 (10Cmjohnson)
[18:14:33] <jinxer-wm>	 (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:16:00] <wikibugs>	 10SRE, 10ops-eqiad: Degraded RAID on ms-be1068 - https://phabricator.wikimedia.org/T305653 (10Cmjohnson) 05Open→03Invalid this is an re-image error
[18:17:31] <icinga-wm>	 RECOVERY - Check systemd state on elastic2055 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:17:39] <logmsgbot>	 !log dzahn@cumin2002 conftool action : set/pooled=yes; selector: dc=eqiad,name=wtp1035.eqiad.wmnet
[18:17:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:17:47] <logmsgbot>	 !log dzahn@cumin2002 conftool action : set/pooled=no; selector: dc=eqiad,name=wtp1033.eqiad.wmnet
[18:17:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:20:13] <icinga-wm>	 PROBLEM - Host wtp1034 is DOWN: PING CRITICAL - Packet loss = 100%
[18:21:48] <icinga-wm>	 RECOVERY - Host wtp1034 is UP: PING OK - Packet loss = 0%, RTA = 0.38 ms
[18:22:28] <logmsgbot>	 !log dzahn@cumin2002 conftool action : set/pooled=yes; selector: dc=eqiad,name=wtp1034.eqiad.wmnet
[18:22:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:22:36] <logmsgbot>	 !log dzahn@cumin2002 conftool action : set/pooled=no; selector: dc=eqiad,name=wtp1032.eqiad.wmnet
[18:22:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:23:56] <icinga-wm>	 PROBLEM - Host wtp1033 is DOWN: PING CRITICAL - Packet loss = 100%
[18:23:56] <icinga-wm>	 RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:24:32] <icinga-wm>	 RECOVERY - Host wtp1033 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms
[18:24:33] <jinxer-wm>	 (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:25:11] <logmsgbot>	 !log razzi@cumin1001 START - Cookbook sre.hadoop.reboot-workers for Hadoop test cluster
[18:25:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:27:06] <icinga-wm>	 RECOVERY - Check systemd state on elastic2037 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:27:12] <icinga-wm>	 PROBLEM - Check systemd state on elastic2025 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:28:56] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox
[18:28:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:29:12] <logmsgbot>	 !log dzahn@cumin2002 conftool action : set/pooled=yes; selector: dc=eqiad,name=wtp1033.eqiad.wmnet
[18:29:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:29:19] <logmsgbot>	 !log dzahn@cumin2002 conftool action : set/pooled=no; selector: dc=eqiad,name=wtp1031.eqiad.wmnet
[18:29:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:29:33] <jinxer-wm>	 (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:30:58] <icinga-wm>	 PROBLEM - Host wtp1032 is DOWN: PING CRITICAL - Packet loss = 100%
[18:32:36] <ryankemper>	 !log [Elastic] Pooled `elastic1052` (likely was erroneously left depooled after https://phabricator.wikimedia.org/P19885)
[18:32:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:33:30] <icinga-wm>	 RECOVERY - Host wtp1032 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms
[18:33:37] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[18:33:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:33:39] <wikibugs>	 10SRE, 10ops-codfw, 10Discovery: elastic2033 without bootable devices available (repeat of T281621) - https://phabricator.wikimedia.org/T305646 (10RKemper)
[18:33:54] <wikibugs>	 10SRE, 10ops-codfw, 10Discovery: elastic2033 without bootable devices available (repeat of T281621) - https://phabricator.wikimedia.org/T305646 (10RKemper) Banned host like so:  ` curl -H 'Content-Type: application/json' -XPUT http://localhost:9200/_cluster/settings -d '{"transient":{"cluster.routing.allocat...
[18:34:33] <jinxer-wm>	 (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:34:38] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): Q3:(Need By: TBD) rack/setup/install elastic1089-1102 - https://phabricator.wikimedia.org/T299609 (10Cmjohnson)
[18:34:53] <logmsgbot>	 !log dzahn@cumin2002 conftool action : set/pooled=yes; selector: dc=eqiad,name=wtp1032.eqiad.wmnet
[18:34:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:35:02] <logmsgbot>	 !log dzahn@cumin2002 conftool action : set/pooled=no; selector: dc=eqiad,name=wtp1030.eqiad.wmnet
[18:35:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:35:48] <jinxer-wm>	 (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:37:12] <icinga-wm>	 PROBLEM - Check systemd state on elastic2038 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:37:12] <icinga-wm>	 PROBLEM - Host wtp1031 is DOWN: PING CRITICAL - Packet loss = 100%
[18:37:24] <icinga-wm>	 RECOVERY - Host wtp1031 is UP: PING OK - Packet loss = 0%, RTA = 0.19 ms
[18:38:34] <logmsgbot>	 !log dzahn@cumin2002 conftool action : set/pooled=yes; selector: dc=eqiad,name=wtp1031.eqiad.wmnet
[18:38:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:39:00] <logmsgbot>	 !log dzahn@cumin2002 conftool action : set/pooled=no; selector: dc=eqiad,name=wtp1029.eqiad.wmnet
[18:39:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:40:48] <jinxer-wm>	 (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:42:09] <wikibugs>	 (03PS1) 10Cmjohnson: Adding new elastic servers to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/778329 (https://phabricator.wikimedia.org/T299609)
[18:42:16] <icinga-wm>	 RECOVERY - Check systemd state on elastic2038 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:43:16] <icinga-wm>	 RECOVERY - Check systemd state on elastic2025 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:43:28] <logmsgbot>	 !log dzahn@cumin2002 conftool action : set/pooled=yes; selector: dc=eqiad,name=wtp1030.eqiad.wmnet
[18:43:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:43:35] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host elastic1089.mgmt.eqiad.wmnet with reboot policy FORCED
[18:43:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:43:58] <logmsgbot>	 !log dzahn@cumin2002 conftool action : set/pooled=no; selector: dc=eqiad,name=wtp1028.eqiad.wmnet
[18:44:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:44:20] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host elastic1090.mgmt.eqiad.wmnet with reboot policy FORCED
[18:44:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:45:08] <wikibugs>	 (03PS9) 10Bking: elastic: don't wait for green on first node [software/spicerack] - 10https://gerrit.wikimedia.org/r/776999 (https://phabricator.wikimedia.org/T304570)
[18:45:12] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host elastic1091.mgmt.eqiad.wmnet with reboot policy FORCED
[18:45:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:45:18] <icinga-wm>	 PROBLEM - Check systemd state on elastic2026 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:45:30] <icinga-wm>	 PROBLEM - Check systemd state on elastic2040 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:45:51] <wikibugs>	 (03PS1) 10Volans: spicerack: simplify profile [puppet] - 10https://gerrit.wikimedia.org/r/778331
[18:45:53] <wikibugs>	 (03PS1) 10Volans: service::catalog: add Spicerack comments [puppet] - 10https://gerrit.wikimedia.org/r/778332
[18:45:53] <jinxer-wm>	 (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:45:54] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host elastic1092.mgmt.eqiad.wmnet with reboot policy FORCED
[18:45:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:45:55] <wikibugs>	 (03PS1) 10Volans: spicerack: install service::catalog configuration [puppet] - 10https://gerrit.wikimedia.org/r/778333
[18:45:56] <icinga-wm>	 PROBLEM - Check systemd state on elastic2056 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:46:26] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] spicerack: simplify profile [puppet] - 10https://gerrit.wikimedia.org/r/778331 (owner: 10Volans)
[18:46:40] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host elastic1094.mgmt.eqiad.wmnet with reboot policy FORCED
[18:46:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:47:06] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] spicerack: install service::catalog configuration [puppet] - 10https://gerrit.wikimedia.org/r/778333 (owner: 10Volans)
[18:48:29] <logmsgbot>	 !log dzahn@cumin2002 conftool action : set/pooled=yes; selector: dc=eqiad,name=wtp1029.eqiad.wmnet
[18:48:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:48:35] <logmsgbot>	 !log dzahn@cumin2002 conftool action : set/pooled=no; selector: dc=eqiad,name=wtp1027.eqiad.wmnet
[18:48:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:49:24] <icinga-wm>	 RECOVERY - Check systemd state on elastic2040 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:50:32] <wikibugs>	 (03PS10) 10Gehel: elastic: don't wait for green on first node [software/spicerack] - 10https://gerrit.wikimedia.org/r/776999 (https://phabricator.wikimedia.org/T304570) (owner: 10Bking)
[18:50:34] <icinga-wm>	 PROBLEM - Host wtp1028 is DOWN: PING CRITICAL - Packet loss = 100%
[18:50:58] <icinga-wm>	 RECOVERY - Host wtp1028 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms
[18:51:18] <jinxer-wm>	 (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:53:16] <wikibugs>	 (03PS11) 10Ryan Kemper: elastic: don't wait for green on first node [software/spicerack] - 10https://gerrit.wikimedia.org/r/776999 (https://phabricator.wikimedia.org/T304570) (owner: 10Bking)
[18:53:50] <logmsgbot>	 !log dzahn@cumin2002 conftool action : set/pooled=yes; selector: dc=eqiad,name=wtp1028.eqiad.wmnet
[18:53:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:56:18] <jinxer-wm>	 (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:57:54] <icinga-wm>	 RECOVERY - Check systemd state on elastic2056 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:58:18] <jinxer-wm>	 (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:59:09] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for drochford - https://phabricator.wikimedia.org/T305634 (10JanWMF) Thanks @RhinosF1; timely but no ironclad hard deadline, so we can certainly go proper process here :)
[18:59:14] <logmsgbot>	 !log dzahn@cumin2002 conftool action : set/pooled=yes; selector: dc=eqiad,name=wtp1027.eqiad.wmnet
[18:59:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:00:25] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] elastic: don't wait for green on first node [software/spicerack] - 10https://gerrit.wikimedia.org/r/776999 (https://phabricator.wikimedia.org/T304570) (owner: 10Bking)
[19:01:02] <icinga-wm>	 PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:01:50] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic1090.mgmt.eqiad.wmnet with reboot policy FORCED
[19:01:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:01:54] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic1089.mgmt.eqiad.wmnet with reboot policy FORCED
[19:01:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:01:57] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic1091.mgmt.eqiad.wmnet with reboot policy FORCED
[19:01:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:02:00] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic1092.mgmt.eqiad.wmnet with reboot policy FORCED
[19:02:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:02:04] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic1094.mgmt.eqiad.wmnet with reboot policy FORCED
[19:02:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:02:11] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.REBOOT (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reboot - bking@cumin1001 - T304938
[19:02:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:02:53] <wikibugs>	 (03PS12) 10Gehel: elastic: don't wait for green on first node [software/spicerack] - 10https://gerrit.wikimedia.org/r/776999 (https://phabricator.wikimedia.org/T304570) (owner: 10Bking)
[19:03:05] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host elastic1095.mgmt.eqiad.wmnet with reboot policy FORCED
[19:03:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:03:08] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host elastic1096.mgmt.eqiad.wmnet with reboot policy FORCED
[19:03:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:03:10] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host elastic1097.mgmt.eqiad.wmnet with reboot policy FORCED
[19:03:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:03:12] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host elastic1098.mgmt.eqiad.wmnet with reboot policy FORCED
[19:03:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:03:15] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host elastic1099.mgmt.eqiad.wmnet with reboot policy FORCED
[19:03:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:03:18] <jinxer-wm>	 (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:04:06] <icinga-wm>	 PROBLEM - Check systemd state on puppetmaster1001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:04:18] <jinxer-wm>	 (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:04:26] <logmsgbot>	 !log razzi@cumin1001 END (PASS) - Cookbook sre.hadoop.reboot-workers (exit_code=0) for Hadoop test cluster
[19:04:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:11:14] <icinga-wm>	 PROBLEM - SSH on aqs1009.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[19:11:56] <icinga-wm>	 RECOVERY - Check systemd state on elastic2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:13:33] <jinxer-wm>	 (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:14:18] <jinxer-wm>	 (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:16:18] <jinxer-wm>	 (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:16:46] <wikibugs>	 (03PS2) 10Cmjohnson: Adding new elastic servers to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/778329 (https://phabricator.wikimedia.org/T299609)
[19:17:52] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic1095.mgmt.eqiad.wmnet with reboot policy FORCED
[19:17:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:17:56] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic1096.mgmt.eqiad.wmnet with reboot policy FORCED
[19:17:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:17:59] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic1097.mgmt.eqiad.wmnet with reboot policy FORCED
[19:18:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:18:27] <wikibugs>	 (03PS1) 10Ryan Kemper: elastic: allow waiting for yellow instead of green [cookbooks] - 10https://gerrit.wikimedia.org/r/778335 (https://phabricator.wikimedia.org/T304570)
[19:18:33] <jinxer-wm>	 (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:18:56] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host elastic1100.mgmt.eqiad.wmnet with reboot policy FORCED
[19:18:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:18:58] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host elastic1101.mgmt.eqiad.wmnet with reboot policy FORCED
[19:18:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:19:00] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host elastic1102.mgmt.eqiad.wmnet with reboot policy FORCED
[19:19:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:19:10] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic1098.mgmt.eqiad.wmnet with reboot policy FORCED
[19:19:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:19:13] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic1099.mgmt.eqiad.wmnet with reboot policy FORCED
[19:19:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:20:56] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work), 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install elastic1089-1102 - https://phabricator.wikimedia.org/T299609 (10Cmjohnson)
[19:21:18] <jinxer-wm>	 (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:22:22] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] elastic: allow waiting for yellow instead of green [cookbooks] - 10https://gerrit.wikimedia.org/r/778335 (https://phabricator.wikimedia.org/T304570) (owner: 10Ryan Kemper)
[19:22:38] <logmsgbot>	 !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host elastic1100.mgmt.eqiad.wmnet with reboot policy FORCED
[19:22:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:24:26] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host elastic1100.mgmt.eqiad.wmnet with reboot policy FORCED
[19:24:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:25:56] <icinga-wm>	 RECOVERY - Check systemd state on puppetmaster1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:26:07] <logmsgbot>	 !log razzi@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-tool1005.eqiad.wmnet
[19:26:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:26:18] <jinxer-wm>	 (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:26:30] <wikibugs>	 (03PS13) 10Ryan Kemper: elastic: don't wait for green on first node [software/spicerack] - 10https://gerrit.wikimedia.org/r/776999 (https://phabricator.wikimedia.org/T304570) (owner: 10Bking)
[19:28:53] <logmsgbot>	 !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host elastic1100.mgmt.eqiad.wmnet with reboot policy FORCED
[19:28:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:29:49] <logmsgbot>	 !log razzi@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-tool1005.eqiad.wmnet
[19:29:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:32:18] <jinxer-wm>	 (ProbeDown) firing: Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:33:34] <wikibugs>	 (03PS1) 10Btullis: Configure LDAP authentication for DataHub [deployment-charts] - 10https://gerrit.wikimedia.org/r/778345 (https://phabricator.wikimedia.org/T301462)
[19:34:33] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic1101.mgmt.eqiad.wmnet with reboot policy FORCED
[19:34:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:34:36] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic1102.mgmt.eqiad.wmnet with reboot policy FORCED
[19:34:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:37:18] <jinxer-wm>	 (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:38:12] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Enable SSL/TLS for accessing the datahub-gms service [deployment-charts] - 10https://gerrit.wikimedia.org/r/778308 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis)
[19:39:22] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] postgresql: migrate backup crons to systemd timer jobs [puppet] - 10https://gerrit.wikimedia.org/r/777433 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe)
[19:41:12] <wikibugs>	 (03CR) 10Btullis: "I've added @muelenhoff as a reviewer primarily to sanity-check the jaas-ldap.conf file and general LDAP authentication configuration. Than" [deployment-charts] - 10https://gerrit.wikimedia.org/r/778345 (https://phabricator.wikimedia.org/T301462) (owner: 10Btullis)
[19:42:21] <wikibugs>	 (03Merged) 10jenkins-bot: Enable SSL/TLS for accessing the datahub-gms service [deployment-charts] - 10https://gerrit.wikimedia.org/r/778308 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis)
[19:42:51] <wikibugs>	 (03PS2) 10Volans: spicerack: simplify profile [puppet] - 10https://gerrit.wikimedia.org/r/778331
[19:44:06] <wikibugs>	 (03PS3) 10Volans: spicerack: simplify profile [puppet] - 10https://gerrit.wikimedia.org/r/778331
[19:44:12] <wikibugs>	 (03PS2) 10Volans: service::catalog: add Spicerack comments [puppet] - 10https://gerrit.wikimedia.org/r/778332
[19:44:16] <wikibugs>	 (03PS2) 10Volans: spicerack: install service::catalog configuration [puppet] - 10https://gerrit.wikimedia.org/r/778333
[19:44:44] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] spicerack: simplify profile [puppet] - 10https://gerrit.wikimedia.org/r/778331 (owner: 10Volans)
[19:45:18] <jinxer-wm>	 (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:45:31] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] spicerack: install service::catalog configuration [puppet] - 10https://gerrit.wikimedia.org/r/778333 (owner: 10Volans)
[19:45:48] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host aqs1008.eqiad.wmnet
[19:45:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:46:06] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main
[19:46:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:46:09] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: apply on main
[19:46:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:46:33] <wikibugs>	 (03PS4) 10Volans: spicerack: simplify profile [puppet] - 10https://gerrit.wikimedia.org/r/778331
[19:46:35] <wikibugs>	 (03PS3) 10Volans: service::catalog: add Spicerack comments [puppet] - 10https://gerrit.wikimedia.org/r/778332
[19:46:37] <wikibugs>	 (03PS3) 10Volans: spicerack: install service::catalog configuration [puppet] - 10https://gerrit.wikimedia.org/r/778333
[19:47:28] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main
[19:47:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:47:30] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: apply on main
[19:47:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:48:18] <icinga-wm>	 RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[19:50:12] <wikibugs>	 (03CR) 10Volans: "PCC results seems to agree on the noop: https://puppet-compiler.wmflabs.org/pcc-worker1003/34740/" [puppet] - 10https://gerrit.wikimedia.org/r/778331 (owner: 10Volans)
[19:50:18] <jinxer-wm>	 (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:51:06] <wikibugs>	 (03PS1) 10Btullis: Bump datahub chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/778348
[19:52:26] <wikibugs>	 (03PS4) 10Volans: service::catalog: add Spicerack comments [puppet] - 10https://gerrit.wikimedia.org/r/778332
[19:52:28] <wikibugs>	 (03PS4) 10Volans: spicerack: install service::catalog configuration [puppet] - 10https://gerrit.wikimedia.org/r/778333
[19:54:06] <wikibugs>	 (03PS2) 10Btullis: Bump datahub chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/778348
[19:54:37] <wikibugs>	 (03CR) 10Volans: "PCC seems to agree that is a noop on the template: https://puppet-compiler.wmflabs.org/pcc-worker1002/34742/" [puppet] - 10https://gerrit.wikimedia.org/r/778332 (owner: 10Volans)
[19:55:18] <jinxer-wm>	 (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:55:40] <icinga-wm>	 RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:55:58] <wikibugs>	 (03CR) 10Volans: "PCC diff: https://puppet-compiler.wmflabs.org/pcc-worker1001/34743/cumin1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/778333 (owner: 10Volans)
[19:57:01] <logmsgbot>	 !log razzi@cumin1001 START - Cookbook sre.hosts.reboot-single for host flerovium.eqiad.wmnet
[19:57:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:57:50] <logmsgbot>	 !log btullis@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host aqs1008.eqiad.wmnet
[19:57:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:58:02] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host aqs1009.eqiad.wmnet
[19:58:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:00:04] <jouncebot>	 brennen: My dear minions, it's time we take the moon! Just kidding. Time for UTC late backport and config training deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220407T2000).
[20:00:18] <jinxer-wm>	 (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:02:32] <logmsgbot>	 !log razzi@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host flerovium.eqiad.wmnet
[20:02:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:03:18] <logmsgbot>	 !log razzi@cumin1001 START - Cookbook sre.hosts.reboot-single for host furud.codfw.wmnet
[20:03:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:04:25] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Bump datahub chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/778348 (owner: 10Btullis)
[20:04:56] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:(Need By: TBD) rack/setup/install cloudvirt10[48-50].eqiad.wmnet - https://phabricator.wikimedia.org/T299574 (10Cmjohnson) These are racked but the switches are not in netbox yet. I am blocked
[20:08:38] <wikibugs>	 (03Merged) 10jenkins-bot: Bump datahub chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/778348 (owner: 10Btullis)
[20:08:53] <logmsgbot>	 !log btullis@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host aqs1009.eqiad.wmnet
[20:08:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:10:24] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1089.eqiad.wmnet with OS bullseye
[20:10:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:10:32] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work), 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install elastic1089-1102 - https://phabricator.wikimedia.org/T299609 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host elastic1089.eqiad....
[20:13:20] <icinga-wm>	 PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:17:28] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1090.eqiad.wmnet with OS bullseye
[20:17:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:17:37] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work), 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install elastic1089-1102 - https://phabricator.wikimedia.org/T299609 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host elastic1090.eqiad....
[20:17:53] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1091.eqiad.wmnet with OS bullseye
[20:17:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:18:02] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work), 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install elastic1089-1102 - https://phabricator.wikimedia.org/T299609 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host elastic1091.eqiad....
[20:18:28] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1092.eqiad.wmnet with OS bullseye
[20:18:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:18:37] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work), 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install elastic1089-1102 - https://phabricator.wikimedia.org/T299609 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host elastic1092.eqiad....
[20:21:26] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:(Need By: TBD) rack/setup/install cloudvirt10[48-50].eqiad.wmnet - https://phabricator.wikimedia.org/T299574 (10cmooney) @Cmjohnson The switches are in Netbox:  https://netbox.wikimedia.org/dcim/devices/3931/  https://netbox.wikimedia.org/d...
[20:21:44] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1094.eqiad.wmnet with OS bullseye
[20:21:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:21:53] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1089.eqiad.wmnet with reason: host reimage
[20:21:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:21:55] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work), 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install elastic1089-1102 - https://phabricator.wikimedia.org/T299609 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host elastic1094.eqiad....
[20:23:36] <icinga-wm>	 PROBLEM - Host furud is DOWN: PING CRITICAL - Packet loss = 100%
[20:24:37] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main
[20:24:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:25:14] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1089.eqiad.wmnet with reason: host reimage
[20:25:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:25:26] <wikibugs>	 (03PS1) 10Cwhite: logstash: reprioritize dlq filter [puppet] - 10https://gerrit.wikimedia.org/r/778353 (https://phabricator.wikimedia.org/T305088)
[20:25:29] <icinga-wm>	 RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:26:01] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main
[20:26:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:26:52] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.REBOOT (3 nodes at a time) for ElasticSearch cluster search_eqiad: security updates - bking@cumin1001 - T304938
[20:26:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:27:19] <wikibugs>	 (03PS2) 10Cwhite: logstash: reprioritize dlq filter [puppet] - 10https://gerrit.wikimedia.org/r/778353 (https://phabricator.wikimedia.org/T305088)
[20:28:15] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: Q3:(Need By: TBD) rack/setup/install dumpsdata100[67] - https://phabricator.wikimedia.org/T299443 (10Cmjohnson) @Jclark-ctr moved the DAC cable to the correct port, these should work now.  I will image shortly
[20:28:58] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1090.eqiad.wmnet with reason: host reimage
[20:28:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:29:04] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host elastic1100.mgmt.eqiad.wmnet with reboot policy FORCED
[20:29:04] <wikibugs>	 (03CR) 10Vivian Rook: "If I'm reading Andrew's comment correctly the updated patch should get us potential access to wallaby, but we'll still need to update clou" [puppet] - 10https://gerrit.wikimedia.org/r/777873 (https://phabricator.wikimedia.org/T304694) (owner: 10Vivian Rook)
[20:29:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:29:11] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1091.eqiad.wmnet with reason: host reimage
[20:29:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:29:46] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1092.eqiad.wmnet with reason: host reimage
[20:29:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:32:20] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1090.eqiad.wmnet with reason: host reimage
[20:32:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:32:54] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1094.eqiad.wmnet with reason: host reimage
[20:32:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:33:47] <wikibugs>	 10SRE, 10SRE-OnFire (FY2021/2022-Q3), 10Infrastructure-Foundations, 10SRE Observability (FY2021/2022-Q3): Implement an accurate and easy to understand status page for all wikis - https://phabricator.wikimedia.org/T202061 (10lmata) That is very cool, thanks! Would it be interesting to replicate similar beha...
[20:33:53] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1089.eqiad.wmnet with OS bullseye
[20:33:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:34:02] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work), 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install elastic1089-1102 - https://phabricator.wikimedia.org/T299609 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host elastic1089.eqiad.wmne...
[20:34:32] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1095.eqiad.wmnet with OS bullseye
[20:34:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:34:38] <wikibugs>	 (03CR) 10JHathaway: [C: 03+1] "looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/778331 (owner: 10Volans)
[20:34:42] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work), 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install elastic1089-1102 - https://phabricator.wikimedia.org/T299609 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host elastic1095.eqiad....
[20:34:48] <wikibugs>	 (03PS1) 10Cwhite: thanos: fix yaml error [puppet] - 10https://gerrit.wikimedia.org/r/778354 (https://phabricator.wikimedia.org/T288726)
[20:35:09] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1092.eqiad.wmnet with reason: host reimage
[20:35:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:35:34] <wikibugs>	 (03CR) 10Cwhite: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/775375 (https://phabricator.wikimedia.org/T305090) (owner: 10Cwhite)
[20:36:21] <wikibugs>	 10SRE, 10serviceops: Migrate node-based services in production to node12 - https://phabricator.wikimedia.org/T290750 (10Jdforrester-WMF)
[20:36:27] <wikibugs>	 (03CR) 10JHathaway: [C: 03+1] "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/778333 (owner: 10Volans)
[20:36:45] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] thanos: fix yaml error [puppet] - 10https://gerrit.wikimedia.org/r/778354 (https://phabricator.wikimedia.org/T288726) (owner: 10Cwhite)
[20:37:53] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1094.eqiad.wmnet with reason: host reimage
[20:37:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:39:05] <wikibugs>	 10SRE, 10DC-Ops: Confirm support of PERC 750 raid controller - https://phabricator.wikimedia.org/T297913 (10RobH) This was updated, same issue on dumpsdata1007 and sent info to our Dell team.
[20:39:11] <logmsgbot>	 !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on elastic1091.eqiad.wmnet with reason: host reimage
[20:39:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:39:55] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host dumpsdata1006.eqiad.wmnet with OS buster
[20:39:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:40:01] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: Q3:(Need By: TBD) rack/setup/install dumpsdata100[67] - https://phabricator.wikimedia.org/T299443 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host dumpsdata1006.eqiad.wmnet with OS buster
[20:40:56] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1090.eqiad.wmnet with OS bullseye
[20:40:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:41:03] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work), 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install elastic1089-1102 - https://phabricator.wikimedia.org/T299609 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host elastic1090.eqiad.wmne...
[20:41:40] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1096.eqiad.wmnet with OS bullseye
[20:41:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:41:48] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work), 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install elastic1089-1102 - https://phabricator.wikimedia.org/T299609 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host elastic1096.eqiad....
[20:42:07] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] "LGTM!" [cookbooks] - 10https://gerrit.wikimedia.org/r/778325 (https://phabricator.wikimedia.org/T305652) (owner: 10Herron)
[20:42:22] <icinga-wm>	 PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:42:22] <icinga-wm>	 RECOVERY - Host furud is UP: PING OK - Packet loss = 0%, RTA = 31.62 ms
[20:42:48] <logmsgbot>	 !log razzi@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host furud.codfw.wmnet
[20:42:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:43:52] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic1100.mgmt.eqiad.wmnet with reboot policy FORCED
[20:43:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:43:58] <wikibugs>	 (03PS2) 10Ryan Kemper: elastic: allow waiting for yellow instead of green [cookbooks] - 10https://gerrit.wikimedia.org/r/778335 (https://phabricator.wikimedia.org/T304570)
[20:44:55] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1092.eqiad.wmnet with OS bullseye
[20:44:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:45:03] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work), 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install elastic1089-1102 - https://phabricator.wikimedia.org/T299609 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host elastic1092.eqiad.wmne...
[20:45:17] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1097.eqiad.wmnet with OS bullseye
[20:45:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:45:25] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work), 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install elastic1089-1102 - https://phabricator.wikimedia.org/T299609 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host elastic1097.eqiad....
[20:45:53] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1095.eqiad.wmnet with reason: host reimage
[20:45:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:46:24] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 91 probes of 664 (alerts on 65) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[20:46:56] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1094.eqiad.wmnet with OS bullseye
[20:46:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:47:03] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host dumpsdata1007.eqiad.wmnet with OS buster
[20:47:03] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work), 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install elastic1089-1102 - https://phabricator.wikimedia.org/T299609 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host elastic1094.eqiad.wmne...
[20:47:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:47:11] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: Q3:(Need By: TBD) rack/setup/install dumpsdata100[67] - https://phabricator.wikimedia.org/T299443 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host dumpsdata1007.eqiad.wmnet with OS buster
[20:48:22] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1098.eqiad.wmnet with OS bullseye
[20:48:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:48:31] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work), 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install elastic1089-1102 - https://phabricator.wikimedia.org/T299609 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host elastic1098.eqiad....
[20:49:21] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1095.eqiad.wmnet with reason: host reimage
[20:49:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:49:51] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: Q3:(Need By: TBD) rack/setup/install dumpsdata100[67] - https://phabricator.wikimedia.org/T299443 (10RobH) Please note the partman part will faill due to the raid controller reordering the disk array numbers and puts SSDs as SDB.  This was failing PXE for m...
[20:51:31] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 58 probes of 664 (alerts on 65) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[20:52:01] <icinga-wm>	 PROBLEM - Check systemd state on elastic1062 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:52:59] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1096.eqiad.wmnet with reason: host reimage
[20:53:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:54:05] <logmsgbot>	 !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dumpsdata1007.eqiad.wmnet with OS buster
[20:54:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:54:06] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1091.eqiad.wmnet with OS bullseye
[20:54:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:54:09] <logmsgbot>	 !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dumpsdata1006.eqiad.wmnet with OS buster
[20:54:09] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: Q3:(Need By: TBD) rack/setup/install dumpsdata100[67] - https://phabricator.wikimedia.org/T299443 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host dumpsdata1007.eqiad.wmnet with OS buster executed with erro...
[20:54:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:54:16] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work), 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install elastic1089-1102 - https://phabricator.wikimedia.org/T299609 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host elastic1091.eqiad.wmne...
[20:54:22] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: Q3:(Need By: TBD) rack/setup/install dumpsdata100[67] - https://phabricator.wikimedia.org/T299443 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host dumpsdata1006.eqiad.wmnet with OS buster executed with erro...
[20:54:39] <icinga-wm>	 PROBLEM - SSH on wtp1037.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[20:55:36] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1099.eqiad.wmnet with OS bullseye
[20:55:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:55:39] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1101.eqiad.wmnet with OS bullseye
[20:55:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:55:45] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work), 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install elastic1089-1102 - https://phabricator.wikimedia.org/T299609 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host elastic1099.eqiad....
[20:55:47] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work), 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install elastic1089-1102 - https://phabricator.wikimedia.org/T299609 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host elastic1101.eqiad....
[20:55:50] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1100.eqiad.wmnet with OS bullseye
[20:55:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:55:59] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work), 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install elastic1089-1102 - https://phabricator.wikimedia.org/T299609 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host elastic1100.eqiad....
[20:56:25] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1096.eqiad.wmnet with reason: host reimage
[20:56:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:56:34] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1102.eqiad.wmnet with OS bullseye
[20:56:34] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1097.eqiad.wmnet with reason: host reimage
[20:56:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:56:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:56:43] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work), 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install elastic1089-1102 - https://phabricator.wikimedia.org/T299609 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host elastic1102.eqiad....
[20:59:00] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1095.eqiad.wmnet with OS bullseye
[20:59:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:59:08] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work), 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install elastic1089-1102 - https://phabricator.wikimedia.org/T299609 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host elastic1095.eqiad.wmne...
[20:59:39] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1098.eqiad.wmnet with reason: host reimage
[20:59:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:59:59] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1097.eqiad.wmnet with reason: host reimage
[21:00:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:03:06] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1098.eqiad.wmnet with reason: host reimage
[21:03:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:05:02] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1096.eqiad.wmnet with OS bullseye
[21:05:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:05:15] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work), 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install elastic1089-1102 - https://phabricator.wikimedia.org/T299609 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host elastic1096.eqiad.wmne...
[21:06:50] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1099.eqiad.wmnet with reason: host reimage
[21:06:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:06:55] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1101.eqiad.wmnet with reason: host reimage
[21:06:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:07:01] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1100.eqiad.wmnet with reason: host reimage
[21:07:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:07:54] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1102.eqiad.wmnet with reason: host reimage
[21:07:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:09:29] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1097.eqiad.wmnet with OS bullseye
[21:09:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:09:38] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work), 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install elastic1089-1102 - https://phabricator.wikimedia.org/T299609 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host elastic1097.eqiad.wmne...
[21:10:17] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1099.eqiad.wmnet with reason: host reimage
[21:10:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:13:03] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1102.eqiad.wmnet with reason: host reimage
[21:13:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:13:23] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1098.eqiad.wmnet with OS bullseye
[21:13:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:13:31] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work), 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install elastic1089-1102 - https://phabricator.wikimedia.org/T299609 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host elastic1098.eqiad.wmne...
[21:15:12] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1100.eqiad.wmnet with reason: host reimage
[21:15:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:15:50] <icinga-wm>	 PROBLEM - Check systemd state on elastic1061 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:16:55] <logmsgbot>	 !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on elastic1101.eqiad.wmnet with reason: host reimage
[21:16:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:18:12] <icinga-wm>	 PROBLEM - Check systemd state on elastic1066 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:18:16] <wikibugs>	 (03CR) 10Bking: [C: 03+1] elastic: allow waiting for yellow instead of green [cookbooks] - 10https://gerrit.wikimedia.org/r/778335 (https://phabricator.wikimedia.org/T304570) (owner: 10Ryan Kemper)
[21:19:18] <wikibugs>	 (03PS3) 10JHathaway: mx: test rejecting email to legacy mailing list domains [puppet] - 10https://gerrit.wikimedia.org/r/777423 (https://phabricator.wikimedia.org/T280472)
[21:19:50] <icinga-wm>	 RECOVERY - Check systemd state on elastic1062 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:19:57] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1099.eqiad.wmnet with OS bullseye
[21:19:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:20:06] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work), 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install elastic1089-1102 - https://phabricator.wikimedia.org/T299609 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host elastic1099.eqiad.wmne...
[21:20:12] <icinga-wm>	 RECOVERY - Check systemd state on elastic1061 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:23:25] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1102.eqiad.wmnet with OS bullseye
[21:23:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:23:35] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work), 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install elastic1089-1102 - https://phabricator.wikimedia.org/T299609 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host elastic1102.eqiad.wmne...
[21:23:37] <wikibugs>	 (03PS4) 10JHathaway: mx: test rejecting email to legacy mailing list domains [puppet] - 10https://gerrit.wikimedia.org/r/777423 (https://phabricator.wikimedia.org/T280472)
[21:24:40] <icinga-wm>	 PROBLEM - Check systemd state on elastic1065 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:24:41] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1100.eqiad.wmnet with OS bullseye
[21:24:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:24:49] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work), 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install elastic1089-1102 - https://phabricator.wikimedia.org/T299609 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host elastic1100.eqiad.wmne...
[21:26:28] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1101.eqiad.wmnet with OS bullseye
[21:26:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:26:37] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work), 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install elastic1089-1102 - https://phabricator.wikimedia.org/T299609 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host elastic1101.eqiad.wmne...
[21:27:08] <wikibugs>	 (03PS5) 10JHathaway: mx: test rejecting email to legacy mailing list domains [puppet] - 10https://gerrit.wikimedia.org/r/777423 (https://phabricator.wikimedia.org/T280472)
[21:28:25] <wikibugs>	 (03CR) 10JHathaway: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34747/console" [puppet] - 10https://gerrit.wikimedia.org/r/777423 (https://phabricator.wikimedia.org/T280472) (owner: 10JHathaway)
[21:29:59] <wikibugs>	 (03CR) 10JHathaway: [V: 03+1 C: 03+2] mx: test rejecting email to legacy mailing list domains [puppet] - 10https://gerrit.wikimedia.org/r/777423 (https://phabricator.wikimedia.org/T280472) (owner: 10JHathaway)
[21:30:30] <wikibugs>	 (03CR) 10JHathaway: [V: 03+1 C: 03+2] "pcc looks correct, https://puppet-compiler.wmflabs.org/pcc-worker1001/34747/" [puppet] - 10https://gerrit.wikimedia.org/r/777423 (https://phabricator.wikimedia.org/T280472) (owner: 10JHathaway)
[21:30:54] <wikibugs>	 (03CR) 10JHathaway: [V: 03+1 C: 03+2] mx: test rejecting email to legacy mailing list domains (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/777423 (https://phabricator.wikimedia.org/T280472) (owner: 10JHathaway)
[21:32:54] <icinga-wm>	 RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:39:10] <icinga-wm>	 RECOVERY - SSH on aqs1009.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[21:40:42] <icinga-wm>	 RECOVERY - Check systemd state on elastic1066 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:42:00] <icinga-wm>	 RECOVERY - Check systemd state on elastic1065 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:44:06] <icinga-wm>	 PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:44:26] <icinga-wm>	 PROBLEM - Check systemd state on elastic1064 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:45:20] <icinga-wm>	 PROBLEM - Check systemd state on elastic1063 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:46:50] <wikibugs>	 (03CR) 10Volans: "addressed comment" [puppet] - 10https://gerrit.wikimedia.org/r/778331 (owner: 10Volans)
[21:46:52] <wikibugs>	 (03PS5) 10Volans: spicerack: simplify profile [puppet] - 10https://gerrit.wikimedia.org/r/778331
[21:46:54] <wikibugs>	 (03PS5) 10Volans: service::catalog: add Spicerack comments [puppet] - 10https://gerrit.wikimedia.org/r/778332
[21:46:56] <wikibugs>	 (03PS5) 10Volans: spicerack: install service::catalog configuration [puppet] - 10https://gerrit.wikimedia.org/r/778333
[21:47:28] <icinga-wm>	 RECOVERY - Check systemd state on elastic1063 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:54:12] <icinga-wm>	 PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[21:54:32] <icinga-wm>	 RECOVERY - SSH on wtp1037.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[21:55:44] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work), 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install elastic1089-1102 - https://phabricator.wikimedia.org/T299609 (10Cmjohnson)
[21:56:31] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work), 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install elastic1089-1102 - https://phabricator.wikimedia.org/T299609 (10Cmjohnson) 05Open→03Resolved on-site work has been completed
[21:57:25] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Configure cloudsw1-e4-eqiad and cloudsw1-f4-eqiad - https://phabricator.wikimedia.org/T304936 (10cmooney)
[21:57:31] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:(Need By: TBD) rack/setup/install cloudvirt10[48-50].eqiad.wmnet - https://phabricator.wikimedia.org/T299574 (10cmooney)
[21:58:58] <icinga-wm>	 RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:01:07] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox
[22:01:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:05:57] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[22:05:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:07:42] <icinga-wm>	 RECOVERY - Check systemd state on elastic1064 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:13:51] <wikibugs>	 (03PS2) 10Btullis: Configure LDAP authentication for DataHub [deployment-charts] - 10https://gerrit.wikimedia.org/r/778345 (https://phabricator.wikimedia.org/T301462)
[22:14:02] <icinga-wm>	 PROBLEM - Check systemd state on elastic1058 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:14:58] <icinga-wm>	 PROBLEM - SSH on aqs1007.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[22:15:48] <icinga-wm>	 PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:16:25] <wikibugs>	 (03PS3) 10Btullis: Configure LDAP authentication for DataHub [deployment-charts] - 10https://gerrit.wikimedia.org/r/778345 (https://phabricator.wikimedia.org/T301462)
[22:17:00] <Reedy>	 jouncebot: nowandnext
[22:17:00] <jouncebot>	 No deployments scheduled for the next 8 hour(s) and 42 minute(s)
[22:17:00] <jouncebot>	 In 8 hour(s) and 42 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220408T0700)
[22:25:04] <icinga-wm>	 PROBLEM - Check systemd state on elastic1057 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:25:56] <icinga-wm>	 RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:27:10] <icinga-wm>	 PROBLEM - Check systemd state on elastic1052 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:29:40] <icinga-wm>	 PROBLEM - Check systemd state on snapshot1008 is CRITICAL: CRITICAL - degraded: The following units failed: cirrussearch-dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:38:04] <icinga-wm>	 PROBLEM - Check systemd state on elastic1082 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:38:20] <icinga-wm>	 RECOVERY - Check systemd state on elastic1052 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:38:54] <icinga-wm>	 PROBLEM - Check systemd state on elastic1081 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:40:04] <icinga-wm>	 PROBLEM - Check systemd state on elastic1051 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:40:28] <icinga-wm>	 RECOVERY - Check systemd state on elastic1057 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:42:30] <icinga-wm>	 RECOVERY - Check systemd state on elastic1082 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:43:48] <icinga-wm>	 RECOVERY - Check systemd state on elastic1058 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:45:52] <icinga-wm>	 PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:47:52] <icinga-wm>	 RECOVERY - Check systemd state on elastic1081 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:55:42] <icinga-wm>	 RECOVERY - Check systemd state on elastic1051 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:56:05] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Observability-Alerting, 10SRE Observability (FY2021/2022-Q4), 10User-fgiunchedi: Spicerack: add support for Alertmanager - https://phabricator.wikimedia.org/T293209 (10lmata)
[22:57:04] <icinga-wm>	 RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:58:22] <wikibugs>	 10SRE, 10Observability-Metrics: Tooling for end-of-quarter SLO reporting - https://phabricator.wikimedia.org/T290924 (10lmata)
[23:00:53] <wikibugs>	 10SRE, 10SRE-OnFire (FY2021/2022-Q4), 10Infrastructure-Foundations, 10SRE Observability (FY2021/2022-Q4): Implement an accurate and easy to understand status page for all wikis - https://phabricator.wikimedia.org/T202061 (10lmata)
[23:01:12] <wikibugs>	 10SRE, 10SRE-OnFire (FY2021/2022-Q4), 10Infrastructure-Foundations (FY2021/2022-Q4), 10SRE Observability (FY2021/2022-Q4): Implement an accurate and easy to understand status page for all wikis - https://phabricator.wikimedia.org/T202061 (10lmata)
[23:03:03] <wikibugs>	 10SRE, 10Observability-Logging, 10SRE Observability (FY2021/2022-Q4): apifeatureusage hosts hanging on shutdown - https://phabricator.wikimedia.org/T305403 (10lmata)
[23:05:44] <icinga-wm>	 PROBLEM - Check systemd state on elastic1059 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:05:46] <icinga-wm>	 PROBLEM - Check systemd state on elastic1080 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:05:46] <icinga-wm>	 PROBLEM - Check systemd state on elastic1083 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:07:09] <wikibugs>	 10SRE, 10DNS, 10Traffic-Icebox, 10Mobile, and 2 others: Many misc wikis lack mobile domains - https://phabricator.wikimedia.org/T152882 (10Dzahn)
[23:07:38] <wikibugs>	 10SRE, 10DNS, 10Traffic-Icebox, 10Mobile, and 2 others: Many misc wikis lack mobile domains - https://phabricator.wikimedia.org/T152882 (10Dzahn) added checkboxes, checked those that already resolve meanwhile
[23:10:21] <wikibugs>	 10SRE, 10DNS, 10Traffic-Icebox, 10Mobile, and 2 others: Many misc wikis lack mobile domains - https://phabricator.wikimedia.org/T152882 (10Dzahn)
[23:14:44] <icinga-wm>	 PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:14:48] <wikibugs>	 10SRE, 10Analytics-Radar, 10Traffic-Icebox, 10User-jbond: Fix geoip updaters for new MaxMind hashed keys by 2019-08-15 - https://phabricator.wikimedia.org/T228533 (10Dzahn) @BBlack This sounds like a duplicate of T303464 (and/or /T302864) to me. Maybe you can just merge it.
[23:16:04] <icinga-wm>	 RECOVERY - SSH on aqs1007.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[23:19:02] <icinga-wm>	 RECOVERY - Check systemd state on elastic1083 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:23:30] <icinga-wm>	 RECOVERY - Check systemd state on elastic1059 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:24:52] <icinga-wm>	 PROBLEM - Router interfaces on mr1-esams is CRITICAL: CRITICAL: No response from remote host 91.198.174.247 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[23:25:58] <icinga-wm>	 RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:27:02] <icinga-wm>	 RECOVERY - Router interfaces on mr1-esams is OK: OK: host 91.198.174.247, interfaces up: 41, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[23:28:25] <wikibugs>	 (03PS1) 10Dzahn: phabricator: allow disabling ssh-phab service except on one host [puppet] - 10https://gerrit.wikimedia.org/r/778366 (https://phabricator.wikimedia.org/T296022)
[23:30:30] <wikibugs>	 (03CR) 10Dzahn: "we also don't want to apply the "interface::alias" from profile::phabricator::main but that only happens if $vcs_ip_v4 or $vcs_ip_v6 are s" [puppet] - 10https://gerrit.wikimedia.org/r/778366 (https://phabricator.wikimedia.org/T296022) (owner: 10Dzahn)
[23:32:32] <icinga-wm>	 RECOVERY - Check systemd state on elastic1080 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:33:18] <jinxer-wm>	 (ProbeDown) firing: (2) Service api-https:443 has failed probes (http_api-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[23:35:19] <wikibugs>	 (03CR) 10Dzahn: "compiling PS1 shows how it's different between phab1001 and phab2001. in PS2 phab1001 and phab2001 will be the same, point being on phab10" [puppet] - 10https://gerrit.wikimedia.org/r/778366 (https://phabricator.wikimedia.org/T296022) (owner: 10Dzahn)
[23:37:08] <wikibugs>	 (03PS2) 10Dzahn: phabricator: allow disabling ssh-phab service except on one host [puppet] - 10https://gerrit.wikimedia.org/r/778366 (https://phabricator.wikimedia.org/T296022)
[23:38:04] <icinga-wm>	 PROBLEM - Check systemd state on elastic1072 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:38:18] <jinxer-wm>	 (ProbeDown) firing: (2) Service api-https:443 has failed probes (http_api-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[23:38:54] <icinga-wm>	 PROBLEM - Check systemd state on elastic1070 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:38:58] <icinga-wm>	 PROBLEM - Check systemd state on elastic1073 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:41:08] <icinga-wm>	 RECOVERY - Check systemd state on elastic1070 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:46:10] <icinga-wm>	 PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:47:52] <icinga-wm>	 RECOVERY - Check systemd state on elastic1073 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:48:18] <jinxer-wm>	 (ProbeDown) resolved: (2) Service api-https:443 has failed probes (http_api-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[23:55:06] <icinga-wm>	 RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:55:56] <icinga-wm>	 RECOVERY - Check systemd state on elastic1072 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state