Wikimedia IRC logs browser

2022-06-29 00:00:24	<mutante>	same here, going afk. need to drive
2022-06-29 00:01:45	<wikibugs>	('CR) ''Krinkle: "Note to self: Confirm with Joe that using this directly cross-dc is fine when we're multi-dc, incl w.r.t. gutter pool, and w.r.t. hashing " [mediawiki-config] - ''https://gerrit.wikimedia.org/r/809326 (https://phabricator.wikimedia.org/T278392) (owner: ''Krinkle)'
2022-06-29 00:05:02	<icinga-wm>	RECOVERY - Disk space on labweb1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=labweb1001&var-datasource=eqiad+prometheus/ops
2022-06-29 00:05:14	<icinga-wm>	PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
2022-06-29 00:06:48	<icinga-wm>	RECOVERY - Disk space on labweb1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=labweb1002&var-datasource=eqiad+prometheus/ops
2022-06-29 00:10:10	<icinga-wm>	PROBLEM - Check systemd state on es2033 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-mysqld-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
2022-06-29 00:17:59	<logmsgbot>	!log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudgw2003-dev.codfw.wmnet with OS bullseye
2022-06-29 00:18:03	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 00:18:05	<wikibugs>	'SRE, ''ops-codfw, ''DC-Ops, ''cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install cloudcontrol2005-dev, clouddb2002-dev, cloudgw2003-dev - https://phabricator.wikimedia.org/T306854 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cloudgw20...'
2022-06-29 00:19:33	<wikibugs>	'SRE, ''ops-codfw, ''DC-Ops, ''cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install cloudcontrol2005-dev, clouddb2002-dev, cloudgw2003-dev - https://phabricator.wikimedia.org/T306854 (''Papaul)'
2022-06-29 00:20:45	<wikibugs>	'SRE, ''ops-codfw, ''DC-Ops, ''cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install cloudcontrol2005-dev, clouddb2002-dev, cloudgw2003-dev - https://phabricator.wikimedia.org/T306854 (''Papaul) ''Open→''Resolved @Andrew thanks for getting me the partman recipe info. This is complete.'
2022-06-29 00:28:48	<icinga-wm>	PROBLEM - Check systemd state on stat1005 is CRITICAL: CRITICAL - degraded: The following units failed: jupyter-bscarone-singleuser.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
2022-06-29 00:33:20	<icinga-wm>	PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
2022-06-29 00:41:37	<wikibugs>	'SRE, ''ops-eqiad: cloudstore1008 - eno2 reporting no carrier - https://phabricator.wikimedia.org/T309885 (''wiki_willy) a:''AndrewBonamici→''aborrero'
2022-06-29 00:50:26	<wikibugs>	('CR) ''Ssingh: [V: ''+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36106/console"; [puppet] - ''https://gerrit.wikimedia.org/r/809227 (https://phabricator.wikimedia.org/T310574) (owner: ''Ssingh)'
2022-06-29 00:55:25	<wikibugs>	('PS3) ''Ssingh: bird: upgrade configuration to bird2 (merge IPv4 and IPv6 configurations) [puppet] - ''https://gerrit.wikimedia.org/r/809227 (https://phabricator.wikimedia.org/T310574)'
2022-06-29 00:56:02	<wikibugs>	('CR) ''Ssingh: [V: ''+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36107/console"; [puppet] - ''https://gerrit.wikimedia.org/r/809227 (https://phabricator.wikimedia.org/T310574) (owner: ''Ssingh)'
2022-06-29 01:00:02	<wikibugs>	('CR) ''Ssingh: [V: ''+1] "Not sure why PCC doesn't show the changed file for modules/bird/files/prometheus-bird-exporter.default, but well, that's the latest change" [puppet] - ''https://gerrit.wikimedia.org/r/809227 (https://phabricator.wikimedia.org/T310574) (owner: ''Ssingh)'
2022-06-29 01:05:48	<icinga-wm>	PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
2022-06-29 01:14:00	<icinga-wm>	PROBLEM - k8s API server requests latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 verb={CREATE,UPDATE} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
2022-06-29 01:14:28	<wikibugs>	('PS1) ''Ssingh: admin: allow sudo for jclark-ctr for cookbooks [puppet] - ''https://gerrit.wikimedia.org/r/809338 (https://phabricator.wikimedia.org/T306654)'
2022-06-29 01:34:18	<icinga-wm>	RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
2022-06-29 01:34:22	<icinga-wm>	PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
2022-06-29 01:41:30	<icinga-wm>	PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_dump_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
2022-06-29 01:41:58	<icinga-wm>	PROBLEM - SSH on db1110.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
2022-06-29 02:01:48	<icinga-wm>	PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 110 probes of 679 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
2022-06-29 02:01:58	<icinga-wm>	RECOVERY - Check systemd state on mwlog2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
2022-06-29 02:05:52	<icinga-wm>	PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
2022-06-29 02:07:02	<icinga-wm>	RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 69 probes of 679 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
2022-06-29 02:14:17	<jinxer-wm>	(CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1004-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
2022-06-29 02:29:24	<icinga-wm>	PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 116 probes of 681 (alerts on 90) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
2022-06-29 02:32:41	<icinga-wm>	RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
2022-06-29 02:33:13	<icinga-wm>	PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 115 probes of 672 (alerts on 90) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
2022-06-29 02:38:11	<icinga-wm>	PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_dump_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
2022-06-29 02:38:27	<icinga-wm>	RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 90 probes of 672 (alerts on 90) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
2022-06-29 02:42:21	<icinga-wm>	RECOVERY - SSH on db1110.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
2022-06-29 02:51:07	<icinga-wm>	RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 57 probes of 681 (alerts on 90) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
2022-06-29 03:05:55	<icinga-wm>	PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
2022-06-29 03:20:13	<icinga-wm>	PROBLEM - WDQS SPARQL on wdqs1007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
2022-06-29 03:24:43	<icinga-wm>	RECOVERY - WDQS SPARQL on wdqs1007 is OK: HTTP OK: HTTP/1.1 200 OK - 690 bytes in 1.097 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
2022-06-29 03:32:41	<icinga-wm>	RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
2022-06-29 03:37:21	<wikibugs>	'SRE, ''ops-eqiad: eqiad: Move links to new MPC7E linecard - https://phabricator.wikimedia.org/T304712 (''Papaul)'
2022-06-29 03:37:33	<wikibugs>	('CR) ''Andrea Denisse: [C: ''+2] loki: add loki as an optional grafana component [puppet] - ''https://gerrit.wikimedia.org/r/809302 (https://phabricator.wikimedia.org/T222826) (owner: ''Cwhite)'
2022-06-29 03:41:53	<icinga-wm>	PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_dump_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
2022-06-29 04:31:30	<wikibugs>	('CR) ''Tim Starling: [C: ''+2] Move $wgCentralAuthTokenCacheType from redis_local to mcrouter [mediawiki-config] - ''https://gerrit.wikimedia.org/r/683465 (https://phabricator.wikimedia.org/T278392) (owner: ''Aaron Schulz)'
2022-06-29 04:32:21	<wikibugs>	('Merged) ''jenkins-bot: Move $wgCentralAuthTokenCacheType from redis_local to mcrouter [mediawiki-config] - ''https://gerrit.wikimedia.org/r/683465 (https://phabricator.wikimedia.org/T278392) (owner: ''Aaron Schulz)'
2022-06-29 04:32:26	<icinga-wm>	RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
2022-06-29 04:36:12	<logmsgbot>	!log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster restart to pickup swift-s3 plugin - ryankemper@cumin1001 - T309648
2022-06-29 04:36:18	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 04:36:19	<stashbot>	T309648: Restore lost index in cloudelastic - https://phabricator.wikimedia.org/T309648
2022-06-29 04:37:25	<logmsgbot>	!log tstarling@deploy1002 Synchronized wmf-config/InitialiseSettings.php: wgCentralAuthTokenCacheType -> mcrouter T278392 (duration: 03m 44s)
2022-06-29 04:37:29	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 04:37:30	<stashbot>	T278392: Storage solution for cross-datacenter tokens - https://phabricator.wikimedia.org/T278392
2022-06-29 04:37:54	<icinga-wm>	PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_dump_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
2022-06-29 04:39:33	<logmsgbot>	!log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
2022-06-29 04:39:35	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 04:40:33	<logmsgbot>	!log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
2022-06-29 04:40:34	<logmsgbot>	!log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
2022-06-29 04:40:36	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 04:40:39	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 04:44:14	<logmsgbot>	!log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
2022-06-29 04:44:18	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 05:17:18	<wikibugs>	'SRE, ''Traffic, ''Patch-For-Review: Test ESI feasibility with current Varnish installation - https://phabricator.wikimedia.org/T308799 (''AndyRussG) Heyy thanks so much for all the work on this!!! just a few notes here from a super uninformed perspective, just on the off chance they might be useful... In...'
2022-06-29 05:18:37	<wikibugs>	'SRE, ''ops-eqiad, ''DBA: db1173 won't boot up - https://phabricator.wikimedia.org/T310595 (''Marostegui) Excellent, can you let me know a day and time that works for you to replace it? I can leave the host offline for you'
2022-06-29 05:33:34	<icinga-wm>	RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
2022-06-29 05:34:00	<icinga-wm>	PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
2022-06-29 05:35:30	<icinga-wm>	RECOVERY - MariaDB read only es2 on es2033 is OK: Version 10.4.25-MariaDB-log, Uptime 38s, read_only: True, event_scheduler: True, 10.97 QPS, connection latency: 0.004945s, query latency: 0.068268s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
2022-06-29 05:36:28	<icinga-wm>	RECOVERY - mysqld processes on es2033 is OK: PROCS OK: 1 process with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
2022-06-29 05:40:40	<icinga-wm>	PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_dump_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
2022-06-29 05:46:41	<wikibugs>	'SRE, ''ops-codfw, ''DBA: es2033 crashed at Jun 28 ~15:34 - https://phabricator.wikimedia.org/T311526 (''Marostegui) Started a data check run'
2022-06-29 05:56:19	<logmsgbot>	!log ryankemper@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster restart to pickup swift-s3 plugin - ryankemper@cumin1001 - T309648
2022-06-29 05:56:26	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 05:56:27	<stashbot>	T309648: Restore lost index in cloudelastic - https://phabricator.wikimedia.org/T309648
2022-06-29 05:59:46	<icinga-wm>	RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
2022-06-29 06:01:07	<wikibugs>	'SRE-OnFire, ''DBA, ''Sustainability (Incident Followup): Investigate mariadb 10.6 performance regression during spikes/high load - https://phabricator.wikimedia.org/T311106 (''Marostegui) Thanks for those graphs Amir! Let me know today once you are around, I want to repeat the test on db1132 (10.6) changin...'
2022-06-29 06:02:36	<logmsgbot>	!log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster restart to pickup swift-s3 plugin - ryankemper@cumin1001 - T309648
2022-06-29 06:02:41	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 06:02:42	<stashbot>	T309648: Restore lost index in cloudelastic - https://phabricator.wikimedia.org/T309648
2022-06-29 06:04:11	<logmsgbot>	!log ryankemper@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster restart to pickup swift-s3 plugin - ryankemper@cumin1001 - T309648
2022-06-29 06:04:14	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 06:05:41	<wikibugs>	('PS1) ''Marostegui: db21[53-74].yaml: Add files [puppet] - ''https://gerrit.wikimedia.org/r/809479 (https://phabricator.wikimedia.org/T311493)'
2022-06-29 06:06:40	<wikibugs>	('CR) ''Marostegui: [C: ''+2] db21[53-74].yaml: Add files [puppet] - ''https://gerrit.wikimedia.org/r/809479 (https://phabricator.wikimedia.org/T311493) (owner: ''Marostegui)'
2022-06-29 06:09:50	<wikibugs>	('PS1) ''Marostegui: db21[53-74].yaml: Disable notifications [puppet] - ''https://gerrit.wikimedia.org/r/809480 (https://phabricator.wikimedia.org/T311493)'
2022-06-29 06:10:33	<wikibugs>	('CR) ''Marostegui: [C: ''+2] db21[53-74].yaml: Disable notifications [puppet] - ''https://gerrit.wikimedia.org/r/809480 (https://phabricator.wikimedia.org/T311493) (owner: ''Marostegui)'
2022-06-29 06:12:57	<wikibugs>	'SRE, ''RESTBase-API, ''Traffic, ''Documentation: I am hitting a rate limit on REST API endpoint - https://phabricator.wikimedia.org/T307610 (''Mitar) Hm, I am pretty sure that I am doing rate limiting correctly on my side, but I am hitting 429s after a brief time when trying to do 1000/10s rate limit to...'
2022-06-29 06:14:36	<jinxer-wm>	(CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1004-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
2022-06-29 06:29:00	<icinga-wm>	PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
2022-06-29 06:33:42	<icinga-wm>	PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
2022-06-29 06:40:22	<icinga-wm>	PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_dump_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
2022-06-29 06:46:24	<wikibugs>	('CR) ''Slyngshede: [C: ''+1] "Looks good. Thank you" [puppet] - ''https://gerrit.wikimedia.org/r/809181 (https://phabricator.wikimedia.org/T273673) (owner: ''Zabe)'
2022-06-29 06:46:27	<wikibugs>	('CR) ''Slyngshede: [C: ''+2] logster: remove absented logster- cron [puppet] - ''https://gerrit.wikimedia.org/r/809181 (https://phabricator.wikimedia.org/T273673) (owner: ''Zabe)'
2022-06-29 06:46:55	<logmsgbot>	!log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1132', diff saved to https://phabricator.wikimedia.org/P30597 and previous config saved to /var/cache/conftool/dbconfig/20220629-064655-root.json
2022-06-29 06:47:00	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 06:54:01	<wikibugs>	('CR) ''Filippo Giunchedi: [C: ''+1] "LGTM! Nicely done, see inline for two non-blocking nits" [puppet] - ''https://gerrit.wikimedia.org/r/809302 (https://phabricator.wikimedia.org/T222826) (owner: ''Cwhite)'
2022-06-29 06:57:07	<wikibugs>	'ops-codfw, ''decommission-hardware: decommission db2071 - https://phabricator.wikimedia.org/T311589 (''Marostegui)'
2022-06-29 06:57:30	<wikibugs>	'ops-codfw, ''decommission-hardware: decommission db2071 - https://phabricator.wikimedia.org/T311589 (''Marostegui)'
2022-06-29 06:57:42	<wikibugs>	'SRE, ''Traffic: pontoon.traffic.eqiad1.wikimedia.cloud unable to run puppet agent due to certificate mismatch - https://phabricator.wikimedia.org/T310303 (''fgiunchedi) Keeping the instances SGTM @BCornwall, thanks for looking into it. Personally I'd recommend starting afresh with a Pontoon stack (i.e. keep...'
2022-06-29 06:57:47	<wikibugs>	('CR) ''Ayounsi: "No idea about prometheus-bird-exporter.default neither." [puppet] - ''https://gerrit.wikimedia.org/r/809227 (https://phabricator.wikimedia.org/T310574) (owner: ''Ssingh)'
2022-06-29 06:58:04	<logmsgbot>	!log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2071 T311589', diff saved to https://phabricator.wikimedia.org/P30598 and previous config saved to /var/cache/conftool/dbconfig/20220629-065804-root.json
2022-06-29 06:58:09	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 06:58:10	<stashbot>	T311589: decommission db2071 - https://phabricator.wikimedia.org/T311589
2022-06-29 07:00:05	<jouncebot>	Amir1 and Urbanecm: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220629T0700).
2022-06-29 07:00:05	<jouncebot>	No Gerrit patches in the queue for this window AFAICS.
2022-06-29 07:04:07	<wikibugs>	'ops-codfw, ''decommission-hardware: decommission db2071 - https://phabricator.wikimedia.org/T311589 (''Marostegui)'
2022-06-29 07:04:34	<wikibugs>	('PS1) ''Marostegui: mariadb: Decommission db2071 [puppet] - ''https://gerrit.wikimedia.org/r/809526 (https://phabricator.wikimedia.org/T311589)'
2022-06-29 07:05:40	<XioNoX>	!log re-enabled bgp to telia in eqsin
2022-06-29 07:05:44	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 07:06:15	<logmsgbot>	!log marostegui@cumin1001 START - Cookbook sre.hosts.decommission for hosts db2071.codfw.wmnet
2022-06-29 07:06:19	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 07:08:32	<wikibugs>	('CR) ''Marostegui: [C: ''+2] mariadb: Decommission db2071 [puppet] - ''https://gerrit.wikimedia.org/r/809526 (https://phabricator.wikimedia.org/T311589) (owner: ''Marostegui)'
2022-06-29 07:08:57	<wikibugs>	'ops-codfw, ''decommission-hardware, ''Patch-For-Review: decommission db2071 - https://phabricator.wikimedia.org/T311589 (''Marostegui)'
2022-06-29 07:10:42	<logmsgbot>	!log marostegui@cumin1001 START - Cookbook sre.dns.netbox
2022-06-29 07:10:45	<wikibugs>	'ops-codfw, ''decommission-hardware, ''Patch-For-Review: decommission db2071 - https://phabricator.wikimedia.org/T311589 (''Marostegui)'
2022-06-29 07:10:46	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 07:14:37	<logmsgbot>	!log marostegui@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
2022-06-29 07:14:41	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 07:16:17	<wikibugs>	('CR) ''Muehlenhoff: [C: ''+1] "Looks good" [puppet] - ''https://gerrit.wikimedia.org/r/809338 (https://phabricator.wikimedia.org/T306654) (owner: ''Ssingh)'
2022-06-29 07:17:47	<logmsgbot>	!log marostegui@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts db2071.codfw.wmnet
2022-06-29 07:17:51	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 07:17:51	<wikibugs>	'ops-codfw, ''decommission-hardware, ''Patch-For-Review: decommission db2071 - https://phabricator.wikimedia.org/T311589 (''ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by marostegui@cumin1001 for hosts: `db2071.codfw.wmnet` - db2071.codfw.wmnet (FAIL) - Downtimed host on Icinga/Alert...'
2022-06-29 07:18:48	<wikibugs>	'ops-codfw, ''decommission-hardware, ''Patch-For-Review: decommission db2071 - https://phabricator.wikimedia.org/T311589 (''Marostegui) a:''Papaul'
2022-06-29 07:19:27	<wikibugs>	'ops-codfw, ''decommission-hardware, ''Patch-For-Review: decommission db2071 - https://phabricator.wikimedia.org/T311589 (''Marostegui) @Papaul this is ready for you. Please note the failure above, make sure to wipe the disks yourself.'
2022-06-29 07:20:46	<wikibugs>	('PS1) ''Ayounsi: Revert "eqsin: disable Telia transit" [homer/public] - ''https://gerrit.wikimedia.org/r/809353'
2022-06-29 07:22:51	<wikibugs>	'SRE, ''ops-eqsin: cr3-eqsin:xe-0/1/1 interface errors - https://phabricator.wikimedia.org/T300485 (''ayounsi) Looks like there are no more errors. @robh could you check it one last time before replying to Telia? `cr3-eqsin> show interfaces xe-0/1/1 extensive \| match error [...] # Everything should show 0, es...'
2022-06-29 07:24:42	<logmsgbot>	!log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts webperf1002.eqiad.wmnet
2022-06-29 07:24:46	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 07:25:52	<icinga-wm>	PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs
2022-06-29 07:27:04	<icinga-wm>	PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1001 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs
2022-06-29 07:27:54	<logmsgbot>	!log marostegui@cumin1001 dbctl commit (dc=all): 'Remove db2071 from dbctl', diff saved to https://phabricator.wikimedia.org/P30600 and previous config saved to /var/cache/conftool/dbconfig/20220629-072753-marostegui.json
2022-06-29 07:27:58	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 07:30:01	<logmsgbot>	!log jmm@cumin2002 START - Cookbook sre.dns.netbox
2022-06-29 07:30:05	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 07:30:55	<icinga-wm>	RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs
2022-06-29 07:32:07	<icinga-wm>	RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1001 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs
2022-06-29 07:34:10	<logmsgbot>	!log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
2022-06-29 07:34:11	<logmsgbot>	!log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts webperf1002.eqiad.wmnet
2022-06-29 07:34:14	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 07:34:18	<wikibugs>	'SRE, ''Performance-Team, ''Patch-For-Review: Upgrade webperf hosts to Bullseye - https://phabricator.wikimedia.org/T305460 (''ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `webperf1002.eqiad.wmnet` - webperf1002.eqiad.wmnet (PASS) - Downtimed host on Icinga...'
2022-06-29 07:34:18	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 07:34:39	<wikibugs>	'ops-codfw, ''decommission-hardware: decommission db2075 - https://phabricator.wikimedia.org/T311591 (''Marostegui)'
2022-06-29 07:35:08	<wikibugs>	'ops-codfw, ''decommission-hardware: decommission db2075 - https://phabricator.wikimedia.org/T311591 (''Marostegui)'
2022-06-29 07:35:59	<wikibugs>	'SRE, ''Performance-Team, ''Patch-For-Review: Upgrade webperf hosts to Bullseye - https://phabricator.wikimedia.org/T305460 (''MoritzMuehlenhoff)'
2022-06-29 07:37:23	<logmsgbot>	!log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2075 T311591', diff saved to https://phabricator.wikimedia.org/P30601 and previous config saved to /var/cache/conftool/dbconfig/20220629-073722-root.json
2022-06-29 07:37:28	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 07:37:29	<stashbot>	T311591: decommission db2075 - https://phabricator.wikimedia.org/T311591
2022-06-29 07:37:58	<wikibugs>	('PS4) ''Urbanecm: GrowthExperiments: Remove unused GEHomepageSuggestedEditsRequiresOptIn [mediawiki-config] - ''https://gerrit.wikimedia.org/r/791302 (https://phabricator.wikimedia.org/T308208) (owner: ''Kosta Harlan)'
2022-06-29 07:38:01	<wikibugs>	('CR) ''Urbanecm: [C: ''+2] GrowthExperiments: Remove unused GEHomepageSuggestedEditsRequiresOptIn [mediawiki-config] - ''https://gerrit.wikimedia.org/r/791302 (https://phabricator.wikimedia.org/T308208) (owner: ''Kosta Harlan)'
2022-06-29 07:38:11	<wikibugs>	('PS3) ''Urbanecm: GrowthExperiments: Remove GEHomepageSuggestedEditsTopicsRequiresOptIn [mediawiki-config] - ''https://gerrit.wikimedia.org/r/791303 (https://phabricator.wikimedia.org/T308209) (owner: ''Kosta Harlan)'
2022-06-29 07:38:13	<wikibugs>	('CR) ''Urbanecm: [C: ''+2] GrowthExperiments: Remove GEHomepageSuggestedEditsTopicsRequiresOptIn [mediawiki-config] - ''https://gerrit.wikimedia.org/r/791303 (https://phabricator.wikimedia.org/T308209) (owner: ''Kosta Harlan)'
2022-06-29 07:38:16	<wikibugs>	('PS1) ''Marostegui: instances.yaml: Remove db2075 from dbctl [puppet] - ''https://gerrit.wikimedia.org/r/809528 (https://phabricator.wikimedia.org/T311591)'
2022-06-29 07:38:56	<wikibugs>	('CR) ''Marostegui: [C: ''+2] instances.yaml: Remove db2075 from dbctl [puppet] - ''https://gerrit.wikimedia.org/r/809528 (https://phabricator.wikimedia.org/T311591) (owner: ''Marostegui)'
2022-06-29 07:39:19	<wikibugs>	('Merged) ''jenkins-bot: GrowthExperiments: Remove unused GEHomepageSuggestedEditsRequiresOptIn [mediawiki-config] - ''https://gerrit.wikimedia.org/r/791302 (https://phabricator.wikimedia.org/T308208) (owner: ''Kosta Harlan)'
2022-06-29 07:39:19	<logmsgbot>	!log marostegui@cumin1001 dbctl commit (dc=all): 'Remove db2075 from dbctl T311591', diff saved to https://phabricator.wikimedia.org/P30602 and previous config saved to /var/cache/conftool/dbconfig/20220629-073919-root.json
2022-06-29 07:39:23	<wikibugs>	('Merged) ''jenkins-bot: GrowthExperiments: Remove GEHomepageSuggestedEditsTopicsRequiresOptIn [mediawiki-config] - ''https://gerrit.wikimedia.org/r/791303 (https://phabricator.wikimedia.org/T308209) (owner: ''Kosta Harlan)'
2022-06-29 07:39:25	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 07:40:10	<wikibugs>	('CR) ''Ayounsi: [C: ''+2] Revert "eqsin: disable Telia transit" [homer/public] - ''https://gerrit.wikimedia.org/r/809353 (owner: ''Ayounsi)'
2022-06-29 07:40:25	<marostegui>	!log dbmaint s1@codfw T311475
2022-06-29 07:40:26	<marostegui>	!log dbmaint s@codfw T311475
2022-06-29 07:40:29	<marostegui>	!log dbmaint s5@codfw T311475
2022-06-29 07:40:30	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 07:40:30	<stashbot>	T311475: Decommission db[2071-2092] - https://phabricator.wikimedia.org/T311475
2022-06-29 07:40:34	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 07:40:39	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 07:41:00	<wikibugs>	('Merged) ''jenkins-bot: Revert "eqsin: disable Telia transit" [homer/public] - ''https://gerrit.wikimedia.org/r/809353 (owner: ''Ayounsi)'
2022-06-29 07:43:08	<wikibugs>	('PS1) ''Marostegui: mariadb: Remove db2075 from puppet [puppet] - ''https://gerrit.wikimedia.org/r/809531 (https://phabricator.wikimedia.org/T311591)'
2022-06-29 07:43:15	<logmsgbot>	!log marostegui@cumin1001 START - Cookbook sre.hosts.decommission for hosts db2075.codfw.wmnet
2022-06-29 07:43:18	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 07:44:12	<wikibugs>	'ops-codfw, ''decommission-hardware, ''Patch-For-Review: decommission db2075 - https://phabricator.wikimedia.org/T311591 (''Marostegui)'
2022-06-29 07:44:36	<wikibugs>	'SRE, ''Performance-Team, ''Patch-For-Review: Upgrade webperf hosts to Bullseye - https://phabricator.wikimedia.org/T305460 (''MoritzMuehlenhoff) ''Open→''Resolved This is complete'
2022-06-29 07:45:25	<logmsgbot>	!log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 143c3fd: d5afd97: Remove unused GEHomepageSuggestedEditsRequiresOptIn and GEHomepageSuggestedEditsTopicsRequiresOptIn (T308209, T308208) (duration: 03m 22s)
2022-06-29 07:45:32	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 07:45:32	<stashbot>	T308209: Remove GEHomepageSuggestedEditsTopicsRequiresOptIn - https://phabricator.wikimedia.org/T308209
2022-06-29 07:45:33	<stashbot>	T308208: Remove GEHomepageSuggestedEditsRequiresOptIn - https://phabricator.wikimedia.org/T308208
2022-06-29 07:46:12	<wikibugs>	('PS2) ''Urbanecm: Remove wgGEMentorDashboardBetaMode [mediawiki-config] - ''https://gerrit.wikimedia.org/r/808263'
2022-06-29 07:46:12	<logmsgbot>	!log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
2022-06-29 07:46:16	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 07:46:17	<wikibugs>	('CR) ''Urbanecm: [C: ''+2] Remove wgGEMentorDashboardBetaMode [mediawiki-config] - ''https://gerrit.wikimedia.org/r/808263 (owner: ''Urbanecm)'
2022-06-29 07:46:29	<wikibugs>	('PS2) ''Urbanecm: [beta] Remove wgGEMentorDashboardDiscoveryEnabled [mediawiki-config] - ''https://gerrit.wikimedia.org/r/808264'
2022-06-29 07:46:32	<wikibugs>	('CR) ''Urbanecm: [C: ''+2] [beta] Remove wgGEMentorDashboardDiscoveryEnabled [mediawiki-config] - ''https://gerrit.wikimedia.org/r/808264 (owner: ''Urbanecm)'
2022-06-29 07:46:57	<logmsgbot>	!log marostegui@cumin1001 START - Cookbook sre.dns.netbox
2022-06-29 07:47:00	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 07:47:03	<wikibugs>	('Merged) ''jenkins-bot: Remove wgGEMentorDashboardBetaMode [mediawiki-config] - ''https://gerrit.wikimedia.org/r/808263 (owner: ''Urbanecm)'
2022-06-29 07:47:11	<logmsgbot>	!log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
2022-06-29 07:47:12	<logmsgbot>	!log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
2022-06-29 07:47:15	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 07:47:18	<wikibugs>	('Merged) ''jenkins-bot: [beta] Remove wgGEMentorDashboardDiscoveryEnabled [mediawiki-config] - ''https://gerrit.wikimedia.org/r/808264 (owner: ''Urbanecm)'
2022-06-29 07:47:19	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 07:47:47	<wikibugs>	('CR) ''Urbanecm: [C: ''+1] "lgtm" [mediawiki-config] - ''https://gerrit.wikimedia.org/r/807576 (https://phabricator.wikimedia.org/T304099) (owner: ''MewOphaswongse)'
2022-06-29 07:48:11	<logmsgbot>	!log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
2022-06-29 07:48:14	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 07:48:26	<wikibugs>	('PS2) ''Urbanecm: Add GEMentorProvider to configuration [mediawiki-config] - ''https://gerrit.wikimedia.org/r/808267 (https://phabricator.wikimedia.org/T310905)'
2022-06-29 07:48:30	<wikibugs>	('CR) ''Urbanecm: [C: ''+2] Add GEMentorProvider to configuration [mediawiki-config] - ''https://gerrit.wikimedia.org/r/808267 (https://phabricator.wikimedia.org/T310905) (owner: ''Urbanecm)'
2022-06-29 07:48:36	<wikibugs>	('PS2) ''Urbanecm: [beta] Growth: Enable structured mentor list at cswiki [mediawiki-config] - ''https://gerrit.wikimedia.org/r/808268 (https://phabricator.wikimedia.org/T310905)'
2022-06-29 07:48:42	<wikibugs>	('CR) ''Urbanecm: [C: ''+2] [beta] Growth: Enable structured mentor list at cswiki [mediawiki-config] - ''https://gerrit.wikimedia.org/r/808268 (https://phabricator.wikimedia.org/T310905) (owner: ''Urbanecm)'
2022-06-29 07:49:18	<wikibugs>	('Merged) ''jenkins-bot: Add GEMentorProvider to configuration [mediawiki-config] - ''https://gerrit.wikimedia.org/r/808267 (https://phabricator.wikimedia.org/T310905) (owner: ''Urbanecm)'
2022-06-29 07:50:47	<wikibugs>	('CR) ''Marostegui: [C: ''+2] mariadb: Remove db2075 from puppet [puppet] - ''https://gerrit.wikimedia.org/r/809531 (https://phabricator.wikimedia.org/T311591) (owner: ''Marostegui)'
2022-06-29 07:50:57	<logmsgbot>	!log marostegui@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
2022-06-29 07:51:01	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 07:51:18	<wikibugs>	'ops-codfw, ''decommission-hardware, ''Patch-For-Review: decommission db2075 - https://phabricator.wikimedia.org/T311591 (''Marostegui)'
2022-06-29 07:51:47	<logmsgbot>	!log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 1d1b9cf: Remove wgGEMentorDashboardBetaMode (duration: 03m 34s)
2022-06-29 07:51:51	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 07:53:13	<logmsgbot>	!log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
2022-06-29 07:53:19	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 07:54:06	<logmsgbot>	!log marostegui@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts db2075.codfw.wmnet
2022-06-29 07:54:09	<logmsgbot>	!log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
2022-06-29 07:54:10	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 07:54:11	<logmsgbot>	!log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
2022-06-29 07:54:11	<logmsgbot>	!log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
2022-06-29 07:54:11	<wikibugs>	'ops-codfw, ''decommission-hardware, ''Patch-For-Review: decommission db2075 - https://phabricator.wikimedia.org/T311591 (''ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by marostegui@cumin1001 for hosts: `db2075.codfw.wmnet` - db2075.codfw.wmnet (FAIL) - Downtimed host on Icinga/Alert...'
2022-06-29 07:54:14	<logmsgbot>	!log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
2022-06-29 07:54:14	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 07:54:17	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 07:54:21	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 07:54:25	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 07:54:26	<wikibugs>	'ops-codfw, ''decommission-hardware, ''Patch-For-Review: decommission db2075 - https://phabricator.wikimedia.org/T311591 (''Marostegui)'
2022-06-29 07:54:57	<logmsgbot>	!log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
2022-06-29 07:55:01	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 07:55:24	<wikibugs>	'ops-codfw, ''decommission-hardware, ''Patch-For-Review: decommission db2075 - https://phabricator.wikimedia.org/T311591 (''Marostegui) a:''Papaul @Papaul this is ready for you. Please note the failure above, make sure to wipe the disks yourself.'
2022-06-29 07:55:50	<logmsgbot>	!log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 5a583804: Add GEMentorProvider to configuration (T310905) (duration: 03m 40s)
2022-06-29 07:55:55	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 07:55:56	<stashbot>	T310905: Deploy structured wikitext mentor list to Wikimedia wikis - https://phabricator.wikimedia.org/T310905
2022-06-29 07:59:04	<urbanecm>	done
2022-06-29 08:00:05	<jouncebot>	dduvall and hashar: Dear deployers, time to do the MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220629T0800).
2022-06-29 08:00:06	<logmsgbot>	!log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
2022-06-29 08:00:10	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 08:00:56	<logmsgbot>	!log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
2022-06-29 08:00:58	<logmsgbot>	!log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
2022-06-29 08:01:00	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 08:01:04	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 08:01:46	<logmsgbot>	!log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
2022-06-29 08:01:49	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 08:03:28	<wikibugs>	('PS1) ''Slyngshede: WIP: profile::prometheus::ops add ganeti cluster targets [puppet] - ''https://gerrit.wikimedia.org/r/809533'
2022-06-29 08:06:04	<wikibugs>	('CR) ''CI reject: [V: ''-1] WIP: profile::prometheus::ops add ganeti cluster targets [puppet] - ''https://gerrit.wikimedia.org/r/809533 (owner: ''Slyngshede)'
2022-06-29 08:10:20	<wikibugs>	('PS2) ''Slyngshede: WIP: profile::prometheus::ops add ganeti cluster targets [puppet] - ''https://gerrit.wikimedia.org/r/809533'
2022-06-29 08:12:11	<wikibugs>	'SRE, ''Cassandra: Allow Cassandra to be deployed on Bullseye nodes - https://phabricator.wikimedia.org/T310980 (''MoritzMuehlenhoff) Can't we just import the Cassandra 4 debs and use those? The work needs to happen at some point anyway and it's a fresh cluster. Buster is almost three years old, going into LT...'
2022-06-29 08:13:39	<wikibugs>	('PS1) ''Elukey: role::ml_k8s::worker::staging: add calico-cni config [puppet] - ''https://gerrit.wikimedia.org/r/809534 (https://phabricator.wikimedia.org/T302195)'
2022-06-29 08:14:54	<wikibugs>	('CR) ''Elukey: [V: ''+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36110/console"; [puppet] - ''https://gerrit.wikimedia.org/r/809534 (https://phabricator.wikimedia.org/T302195) (owner: ''Elukey)'
2022-06-29 08:15:21	<wikibugs>	('CR) ''Elukey: [V: ''+1 C: ''+2] role::ml_k8s::worker::staging: add calico-cni config [puppet] - ''https://gerrit.wikimedia.org/r/809534 (https://phabricator.wikimedia.org/T302195) (owner: ''Elukey)'
2022-06-29 08:18:43	<wikibugs>	('CR) ''Slyngshede: [V: ''+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36109/console"; [puppet] - ''https://gerrit.wikimedia.org/r/809533 (owner: ''Slyngshede)'
2022-06-29 08:25:27	<wikibugs>	'SRE, ''ops-eqiad, ''DC-Ops, ''Data-Engineering, ''Data-Engineering-Kanban: Q4: rack/setup/install stat1010 - https://phabricator.wikimedia.org/T307399 (''BTullis) I'm going to try updating the RAID controller firmware, then the BIOS on stat1010, to see if either of these fixes the drive ordering issue....'
2022-06-29 08:30:11	<wikibugs>	'SRE-tools, ''Infrastructure-Foundations: Decommissioning two hosts end up with: Failed to wipe swraid - https://phabricator.wikimedia.org/T311593 (''Marostegui)'
2022-06-29 08:31:15	<wikibugs>	('PS1) ''Filippo Giunchedi: prometheus: add initial blackbox dns probes for wikipedia [puppet] - ''https://gerrit.wikimedia.org/r/809535 (https://phabricator.wikimedia.org/T169860)'
2022-06-29 08:31:17	<wikibugs>	('PS1) ''Filippo Giunchedi: prometheus: probe DNS for (www).wikipedia.org [puppet] - ''https://gerrit.wikimedia.org/r/809536 (https://phabricator.wikimedia.org/T169860)'
2022-06-29 08:31:37	<logmsgbot>	!log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host an-tool1007.eqiad.wmnet
2022-06-29 08:31:41	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 08:32:08	<icinga-wm>	RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
2022-06-29 08:32:12	<wikibugs>	('CR) ''CI reject: [V: ''-1] prometheus: probe DNS for (www).wikipedia.org [puppet] - ''https://gerrit.wikimedia.org/r/809536 (https://phabricator.wikimedia.org/T169860) (owner: ''Filippo Giunchedi)'
2022-06-29 08:39:15	<icinga-wm>	PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_dump_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
2022-06-29 08:39:47	<wikibugs>	'SRE-tools, ''Infrastructure-Foundations: Decommissioning two hosts end up with: Failed to wipe swraid - https://phabricator.wikimedia.org/T311593 (''MoritzMuehlenhoff) Maybe we need run "swapoff -a" prior to the wipefs call?'
2022-06-29 08:43:10	<logmsgbot>	!log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-tool1007.eqiad.wmnet
2022-06-29 08:43:14	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 08:47:42	<logmsgbot>	!log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on idp-test1002.wikimedia.org with reason: webauthn tests
2022-06-29 08:47:46	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 08:48:08	<logmsgbot>	!log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on idp-test1002.wikimedia.org with reason: webauthn tests
2022-06-29 08:48:12	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 08:51:06	<wikibugs>	'SRE-tools, ''Infrastructure-Foundations: Decommissioning two hosts end up with: Failed to wipe swraid - https://phabricator.wikimedia.org/T311593 (''Marostegui) I have a few more hosts to decommission. I can try to do so, but we'd not know whether it helped or it would have just worked without it too :) Up t...'
2022-06-29 08:55:09	<wikibugs>	'SRE, ''ops-eqiad, ''DBA: db1173 won't boot up - https://phabricator.wikimedia.org/T310595 (''Jclark-ctr) I can do it today if you can offline it'
2022-06-29 08:58:29	<wikibugs>	'SRE, ''ops-eqiad: cloudstore1008 - eno2 reporting no carrier - https://phabricator.wikimedia.org/T309885 (''Jclark-ctr) a:''aborrero→''Andrew'
2022-06-29 08:58:54	<wikibugs>	('PS3) ''Slyngshede: WIP: profile::prometheus::ops add ganeti cluster targets [puppet] - ''https://gerrit.wikimedia.org/r/809533'
2022-06-29 08:59:46	<wikibugs>	('CR) ''CI reject: [V: ''-1] WIP: profile::prometheus::ops add ganeti cluster targets [puppet] - ''https://gerrit.wikimedia.org/r/809533 (owner: ''Slyngshede)'
2022-06-29 09:00:57	<wikibugs>	('PS4) ''Slyngshede: WIP: profile::prometheus::ops add ganeti cluster targets [puppet] - ''https://gerrit.wikimedia.org/r/809533'
2022-06-29 09:01:21	<logmsgbot>	!log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1173 for on-site maintenance T310595', diff saved to https://phabricator.wikimedia.org/P30603 and previous config saved to /var/cache/conftool/dbconfig/20220629-090120-root.json
2022-06-29 09:01:26	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 09:01:27	<stashbot>	T310595: db1173 won't boot up - https://phabricator.wikimedia.org/T310595
2022-06-29 09:02:03	<wikibugs>	('PS1) ''Marostegui: db1173: Disable notifications [puppet] - ''https://gerrit.wikimedia.org/r/809540 (https://phabricator.wikimedia.org/T310595)'
2022-06-29 09:03:02	<wikibugs>	('CR) ''Marostegui: [C: ''+2] db1173: Disable notifications [puppet] - ''https://gerrit.wikimedia.org/r/809540 (https://phabricator.wikimedia.org/T310595) (owner: ''Marostegui)'
2022-06-29 09:03:05	<wikibugs>	'SRE, ''ops-eqiad, ''DBA, ''Patch-For-Review: db1173 won't boot up - https://phabricator.wikimedia.org/T310595 (''Marostegui) @Jclark-ctr host offline, you can proceed whenever you want. Once you are done, please power it back on and I will take it from there. Thanks a lot!'
2022-06-29 09:03:36	<wikibugs>	'SRE-tools, ''Infrastructure-Foundations: Decommissioning two hosts end up with: Failed to wipe swraid - https://phabricator.wikimedia.org/T311593 (''MoritzMuehlenhoff) >>! In T311593#8036155, @Marostegui wrote: > I have a few more hosts to decommission. I can try to do so, but we'd not know whether it helped...'
2022-06-29 09:08:53	<wikibugs>	('CR) ''Slyngshede: [V: ''+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36111/console"; [puppet] - ''https://gerrit.wikimedia.org/r/809533 (owner: ''Slyngshede)'
2022-06-29 09:09:55	<wikibugs>	('PS7) ''Vlad.shapik: WP:Upgrade thumbor to Thumbor 7 and python3 [software/thumbor-plugins] - ''https://gerrit.wikimedia.org/r/800170 (https://phabricator.wikimedia.org/T252719)'
2022-06-29 09:10:32	<wikibugs>	('CR) ''CI reject: [V: ''-1] WP:Upgrade thumbor to Thumbor 7 and python3 [software/thumbor-plugins] - ''https://gerrit.wikimedia.org/r/800170 (https://phabricator.wikimedia.org/T252719) (owner: ''Vlad.shapik)'
2022-06-29 09:14:58	<wikibugs>	('CR) ''Slyngshede: [V: ''+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36113/console"; [puppet] - ''https://gerrit.wikimedia.org/r/809533 (owner: ''Slyngshede)'
2022-06-29 09:22:13	<wikibugs>	('PS5) ''Slyngshede: WIP: profile::prometheus::ops add ganeti cluster targets [puppet] - ''https://gerrit.wikimedia.org/r/809533'
2022-06-29 09:23:43	<wikibugs>	'SRE-tools, ''Infrastructure-Foundations: Decommissioning two hosts end up with: Failed to wipe swraid - https://phabricator.wikimedia.org/T311593 (''Marostegui) Wilco!'
2022-06-29 09:24:54	<wikibugs>	('CR) ''CI reject: [V: ''-1] WIP: profile::prometheus::ops add ganeti cluster targets [puppet] - ''https://gerrit.wikimedia.org/r/809533 (owner: ''Slyngshede)'
2022-06-29 09:27:51	<wikibugs>	('CR) ''Vgutierrez: [C: ''+1] trafficserver: 9.x upgrade: separate metric current_client_connections [puppet] - ''https://gerrit.wikimedia.org/r/803285 (https://phabricator.wikimedia.org/T309651) (owner: ''Ssingh)'
2022-06-29 09:29:45	<wikibugs>	('CR) ''Vgutierrez: [C: ''+1] trafficserver: 9.x upgrade: rename max_connections_active_in [puppet] - ''https://gerrit.wikimedia.org/r/803286 (https://phabricator.wikimedia.org/T309651) (owner: ''Ssingh)'
2022-06-29 09:33:29	<wikibugs>	('CR) ''Vgutierrez: [C: ''+1] trafficserver: 9.x upgrade: remove wmf-tls log format [puppet] - ''https://gerrit.wikimedia.org/r/803301 (https://phabricator.wikimedia.org/T309651) (owner: ''Ssingh)'
2022-06-29 09:34:15	<icinga-wm>	RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
2022-06-29 09:34:21	<wikibugs>	('PS1) ''David Caro: wmcs.openstack: move libs to it's own module [cookbooks] (wmcs) - ''https://gerrit.wikimedia.org/r/809543'
2022-06-29 09:34:56	<wikibugs>	('PS6) ''Slyngshede: WIP: profile::prometheus::ops add ganeti cluster targets [puppet] - ''https://gerrit.wikimedia.org/r/809533'
2022-06-29 09:37:34	<wikibugs>	('CR) ''CI reject: [V: ''-1] WIP: profile::prometheus::ops add ganeti cluster targets [puppet] - ''https://gerrit.wikimedia.org/r/809533 (owner: ''Slyngshede)'
2022-06-29 09:38:28	<logmsgbot>	!log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db1132 with some weight to get it warmed up', diff saved to https://phabricator.wikimedia.org/P30605 and previous config saved to /var/cache/conftool/dbconfig/20220629-093826-root.json
2022-06-29 09:38:33	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 09:41:01	<wikibugs>	('CR) ''CI reject: [V: ''-1] wmcs.openstack: move libs to it's own module [cookbooks] (wmcs) - ''https://gerrit.wikimedia.org/r/809543 (owner: ''David Caro)'
2022-06-29 09:41:21	<icinga-wm>	PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_dump_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
2022-06-29 09:42:23	<wikibugs>	('PS7) ''Slyngshede: WIP: profile::prometheus::ops add ganeti cluster targets [puppet] - ''https://gerrit.wikimedia.org/r/809533'
2022-06-29 09:51:54	<wikibugs>	('PS8) ''Slyngshede: WIP: profile::prometheus::ops add ganeti cluster targets [puppet] - ''https://gerrit.wikimedia.org/r/809533'
2022-06-29 09:53:12	<logmsgbot>	!log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2110.codfw.wmnet with reason: Maintenance
2022-06-29 09:53:16	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 09:53:37	<logmsgbot>	!log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2110.codfw.wmnet with reason: Maintenance
2022-06-29 09:53:38	<logmsgbot>	!log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 12 hosts with reason: Maintenance
2022-06-29 09:53:41	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 09:53:45	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 09:53:58	<logmsgbot>	!log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 12 hosts with reason: Maintenance
2022-06-29 09:54:02	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 09:59:12	<wikibugs>	('PS9) ''Slyngshede: WIP: profile::prometheus::ops add ganeti cluster targets [puppet] - ''https://gerrit.wikimedia.org/r/809533'
2022-06-29 10:00:13	<wikibugs>	('PS1) ''Kosta Harlan: Structured task: Add 'cancel' to the list of allowed commands [extensions/GrowthExperiments] (wmf/1.39.0-wmf.17) - ''https://gerrit.wikimedia.org/r/809549 (https://phabricator.wikimedia.org/T311467)'
2022-06-29 10:00:26	<wikibugs>	('PS1) ''Kosta Harlan: Structured task: Add 'cancel' to the list of allowed commands [extensions/GrowthExperiments] (wmf/1.39.0-wmf.18) - ''https://gerrit.wikimedia.org/r/809550 (https://phabricator.wikimedia.org/T311467)'
2022-06-29 10:03:12	<logmsgbot>	!log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1143.eqiad.wmnet with reason: Maintenance
2022-06-29 10:03:16	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 10:03:35	<icinga-wm>	PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
2022-06-29 10:03:36	<logmsgbot>	!log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1143.eqiad.wmnet with reason: Maintenance
2022-06-29 10:03:39	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 10:03:42	<logmsgbot>	!log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1143 (T309311)', diff saved to https://phabricator.wikimedia.org/P30606 and previous config saved to /var/cache/conftool/dbconfig/20220629-100341-ladsgroup.json
2022-06-29 10:03:45	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 10:03:46	<stashbot>	T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311
2022-06-29 10:04:08	<wikibugs>	('CR) ''Slyngshede: [V: ''+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36116/console"; [puppet] - ''https://gerrit.wikimedia.org/r/809533 (owner: ''Slyngshede)'
2022-06-29 10:08:02	<wikibugs>	('PS10) ''Slyngshede: WIP: profile::prometheus::ops add ganeti cluster targets [puppet] - ''https://gerrit.wikimedia.org/r/809533'
2022-06-29 10:12:56	<wikibugs>	('CR) ''Slyngshede: [V: ''+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36117/console"; [puppet] - ''https://gerrit.wikimedia.org/r/809533 (owner: ''Slyngshede)'
2022-06-29 10:14:36	<jinxer-wm>	(CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1004-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
2022-06-29 10:16:55	<logmsgbot>	!log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143 (T309311)', diff saved to https://phabricator.wikimedia.org/P30607 and previous config saved to /var/cache/conftool/dbconfig/20220629-101655-ladsgroup.json
2022-06-29 10:16:59	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 10:17:00	<stashbot>	T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311
2022-06-29 10:26:49	<wikibugs>	('CR) ''Klausman: [C: ''+1] "Overall, LGTM with two small bits." [deployment-charts] - ''https://gerrit.wikimedia.org/r/809198 (https://phabricator.wikimedia.org/T295956) (owner: ''Hnowlan)'
2022-06-29 10:28:42	<wikibugs>	'SRE, ''API Platform, ''Traffic, ''VisualEditor, and 2 others: Find out if Varnish is messing with ETags, and what to do about it. - https://phabricator.wikimedia.org/T310904 (''daniel) p:''Triage→''Medium'
2022-06-29 10:32:00	<logmsgbot>	!log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143', diff saved to https://phabricator.wikimedia.org/P30608 and previous config saved to /var/cache/conftool/dbconfig/20220629-103200-ladsgroup.json
2022-06-29 10:32:04	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 10:33:33	<icinga-wm>	RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
2022-06-29 10:40:43	<icinga-wm>	PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_dump_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
2022-06-29 10:44:42	<wikibugs>	'Puppet, ''Beta-Cluster-Infrastructure, ''Infrastructure-Foundations, ''Product-Infrastructure-Team-Backlog, ''VPS-Projects: Puppet failures on deployment-docker-changeprop01, deployment-docker-cpjobqueue01, deployment-push-notifications01, deployment-docker-mob... - https://phabricator.wikimedia.org/T259812'
2022-06-29 10:45:49	<icinga-wm>	PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=POST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
2022-06-29 10:47:05	<logmsgbot>	!log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143', diff saved to https://phabricator.wikimedia.org/P30610 and previous config saved to /var/cache/conftool/dbconfig/20220629-104705-ladsgroup.json
2022-06-29 10:47:10	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 10:48:46	<wikibugs>	('PS4) ''Ssingh: bird: upgrade configuration to bird2 (merge IPv4 and IPv6 configurations) [puppet] - ''https://gerrit.wikimedia.org/r/809227 (https://phabricator.wikimedia.org/T310574)'
2022-06-29 10:48:52	<wikibugs>	('CR) ''Ssingh: bird: upgrade configuration to bird2 (merge IPv4 and IPv6 configurations) (''4 comments) [puppet] - ''https://gerrit.wikimedia.org/r/809227 (https://phabricator.wikimedia.org/T310574) (owner: ''Ssingh)'
2022-06-29 10:49:01	<wikibugs>	('CR) ''Slyngshede: [V: ''+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36118/console"; [puppet] - ''https://gerrit.wikimedia.org/r/809533 (owner: ''Slyngshede)'
2022-06-29 10:49:29	<wikibugs>	('CR) ''Ssingh: [V: ''+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36119/console"; [puppet] - ''https://gerrit.wikimedia.org/r/809227 (https://phabricator.wikimedia.org/T310574) (owner: ''Ssingh)'
2022-06-29 10:51:50	<wikibugs>	('PS11) ''Slyngshede: profile::prometheus::ops add ganeti cluster targets [puppet] - ''https://gerrit.wikimedia.org/r/809533 (https://phabricator.wikimedia.org/T311288)'
2022-06-29 10:53:43	<wikibugs>	('CR) ''Ssingh: "[Commenting to indicate that this needs backward compatibility]" [puppet] - ''https://gerrit.wikimedia.org/r/803272 (https://phabricator.wikimedia.org/T309651) (owner: ''Ssingh)'
2022-06-29 10:53:46	<wikibugs>	('CR) ''Ssingh: "[Commenting to indicate that this needs backward compatibility]" [puppet] - ''https://gerrit.wikimedia.org/r/803296 (https://phabricator.wikimedia.org/T309651) (owner: ''Ssingh)'
2022-06-29 10:56:18	<wikibugs>	('CR) ''Slyngshede: "Not sure if this is the right way to go about adding the Ganeti metrics, but there seemed to be no existing way to do so." [puppet] - ''https://gerrit.wikimedia.org/r/809533 (https://phabricator.wikimedia.org/T311288) (owner: ''Slyngshede)'
2022-06-29 10:59:00	<logmsgbot>	!log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 25%: After restart', diff saved to https://phabricator.wikimedia.org/P30612 and previous config saved to /var/cache/conftool/dbconfig/20220629-105859-root.json
2022-06-29 10:59:03	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 11:01:14	<wikibugs>	('CR) ''Slyngshede: "Looks good." [puppet] - ''https://gerrit.wikimedia.org/r/809179 (https://phabricator.wikimedia.org/T273673) (owner: ''Zabe)'
2022-06-29 11:01:17	<wikibugs>	('CR) ''Slyngshede: [C: ''+2] snapshot: remove absented dumps-timechecker cron [puppet] - ''https://gerrit.wikimedia.org/r/809179 (https://phabricator.wikimedia.org/T273673) (owner: ''Zabe)'
2022-06-29 11:02:01	<wikibugs>	('CR) ''Slyngshede: "Looks good." [puppet] - ''https://gerrit.wikimedia.org/r/809178 (https://phabricator.wikimedia.org/T273673) (owner: ''Zabe)'
2022-06-29 11:02:02	<wikibugs>	('CR) ''Slyngshede: [C: ''+2] dumps: remove absented dumps-fetches-wikitech cron [puppet] - ''https://gerrit.wikimedia.org/r/809178 (https://phabricator.wikimedia.org/T273673) (owner: ''Zabe)'
2022-06-29 11:02:11	<logmsgbot>	!log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143 (T309311)', diff saved to https://phabricator.wikimedia.org/P30613 and previous config saved to /var/cache/conftool/dbconfig/20220629-110210-ladsgroup.json
2022-06-29 11:02:12	<logmsgbot>	!log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance
2022-06-29 11:02:16	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 11:02:17	<stashbot>	T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311
2022-06-29 11:02:20	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 11:02:25	<logmsgbot>	!log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance
2022-06-29 11:02:29	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 11:11:12	<logmsgbot>	!log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
2022-06-29 11:11:16	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 11:11:36	<logmsgbot>	!log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
2022-06-29 11:11:40	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 11:14:04	<logmsgbot>	!log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 50%: After restart', diff saved to https://phabricator.wikimedia.org/P30614 and previous config saved to /var/cache/conftool/dbconfig/20220629-111403-root.json
2022-06-29 11:14:08	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 11:17:49	<wikibugs>	('CR) ''Muehlenhoff: "The bean error mentioned in the patch description are unrelated, this was ultimately an error caused by misleading CAS documentation (for " [software/cas-overlay-template] - ''https://gerrit.wikimedia.org/r/809132 (https://phabricator.wikimedia.org/T311235) (owner: ''Muehlenhoff)'
2022-06-29 11:20:36	<logmsgbot>	!log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1149.eqiad.wmnet with reason: Maintenance
2022-06-29 11:20:41	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 11:20:49	<logmsgbot>	!log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1149.eqiad.wmnet with reason: Maintenance
2022-06-29 11:20:53	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 11:20:54	<logmsgbot>	!log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1149 (T309311)', diff saved to https://phabricator.wikimedia.org/P30615 and previous config saved to /var/cache/conftool/dbconfig/20220629-112054-ladsgroup.json
2022-06-29 11:20:59	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 11:21:00	<stashbot>	T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311
2022-06-29 11:26:00	<logmsgbot>	!log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host cloudnet1005.mgmt.eqiad.wmnet with reboot policy FORCED
2022-06-29 11:26:00	<logmsgbot>	!log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host cloudrabbit1001.mgmt.eqiad.wmnet with reboot policy FORCED
2022-06-29 11:26:00	<logmsgbot>	!log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host cloudnet1006.mgmt.eqiad.wmnet with reboot policy FORCED
2022-06-29 11:26:00	<logmsgbot>	!log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host cloudservices1005.mgmt.eqiad.wmnet with reboot policy FORCED
2022-06-29 11:26:00	<logmsgbot>	!log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host cloudrabbit1002.mgmt.eqiad.wmnet with reboot policy FORCED
2022-06-29 11:26:00	<logmsgbot>	!log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host cloudrabbit1003.mgmt.eqiad.wmnet with reboot policy FORCED
2022-06-29 11:26:04	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 11:26:08	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 11:26:12	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 11:26:16	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 11:26:20	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 11:26:24	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 11:26:39	<wikibugs>	('CR) ''Filippo Giunchedi: "LGTM overall, see inline" [puppet] - ''https://gerrit.wikimedia.org/r/809533 (https://phabricator.wikimedia.org/T311288) (owner: ''Slyngshede)'
2022-06-29 11:29:02	<wikibugs>	('PS12) ''Slyngshede: profile::prometheus::ops add ganeti cluster targets [puppet] - ''https://gerrit.wikimedia.org/r/809533 (https://phabricator.wikimedia.org/T311288)'
2022-06-29 11:29:07	<logmsgbot>	!log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 75%: After restart', diff saved to https://phabricator.wikimedia.org/P30616 and previous config saved to /var/cache/conftool/dbconfig/20220629-112907-root.json
2022-06-29 11:29:11	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 11:30:56	<wikibugs>	('PS2) ''Filippo Giunchedi: prometheus: probe DNS for (www).wikipedia.org [puppet] - ''https://gerrit.wikimedia.org/r/809536 (https://phabricator.wikimedia.org/T169860)'
2022-06-29 11:32:08	<logmsgbot>	!log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149 (T309311)', diff saved to https://phabricator.wikimedia.org/P30617 and previous config saved to /var/cache/conftool/dbconfig/20220629-113207-ladsgroup.json
2022-06-29 11:32:13	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 11:32:14	<stashbot>	T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311
2022-06-29 11:32:59	<icinga-wm>	RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
2022-06-29 11:33:55	<wikibugs>	('CR) ''Slyngshede: [V: ''+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36120/console"; [puppet] - ''https://gerrit.wikimedia.org/r/809533 (https://phabricator.wikimedia.org/T311288) (owner: ''Slyngshede)'
2022-06-29 11:35:01	<wikibugs>	('CR) ''Muehlenhoff: profile::prometheus::ops add ganeti cluster targets (''1 comment) [puppet] - ''https://gerrit.wikimedia.org/r/809533 (https://phabricator.wikimedia.org/T311288) (owner: ''Slyngshede)'
2022-06-29 11:35:34	<wikibugs>	'SRE, ''ops-eqiad, ''DC-Ops, ''serviceops: Q4: (Need By: TBD) rack/setup/install mw14[57-98] - https://phabricator.wikimedia.org/T306121 (''Cmjohnson)'
2022-06-29 11:38:30	<wikibugs>	('CR) ''Slyngshede: profile::prometheus::ops add ganeti cluster targets (''2 comments) [puppet] - ''https://gerrit.wikimedia.org/r/809533 (https://phabricator.wikimedia.org/T311288) (owner: ''Slyngshede)'
2022-06-29 11:42:01	<icinga-wm>	PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_dump_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
2022-06-29 11:44:03	<wikibugs>	('CR) ''Filippo Giunchedi: [C: ''+1] "LGTM!" [puppet] - ''https://gerrit.wikimedia.org/r/809533 (https://phabricator.wikimedia.org/T311288) (owner: ''Slyngshede)'
2022-06-29 11:44:11	<logmsgbot>	!log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 100%: After restart', diff saved to https://phabricator.wikimedia.org/P30618 and previous config saved to /var/cache/conftool/dbconfig/20220629-114411-root.json
2022-06-29 11:44:16	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 11:47:13	<logmsgbot>	!log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149', diff saved to https://phabricator.wikimedia.org/P30619 and previous config saved to /var/cache/conftool/dbconfig/20220629-114712-ladsgroup.json
2022-06-29 11:47:17	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 11:47:54	<logmsgbot>	!log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudrabbit1002.mgmt.eqiad.wmnet with reboot policy FORCED
2022-06-29 11:47:57	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 11:48:00	<logmsgbot>	!log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudrabbit1003.mgmt.eqiad.wmnet with reboot policy FORCED
2022-06-29 11:48:01	<logmsgbot>	!log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudrabbit1001.mgmt.eqiad.wmnet with reboot policy FORCED
2022-06-29 11:48:02	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 11:48:03	<logmsgbot>	!log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudservices1005.mgmt.eqiad.wmnet with reboot policy FORCED
2022-06-29 11:48:06	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 11:48:10	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 11:48:41	<logmsgbot>	!log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudnet1006.mgmt.eqiad.wmnet with reboot policy FORCED
2022-06-29 11:48:45	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 11:49:14	<logmsgbot>	!log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudnet1005.mgmt.eqiad.wmnet with reboot policy FORCED
2022-06-29 11:49:18	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 11:50:01	<logmsgbot>	!log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host cloudnet1006.mgmt.eqiad.wmnet with reboot policy FORCED
2022-06-29 11:50:05	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 11:52:37	<logmsgbot>	!log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudnet1006.mgmt.eqiad.wmnet with reboot policy FORCED
2022-06-29 11:52:40	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 11:53:07	<wikibugs>	'SRE, ''ops-eqiad, ''DC-Ops, ''cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (''Cmjohnson)'
2022-06-29 11:54:34	<wikibugs>	'SRE, ''ops-eqiad, ''DC-Ops, ''cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (''Cmjohnson) @Jclark-ctr Can you verify the mgmt cable is connected for cloudnet1006.'
2022-06-29 11:56:59	<wikibugs>	('CR) ''Ayounsi: "The DNS check is from before my time. Leaving it to Brandon as it's DNS related and he will know better than me on how best to monitor it." [puppet] - ''https://gerrit.wikimedia.org/r/809535 (https://phabricator.wikimedia.org/T169860) (owner: ''Filippo Giunchedi)'
2022-06-29 12:02:18	<logmsgbot>	!log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149', diff saved to https://phabricator.wikimedia.org/P30620 and previous config saved to /var/cache/conftool/dbconfig/20220629-120217-ladsgroup.json
2022-06-29 12:02:23	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 12:04:58	<icinga-wm>	RECOVERY - Check systemd state on stat1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
2022-06-29 12:06:33	<icinga-wm>	PROBLEM - Check systemd state on ms-be1029 is CRITICAL: CRITICAL - degraded: The following units failed: swift-drive-audit.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
2022-06-29 12:08:42	<wikibugs>	('PS5) ''Filippo Giunchedi: icinga: check commons.w.o with blackbox exporter [puppet] - ''https://gerrit.wikimedia.org/r/804274 (https://phabricator.wikimedia.org/T305847)'
2022-06-29 12:08:44	<wikibugs>	('PS4) ''Filippo Giunchedi: WIP irc check via blackbox [puppet] - ''https://gerrit.wikimedia.org/r/805815'
2022-06-29 12:08:46	<wikibugs>	('PS1) ''Filippo Giunchedi: prometheus: adjust check::http params based on distro [puppet] - ''https://gerrit.wikimedia.org/r/809586 (https://phabricator.wikimedia.org/T305847)'
2022-06-29 12:10:05	<wikibugs>	('CR) ''Filippo Giunchedi: "Ideally we run Bullseye everywhere (https://phabricator.wikimedia.org/T309979) in the meantime adjust options accordingly" [puppet] - ''https://gerrit.wikimedia.org/r/809586 (https://phabricator.wikimedia.org/T305847) (owner: ''Filippo Giunchedi)'
2022-06-29 12:13:54	<wikibugs>	('CR) ''Filippo Giunchedi: "+ traffic folks CC'd (feel free to review/comment!)" [puppet] - ''https://gerrit.wikimedia.org/r/809535 (https://phabricator.wikimedia.org/T169860) (owner: ''Filippo Giunchedi)'
2022-06-29 12:15:13	<wikibugs>	('CR) ''Filippo Giunchedi: "This is the "deployment" of the probes in https://gerrit.wikimedia.org/r/c/operations/puppet/+/809535 and will be performed from all sites" [puppet] - ''https://gerrit.wikimedia.org/r/809536 (https://phabricator.wikimedia.org/T169860) (owner: ''Filippo Giunchedi)'
2022-06-29 12:17:23	<logmsgbot>	!log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149 (T309311)', diff saved to https://phabricator.wikimedia.org/P30621 and previous config saved to /var/cache/conftool/dbconfig/20220629-121722-ladsgroup.json
2022-06-29 12:17:28	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 12:17:30	<stashbot>	T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311
2022-06-29 12:24:48	<logmsgbot>	!log mforns@deploy1002 Started deploy [analytics/refinery@2f5987d]: Regular analytics weekly train [analytics/refinery@2f5987d]
2022-06-29 12:24:52	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 12:25:57	<logmsgbot>	!log mforns@deploy1002 Finished deploy [analytics/refinery@2f5987d]: Regular analytics weekly train [analytics/refinery@2f5987d] (duration: 01m 08s)
2022-06-29 12:26:01	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 12:26:22	<wikibugs>	('PS1) ''Jcrespo: Prepare for 0.1.3 release [software/mediabackups] - ''https://gerrit.wikimedia.org/r/809588 (https://phabricator.wikimedia.org/T311215)'
2022-06-29 12:26:25	<wikibugs>	('PS1) ''Jcrespo: cli: Change logging to log on a different file each [software/mediabackups] - ''https://gerrit.wikimedia.org/r/809589 (https://phabricator.wikimedia.org/T311215)'
2022-06-29 12:26:43	<logmsgbot>	!log mforns@deploy1002 Started deploy [analytics/refinery@2f5987d] (thin): Regular analytics weekly train THIN [analytics/refinery@2f5987d]
2022-06-29 12:26:48	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 12:26:51	<logmsgbot>	!log mforns@deploy1002 Finished deploy [analytics/refinery@2f5987d] (thin): Regular analytics weekly train THIN [analytics/refinery@2f5987d] (duration: 00m 07s)
2022-06-29 12:26:55	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 12:27:02	<logmsgbot>	!log mforns@deploy1002 Started deploy [analytics/refinery@2f5987d] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@2f5987d]
2022-06-29 12:27:06	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 12:28:13	<wikibugs>	('PS2) ''Jcrespo: cli: Change logging to log on a different file each [software/mediabackups] - ''https://gerrit.wikimedia.org/r/809589 (https://phabricator.wikimedia.org/T311215)'
2022-06-29 12:34:09	<icinga-wm>	RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
2022-06-29 12:34:35	<logmsgbot>	!log mforns@deploy1002 Finished deploy [analytics/refinery@2f5987d] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@2f5987d] (duration: 07m 32s)
2022-06-29 12:34:39	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 12:34:51	<icinga-wm>	PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service,refine_eventlogging_analytics.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
2022-06-29 12:34:54	<wikibugs>	('CR) ''Slyngshede: [C: ''+2] profile::prometheus::ops add ganeti cluster targets [puppet] - ''https://gerrit.wikimedia.org/r/809533 (https://phabricator.wikimedia.org/T311288) (owner: ''Slyngshede)'
2022-06-29 12:41:17	<icinga-wm>	PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_dump_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
2022-06-29 12:47:26	<logmsgbot>	!log otto@deploy1002 Started deploy [analytics/refinery@2f5987d]: (no justification provided)
2022-06-29 12:47:29	<logmsgbot>	!log otto@deploy1002 Finished deploy [analytics/refinery@2f5987d]: (no justification provided) (duration: 00m 03s)
2022-06-29 12:47:30	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 12:47:34	<logmsgbot>	!log otto@deploy1002 Started deploy [analytics/refinery@2f5987d]: (no justification provided)
2022-06-29 12:47:35	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 12:47:39	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 12:49:34	<logmsgbot>	!log otto@deploy1002 Finished deploy [analytics/refinery@2f5987d]: (no justification provided) (duration: 02m 00s)
2022-06-29 12:49:38	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 12:53:51	<logmsgbot>	!log otto@deploy1002 Started deploy [analytics/refinery@2f5987d]: (no justification provided)
2022-06-29 12:53:55	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 12:54:29	<logmsgbot>	!log otto@deploy1002 Finished deploy [analytics/refinery@2f5987d]: (no justification provided) (duration: 00m 37s)
2022-06-29 12:54:32	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 12:54:49	<icinga-wm>	PROBLEM - k8s API server requests latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 verb={CREATE,DELETE,UPDATE} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
2022-06-29 12:56:05	<icinga-wm>	RECOVERY - Hadoop HDFS Namenode FSImage Age on an-master1002 is OK: FILE_AGE OK: /srv/hadoop/name/current/VERSION is 109 seconds old and 217 bytes https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration
2022-06-29 12:56:48	<XioNoX>	sukhe: I'm here
2022-06-29 12:57:20	<sukhe>	XioNoX: hello!
2022-06-29 12:57:28	<sukhe>	waiting for your final review of the patch
2022-06-29 12:57:35	<sukhe>	and then we can get started
2022-06-29 13:00:05	<jouncebot>	RoanKattouw, Lucas_WMDE, Urbanecm, and awight: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220629T1300).
2022-06-29 13:00:05	<jouncebot>	MatmaRex: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
2022-06-29 13:00:17	<wikibugs>	('PS1) ''Marostegui: instances.yaml: Remove db2081 [puppet] - ''https://gerrit.wikimedia.org/r/809592 (https://phabricator.wikimedia.org/T311475)'
2022-06-29 13:00:25	<urbanecm>	i can deploy today
2022-06-29 13:00:27	<urbanecm>	hi MatmaRex
2022-06-29 13:00:31	<MatmaRex>	hi
2022-06-29 13:00:47	<urbanecm>	MatmaRex: would you prefer to test them separately, or at once?
2022-06-29 13:00:48	<XioNoX>	sukhe: re-checking but I think I was find with it
2022-06-29 13:00:58	<sukhe>	thanks!
2022-06-29 13:01:13	<sukhe>	I am preparing the other sutff, please take your time
2022-06-29 13:01:28	<wikibugs>	('CR) ''Filippo Giunchedi: [C: ''+1] "I forgot one fundamental thing: you'll need to add the corresponding job definition for prometheus to pick up, e.g. like $trafficserver_jo" [puppet] - ''https://gerrit.wikimedia.org/r/809533 (https://phabricator.wikimedia.org/T311288) (owner: ''Slyngshede)'
2022-06-29 13:01:38	<MatmaRex>	urbanecm: either is fine
2022-06-29 13:01:42	<urbanecm>	okay, thanks
2022-06-29 13:01:46	<wikibugs>	('PS2) ''Urbanecm: Enable DiscussionTools newtopictool at enwiki [mediawiki-config] - ''https://gerrit.wikimedia.org/r/809010 (https://phabricator.wikimedia.org/T311023) (owner: ''Bartosz Dziewoński)'
2022-06-29 13:01:50	<wikibugs>	('CR) ''Urbanecm: [C: ''+2] Enable DiscussionTools newtopictool at enwiki [mediawiki-config] - ''https://gerrit.wikimedia.org/r/809010 (https://phabricator.wikimedia.org/T311023) (owner: ''Bartosz Dziewoński)'
2022-06-29 13:02:00	<wikibugs>	('PS3) ''Urbanecm: Enable DiscussionTools on mobile at partner wikis [mediawiki-config] - ''https://gerrit.wikimedia.org/r/809012 (https://phabricator.wikimedia.org/T298221) (owner: ''Bartosz Dziewoński)'
2022-06-29 13:02:02	<wikibugs>	('CR) ''Ayounsi: [C: ''+1] "ship it!" [puppet] - ''https://gerrit.wikimedia.org/r/809227 (https://phabricator.wikimedia.org/T310574) (owner: ''Ssingh)'
2022-06-29 13:02:04	<wikibugs>	('CR) ''Urbanecm: [C: ''+2] Enable DiscussionTools on mobile at partner wikis [mediawiki-config] - ''https://gerrit.wikimedia.org/r/809012 (https://phabricator.wikimedia.org/T298221) (owner: ''Bartosz Dziewoński)'
2022-06-29 13:02:05	<XioNoX>	sukhe: +1
2022-06-29 13:02:10	<MatmaRex>	i'm a little distracted so please give me a few more minutes to test
2022-06-29 13:02:14	<sukhe>	XioNoX: here's to a third time lucky :P
2022-06-29 13:02:27	<icinga-wm>	RECOVERY - Check systemd state on ms-be1029 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
2022-06-29 13:02:41	<wikibugs>	('PS3) ''Urbanecm: Enable DiscussionTools visualenhancements at mediawikiwiki [mediawiki-config] - ''https://gerrit.wikimedia.org/r/809011 (https://phabricator.wikimedia.org/T310960) (owner: ''Bartosz Dziewoński)'
2022-06-29 13:02:42	<sukhe>	!log sudo cumin -d 'P{R:Class = bird}' 'disable-puppet "PLEASE DO NOT enable Puppet: deploying T310574"'
2022-06-29 13:02:47	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 13:02:48	<stashbot>	T310574: Upgrade to Bird 2 - https://phabricator.wikimedia.org/T310574
2022-06-29 13:02:59	<wikibugs>	('Merged) ''jenkins-bot: Enable DiscussionTools newtopictool at enwiki [mediawiki-config] - ''https://gerrit.wikimedia.org/r/809010 (https://phabricator.wikimedia.org/T311023) (owner: ''Bartosz Dziewoński)'
2022-06-29 13:03:03	<wikibugs>	('CR) ''Urbanecm: [C: ''+2] Enable DiscussionTools visualenhancements at mediawikiwiki [mediawiki-config] - ''https://gerrit.wikimedia.org/r/809011 (https://phabricator.wikimedia.org/T310960) (owner: ''Bartosz Dziewoński)'
2022-06-29 13:03:06	<logmsgbot>	!log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcephosd1025.eqiad.wmnet with OS buster
2022-06-29 13:03:08	<wikibugs>	('Merged) ''jenkins-bot: Enable DiscussionTools on mobile at partner wikis [mediawiki-config] - ''https://gerrit.wikimedia.org/r/809012 (https://phabricator.wikimedia.org/T298221) (owner: ''Bartosz Dziewoński)'
2022-06-29 13:03:10	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 13:03:11	<icinga-wm>	PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
2022-06-29 13:03:15	<wikibugs>	'SRE, ''ops-eqiad, ''DC-Ops, ''cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudcephosd10[25-34] - https://phabricator.wikimedia.org/T294972 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudcephosd1025.eqiad.wmnet with OS...'
2022-06-29 13:03:26	<wikibugs>	('PS2) ''Urbanecm: Enable DiscussionTools on mobile at mediawikiwiki [mediawiki-config] - ''https://gerrit.wikimedia.org/r/809223 (https://phabricator.wikimedia.org/T310960) (owner: ''Bartosz Dziewoński)'
2022-06-29 13:03:29	<wikibugs>	('CR) ''Urbanecm: [C: ''+2] Enable DiscussionTools on mobile at mediawikiwiki [mediawiki-config] - ''https://gerrit.wikimedia.org/r/809223 (https://phabricator.wikimedia.org/T310960) (owner: ''Bartosz Dziewoński)'
2022-06-29 13:04:02	<wikibugs>	('Merged) ''jenkins-bot: Enable DiscussionTools visualenhancements at mediawikiwiki [mediawiki-config] - ''https://gerrit.wikimedia.org/r/809011 (https://phabricator.wikimedia.org/T310960) (owner: ''Bartosz Dziewoński)'
2022-06-29 13:04:28	<sukhe>	XioNoX: going with durum1001
2022-06-29 13:04:32	<wikibugs>	('Merged) ''jenkins-bot: Enable DiscussionTools on mobile at mediawikiwiki [mediawiki-config] - ''https://gerrit.wikimedia.org/r/809223 (https://phabricator.wikimedia.org/T310960) (owner: ''Bartosz Dziewoński)'
2022-06-29 13:04:41	<XioNoX>	yay!
2022-06-29 13:04:41	<sukhe>	no package updates required on this host
2022-06-29 13:04:44	<wikibugs>	('CR) ''Ssingh: [V: ''+1 C: ''+2] bird: upgrade configuration to bird2 (merge IPv4 and IPv6 configurations) [puppet] - ''https://gerrit.wikimedia.org/r/809227 (https://phabricator.wikimedia.org/T310574) (owner: ''Ssingh)'
2022-06-29 13:04:58	<MatmaRex>	urbanecm: sorry, i'm away for 10 minutes
2022-06-29 13:05:09	<urbanecm>	no problem MatmaRex, I'll pull to mwdebug and wait for you to return
2022-06-29 13:05:14	<logmsgbot>	!log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
2022-06-29 13:05:18	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 13:05:30	<urbanecm>	MatmaRex: pulled to mwdebug1001, ready for you to test once you're back.
2022-06-29 13:06:04	<sukhe>	XioNoX: PCC wouldn't show it but
2022-06-29 13:06:04	<sukhe>	+ARGS="-bird.v2 -format.new=true"
2022-06-29 13:06:06	<logmsgbot>	!log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
2022-06-29 13:06:06	<sukhe>	it picked it up
2022-06-29 13:06:07	<logmsgbot>	!log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
2022-06-29 13:06:10	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 13:06:12	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 13:06:28	<XioNoX>	nice
2022-06-29 13:06:43	<logmsgbot>	!log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcephosd1025.eqiad.wmnet with OS buster
2022-06-29 13:06:46	<XioNoX>	sukhe: let's see if everything falls into place on its own or if there is anything to kick
2022-06-29 13:06:47	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 13:06:49	<wikibugs>	'SRE, ''ops-eqiad, ''DC-Ops, ''cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudcephosd10[25-34] - https://phabricator.wikimedia.org/T294972 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudcephosd1025.eqiad.wmnet with OS bus...'
2022-06-29 13:07:05	<logmsgbot>	!log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
2022-06-29 13:07:09	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 13:07:14	<wikibugs>	('CR) ''Marostegui: [C: ''+2] instances.yaml: Remove db2081 [puppet] - ''https://gerrit.wikimedia.org/r/809592 (https://phabricator.wikimedia.org/T311475) (owner: ''Marostegui)'
2022-06-29 13:07:22	<sukhe>	oops interesting
2022-06-29 13:07:32	<sukhe>	bird.conf line 8
2022-06-29 13:07:35	<logmsgbot>	!log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcephosd1027.eqiad.wmnet with OS buster
2022-06-29 13:07:41	<wikibugs>	'SRE, ''ops-eqiad, ''DC-Ops, ''cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudcephosd10[25-34] - https://phabricator.wikimedia.org/T294972 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudcephosd1027.eqiad.wmnet with OS...'
2022-06-29 13:07:42	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 13:07:42	<logmsgbot>	!log marostegui@cumin1001 dbctl commit (dc=all): 'Remove db2081 from dbctl T311475', diff saved to https://phabricator.wikimedia.org/P30622 and previous config saved to /var/cache/conftool/dbconfig/20220629-130741-marostegui.json
2022-06-29 13:07:43	<sukhe>	let me confirm if we got this right from yesterday
2022-06-29 13:07:47	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 13:07:47	<stashbot>	T311475: Decommission db[2071-2092] - https://phabricator.wikimedia.org/T311475
2022-06-29 13:08:17	<icinga-wm>	PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
2022-06-29 13:08:19	<sukhe>	weird
2022-06-29 13:08:26	<sukhe>	not sure why it's complaining! it's the same line from yesterday
2022-06-29 13:08:42	<XioNoX>	checking
2022-06-29 13:09:10	<logmsgbot>	!log marostegui@cumin1001 START - Cookbook sre.hosts.decommission for hosts db2081.codfw.wmnet
2022-06-29 13:09:13	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 13:09:22	<XioNoX>	sukhe: protocol device I think
2022-06-29 13:09:47	<sukhe>	XioNoX: but it matches https://gerrit.wikimedia.org/r/c/operations/puppet/+/809205/2/modules/bird/templates/bird_anycast.conf.erb?
2022-06-29 13:10:23	<wikibugs>	('PS1) ''Marostegui: mariadb: Remove db2081 from puppet [puppet] - ''https://gerrit.wikimedia.org/r/809593 (https://phabricator.wikimedia.org/T311623)'
2022-06-29 13:10:35	<XioNoX>	yeah
2022-06-29 13:11:02	<sukhe>	sukhe@durum1001:~$ apt-cache policy bird
2022-06-29 13:11:02	<sukhe>	bird: Installed: (none)
2022-06-29 13:11:06	<sukhe>	so we are definitely on bird2, that's not the issue
2022-06-29 13:11:17	<icinga-wm>	PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
2022-06-29 13:11:28	<sukhe>	clean puppet run, with the exception of bird2 failing but that's what we are taking about right now
2022-06-29 13:11:32	<XioNoX>	sukhe: yeah I think it's direct, not device, dunno how I made that oversight
2022-06-29 13:11:47	<wikibugs>	('PS1) ''Slyngshede: profile::prometheus::ops enable Ganeti metric scraping. [puppet] - ''https://gerrit.wikimedia.org/r/809594'
2022-06-29 13:12:02	<sukhe>	don't worry about it, you are not alone :)
2022-06-29 13:12:04	<sukhe>	fixiing
2022-06-29 13:12:09	<sukhe>	can you check the routers please for now?
2022-06-29 13:12:11	<logmsgbot>	!log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
2022-06-29 13:12:11	<XioNoX>	doing it manually to test
2022-06-29 13:12:14	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 13:12:16	<sukhe>	I updated it manually
2022-06-29 13:12:28	<wikibugs>	'SRE-tools, ''Infrastructure-Foundations: Decommissioning two hosts end up with: Failed to wipe swraid - https://phabricator.wikimedia.org/T311593 (''Marostegui) I ran swapoff -a on db2081 and it went fine. Could be coincidence or it could've been the fix. Hard to know. However, I guess it doesn't hurt to inc...'
2022-06-29 13:12:54	<logmsgbot>	!log marostegui@cumin1001 START - Cookbook sre.dns.netbox
2022-06-29 13:12:55	<logmsgbot>	!log bking@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster restart to pickup swift-s3 plugin - bking@cumin1001 - T309648
2022-06-29 13:12:58	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 13:13:01	<MatmaRex>	urbanecm: sorry, looking now
2022-06-29 13:13:05	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 13:13:05	<logmsgbot>	!log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
2022-06-29 13:13:05	<stashbot>	T309648: Restore lost index in cloudelastic - https://phabricator.wikimedia.org/T309648
2022-06-29 13:13:06	<logmsgbot>	!log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
2022-06-29 13:13:09	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 13:13:10	<urbanecm>	no problem. let me know how it goes :)
2022-06-29 13:13:13	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 13:13:13	<sukhe>	XioNoX: but wait, we already have direct on line 37
2022-06-29 13:13:18	<sukhe>	removing that one
2022-06-29 13:13:25	<XioNoX>	sukhe: wait I'm editing it too
2022-06-29 13:13:31	<sukhe>	ok waiting
2022-06-29 13:13:41	<icinga-wm>	RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
2022-06-29 13:13:48	<XioNoX>	sukhe: alright there we are
2022-06-29 13:14:05	<MatmaRex>	urbanecm: 809010 looks good
2022-06-29 13:14:14	<XioNoX>	sukhe: so we need an empty `protocol device`
2022-06-29 13:14:27	<urbanecm>	MatmaRex: ack. I'll sync them at once, so waiting for all of them to be checked.
2022-06-29 13:14:33	<XioNoX>	and then protocol direct with v4/v6
2022-06-29 13:14:40	<wikibugs>	('PS1) ''Ssingh: bird: update bird.conf (replace protocol device with direct) [puppet] - ''https://gerrit.wikimedia.org/r/809595'
2022-06-29 13:14:49	<XioNoX>	sukhe: I inverted the two
2022-06-29 13:15:15	<XioNoX>	and the v4 and v6 prefixes are there
2022-06-29 13:15:20	<sukhe>	XioNoX: patch out, let's review and do another Puppet run to be sure?!
2022-06-29 13:15:27	<sukhe>	https://gerrit.wikimedia.org/r/c/operations/puppet/+/809595 <--
2022-06-29 13:15:34	<XioNoX>	yep
2022-06-29 13:16:12	<wikibugs>	('CR) ''Ayounsi: "One comment" [puppet] - ''https://gerrit.wikimedia.org/r/809595 (owner: ''Ssingh)'
2022-06-29 13:16:16	<MatmaRex>	809012 seems okay
2022-06-29 13:16:28	<wikibugs>	('CR) ''Ayounsi: bird: update bird.conf (replace protocol device with direct) (''1 comment) [puppet] - ''https://gerrit.wikimedia.org/r/809595 (owner: ''Ssingh)'
2022-06-29 13:16:35	<XioNoX>	sukhe: one comment then lgtm
2022-06-29 13:16:36	<wikibugs>	('CR) ''Marostegui: [C: ''+2] mariadb: Remove db2081 from puppet [puppet] - ''https://gerrit.wikimedia.org/r/809593 (https://phabricator.wikimedia.org/T311623) (owner: ''Marostegui)'
2022-06-29 13:16:37	<sukhe>	aaha
2022-06-29 13:16:45	<logmsgbot>	!log marostegui@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
2022-06-29 13:16:49	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 13:16:54	<logmsgbot>	!log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
2022-06-29 13:16:59	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 13:17:00	<MatmaRex>	and the last mediawikiwiki changes look good too
2022-06-29 13:17:03	<MatmaRex>	urbanecm: all look good
2022-06-29 13:17:09	<sukhe>	XioNoX: this bird better fly now :P
2022-06-29 13:17:09	<wikibugs>	('PS2) ''Ssingh: bird: update bird.conf (replace protocol device with direct) [puppet] - ''https://gerrit.wikimedia.org/r/809595'
2022-06-29 13:17:16	<urbanecm>	MatmaRex: okay, thanks. syncing!
2022-06-29 13:17:31	<XioNoX>	sukhe: everything flies with enough thrust
2022-06-29 13:17:46	<sukhe>	that sounds about right, given the current circumstances :D
2022-06-29 13:18:27	<sukhe>	ok I am merging this then
2022-06-29 13:18:29	<logmsgbot>	!log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db2081.codfw.wmnet
2022-06-29 13:18:33	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 13:18:42	<wikibugs>	('CR) ''Ayounsi: [C: ''+1] bird: update bird.conf (replace protocol device with direct) (''1 comment) [puppet] - ''https://gerrit.wikimedia.org/r/809595 (owner: ''Ssingh)'
2022-06-29 13:18:46	<XioNoX>	sukhe: +1
2022-06-29 13:18:49	<wikibugs>	('CR) ''Ssingh: [C: ''+2] bird: update bird.conf (replace protocol device with direct) (''1 comment) [puppet] - ''https://gerrit.wikimedia.org/r/809595 (owner: ''Ssingh)'
2022-06-29 13:18:58	<logmsgbot>	!log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1027.eqiad.wmnet with reason: host reimage
2022-06-29 13:19:01	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 13:19:25	<wikibugs>	'ops-codfw, ''decommission-hardware, ''Patch-For-Review: decommission db2081 - https://phabricator.wikimedia.org/T311623 (''Marostegui) This is ready for on-site steps!'
2022-06-29 13:20:07	<sukhe>	XioNoX: looks good!
2022-06-29 13:20:11	<sukhe>	can you confirm the router side?
2022-06-29 13:20:15	<XioNoX>	checking
2022-06-29 13:20:26	<XioNoX>	yep, all good1
2022-06-29 13:20:31	<logmsgbot>	!log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcephosd1026.eqiad.wmnet with OS buster
2022-06-29 13:20:34	<wikibugs>	('CR) ''Slyngshede: [V: ''+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36121/console"; [puppet] - ''https://gerrit.wikimedia.org/r/809594 (owner: ''Slyngshede)'
2022-06-29 13:20:35	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 13:20:37	<wikibugs>	'SRE, ''ops-eqiad, ''DC-Ops, ''cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudcephosd10[25-34] - https://phabricator.wikimedia.org/T294972 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudcephosd1026.eqiad.wmnet with OS...'
2022-06-29 13:21:18	<XioNoX>	sukhe: the prometheus exporter is working fine too
2022-06-29 13:21:22	<sukhe>	nice!
2022-06-29 13:22:30	<logmsgbot>	!log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1027.eqiad.wmnet with reason: host reimage
2022-06-29 13:22:34	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 13:23:34	<logmsgbot>	!log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 78fe6a15: 9f76648: 897e69c7: 977e57b: DiscussionTools config changes (T310960, T298221, T311023) (duration: 03m 38s)
2022-06-29 13:23:40	<XioNoX>	sukhe: what's the next host?
2022-06-29 13:23:46	<urbanecm>	MatmaRex: and all live. anything else i can do for you today?
2022-06-29 13:23:48	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 13:23:49	<stashbot>	T311023: Enable new topic tool by default on enwiki - https://phabricator.wikimedia.org/T311023
2022-06-29 13:23:49	<stashbot>	T298221: [Config Change] Offer mobile Reply and New Discussion Tools at partner wikis - https://phabricator.wikimedia.org/T298221
2022-06-29 13:23:49	<sukhe>	XioNoX: so all looks good?
2022-06-29 13:23:49	<stashbot>	T310960: [Config Change] Make all DiscussionTools available by default at mediawiki.org - https://phabricator.wikimedia.org/T310960
2022-06-29 13:23:55	<sukhe>	we can do durum1002, to be extra sure
2022-06-29 13:23:56	<MatmaRex>	urbanecm: thank you!
2022-06-29 13:23:58	<sukhe>	then roll out to all durums
2022-06-29 13:24:02	<sukhe>	and then A:wikidough
2022-06-29 13:24:04	<urbanecm>	no problem :)
2022-06-29 13:24:14	<sukhe>	these should go fairly quickly, I will use depdeploy
2022-06-29 13:24:33	<XioNoX>	sukhe: yeah everything is good, routers and icinga happy
2022-06-29 13:24:42	<XioNoX>	sukhe: +1
2022-06-29 13:24:47	<sukhe>	yayay
2022-06-29 13:24:52	<sukhe>	ok let's try durum1002
2022-06-29 13:26:00	<wikibugs>	('PS1) ''Muehlenhoff: Disable swap before running wipefs [cookbooks] - ''https://gerrit.wikimedia.org/r/809599 (https://phabricator.wikimedia.org/T311593)'
2022-06-29 13:26:09	<XioNoX>	go for it
2022-06-29 13:26:48	<sukhe>	https://puppetboard.wikimedia.org/report/durum1002.eqiad.wmnet/1e625a6a37f0a350b28416bfdcd8773e349ab8e4
2022-06-29 13:26:51	<sukhe>	looks good
2022-06-29 13:26:55	<wikibugs>	'SRE-OnFire, ''DBA, ''Sustainability (Incident Followup): Investigate mariadb 10.6 performance regression during spikes/high load - https://phabricator.wikimedia.org/T311106 (''Ladsgroup) https://people.wikimedia.org/~ladsgroup/mariadb_flamegraphs/normal.106.svg and https://people.wikimedia.org/~ladsgroup/m...'
2022-06-29 13:27:06	<sukhe>	can you verify the routers for this too?
2022-06-29 13:27:13	<sukhe>	just to be extra safe
2022-06-29 13:28:35	<XioNoX>	sukhe: checked and all good, bgp, bfd
2022-06-29 13:28:40	<wikibugs>	'SRE, ''ops-eqiad, ''DC-Ops, ''cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudcephosd10[25-34] - https://phabricator.wikimedia.org/T294972 (''Cmjohnson) @jclark cloudcephosd1025 states no cable, can you verify the cable and/or the port please'
2022-06-29 13:28:42	<sukhe>	nice!
2022-06-29 13:28:43	<sukhe>	ok then
2022-06-29 13:28:49	<sukhe>	I think we can do a A:durum deploy and move on
2022-06-29 13:29:04	<XioNoX>	sukhe: https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=durum1002&service=DPKG
2022-06-29 13:29:08	<XioNoX>	probably need a re-check
2022-06-29 13:29:26	<sukhe>	checking
2022-06-29 13:29:54	<icinga-wm>	RECOVERY - DPKG on durum1002 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
2022-06-29 13:29:55	<XioNoX>	(scheduled a re-check)
2022-06-29 13:29:57	<XioNoX>	yep
2022-06-29 13:29:58	<wikibugs>	('PS2) ''Filippo Giunchedi: profile::prometheus::ops enable Ganeti metric scraping. [puppet] - ''https://gerrit.wikimedia.org/r/809594 (owner: ''Slyngshede)'
2022-06-29 13:30:00	<sukhe>	interesting
2022-06-29 13:30:00	<XioNoX>	sukhe: ^
2022-06-29 13:30:04	<sukhe>	though I wonder why that happened at all
2022-06-29 13:30:15	<sukhe>	let me check the puppet run once more
2022-06-29 13:31:03	<wikibugs>	'SRE-OnFire, ''DBA, ''Sustainability (Incident Followup): Investigate mariadb 10.6 performance regression during spikes/high load - https://phabricator.wikimedia.org/T311106 (''Marostegui) ^ Those two graphs are between db1132 (with performance_schema disabled) and a normal 10.4 host. They look _a lot_ more...'
2022-06-29 13:31:17	<sukhe>	hm ok
2022-06-29 13:31:20	<sukhe>	nothing that I can find
2022-06-29 13:31:33	<sukhe>	let's go ahead with the rest of the durum hosts for now :)
2022-06-29 13:31:39	<XioNoX>	+!
2022-06-29 13:31:40	<XioNoX>	1
2022-06-29 13:31:50	<sukhe>	running apt update on A:durum followed by debdeploy
2022-06-29 13:32:08	<logmsgbot>	!log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1026.eqiad.wmnet with reason: host reimage
2022-06-29 13:32:12	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 13:32:43	<wikibugs>	'SRE-swift-storage, ''Lift-Wing, ''Machine-Learning-Team (Active Tasks): Create Swift account for readonly access to ML models - https://phabricator.wikimedia.org/T311628 (''klausman)'
2022-06-29 13:33:10	<wikibugs>	('CR) ''Filippo Giunchedi: "LGTM, see inline" [puppet] - ''https://gerrit.wikimedia.org/r/809594 (owner: ''Slyngshede)'
2022-06-29 13:34:01	<sukhe>	ok running puppet agent
2022-06-29 13:35:00	<sukhe>	waits
2022-06-29 13:35:21	<XioNoX>	sukhe: is apt update needed or just to be safe?
2022-06-29 13:35:28	<sukhe>	yep, did that
2022-06-29 13:35:31	<sukhe>	and then debdeployed
2022-06-29 13:35:35	<sukhe>	and now running puppet agent
2022-06-29 13:35:37	<logmsgbot>	!log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1026.eqiad.wmnet with reason: host reimage
2022-06-29 13:35:41	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 13:35:50	<XioNoX>	sukhe: but I'm wondering if it's strictly needed or not
2022-06-29 13:35:57	<XioNoX>	I'm not familiar with the deb side of things
2022-06-29 13:36:23	<sukhe>	XioNoX: I don't think it was required as debmonitor was already showing the upgrade, but https://wikitech-static.wikimedia.org/wiki/Software_deployment
2022-06-29 13:36:28	<sukhe>	> Note that debdeploy won't run apt update, so if you uploaded the new version very recently, it won't make any changes
2022-06-29 13:36:49	<icinga-wm>	PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
2022-06-29 13:36:53	<sukhe>	oh oh
2022-06-29 13:37:01	<sukhe>	esams
2022-06-29 13:37:03	<sukhe>	checking
2022-06-29 13:37:34	<sukhe>	bird looks OK though
2022-06-29 13:37:45	<icinga-wm>	PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
2022-06-29 13:37:51	<icinga-wm>	PROBLEM - BFD status on asw1-b13-drmrs.mgmt is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
2022-06-29 13:37:53	<XioNoX>	v4 is good, v6 not
2022-06-29 13:38:05	<icinga-wm>	PROBLEM - BFD status on cr3-ulsfo is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
2022-06-29 13:38:07	<icinga-wm>	PROBLEM - BFD status on cr2-esams is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
2022-06-29 13:38:23	<icinga-wm>	PROBLEM - BGP status on asw1-b13-drmrs.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv6: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
2022-06-29 13:38:24	<sukhe>	hmm bird6 (!)
2022-06-29 13:38:57	<icinga-wm>	PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
2022-06-29 13:39:29	<icinga-wm>	PROBLEM - BFD status on cr3-eqsin is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
2022-06-29 13:39:34	<sukhe>	ha Puppet failed
2022-06-29 13:39:59	<XioNoX>	sukhe: bird6 is down but bird is still on 1.6
2022-06-29 13:40:11	<XioNoX>	looking at durum3001
2022-06-29 13:40:29	<XioNoX>	should I try a force puppet run or you're on it?
2022-06-29 13:40:32	<sukhe>	doing it
2022-06-29 13:40:50	<XioNoX>	let's see if it fixes it
2022-06-29 13:41:14	<XioNoX>	Package[bird2] failure purged present
2022-06-29 13:41:29	<XioNoX>	it can't install bird2
2022-06-29 13:42:08	<XioNoX>	maybe because of The following packages will be DOWNGRADED: prometheus-bird-exporter
2022-06-29 13:42:09	<XioNoX>	?
2022-06-29 13:42:28	<wikibugs>	'SRE, ''ops-codfw, ''decommission-hardware: decommission db2071 - https://phabricator.wikimedia.org/T311589 (''Papaul)'
2022-06-29 13:42:38	<XioNoX>	yeah E: Packages were downgraded and -y was used without --allow-downgrades.
2022-06-29 13:42:40	<sukhe>	I think for some reason the debdeploy for prometheus-didn't run
2022-06-29 13:42:46	<sukhe>	so that's probably it then
2022-06-29 13:42:54	<sukhe>	checking
2022-06-29 13:43:00	<XioNoX>	ok
2022-06-29 13:43:07	<icinga-wm>	RECOVERY - BFD status on cr2-esams is OK: OK: UP: 18 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
2022-06-29 13:43:09	<sukhe>	we should see recoveries on esams
2022-06-29 13:43:10	<sukhe>	ok great
2022-06-29 13:43:13	<wikibugs>	'SRE, ''ops-codfw, ''decommission-hardware: decommission db2075 - https://phabricator.wikimedia.org/T311591 (''Papaul)'
2022-06-29 13:43:23	<XioNoX>	it's great that v4 stayed up through it
2022-06-29 13:43:33	<sukhe>	:D
2022-06-29 13:43:58	<wikibugs>	'SRE, ''ops-codfw, ''decommission-hardware: decommission db2081 - https://phabricator.wikimedia.org/T311623 (''Papaul)'
2022-06-29 13:44:34	<XioNoX>	confirmed that everything is working now in esams
2022-06-29 13:45:29	<sukhe>	thanks, checking ulsfo quickly
2022-06-29 13:45:34	<sukhe>	to make sure changes propagated
2022-06-29 13:45:49	<icinga-wm>	RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
2022-06-29 13:47:04	<logmsgbot>	!log pt1979@cumin2002 START - Cookbook sre.dns.netbox
2022-06-29 13:47:08	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 13:47:27	<icinga-wm>	RECOVERY - BGP status on cr3-ulsfo is OK: BGP OK - up: 75, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
2022-06-29 13:47:31	<XioNoX>	alright
2022-06-29 13:47:41	<XioNoX>	sukhe: did it recover on its own or you did something?
2022-06-29 13:47:52	<sukhe>	XioNoX: just ran install via cumin and puppet agent again :)
2022-06-29 13:47:55	<icinga-wm>	RECOVERY - BFD status on cr3-ulsfo is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
2022-06-29 13:48:21	<icinga-wm>	RECOVERY - BGP status on asw1-b13-drmrs.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
2022-06-29 13:48:21	<sukhe>	it seems like I ignored the version number in prometheus-bird-exporter and hence it is a downgrade
2022-06-29 13:48:29	<sukhe>	but that's OK, we can fix it later quickly
2022-06-29 13:48:54	<sukhe>	for now, everything should be fine with durum
2022-06-29 13:49:02	<logmsgbot>	!log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcephosd1028.eqiad.wmnet with OS buster
2022-06-29 13:49:05	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 13:49:06	<sukhe>	Wikidough is next and should go a bit more smoothly
2022-06-29 13:49:07	<wikibugs>	'SRE, ''ops-eqiad, ''DC-Ops, ''cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudcephosd10[25-34] - https://phabricator.wikimedia.org/T294972 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudcephosd1028.eqiad.wmnet with OS...'
2022-06-29 13:49:17	<icinga-wm>	RECOVERY - BFD status on asw1-b13-drmrs.mgmt is OK: OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
2022-06-29 13:49:17	<icinga-wm>	RECOVERY - BGP status on cr2-eqsin is OK: BGP OK - up: 95, down: 3, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
2022-06-29 13:49:20	<sukhe>	can you confirm the routers for me please?
2022-06-29 13:49:35	<sukhe>	durums look good
2022-06-29 13:50:01	<icinga-wm>	RECOVERY - BFD status on cr3-eqsin is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
2022-06-29 13:50:10	<logmsgbot>	!log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcephosd1029.eqiad.wmnet with OS buster
2022-06-29 13:50:12	<XioNoX>	sukhe: checking drmrs as a random one, I trust icinga for the ithers
2022-06-29 13:50:14	<XioNoX>	others
2022-06-29 13:50:14	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 13:50:16	<wikibugs>	'SRE, ''ops-eqiad, ''DC-Ops, ''cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudcephosd10[25-34] - https://phabricator.wikimedia.org/T294972 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudcephosd1029.eqiad.wmnet with OS...'
2022-06-29 13:50:28	<sukhe>	Icinga looks ok for durum*; puppet runs which should clear up
2022-06-29 13:50:40	<sukhe>	(already cleared on puppetboard)
2022-06-29 13:50:56	<logmsgbot>	!log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
2022-06-29 13:50:57	<wikibugs>	('PS1) ''Btullis: Update the partman recipe for use with the new stat servers [puppet] - ''https://gerrit.wikimedia.org/r/809602 (https://phabricator.wikimedia.org/T307399)'
2022-06-29 13:51:00	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 13:51:01	<XioNoX>	sukhe: drmrs lgtm
2022-06-29 13:51:12	<wikibugs>	'SRE, ''ops-codfw, ''decommission-hardware: decommission db2071 - https://phabricator.wikimedia.org/T311589 (''Papaul)'
2022-06-29 13:51:29	<logmsgbot>	!log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcephosd1030.eqiad.wmnet with OS buster
2022-06-29 13:51:30	<wikibugs>	'SRE, ''ops-codfw, ''decommission-hardware: decommission db2075 - https://phabricator.wikimedia.org/T311591 (''Papaul)'
2022-06-29 13:51:34	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 13:51:36	<wikibugs>	'SRE, ''ops-eqiad, ''DC-Ops, ''cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudcephosd10[25-34] - https://phabricator.wikimedia.org/T294972 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudcephosd1030.eqiad.wmnet with OS...'
2022-06-29 13:51:53	<wikibugs>	'SRE, ''ops-codfw, ''decommission-hardware: decommission db2081 - https://phabricator.wikimedia.org/T311623 (''Papaul)'
2022-06-29 13:52:00	<sukhe>	ok
2022-06-29 13:52:04	<sukhe>	A:wikidough is next
2022-06-29 13:52:13	<sukhe>	this time it should go smoothly to inspire confidence for internal recursors :P
2022-06-29 13:52:31	<sukhe>	XioNoX: ready to proceed?
2022-06-29 13:52:35	<sukhe>	this will be a fun one
2022-06-29 13:52:45	<XioNoX>	sukhe: yep
2022-06-29 13:52:53	<sukhe>	deep breaths!
2022-06-29 13:52:58	<logmsgbot>	!log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcephosd1031.eqiad.wmnet with OS buster
2022-06-29 13:53:02	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 13:53:04	<wikibugs>	'SRE, ''ops-eqiad, ''DC-Ops, ''cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudcephosd10[25-34] - https://phabricator.wikimedia.org/T294972 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudcephosd1031.eqiad.wmnet with OS...'
2022-06-29 13:53:32	<XioNoX>	sukhe: which host first?
2022-06-29 13:54:07	<sukhe>	https://debmonitor.wikimedia.org/search?q=doh
2022-06-29 13:54:08	<sukhe>	looks clean
2022-06-29 13:54:12	<sukhe>	packages upgraded
2022-06-29 13:54:22	<sukhe>	doh1001
2022-06-29 13:54:26	<sukhe>	starting now with puppet run
2022-06-29 13:54:37	<wikibugs>	('PS3) ''Jcrespo: cli: Change logging to log on a different file each [software/mediabackups] - ''https://gerrit.wikimedia.org/r/809589 (https://phabricator.wikimedia.org/T311215)'
2022-06-29 13:54:41	<wikibugs>	('CR) ''Filippo Giunchedi: [C: ''+1] "LGTM" [puppet] - ''https://gerrit.wikimedia.org/r/809602 (https://phabricator.wikimedia.org/T307399) (owner: ''Btullis)'
2022-06-29 13:54:57	<XioNoX>	alright
2022-06-29 13:54:58	<logmsgbot>	!log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcephosd1032.eqiad.wmnet with OS buster
2022-06-29 13:55:02	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 13:55:05	<wikibugs>	'SRE, ''ops-eqiad, ''DC-Ops, ''cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudcephosd10[25-34] - https://phabricator.wikimedia.org/T294972 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudcephosd1032.eqiad.wmnet with OS...'
2022-06-29 13:55:07	<logmsgbot>	!log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcephosd1033.eqiad.wmnet with OS buster
2022-06-29 13:55:11	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 13:55:14	<wikibugs>	'SRE, ''ops-eqiad, ''DC-Ops, ''cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudcephosd10[25-34] - https://phabricator.wikimedia.org/T294972 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudcephosd1033.eqiad.wmnet with OS...'
2022-06-29 13:55:16	<logmsgbot>	!log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcephosd1034.eqiad.wmnet with OS buster
2022-06-29 13:55:20	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 13:55:22	<wikibugs>	'SRE, ''ops-eqiad, ''DC-Ops, ''cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudcephosd10[25-34] - https://phabricator.wikimedia.org/T294972 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudcephosd1034.eqiad.wmnet with OS...'
2022-06-29 13:56:07	<icinga-wm>	PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
2022-06-29 13:56:17	<wikibugs>	('PS1) ''Marostegui: wmnet: Update x1-master CNAME [dns] - ''https://gerrit.wikimedia.org/r/809605 (https://phabricator.wikimedia.org/T300472)'
2022-06-29 13:56:21	<XioNoX>	sukhe: v4 and v6 are down
2022-06-29 13:56:25	<XioNoX>	up now
2022-06-29 13:56:30	<sukhe>	probably during the restart?
2022-06-29 13:56:58	<sukhe>	looks ok otherwise
2022-06-29 13:57:00	<XioNoX>	sukhe: receiving 1 v4 and 1 v6 prefix
2022-06-29 13:57:01	<wikibugs>	('CR) ''Muehlenhoff: [C: ''+1] "Nice!" [puppet] - ''https://gerrit.wikimedia.org/r/809602 (https://phabricator.wikimedia.org/T307399) (owner: ''Btullis)'
2022-06-29 13:57:02	<XioNoX>	yep
2022-06-29 13:57:03	<XioNoX>	lgtm
2022-06-29 13:57:03	<sukhe>	nice
2022-06-29 13:57:04	<sukhe>	!
2022-06-29 13:57:13	<sukhe>	clean puppet run too
2022-06-29 13:57:28	<XioNoX>	sukhe: I have a meeting in 3min, but will keep an eye in here
2022-06-29 13:57:32	<sukhe>	thanks
2022-06-29 13:57:33	<sukhe>	so
2022-06-29 13:57:36	<sukhe>	Wikidough then?
2022-06-29 13:57:41	<sukhe>	intenral recursors or centrallog?
2022-06-29 13:57:42	<XioNoX>	and do the verification
2022-06-29 13:57:50	<sukhe>	both internal recurors and centrallog have differnet configs
2022-06-29 13:57:52	<sukhe>	namely, just V4
2022-06-29 13:58:04	<XioNoX>	sukhe: wikidough, then centrallog
2022-06-29 13:58:08	<sukhe>	cool
2022-06-29 13:58:08	<logmsgbot>	!log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1027.eqiad.wmnet with OS buster
2022-06-29 13:58:11	<XioNoX>	then dns
2022-06-29 13:58:12	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 13:58:14	<wikibugs>	'SRE, ''ops-eqiad, ''DC-Ops, ''cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudcephosd10[25-34] - https://phabricator.wikimedia.org/T294972 (''Cmjohnson) @Jclark-ctr cloudcephosd1031 same thing, no cable, can you check this as well'
2022-06-29 13:58:15	<sukhe>	ok awesome
2022-06-29 13:58:18	<sukhe>	good luck to us!
2022-06-29 13:58:20	<wikibugs>	'SRE, ''ops-eqiad, ''DC-Ops, ''cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudcephosd10[25-34] - https://phabricator.wikimedia.org/T294972 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudcephosd1027.eqiad.wmnet with OS bus...'
2022-06-29 13:58:20	<sukhe>	doing wikdough now
2022-06-29 13:59:31	<wikibugs>	'SRE, ''ops-eqiad, ''DC-Ops, ''Data-Engineering, and 2 others: Q4: rack/setup/install stat1010 - https://phabricator.wikimedia.org/T307399 (''BTullis) The RAID controller firmware update did not make any difference, but thankfully @fgiunchedi has identified the cause of the reversed device names. Essentia...'
2022-06-29 13:59:57	<wikibugs>	('CR) ''Marostegui: [C: ''-2] "Wait for the failover day" [dns] - ''https://gerrit.wikimedia.org/r/809605 (https://phabricator.wikimedia.org/T300472) (owner: ''Marostegui)'
2022-06-29 14:00:15	<logmsgbot>	!log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcephosd1031.eqiad.wmnet with OS buster
2022-06-29 14:00:19	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 14:00:21	<wikibugs>	'SRE, ''ops-eqiad, ''DC-Ops, ''cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudcephosd10[25-34] - https://phabricator.wikimedia.org/T294972 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudcephosd1031.eqiad.wmnet with OS bus...'
2022-06-29 14:00:36	<wikibugs>	('PS1) ''Marostegui: mariadb: Promote db1120 to x1 master [puppet] - ''https://gerrit.wikimedia.org/r/809607 (https://phabricator.wikimedia.org/T300472)'
2022-06-29 14:00:40	<logmsgbot>	!log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1028.eqiad.wmnet with reason: host reimage
2022-06-29 14:00:44	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 14:00:46	<wikibugs>	('CR) ''Btullis: [C: ''+2] Update the partman recipe for use with the new stat servers [puppet] - ''https://gerrit.wikimedia.org/r/809602 (https://phabricator.wikimedia.org/T307399) (owner: ''Btullis)'
2022-06-29 14:01:32	<logmsgbot>	!log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1029.eqiad.wmnet with reason: host reimage
2022-06-29 14:01:36	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 14:01:40	<sukhe>	!log sudo cumin -b 1 -s 5 'A:wikidough' 'run-puppet-agent -q'
2022-06-29 14:01:44	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 14:01:47	<wikibugs>	('CR) ''Marostegui: [C: ''-2] "Wait for the failover day" [puppet] - ''https://gerrit.wikimedia.org/r/809607 (https://phabricator.wikimedia.org/T300472) (owner: ''Marostegui)'
2022-06-29 14:02:09	<icinga-wm>	PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
2022-06-29 14:02:20	<sukhe>	^ should recover
2022-06-29 14:02:27	<wikibugs>	('CR) ''Ladsgroup: [C: ''+1] wmnet: Update x1-master CNAME [dns] - ''https://gerrit.wikimedia.org/r/809605 (https://phabricator.wikimedia.org/T300472) (owner: ''Marostegui)'
2022-06-29 14:02:52	<logmsgbot>	!log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1030.eqiad.wmnet with reason: host reimage
2022-06-29 14:02:56	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 14:03:24	<logmsgbot>	!log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cloudcephosd1030.eqiad.wmnet with reason: host reimage
2022-06-29 14:03:28	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 14:04:15	<logmsgbot>	!log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1028.eqiad.wmnet with reason: host reimage
2022-06-29 14:04:19	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 14:04:34	<XioNoX>	sukhe: it's all established on the device
2022-06-29 14:04:38	<XioNoX>	so yeah should recover
2022-06-29 14:04:40	<sukhe>	yep
2022-06-29 14:04:42	<sukhe>	going smooth so far
2022-06-29 14:04:46	<sukhe>	Ok to proceed on 12 hosts? Enter the number of affected hosts to confirm or "q" to quit 12
2022-06-29 14:04:50	<sukhe>	PASS \|█████████████████████████████ \| 33% (4/12) [05:20<11:37, 87.13s/hosts]
2022-06-29 14:04:58	<logmsgbot>	!log btullis@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host stat1010.eqiad.wmnet with OS bullseye
2022-06-29 14:05:02	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 14:05:05	<wikibugs>	'SRE, ''ops-eqiad, ''DC-Ops, ''Data-Engineering, and 2 others: Q4: rack/setup/install stat1010 - https://phabricator.wikimedia.org/T307399 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host stat1010.eqiad.wmnet with OS bullseye executed with errors: - stat101...'
2022-06-29 14:05:06	<XioNoX>	and receiving the prefixes there too
2022-06-29 14:05:32	<sukhe>	nice
2022-06-29 14:06:19	<logmsgbot>	!log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1032.eqiad.wmnet with reason: host reimage
2022-06-29 14:06:24	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 14:06:26	<wikibugs>	'SRE, ''Infrastructure-Foundations, ''CAS-SSO: Enable webauthn in CAS to replace U2F - https://phabricator.wikimedia.org/T311236 (''MoritzMuehlenhoff) >>! In T311236#8023441, @MoritzMuehlenhoff wrote: >>>! In T311236#8023420, @jbond wrote: >>>but that bails out with a bean error related to the fasterxml par...'
2022-06-29 14:06:32	<logmsgbot>	!log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1033.eqiad.wmnet with reason: host reimage
2022-06-29 14:06:36	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 14:06:37	<logmsgbot>	!log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1034.eqiad.wmnet with reason: host reimage
2022-06-29 14:06:40	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 14:06:51	<logmsgbot>	!log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1029.eqiad.wmnet with reason: host reimage
2022-06-29 14:06:55	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 14:07:06	<wikibugs>	'SRE, ''ops-eqiad, ''DC-Ops, ''Data-Engineering, and 2 others: Q4: rack/setup/install stat1010 - https://phabricator.wikimedia.org/T307399 (''BTullis) I'm updating the BIOS as well from version 2.13.3 to version 2.14.2 since it was marked as urgent by Dell. {F35286599,width=80%}'
2022-06-29 14:07:31	<wikibugs>	('CR) ''Ladsgroup: [C: ''+1] mariadb: Promote db1120 to x1 master [puppet] - ''https://gerrit.wikimedia.org/r/809607 (https://phabricator.wikimedia.org/T300472) (owner: ''Marostegui)'
2022-06-29 14:09:48	<XioNoX>	sukhe: it won't show a recovery here as it's curently in warning for a different thing
2022-06-29 14:09:49	<logmsgbot>	!log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1032.eqiad.wmnet with reason: host reimage
2022-06-29 14:09:53	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 14:10:00	<XioNoX>	sukhe: see the warnings in https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=bgp
2022-06-29 14:10:23	<sukhe>	nice timing :)
2022-06-29 14:11:33	<sukhe>	XioNoX: I just confirmed the version issue with prometheus-bird-exporter with moritzm
2022-06-29 14:11:35	<logmsgbot>	!log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1033.eqiad.wmnet with reason: host reimage
2022-06-29 14:11:43	<sukhe>	we will fix it after, but yeah, it doesn't affect any functionality
2022-06-29 14:11:45	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 14:12:12	<wikibugs>	('PS1) ''Hokwelum: make script send mail whenever there is output [puppet] - ''https://gerrit.wikimedia.org/r/809611'
2022-06-29 14:12:45	<logmsgbot>	!log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host stat1010.eqiad.wmnet with OS bullseye
2022-06-29 14:12:46	<wikibugs>	('CR) ''CI reject: [V: ''-1] make script send mail whenever there is output [puppet] - ''https://gerrit.wikimedia.org/r/809611 (owner: ''Hokwelum)'
2022-06-29 14:12:49	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 14:12:54	<wikibugs>	'SRE, ''ops-eqiad, ''DC-Ops, ''Data-Engineering, and 2 others: Q4: rack/setup/install stat1010 - https://phabricator.wikimedia.org/T307399 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host stat1010.eqiad.wmnet with OS bullseye'
2022-06-29 14:13:33	<icinga-wm>	PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
2022-06-29 14:13:46	<sukhe>	^ expected
2022-06-29 14:13:48	<wikibugs>	'SRE, ''ops-eqiad, ''DC-Ops, ''Data-Engineering, and 2 others: Q4: rack/setup/install stat1010 - https://phabricator.wikimedia.org/T307399 (''BTullis)'
2022-06-29 14:13:56	<logmsgbot>	!log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1026.eqiad.wmnet with OS buster
2022-06-29 14:14:00	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 14:14:02	<wikibugs>	'SRE, ''ops-eqiad, ''DC-Ops, ''cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudcephosd10[25-34] - https://phabricator.wikimedia.org/T294972 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudcephosd1026.eqiad.wmnet with OS bus...'
2022-06-29 14:14:21	<logmsgbot>	!log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1034.eqiad.wmnet with reason: host reimage
2022-06-29 14:14:23	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 14:14:25	<wikibugs>	('PS2) ''Hokwelum: make script send mail whenever there is output [puppet] - ''https://gerrit.wikimedia.org/r/809611'
2022-06-29 14:14:36	<jinxer-wm>	(CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1004-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
2022-06-29 14:18:18	<sukhe>	XioNoX: A:wikidough done, no issues at my end
2022-06-29 14:18:25	<sukhe>	I will wait for confirmation from you before moving to centrallog
2022-06-29 14:18:45	<sukhe>	> tests/test_dns.py::test_show_information Your resolver is doh1001
2022-06-29 14:19:38	<XioNoX>	sukhe: nice! yeah let's move to centrallog
2022-06-29 14:20:17	<sukhe>	oh just two hosts here
2022-06-29 14:20:22	<sukhe>	in which case I will do one by bone
2022-06-29 14:21:43	<wikibugs>	('CR) ''ArielGlenn: [C: ''+2] make script send mail whenever there is output [puppet] - ''https://gerrit.wikimedia.org/r/809611 (owner: ''Hokwelum)'
2022-06-29 14:22:55	<XioNoX>	sukhe: yeah, and I'm fully back
2022-06-29 14:24:06	<sukhe>	clean puppet run on centrallog1001
2022-06-29 14:24:12	<logmsgbot>	!log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1028.eqiad.wmnet with OS buster
2022-06-29 14:24:13	<sukhe>	this was the first host with just the IPv4 config
2022-06-29 14:24:16	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 14:24:18	<wikibugs>	'SRE, ''ops-eqiad, ''DC-Ops, ''cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudcephosd10[25-34] - https://phabricator.wikimedia.org/T294972 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudcephosd1028.eqiad.wmnet with OS bus...'
2022-06-29 14:24:36	<sukhe>	no bird6 on this host anyway so yeah
2022-06-29 14:25:00	<wikibugs>	('PS1) ''Majavah: Revert "Revert "openstack::nova: enable TLS encryption for rabbitmq"" [puppet] - ''https://gerrit.wikimedia.org/r/809557'
2022-06-29 14:25:24	<logmsgbot>	!log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1030.eqiad.wmnet with OS buster
2022-06-29 14:25:28	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 14:25:34	<wikibugs>	'SRE, ''ops-eqiad, ''DC-Ops, ''cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudcephosd10[25-34] - https://phabricator.wikimedia.org/T294972 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudcephosd1030.eqiad.wmnet with OS bus...'
2022-06-29 14:26:04	<XioNoX>	sukhe: easy :)
2022-06-29 14:26:07	<MatmaRex>	urbanecm: did my config changes get reverted, or not deployed? i do not see them any more
2022-06-29 14:26:33	<sukhe>	XioNoX: looking good?
2022-06-29 14:26:33	<wikibugs>	('CR) ''Andrew Bogott: [C: ''+2] Revert "Revert "openstack::nova: enable TLS encryption for rabbitmq"" [puppet] - ''https://gerrit.wikimedia.org/r/809557 (owner: ''Majavah)'
2022-06-29 14:26:38	<sukhe>	I have no tests for centrallog so hard to say :)
2022-06-29 14:26:46	<wikibugs>	'SRE, ''ops-eqiad, ''DC-Ops, ''cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (''cmooney) Are we proceeding with the deploy of cloudnet1005/cloudnet1006 prior to T310546 and T310547 being completed? As discussed we'd be relat...'
2022-06-29 14:26:47	<XioNoX>	sukhe: yep, prefix is being advertised
2022-06-29 14:26:51	<sukhe>	nice
2022-06-29 14:26:52	<sukhe>	ok
2022-06-29 14:26:57	<sukhe>	so centrallog2002 is a special case then
2022-06-29 14:26:59	<sukhe>	it's on bullseye
2022-06-29 14:27:14	<sukhe>	we need to build anycast-hc for bullseye then
2022-06-29 14:27:41	<sukhe>	prometheus-bird-exporter is good
2022-06-29 14:27:41	<sukhe>	Depends: bird \| bird2, libc6 (>= 2.4)
2022-06-29 14:27:55	<sukhe>	we can skip this one for now then, or I can build and import the package quickly
2022-06-29 14:27:56	<XioNoX>	ohhh
2022-06-29 14:27:58	<sukhe>	yep :)
2022-06-29 14:28:12	<XioNoX>	yeah bird2 and the exporter are in upstream but not anycast-hc
2022-06-29 14:28:18	<sukhe>	yep
2022-06-29 14:28:38	<XioNoX>	sukhe: up to you, are any of the dns hosts on bullseye?
2022-06-29 14:29:03	<XioNoX>	drmrs is still buster
2022-06-29 14:29:20	<XioNoX>	so I guess it's the only one
2022-06-29 14:29:29	<XioNoX>	sukhe: +1 to skip it for now and test dns
2022-06-29 14:29:38	<sukhe>	XioNoX: yep, this is the only one
2022-06-29 14:29:40	<sukhe>	ok
2022-06-29 14:29:47	<XioNoX>	dunno how much time/effort it is to build it
2022-06-29 14:29:56	<sukhe>	not much
2022-06-29 14:30:01	<sukhe>	so I will get it done today after this
2022-06-29 14:30:07	<sukhe>	since we should not keep puppet disabled for too long
2022-06-29 14:30:15	<logmsgbot>	!log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1034.eqiad.wmnet with OS buster
2022-06-29 14:30:19	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 14:30:23	<wikibugs>	'SRE, ''ops-eqiad, ''DC-Ops, ''cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudcephosd10[25-34] - https://phabricator.wikimedia.org/T294972 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudcephosd1034.eqiad.wmnet with OS bus...'
2022-06-29 14:30:31	<XioNoX>	sukhe: yep
2022-06-29 14:30:36	<sukhe>	next is A:dns-rec :)
2022-06-29 14:30:46	<MatmaRex>	can anyone help me investigate a problem with the backports/configs from last window? i can see their effect when using any mwdebug server, but not when accessing the sites normally
2022-06-29 14:30:59	<XioNoX>	sukhe: should be as smooth as centrallog1002
2022-06-29 14:31:07	<sukhe>	:D
2022-06-29 14:31:15	<sukhe>	ok I am going to proceed with one to start with
2022-06-29 14:31:21	<sukhe>	any favourites? if not, dns1001 it is
2022-06-29 14:31:28	<sukhe>	I am a bit biased towards eqiad, what can I say
2022-06-29 14:31:31	<XioNoX>	sukhe: 1001 is good
2022-06-29 14:31:34	<sukhe>	ok!
2022-06-29 14:31:44	<XioNoX>	especially as it's the one that will get the most care
2022-06-29 14:32:00	<MatmaRex>	e.g. when visiting https://en.wikipedia.org/w/index.php?title=Wikipedia:Village_pump_(technical)&action=edit&section=new , you should get a different form
2022-06-29 14:32:47	<wikibugs>	('CR) ''Muehlenhoff: [C: ''+2] Inline ganeti::kvm [puppet] - ''https://gerrit.wikimedia.org/r/809157 (owner: ''Muehlenhoff)'
2022-06-29 14:33:17	<MatmaRex>	the changes are still there in git, so i think they were not deployed properly
2022-06-29 14:33:22	<sukhe>	XioNoX: running agent on dns1001
2022-06-29 14:34:10	<XioNoX>	alright
2022-06-29 14:34:14	<icinga-wm>	PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
2022-06-29 14:34:23	<MatmaRex>	any deployers around?
2022-06-29 14:34:26	<sukhe>	^ expected, hopefully
2022-06-29 14:34:54	<sukhe>	XioNoX: done
2022-06-29 14:34:59	<XioNoX>	sukhe: confirmed1
2022-06-29 14:35:00	<XioNoX>	!
2022-06-29 14:35:08	<sukhe>	!
2022-06-29 14:35:13	<XioNoX>	both prefixes are being received
2022-06-29 14:35:16	<sukhe>	phew
2022-06-29 14:35:27	<Lucas_WMDE>	MatmaRex: I’m here but didn’t pay attention during the window
2022-06-29 14:35:36	<sukhe>	XioNoX: I will do one more manually :)
2022-06-29 14:35:38	<sukhe>	then we can run cumin
2022-06-29 14:35:39	<dancy>	MatMaRex: I'm around
2022-06-29 14:35:45	<sukhe>	dns1002
2022-06-29 14:36:03	<XioNoX>	sukhe: sounds good!
2022-06-29 14:36:03	<MatmaRex>	Lucas_WMDE: dancy: thanks. it seems to me that my patches are deployed on all mwdebug servsers, but not on normal servers
2022-06-29 14:36:15	<MatmaRex>	e.g. try visiting https://en.wikipedia.org/w/index.php?title=Wikipedia:Village_pump_(technical)&action=edit&section=new
2022-06-29 14:36:20	<MatmaRex>	(while logged out)
2022-06-29 14:36:27	<logmsgbot>	!log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1029.eqiad.wmnet with OS buster
2022-06-29 14:36:28	<MatmaRex>	you'll get a different interface depending on mwdebug or not
2022-06-29 14:36:30	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 14:36:33	<wikibugs>	'SRE, ''ops-eqiad, ''DC-Ops, ''cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudcephosd10[25-34] - https://phabricator.wikimedia.org/T294972 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudcephosd1029.eqiad.wmnet with OS bus...'
2022-06-29 14:36:38	<XioNoX>	sukhe: even if they break everywhere else, having 2 working servers are enough to prevent a meltdown
2022-06-29 14:36:53	<dancy>	MatMaRex: Can you point me to the commit in question?
2022-06-29 14:36:55	<sukhe>	oh yeah, the magic of anycast :)
2022-06-29 14:37:19	<MatmaRex>	dancy: the last 4 commits. e.g. https://gerrit.wikimedia.org/r/c/809010/
2022-06-29 14:37:33	<Lucas_WMDE>	MatmaRex: I seem to get the same interface with both
2022-06-29 14:37:42	<Lucas_WMDE>	is “Welcome to the village pump for technical issues” with the large info icon the new interface?
2022-06-29 14:38:01	<MatmaRex>	Lucas_WMDE: if you're logged in, you might have some preferences enabled that affect it, try logged out
2022-06-29 14:38:04	<sukhe>	XioNoX: done
2022-06-29 14:38:06	<sukhe>	dns1002
2022-06-29 14:38:07	<Lucas_WMDE>	I’m in a private widnow
2022-06-29 14:38:24	<Lucas_WMDE>	and got that interface when I first loaded the page, before touching the WikimediaDebug extension
2022-06-29 14:38:38	<MatmaRex>	old: https://i.imgur.com/HzO89iI.png new: https://i.imgur.com/kWNnG5K.png
2022-06-29 14:38:54	<XioNoX>	sukhe: confirmed working!
2022-06-29 14:38:54	<Lucas_WMDE>	then I get new without WikimediaDebug
2022-06-29 14:39:02	<sukhe>	XioNoX: :D
2022-06-29 14:39:04	<sukhe>	ok then
2022-06-29 14:39:12	<MatmaRex>	there are people reporting it at https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(proposals)#Doing_this too
2022-06-29 14:39:12	<sukhe>	running on A:dns-rec, batch of 1, 5 seconds apart
2022-06-29 14:39:17	<Lucas_WMDE>	and I think afterwards I briefly got old, don’t remember if with or without WikimediaDebug, before reloading
2022-06-29 14:39:23	<Lucas_WMDE>	so to me it feels like it might be flaky
2022-06-29 14:39:23	<XioNoX>	sukhe: godspeed
2022-06-29 14:39:25	<urbanecm>	MatmaRex: i definitely didn't revert them
2022-06-29 14:39:26	<MatmaRex>	i guess might depend on the location? or not synced to all servsers?
2022-06-29 14:39:38	<urbanecm>	But yesterday a sync only affected half of fleet
2022-06-29 14:39:42	<urbanecm>	So perhaps it happened again
2022-06-29 14:39:47	<MatmaRex>	or it's a caching thing but it shouldn't affect this
2022-06-29 14:40:03	<Lucas_WMDE>	I just got the old interface from mw1413
2022-06-29 14:40:16	<urbanecm>	I shutdown my laptop already, Lucas_WMDE if you can just resync IS.php, that'd be helpful
2022-06-29 14:40:27	<dancy>	Lemme look at the initializesettings.php file on mw1413 first.
2022-06-29 14:40:32	<wikibugs>	'SRE, ''Generated Data Platform, ''Image-Suggestions, ''serviceops, ''Service-deployment-requests: Setup Initial Image Suggestion Service CI and k8s params/stubs - https://phabricator.wikimedia.org/T305154 (''akosiaris) I think this is done?'
2022-06-29 14:40:32	<urbanecm>	Okay
2022-06-29 14:40:34	<Lucas_WMDE>	now I got the new interface from mw1391
2022-06-29 14:40:37	<Lucas_WMDE>	dancy: ack
2022-06-29 14:40:48	<urbanecm>	dancy: note mw1413 had a very similar issue yesterday with my deploy
2022-06-29 14:40:50	<Lucas_WMDE>	otherwise yeah resync sounds sensible
2022-06-29 14:40:51	<logmsgbot>	!log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1033.eqiad.wmnet with OS buster
2022-06-29 14:40:55	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 14:40:57	<wikibugs>	'SRE, ''ops-eqiad, ''DC-Ops, ''cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudcephosd10[25-34] - https://phabricator.wikimedia.org/T294972 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudcephosd1033.eqiad.wmnet with OS bus...'
2022-06-29 14:41:03	<dancy>	urbanecm: I saw that. Very odd.
2022-06-29 14:41:18	<dancy>	The main thing that has changed recently is disabling opcache revalidation and enabling php-rpm restarts.
2022-06-29 14:41:22	<Lucas_WMDE>	urbanecm: do you remember if scap printed any warnings? (the SAL entry looks normal, at least)
2022-06-29 14:41:26	<dancy>	(unconditional restarts, that is)
2022-06-29 14:41:28	<wikibugs>	'SRE, ''Generated Data Platform, ''Image-Suggestions, ''serviceops, ''Patch-For-Review: Blubber setup for Image Suggestions Service - https://phabricator.wikimedia.org/T305155 (''akosiaris) This is lacking just the entrypoint.sh item, right?'
2022-06-29 14:41:36	<wikibugs>	'SRE, ''Generated Data Platform, ''Image-Suggestions, ''serviceops, ''Patch-For-Review: Blubber setup for Image Suggestions Service - https://phabricator.wikimedia.org/T305155 (''akosiaris)'
2022-06-29 14:41:40	<wikibugs>	'SRE, ''Generated Data Platform, ''Image-Suggestions, ''serviceops: Setup Initial Image Suggestion Service CI and k8s params/stubs - https://phabricator.wikimedia.org/T305154 (''akosiaris)'
2022-06-29 14:42:23	<wikibugs>	('PS1) ''Muehlenhoff: ganeti: Add SPDX headers [puppet] - ''https://gerrit.wikimedia.org/r/809616 (https://phabricator.wikimedia.org/T308013)'
2022-06-29 14:43:34	<dancy>	mw1413 has the right wmf-config/InitialiseSettings.php file
2022-06-29 14:43:47	<Lucas_WMDE>	looks like that to me too
2022-06-29 14:43:54	<Lucas_WMDE>	(same sha256sum as mw1391)
2022-06-29 14:44:02	<icinga-wm>	PROBLEM - nova-compute proc minimum on cloudvirt1044 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
2022-06-29 14:44:15	<Lucas_WMDE>	is it worth restarting php-fpm again?
2022-06-29 14:44:39	<dancy>	I'll run sync-wikiversions
2022-06-29 14:45:09	<XioNoX>	sukhe: all smooth?
2022-06-29 14:45:12	<dancy>	In progress.
2022-06-29 14:45:19	<wikibugs>	('PS1) ''Majavah: Revert "Revert "Revert "openstack::nova: enable TLS encryption for rabbitmq""" [puppet] - ''https://gerrit.wikimedia.org/r/809560'
2022-06-29 14:45:25	<sukhe>	XioNoX: so far yep, running agent on A:dns-rec :)
2022-06-29 14:45:42	<sukhe>	batches of 2, 2 seconds apart. I thought 5 seconds was a bit too much
2022-06-29 14:45:51	<icinga-wm>	RECOVERY - nova-compute proc minimum on cloudvirt1044 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
2022-06-29 14:46:19	<sukhe>	dns2001 is done, you can check that
2022-06-29 14:46:59	<wikibugs>	('CR) ''Andrew Bogott: [C: ''+2] Revert "Revert "Revert "openstack::nova: enable TLS encryption for rabbitmq""" [puppet] - ''https://gerrit.wikimedia.org/r/809560 (owner: ''Majavah)'
2022-06-29 14:48:06	<XioNoX>	alright
2022-06-29 14:48:57	<logmsgbot>	!log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1032.eqiad.wmnet with OS buster
2022-06-29 14:49:01	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 14:49:04	<wikibugs>	'SRE, ''ops-eqiad, ''DC-Ops, ''cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudcephosd10[25-34] - https://phabricator.wikimedia.org/T294972 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudcephosd1032.eqiad.wmnet with OS bus...'
2022-06-29 14:49:08	<logmsgbot>	!log dancy@deploy1002 rebuilt and synchronized wikiversions files: Debugging
2022-06-29 14:49:11	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 14:49:18	<XioNoX>	sukhe: all good for 2001
2022-06-29 14:49:19	<wikibugs>	'SRE, ''WMF-General-or-Unknown, ''WMF-Legal, ''Documentation, and 2 others: Default license for operations/puppet - https://phabricator.wikimedia.org/T67270 (''MoritzMuehlenhoff) >>! In T67270#8016999, @jbond wrote: >> In such cases it might make sense to align such files by relicensing to Apache 2 > sta...'
2022-06-29 14:49:25	<dancy>	OK php-rpm restarts finished
2022-06-29 14:49:33	<dancy>	MatMaRex: Any changes?
2022-06-29 14:49:43	<sukhe>	XioNoX: 6/12 :)
2022-06-29 14:49:46	<wikibugs>	('CR) ''Alexandros Kosiaris: [C: ''+2] Setup .gitconfig for mwpresync system user [puppet] - ''https://gerrit.wikimedia.org/r/809297 (https://phabricator.wikimedia.org/T303857) (owner: ''Ahmon Dancy)'
2022-06-29 14:49:47	<dancy>	oh, I just saw things change when I reloaded the page.
2022-06-29 14:50:24	<MatmaRex>	dancy: thanks, looks fixed to me
2022-06-29 14:50:33	<Lucas_WMDE>	I also seem to be getting the new interface consistently now
2022-06-29 14:50:37	<dancy>	so that leaves us with... "what the hell"
2022-06-29 14:50:38	<sukhe>	XioNoX: after this, that leaves us with authdns1001 and authdns2001
2022-06-29 14:50:40	<Lucas_WMDE>	thanks dancy!
2022-06-29 14:51:07	<XioNoX>	sukhe: those two should be the exact same as dnsXXX
2022-06-29 14:51:10	<XioNoX>	just different name
2022-06-29 14:51:10	<sukhe>	yep
2022-06-29 14:51:27	<sukhe>	should I start them in parallel? I don't see any issues
2022-06-29 14:51:46	<MatmaRex>	so what happened here? the files were synced everywhere, but the web service or something didn't "refresh" them?
2022-06-29 14:52:10	<XioNoX>	sukhe: sure
2022-06-29 14:52:36	<dancy>	That's what it seems like. There is caching involved w/ the data in InitialiseSettings.php.
2022-06-29 14:52:58	<dancy>	and someone was telling me about a possibility of bad caching interactions. I'll get more info.
2022-06-29 14:53:19	<MatmaRex>	huh. thanks
2022-06-29 14:54:06	<sukhe>	XioNoX: now running agent on authdns1001
2022-06-29 14:54:43	<XioNoX>	alright
2022-06-29 14:55:49	<sukhe>	dns-rec all done
2022-06-29 14:55:58	<sukhe>	authdns1001 too
2022-06-29 14:56:48	<XioNoX>	awesome
2022-06-29 14:57:18	<XioNoX>	confirmed all good
2022-06-29 14:57:30	<sukhe>	! :D
2022-06-29 14:57:37	<sukhe>	ok let's do the final one
2022-06-29 14:57:49	<sukhe>	and for centrallog, I will do it later in the day
2022-06-29 14:57:52	<sukhe>	I don't think you have to be around for that
2022-06-29 14:57:58	<sukhe>	doing authdns2001
2022-06-29 14:59:22	<wikibugs>	('PS1) ''Clare Ming: Add jawiki, zhwikinews to pilot wikis [mediawiki-config] - ''https://gerrit.wikimedia.org/r/809620 (https://phabricator.wikimedia.org/T311419)'
2022-06-29 15:00:31	<sukhe>	XioNoX: all done :)
2022-06-29 15:00:39	<XioNoX>	sukhe: woot
2022-06-29 15:01:17	<sukhe>	phew!
2022-06-29 15:01:47	<sukhe>	that was something. I guess third time is the charm really is something
2022-06-29 15:01:51	<wikibugs>	'SRE, ''ops-codfw, ''decommission-hardware: decommission db2071 - https://phabricator.wikimedia.org/T311589 (''Papaul)'
2022-06-29 15:02:08	<wikibugs>	'SRE, ''ops-codfw, ''decommission-hardware: decommission db2071 - https://phabricator.wikimedia.org/T311589 (''Papaul) ''Open→''Resolved Complete'
2022-06-29 15:02:14	<wikibugs>	'SRE, ''ops-codfw, ''decommission-hardware: decommission db2075 - https://phabricator.wikimedia.org/T311591 (''Papaul)'
2022-06-29 15:02:16	<sukhe>	XioNoX: thanks for the help and patience! I am around to monitor things and I will push centtrallog as well shortly
2022-06-29 15:02:27	<sukhe>	and then we can fix the prometheus-bird thing, but that's not urgent in any form
2022-06-29 15:02:35	<XioNoX>	sukhe: no, thank you!
2022-06-29 15:02:38	<wikibugs>	'SRE, ''ops-codfw, ''decommission-hardware: decommission db2081 - https://phabricator.wikimedia.org/T311623 (''Papaul)'
2022-06-29 15:02:43	<wikibugs>	'SRE, ''ops-codfw, ''decommission-hardware: decommission db2075 - https://phabricator.wikimedia.org/T311591 (''Papaul) ''Open→''Resolved complete'
2022-06-29 15:02:50	<wikibugs>	'SRE, ''ops-codfw, ''decommission-hardware: decommission db2081 - https://phabricator.wikimedia.org/T311623 (''Papaul) ''Open→''Resolved complete'
2022-06-29 15:07:52	<icinga-wm>	RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
2022-06-29 15:23:28	<wikibugs>	('PS1) ''Muehlenhoff: calico: Assign SPDX headers [puppet] - ''https://gerrit.wikimedia.org/r/809624 (https://phabricator.wikimedia.org/T308013)'
2022-06-29 15:23:30	<wikibugs>	('PS1) ''Muehlenhoff: ores: Assign SPDX headers [puppet] - ''https://gerrit.wikimedia.org/r/809625 (https://phabricator.wikimedia.org/T308013)'
2022-06-29 15:23:32	<wikibugs>	('PS1) ''Muehlenhoff: rancid: Assign SPDX headers [puppet] - ''https://gerrit.wikimedia.org/r/809626 (https://phabricator.wikimedia.org/T308013)'
2022-06-29 15:25:47	<sukhe>	!log upload anycast-healthchecker 0.8.2-1wm1 to apt.wm.o (bullseye) - T310574
2022-06-29 15:25:51	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 15:25:53	<stashbot>	T310574: Upgrade to Bird 2 - https://phabricator.wikimedia.org/T310574
2022-06-29 15:26:26	<wikibugs>	('PS1) ''Muehlenhoff: jupyterhub: Assign SPDX headers [puppet] - ''https://gerrit.wikimedia.org/r/809628 (https://phabricator.wikimedia.org/T308013)'
2022-06-29 15:31:25	<wikibugs>	('CR) ''Jdlrobson: [C: ''+1] "LGTM" [mediawiki-config] - ''https://gerrit.wikimedia.org/r/809620 (https://phabricator.wikimedia.org/T311419) (owner: ''Clare Ming)'
2022-06-29 15:31:35	<wikibugs>	('PS1) ''Muehlenhoff: Add Paul Norman to contributors [puppet] - ''https://gerrit.wikimedia.org/r/809629 (https://phabricator.wikimedia.org/T308013)'
2022-06-29 15:33:32	<icinga-wm>	PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
2022-06-29 15:43:53	<wikibugs>	('PS3) ''Lucas Werkmeister (WMDE): Increase weights on the language selector statement boosts [mediawiki-config] - ''https://gerrit.wikimedia.org/r/808941 (https://phabricator.wikimedia.org/T307869) (owner: ''DCausse)'
2022-06-29 15:44:01	<Lucas_WMDE>	since things are looking quiet at the moment, I’ll deploy ^
2022-06-29 15:44:16	<Lucas_WMDE>	shouldn’t have any effect yet, just one less thing to deploy later :)
2022-06-29 15:44:17	<hashar>	Lucas_WMDE: +1 :)
2022-06-29 15:44:46	<wikibugs>	('CR) ''Hashar: [C: ''+1] Increase weights on the language selector statement boosts [mediawiki-config] - ''https://gerrit.wikimedia.org/r/808941 (https://phabricator.wikimedia.org/T307869) (owner: ''DCausse)'
2022-06-29 15:45:54	<wikibugs>	('CR) ''Lucas Werkmeister (WMDE): [C: ''+2] Increase weights on the language selector statement boosts [mediawiki-config] - ''https://gerrit.wikimedia.org/r/808941 (https://phabricator.wikimedia.org/T307869) (owner: ''DCausse)'
2022-06-29 15:46:19	<wikibugs>	'SRE, ''ops-codfw, ''DBA, ''DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (''Papaul)'
2022-06-29 15:46:29	<wikibugs>	'SRE, ''ops-eqiad, ''DC-Ops, ''Data-Engineering, ''Data-Engineering-Kanban: Q4: rack/setup/install stat1010 - https://phabricator.wikimedia.org/T307399 (''RobH) >>! In T307399#8037215, @BTullis wrote: > The RAID controller firmware update did not make any difference, but thankfully @fgiunchedi has ident...'
2022-06-29 15:47:19	<wikibugs>	('Merged) ''jenkins-bot: Increase weights on the language selector statement boosts [mediawiki-config] - ''https://gerrit.wikimedia.org/r/808941 (https://phabricator.wikimedia.org/T307869) (owner: ''DCausse)'
2022-06-29 15:48:22	<Lucas_WMDE>	syncing
2022-06-29 15:48:38	<wikibugs>	'SRE, ''Cassandra: Allow Cassandra to be deployed on Bullseye nodes - https://phabricator.wikimedia.org/T310980 (''Eevans) >>! In T310980#8036017, @MoritzMuehlenhoff wrote: > Can't we just import the Cassandra 4 debs and use those? The work needs to happen at some point anyway and it's a fresh cluster. Buster...'
2022-06-29 15:48:41	<Lucas_WMDE>	(briefly tested on mwdebug1001 that searchEntities.php didn’t crash)
2022-06-29 15:50:19	<wikibugs>	'SRE, ''ops-eqsin: cr3-eqsin:xe-0/1/1 interface errors - https://phabricator.wikimedia.org/T300485 (''RobH) ''Open→''Resolved Indeed, I see no more errors since Arelion investigated earlier this week and since then the errors have cleared up. This is great that its resolved, but not so great in that no...'
2022-06-29 15:51:40	<logmsgbot>	!log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:808941\|Increase weights on the language selector statement boosts (T307869)]] (expected to be a no-op) (duration: 03m 21s)
2022-06-29 15:51:45	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 15:51:47	<stashbot>	T307869: Request for new search profile for Wikidata that boosts Items for languages - https://phabricator.wikimedia.org/T307869
2022-06-29 15:53:09	<Lucas_WMDE>	done
2022-06-29 15:53:43	<logmsgbot>	!log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
2022-06-29 15:53:47	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 15:54:42	<logmsgbot>	!log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
2022-06-29 15:54:43	<logmsgbot>	!log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
2022-06-29 15:54:46	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 15:54:50	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 15:54:59	<wikibugs>	'SRE, ''Cassandra: Allow Cassandra to be deployed on Bullseye nodes - https://phabricator.wikimedia.org/T310980 (''elukey) I'd be willing to work on it, but my fear is that it becomes a projects in itself that takes a long time to finish (without proper planning and resource allocation). If everybody agrees...'
2022-06-29 15:55:41	<logmsgbot>	!log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
2022-06-29 15:55:45	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 15:58:03	<wikibugs>	'SRE, ''DC-Ops: Confirm support of PERC 750 raid controller - https://phabricator.wikimedia.org/T297913 (''BTullis) I have been doing some testing of an install of a server with an H750 card under ticket T307399 recently. One thing that I have ascertained, with the help of @fgiunchedi, is that the swapping o...'
2022-06-29 15:58:55	<wikibugs>	('CR) ''Ottomata: [C: ''+1] jupyterhub: Assign SPDX headers [puppet] - ''https://gerrit.wikimedia.org/r/809628 (https://phabricator.wikimedia.org/T308013) (owner: ''Muehlenhoff)'
2022-06-29 16:00:37	<wikibugs>	('PS1) ''Majavah: Revert "Revert "openstack::nova: enable TLS encryption for rabbitmq"" [puppet] - ''https://gerrit.wikimedia.org/r/809633'
2022-06-29 16:01:28	<wikibugs>	('CR) ''Herron: [C: ''+1] "LGTM, please see optional nit inline" [puppet] - ''https://gerrit.wikimedia.org/r/809302 (https://phabricator.wikimedia.org/T222826) (owner: ''Cwhite)'
2022-06-29 16:07:54	<wikibugs>	('PS1) ''Eigyan: [wmf-config]: Deploy GDI Survey Wave 2 on ES,FR,PT wikis. [mediawiki-config] - ''https://gerrit.wikimedia.org/r/809634 (https://phabricator.wikimedia.org/T311643)'
2022-06-29 16:10:14	<icinga-wm>	PROBLEM - SSH on wtp1038.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
2022-06-29 16:14:28	<wikibugs>	('PS2) ''Eigyan: [wmf-config]: Deploy GDI Survey Wave 2 on ES,FR,PT wikis. [mediawiki-config] - ''https://gerrit.wikimedia.org/r/809634 (https://phabricator.wikimedia.org/T311643)'
2022-06-29 16:15:28	<wikibugs>	('CR) ''CI reject: [V: ''-1] [wmf-config]: Deploy GDI Survey Wave 2 on ES,FR,PT wikis. [mediawiki-config] - ''https://gerrit.wikimedia.org/r/809634 (https://phabricator.wikimedia.org/T311643) (owner: ''Eigyan)'
2022-06-29 16:15:59	<wikibugs>	('PS3) ''Eigyan: [wmf-config]: Deploy GDI Survey Wave 2 on ES,FR,PT wikis. [mediawiki-config] - ''https://gerrit.wikimedia.org/r/809634 (https://phabricator.wikimedia.org/T311643)'
2022-06-29 16:22:08	<logmsgbot>	!log bking@cumin1001 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster restart to pickup swift-s3 plugin - bking@cumin1001 - T309648
2022-06-29 16:22:14	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 16:22:15	<stashbot>	T309648: Restore lost index in cloudelastic - https://phabricator.wikimedia.org/T309648
2022-06-29 16:22:54	<wikibugs>	('PS3) ''Cwhite: loki: add loki as an optional grafana component [puppet] - ''https://gerrit.wikimedia.org/r/809302 (https://phabricator.wikimedia.org/T222826)'
2022-06-29 16:23:25	<wikibugs>	('CR) ''Cwhite: loki: add loki as an optional grafana component (''3 comments) [puppet] - ''https://gerrit.wikimedia.org/r/809302 (https://phabricator.wikimedia.org/T222826) (owner: ''Cwhite)'
2022-06-29 16:24:13	<wikibugs>	('PS4) ''Cwhite: loki: add loki as an optional grafana component [puppet] - ''https://gerrit.wikimedia.org/r/809302 (https://phabricator.wikimedia.org/T222826)'
2022-06-29 16:25:11	<wikibugs>	('PS5) ''Cwhite: loki: add loki as an optional grafana component [puppet] - ''https://gerrit.wikimedia.org/r/809302 (https://phabricator.wikimedia.org/T222826)'
2022-06-29 16:27:07	<sukhe>	please note that Puppet is intentionally disabled on centrallog2002 (as also in the disable message)
2022-06-29 16:27:23	<sukhe>	I will enable it in the afternoon, but please don't do it before that as it will break anycast. thank you :)
2022-06-29 16:28:40	<icinga-wm>	RECOVERY - k8s API server requests latencies on kubestagemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
2022-06-29 16:33:03	<wikibugs>	('PS1) ''PipelineBot: blubberoid: pipeline bot promote [deployment-charts] - ''https://gerrit.wikimedia.org/r/809637'
2022-06-29 16:33:10	<icinga-wm>	RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
2022-06-29 16:36:13	<logmsgbot>	!log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T298560)', diff saved to https://phabricator.wikimedia.org/P30624 and previous config saved to /var/cache/conftool/dbconfig/20220629-163612-ladsgroup.json
2022-06-29 16:36:19	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 16:36:20	<stashbot>	T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560
2022-06-29 16:42:42	<icinga-wm>	PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_dump_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
2022-06-29 16:42:58	<wikibugs>	('PS1) ''Eevans: Assign new password to Cassandra superuser [labs/private] - ''https://gerrit.wikimedia.org/r/809639 (https://phabricator.wikimedia.org/T311652)'
2022-06-29 16:43:26	<wikibugs>	('PS2) ''Eevans: Assign new password to Cassandra superuser [labs/private] - ''https://gerrit.wikimedia.org/r/809639 (https://phabricator.wikimedia.org/T311652)'
2022-06-29 16:48:12	<wikibugs>	'SRE, ''ops-codfw, ''DBA, ''DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (''Papaul)'
2022-06-29 16:51:03	<wikibugs>	('CR) ''Dduvall: [C: ''+2] blubberoid: pipeline bot promote [deployment-charts] - ''https://gerrit.wikimedia.org/r/809637 (owner: ''PipelineBot)'
2022-06-29 16:51:18	<logmsgbot>	!log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P30625 and previous config saved to /var/cache/conftool/dbconfig/20220629-165117-ladsgroup.json
2022-06-29 16:51:22	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 16:51:34	<wikibugs>	'SRE, ''ops-eqiad, ''DC-Ops, ''Data-Engineering, ''Data-Engineering-Kanban: Q4: rack/setup/install stat1010 - https://phabricator.wikimedia.org/T307399 (''BTullis) I tried the `partman/custom/kafka-jumbo.cfg` partman recipe on this how, but it didn't seem to be applied. When I checked the log I saw thi...'
2022-06-29 16:51:41	<wikibugs>	('PS1) ''Btullis: Reduce the minimum size of /srv in the kafka-jumbo recipe [puppet] - ''https://gerrit.wikimedia.org/r/809640 (https://phabricator.wikimedia.org/T307399)'
2022-06-29 16:54:04	<wikibugs>	('Merged) ''jenkins-bot: blubberoid: pipeline bot promote [deployment-charts] - ''https://gerrit.wikimedia.org/r/809637 (owner: ''PipelineBot)'
2022-06-29 16:56:14	<wikibugs>	('PS1) ''RobH: testing h750 recipes [puppet] - ''https://gerrit.wikimedia.org/r/809641 (https://phabricator.wikimedia.org/T297913)'
2022-06-29 16:56:45	<wikibugs>	('CR) ''Btullis: [C: ''+2] Reduce the minimum size of /srv in the kafka-jumbo recipe [puppet] - ''https://gerrit.wikimedia.org/r/809640 (https://phabricator.wikimedia.org/T307399) (owner: ''Btullis)'
2022-06-29 16:56:50	<wikibugs>	'SRE, ''DC-Ops, ''Patch-For-Review: Confirm support of PERC 750 raid controller - https://phabricator.wikimedia.org/T297913 (''RobH) >>! In T297913#8037888, @gerritbot wrote: > Change 809641 had a related patch set uploaded (by RobH; author: RobH): > %%%[operations/puppet@production] testing h750 recipes%%%...'
2022-06-29 16:57:08	<wikibugs>	('PS2) ''RobH: testing h750 recipes [puppet] - ''https://gerrit.wikimedia.org/r/809641 (https://phabricator.wikimedia.org/T302937)'
2022-06-29 16:57:16	<wikibugs>	('PS3) ''RobH: testing h750 recipes [puppet] - ''https://gerrit.wikimedia.org/r/809641 (https://phabricator.wikimedia.org/T302937)'
2022-06-29 16:58:01	<logmsgbot>	!log dduvall@deploy1002 helmfile [staging] START helmfile.d/services/blubberoid: apply
2022-06-29 16:58:05	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 16:58:15	<wikibugs>	('CR) ''RobH: [C: ''+2] testing h750 recipes [puppet] - ''https://gerrit.wikimedia.org/r/809641 (https://phabricator.wikimedia.org/T302937) (owner: ''RobH)'
2022-06-29 16:58:25	<logmsgbot>	!log dduvall@deploy1002 helmfile [staging] DONE helmfile.d/services/blubberoid: apply
2022-06-29 16:58:29	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 16:59:10	<logmsgbot>	!log dduvall@deploy1002 helmfile [codfw] START helmfile.d/services/blubberoid: apply
2022-06-29 16:59:14	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 16:59:46	<logmsgbot>	!log dduvall@deploy1002 helmfile [codfw] DONE helmfile.d/services/blubberoid: apply
2022-06-29 16:59:50	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 17:00:00	<logmsgbot>	!log dduvall@deploy1002 helmfile [eqiad] START helmfile.d/services/blubberoid: apply
2022-06-29 17:00:03	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 17:00:30	<logmsgbot>	!log dduvall@deploy1002 helmfile [eqiad] DONE helmfile.d/services/blubberoid: apply
2022-06-29 17:00:34	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 17:04:04	<logmsgbot>	!log robh@cumin1001 START - Cookbook sre.hosts.reimage for host dumpsdata1007.eqiad.wmnet with OS bullseye
2022-06-29 17:04:08	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 17:04:11	<wikibugs>	'SRE, ''DC-Ops, ''Patch-For-Review: datadumps1007 test installs - https://phabricator.wikimedia.org/T302937 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host dumpsdata1007.eqiad.wmnet with OS bullseye'
2022-06-29 17:06:23	<logmsgbot>	!log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P30626 and previous config saved to /var/cache/conftool/dbconfig/20220629-170622-ladsgroup.json
2022-06-29 17:06:27	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 17:09:43	<wikibugs>	('PS3) ''Hnowlan: api-gateway: allow discovery services to set custom rate limits [deployment-charts] - ''https://gerrit.wikimedia.org/r/809198 (https://phabricator.wikimedia.org/T295956)'
2022-06-29 17:10:10	<wikibugs>	('CR) ''Hnowlan: api-gateway: allow discovery services to set custom rate limits (''5 comments) [deployment-charts] - ''https://gerrit.wikimedia.org/r/809198 (https://phabricator.wikimedia.org/T295956) (owner: ''Hnowlan)'
2022-06-29 17:11:30	<icinga-wm>	RECOVERY - SSH on wtp1038.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
2022-06-29 17:15:36	<wikibugs>	'SRE, ''DC-Ops, ''Patch-For-Review: datadumps1007 test installs - https://phabricator.wikimedia.org/T302937 (''RobH) >>! In T302937#8032403, @fgiunchedi wrote: > As mentioned at the SRE meeting @BTullis is also looking into this for DSE hosts (review at https://gerrit.wikimedia.org/r/c/operations/puppet/+/8...'
2022-06-29 17:17:08	<sukhe>	python3-anycast-healthchecker : Depends: python3-pythonjsonlogger but it is not going to be installed
2022-06-29 17:17:24	<sukhe>	except the package it seems like was called python3-json-logger, with the same upstream source
2022-06-29 17:17:29	<sukhe>	time for another package rebuild :]
2022-06-29 17:17:39	<sukhe>	sharing this because Icinga will complain about dpkg status shortly
2022-06-29 17:17:44	<sukhe>	on centrallog2001
2022-06-29 17:18:05	<wikibugs>	'SRE, ''DC-Ops, ''Patch-For-Review: datadumps1007 test installs - https://phabricator.wikimedia.org/T302937 (''MoritzMuehlenhoff) >>! In T302937#8037948, @RobH wrote: >>>! In T302937#8032403, @fgiunchedi wrote: >> As mentioned at the SRE meeting @BTullis is also looking into this for DSE hosts (review at ht...'
2022-06-29 17:18:15	<logmsgbot>	!log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dumpsdata1007.eqiad.wmnet with OS bullseye
2022-06-29 17:18:19	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 17:18:21	<wikibugs>	'SRE, ''DC-Ops, ''Patch-For-Review: datadumps1007 test installs - https://phabricator.wikimedia.org/T302937 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host dumpsdata1007.eqiad.wmnet with OS bullseye executed with errors: - dumpsdata1007 (FAIL) - Removed f...'
2022-06-29 17:19:11	<logmsgbot>	!log robh@cumin1001 START - Cookbook sre.hosts.reimage for host dumpsdata1007.eqiad.wmnet with OS bullseye
2022-06-29 17:19:14	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 17:19:18	<wikibugs>	'SRE, ''DC-Ops, ''Patch-For-Review: datadumps1007 test installs - https://phabricator.wikimedia.org/T302937 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host dumpsdata1007.eqiad.wmnet with OS bullseye'
2022-06-29 17:21:28	<logmsgbot>	!log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T298560)', diff saved to https://phabricator.wikimedia.org/P30627 and previous config saved to /var/cache/conftool/dbconfig/20220629-172127-ladsgroup.json
2022-06-29 17:21:29	<logmsgbot>	!log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1128.eqiad.wmnet with reason: Maintenance
2022-06-29 17:21:33	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 17:21:35	<stashbot>	T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560
2022-06-29 17:21:38	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 17:21:43	<logmsgbot>	!log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1128.eqiad.wmnet with reason: Maintenance
2022-06-29 17:21:46	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 17:22:08	<wikibugs>	'SRE, ''ops-codfw, ''DBA, ''DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (''Papaul)'
2022-06-29 17:31:00	<icinga-wm>	PROBLEM - Disk space on thanos-be2001 is CRITICAL: DISK CRITICAL - free space: / 2054 MB (3% inode=97%): /tmp 2054 MB (3% inode=97%): /var/tmp 2054 MB (3% inode=97%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be2001&var-datasource=codfw+prometheus/ops
2022-06-29 17:31:14	<sukhe>	!log running puppet agent on centrallog2002 to finalize T310574
2022-06-29 17:31:18	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 17:31:19	<stashbot>	T310574: Upgrade to Bird 2 - https://phabricator.wikimedia.org/T310574
2022-06-29 17:31:23	<logmsgbot>	!log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on dumpsdata1007.eqiad.wmnet with reason: host reimage
2022-06-29 17:31:26	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 17:31:46	<robh>	yes my server, rise... rissssssssseeeeee
2022-06-29 17:33:06	<icinga-wm>	RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
2022-06-29 17:34:53	<logmsgbot>	!log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dumpsdata1007.eqiad.wmnet with reason: host reimage
2022-06-29 17:34:56	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 17:38:59	<wikibugs>	'SRE, ''Infrastructure-Foundations, ''Traffic, ''netops: Upgrade to Bird 2 - https://phabricator.wikimedia.org/T310574 (''ssingh) ` ===== NODE GROUP ===== (40) authdns[1001,2001].wikimedia.org,c...'
2022-06-29 17:40:20	<wikibugs>	('PS3) ''Slyngshede: profile::prometheus::ops enable Ganeti metric scraping. [puppet] - ''https://gerrit.wikimedia.org/r/809594'
2022-06-29 17:45:09	<wikibugs>	('CR) ''Slyngshede: profile::prometheus::ops enable Ganeti metric scraping. (''1 comment) [puppet] - ''https://gerrit.wikimedia.org/r/809594 (owner: ''Slyngshede)'
2022-06-29 17:45:42	<wikibugs>	('PS4) ''MewOphaswongse: Structured task: enable free text for "other" rejection reason [mediawiki-config] - ''https://gerrit.wikimedia.org/r/807576 (https://phabricator.wikimedia.org/T304099)'
2022-06-29 17:46:13	<wikibugs>	'SRE, ''Infrastructure-Foundations, ''CAS-SSO, ''Patch-For-Review: Update CAS to 6.5 - https://phabricator.wikimedia.org/T311235 (''ssingh) p:''Triage→''Medium'
2022-06-29 17:48:35	<wikibugs>	'SRE, ''ops-eqiad, ''DC-Ops, ''Dumps-Generation: Q3: rack/setup/install dumpsdata100[67] - https://phabricator.wikimedia.org/T299443 (''RobH)'
2022-06-29 17:48:41	<wikibugs>	('CR) ''Slyngshede: [V: ''+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36124/console"; [puppet] - ''https://gerrit.wikimedia.org/r/809594 (owner: ''Slyngshede)'
2022-06-29 17:51:01	<logmsgbot>	!log pt1979@cumin2002 START - Cookbook sre.dns.netbox
2022-06-29 17:51:05	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 17:53:54	<wikibugs>	('CR) ''Jsn.sherman: [C: ''+1] "LGTM!" [mediawiki-config] - ''https://gerrit.wikimedia.org/r/809634 (https://phabricator.wikimedia.org/T311643) (owner: ''Eigyan)'
2022-06-29 17:54:23	<wikibugs>	'SRE, ''ops-eqiad, ''DC-Ops, ''cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudcephosd10[25-34] - https://phabricator.wikimedia.org/T294972 (''Cmjohnson)'
2022-06-29 17:54:29	<wikibugs>	('CR) ''EllenR: [C: ''+1] "lgtm" [mediawiki-config] - ''https://gerrit.wikimedia.org/r/809634 (https://phabricator.wikimedia.org/T311643) (owner: ''Eigyan)'
2022-06-29 17:54:46	<wikibugs>	('PS1) ''Dduvall: gitlab_runner: Allow internal docker DNS traffic [puppet] - ''https://gerrit.wikimedia.org/r/809650 (https://phabricator.wikimedia.org/T311241)'
2022-06-29 17:54:48	<wikibugs>	('CR) ''Eigyan: [wmf-config]: Deploy GDI Survey Wave 2 on ES,FR,PT wikis. (''1 comment) [mediawiki-config] - ''https://gerrit.wikimedia.org/r/809634 (https://phabricator.wikimedia.org/T311643) (owner: ''Eigyan)'
2022-06-29 17:55:15	<wikibugs>	('CR) ''Eigyan: [wmf-config]: Deploy GDI Survey Wave 2 on ES,FR,PT wikis. (''1 comment) [mediawiki-config] - ''https://gerrit.wikimedia.org/r/809634 (https://phabricator.wikimedia.org/T311643) (owner: ''Eigyan)'
2022-06-29 17:55:26	<logmsgbot>	!log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
2022-06-29 17:55:29	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 17:59:19	<wikibugs>	('PS1) ''Ebernhardson: metastore: Remove versioning from saneitize updates [extensions/CirrusSearch] (wmf/1.39.0-wmf.18) - ''https://gerrit.wikimedia.org/r/809564'
2022-06-29 17:59:24	<wikibugs>	('CR) ''EllenR: [C: ''+1] "mine are too!" [mediawiki-config] - ''https://gerrit.wikimedia.org/r/809634 (https://phabricator.wikimedia.org/T311643) (owner: ''Eigyan)'
2022-06-29 17:59:43	<wikibugs>	('CR) ''Dduvall: "dzahn was able to get things working again with a docker/firewall restart, so I'm not sure this is still necessary." [puppet] - ''https://gerrit.wikimedia.org/r/809650 (https://phabricator.wikimedia.org/T311241) (owner: ''Dduvall)'
2022-06-29 18:00:05	<jouncebot>	dduvall and hashar: Your horoscope predicts another unfortunate Train log triage with CPT deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220629T1800).
2022-06-29 18:00:06	<jouncebot>	dduvall and hashar: May I have your attention please! MediaWiki train - Utc-7+Utc-0 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220629T1800)
2022-06-29 18:02:17	<logmsgbot>	!log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox
2022-06-29 18:02:21	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 18:02:36	<wikibugs>	('PS1) ''Dduvall: group1 wikis to 1.39.0-wmf.18 [mediawiki-config] - ''https://gerrit.wikimedia.org/r/809652 (https://phabricator.wikimedia.org/T308071)'
2022-06-29 18:02:38	<wikibugs>	('CR) ''Dduvall: [C: ''+2] group1 wikis to 1.39.0-wmf.18 [mediawiki-config] - ''https://gerrit.wikimedia.org/r/809652 (https://phabricator.wikimedia.org/T308071) (owner: ''Dduvall)'
2022-06-29 18:03:20	<wikibugs>	('Merged) ''jenkins-bot: group1 wikis to 1.39.0-wmf.18 [mediawiki-config] - ''https://gerrit.wikimedia.org/r/809652 (https://phabricator.wikimedia.org/T308071) (owner: ''Dduvall)'
2022-06-29 18:04:22	<JJMC89>	db1128 is pooled and has a large amount of lag
2022-06-29 18:05:07	<icinga-wm>	PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
2022-06-29 18:06:01	<JJMC89>	Amir1: ^
2022-06-29 18:06:41	<logmsgbot>	!log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
2022-06-29 18:06:45	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 18:07:05	<logmsgbot>	!log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
2022-06-29 18:07:09	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 18:07:25	<logmsgbot>	!log dduvall@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.39.0-wmf.18 refs T308071
2022-06-29 18:07:28	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 18:07:29	<stashbot>	T308071: 1.39.0-wmf.18 deployment blockers - https://phabricator.wikimedia.org/T308071
2022-06-29 18:08:00	<logmsgbot>	!log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
2022-06-29 18:08:01	<logmsgbot>	!log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
2022-06-29 18:08:04	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 18:08:08	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 18:08:23	<icinga-wm>	PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_dump_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
2022-06-29 18:08:54	<logmsgbot>	!log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
2022-06-29 18:08:58	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 18:09:47	<logmsgbot>	!log robh@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dumpsdata1007.eqiad.wmnet with OS bullseye
2022-06-29 18:09:50	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 18:09:53	<wikibugs>	'SRE, ''DC-Ops: datadumps1007 test installs - https://phabricator.wikimedia.org/T302937 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host dumpsdata1007.eqiad.wmnet with OS bullseye executed with errors: - dumpsdata1007 (FAIL) - Removed from Puppet and PuppetD...'
2022-06-29 18:11:00	<logmsgbot>	!log dduvall@deploy1002 Synchronized php: group1 wikis to 1.39.0-wmf.18 refs T308071 (duration: 03m 35s)
2022-06-29 18:11:06	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 18:11:28	<logmsgbot>	!log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host cloudvirt1051.mgmt.eqiad.wmnet with reboot policy FORCED
2022-06-29 18:11:29	<logmsgbot>	!log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host cloudvirt1049.mgmt.eqiad.wmnet with reboot policy FORCED
2022-06-29 18:11:29	<logmsgbot>	!log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host cloudvirt1048.mgmt.eqiad.wmnet with reboot policy FORCED
2022-06-29 18:11:29	<logmsgbot>	!log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host cloudvirt1052.mgmt.eqiad.wmnet with reboot policy FORCED
2022-06-29 18:11:29	<logmsgbot>	!log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host cloudvirt1053.mgmt.eqiad.wmnet with reboot policy FORCED
2022-06-29 18:11:29	<logmsgbot>	!log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host cloudvirt1050.mgmt.eqiad.wmnet with reboot policy FORCED
2022-06-29 18:11:33	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 18:11:37	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 18:11:41	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 18:11:45	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 18:11:48	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 18:11:49	<wikibugs>	'SRE, ''DC-Ops, ''Patch-For-Review: Confirm support of PERC 750 raid controller - https://phabricator.wikimedia.org/T297913 (''RobH) So post dumpsdata1007 install it fails puppet due to megaraid monitoring items it seems?'
2022-06-29 18:11:52	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 18:12:35	<wikibugs>	'SRE, ''ops-eqiad, ''DC-Ops, ''cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install cloudvirt105[123].eqiad.wmnet - https://phabricator.wikimedia.org/T305194 (''Cmjohnson)'
2022-06-29 18:12:37	<JJMC89>	Amir1: db1128 is pooled and has a large amount of lag
2022-06-29 18:12:39	<logmsgbot>	!log robh@cumin1001 START - Cookbook sre.hosts.reimage for host dumpsdata1006.eqiad.wmnet with OS bullseye
2022-06-29 18:12:43	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 18:12:45	<wikibugs>	'SRE, ''ops-eqiad, ''DC-Ops, ''Dumps-Generation: Q3: rack/setup/install dumpsdata100[67] - https://phabricator.wikimedia.org/T299443 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host dumpsdata1006.eqiad.wmnet with OS bullseye'
2022-06-29 18:12:50	<Amir1>	JJMC89: on it
2022-06-29 18:13:59	<logmsgbot>	!log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
2022-06-29 18:14:03	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 18:14:36	<jinxer-wm>	(CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1004-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
2022-06-29 18:14:39	<logmsgbot>	!log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depool db1128', diff saved to https://phabricator.wikimedia.org/P30628 and previous config saved to /var/cache/conftool/dbconfig/20220629-181438-ladsgroup.json
2022-06-29 18:14:43	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 18:14:56	<Amir1>	JJMC89: depooled now
2022-06-29 18:14:57	<logmsgbot>	!log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
2022-06-29 18:14:59	<logmsgbot>	!log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
2022-06-29 18:15:00	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 18:15:04	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 18:15:14	<JJMC89>	thanks
2022-06-29 18:15:42	<logmsgbot>	!log robh@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dumpsdata1006.eqiad.wmnet with OS bullseye
2022-06-29 18:15:46	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 18:15:48	<wikibugs>	'SRE, ''ops-eqiad, ''DC-Ops, ''Dumps-Generation: Q3: rack/setup/install dumpsdata100[67] - https://phabricator.wikimedia.org/T299443 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host dumpsdata1006.eqiad.wmnet with OS bullseye executed with errors: - dumpsdata10...'
2022-06-29 18:18:31	<wikibugs>	'SRE, ''DC-Ops, ''Patch-For-Review: Confirm support of PERC 750 raid controller - https://phabricator.wikimedia.org/T297913 (''MoritzMuehlenhoff) >>! In T297913#8038074, @RobH wrote: > So post dumpsdata1007 install it fails puppet due to megaraid monitoring items it seems? That's expected, we still need to...'
2022-06-29 18:18:45	<logmsgbot>	!log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
2022-06-29 18:18:49	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 18:19:29	<icinga-wm>	RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
2022-06-29 18:19:49	<wikibugs>	'SRE, ''LDAP-Access-Requests, ''Release-Engineering-Team (Radar): Grant Access to wmf for demon - https://phabricator.wikimedia.org/T311661 (''thcipriani)'
2022-06-29 18:20:07	<wikibugs>	('PS3) ''Ssingh: trafficserver: 9.x upgrade: switch ip_allow.config to YAML format [puppet] - ''https://gerrit.wikimedia.org/r/803272 (https://phabricator.wikimedia.org/T309651)'
2022-06-29 18:20:56	<wikibugs>	('CR) ''Ssingh: [V: ''+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36125/console"; [puppet] - ''https://gerrit.wikimedia.org/r/803272 (https://phabricator.wikimedia.org/T309651) (owner: ''Ssingh)'
2022-06-29 18:21:41	<logmsgbot>	!log pt1979@cumin2002 START - Cookbook sre.dns.netbox
2022-06-29 18:21:45	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 18:24:11	<logmsgbot>	!log robh@cumin1001 START - Cookbook sre.hosts.provision for host dumpsdata1006.mgmt.eqiad.wmnet with reboot policy FORCED
2022-06-29 18:24:14	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 18:25:39	<wikibugs>	('PS3) ''Ssingh: trafficserver: 9.x upgrade: replace client.verify.server [puppet] - ''https://gerrit.wikimedia.org/r/803296 (https://phabricator.wikimedia.org/T309651)'
2022-06-29 18:25:55	<wikibugs>	'SRE, ''ops-codfw, ''DBA, ''DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (''Papaul)'
2022-06-29 18:26:29	<logmsgbot>	!log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
2022-06-29 18:26:33	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 18:27:06	<logmsgbot>	!log bking@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster restart to pickup swift-s3 plugin - bking@cumin1001 - T309648
2022-06-29 18:27:11	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 18:27:11	<stashbot>	T309648: Restore lost index in cloudelastic - https://phabricator.wikimedia.org/T309648
2022-06-29 18:27:25	<wikibugs>	('CR) ''Ssingh: [V: ''+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36126/console"; [puppet] - ''https://gerrit.wikimedia.org/r/803296 (https://phabricator.wikimedia.org/T309651) (owner: ''Ssingh)'
2022-06-29 18:27:26	<logmsgbot>	!log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db2153.mgmt.codfw.wmnet with reboot policy FORCED
2022-06-29 18:27:30	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 18:27:50	<wikibugs>	'SRE, ''ops-eqiad, ''DC-Ops, ''Dumps-Generation: Q3: rack/setup/install dumpsdata100[67] - https://phabricator.wikimedia.org/T299443 (''RobH)'
2022-06-29 18:28:05	<logmsgbot>	!log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudvirt1050.mgmt.eqiad.wmnet with reboot policy FORCED
2022-06-29 18:28:08	<logmsgbot>	!log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudvirt1053.mgmt.eqiad.wmnet with reboot policy FORCED
2022-06-29 18:28:09	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 18:28:11	<logmsgbot>	!log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudvirt1052.mgmt.eqiad.wmnet with reboot policy FORCED
2022-06-29 18:28:12	<logmsgbot>	!log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudvirt1048.mgmt.eqiad.wmnet with reboot policy FORCED
2022-06-29 18:28:12	<logmsgbot>	!log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudvirt1051.mgmt.eqiad.wmnet with reboot policy FORCED
2022-06-29 18:28:13	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 18:28:14	<logmsgbot>	!log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudvirt1049.mgmt.eqiad.wmnet with reboot policy FORCED
2022-06-29 18:28:15	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 18:28:20	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 18:28:23	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 18:28:27	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 18:28:28	<logmsgbot>	!log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db2154.mgmt.codfw.wmnet with reboot policy FORCED
2022-06-29 18:28:32	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 18:29:47	<wikibugs>	('CR) ''Ssingh: [V: ''+1] "Rebased on top of I95c0009bc06 to allow for backward compatibility." [puppet] - ''https://gerrit.wikimedia.org/r/803296 (https://phabricator.wikimedia.org/T309651) (owner: ''Ssingh)'
2022-06-29 18:29:53	<wikibugs>	('CR) ''Ssingh: [V: ''+1] "Rebased on top of I95c0009bc06 to allow for backward compatibility." [puppet] - ''https://gerrit.wikimedia.org/r/803272 (https://phabricator.wikimedia.org/T309651) (owner: ''Ssingh)'
2022-06-29 18:31:22	<logmsgbot>	!log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox
2022-06-29 18:31:26	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 18:33:46	<wikibugs>	'SRE, ''ops-eqiad, ''DC-Ops, ''Dumps-Generation: Q3: rack/setup/install dumpsdata100[67] - https://phabricator.wikimedia.org/T299443 (''RobH) Something is very wrong with dumps1006, when I go to set it up, it doesn't see a 10G NIC, only the 1G. Rather than pollute this seutp task, I'll create a high prio...'
2022-06-29 18:34:24	<logmsgbot>	!log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
2022-06-29 18:34:28	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 18:48:35	<wikibugs>	('PS1) ''Cmjohnson: Adding new cloudvirt hosts to site.pp and netboot.cfg [puppet] - ''https://gerrit.wikimedia.org/r/809656 (https://phabricator.wikimedia.org/T299574)'
2022-06-29 18:49:10	<wikibugs>	('CR) ''CI reject: [V: ''-1] Adding new cloudvirt hosts to site.pp and netboot.cfg [puppet] - ''https://gerrit.wikimedia.org/r/809656 (https://phabricator.wikimedia.org/T299574) (owner: ''Cmjohnson)'
2022-06-29 18:49:21	<wikibugs>	('Abandoned) ''Cmjohnson: Adding new cloudvirt hosts to site.pp and netboot.cfg [puppet] - ''https://gerrit.wikimedia.org/r/809656 (https://phabricator.wikimedia.org/T299574) (owner: ''Cmjohnson)'
2022-06-29 18:54:42	<wikibugs>	'SRE, ''Infrastructure-Foundations, ''Patch-For-Review: Assign SPDX headers to puppet.git - https://phabricator.wikimedia.org/T308013 (''Krenair)'
2022-06-29 18:56:02	<logmsgbot>	!log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2154.mgmt.codfw.wmnet with reboot policy FORCED
2022-06-29 18:56:06	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 18:56:14	<logmsgbot>	!log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2153.mgmt.codfw.wmnet with reboot policy FORCED
2022-06-29 18:56:17	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 18:58:50	<wikibugs>	('PS1) ''Cmjohnson: Adding cloudvirt servers cloudvirt servers [puppet] - ''https://gerrit.wikimedia.org/r/809659 (https://phabricator.wikimedia.org/T305194)'
2022-06-29 18:59:25	<wikibugs>	('CR) ''CI reject: [V: ''-1] Adding cloudvirt servers cloudvirt servers [puppet] - ''https://gerrit.wikimedia.org/r/809659 (https://phabricator.wikimedia.org/T305194) (owner: ''Cmjohnson)'
2022-06-29 19:03:47	<wikibugs>	('Abandoned) ''Cmjohnson: Adding cloudvirt servers cloudvirt servers [puppet] - ''https://gerrit.wikimedia.org/r/809659 (https://phabricator.wikimedia.org/T305194) (owner: ''Cmjohnson)'
2022-06-29 19:05:42	<wikibugs>	('PS1) ''Cmjohnson: Adding cloudvirts to site.pp [puppet] - ''https://gerrit.wikimedia.org/r/809660 (https://phabricator.wikimedia.org/T305194)'
2022-06-29 19:06:16	<wikibugs>	('CR) ''CI reject: [V: ''-1] Adding cloudvirts to site.pp [puppet] - ''https://gerrit.wikimedia.org/r/809660 (https://phabricator.wikimedia.org/T305194) (owner: ''Cmjohnson)'
2022-06-29 19:08:34	<wikibugs>	('PS2) ''Cmjohnson: Adding cloudvirts to site.pp [puppet] - ''https://gerrit.wikimedia.org/r/809660 (https://phabricator.wikimedia.org/T305194)'
2022-06-29 19:09:36	<wikibugs>	('PS3) ''Cmjohnson: Adding cloudvirts to site.pp [puppet] - ''https://gerrit.wikimedia.org/r/809660 (https://phabricator.wikimedia.org/T305194)'
2022-06-29 19:10:22	<wikibugs>	('CR) ''Cmjohnson: [C: ''+2] Adding cloudvirts to site.pp [puppet] - ''https://gerrit.wikimedia.org/r/809660 (https://phabricator.wikimedia.org/T305194) (owner: ''Cmjohnson)'
2022-06-29 19:10:58	<logmsgbot>	!log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db2155.mgmt.codfw.wmnet with reboot policy FORCED
2022-06-29 19:11:02	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 19:11:18	<logmsgbot>	!log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db2156.mgmt.codfw.wmnet with reboot policy FORCED
2022-06-29 19:11:21	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 19:11:41	<wikibugs>	('PS1) ''Cmjohnson: Adding cloudvirts to netboot [puppet] - ''https://gerrit.wikimedia.org/r/809661 (https://phabricator.wikimedia.org/T305194)'
2022-06-29 19:12:31	<wikibugs>	('CR) ''Cmjohnson: [C: ''+2] Adding cloudvirts to netboot [puppet] - ''https://gerrit.wikimedia.org/r/809661 (https://phabricator.wikimedia.org/T305194) (owner: ''Cmjohnson)'
2022-06-29 19:15:41	<logmsgbot>	!log pt1979@cumin2002 END (ERROR) - Cookbook sre.hosts.provision (exit_code=97) for host db2156.mgmt.codfw.wmnet with reboot policy FORCED
2022-06-29 19:15:45	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 19:15:58	<logmsgbot>	!log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db2156.mgmt.codfw.wmnet with reboot policy FORCED
2022-06-29 19:16:02	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 19:19:48	<wikibugs>	('PS1) ''BryanDavis: developer-portal: Bump container version to 2022-06-28-153911-production [deployment-charts] - ''https://gerrit.wikimedia.org/r/809665'
2022-06-29 19:22:07	<logmsgbot>	!log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2156.mgmt.codfw.wmnet with reboot policy FORCED
2022-06-29 19:22:10	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 19:23:46	<logmsgbot>	!log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db2156.mgmt.codfw.wmnet with reboot policy FORCED
2022-06-29 19:23:49	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 19:23:53	<wikibugs>	('PS1) ''Zabe: Stop setting wgCentralAuthAutoNew [mediawiki-config] - ''https://gerrit.wikimedia.org/r/809666 (https://phabricator.wikimedia.org/T257079)'
2022-06-29 19:24:06	<wikibugs>	'SRE, ''ops-eqiad, ''DC-Ops, ''Patch-For-Review, ''cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install cloudvirt105[123].eqiad.wmnet - https://phabricator.wikimedia.org/T305194 (''Cmjohnson)'
2022-06-29 19:24:38	<wikibugs>	'SRE, ''ops-eqiad, ''DC-Ops, ''Patch-For-Review, ''cloud-services-team (Hardware): Q3:(Need By: TBD) rack/setup/install cloudvirt10[48-50].eqiad.wmnet - https://phabricator.wikimedia.org/T299574 (''Cmjohnson)'
2022-06-29 19:25:21	<wikibugs>	('CR) ''BryanDavis: [C: ''+2] developer-portal: Bump container version to 2022-06-28-153911-production [deployment-charts] - ''https://gerrit.wikimedia.org/r/809665 (owner: ''BryanDavis)'
2022-06-29 19:26:36	<wikibugs>	'SRE, ''ops-eqiad, ''DC-Ops, ''Dumps-Generation: Q3: rack/setup/install dumpsdata100[67] - https://phabricator.wikimedia.org/T299443 (''RobH) a:''RobH→''Jclark-ctr Ok, I updated the bios and then foolishly updated idrac, and now https implementation is broken for idrac. {F35287417} @Jclark-ctr (or @...'
2022-06-29 19:26:50	<logmsgbot>	!log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2155.mgmt.codfw.wmnet with reboot policy FORCED
2022-06-29 19:26:54	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 19:28:03	<logmsgbot>	!log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2156.mgmt.codfw.wmnet with reboot policy FORCED
2022-06-29 19:28:06	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 19:28:49	<logmsgbot>	!log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db2156.mgmt.codfw.wmnet with reboot policy FORCED
2022-06-29 19:28:53	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 19:29:22	<wikibugs>	('Merged) ''jenkins-bot: developer-portal: Bump container version to 2022-06-28-153911-production [deployment-charts] - ''https://gerrit.wikimedia.org/r/809665 (owner: ''BryanDavis)'
2022-06-29 19:30:42	<wikibugs>	'SRE, ''DC-Ops, ''Patch-For-Review: Confirm support of PERC 750 raid controller - https://phabricator.wikimedia.org/T297913 (''RobH) >>! In T297913#8038091, @MoritzMuehlenhoff wrote: >>>! In T297913#8038074, @RobH wrote: >> So post dumpsdata1007 install it fails puppet due to megaraid monitoring items it se...'
2022-06-29 19:31:04	<logmsgbot>	!log bd808@deploy1002 helmfile [staging] START helmfile.d/services/developer-portal: apply
2022-06-29 19:31:08	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 19:31:28	<logmsgbot>	!log bd808@deploy1002 helmfile [staging] DONE helmfile.d/services/developer-portal: apply
2022-06-29 19:31:32	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 19:31:45	<wikibugs>	'SRE, ''ops-eqiad, ''DC-Ops, ''Dumps-Generation: Q3: rack/setup/install dumpsdata100[67] - https://phabricator.wikimedia.org/T299443 (''RobH)'
2022-06-29 19:32:04	<logmsgbot>	!log bd808@deploy1002 helmfile [codfw] START helmfile.d/services/developer-portal: apply
2022-06-29 19:32:08	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 19:32:41	<logmsgbot>	!log bd808@deploy1002 helmfile [codfw] DONE helmfile.d/services/developer-portal: apply
2022-06-29 19:32:45	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 19:32:54	<logmsgbot>	!log bd808@deploy1002 helmfile [eqiad] START helmfile.d/services/developer-portal: apply
2022-06-29 19:32:57	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 19:33:57	<logmsgbot>	!log bd808@deploy1002 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply
2022-06-29 19:34:00	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 19:34:15	<icinga-wm>	RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
2022-06-29 19:35:01	<wikibugs>	('PS1) ''Zabe: Stop setting wgBabelCentralApi [mediawiki-config] - ''https://gerrit.wikimedia.org/r/809671 (https://phabricator.wikimedia.org/T257079)'
2022-06-29 19:36:04	<logmsgbot>	!log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2156.mgmt.codfw.wmnet with reboot policy FORCED
2022-06-29 19:36:08	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 19:36:23	<logmsgbot>	!log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1048.eqiad.wmnet with OS bullseye
2022-06-29 19:36:26	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 19:36:32	<wikibugs>	'SRE, ''ops-eqiad, ''DC-Ops, ''cloud-services-team (Hardware): Q3:(Need By: TBD) rack/setup/install cloudvirt10[48-50].eqiad.wmnet - https://phabricator.wikimedia.org/T299574 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudvirt1048.eqiad.wmnet w...'
2022-06-29 19:36:59	<logmsgbot>	!log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db2158.mgmt.codfw.wmnet with reboot policy FORCED
2022-06-29 19:37:03	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 19:37:21	<logmsgbot>	!log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db2157.mgmt.codfw.wmnet with reboot policy FORCED
2022-06-29 19:37:24	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 19:41:21	<icinga-wm>	PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_dump_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
2022-06-29 19:46:20	<logmsgbot>	!log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1049.eqiad.wmnet with OS bullseye
2022-06-29 19:46:24	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 19:46:26	<wikibugs>	'SRE, ''ops-eqiad, ''DC-Ops, ''cloud-services-team (Hardware): Q3:(Need By: TBD) rack/setup/install cloudvirt10[48-50].eqiad.wmnet - https://phabricator.wikimedia.org/T299574 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudvirt1049.eqiad.wmnet w...'
2022-06-29 19:46:31	<logmsgbot>	!log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1050.eqiad.wmnet with OS bullseye
2022-06-29 19:46:35	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 19:46:37	<wikibugs>	'SRE, ''ops-eqiad, ''DC-Ops, ''cloud-services-team (Hardware): Q3:(Need By: TBD) rack/setup/install cloudvirt10[48-50].eqiad.wmnet - https://phabricator.wikimedia.org/T299574 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudvirt1050.eqiad.wmnet w...'
2022-06-29 19:46:42	<logmsgbot>	!log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1051.eqiad.wmnet with OS bullseye
2022-06-29 19:46:45	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 19:46:48	<wikibugs>	'SRE, ''ops-eqiad, ''DC-Ops, ''cloud-services-team (Hardware): Q3:(Need By: TBD) rack/setup/install cloudvirt10[48-50].eqiad.wmnet - https://phabricator.wikimedia.org/T299574 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudvirt1051.eqiad.wmnet w...'
2022-06-29 19:47:01	<logmsgbot>	!log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1052.eqiad.wmnet with OS bullseye
2022-06-29 19:47:04	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 19:47:05	<wikibugs>	'SRE, ''ops-eqiad, ''DC-Ops, ''cloud-services-team (Hardware): Q3:(Need By: TBD) rack/setup/install cloudvirt10[48-50].eqiad.wmnet - https://phabricator.wikimedia.org/T299574 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudvirt1052.eqiad.wmnet w...'
2022-06-29 19:47:10	<logmsgbot>	!log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1053.eqiad.wmnet with OS bullseye
2022-06-29 19:47:13	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 19:47:16	<wikibugs>	'SRE, ''ops-eqiad, ''DC-Ops, ''cloud-services-team (Hardware): Q3:(Need By: TBD) rack/setup/install cloudvirt10[48-50].eqiad.wmnet - https://phabricator.wikimedia.org/T299574 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudvirt1053.eqiad.wmnet w...'
2022-06-29 19:49:48	<logmsgbot>	!log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1048.eqiad.wmnet with reason: host reimage
2022-06-29 19:49:52	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 19:53:15	<logmsgbot>	!log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1048.eqiad.wmnet with reason: host reimage
2022-06-29 19:53:19	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 19:56:17	<wikibugs>	'SRE, ''ops-eqiad, ''DC-Ops, ''cloud-services-team (Hardware): Q4:(Need By: TBD) rack/setup/install cloudweb100[34] - https://phabricator.wikimedia.org/T305414 (''Cmjohnson) @ayounsi @taavi @Andrew Has a determination on public vs private VLAN been decided?'
2022-06-29 19:57:13	<wikibugs>	('PS1) ''Bartosz Dziewoński: New topic hint: Avoid error about section editing when opened from diff [extensions/DiscussionTools] (wmf/1.39.0-wmf.17) - ''https://gerrit.wikimedia.org/r/809688 (https://phabricator.wikimedia.org/T311665)'
2022-06-29 19:57:25	<wikibugs>	('CR) ''Dzahn: [C: ''+2] gitlab_runner: Allow internal docker DNS traffic [puppet] - ''https://gerrit.wikimedia.org/r/809650 (https://phabricator.wikimedia.org/T311241) (owner: ''Dduvall)'
2022-06-29 19:57:33	<wikibugs>	('PS1) ''Bartosz Dziewoński: New topic hint: Avoid error about section editing when opened from diff [extensions/DiscussionTools] (wmf/1.39.0-wmf.18) - ''https://gerrit.wikimedia.org/r/809689 (https://phabricator.wikimedia.org/T311665)'
2022-06-29 19:57:43	<wikibugs>	('PS1) ''Bartosz Dziewoński: New topic hint: Add clear:both [extensions/DiscussionTools] (wmf/1.39.0-wmf.17) - ''https://gerrit.wikimedia.org/r/809690 (https://phabricator.wikimedia.org/T311597)'
2022-06-29 19:57:46	<wikibugs>	('CR) ''Samtar: "Code is sound, but recommend holding off merging this per T310974#8034803" [mediawiki-config] - ''https://gerrit.wikimedia.org/r/808424 (https://phabricator.wikimedia.org/T310974) (owner: ''Stang)'
2022-06-29 19:57:53	<wikibugs>	('PS2) ''Samtar: enwiki: Raise wgPageTriageMaxAge to indefinite [mediawiki-config] - ''https://gerrit.wikimedia.org/r/808424 (https://phabricator.wikimedia.org/T310974) (owner: ''Stang)'
2022-06-29 19:57:59	<wikibugs>	('PS1) ''Bartosz Dziewoński: New topic hint: Add clear:both [extensions/DiscussionTools] (wmf/1.39.0-wmf.18) - ''https://gerrit.wikimedia.org/r/809691 (https://phabricator.wikimedia.org/T311597)'
2022-06-29 19:58:26	<wikibugs>	'SRE, ''ops-eqiad, ''DC-Ops, ''cloud-services-team (Hardware): Q4:(Need By: TBD) rack/setup/install cloudcontrol100[6-7].wikimedia.org - https://phabricator.wikimedia.org/T306853 (''Cmjohnson) @ayounsi @Andrew Has a determination on public vs private VLAN been decided? Also, @andrew which partman recipe...'
2022-06-29 19:59:45	<logmsgbot>	!log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1049.eqiad.wmnet with reason: host reimage
2022-06-29 19:59:49	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 19:59:53	<logmsgbot>	!log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1051.eqiad.wmnet with reason: host reimage
2022-06-29 19:59:55	<logmsgbot>	!log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1050.eqiad.wmnet with reason: host reimage
2022-06-29 19:59:56	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 20:00:00	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 20:00:05	<jouncebot>	RoanKattouw, Urbanecm, and cjming: Time to snap out of that daydream and deploy UTC late backport window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220629T2000).
2022-06-29 20:00:05	<jouncebot>	cjming, mewoph, eigyan, ebernhardson, zabe, and MatmaRex: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
2022-06-29 20:00:20	<logmsgbot>	!log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1053.eqiad.wmnet with reason: host reimage
2022-06-29 20:00:22	<eigyan>	Greetings Everyone!
2022-06-29 20:00:24	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 20:00:35	<cjming>	hi all - i can deploy since i'm on the list
2022-06-29 20:00:47	<wikibugs>	'SRE, ''ops-eqiad, ''DC-Ops, ''cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (''Cmjohnson) @cmooney are you requesting cloudnets to be moved to a different switch or this an open discussion with @Andrew'
2022-06-29 20:01:08	<zabe>	gey
2022-06-29 20:01:12	<cjming>	i'll do them in order so I'll start with mine
2022-06-29 20:01:12	<zabe>	s/gey/hey
2022-06-29 20:01:15	<logmsgbot>	!log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1052.eqiad.wmnet with reason: host reimage
2022-06-29 20:01:18	<mewoph>	cjming: perfect thanks!
2022-06-29 20:01:19	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 20:01:32	<wikibugs>	('CR) ''Clare Ming: [C: ''+2] Add jawiki, zhwikinews to pilot wikis [mediawiki-config] - ''https://gerrit.wikimedia.org/r/809620 (https://phabricator.wikimedia.org/T311419) (owner: ''Clare Ming)'
2022-06-29 20:02:04	<MatmaRex>	hi
2022-06-29 20:02:20	<logmsgbot>	!log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2157.mgmt.codfw.wmnet with reboot policy FORCED
2022-06-29 20:02:22	<wikibugs>	('Merged) ''jenkins-bot: Add jawiki, zhwikinews to pilot wikis [mediawiki-config] - ''https://gerrit.wikimedia.org/r/809620 (https://phabricator.wikimedia.org/T311419) (owner: ''Clare Ming)'
2022-06-29 20:02:24	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 20:02:24	<logmsgbot>	!log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2158.mgmt.codfw.wmnet with reboot policy FORCED
2022-06-29 20:02:27	<logmsgbot>	!log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cloudvirt1052.eqiad.wmnet with reason: host reimage
2022-06-29 20:02:28	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 20:02:30	<MatmaRex>	i'm aware that i overloaded the window slightly and that i am late, so if it can't be done, it's okay if you drop my patches
2022-06-29 20:02:32	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 20:02:37	<MatmaRex>	(although i hope we can include them)
2022-06-29 20:02:54	<ebernhardson>	\o
2022-06-29 20:03:22	<logmsgbot>	!log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1049.eqiad.wmnet with reason: host reimage
2022-06-29 20:03:26	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 20:03:58	<wikibugs>	('PS1) ''Ottomata: analytics test cluster presto - configure iceberg with kerberos support [puppet] - ''https://gerrit.wikimedia.org/r/809676 (https://phabricator.wikimedia.org/T311525)'
2022-06-29 20:04:16	<logmsgbot>	!log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db2159.mgmt.codfw.wmnet with reboot policy FORCED
2022-06-29 20:04:20	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 20:04:54	<cjming>	thanks MatmaRex: let's see where we end up during the window -- hopefully we can do all of your patches
2022-06-29 20:05:07	<wikibugs>	('CR) ''Ottomata: [V: ''+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36127/console"; [puppet] - ''https://gerrit.wikimedia.org/r/809676 (https://phabricator.wikimedia.org/T311525) (owner: ''Ottomata)'
2022-06-29 20:05:28	<logmsgbot>	!log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1050.eqiad.wmnet with reason: host reimage
2022-06-29 20:05:32	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 20:05:57	<wikibugs>	('CR) ''BPirkle: [C: ''+1] "LGTM, I made an additional comment on naming, but am happy to merge without changes if, after a bit more thought, you're happy with it as-" [deployment-charts] - ''https://gerrit.wikimedia.org/r/809198 (https://phabricator.wikimedia.org/T295956) (owner: ''Hnowlan)'
2022-06-29 20:06:09	<wikibugs>	('CR) ''Clare Ming: [C: ''+2] Structured task: Add 'cancel' to the list of allowed commands [extensions/GrowthExperiments] (wmf/1.39.0-wmf.18) - ''https://gerrit.wikimedia.org/r/809550 (https://phabricator.wikimedia.org/T311467) (owner: ''Kosta Harlan)'
2022-06-29 20:06:11	<wikibugs>	'SRE, ''ops-codfw, ''DBA, ''DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (''Papaul)'
2022-06-29 20:06:31	<cjming>	mewoph: starting on your patches
2022-06-29 20:07:04	<mewoph>	cjming: 👍
2022-06-29 20:07:26	<logmsgbot>	!log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1051.eqiad.wmnet with reason: host reimage
2022-06-29 20:07:30	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 20:07:34	<logmsgbot>	!log cjming@deploy1002 Synchronized wmf-config/config: Config: [[gerrit:809620\|Add jawiki, zhwikinews to pilot wikis (T311419)]] (duration: 03m 31s)
2022-06-29 20:07:39	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 20:07:39	<stashbot>	T311419: Change default skin on jawiki, zhwikinews to Vector (2022) - https://phabricator.wikimedia.org/T311419
2022-06-29 20:08:12	<logmsgbot>	!log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db2160.mgmt.codfw.wmnet with reboot policy FORCED
2022-06-29 20:08:16	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 20:09:30	<wikibugs>	('PS1) ''Papaul: Add new db nodes to site.pp [puppet] - ''https://gerrit.wikimedia.org/r/809677 (https://phabricator.wikimedia.org/T306927)'
2022-06-29 20:09:55	<logmsgbot>	!log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
2022-06-29 20:09:58	<logmsgbot>	!log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1053.eqiad.wmnet with reason: host reimage
2022-06-29 20:09:59	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 20:10:03	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 20:10:05	<wikibugs>	('CR) ''CI reject: [V: ''-1] Add new db nodes to site.pp [puppet] - ''https://gerrit.wikimedia.org/r/809677 (https://phabricator.wikimedia.org/T306927) (owner: ''Papaul)'
2022-06-29 20:10:54	<logmsgbot>	!log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
2022-06-29 20:10:55	<logmsgbot>	!log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
2022-06-29 20:10:57	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 20:11:01	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 20:11:02	<wikibugs>	('PS2) ''Papaul: Add new db nodes to site.pp [puppet] - ''https://gerrit.wikimedia.org/r/809677 (https://phabricator.wikimedia.org/T306927)'
2022-06-29 20:11:02	<mewoph>	cjming: is it ok to +2 the wmf17 patch at the same time or do we have to wait for the other one to be merged first? our our tests take about ~20min and i don't want to take up the entire window since there are a lot of patches scheduled
2022-06-29 20:11:05	<logmsgbot>	!log cjming@deploy1002 Synchronized dblists/desktop-improvements.dblist: Config: [[gerrit:809620\|Add jawiki, zhwikinews to pilot wikis (T311419)]] (duration: 03m 23s)
2022-06-29 20:11:23	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 20:11:30	<cjming>	mewoph: i was just thinking to sync them together -- let's do it
2022-06-29 20:11:43	<mewoph>	cjming: cool thanks!
2022-06-29 20:11:44	<wikibugs>	('CR) ''Clare Ming: [C: ''+2] Structured task: Add 'cancel' to the list of allowed commands [extensions/GrowthExperiments] (wmf/1.39.0-wmf.17) - ''https://gerrit.wikimedia.org/r/809549 (https://phabricator.wikimedia.org/T311467) (owner: ''Kosta Harlan)'
2022-06-29 20:11:56	<logmsgbot>	!log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
2022-06-29 20:12:00	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 20:12:42	<wikibugs>	('PS2) ''Ottomata: analytics test cluster presto - configure iceberg with kerberos support [puppet] - ''https://gerrit.wikimedia.org/r/809676 (https://phabricator.wikimedia.org/T311525)'
2022-06-29 20:13:41	<wikibugs>	('CR) ''Ottomata: [V: ''+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36128/console"; [puppet] - ''https://gerrit.wikimedia.org/r/809676 (https://phabricator.wikimedia.org/T311525) (owner: ''Ottomata)'
2022-06-29 20:14:04	<wikibugs>	'SRE, ''serviceops-collab, ''Patch-For-Review: Onboarding for Arnold Okoth - https://phabricator.wikimedia.org/T288645 (''Dzahn)'
2022-06-29 20:14:32	<wikibugs>	'SRE, ''serviceops-collab, ''Patch-For-Review: Onboarding for Arnold Okoth - https://phabricator.wikimedia.org/T288645 (''Dzahn) @Arnoldokoth Any reason to keep it open?'
2022-06-29 20:15:38	<logmsgbot>	!log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1048.eqiad.wmnet with OS bullseye
2022-06-29 20:15:42	<wikibugs>	('PS3) ''Ottomata: analytics test cluster presto - configure iceberg with kerberos support [puppet] - ''https://gerrit.wikimedia.org/r/809676 (https://phabricator.wikimedia.org/T311525)'
2022-06-29 20:15:42	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 20:15:44	<wikibugs>	'SRE, ''ops-eqiad, ''DC-Ops, ''cloud-services-team (Hardware): Q3:(Need By: TBD) rack/setup/install cloudvirt10[48-50].eqiad.wmnet - https://phabricator.wikimedia.org/T299574 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudvirt1048.eqiad.wmnet with...'
2022-06-29 20:15:51	<wikibugs>	('PS4) ''Ottomata: analytics test cluster presto - configure iceberg with kerberos support [puppet] - ''https://gerrit.wikimedia.org/r/809676 (https://phabricator.wikimedia.org/T311525)'
2022-06-29 20:16:39	<mutante>	!log LDAP - mwmaint1002 - added demon to wmf group (T311661)
2022-06-29 20:16:43	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 20:16:44	<stashbot>	T311661: Grant Access to wmf for demon - https://phabricator.wikimedia.org/T311661
2022-06-29 20:17:08	<wikibugs>	('CR) ''Ottomata: [V: ''+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36129/console"; [puppet] - ''https://gerrit.wikimedia.org/r/809676 (https://phabricator.wikimedia.org/T311525) (owner: ''Ottomata)'
2022-06-29 20:17:23	<wikibugs>	'SRE, ''LDAP-Access-Requests, ''Release-Engineering-Team (Radar): Grant Access to wmf for demon - https://phabricator.wikimedia.org/T311661 (''Dzahn) ''Open→''Resolved a:''Dzahn done. Chad is already in shell access group so there is no puppet code change needed for this. added to wmf'
2022-06-29 20:19:18	<wikibugs>	('CR) ''Ottomata: [V: ''+1 C: ''+2] analytics test cluster presto - configure iceberg with kerberos support [puppet] - ''https://gerrit.wikimedia.org/r/809676 (https://phabricator.wikimedia.org/T311525) (owner: ''Ottomata)'
2022-06-29 20:20:19	<cjming>	20 mins for CI -- oof
2022-06-29 20:21:10	<mutante>	!log restarting docker on all 6 gitlab-runners via cumin T311241
2022-06-29 20:21:15	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 20:21:17	<stashbot>	T311241: DNS/networking not working on Trusted Runners - https://phabricator.wikimedia.org/T311241
2022-06-29 20:21:47	<MatmaRex>	cjming: you could +2 several patches at once, so that the CI can run in parallel
2022-06-29 20:22:14	<MatmaRex>	oh, you already talked about it above
2022-06-29 20:23:49	<cjming>	MatmaRex: ya - thanks -- when we get to yours, presumably we can +2 all of them at the same time
2022-06-29 20:24:05	<MatmaRex>	yeah
2022-06-29 20:24:06	<wikibugs>	('Merged) ''jenkins-bot: Structured task: Add 'cancel' to the list of allowed commands [extensions/GrowthExperiments] (wmf/1.39.0-wmf.18) - ''https://gerrit.wikimedia.org/r/809550 (https://phabricator.wikimedia.org/T311467) (owner: ''Kosta Harlan)'
2022-06-29 20:24:59	<icinga-wm>	PROBLEM - Check systemd state on gitlab-runner2002 is CRITICAL: CRITICAL - degraded: The following units failed: docker-gc.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
2022-06-29 20:25:00	<logmsgbot>	!log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1049.eqiad.wmnet with OS bullseye
2022-06-29 20:25:04	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 20:25:07	<wikibugs>	'SRE, ''ops-eqiad, ''DC-Ops, ''cloud-services-team (Hardware): Q3:(Need By: TBD) rack/setup/install cloudvirt10[48-50].eqiad.wmnet - https://phabricator.wikimedia.org/T299574 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudvirt1049.eqiad.wmnet with...'
2022-06-29 20:25:28	<cjming>	mewoph: your .18 patch is on mwdebug1002 - can you test?
2022-06-29 20:25:37	<mewoph>	cjming: looking now
2022-06-29 20:25:40	<MatmaRex>	well, you could even +2 them now, AFAIK nothing bad will happen if they finish and get merged early while you're deploying something else, the actual deployment is still manual, right? (as long as we don't forget about them after the deployment window ends)
2022-06-29 20:26:57	<logmsgbot>	!log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host an-presto1006.eqiad.wmnet with OS buster
2022-06-29 20:26:58	<logmsgbot>	!log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1050.eqiad.wmnet with OS bullseye
2022-06-29 20:27:01	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 20:27:03	<wikibugs>	'SRE, ''ops-eqiad, ''DC-Ops, ''Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-presto1006.eqiad.wmnet with OS buster'
2022-06-29 20:27:06	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 20:27:09	<wikibugs>	'SRE, ''ops-eqiad, ''DC-Ops, ''cloud-services-team (Hardware): Q3:(Need By: TBD) rack/setup/install cloudvirt10[48-50].eqiad.wmnet - https://phabricator.wikimedia.org/T299574 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudvirt1050.eqiad.wmnet with...'
2022-06-29 20:27:09	<icinga-wm>	RECOVERY - Check systemd state on gitlab-runner2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
2022-06-29 20:28:08	<cjming>	MatmaRex: the only thing that concerns me is if something needs reverting - so i tend to do them linearly to keep what i'm doing straight which isn't usually an issue with a few patches in the window -- with this many, i'm not sure
2022-06-29 20:28:26	<mewoph>	cjming: lgtm
2022-06-29 20:28:31	<cjming>	cool -syncing
2022-06-29 20:28:49	<MatmaRex>	yeah, makes sense
2022-06-29 20:29:40	<cjming>	mewoph: your .17 patch -- in zuul is that a non-voting failure?
2022-06-29 20:30:39	<wikibugs>	'SRE, ''ops-eqiad, ''DC-Ops, ''cloud-services-team (Hardware): Q3:(Need By: TBD) rack/setup/install cloudvirt10[48-50].eqiad.wmnet - https://phabricator.wikimedia.org/T299574 (''Cmjohnson)'
2022-06-29 20:31:01	<logmsgbot>	!log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1051.eqiad.wmnet with OS bullseye
2022-06-29 20:31:05	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 20:31:06	<wikibugs>	'SRE, ''ops-eqiad, ''DC-Ops, ''cloud-services-team (Hardware): Q3:(Need By: TBD) rack/setup/install cloudvirt10[48-50].eqiad.wmnet - https://phabricator.wikimedia.org/T299574 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudvirt1051.eqiad.wmnet with...'
2022-06-29 20:32:09	<logmsgbot>	!log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
2022-06-29 20:32:13	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 20:32:21	<mewoph>	cjming: i think it's voting but the test that failed is not related to the change
2022-06-29 20:32:29	<logmsgbot>	!log cjming@deploy1002 Synchronized php-1.39.0-wmf.18/extensions/GrowthExperiments/modules/ext.growthExperiments.StructuredTask/TargetInitializer.js: Backport: [[gerrit:809550\|Structured task: Add 'cancel' to the list of allowed commands (T311467)]] (duration: 03m 37s)
2022-06-29 20:32:34	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 20:32:34	<stashbot>	T311467: [wmf.17-mobile] "Suggestions" label has no padding - https://phabricator.wikimedia.org/T311467
2022-06-29 20:32:54	<logmsgbot>	!log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
2022-06-29 20:32:55	<logmsgbot>	!log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
2022-06-29 20:32:57	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 20:33:02	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 20:33:11	<cjming>	mewoph: your .18 patch should be live
2022-06-29 20:33:40	<cjming>	eigyan: i think you're here - i'm going to go ahead and start your patch
2022-06-29 20:33:55	<eigyan>	great..thanks mewoph
2022-06-29 20:33:56	<logmsgbot>	!log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
2022-06-29 20:34:01	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 20:34:55	<wikibugs>	('CR) ''CI reject: [V: ''-1] Structured task: Add 'cancel' to the list of allowed commands [extensions/GrowthExperiments] (wmf/1.39.0-wmf.17) - ''https://gerrit.wikimedia.org/r/809549 (https://phabricator.wikimedia.org/T311467) (owner: ''Kosta Harlan)'
2022-06-29 20:36:00	<cjming>	mewoph: i'll try again -- in the meantime i'll continue with the next patch
2022-06-29 20:36:09	<logmsgbot>	!log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2160.mgmt.codfw.wmnet with reboot policy FORCED
2022-06-29 20:36:11	<mewoph>	cjming: we can skip the wmf17 for this window, don't want to take up any more time and wmf18 is going out tmr anyway :(
2022-06-29 20:36:12	<logmsgbot>	!log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2159.mgmt.codfw.wmnet with reboot policy FORCED
2022-06-29 20:36:13	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 20:36:16	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 20:36:26	<icinga-wm>	PROBLEM - Check systemd state on elastic1048 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
2022-06-29 20:36:39	<cjming>	mewoph: ok -- lmk if you change your mind - i can check with you later
2022-06-29 20:36:50	<wikibugs>	('CR) ''Clare Ming: [C: ''+2] [wmf-config]: Deploy GDI Survey Wave 2 on ES,FR,PT wikis. [mediawiki-config] - ''https://gerrit.wikimedia.org/r/809634 (https://phabricator.wikimedia.org/T311643) (owner: ''Eigyan)'
2022-06-29 20:37:32	<logmsgbot>	!log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host an-presto1006.eqiad.wmnet with OS buster
2022-06-29 20:37:36	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 20:37:37	<wikibugs>	'SRE, ''ops-eqiad, ''DC-Ops, ''Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-presto1006.eqiad.wmnet with OS buster exec...'
2022-06-29 20:38:09	<wikibugs>	('Merged) ''jenkins-bot: [wmf-config]: Deploy GDI Survey Wave 2 on ES,FR,PT wikis. [mediawiki-config] - ''https://gerrit.wikimedia.org/r/809634 (https://phabricator.wikimedia.org/T311643) (owner: ''Eigyan)'
2022-06-29 20:38:50	<cjming>	eigyan: your patch should be on mwdebug1002 - can you check?
2022-06-29 20:39:08	<eigyan>	checking now cjming thank you!
2022-06-29 20:41:02	<wikibugs>	('CR) ''Papaul: [C: ''+2] Add new db nodes to site.pp [puppet] - ''https://gerrit.wikimedia.org/r/809677 (https://phabricator.wikimedia.org/T306927) (owner: ''Papaul)'
2022-06-29 20:41:08	<eigyan>	cjming everthing looks 💯
2022-06-29 20:41:16	<cjming>	cool - syncing now
2022-06-29 20:41:28	<cjming>	ebernhardson: i think you're here - is yours a no-op?
2022-06-29 20:41:35	<logmsgbot>	!log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host an-presto1007.eqiad.wmnet with OS buster
2022-06-29 20:41:39	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 20:41:41	<wikibugs>	'SRE, ''ops-eqiad, ''DC-Ops, ''Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-presto1007.eqiad.wmnet with OS buster'
2022-06-29 20:41:56	<ebernhardson>	cjming: hmm, it shouldn't be. sec
2022-06-29 20:42:07	<wikibugs>	('CR) ''Cwhite: [C: ''+2] "PCC checks out: https://puppet-compiler.wmflabs.org/pcc-worker1003/36130/"; [puppet] - ''https://gerrit.wikimedia.org/r/809302 (https://phabricator.wikimedia.org/T222826) (owner: ''Cwhite)'
2022-06-29 20:42:09	<wikibugs>	('CR) ''Clare Ming: [C: ''+2] metastore: Remove versioning from saneitize updates [extensions/CirrusSearch] (wmf/1.39.0-wmf.18) - ''https://gerrit.wikimedia.org/r/809564 (owner: ''Ebernhardson)'
2022-06-29 20:42:10	<ebernhardson>	cjming: maybe someone alreaday deployed the change and i didn't notice
2022-06-29 20:42:31	<cjming>	ebernhardson: merging yours now
2022-06-29 20:42:36	<logmsgbot>	!log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db2153.codfw.wmnet with OS bullseye
2022-06-29 20:42:40	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 20:42:48	<wikibugs>	'SRE, ''ops-codfw, ''DBA, ''DC-Ops, ''Patch-For-Review: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db2153.codfw.wmnet with OS...'
2022-06-29 20:42:54	<ebernhardson>	cjming: thanks. Yet again this is something i can't really test on mwdebug, it runs from the job queue every 2 hours
2022-06-29 20:43:15	<cjming>	ebernhardson: got it - then i'll go ahead and sync
2022-06-29 20:43:24	<logmsgbot>	!log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcephosd1031.eqiad.wmnet with OS buster
2022-06-29 20:43:28	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 20:43:30	<wikibugs>	'SRE, ''ops-eqiad, ''DC-Ops, ''cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudcephosd10[25-34] - https://phabricator.wikimedia.org/T294972 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudcephosd1031.eqiad.wmnet with OS...'
2022-06-29 20:43:43	<wikibugs>	'SRE, ''ops-codfw, ''DBA, ''DC-Ops, ''Patch-For-Review: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (''Papaul)'
2022-06-29 20:44:02	<logmsgbot>	!log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
2022-06-29 20:44:06	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 20:45:01	<logmsgbot>	!log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
2022-06-29 20:45:02	<logmsgbot>	!log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
2022-06-29 20:45:05	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 20:45:08	<logmsgbot>	!log cjming@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:809634\|[wmf-config]: Deploy GDI Survey Wave 2 on ES,FR,PT wikis. (T311643)]] (duration: 03m 25s)
2022-06-29 20:45:09	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 20:45:14	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 20:45:15	<stashbot>	T311643: Deploy GDI Survey Wave 2 on ES,FR,PT wikis. - https://phabricator.wikimedia.org/T311643
2022-06-29 20:45:18	<logmsgbot>	!log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcephosd1025.eqiad.wmnet with OS buster
2022-06-29 20:45:23	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 20:45:25	<wikibugs>	'SRE, ''ops-eqiad, ''DC-Ops, ''cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudcephosd10[25-34] - https://phabricator.wikimedia.org/T294972 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudcephosd1025.eqiad.wmnet with OS...'
2022-06-29 20:45:27	<wikibugs>	('CR) ''Cwhite: "Incorporated feedback from the meeting today." [puppet] - ''https://gerrit.wikimedia.org/r/806349 (https://phabricator.wikimedia.org/T222826) (owner: ''Cwhite)'
2022-06-29 20:45:28	<cjming>	eigyan: your change should be live
2022-06-29 20:45:49	<eigyan>	Excellent! thank you cjming!
2022-06-29 20:45:54	<logmsgbot>	!log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
2022-06-29 20:45:55	<cjming>	np!
2022-06-29 20:45:58	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 20:46:30	<cjming>	zabe: i think you're here too? i will do yours next
2022-06-29 20:46:53	<zabe>	ok
2022-06-29 20:47:08	<logmsgbot>	!log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1052.eqiad.wmnet with OS bullseye
2022-06-29 20:47:13	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 20:47:14	<wikibugs>	'SRE, ''ops-eqiad, ''DC-Ops, ''cloud-services-team (Hardware): Q3:(Need By: TBD) rack/setup/install cloudvirt10[48-50].eqiad.wmnet - https://phabricator.wikimedia.org/T299574 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudvirt1052.eqiad.wmnet with...'
2022-06-29 20:48:31	<logmsgbot>	!log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host an-presto1007.eqiad.wmnet with OS buster
2022-06-29 20:48:35	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 20:48:36	<wikibugs>	'SRE, ''ops-eqiad, ''DC-Ops, ''Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-presto1007.eqiad.wmnet with OS buster exec...'
2022-06-29 20:49:20	<logmsgbot>	!log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host an-presto1006.eqiad.wmnet with OS bullseye
2022-06-29 20:49:24	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 20:49:26	<wikibugs>	'SRE, ''ops-eqiad, ''DC-Ops, ''Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-presto1006.eqiad.wmnet with OS bullseye'
2022-06-29 20:51:07	<cjming>	MatmaRex: can you rebase your .17 patches? and when Erik's patch is merged, can you also rebase your .18 patches?
2022-06-29 20:51:28	<MatmaRex>	sure
2022-06-29 20:51:46	<cjming>	if they need it - maybe they don't
2022-06-29 20:51:53	<logmsgbot>	!log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host mw1496.mgmt.eqiad.wmnet with reboot policy FORCED
2022-06-29 20:51:55	<wikibugs>	('PS2) ''Bartosz Dziewoński: New topic hint: Add clear:both [extensions/DiscussionTools] (wmf/1.39.0-wmf.17) - ''https://gerrit.wikimedia.org/r/809690 (https://phabricator.wikimedia.org/T311597)'
2022-06-29 20:51:57	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 20:52:06	<wikibugs>	('PS2) ''Bartosz Dziewoński: New topic hint: Add clear:both [extensions/DiscussionTools] (wmf/1.39.0-wmf.18) - ''https://gerrit.wikimedia.org/r/809691 (https://phabricator.wikimedia.org/T311597)'
2022-06-29 20:54:22	<wikibugs>	('CR) ''Andrew Bogott: openstack: make enc-cli authenticate via keystone (''1 comment) [puppet] - ''https://gerrit.wikimedia.org/r/779899 (https://phabricator.wikimedia.org/T274666) (owner: ''Majavah)'
2022-06-29 20:54:30	<logmsgbot>	!log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dumpsdata1006.mgmt.eqiad.wmnet with reboot policy FORCED
2022-06-29 20:54:34	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 20:54:49	<logmsgbot>	!log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1031.eqiad.wmnet with reason: host reimage
2022-06-29 20:54:53	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 20:55:56	<icinga-wm>	PROBLEM - Check systemd state on thanos-be2001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
2022-06-29 20:56:32	<logmsgbot>	!log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1053.eqiad.wmnet with OS bullseye
2022-06-29 20:56:36	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 20:56:39	<wikibugs>	'SRE, ''ops-eqiad, ''DC-Ops, ''cloud-services-team (Hardware): Q3:(Need By: TBD) rack/setup/install cloudvirt10[48-50].eqiad.wmnet - https://phabricator.wikimedia.org/T299574 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudvirt1053.eqiad.wmnet with...'
2022-06-29 20:56:42	<logmsgbot>	!log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1025.eqiad.wmnet with reason: host reimage
2022-06-29 20:56:46	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 20:58:14	<logmsgbot>	!log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1031.eqiad.wmnet with reason: host reimage
2022-06-29 20:58:18	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 20:58:43	<icinga-wm>	RECOVERY - Check systemd state on elastic1048 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
2022-06-29 20:59:11	<wikibugs>	'SRE, ''ops-eqiad, ''DC-Ops, ''cloud-services-team (Hardware): Q3:(Need By: TBD) rack/setup/install cloudvirt10[48-50].eqiad.wmnet - https://phabricator.wikimedia.org/T299574 (''Cmjohnson) ''Open→''Resolved'
2022-06-29 20:59:14	<wikibugs>	'SRE, ''ops-eqiad, ''DC-Ops, ''cloud-services-team (Hardware): Q3:(Need By: TBD) rack/setup/install cloudvirt10[48-50].eqiad.wmnet - https://phabricator.wikimedia.org/T299574 (''Cmjohnson) pinging @andrew so he knows the base image has been completed. Resolving the task.'
2022-06-29 20:59:18	<wikibugs>	'SRE, ''Infrastructure-Foundations, ''netops, ''Patch-For-Review: Configure cloudsw1-e4-eqiad and cloudsw1-f4-eqiad - https://phabricator.wikimedia.org/T304936 (''Cmjohnson)'
2022-06-29 20:59:44	<wikibugs>	'SRE, ''ops-eqiad, ''DC-Ops, ''cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install cloudvirt105[123].eqiad.wmnet - https://phabricator.wikimedia.org/T305194 (''Cmjohnson)'
2022-06-29 20:59:55	<wikibugs>	'SRE, ''ops-eqiad, ''DC-Ops, ''cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install cloudvirt105[123].eqiad.wmnet - https://phabricator.wikimedia.org/T305194 (''Cmjohnson) ''Open→''Resolved pinging @andrew so he knows the base image has been completed. Resolving the task.'
2022-06-29 21:00:48	<logmsgbot>	!log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1025.eqiad.wmnet with reason: host reimage
2022-06-29 21:00:50	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 21:01:44	<logmsgbot>	!log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2153.codfw.wmnet with reason: host reimage
2022-06-29 21:01:48	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 21:02:21	<icinga-wm>	PROBLEM - Check systemd state on elastic1080 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
2022-06-29 21:02:29	<wikibugs>	('Merged) ''jenkins-bot: metastore: Remove versioning from saneitize updates [extensions/CirrusSearch] (wmf/1.39.0-wmf.18) - ''https://gerrit.wikimedia.org/r/809564 (owner: ''Ebernhardson)'
2022-06-29 21:02:47	<logmsgbot>	!log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host mw1496.mgmt.eqiad.wmnet with reboot policy FORCED
2022-06-29 21:02:51	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 21:02:53	<icinga-wm>	PROBLEM - SSH on bast5002 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring
2022-06-29 21:03:31	<wikibugs>	('CR) ''Clare Ming: [C: ''+2] Stop setting wgCentralAuthAutoNew [mediawiki-config] - ''https://gerrit.wikimedia.org/r/809666 (https://phabricator.wikimedia.org/T257079) (owner: ''Zabe)'
2022-06-29 21:04:01	<icinga-wm>	RECOVERY - SSH on bast5002 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
2022-06-29 21:04:53	<wikibugs>	('Merged) ''jenkins-bot: Stop setting wgCentralAuthAutoNew [mediawiki-config] - ''https://gerrit.wikimedia.org/r/809666 (https://phabricator.wikimedia.org/T257079) (owner: ''Zabe)'
2022-06-29 21:05:05	<logmsgbot>	!log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2153.codfw.wmnet with reason: host reimage
2022-06-29 21:05:09	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 21:05:22	<wikibugs>	('PS2) ''Clare Ming: Stop setting wgBabelCentralApi [mediawiki-config] - ''https://gerrit.wikimedia.org/r/809671 (https://phabricator.wikimedia.org/T257079) (owner: ''Zabe)'
2022-06-29 21:05:36	<wikibugs>	('PS1) ''Cwhite: grafana: ldap parameter expects a hash by default [puppet] - ''https://gerrit.wikimedia.org/r/809682'
2022-06-29 21:06:28	<cjming>	zabe: your 1st patch is on mwdebug1002 - not sure if it's testable
2022-06-29 21:06:52	<zabe>	cjming, it's not really testable
2022-06-29 21:06:57	<logmsgbot>	!log cjming@deploy1002 Synchronized php-1.39.0-wmf.18/extensions/CirrusSearch/includes/MetaStore/MetaSaneitizeJobStore.php: Backport: [[gerrit:809564\|metastore: Remove versioning from saneitize updates]] (duration: 03m 35s)
2022-06-29 21:07:00	<cjming>	then i will sync
2022-06-29 21:07:01	<zabe>	I will keep an eye on logstash afterwards
2022-06-29 21:07:01	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 21:07:31	<cjming>	ebernhardson: your change should be live
2022-06-29 21:07:55	<wikibugs>	('CR) ''Clare Ming: [C: ''+2] Stop setting wgBabelCentralApi [mediawiki-config] - ''https://gerrit.wikimedia.org/r/809671 (https://phabricator.wikimedia.org/T257079) (owner: ''Zabe)'
2022-06-29 21:08:02	<ebernhardson>	cjming: thanks!
2022-06-29 21:08:08	<cjming>	np!
2022-06-29 21:08:44	<wikibugs>	('Merged) ''jenkins-bot: Stop setting wgBabelCentralApi [mediawiki-config] - ''https://gerrit.wikimedia.org/r/809671 (https://phabricator.wikimedia.org/T257079) (owner: ''Zabe)'
2022-06-29 21:09:31	<wikibugs>	('CR) ''Cwhite: [C: ''+2] grafana: ldap parameter expects a hash by default [puppet] - ''https://gerrit.wikimedia.org/r/809682 (owner: ''Cwhite)'
2022-06-29 21:09:58	<logmsgbot>	!log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox
2022-06-29 21:10:02	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 21:10:42	<cjming>	ok MatmaRex: let's do yours if you're still around and up for it
2022-06-29 21:10:45	<logmsgbot>	!log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host mw1496.mgmt.eqiad.wmnet with reboot policy FORCED
2022-06-29 21:10:47	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 21:11:03	<logmsgbot>	!log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
2022-06-29 21:11:08	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 21:11:16	<logmsgbot>	!log cjming@deploy1002 Synchronized wmf-config/CommonSettings.php: Config: [[gerrit:809666\|Stop setting wgCentralAuthAutoNew (T257079)]] (duration: 03m 28s)
2022-06-29 21:11:20	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 21:11:20	<stashbot>	T257079: Audit all mismatched/unused wmf-config settings - https://phabricator.wikimedia.org/T257079
2022-06-29 21:12:04	<logmsgbot>	!log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
2022-06-29 21:12:05	<logmsgbot>	!log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
2022-06-29 21:12:07	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 21:12:11	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 21:13:05	<logmsgbot>	!log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
2022-06-29 21:13:08	<logmsgbot>	!log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
2022-06-29 21:13:08	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 21:13:12	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 21:13:17	<cjming>	MatmaRex: do you want to do your patches still?
2022-06-29 21:13:39	<MatmaRex>	cjming: sure, if you're okay with it
2022-06-29 21:14:01	<cjming>	i am
2022-06-29 21:14:12	<wikibugs>	('CR) ''Clare Ming: [C: ''+2] New topic hint: Avoid error about section editing when opened from diff [extensions/DiscussionTools] (wmf/1.39.0-wmf.17) - ''https://gerrit.wikimedia.org/r/809688 (https://phabricator.wikimedia.org/T311665) (owner: ''Bartosz Dziewoński)'
2022-06-29 21:14:19	<wikibugs>	('CR) ''Clare Ming: [C: ''+2] New topic hint: Avoid error about section editing when opened from diff [extensions/DiscussionTools] (wmf/1.39.0-wmf.18) - ''https://gerrit.wikimedia.org/r/809689 (https://phabricator.wikimedia.org/T311665) (owner: ''Bartosz Dziewoński)'
2022-06-29 21:15:08	<logmsgbot>	!log cjming@deploy1002 Synchronized wmf-config/CommonSettings.php: Config: [[gerrit:809671\|Stop setting wgBabelCentralApi (T257079)]] (duration: 03m 30s)
2022-06-29 21:15:12	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 21:15:21	<cjming>	zabe: both your patches should be live
2022-06-29 21:15:29	<zabe>	thanks :)
2022-06-29 21:17:04	<logmsgbot>	!log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox
2022-06-29 21:17:08	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 21:17:35	<cjming>	MatmaRex: I +2'd your 1st 2 patches -- do the 2nd 2 need to be rebased after merge or can i go ahead and merge them now too?
2022-06-29 21:18:01	<MatmaRex>	cjming: no, you can +2 them, they should get merged
2022-06-29 21:18:07	<logmsgbot>	!log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
2022-06-29 21:18:10	<wikibugs>	('CR) ''Clare Ming: [C: ''+2] New topic hint: Add clear:both [extensions/DiscussionTools] (wmf/1.39.0-wmf.17) - ''https://gerrit.wikimedia.org/r/809690 (https://phabricator.wikimedia.org/T311597) (owner: ''Bartosz Dziewoński)'
2022-06-29 21:18:11	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 21:18:13	<wikibugs>	('CR) ''Clare Ming: [C: ''+2] New topic hint: Add clear:both [extensions/DiscussionTools] (wmf/1.39.0-wmf.18) - ''https://gerrit.wikimedia.org/r/809691 (https://phabricator.wikimedia.org/T311597) (owner: ''Bartosz Dziewoński)'
2022-06-29 21:18:30	<MatmaRex>	thanks
2022-06-29 21:19:05	<logmsgbot>	!log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
2022-06-29 21:19:06	<logmsgbot>	!log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
2022-06-29 21:19:10	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 21:19:13	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 21:20:05	<logmsgbot>	!log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
2022-06-29 21:20:07	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 21:20:23	<wikibugs>	('Merged) ''jenkins-bot: New topic hint: Avoid error about section editing when opened from diff [extensions/DiscussionTools] (wmf/1.39.0-wmf.17) - ''https://gerrit.wikimedia.org/r/809688 (https://phabricator.wikimedia.org/T311665) (owner: ''Bartosz Dziewoński)'
2022-06-29 21:20:29	<logmsgbot>	!log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mw1496.mgmt.eqiad.wmnet with reboot policy FORCED
2022-06-29 21:20:33	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 21:20:55	<logmsgbot>	!log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
2022-06-29 21:20:59	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 21:21:08	<wikibugs>	('Merged) ''jenkins-bot: New topic hint: Avoid error about section editing when opened from diff [extensions/DiscussionTools] (wmf/1.39.0-wmf.18) - ''https://gerrit.wikimedia.org/r/809689 (https://phabricator.wikimedia.org/T311665) (owner: ''Bartosz Dziewoński)'
2022-06-29 21:21:14	<icinga-wm>	RECOVERY - Check systemd state on elastic1080 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
2022-06-29 21:21:27	<logmsgbot>	!log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host mw1496.eqiad.wmnet with OS buster
2022-06-29 21:21:31	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 21:21:33	<wikibugs>	'SRE, ''ops-eqiad, ''DC-Ops, ''serviceops: Q4: (Need By: TBD) rack/setup/install mw14[57-98] - https://phabricator.wikimedia.org/T306121 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host mw1496.eqiad.wmnet with OS buster'
2022-06-29 21:22:12	<logmsgbot>	!log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host mw1475.eqiad.wmnet with OS buster
2022-06-29 21:22:16	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 21:22:18	<wikibugs>	'SRE, ''ops-eqiad, ''DC-Ops, ''serviceops: Q4: (Need By: TBD) rack/setup/install mw14[57-98] - https://phabricator.wikimedia.org/T306121 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host mw1475.eqiad.wmnet with OS buster'
2022-06-29 21:23:50	<cjming>	MatmaRex: your 1st 2 patches are on mwdebug1002 if they're testable
2022-06-29 21:23:51	<logmsgbot>	!log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host mw1477.mgmt.eqiad.wmnet with reboot policy FORCED
2022-06-29 21:23:55	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 21:24:19	<logmsgbot>	!log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2153.codfw.wmnet with OS bullseye
2022-06-29 21:24:23	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 21:24:26	<wikibugs>	'SRE, ''ops-codfw, ''DBA, ''DC-Ops, ''Patch-For-Review: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db2153.codfw.wmnet with OS bul...'
2022-06-29 21:24:39	<MatmaRex>	cjming: yep, looking
2022-06-29 21:25:10	<logmsgbot>	!log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
2022-06-29 21:25:13	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 21:25:24	<wikibugs>	('Merged) ''jenkins-bot: New topic hint: Add clear:both [extensions/DiscussionTools] (wmf/1.39.0-wmf.17) - ''https://gerrit.wikimedia.org/r/809690 (https://phabricator.wikimedia.org/T311597) (owner: ''Bartosz Dziewoński)'
2022-06-29 21:25:27	<wikibugs>	('Merged) ''jenkins-bot: New topic hint: Add clear:both [extensions/DiscussionTools] (wmf/1.39.0-wmf.18) - ''https://gerrit.wikimedia.org/r/809691 (https://phabricator.wikimedia.org/T311597) (owner: ''Bartosz Dziewoński)'
2022-06-29 21:25:59	<logmsgbot>	!log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
2022-06-29 21:26:00	<logmsgbot>	!log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
2022-06-29 21:26:01	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 21:26:04	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 21:26:19	<MatmaRex>	cjming: looks good
2022-06-29 21:26:26	<cjming>	syncing
2022-06-29 21:26:46	<logmsgbot>	!log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
2022-06-29 21:26:50	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 21:27:02	<icinga-wm>	RECOVERY - Check systemd state on thanos-be2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
2022-06-29 21:28:36	<cjming>	MatmaRex: and your last 2 patches are up on mwdebug1002
2022-06-29 21:29:04	<MatmaRex>	cjming: also looks good!
2022-06-29 21:29:20	<cjming>	cool - will sync those as well then -- i'll ping you here when all of them are live
2022-06-29 21:30:09	<logmsgbot>	!log cjming@deploy1002 Synchronized php-1.39.0-wmf.17/extensions/DiscussionTools/modules/NewTopicController.js: Backport: [[gerrit:809688\|New topic hint: Avoid error about section editing when opened from diff (T311665)]] (duration: 03m 35s)
2022-06-29 21:30:14	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 21:30:14	<logmsgbot>	!log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db2154.codfw.wmnet with OS bullseye
2022-06-29 21:30:15	<stashbot>	T311665: Clicking New section and then legacy mode from diff view gives a confusing error message "Section editing not supported" - https://phabricator.wikimedia.org/T311665
2022-06-29 21:30:19	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 21:30:22	<wikibugs>	'SRE, ''ops-codfw, ''DBA, ''DC-Ops, ''Patch-For-Review: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db2154.codfw.wmnet with OS...'
2022-06-29 21:30:47	<wikibugs>	('PS1) ''Cwhite: beta-logs: add minimal grafana config [puppet] - ''https://gerrit.wikimedia.org/r/809706 (https://phabricator.wikimedia.org/T222826)'
2022-06-29 21:31:49	<logmsgbot>	!log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
2022-06-29 21:31:52	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 21:32:01	<icinga-wm>	PROBLEM - Check systemd state on thanos-be2001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
2022-06-29 21:32:25	<icinga-wm>	PROBLEM - Host db1173.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
2022-06-29 21:32:37	<logmsgbot>	!log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
2022-06-29 21:32:38	<logmsgbot>	!log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
2022-06-29 21:32:41	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 21:32:45	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 21:32:50	<logmsgbot>	!log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1496.eqiad.wmnet with reason: host reimage
2022-06-29 21:32:53	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 21:33:08	<logmsgbot>	!log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1475.eqiad.wmnet with reason: host reimage
2022-06-29 21:33:11	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 21:33:23	<logmsgbot>	!log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
2022-06-29 21:33:26	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 21:34:07	<logmsgbot>	!log cjming@deploy1002 Synchronized php-1.39.0-wmf.18/extensions/DiscussionTools/modules/NewTopicController.js: Backport: [[gerrit:809689\|New topic hint: Avoid error about section editing when opened from diff (T311665)]] (duration: 03m 43s)
2022-06-29 21:34:12	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 21:34:47	<logmsgbot>	!log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1025.eqiad.wmnet with OS buster
2022-06-29 21:34:50	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 21:34:53	<wikibugs>	'SRE, ''ops-eqiad, ''DC-Ops, ''cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudcephosd10[25-34] - https://phabricator.wikimedia.org/T294972 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudcephosd1025.eqiad.wmnet with OS bus...'
2022-06-29 21:35:55	<logmsgbot>	!log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host an-presto1006.eqiad.wmnet with OS bullseye
2022-06-29 21:35:59	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 21:36:05	<wikibugs>	'SRE, ''ops-eqiad, ''DC-Ops, ''Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-presto1006.eqiad.wmnet with OS bullseye ex...'
2022-06-29 21:36:21	<logmsgbot>	!log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1496.eqiad.wmnet with reason: host reimage
2022-06-29 21:36:24	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 21:36:42	<logmsgbot>	!log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mw1477.mgmt.eqiad.wmnet with reboot policy FORCED
2022-06-29 21:36:46	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 21:37:29	<wikibugs>	'SRE, ''ops-eqiad, ''DBA: db1173 won't boot up - https://phabricator.wikimedia.org/T310595 (''Jclark-ctr) Replaced Dimm A7 powered host on'
2022-06-29 21:37:38	<icinga-wm>	RECOVERY - Host db1173.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.84 ms
2022-06-29 21:37:47	<logmsgbot>	!log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host mw1477.eqiad.wmnet with OS buster
2022-06-29 21:37:50	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 21:37:51	<logmsgbot>	!log cjming@deploy1002 Synchronized php-1.39.0-wmf.17/extensions/DiscussionTools/modules/dt.ui.NewTopicController.less: Backport: [[gerrit:809690\|New topic hint: Add clear:both (T311597)]] (duration: 03m 24s)
2022-06-29 21:37:53	<wikibugs>	'SRE, ''ops-eqiad, ''DC-Ops, ''serviceops: Q4: (Need By: TBD) rack/setup/install mw14[57-98] - https://phabricator.wikimedia.org/T306121 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host mw1477.eqiad.wmnet with OS buster'
2022-06-29 21:37:56	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 21:37:56	<stashbot>	T311597: Legacy section=new hint may overlap with floating boxes - https://phabricator.wikimedia.org/T311597
2022-06-29 21:37:57	<logmsgbot>	!log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1475.eqiad.wmnet with reason: host reimage
2022-06-29 21:38:01	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 21:38:26	<logmsgbot>	!log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
2022-06-29 21:38:30	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 21:39:09	<logmsgbot>	!log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
2022-06-29 21:39:10	<logmsgbot>	!log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
2022-06-29 21:39:13	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 21:39:16	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 21:39:52	<logmsgbot>	!log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
2022-06-29 21:39:55	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 21:41:37	<logmsgbot>	!log cjming@deploy1002 Synchronized php-1.39.0-wmf.18/extensions/DiscussionTools/modules/dt.ui.NewTopicController.less: Backport: [[gerrit:809691\|New topic hint: Add clear:both (T311597)]] (duration: 03m 27s)
2022-06-29 21:41:42	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 21:41:44	<logmsgbot>	!log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1031.eqiad.wmnet with OS buster
2022-06-29 21:41:48	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 21:41:49	<wikibugs>	'SRE, ''ops-eqiad, ''DC-Ops, ''cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudcephosd10[25-34] - https://phabricator.wikimedia.org/T294972 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudcephosd1031.eqiad.wmnet with OS bus...'
2022-06-29 21:42:11	<cjming>	MatmaRex: all your changes should be live
2022-06-29 21:42:17	<MatmaRex>	thanks!
2022-06-29 21:42:30	<cjming>	np!
2022-06-29 21:42:54	<cjming>	!log end of UTC late backport window
2022-06-29 21:42:56	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 21:43:18	<wikibugs>	'SRE, ''ops-eqiad, ''DC-Ops, ''cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudcephosd10[25-34] - https://phabricator.wikimedia.org/T294972 (''Cmjohnson)'
2022-06-29 21:43:56	<wikibugs>	'SRE, ''Infrastructure-Foundations, ''netops, ''Patch-For-Review: Configure cloudsw1-e4-eqiad and cloudsw1-f4-eqiad - https://phabricator.wikimedia.org/T304936 (''Cmjohnson)'
2022-06-29 21:44:24	<wikibugs>	'SRE, ''ops-eqiad, ''DC-Ops, ''cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudcephosd10[25-34] - https://phabricator.wikimedia.org/T294972 (''Cmjohnson) ''Open→''Resolved pinging @andrew to notify the task has been resolved'
2022-06-29 21:48:40	<logmsgbot>	!log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1477.eqiad.wmnet with reason: host reimage
2022-06-29 21:48:44	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 21:49:08	<icinga-wm>	RECOVERY - Check systemd state on thanos-be2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
2022-06-29 21:49:25	<logmsgbot>	!log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2154.codfw.wmnet with reason: host reimage
2022-06-29 21:49:29	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 21:49:54	<icinga-wm>	PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=UPDATE https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
2022-06-29 21:52:11	<logmsgbot>	!log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1477.eqiad.wmnet with reason: host reimage
2022-06-29 21:52:15	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 21:53:23	<icinga-wm>	PROBLEM - Check systemd state on thanos-be2001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
2022-06-29 21:54:46	<logmsgbot>	!log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2154.codfw.wmnet with reason: host reimage
2022-06-29 21:54:49	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 21:55:49	<logmsgbot>	!log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1496.eqiad.wmnet with OS buster
2022-06-29 21:55:53	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 21:55:55	<wikibugs>	'SRE, ''ops-eqiad, ''DC-Ops, ''serviceops: Q4: (Need By: TBD) rack/setup/install mw14[57-98] - https://phabricator.wikimedia.org/T306121 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host mw1496.eqiad.wmnet with OS buster completed: - mw1496 (PASS) -...'
2022-06-29 21:58:16	<wikibugs>	('CR) ''Cwhite: [C: ''+2] beta-logs: add minimal grafana config [puppet] - ''https://gerrit.wikimedia.org/r/809706 (https://phabricator.wikimedia.org/T222826) (owner: ''Cwhite)'
2022-06-29 22:02:56	<wikibugs>	('PS1) ''Cwhite: loki: add ferm rule to control api access [puppet] - ''https://gerrit.wikimedia.org/r/809709 (https://phabricator.wikimedia.org/T222826)'
2022-06-29 22:07:08	<icinga-wm>	RECOVERY - Check systemd state on thanos-be2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
2022-06-29 22:08:11	<logmsgbot>	!log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1475.eqiad.wmnet with OS buster
2022-06-29 22:08:15	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 22:08:16	<wikibugs>	'SRE, ''ops-eqiad, ''DC-Ops, ''serviceops: Q4: (Need By: TBD) rack/setup/install mw14[57-98] - https://phabricator.wikimedia.org/T306121 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host mw1475.eqiad.wmnet with OS buster completed: - mw1475 (PASS) -...'
2022-06-29 22:09:40	<wikibugs>	'SRE, ''ops-eqiad, ''DC-Ops, ''Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (''Cmjohnson) @BTullis Robh figured out the workaround to get the right raid volume to boot first. I tried on an-presto1006 and everything seeme...'
2022-06-29 22:11:50	<icinga-wm>	PROBLEM - Check systemd state on thanos-be2001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
2022-06-29 22:12:11	<wikibugs>	'SRE, ''ops-eqiad, ''DC-Ops, ''serviceops: Q4: (Need By: TBD) rack/setup/install mw14[57-98] - https://phabricator.wikimedia.org/T306121 (''Cmjohnson)'
2022-06-29 22:14:15	<wikibugs>	'SRE, ''ops-eqiad, ''DBA: db1173 won't boot up - https://phabricator.wikimedia.org/T310595 (''Cmjohnson) ''Open→''Resolved @Jclark-ctr completed the task, we will send the broken parts back to Dell'
2022-06-29 22:14:36	<jinxer-wm>	(CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1004-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
2022-06-29 22:16:29	<logmsgbot>	!log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host cloudnet1006.mgmt.eqiad.wmnet with reboot policy FORCED
2022-06-29 22:16:33	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 22:18:06	<wikibugs>	('PS1) ''Ahmon Dancy: Allow mwbuilder group to access mwdeploy key [puppet] - ''https://gerrit.wikimedia.org/r/809712 (https://phabricator.wikimedia.org/T310395)'
2022-06-29 22:19:16	<wikibugs>	'SRE, ''ops-eqiad, ''DC-Ops, ''fundraising-tech-ops: Q3:(Need By: TBD) rack/setup/install frlog1002 - https://phabricator.wikimedia.org/T306839 (''Cmjohnson)'
2022-06-29 22:19:48	<icinga-wm>	PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
2022-06-29 22:19:58	<icinga-wm>	PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
2022-06-29 22:22:02	<icinga-wm>	RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
2022-06-29 22:25:08	<logmsgbot>	!log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox
2022-06-29 22:25:13	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 22:26:01	<logmsgbot>	!log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2154.codfw.wmnet with OS bullseye
2022-06-29 22:26:02	<icinga-wm>	RECOVERY - Check systemd state on thanos-be2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
2022-06-29 22:26:05	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 22:26:07	<wikibugs>	'SRE, ''ops-codfw, ''DBA, ''DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db2154.codfw.wmnet with OS bullseye completed: - db2...'
2022-06-29 22:27:18	<icinga-wm>	PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin2002 is CRITICAL: CRITICAL: the following (33) node(s) change every puppet run: aqs2001, aqs2002, aqs2003, aqs2004, aqs2005, aqs2006, aqs2007, aqs2008, aqs2009, aqs2010, aqs2011, aqs2012, cloudservices1003, cloudservices1004, cloudvirt1051, cloudvirt1052, db2154, gitlab1001, gitlab1003, gitlab1004, gitlab2001, ms-fe1010, ms-fe1011, ms-fe1012, ms-fe2010, ms-fe20
2022-06-29 22:27:18	<icinga-wm>	e2012, mw1475, thanos-fe1002, thanos-fe1003, thanos-fe2001, thanos-fe2002, thanos-fe2003 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes
2022-06-29 22:29:27	<logmsgbot>	!log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
2022-06-29 22:29:31	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 22:30:53	<icinga-wm>	PROBLEM - Check systemd state on thanos-be2001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
2022-06-29 22:31:14	<logmsgbot>	!log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1477.eqiad.wmnet with OS buster
2022-06-29 22:31:18	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 22:31:21	<wikibugs>	'SRE, ''ops-eqiad, ''DC-Ops, ''serviceops: Q4: (Need By: TBD) rack/setup/install mw14[57-98] - https://phabricator.wikimedia.org/T306121 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host mw1477.eqiad.wmnet with OS buster completed: - mw1477 (PASS) -...'
2022-06-29 22:33:16	<wikibugs>	('CR) ''Cwhite: "This change is ready for review." [puppet] - ''https://gerrit.wikimedia.org/r/809709 (https://phabricator.wikimedia.org/T222826) (owner: ''Cwhite)'
2022-06-29 22:33:53	<icinga-wm>	PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
2022-06-29 22:34:46	<logmsgbot>	!log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudnet1006.mgmt.eqiad.wmnet with reboot policy FORCED
2022-06-29 22:34:50	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 22:35:57	<wikibugs>	'SRE, ''ops-eqiad, ''DC-Ops, ''serviceops: Decommission mw13[07-48] - https://phabricator.wikimedia.org/T306162 (''Cmjohnson)'
2022-06-29 22:36:14	<wikibugs>	'SRE, ''ops-eqiad, ''DC-Ops, ''serviceops: Q4: (Need By: TBD) rack/setup/install mw14[57-98] - https://phabricator.wikimedia.org/T306121 (''Cmjohnson) ''Open→''Resolved @Dzahn I am not sure if this is you but these are installed. Resolving the task'
2022-06-29 22:37:39	<logmsgbot>	!log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host stat1009.mgmt.eqiad.wmnet with reboot policy FORCED
2022-06-29 22:37:43	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 22:39:15	<icinga-wm>	PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
2022-06-29 22:41:32	<logmsgbot>	!log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db2155.codfw.wmnet with OS bullseye
2022-06-29 22:41:35	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 22:41:38	<wikibugs>	'SRE, ''ops-codfw, ''DBA, ''DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db2155.codfw.wmnet with OS bullseye'
2022-06-29 22:58:12	<wikibugs>	('PS1) ''BryanDavis: striker: connect docker container directly to host network [puppet] - ''https://gerrit.wikimedia.org/r/809714 (https://phabricator.wikimedia.org/T306469)'
2022-06-29 22:59:36	<wikibugs>	('PS1) ''Cmjohnson: Adding new cloudnet, cloudrabbit and cloudservice nodes to site.pp [puppet] - ''https://gerrit.wikimedia.org/r/809715 (https://phabricator.wikimedia.org/T304888)'
2022-06-29 23:00:19	<wikibugs>	('CR) ''Cmjohnson: [C: ''+2] Adding new cloudnet, cloudrabbit and cloudservice nodes to site.pp [puppet] - ''https://gerrit.wikimedia.org/r/809715 (https://phabricator.wikimedia.org/T304888) (owner: ''Cmjohnson)'
2022-06-29 23:01:34	<logmsgbot>	!log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2155.codfw.wmnet with reason: host reimage
2022-06-29 23:01:38	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 23:01:39	<wikibugs>	('CR) ''BryanDavis: "PCC: https://puppet-compiler.wmflabs.org/pcc-worker1002/36132/"; [puppet] - ''https://gerrit.wikimedia.org/r/809714 (https://phabricator.wikimedia.org/T306469) (owner: ''BryanDavis)'
2022-06-29 23:04:17	<subbu>	where do logs go on a production host? I'm trying to look at logs emitted to the 'Parsoid' channel on a parsoid host (i think probably info or others that aren't in logstash).
2022-06-29 23:05:10	<logmsgbot>	!log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2155.codfw.wmnet with reason: host reimage
2022-06-29 23:05:13	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 23:05:14	<subbu>	i logged onto wtp1046 and looked at /var/log/mediawiki/ but nothing useful there.
2022-06-29 23:07:11	<subbu>	will look on wikitech for docs
2022-06-29 23:08:22	<TimStarling>	subbu: mwlog1002 /srv/mw-log
2022-06-29 23:08:40	<subbu>	yes, found it. :) thanks.
2022-06-29 23:19:28	<wikibugs>	('CR) ''BryanDavis: striker: connect docker container directly to host network (''1 comment) [puppet] - ''https://gerrit.wikimedia.org/r/809714 (https://phabricator.wikimedia.org/T306469) (owner: ''BryanDavis)'
2022-06-29 23:23:09	<icinga-wm>	RECOVERY - Check systemd state on thanos-be2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
2022-06-29 23:28:59	<icinga-wm>	PROBLEM - Check systemd state on thanos-be2001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
2022-06-29 23:30:14	<logmsgbot>	!log bking@cumin1001 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster restart to pickup swift-s3 plugin - bking@cumin1001 - T309648
2022-06-29 23:30:19	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 23:30:20	<stashbot>	T309648: Restore lost index in cloudelastic - https://phabricator.wikimedia.org/T309648
2022-06-29 23:34:05	<logmsgbot>	!log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host stat1009.mgmt.eqiad.wmnet with reboot policy FORCED
2022-06-29 23:34:05	<logmsgbot>	!log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host stat1009.mgmt.eqiad.wmnet with reboot policy FORCED
2022-06-29 23:34:09	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 23:34:13	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 23:40:23	<icinga-wm>	RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
2022-06-29 23:40:30	<subbu>	Did something change with commons y'day? So, I start looking at a spike in volume of Parsoid events (not fatal, or errors) by 2x as of y'day ... found that for commonswiki slow-parsoid parses (> 3s) spiked by 15x since y'day (but note that new code didn't go out to Parsoid hosts till today) ... see https://logstash.wikimedia.org/goto/bff9cfa00d556f53c143780858d84adf
2022-06-29 23:41:36	<subbu>	But, turns out this is also the same (5x spike) with the core/legacy parser. https://logstash.wikimedia.org/goto/09e7b301cbc0451e32e9b90246ad27ba
2022-06-29 23:45:59	<subbu>	TimStarling, Krinkle ^ fyi.
2022-06-29 23:47:51	<TimStarling>	that is concerning
2022-06-29 23:49:15	<icinga-wm>	PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
2022-06-29 23:50:22	<subbu>	Since commons didn't get new code till today, it couldn't be a code change.
2022-06-29 23:50:46	<TimStarling>	could be a commons template or module change
2022-06-29 23:50:52	<logmsgbot>	!log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=1) for host db2155.codfw.wmnet with OS bullseye
2022-06-29 23:50:55	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 23:50:57	<wikibugs>	'SRE, ''ops-codfw, ''DBA, ''DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db2155.codfw.wmnet with OS bullseye completed: - db2...'
2022-06-29 23:51:00	<wikibugs>	'SRE, ''ops-codfw, ''DBA, ''DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db2155.codfw.wmnet with OS bullseye executed with er...'
2022-06-29 23:51:16	<subbu>	aah ... okay ... so, it is just triggering a large volume of parses likely.
2022-06-29 23:51:17	<TimStarling>	previewing one affected page, I can reproduce 4.4s total as in the logs, of which 3.2s is Lua
2022-06-29 23:52:18	<TimStarling>	https://commons.wikimedia.org/wiki/Category:Ahmedabad -- note right floating infobox
2022-06-29 23:52:23	<wikibugs>	('PS1) ''Andrew Bogott: wmcs-enc-cli.py: fix args passed to requests.post [puppet] - ''https://gerrit.wikimedia.org/r/809721 (https://phabricator.wikimedia.org/T274666)'
2022-06-29 23:53:42	<logmsgbot>	!log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db2154.codfw.wmnet with OS bullseye
2022-06-29 23:53:46	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 23:53:48	<wikibugs>	'SRE, ''ops-codfw, ''DBA, ''DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db2154.codfw.wmnet with OS bullseye'
2022-06-29 23:53:55	<wikibugs>	('CR) ''CI reject: [V: ''-1] wmcs-enc-cli.py: fix args passed to requests.post [puppet] - ''https://gerrit.wikimedia.org/r/809721 (https://phabricator.wikimedia.org/T274666) (owner: ''Andrew Bogott)'
2022-06-29 23:55:31	<TimStarling>	we really need a template profiler
2022-06-29 23:55:36	<logmsgbot>	!log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db2157.codfw.wmnet with OS bullseye
2022-06-29 23:55:39	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 23:55:43	<wikibugs>	'SRE, ''ops-codfw, ''DBA, ''DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db2157.codfw.wmnet with OS bullseye'
2022-06-29 23:55:52	<logmsgbot>	!log pt1979@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host db2154.codfw.wmnet with OS bullseye
2022-06-29 23:55:56	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 23:55:57	<wikibugs>	'SRE, ''ops-codfw, ''DBA, ''DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db2154.codfw.wmnet with OS bullseye executed with er...'
2022-06-29 23:56:01	<icinga-wm>	RECOVERY - Check systemd state on thanos-be2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
2022-06-29 23:56:55	<logmsgbot>	!log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db2155.codfw.wmnet with OS bullseye
2022-06-29 23:56:59	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2022-06-29 23:57:00	<wikibugs>	'SRE, ''ops-codfw, ''DBA, ''DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db2155.codfw.wmnet with OS bullseye'
2022-06-29 23:58:52	<wikibugs>	'SRE, ''ops-codfw, ''DBA, ''DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (''Papaul)'

Wikimedia IRC logs browser - #wikimedia-operations