[00:00:21] <icinga-wm>	 RECOVERY - Check systemd state on config-master2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:02:52] <jinxer-wm>	 (ProbeDown) firing: (60) Service pki1001:443 has failed probes (http_PKI_aux_front_proxy_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[00:04:07] <icinga-wm>	 RECOVERY - Check systemd state on config-master1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:04:31] <icinga-wm>	 PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:07:52] <jinxer-wm>	 (ProbeDown) firing: (80) Service pki1001:443 has failed probes (http_PKI_aux_front_proxy_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[00:08:15] <icinga-wm>	 PROBLEM - Check systemd state on config-master1001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:19:23] <icinga-wm>	 RECOVERY - Check systemd state on config-master1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:19:36] <logmsgbot>	 !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1103.eqiad.wmnet with OS bullseye
[00:23:29] <icinga-wm>	 PROBLEM - Check systemd state on config-master1001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:29:14] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp1103.eqiad.wmnet with OS bullseye
[00:30:51] <icinga-wm>	 RECOVERY - Check systemd state on config-master2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:34:59] <icinga-wm>	 PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:38:58] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/969986
[00:39:00] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/969986 (owner: 10TrainBranchBot)
[00:42:29] <icinga-wm>	 RECOVERY - Check systemd state on logstash2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:45:55] <icinga-wm>	 PROBLEM - Check systemd state on logstash1026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:46:29] <logmsgbot>	 !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1103.eqiad.wmnet with OS bullseye
[00:58:58] <jinxer-wm>	 (RdfStreamingUpdaterSpaceUsageTooHigh) firing: The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh
[00:59:21] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/969986 (owner: 10TrainBranchBot)
[01:03:50] <wikibugs>	 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T350095 (10phaultfinder)
[01:03:58] <jinxer-wm>	 (RdfStreamingUpdaterSpaceUsageTooHigh) resolved: The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh
[01:04:58] <jinxer-wm>	 (RdfStreamingUpdaterSpaceUsageTooHigh) firing: The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh
[01:09:58] <jinxer-wm>	 (RdfStreamingUpdaterSpaceUsageTooHigh) resolved: The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh
[01:13:59] <jinxer-wm>	 (RdfStreamingUpdaterSpaceUsageTooHigh) firing: The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh
[01:15:09] <jinxer-wm>	 (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 49.07% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[01:18:58] <jinxer-wm>	 (RdfStreamingUpdaterSpaceUsageTooHigh) resolved: The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh
[01:20:09] <jinxer-wm>	 (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 49.07% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[01:24:09] <jinxer-wm>	 (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at codfw: 47.69% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[01:27:59] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on restbase2022:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[01:29:09] <jinxer-wm>	 (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at codfw: 47.69% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[01:31:09] <jinxer-wm>	 (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at codfw: 50% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[01:36:09] <jinxer-wm>	 (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at codfw: 45.83% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[01:45:59] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on ganeti1029:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[01:51:59] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on elastic2047:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[01:58:09] <jinxer-wm>	 (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at codfw: 46.76% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[01:59:54] <wikibugs>	 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team (FY2023/2024-Q1): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10Andrew)
[02:00:05] <jouncebot>	 Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231031T0200)
[02:03:09] <jinxer-wm>	 (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at codfw: 46.76% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[02:04:58] <jinxer-wm>	 (RdfStreamingUpdaterSpaceUsageTooHigh) firing: The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh
[02:06:09] <jinxer-wm>	 (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at codfw: 44.91% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[02:07:31] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/1.42.0-wmf.3 [core] (wmf/1.42.0-wmf.3) - 10https://gerrit.wikimedia.org/r/969987 (https://phabricator.wikimedia.org/T348356)
[02:07:33] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.42.0-wmf.3 [core] (wmf/1.42.0-wmf.3) - 10https://gerrit.wikimedia.org/r/969987 (https://phabricator.wikimedia.org/T348356) (owner: 10TrainBranchBot)
[02:09:58] <jinxer-wm>	 (RdfStreamingUpdaterSpaceUsageTooHigh) resolved: The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh
[02:11:09] <jinxer-wm>	 (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at codfw: 49.07% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[02:21:59] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on deploy2002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[02:24:49] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/1.42.0-wmf.3 [core] (wmf/1.42.0-wmf.3) - 10https://gerrit.wikimedia.org/r/969987 (https://phabricator.wikimedia.org/T348356) (owner: 10TrainBranchBot)
[02:25:59] <jinxer-wm>	 (PuppetFailure) firing: (2) Puppet has failed on ganeti1029:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[02:31:59] <jinxer-wm>	 (PuppetFailure) firing: (2) Puppet has failed on elastic1070:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[02:32:41] <icinga-wm>	 RECOVERY - Check systemd state on config-master2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:36:49] <icinga-wm>	 PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:36:59] <jinxer-wm>	 (PuppetFailure) firing: (3) Puppet has failed on elastic1070:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[02:37:59] <jinxer-wm>	 (PuppetFailure) firing: (2) Puppet has failed on restbase2022:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[02:38:44] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:42:21] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1005-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[02:47:55] <icinga-wm>	 RECOVERY - Check systemd state on config-master2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:52:05] <icinga-wm>	 PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:00:05] <jouncebot>	 Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231031T0300)
[03:01:37] <wikibugs>	 (03PS1) 10TrainBranchBot: testwikis wikis to 1.42.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/970008 (https://phabricator.wikimedia.org/T348356)
[03:01:39] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] testwikis wikis to 1.42.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/970008 (https://phabricator.wikimedia.org/T348356) (owner: 10TrainBranchBot)
[03:02:26] <wikibugs>	 (03Merged) 10jenkins-bot: testwikis wikis to 1.42.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/970008 (https://phabricator.wikimedia.org/T348356) (owner: 10TrainBranchBot)
[03:02:52] <logmsgbot>	 !log mwpresync@deploy2002 Started scap: testwikis wikis to 1.42.0-wmf.3  refs T348356
[03:02:58] <stashbot>	 T348356: 1.42.0-wmf.3 deployment blockers - https://phabricator.wikimedia.org/T348356
[03:04:32] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:14:58] <jinxer-wm>	 (RdfStreamingUpdaterSpaceUsageTooHigh) firing: The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh
[03:16:59] <jinxer-wm>	 (PuppetFailure) firing: (4) Puppet has failed on elastic1068:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[03:19:58] <jinxer-wm>	 (RdfStreamingUpdaterSpaceUsageTooHigh) resolved: The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh
[03:20:59] <jinxer-wm>	 (PuppetFailure) firing: (3) Puppet has failed on ganeti1029:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[03:39:52] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[03:42:59] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on kubetcd1006:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[03:45:59] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on cumin1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[03:50:59] <jinxer-wm>	 (PuppetFailure) firing: (2) Puppet has failed on cumin1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[03:51:17] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[03:51:59] <jinxer-wm>	 (PuppetFailure) firing: (2) Puppet has failed on deploy1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[03:53:36] <logmsgbot>	 !log mwpresync@deploy2002 Finished scap: testwikis wikis to 1.42.0-wmf.3  refs T348356 (duration: 50m 44s)
[03:53:41] <stashbot>	 T348356: 1.42.0-wmf.3 deployment blockers - https://phabricator.wikimedia.org/T348356
[03:55:52] <logmsgbot>	 !log mwpresync@deploy2002 Pruned MediaWiki: 1.42.0-wmf.1 (duration: 02m 14s)
[03:57:58] <jinxer-wm>	 (RdfStreamingUpdaterSpaceUsageTooHigh) firing: The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh
[03:59:11] <icinga-wm>	 RECOVERY - Check systemd state on config-master2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:02:58] <jinxer-wm>	 (RdfStreamingUpdaterSpaceUsageTooHigh) resolved: The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh
[04:03:25] <icinga-wm>	 PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:07:59] <jinxer-wm>	 (PuppetFailure) firing: (2) Puppet has failed on kubetcd1004:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[04:08:06] <jinxer-wm>	 (ProbeDown) firing: (80) Service pki1001:443 has failed probes (http_PKI_aux_front_proxy_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[04:08:51] <wikibugs>	 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T342502 (10phaultfinder)
[04:12:59] <jinxer-wm>	 (PuppetFailure) firing: (3) Puppet has failed on kubetcd1004:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[04:13:23] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[04:14:37] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50713 bytes in 0.105 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[04:16:59] <jinxer-wm>	 (PuppetFailure) firing: (5) Puppet has failed on elastic1068:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[04:20:59] <jinxer-wm>	 (PuppetFailure) firing: (4) Puppet has failed on ganeti1029:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[04:22:59] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[04:24:17] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 1.858 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[04:38:33] <icinga-wm>	 RECOVERY - Check systemd state on config-master1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:41:59] <jinxer-wm>	 (PuppetFailure) firing: (6) Puppet has failed on elastic1068:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[04:42:45] <icinga-wm>	 PROBLEM - Check systemd state on config-master1001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:52:59] <jinxer-wm>	 (PuppetFailure) firing: (4) Puppet has failed on kubemaster1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[04:54:49] <wikibugs>	 (03PS2) 10KartikMistry: Update MinT to 2023-10-31-044726-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/968388 (https://phabricator.wikimedia.org/T333969)
[04:58:15] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[04:59:29] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50713 bytes in 0.098 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:01:59] <jinxer-wm>	 (PuppetFailure) firing: (7) Puppet has failed on elastic1068:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[05:07:59] <jinxer-wm>	 (PuppetFailure) firing: (5) Puppet has failed on kubemaster1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[05:09:11] <icinga-wm>	 RECOVERY - Check systemd state on config-master1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:13:21] <icinga-wm>	 PROBLEM - Check systemd state on config-master1001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:24:25] <icinga-wm>	 RECOVERY - Check systemd state on config-master1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:28:37] <icinga-wm>	 PROBLEM - Check systemd state on config-master1001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:32:59] <jinxer-wm>	 (PuppetFailure) firing: (6) Puppet has failed on kubemaster1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[05:39:41] <icinga-wm>	 RECOVERY - Check systemd state on config-master1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:42:59] <jinxer-wm>	 (PuppetFailure) firing: (7) Puppet has failed on kubemaster1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[05:43:49] <icinga-wm>	 PROBLEM - Check systemd state on config-master1001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:54:53] <icinga-wm>	 RECOVERY - Check systemd state on config-master1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:57:59] <jinxer-wm>	 (PuppetFailure) firing: (12) Puppet has failed on kubemaster1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[05:59:01] <icinga-wm>	 PROBLEM - Check systemd state on config-master1001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:00:05] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231031T0600)
[06:00:05] <jouncebot>	 kormat, marostegui, and Amir1: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231031T0600).
[06:01:39] <icinga-wm>	 RECOVERY - Check systemd state on config-master2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:02:58] <jinxer-wm>	 (RdfStreamingUpdaterSpaceUsageTooHigh) firing: The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh
[06:02:59] <jinxer-wm>	 (PuppetFailure) firing: (14) Puppet has failed on kubemaster1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[06:05:49] <icinga-wm>	 PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:07:58] <jinxer-wm>	 (RdfStreamingUpdaterSpaceUsageTooHigh) resolved: The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh
[06:07:59] <jinxer-wm>	 (PuppetFailure) firing: (15) Puppet has failed on kubemaster1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[06:14:53] <wikibugs>	 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T342502 (10phaultfinder)
[06:21:36] <wikibugs>	 (03PS1) 10Marostegui: ProductionServices.php: Promote pc2014 to pc1 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/970033
[06:22:42] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] ProductionServices.php: Promote pc2014 to pc1 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/970033 (owner: 10Marostegui)
[06:23:23] <wikibugs>	 (03Merged) 10jenkins-bot: ProductionServices.php: Promote pc2014 to pc1 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/970033 (owner: 10Marostegui)
[06:24:27] <logmsgbot>	 !log marostegui@deploy2002 Started scap: Backport for [[gerrit:970033|ProductionServices.php: Promote pc2014 to pc1 master]]
[06:25:56] <logmsgbot>	 !log marostegui@deploy2002 marostegui: Backport for [[gerrit:970033|ProductionServices.php: Promote pc2014 to pc1 master]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[06:26:01] <logmsgbot>	 !log marostegui@deploy2002 marostegui: Continuing with sync
[06:26:21] <wikibugs>	 (03PS1) 10Marostegui: Revert "ProductionServices.php: Promote pc2014 to pc1 master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/969772
[06:27:37] <icinga-wm>	 PROBLEM - BGP status on cr1-drmrs is CRITICAL: BGP CRITICAL - No response from remote host 185.15.58.128 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[06:29:59] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on ml-cache1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[06:31:17] <logmsgbot>	 !log marostegui@deploy2002 Finished scap: Backport for [[gerrit:970033|ProductionServices.php: Promote pc2014 to pc1 master]] (duration: 06m 50s)
[06:32:59] <jinxer-wm>	 (PuppetFailure) firing: (16) Puppet has failed on kubemaster1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[06:33:29] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on 34 hosts with reason: Primary switchover s4 T349820
[06:33:36] <stashbot>	 T349820: Switchover s4 master (db2179 -> db2140) - https://phabricator.wikimedia.org/T349820
[06:33:57] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 34 hosts with reason: Primary switchover s4 T349820
[06:33:59] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on mw1415:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[06:34:08] <wikibugs>	 (03PS1) 10Marostegui: pc2014: Move it to pc2 [puppet] - 10https://gerrit.wikimedia.org/r/970202
[06:35:29] <icinga-wm>	 PROBLEM - Check systemd state on mwmaint2002 is CRITICAL: CRITICAL - degraded: The following units failed: mediawiki_job_purge_parsercache_pc1.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:36:48] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Set db2140 with weight 0 T349820', diff saved to https://phabricator.wikimedia.org/P53068 and previous config saved to /var/cache/conftool/dbconfig/20231031-063647-arnaudb.json
[06:37:41] <icinga-wm>	 PROBLEM - CirrusSearch more_like eqiad 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=39
[06:37:59] <jinxer-wm>	 (PuppetFailure) firing: (17) Puppet has failed on kubemaster1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[06:37:59] <jinxer-wm>	 (PuppetFailure) firing: (2) Puppet has failed on restbase2022:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[06:40:20] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "ProductionServices.php: Promote pc2014 to pc1 master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/969772 (owner: 10Marostegui)
[06:40:25] <icinga-wm>	 RECOVERY - CirrusSearch more_like eqiad 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=39
[06:41:05] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "ProductionServices.php: Promote pc2014 to pc1 master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/969772 (owner: 10Marostegui)
[06:42:06] <logmsgbot>	 !log marostegui@deploy2002 Started scap: Backport for [[gerrit:969772|Revert "ProductionServices.php: Promote pc2014 to pc1 master"]]
[06:42:21] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1005-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[06:42:56] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] pc2014: Move it to pc2 [puppet] - 10https://gerrit.wikimedia.org/r/970202 (owner: 10Marostegui)
[06:42:59] <jinxer-wm>	 (PuppetFailure) firing: (18) Puppet has failed on kubemaster1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[06:43:24] <logmsgbot>	 !log marostegui@deploy2002 marostegui: Backport for [[gerrit:969772|Revert "ProductionServices.php: Promote pc2014 to pc1 master"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[06:44:03] <logmsgbot>	 !log marostegui@deploy2002 marostegui: Continuing with sync
[06:49:19] <logmsgbot>	 !log marostegui@deploy2002 Finished scap: Backport for [[gerrit:969772|Revert "ProductionServices.php: Promote pc2014 to pc1 master"]] (duration: 07m 12s)
[06:58:10] <jinxer-wm>	 (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at codfw: 44.44% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[07:01:21] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] mariadb: Promote db2140 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/968968 (https://phabricator.wikimedia.org/T349820) (owner: 10Gerrit maintenance bot)
[07:01:24] <wikibugs>	 (03CR) 10Arnaudb: [C: 03+2] mariadb: Promote db2140 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/968968 (https://phabricator.wikimedia.org/T349820) (owner: 10Gerrit maintenance bot)
[07:02:47] <arnaudb>	 !log Starting s4 codfw failover from db2179 to db2140 - T349820
[07:02:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:02:52] <stashbot>	 T349820: Switchover s4 master (db2179 -> db2140) - https://phabricator.wikimedia.org/T349820
[07:03:59] <jinxer-wm>	 (PuppetFailure) firing: (2) Puppet has failed on mw1349:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[07:04:06] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Set s4 codfw as read-only for maintenance - T349820', diff saved to https://phabricator.wikimedia.org/P53070 and previous config saved to /var/cache/conftool/dbconfig/20231031-070405-arnaudb.json
[07:05:49] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Promote db2140 to s4 primary and set section read-write T349820', diff saved to https://phabricator.wikimedia.org/P53071 and previous config saved to /var/cache/conftool/dbconfig/20231031-070549-arnaudb.json
[07:07:35] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] wmnet: Update s4-master alias [dns] - 10https://gerrit.wikimedia.org/r/968969 (https://phabricator.wikimedia.org/T349820) (owner: 10Gerrit maintenance bot)
[07:08:09] <jinxer-wm>	 (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at codfw: 49.07% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[07:09:58] <wikibugs>	 (03PS2) 10Arnaudb: wmnet: Update s4-master alias [dns] - 10https://gerrit.wikimedia.org/r/968969 (https://phabricator.wikimedia.org/T349820) (owner: 10Gerrit maintenance bot)
[07:12:09] <wikibugs>	 (03CR) 10Arnaudb: [C: 03+2] wmnet: Update s4-master alias [dns] - 10https://gerrit.wikimedia.org/r/968969 (https://phabricator.wikimedia.org/T349820) (owner: 10Gerrit maintenance bot)
[07:19:39] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2179 weight mimic old db2140', diff saved to https://phabricator.wikimedia.org/P53072 and previous config saved to /var/cache/conftool/dbconfig/20231031-071938-arnaudb.json
[07:30:23] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2179 depooling from API and pooling in db2140', diff saved to https://phabricator.wikimedia.org/P53073 and previous config saved to /var/cache/conftool/dbconfig/20231031-073023-arnaudb.json
[07:33:12] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2179 weight rebalancing', diff saved to https://phabricator.wikimedia.org/P53074 and previous config saved to /var/cache/conftool/dbconfig/20231031-073312-arnaudb.json
[07:36:53] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2179 weight rebalancing - depooled', diff saved to https://phabricator.wikimedia.org/P53075 and previous config saved to /var/cache/conftool/dbconfig/20231031-073652-arnaudb.json
[07:37:59] <jinxer-wm>	 (PuppetFailure) firing: (19) Puppet has failed on kubemaster1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[07:38:22] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2179 (re)pooling @ 15%: Host warmup', diff saved to https://phabricator.wikimedia.org/P53076 and previous config saved to /var/cache/conftool/dbconfig/20231031-073822-arnaudb.json
[07:39:52] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[07:42:59] <jinxer-wm>	 (PuppetFailure) firing: (20) Puppet has failed on kubemaster1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[07:47:05] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Add weekly-update script [deployment-charts] - 10https://gerrit.wikimedia.org/r/970204 (https://phabricator.wikimedia.org/T344478)
[07:50:46] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[07:50:50] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[07:50:59] <jinxer-wm>	 (PuppetFailure) firing: (2) Puppet has failed on cumin1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[07:51:17] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[07:51:52] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.264 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[07:51:56] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50714 bytes in 0.123 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[07:51:59] <jinxer-wm>	 (PuppetFailure) firing: (2) Puppet has failed on deploy1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[07:51:59] <jinxer-wm>	 (PuppetFailure) firing: (9) Puppet has failed on elastic1068:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[07:53:27] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2179 (re)pooling @ 30%: Host warmup', diff saved to https://phabricator.wikimedia.org/P53077 and previous config saved to /var/cache/conftool/dbconfig/20231031-075327-arnaudb.json
[08:00:05] <jouncebot>	 Amir1, Urbanecm, and taavi: That opportune time is upon us again. Time for a UTC morning backport window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231031T0800).
[08:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[08:02:16] <jinxer-wm>	 (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[08:07:16] <jinxer-wm>	 (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[08:08:07] <jinxer-wm>	 (ProbeDown) firing: (80) Service pki1001:443 has failed probes (http_PKI_aux_front_proxy_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:08:32] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2179 (re)pooling @ 45%: Host warmup', diff saved to https://phabricator.wikimedia.org/P53078 and previous config saved to /var/cache/conftool/dbconfig/20231031-080832-arnaudb.json
[08:11:03] <wikibugs>	 10SRE-OnFire, 10Observability-Metrics, 10Sustainability (Incident Followup), 10User-fgiunchedi: ThanosCompactHalted error on overlapping blocks - https://phabricator.wikimedia.org/T335406 (10fgiunchedi) 05Open→03Resolved We require a replica label now as per {T350002}, resolving
[08:13:49] <wikibugs>	 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T342502 (10phaultfinder)
[08:19:22] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] P:monitoring remove remainders of check_eth. [puppet] - 10https://gerrit.wikimedia.org/r/969721 (owner: 10Slyngshede)
[08:21:17] <jinxer-wm>	 (PuppetFailure) firing: (4) Puppet has failed on ganeti1029:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[08:23:37] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2179 (re)pooling @ 60%: Host warmup', diff saved to https://phabricator.wikimedia.org/P53079 and previous config saved to /var/cache/conftool/dbconfig/20231031-082336-arnaudb.json
[08:29:51] <wikibugs>	 (03PS1) 10Majavah: P:pki: use wmf-ca-certificates [puppet] - 10https://gerrit.wikimedia.org/r/970267 (https://phabricator.wikimedia.org/T350111)
[08:30:45] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] P:pki: use wmf-ca-certificates [puppet] - 10https://gerrit.wikimedia.org/r/970267 (https://phabricator.wikimedia.org/T350111) (owner: 10Majavah)
[08:31:03] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] P:pki: use wmf-ca-certificates [puppet] - 10https://gerrit.wikimedia.org/r/970267 (https://phabricator.wikimedia.org/T350111) (owner: 10Majavah)
[08:31:38] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/245/con" [puppet] - 10https://gerrit.wikimedia.org/r/970267 (https://phabricator.wikimedia.org/T350111) (owner: 10Majavah)
[08:33:59] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on krb2002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[08:34:34] <icinga-wm>	 RECOVERY - Check systemd state on config-master2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:36:26] <icinga-wm>	 RECOVERY - Check systemd state on pki2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:37:59] <jinxer-wm>	 (PuppetFailure) firing: (21) Puppet has failed on kubemaster1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[08:38:42] <icinga-wm>	 PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:38:42] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2179 (re)pooling @ 75%: Host warmup', diff saved to https://phabricator.wikimedia.org/P53080 and previous config saved to /var/cache/conftool/dbconfig/20231031-083841-arnaudb.json
[08:40:42] <icinga-wm>	 RECOVERY - Check systemd state on pki1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:50:19] <wikibugs>	 (03CR) 10Ayounsi: "Thanks. I like the approach as it doesn't use "nerd knobs" nor adds much complexity in the policies." [homer/public] - 10https://gerrit.wikimedia.org/r/969367 (https://phabricator.wikimedia.org/T344547) (owner: 10Cathal Mooney)
[08:53:01] <wikibugs>	 (03CR) 10Brouberol: [C: 03+2] Enable the management of the skein certificate via Puppet [puppet] - 10https://gerrit.wikimedia.org/r/968612 (https://phabricator.wikimedia.org/T329398) (owner: 10Brouberol)
[08:53:47] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2179 (re)pooling @ 90%: Host warmup', diff saved to https://phabricator.wikimedia.org/P53081 and previous config saved to /var/cache/conftool/dbconfig/20231031-085346-arnaudb.json
[08:56:16] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1230 config append', diff saved to https://phabricator.wikimedia.org/P53082 and previous config saved to /var/cache/conftool/dbconfig/20231031-085615-arnaudb.json
[08:56:56] <wikibugs>	 (03PS1) 10Hashar: puppet_compiler: always send CORS header even on 404 [puppet] - 10https://gerrit.wikimedia.org/r/970268 (https://phabricator.wikimedia.org/T350003)
[08:57:11] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] puppet_compiler: always send CORS header even on 404 [puppet] - 10https://gerrit.wikimedia.org/r/970268 (https://phabricator.wikimedia.org/T350003) (owner: 10Hashar)
[08:57:17] <wikibugs>	 (03CR) 10Brouberol: [C: 03+2] Enable the management of the skein certificate via Puppet on one instance [puppet] - 10https://gerrit.wikimedia.org/r/968613 (https://phabricator.wikimedia.org/T329398) (owner: 10Brouberol)
[08:57:41] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1230 (re)pooling @ 5%: db1230 host warmup', diff saved to https://phabricator.wikimedia.org/P53083 and previous config saved to /var/cache/conftool/dbconfig/20231031-085740-arnaudb.json
[08:57:52] <jinxer-wm>	 (ProbeDown) firing: (80) Service pki1001:443 has failed probes (http_PKI_aux_front_proxy_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:57:59] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on kubestage2002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[08:59:07] <wikibugs>	 (03PS2) 10Hashar: puppet_compiler: always send CORS header even on 404 [puppet] - 10https://gerrit.wikimedia.org/r/970268 (https://phabricator.wikimedia.org/T350003)
[09:00:52] <logmsgbot>	 !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/changeprop: sync
[09:00:59] <jinxer-wm>	 (PuppetFailure) resolved: (2) Puppet has failed on cumin1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[09:01:07] <logmsgbot>	 !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop: sync
[09:01:16] <jinxer-wm>	 (PuppetFailure) firing: (4) Puppet has failed on ganeti1029:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[09:01:59] <jinxer-wm>	 (PuppetFailure) resolved: (9) Puppet has failed on elastic1068:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[09:02:59] <jinxer-wm>	 (PuppetFailure) resolved: Puppet has failed on kubestage2002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[09:02:59] <jinxer-wm>	 (PuppetFailure) firing: (21) Puppet has failed on kubemaster1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[09:03:59] <jinxer-wm>	 (PuppetFailure) resolved: Puppet has failed on krb2002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[09:03:59] <jinxer-wm>	 (PuppetFailure) resolved: (2) Puppet has failed on mw1349:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[09:04:59] <jinxer-wm>	 (PuppetFailure) resolved: (2) Puppet has failed on ml-cache1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[09:05:43] <wikibugs>	 (03PS1) 10Arnaudb: mariadb: db1130 db1230 swap hosts [puppet] - 10https://gerrit.wikimedia.org/r/969988 (https://phabricator.wikimedia.org/T344036)
[09:05:59] <jinxer-wm>	 (PuppetFailure) resolved: (4) Puppet has failed on ganeti1029:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[09:06:59] <jinxer-wm>	 (PuppetFailure) resolved: (2) Puppet has failed on deploy1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[09:07:59] <jinxer-wm>	 (PuppetFailure) resolved: (2) Puppet has failed on restbase2022:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[09:07:59] <jinxer-wm>	 (PuppetFailure) resolved: (21) Puppet has failed on kubemaster1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[09:08:15] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM, minor nit inline" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/969692 (owner: 10Ayounsi)
[09:10:15] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/969749 (owner: 10Ayounsi)
[09:12:00] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/969752 (owner: 10Ayounsi)
[09:12:42] <icinga-wm>	 PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1001 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs
[09:13:46] <wikibugs>	 (03PS1) 10Stevemunene: switch druid host to run data_purge job [puppet] - 10https://gerrit.wikimedia.org/r/970272 (https://phabricator.wikimedia.org/T336042)
[09:14:04] <icinga-wm>	 PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs
[09:14:57] <volans>	 arnaudb: FYI ^^^ (diff is related to the host you're working on)
[09:16:09] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "This change needs to be communicated to DCOps before deploying" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/969692 (owner: 10Ayounsi)
[09:16:15] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "This change needs to be communicated to DCOps before deploying" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/969749 (owner: 10Ayounsi)
[09:18:20] <wikibugs>	 (03PS2) 10Elukey: services: update ChangeProp's eqiad Docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/969758 (https://phabricator.wikimedia.org/T348950)
[09:18:50] <wikibugs>	 (03CR) 10Elukey: "Updated the docker image to one with improved (debug) logging." [deployment-charts] - 10https://gerrit.wikimedia.org/r/969758 (https://phabricator.wikimedia.org/T348950) (owner: 10Elukey)
[09:20:31] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: New mailing list request for Project Korikath - https://phabricator.wikimedia.org/T349429 (10Ladsgroup) 05Open→03Resolved a:03Ladsgroup Created, https://lists.wikimedia.org/postorius/lists/korikath.lists.wikimedia.org. I made it as a a public mailing list, feel free to c...
[09:22:38] <wikibugs>	 (03PS1) 10Majavah: Fix cloud-public definitions [homer/public] - 10https://gerrit.wikimedia.org/r/970274 (https://phabricator.wikimedia.org/T350114)
[09:23:33] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] Fix cloud-public definitions [homer/public] - 10https://gerrit.wikimedia.org/r/970274 (https://phabricator.wikimedia.org/T350114) (owner: 10Majavah)
[09:23:54] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] Fix cloud-public definitions [homer/public] - 10https://gerrit.wikimedia.org/r/970274 (https://phabricator.wikimedia.org/T350114) (owner: 10Majavah)
[09:24:35] <wikibugs>	 (03Merged) 10jenkins-bot: Fix cloud-public definitions [homer/public] - 10https://gerrit.wikimedia.org/r/970274 (https://phabricator.wikimedia.org/T350114) (owner: 10Majavah)
[09:29:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag  - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[09:32:51] <wikibugs>	 (03PS1) 10Majavah: cr-cloud: Move allow-public below deny-to-private-subnets [homer/public] - 10https://gerrit.wikimedia.org/r/970275
[09:34:49] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Set ', diff saved to https://phabricator.wikimedia.org/P53084 and previous config saved to /var/cache/conftool/dbconfig/20231031-093448-arnaudb.json
[09:34:58] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2179 (re)pooling @ 100%: Host warmup', diff saved to https://phabricator.wikimedia.org/P53085 and previous config saved to /var/cache/conftool/dbconfig/20231031-093457-arnaudb.json
[09:35:38] <icinga-wm>	 RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs
[09:38:25] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10Puppet (Puppet 7.0): Investigate PKI errors - https://phabricator.wikimedia.org/T350118 (10jbond)
[09:38:33] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10Puppet (Puppet 7.0): Investigate PKI errors - https://phabricator.wikimedia.org/T350118 (10jbond)
[09:39:20] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'set db1230 as a depooled host', diff saved to https://phabricator.wikimedia.org/P53086 and previous config saved to /var/cache/conftool/dbconfig/20231031-093919-arnaudb.json
[09:39:34] <icinga-wm>	 RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1001 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs
[09:41:57] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] puppet_compiler: always send CORS header even on 404 [puppet] - 10https://gerrit.wikimedia.org/r/970268 (https://phabricator.wikimedia.org/T350003) (owner: 10Hashar)
[09:45:50] <wikibugs>	 (03PS1) 10Elukey: changeprop: allow to specify consumer/producer kafka settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/970276 (https://phabricator.wikimedia.org/T348950)
[09:46:06] <wikibugs>	 (03PS2) 10Arnaudb: mariadb: db1130 db1230 swap hosts [puppet] - 10https://gerrit.wikimedia.org/r/969988 (https://phabricator.wikimedia.org/T344036)
[09:47:38] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'set db1230 as a depooled host', diff saved to https://phabricator.wikimedia.org/P53087 and previous config saved to /var/cache/conftool/dbconfig/20231031-094737-arnaudb.json
[09:50:32] <wikibugs>	 (03PS7) 10Ayounsi: Ask for port # and type instead of interface name [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/969692
[09:50:34] <wikibugs>	 (03PS4) 10Ayounsi: provision_server: make switch selection optional [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/969749
[09:50:34] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance
[09:50:36] <wikibugs>	 (03PS3) 10Ayounsi: provision_server: don't show servers with a primary IP [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/969752
[09:50:45] <wikibugs>	 (03CR) 10Ayounsi: Ask for port # and type instead of interface name (033 comments) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/969692 (owner: 10Ayounsi)
[09:50:48] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance
[09:50:55] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2179 (T343198)', diff saved to https://phabricator.wikimedia.org/P53088 and previous config saved to /var/cache/conftool/dbconfig/20231031-095054-arnaudb.json
[09:50:59] <stashbot>	 T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198
[09:52:14] <wikibugs>	 (03PS6) 10Cathal Mooney: Change core router config to export internal routes to Switches [homer/public] - 10https://gerrit.wikimedia.org/r/969367 (https://phabricator.wikimedia.org/T344547)
[09:54:28] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10Puppet (Puppet 7.0): Investigate PKI errors - https://phabricator.wikimedia.org/T350118 (10jbond)
[09:57:53] <wikibugs>	 (03CR) 10Marostegui: "Don't depool db1130 yet" [puppet] - 10https://gerrit.wikimedia.org/r/969988 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb)
[09:58:09] <wikibugs>	 (03PS1) 10Cathal Mooney: Deny traffic from cloud pub ranges to WMF private IPs and tidy conf [homer/public] - 10https://gerrit.wikimedia.org/r/970279 (https://phabricator.wikimedia.org/T347030)
[09:58:14] <wikibugs>	 (03CR) 10DCausse: [C: 03+1] cirrus updater: Re-enable the .* route for mwapi [deployment-charts] - 10https://gerrit.wikimedia.org/r/969209 (owner: 10Ebernhardson)
[10:00:05] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231031T1000)
[10:00:21] <wikibugs>	 (03PS3) 10Arnaudb: mariadb: db1130 db1230 swap hosts [puppet] - 10https://gerrit.wikimedia.org/r/969988 (https://phabricator.wikimedia.org/T344036)
[10:00:47] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10Puppet (Puppet 7.0): Investigate PKI errors - https://phabricator.wikimedia.org/T350118 (10jbond) The last successful sign in eqiad was at 2023-10-30T21:19:14 and in codfw at 2023-10-30T23:04:02
[10:00:52] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] mariadb: db1130 db1230 swap hosts [puppet] - 10https://gerrit.wikimedia.org/r/969988 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb)
[10:01:01] <wikibugs>	 (03PS2) 10Cathal Mooney: Deny traffic from cloud pub ranges to WMF private IPs and tidy conf [homer/public] - 10https://gerrit.wikimedia.org/r/970279 (https://phabricator.wikimedia.org/T347030)
[10:01:40] <wikibugs>	 (03PS3) 10Cathal Mooney: Deny traffic from cloud pub ranges to WMF private IPs and tidy conf [homer/public] - 10https://gerrit.wikimedia.org/r/970279 (https://phabricator.wikimedia.org/T347030)
[10:01:55] <wikibugs>	 (03CR) 10Arnaudb: [C: 03+2] mariadb: db1130 db1230 swap hosts [puppet] - 10https://gerrit.wikimedia.org/r/969988 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb)
[10:02:40] <wikibugs>	 (03CR) 10Volans: "LGTM, but needs another coordinated change, one in this same repo, another one in the cookbooks" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/969319 (owner: 10Ayounsi)
[10:03:42] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10Puppet (Puppet 7.0): Investigate PKI errors - https://phabricator.wikimedia.org/T350118 (10jbond)
[10:04:27] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/969692 (owner: 10Ayounsi)
[10:06:55] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] "looking good, just wondering if it's worth maintaining all the hiera apparatus introduced to be able to switch the ssl_client_certificate" [puppet] - 10https://gerrit.wikimedia.org/r/969701 (https://phabricator.wikimedia.org/T349915) (owner: 10Jbond)
[10:07:32] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: Add weekly-update script [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/969303 (https://phabricator.wikimedia.org/T344478)
[10:08:44] <wikibugs>	 (03PS1) 10Slyngshede: Bump Bitu version to 0.0.2 [software/bitu] - 10https://gerrit.wikimedia.org/r/970281
[10:10:11] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: Add weekly-update script (032 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/969303 (https://phabricator.wikimedia.org/T344478) (owner: 10Giuseppe Lavagetto)
[10:11:02] <wikibugs>	 (03CR) 10Majavah: Deny traffic from cloud pub ranges to WMF private IPs and tidy conf (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/970279 (https://phabricator.wikimedia.org/T347030) (owner: 10Cathal Mooney)
[10:11:04] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10Puppet (Puppet 7.0): Investigate PKI errors - https://phabricator.wikimedia.org/T350118 (10jbond)
[10:12:56] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:13:21] <wikibugs>	 (03CR) 10Majavah: [C: 03+1] "the config looks fine, let me know when the endpoints are live and this can be deployed" [puppet] - 10https://gerrit.wikimedia.org/r/967963 (https://phabricator.wikimedia.org/T337390) (owner: 10Raymond Ndibe)
[10:13:52] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:14:08] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:16:59] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on pki1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[10:17:51] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'set db1230 as a depooled host', diff saved to https://phabricator.wikimedia.org/P53089 and previous config saved to /var/cache/conftool/dbconfig/20231031-101750-arnaudb.json
[10:18:30] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1230 (re)pooling @ 5%: db1230 host warmup', diff saved to https://phabricator.wikimedia.org/P53090 and previous config saved to /var/cache/conftool/dbconfig/20231031-101829-arnaudb.json
[10:19:18] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 17 Dec 2023 03:07:37 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:19:36] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50713 bytes in 0.067 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:19:46] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.300 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:22:59] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1227 (re)pooling @ 5%: dh1227 host warmup', diff saved to https://phabricator.wikimedia.org/P53091 and previous config saved to /var/cache/conftool/dbconfig/20231031-102259-arnaudb.json
[10:23:34] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] Deny traffic from cloud pub ranges to WMF private IPs and tidy conf (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/970279 (https://phabricator.wikimedia.org/T347030) (owner: 10Cathal Mooney)
[10:23:44] <wikibugs>	 (03PS1) 10Aklapper: Correct Gerrit Privacy Policy [puppet] - 10https://gerrit.wikimedia.org/r/970283 (https://phabricator.wikimedia.org/T350124)
[10:26:55] <wikibugs>	 (03PS1) 10Arnaudb: mariadb: db1127 && db1227 notifications reenabling [puppet] - 10https://gerrit.wikimedia.org/r/969989 (https://phabricator.wikimedia.org/T344036)
[10:27:36] <wikibugs>	 (03CR) 10Volans: "approach looks good, couple of comments/questions inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/967935 (https://phabricator.wikimedia.org/T340739) (owner: 10Jbond)
[10:28:30] <wikibugs>	 (03PS2) 10Arnaudb: mariadb: db1227 notifications reenabling, disabling on db1127 [puppet] - 10https://gerrit.wikimedia.org/r/969989 (https://phabricator.wikimedia.org/T344036)
[10:28:45] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+1] "Looks reasonable compared to prod jobrunner config." [deployment-charts] - 10https://gerrit.wikimedia.org/r/968955 (https://phabricator.wikimedia.org/T349796) (owner: 10Giuseppe Lavagetto)
[10:30:02] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] mariadb: db1227 notifications reenabling, disabling on db1127 [puppet] - 10https://gerrit.wikimedia.org/r/969989 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb)
[10:30:12] <wikibugs>	 (03CR) 10Arnaudb: [C: 03+2] mariadb: db1227 notifications reenabling, disabling on db1127 [puppet] - 10https://gerrit.wikimedia.org/r/969989 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb)
[10:31:15] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] "looking good for me in terms of NOOP for acme chief clients in our production environment." [puppet] - 10https://gerrit.wikimedia.org/r/969706 (https://phabricator.wikimedia.org/T349915) (owner: 10Jbond)
[10:31:18] <wikibugs>	 (03CR) 10Ayounsi: "No pb, but maybe safer to do netbox-dev first." [puppet] - 10https://gerrit.wikimedia.org/r/969331 (owner: 10Muehlenhoff)
[10:32:57] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10Puppet (Puppet 7.0): Investigate PKI errors - https://phabricator.wikimedia.org/T350118 (10jbond) It seems apache reloads at 00:00 every night.  i believe this is what caused the issue.  the pki certificates where rotated to puppet7 at 17...
[10:33:35] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1230 (re)pooling @ 10%: db1230 host warmup', diff saved to https://phabricator.wikimedia.org/P53092 and previous config saved to /var/cache/conftool/dbconfig/20231031-103334-arnaudb.json
[10:33:57] <wikibugs>	 (03PS4) 10Cathal Mooney: Deny traffic from cloud pub ranges to WMF private IPs and tidy conf [homer/public] - 10https://gerrit.wikimedia.org/r/970279 (https://phabricator.wikimedia.org/T347030)
[10:34:38] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10jbond)
[10:34:41] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10Puppet (Puppet 7.0): Investigate PKI errors - https://phabricator.wikimedia.org/T350118 (10jbond) 05Open→03In progress p:05Triage→03Medium
[10:36:35] <wikibugs>	 (03CR) 10Majavah: [C: 03+1] Deny traffic from cloud pub ranges to WMF private IPs and tidy conf (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/970279 (https://phabricator.wikimedia.org/T347030) (owner: 10Cathal Mooney)
[10:37:00] <wikibugs>	 (03Abandoned) 10Giuseppe Lavagetto: Add weekly-update script [deployment-charts] - 10https://gerrit.wikimedia.org/r/970204 (https://phabricator.wikimedia.org/T344478) (owner: 10Giuseppe Lavagetto)
[10:37:29] <logmsgbot>	 !log fnegri@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcontrol1007.eqiad.wmnet with OS bookworm
[10:38:05] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1227 (re)pooling @ 10%: dh1227 host warmup', diff saved to https://phabricator.wikimedia.org/P53093 and previous config saved to /var/cache/conftool/dbconfig/20231031-103804-arnaudb.json
[10:38:49] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10Puppet (Puppet 7.0): Investigate PKI errors - https://phabricator.wikimedia.org/T350118 (10jbond)
[10:41:36] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Add weekly-update script [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/969303 (https://phabricator.wikimedia.org/T344478) (owner: 10Giuseppe Lavagetto)
[10:42:21] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1005-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[10:42:56] <wikibugs>	 (03PS1) 10Slyngshede: P:base enable ethtool data collection [puppet] - 10https://gerrit.wikimedia.org/r/970329 (https://phabricator.wikimedia.org/T347312)
[10:44:04] <wikibugs>	 (03PS1) 10Aklapper: Correct IDP Privacy Policy [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/970330 (https://phabricator.wikimedia.org/T350129)
[10:45:03] <wikibugs>	 (03PS1) 10Brouberol: Generate an RSA2048-encrypted private key for Skein [puppet] - 10https://gerrit.wikimedia.org/r/970331 (https://phabricator.wikimedia.org/T329398)
[10:45:49] <wikibugs>	 (03PS2) 10Brouberol: Generate an RSA2048-encrypted private key for Skein [puppet] - 10https://gerrit.wikimedia.org/r/970331 (https://phabricator.wikimedia.org/T329398)
[10:47:55] <wikibugs>	 (03PS1) 10Fabfur: Basic retry mechanism for specific kafka errors [software/purged] - 10https://gerrit.wikimedia.org/r/970332 (https://phabricator.wikimedia.org/T334078)
[10:48:31] <wikibugs>	 (03CR) 10Slyngshede: "It might be beneficial if you would take a look at the prometheus::ethtool_exporter" [puppet] - 10https://gerrit.wikimedia.org/r/970329 (https://phabricator.wikimedia.org/T347312) (owner: 10Slyngshede)
[10:48:35] <wikibugs>	 (03CR) 10Ayounsi: "FYI we don't need to enable it on VMs." [puppet] - 10https://gerrit.wikimedia.org/r/970329 (https://phabricator.wikimedia.org/T347312) (owner: 10Slyngshede)
[10:48:39] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1230 (re)pooling @ 20%: db1230 host warmup', diff saved to https://phabricator.wikimedia.org/P53094 and previous config saved to /var/cache/conftool/dbconfig/20231031-104839-arnaudb.json
[10:48:51] <wikibugs>	 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T342502 (10phaultfinder)
[10:49:10] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10Puppet (Puppet 7.0): Investigate PKI errors - https://phabricator.wikimedia.org/T350118 (10jbond)
[10:50:30] <logmsgbot>	 !log fnegri@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol1007.eqiad.wmnet with reason: host reimage
[10:50:49] <wikibugs>	 (03PS3) 10Brouberol: Generate an RSA 4096-encrypted private key for Skein [puppet] - 10https://gerrit.wikimedia.org/r/970331 (https://phabricator.wikimedia.org/T329398)
[10:52:08] <wikibugs>	 (03CR) 10Brouberol: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/249/con" [puppet] - 10https://gerrit.wikimedia.org/r/970331 (https://phabricator.wikimedia.org/T329398) (owner: 10Brouberol)
[10:52:30] <wikibugs>	 (03CR) 10Brouberol: Generate an RSA 4096-encrypted private key for Skein [puppet] - 10https://gerrit.wikimedia.org/r/970331 (https://phabricator.wikimedia.org/T329398) (owner: 10Brouberol)
[10:53:09] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1227 (re)pooling @ 20%: dh1227 host warmup', diff saved to https://phabricator.wikimedia.org/P53095 and previous config saved to /var/cache/conftool/dbconfig/20231031-105308-arnaudb.json
[10:53:11] <logmsgbot>	 !log fnegri@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol1007.eqiad.wmnet with reason: host reimage
[11:00:12] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] Deny traffic from cloud pub ranges to WMF private IPs and tidy conf (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/970279 (https://phabricator.wikimedia.org/T347030) (owner: 10Cathal Mooney)
[11:00:47] <wikibugs>	 (03Merged) 10jenkins-bot: Deny traffic from cloud pub ranges to WMF private IPs and tidy conf [homer/public] - 10https://gerrit.wikimedia.org/r/970279 (https://phabricator.wikimedia.org/T347030) (owner: 10Cathal Mooney)
[11:03:37] <wikibugs>	 (03PS2) 10Fabfur: Basic retry mechanism for specific kafka errors [software/purged] - 10https://gerrit.wikimedia.org/r/970332 (https://phabricator.wikimedia.org/T334078)
[11:03:39] <wikibugs>	 (03CR) 10Volans: [C: 04-1] "I think we can avoid to hardcode them" [cookbooks] - 10https://gerrit.wikimedia.org/r/969175 (https://phabricator.wikimedia.org/T306421) (owner: 10Cathal Mooney)
[11:03:45] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1230 (re)pooling @ 30%: db1230 host warmup', diff saved to https://phabricator.wikimedia.org/P53096 and previous config saved to /var/cache/conftool/dbconfig/20231031-110344-arnaudb.json
[11:04:41] <wikibugs>	 (03CR) 10Vgutierrez: "this is kinda co" [puppet] - 10https://gerrit.wikimedia.org/r/969719 (https://phabricator.wikimedia.org/T349915) (owner: 10Jbond)
[11:08:14] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1227 (re)pooling @ 30%: dh1227 host warmup', diff saved to https://phabricator.wikimedia.org/P53097 and previous config saved to /var/cache/conftool/dbconfig/20231031-110813-arnaudb.json
[11:09:44] <wikibugs>	 (03CR) 10Vgutierrez: Basic retry mechanism for specific kafka errors (032 comments) [software/purged] - 10https://gerrit.wikimedia.org/r/970332 (https://phabricator.wikimedia.org/T334078) (owner: 10Fabfur)
[11:10:47] <icinga-wm>	 RECOVERY - Check systemd state on config-master1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:12:52] <jinxer-wm>	 (ProbeDown) resolved: (40) Service pki2002:443 has failed probes (http_PKI_aux_front_proxy_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#pki2002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:14:29] <icinga-wm>	 PROBLEM - Check systemd state on config-master1001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:15:13] <wikibugs>	 (03PS1) 10Majavah: diffscan: add support for multiple instances [puppet] - 10https://gerrit.wikimedia.org/r/970335
[11:15:15] <wikibugs>	 (03PS1) 10Majavah: P:diffscan: add support for configuring multiple instances [puppet] - 10https://gerrit.wikimedia.org/r/970336 (https://phabricator.wikimedia.org/T206653)
[11:15:23] <wikibugs>	 (03PS1) 10Majavah: hieradata: lock down ssh and node-exporter on cloudvirts [puppet] - 10https://gerrit.wikimedia.org/r/970337
[11:15:39] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] diffscan: add support for multiple instances [puppet] - 10https://gerrit.wikimedia.org/r/970335 (owner: 10Majavah)
[11:15:46] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] P:diffscan: add support for configuring multiple instances [puppet] - 10https://gerrit.wikimedia.org/r/970336 (https://phabricator.wikimedia.org/T206653) (owner: 10Majavah)
[11:16:52] <wikibugs>	 (03PS1) 10Jbond: pki::multirootca: Add puppet_rsa to multirootca [puppet] - 10https://gerrit.wikimedia.org/r/970338 (https://phabricator.wikimedia.org/T350118)
[11:16:58] <wikibugs>	 (03PS3) 10Fabfur: Basic retry mechanism for specific kafka errors [software/purged] - 10https://gerrit.wikimedia.org/r/970332 (https://phabricator.wikimedia.org/T334078)
[11:17:06] <wikibugs>	 (03CR) 10Fabfur: Basic retry mechanism for specific kafka errors (032 comments) [software/purged] - 10https://gerrit.wikimedia.org/r/970332 (https://phabricator.wikimedia.org/T334078) (owner: 10Fabfur)
[11:18:05] <wikibugs>	 (03PS2) 10Jbond: pki::multirootca: Add puppet_rsa to multirootca [puppet] - 10https://gerrit.wikimedia.org/r/970338 (https://phabricator.wikimedia.org/T350118)
[11:18:24] <wikibugs>	 (03PS2) 10Majavah: diffscan: add support for multiple instances [puppet] - 10https://gerrit.wikimedia.org/r/970335
[11:18:26] <wikibugs>	 (03PS2) 10Majavah: P:diffscan: add support for configuring multiple instances [puppet] - 10https://gerrit.wikimedia.org/r/970336 (https://phabricator.wikimedia.org/T206653)
[11:18:28] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/253/con" [puppet] - 10https://gerrit.wikimedia.org/r/970337 (owner: 10Majavah)
[11:18:49] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1230 (re)pooling @ 40%: db1230 host warmup', diff saved to https://phabricator.wikimedia.org/P53098 and previous config saved to /var/cache/conftool/dbconfig/20231031-111849-arnaudb.json
[11:18:51] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:21:03] <icinga-wm>	 RECOVERY - Check systemd state on config-master1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:21:11] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/255/con" [puppet] - 10https://gerrit.wikimedia.org/r/970336 (https://phabricator.wikimedia.org/T206653) (owner: 10Majavah)
[11:23:20] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1227 (re)pooling @ 40%: dh1227 host warmup', diff saved to https://phabricator.wikimedia.org/P53099 and previous config saved to /var/cache/conftool/dbconfig/20231031-112318-arnaudb.json
[11:24:23] <logmsgbot>	 !log fnegri@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol1007.eqiad.wmnet with OS bookworm
[11:24:48] <wikibugs>	 (03PS3) 10Majavah: diffscan: add support for multiple instances [puppet] - 10https://gerrit.wikimedia.org/r/970335
[11:24:49] <wikibugs>	 (03PS3) 10Majavah: P:diffscan: add support for configuring multiple instances [puppet] - 10https://gerrit.wikimedia.org/r/970336 (https://phabricator.wikimedia.org/T206653)
[11:25:09] <icinga-wm>	 PROBLEM - Check systemd state on config-master1001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:25:57] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/256/con" [puppet] - 10https://gerrit.wikimedia.org/r/970336 (https://phabricator.wikimedia.org/T206653) (owner: 10Majavah)
[11:26:28] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+1] "Should be safe from what I understand nothing connects to these services apart from on the 10.x IP.  Might there be connections from local" [puppet] - 10https://gerrit.wikimedia.org/r/970337 (owner: 10Majavah)
[11:27:17] <wikibugs>	 (03CR) 10Majavah: [V: 03+1 C: 03+2] hieradata: lock down ssh and node-exporter on cloudvirts [puppet] - 10https://gerrit.wikimedia.org/r/970337 (owner: 10Majavah)
[11:27:42] <wikibugs>	 (03CR) 10Kamila Součková: [C: 03+1] service_proxy: add rest-gateway to listeners [puppet] - 10https://gerrit.wikimedia.org/r/968617 (https://phabricator.wikimedia.org/T348731) (owner: 10Hnowlan)
[11:28:12] <wikibugs>	 (03PS2) 10Slyngshede: P:base enable ethtool data collection [puppet] - 10https://gerrit.wikimedia.org/r/970329 (https://phabricator.wikimedia.org/T347312)
[11:28:14] <wikibugs>	 (03CR) 10Jbond: "thanks see inline" [puppet] - 10https://gerrit.wikimedia.org/r/969719 (https://phabricator.wikimedia.org/T349915) (owner: 10Jbond)
[11:30:10] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/970331 (https://phabricator.wikimedia.org/T329398) (owner: 10Brouberol)
[11:31:13] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:31:40] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Review access change [docker-images/production-images] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/970349
[11:31:57] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Review access change [docker-images/production-images] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/970349 (owner: 10Giuseppe Lavagetto)
[11:31:59] <jinxer-wm>	 (PuppetFailure) firing: (2) Puppet has failed on pki1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[11:32:47] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/259/console" [puppet] - 10https://gerrit.wikimedia.org/r/970329 (https://phabricator.wikimedia.org/T347312) (owner: 10Slyngshede)
[11:32:50] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/258/con" [puppet] - 10https://gerrit.wikimedia.org/r/970338 (https://phabricator.wikimedia.org/T350118) (owner: 10Jbond)
[11:33:39] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] pki::multirootca: Add puppet_rsa to multirootca [puppet] - 10https://gerrit.wikimedia.org/r/970338 (https://phabricator.wikimedia.org/T350118) (owner: 10Jbond)
[11:33:54] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1230 (re)pooling @ 50%: db1230 host warmup', diff saved to https://phabricator.wikimedia.org/P53101 and previous config saved to /var/cache/conftool/dbconfig/20231031-113353-arnaudb.json
[11:36:33] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/260/console" [puppet] - 10https://gerrit.wikimedia.org/r/970329 (https://phabricator.wikimedia.org/T347312) (owner: 10Slyngshede)
[11:36:44] <wikibugs>	 (03PS4) 10Majavah: P:diffscan: add support for configuring multiple instances [puppet] - 10https://gerrit.wikimedia.org/r/970336 (https://phabricator.wikimedia.org/T206653)
[11:38:05] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/261/console" [puppet] - 10https://gerrit.wikimedia.org/r/970329 (https://phabricator.wikimedia.org/T347312) (owner: 10Slyngshede)
[11:38:12] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/262/console" [puppet] - 10https://gerrit.wikimedia.org/r/970336 (https://phabricator.wikimedia.org/T206653) (owner: 10Majavah)
[11:38:24] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1227 (re)pooling @ 50%: dh1227 host warmup', diff saved to https://phabricator.wikimedia.org/P53102 and previous config saved to /var/cache/conftool/dbconfig/20231031-113823-arnaudb.json
[11:40:10] <jinxer-wm>	 (ProbeDown) firing: (19) Service pki2002:443 has failed probes (http_PKI_cassandra_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#pki2002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:41:19] <wikibugs>	 (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/263/con" [puppet] - 10https://gerrit.wikimedia.org/r/968617 (https://phabricator.wikimedia.org/T348731) (owner: 10Hnowlan)
[11:45:10] <jinxer-wm>	 (ProbeDown) firing: (40) Service pki2002:443 has failed probes (http_PKI_aux_front_proxy_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#pki2002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:48:59] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1230 (re)pooling @ 60%: db1230 host warmup', diff saved to https://phabricator.wikimedia.org/P53103 and previous config saved to /var/cache/conftool/dbconfig/20231031-114858-arnaudb.json
[11:50:10] <jinxer-wm>	 (ProbeDown) firing: (42) Service pki2002:443 has failed probes (http_PKI_aux_front_proxy_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#pki2002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:51:17] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[11:51:27] <icinga-wm>	 RECOVERY - Check systemd state on config-master1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:53:29] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1227 (re)pooling @ 60%: dh1227 host warmup', diff saved to https://phabricator.wikimedia.org/P53104 and previous config saved to /var/cache/conftool/dbconfig/20231031-115328-arnaudb.json
[11:55:38] <icinga-wm>	 PROBLEM - Check systemd state on config-master1001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:57:55] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[11:58:34] <wikibugs>	 10SRE, 10Wikimedia-Etherpad, 10collaboration-services: Upgrade etherpad.wikimedia.org to v1.9.4 - https://phabricator.wikimedia.org/T316421 (10LSobanski)
[11:58:39] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[11:58:41] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[11:59:16] <wikibugs>	 10SRE, 10Wikimedia-Etherpad, 10collaboration-services: Upgrade etherpad.wikimedia.org to v1.9.4 - https://phabricator.wikimedia.org/T316421 (10LSobanski) I updated the description to reflect the new Etherpad release (1.9.4). See below for a list of changes:  * Compability changes ** Log4js has been updated t...
[11:59:51] <wikibugs>	 (03PS1) 10Jbond: pki::multirootca: Add parameter so pki can generate its certs [puppet] - 10https://gerrit.wikimedia.org/r/970339 (https://phabricator.wikimedia.org/T350118)
[12:00:05] <jouncebot>	 Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231031T1200)
[12:00:28] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50715 bytes in 7.294 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:00:40] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.254 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:00:42] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 17 Dec 2023 03:07:37 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:01:32] <wikibugs>	 (03PS1) 10Jbond: pki: move pki1001 back to puppet5 [puppet] - 10https://gerrit.wikimedia.org/r/970340
[12:01:43] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] pki: move pki1001 back to puppet5 [puppet] - 10https://gerrit.wikimedia.org/r/970340 (owner: 10Jbond)
[12:04:04] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1230 (re)pooling @ 70%: db1230 host warmup', diff saved to https://phabricator.wikimedia.org/P53105 and previous config saved to /var/cache/conftool/dbconfig/20231031-120403-arnaudb.json
[12:05:49] <wikibugs>	 (03PS1) 10Cathal Mooney: Do not NAT traffic from cloud VPS to cloud-private, and filter ports [puppet] - 10https://gerrit.wikimedia.org/r/970341 (https://phabricator.wikimedia.org/T350132)
[12:06:26] <icinga-wm>	 PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:07:00] <jinxer-wm>	 (PuppetFailure) firing: (2) Puppet has failed on pki1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[12:07:06] <icinga-wm>	 PROBLEM - Check unit status of httpbb_hourly_appserver on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[12:08:34] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1227 (re)pooling @ 70%: dh1227 host warmup', diff saved to https://phabricator.wikimedia.org/P53106 and previous config saved to /var/cache/conftool/dbconfig/20231031-120833-arnaudb.json
[12:09:08] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:09:52] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:10:14] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:13:54] <wikibugs>	 (03PS2) 10Jbond: pki::multirootca: Add parameter so pki can generate its certs [puppet] - 10https://gerrit.wikimedia.org/r/970339 (https://phabricator.wikimedia.org/T350118)
[12:15:12] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/265/con" [puppet] - 10https://gerrit.wikimedia.org/r/970339 (https://phabricator.wikimedia.org/T350118) (owner: 10Jbond)
[12:15:26] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/970329 (https://phabricator.wikimedia.org/T347312) (owner: 10Slyngshede)
[12:15:54] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 17 Dec 2023 03:07:37 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:16:36] <wikibugs>	 (03CR) 10Gmodena: [C: 03+1] "Ack. Thanks for the heads up." [deployment-charts] - 10https://gerrit.wikimedia.org/r/966921 (https://phabricator.wikimedia.org/T349095) (owner: 10Bking)
[12:17:05] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+1] "Overall a lot neater, nice." [deployment-charts] - 10https://gerrit.wikimedia.org/r/970276 (https://phabricator.wikimedia.org/T348950) (owner: 10Elukey)
[12:17:33] <wikibugs>	 (03PS3) 10Jbond: pki::multirootca: Add parameter so pki can generate its certs [puppet] - 10https://gerrit.wikimedia.org/r/970339 (https://phabricator.wikimedia.org/T350118)
[12:18:22] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 9.342 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:18:50] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/970339 (https://phabricator.wikimedia.org/T350118) (owner: 10Jbond)
[12:19:02] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50713 bytes in 0.108 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:19:09] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1230 (re)pooling @ 80%: db1230 host warmup', diff saved to https://phabricator.wikimedia.org/P53107 and previous config saved to /var/cache/conftool/dbconfig/20231031-121908-arnaudb.json
[12:20:22] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] pki::multirootca: Add parameter so pki can generate its certs [puppet] - 10https://gerrit.wikimedia.org/r/970339 (https://phabricator.wikimedia.org/T350118) (owner: 10Jbond)
[12:21:52] <icinga-wm>	 RECOVERY - Check systemd state on config-master1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:22:35] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+1] services: update ChangeProp's eqiad Docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/969758 (https://phabricator.wikimedia.org/T348950) (owner: 10Elukey)
[12:23:40] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1227 (re)pooling @ 80%: dh1227 host warmup', diff saved to https://phabricator.wikimedia.org/P53108 and previous config saved to /var/cache/conftool/dbconfig/20231031-122338-arnaudb.json
[12:24:53] <wikibugs>	 (03PS2) 10Cathal Mooney: Do not NAT traffic from cloud VPS to cloud-private, and filter ports [puppet] - 10https://gerrit.wikimedia.org/r/970341 (https://phabricator.wikimedia.org/T350132)
[12:25:10] <jinxer-wm>	 (ProbeDown) resolved: (42) Service pki2002:443 has failed probes (http_PKI_aux_front_proxy_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#pki2002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:25:45] <wikibugs>	 (03CR) 10Brouberol: [C: 03+2] Generate an RSA 4096-encrypted private key for Skein [puppet] - 10https://gerrit.wikimedia.org/r/970331 (https://phabricator.wikimedia.org/T329398) (owner: 10Brouberol)
[12:25:48] <icinga-wm>	 PROBLEM - Check systemd state on config-master1001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:53:49] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1227 (re)pooling @ 100%: dh1227 host warmup', diff saved to https://phabricator.wikimedia.org/P53113 and previous config saved to /var/cache/conftool/dbconfig/20231031-125348-arnaudb.json
[12:55:40] <jinxer-wm>	 (ProbeDown) resolved: (24) Service pki2002:443 has failed probes (http_PKI_aux_front_proxy_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#pki2002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:56:10] <wikibugs>	 (03PS1) 10Jbond: cfssl::ocsp: use client mtls certs if present [puppet] - 10https://gerrit.wikimedia.org/r/970369 (https://phabricator.wikimedia.org/T350118)
[12:56:13] <wikibugs>	 (03CR) 10Elukey: changeprop: allow to specify consumer/producer kafka settings (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/970276 (https://phabricator.wikimedia.org/T348950) (owner: 10Elukey)
[12:57:49] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/268/con" [puppet] - 10https://gerrit.wikimedia.org/r/970369 (https://phabricator.wikimedia.org/T350118) (owner: 10Jbond)
[12:58:50] <wikibugs>	 (03PS2) 10Elukey: changeprop: allow to specify consumer/producer kafka settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/970276 (https://phabricator.wikimedia.org/T348950)
[12:59:15] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] cfssl::ocsp: use client mtls certs if present [puppet] - 10https://gerrit.wikimedia.org/r/970369 (https://phabricator.wikimedia.org/T350118) (owner: 10Jbond)
[12:59:55] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] changeprop: allow to specify consumer/producer kafka settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/970276 (https://phabricator.wikimedia.org/T348950) (owner: 10Elukey)
[13:04:34] <icinga-wm>	 RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:05:30] <wikibugs>	 (03CR) 10Paladox: [C: 03+1] Correct Gerrit Privacy Policy [puppet] - 10https://gerrit.wikimedia.org/r/970283 (https://phabricator.wikimedia.org/T350124) (owner: 10Aklapper)
[13:05:52] <ihurbain>	 o/ is the afternoon backport window happening? 
[13:06:08] <ihurbain>	 (or did i get lost in time change whirlpool?)
[13:06:45] <RhinosF1>	 jouncebot: nowandnext
[13:06:46] <jouncebot>	 For the next 0 hour(s) and 53 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231031T1300)
[13:06:46] <jouncebot>	 In 1 hour(s) and 53 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231031T1500)
[13:06:59] <jinxer-wm>	 (PuppetFailure) resolved: Puppet has failed on pki2002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[13:07:15] <ihurbain>	 hah. thanks :D
[13:08:12] <RhinosF1>	 RoanKattouw, Lucas_WMDE, urbanecm, awight, TheresNoTime, taavi: It's time to deploy and jouncebot broke
[13:08:20] <RhinosF1>	 ihurbain: there
[13:08:27] <ihurbain>	 RhinosF1: thank you kindly! :)
[13:08:39] <wikibugs>	 (03PS1) 10Ottomata: eventgate chart - debug mode: add some perf settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/970371 (https://phabricator.wikimedia.org/T347477)
[13:08:59] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Structured-Data-Backlog, 10UploadWizard: Access request to deleted image files in the backup cluster - https://phabricator.wikimedia.org/T350020 (10jcrespo) a:03jcrespo
[13:09:14] <icinga-wm>	 RECOVERY - Check unit status of httpbb_hourly_appserver on cumin2002 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[13:09:54] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] eventgate chart - debug mode: add some perf settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/970371 (https://phabricator.wikimedia.org/T347477) (owner: 10Ottomata)
[13:10:51] <wikibugs>	 (03Merged) 10jenkins-bot: eventgate chart - debug mode: add some perf settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/970371 (https://phabricator.wikimedia.org/T347477) (owner: 10Ottomata)
[13:11:05] <wikibugs>	 (03PS3) 10Elukey: changeprop: allow to specify consumer/producer kafka settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/970276 (https://phabricator.wikimedia.org/T348950)
[13:12:30] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] P:base enable ethtool data collection [puppet] - 10https://gerrit.wikimedia.org/r/970329 (https://phabricator.wikimedia.org/T347312) (owner: 10Slyngshede)
[13:13:25] <wikibugs>	 (03PS1) 10Ottomata: eventgate chart - fix missing comma [deployment-charts] - 10https://gerrit.wikimedia.org/r/970372 (https://phabricator.wikimedia.org/T347477)
[13:13:54] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] eventgate chart - fix missing comma [deployment-charts] - 10https://gerrit.wikimedia.org/r/970372 (https://phabricator.wikimedia.org/T347477) (owner: 10Ottomata)
[13:14:50] <icinga-wm>	 RECOVERY - Check systemd state on config-master2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:14:50] <wikibugs>	 (03PS4) 10Elukey: changeprop: allow to specify consumer/producer kafka settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/970276 (https://phabricator.wikimedia.org/T348950)
[13:15:05] <wikibugs>	 (03Merged) 10jenkins-bot: eventgate chart - fix missing comma [deployment-charts] - 10https://gerrit.wikimedia.org/r/970372 (https://phabricator.wikimedia.org/T347477) (owner: 10Ottomata)
[13:15:11] <TheresNoTime>	 ihurbain: I can deploy :)
[13:15:21] <ihurbain>	 woot!
[13:15:25] <ihurbain>	 i'm around & ready :)
[13:15:47] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/969168 (https://phabricator.wikimedia.org/T342871) (owner: 10Isabelle Hurbain-Palatin)
[13:16:33] <wikibugs>	 (03Merged) 10jenkins-bot: Roll-out Parsoid Kartographer support for all English language wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/969168 (https://phabricator.wikimedia.org/T342871) (owner: 10Isabelle Hurbain-Palatin)
[13:16:39] <RhinosF1>	 ty TheresNoTime 
[13:16:57] <logmsgbot>	 !log samtar@deploy2002 Started scap: Backport for [[gerrit:969168|Roll-out Parsoid Kartographer support for all English language wikis (T342871)]]
[13:17:02] <stashbot>	 T342871: Parsoid + Kartographer roll-out plan - https://phabricator.wikimedia.org/T342871
[13:17:39] <logmsgbot>	 !log otto@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics: apply
[13:18:19] <logmsgbot>	 !log samtar@deploy2002 ihurbain and samtar: Backport for [[gerrit:969168|Roll-out Parsoid Kartographer support for all English language wikis (T342871)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[13:18:34] <TheresNoTime>	 ihurbain: live on mwdebug, can you test? :)
[13:18:38] <ihurbain>	 testing
[13:19:02] <icinga-wm>	 PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:19:29] <wikibugs>	 (03CR) 10Elukey: changeprop: allow to specify consumer/producer kafka settings (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/970276 (https://phabricator.wikimedia.org/T348950) (owner: 10Elukey)
[13:22:16] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: New mailing list request for Project Korikath - https://phabricator.wikimedia.org/T349429 (10Mrb_Rafi) Thanks a lot for the support, @Ladsgroup
[13:22:19] <ihurbain>	 TheresNoTime: we happy, ship it!
[13:22:24] <logmsgbot>	 !log samtar@deploy2002 ihurbain and samtar: Continuing with sync
[13:22:27] <wikibugs>	 (03CR) 10Bking: [C: 03+2] Update flink-session-cluster to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/969343 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm)
[13:22:29] <TheresNoTime>	 :D
[13:22:40] <ihurbain>	 TheresNoTime: thank you very much :)
[13:23:36] <TheresNoTime>	 you're very welcome :) it'll take a few minutes to be live, I'll ping you again just to double-check its still working okay
[13:23:46] <ihurbain>	 ack :)
[13:27:15] <wikibugs>	 (03PS1) 10Ottomata: eventgate chart - remove --prof-process flag from debug mode [deployment-charts] - 10https://gerrit.wikimedia.org/r/970374 (https://phabricator.wikimedia.org/T347477)
[13:27:42] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Structured-Data-Backlog, 10UploadWizard: Access request to deleted image files in the backup cluster - https://phabricator.wikimedia.org/T350020 (10jcrespo) We should discuss this a bit- as this changes not only the initial hypothesis, but also the restrictions of your proje...
[13:27:47] <logmsgbot>	 !log samtar@deploy2002 Finished scap: Backport for [[gerrit:969168|Roll-out Parsoid Kartographer support for all English language wikis (T342871)]] (duration: 10m 49s)
[13:27:47] <logmsgbot>	 !log otto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics: apply
[13:27:51] <TheresNoTime>	 ihurbain: live on prod :) 
[13:27:52] <stashbot>	 T342871: Parsoid + Kartographer roll-out plan - https://phabricator.wikimedia.org/T342871
[13:28:02] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] eventgate chart - remove --prof-process flag from debug mode [deployment-charts] - 10https://gerrit.wikimedia.org/r/970374 (https://phabricator.wikimedia.org/T347477) (owner: 10Ottomata)
[13:28:04] <ihurbain>	 shiny! :)
[13:29:34] <ihurbain>	 it still seems to be working okay.
[13:29:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag  - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[13:29:55] <wikibugs>	 (03Merged) 10jenkins-bot: eventgate chart - remove --prof-process flag from debug mode [deployment-charts] - 10https://gerrit.wikimedia.org/r/970374 (https://phabricator.wikimedia.org/T347477) (owner: 10Ottomata)
[13:30:29] <TheresNoTime>	 \o/
[13:30:50] <TheresNoTime>	 !log close UTC afternoon backport window
[13:30:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:34:27] <wikibugs>	 10SRE, 10MediaWiki-General, 10MediaWiki-libs-Stats, 10observability, and 5 others: MediaWiki Prometheus support - https://phabricator.wikimedia.org/T240685 (10herron)
[13:35:59] <wikibugs>	 (03PS1) 10Jbond: etcd::client::globalconfig: switch to wmf-ca-certificate [puppet] - 10https://gerrit.wikimedia.org/r/970377 (https://phabricator.wikimedia.org/T350147)
[13:36:22] <logmsgbot>	 !log otto@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics: apply
[13:36:35] <logmsgbot>	 !log otto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics: apply
[13:38:40] <wikibugs>	 (03PS1) 10Ayounsi: Add MoveServersUplinks Netbox script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/970379 (https://phabricator.wikimedia.org/T348129)
[13:40:02] <wikibugs>	 (03CR) 10Xcollazo: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/970272 (https://phabricator.wikimedia.org/T336042) (owner: 10Stevemunene)
[13:41:20] <icinga-wm>	 RECOVERY - Check systemd state on config-master2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:45:30] <icinga-wm>	 PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:45:44] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4052.ulsfo.wmnet with OS bookworm
[13:51:18] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] etcd::client::globalconfig: switch to wmf-ca-certificate [puppet] - 10https://gerrit.wikimedia.org/r/970377 (https://phabricator.wikimedia.org/T350147) (owner: 10Jbond)
[13:52:18] <icinga-wm>	 RECOVERY - Check systemd state on config-master1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:53:58] <jinxer-wm>	 (RdfStreamingUpdaterSpaceUsageTooHigh) firing: The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh
[13:58:16] <sbassett>	 Hey folks - are backport window deploys complete?  There’s a quick sec patch update I’d like to get out now, if possible...
[13:58:43] <logmsgbot>	 !log otto@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics-external: apply
[13:58:58] <jinxer-wm>	 (RdfStreamingUpdaterSpaceUsageTooHigh) resolved: The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh
[13:59:01] <logmsgbot>	 !log otto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics-external: apply
[14:01:52] <wikibugs>	 (03PS1) 10Arnaudb: mariadb: db1131 decomission [puppet] - 10https://gerrit.wikimedia.org/r/969992 (https://phabricator.wikimedia.org/T350141)
[14:02:59] <wikibugs>	 (03CR) 10Arnaudb: "this is supposed to be steps 1 to 5 of https://phabricator.wikimedia.org/T350141" [puppet] - 10https://gerrit.wikimedia.org/r/969992 (https://phabricator.wikimedia.org/T350141) (owner: 10Arnaudb)
[14:05:24] <wikibugs>	 10SRE, 10MediaWiki-General, 10MediaWiki-libs-Stats, 10observability, and 5 others: MediaWiki Prometheus support - https://phabricator.wikimedia.org/T240685 (10herron)
[14:06:50] <sbassett>	 !log Deployed updated security mitigation for T348828
[14:06:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:07:52] <icinga-wm>	 RECOVERY - Check systemd state on config-master2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:08:07] <wikibugs>	 (03CR) 10Marostegui: "We approach this in a different way, keep in mind that the template is quite generic and might not fit our needs." [puppet] - 10https://gerrit.wikimedia.org/r/969992 (https://phabricator.wikimedia.org/T350141) (owner: 10Arnaudb)
[14:10:12] <wikibugs>	 (03PS9) 10Herron: prom-es-exporter: w3c-networkerror include uri_host label [puppet] - 10https://gerrit.wikimedia.org/r/969135 (https://phabricator.wikimedia.org/T349807)
[14:11:14] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Create automation to move servers in Netbox from old to new switch - https://phabricator.wikimedia.org/T348129 (10Volans) @cmooney thanks for the summary, couple of questions:  1) will the migration be performed rack by rack as opposed to s...
[14:13:05] <sukhe>	 !log install4002:/etc/dhcp/automation/ttyS1-115200 rm cp4052.conf
[14:13:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:14:23] <wikibugs>	 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team (FY2023/2024-Q1): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10Andrew)
[14:19:49] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Create automation to move servers in Netbox from old to new switch - https://phabricator.wikimedia.org/T348129 (10Papaul) @Volans to get the the prefix ge vs xe maybe use the rack. In codfw we ahve only 10g servers racked in 10g rack and th...
[14:20:17] <wikibugs>	 10SRE, 10MediaWiki-General, 10MediaWiki-libs-Stats, 10observability, and 5 others: MediaWiki Prometheus support - https://phabricator.wikimedia.org/T240685 (10herron)
[14:21:57] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Create automation to move servers in Netbox from old to new switch - https://phabricator.wikimedia.org/T348129 (10ayounsi) > will the migration be performed rack by rack as opposed to server by server? yep  > For multi-unit servers we pick...
[14:23:45] <jinxer-wm>	 (Primary inbound port utilisation over 80%  #page) firing: Alert for device cloudsw1-f4-eqiad.mgmt.eqiad.wmnet - Primary inbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page
[14:23:45] <jinxer-wm>	 (Primary inbound port utilisation over 80%  #page) firing: Alert for device cloudsw1-f4-eqiad.mgmt.eqiad.wmnet - Primary inbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page
[14:23:53] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Create automation to move servers in Netbox from old to new switch - https://phabricator.wikimedia.org/T348129 (10Papaul) yes we always pick the lower numbering unit for 2U host.
[14:23:54] <jynus>	 here
[14:24:10] <jynus>	 mgmt so should not be a huge issue, acking
[14:24:20] <hnowlan>	 here also 
[14:24:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag  - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[14:25:08] <jynus>	 I know topranks was working on something related to cloud
[14:25:20] <jynus>	 maybe there was a spike of traffic or a reboot or something?
[14:25:23] <XioNoX>	 taavi, andrewbogott, anything going on in WMCS?
[14:25:32] <urandom>	 Making breakfast, but here if needed 
[14:25:46] <XioNoX>	 the F4-D5 link is saturating: https://librenms.wikimedia.org/device/device=242/tab=port/port=25230/
[14:26:13] <taavi>	 hm, andrew is doing something with Ceph which might explain it
[14:26:17] <andrewbogott>	 XioNoX: I'm rebalancing a couple of ceph nodes but nothing that we haven't done 100 times before
[14:26:32] <XioNoX>	 yeah probably related
[14:26:59] <taavi>	 jynus: mgmt in the alert means that the switch monitoring data is polled via the management network, not that the alert is about management network traffic
[14:27:08] <jynus>	 I get it now
[14:27:10] <andrewbogott>	 I'm also not 100% sure how to stop it or throttle it (and I'm in another meeting) do I need to drop everything and look at this?
[14:27:25] <jynus>	 andrewbogott: if cloud is happy we are happy
[14:27:36] <andrewbogott>	 ok! I think we're still good.  thanks
[14:27:40] * andrewbogott back to meeting
[14:27:46] <XioNoX>	 andrewbogott: it's up to you, there is congestion on one of the links, if nothing else alerts that's probably fine to wait
[14:28:45] <jinxer-wm>	 (Primary inbound port utilisation over 80%  #page) resolved: Device cloudsw1-f4-eqiad.mgmt.eqiad.wmnet recovered from Primary inbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page
[14:28:45] <jinxer-wm>	 (Primary inbound port utilisation over 80%  #page) resolved: Device cloudsw1-f4-eqiad.mgmt.eqiad.wmnet recovered from Primary inbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page
[14:29:29] <jynus>	 ok, then as we know the most probably root cause for that, I will not give it more thought
[14:29:56] <wikibugs>	 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T342502 (10phaultfinder)
[14:30:40] <jynus>	 ^what's the right way to go about that
[14:31:05] <jynus>	 we just leave it there, right?
[14:31:15] <jynus>	 the ticket, I mean
[14:35:46] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2179 (T343198)', diff saved to https://phabricator.wikimedia.org/P53116 and previous config saved to /var/cache/conftool/dbconfig/20231031-143545-arnaudb.json
[14:35:58] <stashbot>	 T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198
[14:36:14] <topranks>	 jynus: I'll take a look at the ticket about spine switch discards 
[14:36:29] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr
[14:36:36] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations: Automation to change a server's vlan - https://phabricator.wikimedia.org/T350152 (10ayounsi)
[14:36:44] <wikibugs>	 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T342502 (10cmooney) a:05Jclark-ctr→03cmooney
[14:36:58] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10ayounsi)
[14:37:04] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations: Automation to change a server's vlan - https://phabricator.wikimedia.org/T350152 (10ayounsi)
[14:37:26] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:38:19] <wikibugs>	 10SRE, 10ops-eqiad: Add test server to rack E8 - https://phabricator.wikimedia.org/T349168 (10Jclark-ctr) Configured  idrac manually and verified connection on switch
[14:38:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:39:52] <wikibugs>	 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T342502 (10cmooney) a:05cmooney→03Jclark-ctr
[14:42:21] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1005-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[14:42:39] <wikibugs>	 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T342502 (10cmooney) Yeah I think we still need to look at this, further errors on the link today.  Seems somewhat related to throughput, but we are miles away from capacity (peaks under 2Gb/sec).  I'd say worth trying an optic swap on one...
[14:44:48] <wikibugs>	 (03CR) 10Brouberol: [C: 03+1] "LGTM! I agree with @Xcollazo's remark." [puppet] - 10https://gerrit.wikimedia.org/r/970272 (https://phabricator.wikimedia.org/T336042) (owner: 10Stevemunene)
[14:45:26] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: docker::builder: add system to properly perform a weekly update [puppet] - 10https://gerrit.wikimedia.org/r/970391 (https://phabricator.wikimedia.org/T344478)
[14:45:28] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: docker::builder: switch systemd timer to our new script [puppet] - 10https://gerrit.wikimedia.org/r/970392 (https://phabricator.wikimedia.org/T344478)
[14:45:48] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:46:46] <icinga-wm>	 PROBLEM - BFD status on cr1-drmrs is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[14:47:02] <icinga-wm>	 PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[14:47:20] <icinga-wm>	 PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[14:47:43] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Add fake ssh private key for docker::builder [labs/private] - 10https://gerrit.wikimedia.org/r/970393
[14:47:46] <icinga-wm>	 PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 2/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[14:48:06] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Add fake ssh private key for docker::builder [labs/private] - 10https://gerrit.wikimedia.org/r/970393 (owner: 10Giuseppe Lavagetto)
[14:48:10] <icinga-wm>	 RECOVERY - BFD status on cr1-drmrs is OK: UP: 3 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[14:49:50] <icinga-wm>	 RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[14:50:10] <icinga-wm>	 RECOVERY - BFD status on cr1-eqiad is OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[14:50:36] <icinga-wm>	 RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[14:50:52] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P53117 and previous config saved to /var/cache/conftool/dbconfig/20231031-145052-arnaudb.json
[14:53:45] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:54:48] <icinga-wm>	 PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[14:55:12] <icinga-wm>	 PROBLEM - BFD status on cr1-drmrs is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[14:55:28] <icinga-wm>	 PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[14:55:46] <icinga-wm>	 PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[14:56:07] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: docker::builder: strings must be strings in yaml [labs/private] - 10https://gerrit.wikimedia.org/r/970395
[14:56:20] <wikibugs>	 (03PS4) 10Majavah: diffscan: add support for multiple instances [puppet] - 10https://gerrit.wikimedia.org/r/970335
[14:56:24] <wikibugs>	 (03PS5) 10Majavah: P:diffscan: add support for configuring multiple instances [puppet] - 10https://gerrit.wikimedia.org/r/970336 (https://phabricator.wikimedia.org/T206653)
[14:56:28] <wikibugs>	 (03PS1) 10Majavah: P:diffscan: add scan for WMCS infrastructure addresses [puppet] - 10https://gerrit.wikimedia.org/r/970396 (https://phabricator.wikimedia.org/T206653)
[14:56:49] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] docker::builder: strings must be strings in yaml [labs/private] - 10https://gerrit.wikimedia.org/r/970395 (owner: 10Giuseppe Lavagetto)
[14:57:03] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4052.ulsfo.wmnet with OS bookworm
[14:57:11] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/278/con" [puppet] - 10https://gerrit.wikimedia.org/r/970396 (https://phabricator.wikimedia.org/T206653) (owner: 10Majavah)
[14:57:22] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10cmooney) >>! In T327938#9234691, @Volans wrote: > @cmooney adding a note here to not forget. We'll need to check how it will work for Ganeti VMs, in particular the makev...
[14:58:14] <wikibugs>	 (03Abandoned) 10Jdlrobson: [Visual change] Normalize small font sizes in Vector 2022 [skins/Vector] (wmf/1.42.0-wmf.2) - 10https://gerrit.wikimedia.org/r/968314 (https://phabricator.wikimedia.org/T346062) (owner: 10Jdlrobson)
[14:58:18] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/279/con" [puppet] - 10https://gerrit.wikimedia.org/r/970391 (https://phabricator.wikimedia.org/T344478) (owner: 10Giuseppe Lavagetto)
[14:59:02] <icinga-wm>	 RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[14:59:24] <icinga-wm>	 RECOVERY - BFD status on cr1-drmrs is OK: UP: 3 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[14:59:40] <icinga-wm>	 RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[14:59:58] <icinga-wm>	 RECOVERY - BFD status on cr1-eqiad is OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[15:03:56] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:04:28] <logmsgbot>	 !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4052.ulsfo.wmnet with OS bookworm
[15:04:46] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4052.ulsfo.wmnet with OS bookworm
[15:05:34] <logmsgbot>	 !log otto@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-analytics: apply
[15:05:47] <logmsgbot>	 !log otto@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-analytics-external: apply
[15:05:49] <logmsgbot>	 !log otto@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics-external: apply
[15:05:52] <logmsgbot>	 !log otto@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics: apply
[15:05:59] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P53118 and previous config saved to /var/cache/conftool/dbconfig/20231031-150558-arnaudb.json
[15:08:19] <wikibugs>	 (03CR) 10Brouberol: [V: 03+1] "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/970378 (https://phabricator.wikimedia.org/T329398) (owner: 10Brouberol)
[15:11:13] <logmsgbot>	 !log otto@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-analytics-external: apply
[15:11:16] <logmsgbot>	 !log otto@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics-external: apply
[15:15:03] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations: Automation to change a server's vlan - https://phabricator.wikimedia.org/T350152 (10ayounsi) There will be special usecase, but if we can tackle all the regular servers (eg. 1 uplink, 1 IP, 1 , then we will be in a great spot.  The ideal/cleanest is to go through a re...
[15:15:06] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:19:33] <wikibugs>	 (03PS1) 10Jforrester: wikifunctions: Bump orchestrator to image 2023-10-31-024528 [deployment-charts] - 10https://gerrit.wikimedia.org/r/970398 (https://phabricator.wikimedia.org/T350034)
[15:20:55] <wikibugs>	 (03CR) 10Jforrester: [C: 03+2] wikifunctions: Bump orchestrator to image 2023-10-31-024528 [deployment-charts] - 10https://gerrit.wikimedia.org/r/970398 (https://phabricator.wikimedia.org/T350034) (owner: 10Jforrester)
[15:21:05] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2179 (T343198)', diff saved to https://phabricator.wikimedia.org/P53119 and previous config saved to /var/cache/conftool/dbconfig/20231031-152105-arnaudb.json
[15:21:14] <stashbot>	 T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198
[15:21:46] <wikibugs>	 (03Merged) 10jenkins-bot: wikifunctions: Bump orchestrator to image 2023-10-31-024528 [deployment-charts] - 10https://gerrit.wikimedia.org/r/970398 (https://phabricator.wikimedia.org/T350034) (owner: 10Jforrester)
[15:22:13] <logmsgbot>	 !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[15:22:53] <logmsgbot>	 !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[15:23:53] <logmsgbot>	 !log jforrester@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply
[15:23:58] <wikibugs>	 (03CR) 10BCornwall: [C: 03+1] "Nice job!" [puppet] - 10https://gerrit.wikimedia.org/r/967950 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron)
[15:24:02] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+1] changeprop: allow to specify consumer/producer kafka settings (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/970276 (https://phabricator.wikimedia.org/T348950) (owner: 10Elukey)
[15:24:18] <logmsgbot>	 !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4052.ulsfo.wmnet with OS bookworm
[15:25:02] <logmsgbot>	 !log jforrester@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply
[15:25:07] <logmsgbot>	 !log jforrester@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply
[15:25:19] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] changeprop: allow to specify consumer/producer kafka settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/970276 (https://phabricator.wikimedia.org/T348950) (owner: 10Elukey)
[15:26:16] <logmsgbot>	 !log jforrester@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply
[15:28:35] <wikibugs>	 (03PS2) 10Arnaudb: mariadb: db1131 decomission [puppet] - 10https://gerrit.wikimedia.org/r/969992 (https://phabricator.wikimedia.org/T350141)
[15:28:57] <logmsgbot>	 !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/changeprop: sync
[15:29:12] <logmsgbot>	 !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop: sync
[15:29:32] <icinga-wm>	 PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:33:27] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.decommission for hosts db1131.eqiad.wmnet
[15:35:13] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] mariadb: db1131 decomission [puppet] - 10https://gerrit.wikimedia.org/r/969992 (https://phabricator.wikimedia.org/T350141) (owner: 10Arnaudb)
[15:37:37] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/970378 (https://phabricator.wikimedia.org/T329398) (owner: 10Brouberol)
[15:38:32] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.dns.netbox
[15:41:01] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1131.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - arnaudb@cumin1001"
[15:42:24] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1131.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - arnaudb@cumin1001"
[15:42:24] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:42:25] <logmsgbot>	 !log arnaudb@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts db1131.eqiad.wmnet
[15:43:10] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4052.ulsfo.wmnet with OS bookworm
[15:44:16] <wikibugs>	 (03CR) 10Brouberol: [V: 03+1 C: 03+2] Convert the Skein private key to the PKCS#8 format [puppet] - 10https://gerrit.wikimedia.org/r/970378 (https://phabricator.wikimedia.org/T329398) (owner: 10Brouberol)
[15:46:54] <wikibugs>	 (03PS1) 10Brouberol: Fix typo in unless condition [puppet] - 10https://gerrit.wikimedia.org/r/970401
[15:48:27] <wikibugs>	 (03CR) 10Arnaudb: [C: 03+2] mariadb: db1131 decomission [puppet] - 10https://gerrit.wikimedia.org/r/969992 (https://phabricator.wikimedia.org/T350141) (owner: 10Arnaudb)
[15:48:30] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:49:38] <wikibugs>	 (03CR) 10Vgutierrez: Basic retry mechanism for specific kafka errors (031 comment) [software/purged] - 10https://gerrit.wikimedia.org/r/970332 (https://phabricator.wikimedia.org/T334078) (owner: 10Fabfur)
[15:49:51] <wikibugs>	 (03CR) 10Brouberol: [C: 03+2] Fix typo in unless condition [puppet] - 10https://gerrit.wikimedia.org/r/970401 (owner: 10Brouberol)
[15:51:17] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[15:51:57] <wikibugs>	 (03CR) 10Cathal Mooney: "LGTM overall, a few comments on the approach but good to go.  The one on the interface naming I think we do need to tackle, not 100% sure " [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/970379 (https://phabricator.wikimedia.org/T348129) (owner: 10Ayounsi)
[15:52:54] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'discard db1131', diff saved to https://phabricator.wikimedia.org/P53120 and previous config saved to /var/cache/conftool/dbconfig/20231031-155253-arnaudb.json
[15:53:15] <wikibugs>	 (03PS1) 10Filippo Giunchedi: team-sre: ignore systemd_unit_.+_owner stale textfile [alerts] - 10https://gerrit.wikimedia.org/r/970402 (https://phabricator.wikimedia.org/T349176)
[15:54:30] <wikibugs>	 (03PS1) 10Brouberol: Fix puppet error by providing the openssl absolute path [puppet] - 10https://gerrit.wikimedia.org/r/970403
[15:55:15] <wikibugs>	 (03CR) 10Brouberol: "Sorry about the quifix PR. This slipped through PCC." [puppet] - 10https://gerrit.wikimedia.org/r/970403 (owner: 10Brouberol)
[15:57:04] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: eqiad: move non WMCS servers out of rack C8 - https://phabricator.wikimedia.org/T308339 (10ABran-WMF) db1131 is ready to be handled (T350141)
[15:58:48] <wikibugs>	 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review: decommission db1131.eqiad.wmnet - https://phabricator.wikimedia.org/T350141 (10ABran-WMF)
[16:00:05] <jouncebot>	 jbond and rzl: May I have your attention please! Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231031T1600)
[16:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[16:00:27] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:03:02] <wikibugs>	 (03PS7) 10Cathal Mooney: Change core router config to export internal routes to Switches [homer/public] - 10https://gerrit.wikimedia.org/r/969367 (https://phabricator.wikimedia.org/T344547)
[16:04:06] <logmsgbot>	 !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4052.ulsfo.wmnet with OS bookworm
[16:06:11] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp1103.eqiad.wmnet with OS bullseye
[16:07:31] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Create automation to move servers in Netbox from old to new switch - https://phabricator.wikimedia.org/T348129 (10Volans) >>! In T348129#9295072, @ayounsi wrote: >> this way there is no check to ensure that reality corresponds to what we do...
[16:08:46] <logmsgbot>	 !log taavi@cumin1001 START - Cookbook sre.dns.netbox
[16:10:55] <logmsgbot>	 !log taavi@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign new IPs to cloudvirt-wdqs1002 - taavi@cumin1001"
[16:11:44] <logmsgbot>	 !log taavi@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign new IPs to cloudvirt-wdqs1002 - taavi@cumin1001"
[16:11:44] <logmsgbot>	 !log taavi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:12:30] <logmsgbot>	 !log taavi@cumin1001 START - Cookbook sre.hosts.provision for host cloudvirt-wdqs1002.mgmt.eqiad.wmnet with reboot policy FORCED
[16:15:22] <logmsgbot>	 !log taavi@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudvirt-wdqs1002.mgmt.eqiad.wmnet with reboot policy FORCED
[16:15:42] <logmsgbot>	 !log taavi@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt-wdqs1002.eqiad.wmnet with OS bookworm
[16:15:56] <wikibugs>	 10SRE, 10ops-eqiad, 10Cloud-VPS, 10Data-Platform-SRE, 10cloud-services-team: Move cloudvirt-wdqs hosts - https://phabricator.wikimedia.org/T346948 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by taavi@cumin1001 for host cloudvirt-wdqs1002.eqiad.wmnet with OS bookworm
[16:18:47] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:20:42] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp1104.eqiad.wmnet with OS bullseye
[16:20:49] <wikibugs>	 10SRE, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1104.eqiad.wmnet with OS bullseye
[16:22:17] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1103.eqiad.wmnet with reason: host reimage
[16:23:11] <logmsgbot>	 !log elukey@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop: sync
[16:23:26] <logmsgbot>	 !log elukey@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop: sync
[16:25:31] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1103.eqiad.wmnet with reason: host reimage
[16:27:28] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Create automation to move servers in Netbox from old to new switch - https://phabricator.wikimedia.org/T348129 (10ayounsi) > What I mean is that this way it might be harder to catch mistakes, if a host has been plugged into a different port...
[16:27:28] <logmsgbot>	 !log taavi@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudvirt-wdqs1002.eqiad.wmnet with OS bookworm
[16:27:41] <wikibugs>	 10SRE, 10ops-eqiad, 10Cloud-VPS, 10Data-Platform-SRE, 10cloud-services-team: Move cloudvirt-wdqs hosts - https://phabricator.wikimedia.org/T346948 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by taavi@cumin1001 for host cloudvirt-wdqs1002.eqiad.wmnet with OS bookworm executed with...
[16:28:35] <wikibugs>	 (03CR) 10Fabfur: Basic retry mechanism for specific kafka errors (031 comment) [software/purged] - 10https://gerrit.wikimedia.org/r/970332 (https://phabricator.wikimedia.org/T334078) (owner: 10Fabfur)
[16:28:51] <wikibugs>	 (03PS4) 10Fabfur: Basic retry mechanism for specific kafka errors [software/purged] - 10https://gerrit.wikimedia.org/r/970332 (https://phabricator.wikimedia.org/T334078)
[16:29:53] <wikibugs>	 (03PS1) 10Jbond: pki: switch to cfssl certificate [puppet] - 10https://gerrit.wikimedia.org/r/970407 (https://phabricator.wikimedia.org/T349619)
[16:30:51] <logmsgbot>	 !log fabfur@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1104.eqiad.wmnet with OS bullseye
[16:30:51] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] pki: switch to cfssl certificate [puppet] - 10https://gerrit.wikimedia.org/r/970407 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond)
[16:30:57] <wikibugs>	 10SRE, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1104.eqiad.wmnet with OS bullseye executed with errors: - cp1104 (**FAIL**)   - Downtimed on Icinga/...
[16:31:11] <icinga-wm>	 PROBLEM - Check systemd state on gitlab-runner1004 is CRITICAL: CRITICAL - degraded: The following units failed: docker-gc.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:31:25] <icinga-wm>	 PROBLEM - Check systemd state on gitlab-runner2003 is CRITICAL: CRITICAL - degraded: The following units failed: docker-gc.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:31:43] <icinga-wm>	 PROBLEM - Check systemd state on gitlab-runner1002 is CRITICAL: CRITICAL - degraded: The following units failed: docker-gc.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:31:57] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp1104.eqiad.wmnet with OS bullseye
[16:32:01] <wikibugs>	 (03PS2) 10Brouberol: Fix puppet error by providing the openssl absolute path [puppet] - 10https://gerrit.wikimedia.org/r/970403 (https://phabricator.wikimedia.org/T329398)
[16:32:02] <wikibugs>	 10SRE, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1104.eqiad.wmnet with OS bullseye
[16:32:05] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:33:10] <wikibugs>	 (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/970403 (https://phabricator.wikimedia.org/T329398) (owner: 10Brouberol)
[16:33:23] <icinga-wm>	 RECOVERY - Check systemd state on gitlab-runner1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:33:55] <icinga-wm>	 RECOVERY - Check systemd state on gitlab-runner1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:34:43] <icinga-wm>	 RECOVERY - Check systemd state on gitlab-runner2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:34:46] <logmsgbot>	 !log elukey@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop: sync
[16:35:00] <logmsgbot>	 !log elukey@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop: sync
[16:35:24] <wikibugs>	 (03CR) 10Brouberol: [C: 03+2] Fix puppet error by providing the openssl absolute path [puppet] - 10https://gerrit.wikimedia.org/r/970403 (https://phabricator.wikimedia.org/T329398) (owner: 10Brouberol)
[16:40:02] <wikibugs>	 (03PS3) 10Elukey: services: update ChangeProp's eqiad Docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/969758 (https://phabricator.wikimedia.org/T348950)
[16:41:44] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10BCornwall)
[16:43:14] <wikibugs>	 (03CR) 10Jbond: sre.puppet.migrate-role: add new cookbook to migrate roles to puppet7 (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/967935 (https://phabricator.wikimedia.org/T340739) (owner: 10Jbond)
[16:43:26] <wikibugs>	 (03PS1) 10Brouberol: Hide skein private key diff in puppet logs [puppet] - 10https://gerrit.wikimedia.org/r/970408 (https://phabricator.wikimedia.org/T329398)
[16:43:38] <wikibugs>	 (03PS10) 10Jbond: sre.puppet.migrate-role: add new cookbook to migrate roles to puppet7 [cookbooks] - 10https://gerrit.wikimedia.org/r/967935 (https://phabricator.wikimedia.org/T340739)
[16:44:18] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1103.eqiad.wmnet with OS bullseye
[16:45:17] <wikibugs>	 (03CR) 10Brouberol: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/281/con" [puppet] - 10https://gerrit.wikimedia.org/r/970408 (https://phabricator.wikimedia.org/T329398) (owner: 10Brouberol)
[16:45:19] <wikibugs>	 (03PS1) 10Jbond: pki1001: move back to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/970409
[16:46:09] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] pki1001: move back to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/970409 (owner: 10Jbond)
[16:46:26] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] services: update ChangeProp's eqiad Docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/969758 (https://phabricator.wikimedia.org/T348950) (owner: 10Elukey)
[16:48:40] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.puppet.migrate-role: add new cookbook to migrate roles to puppet7 [cookbooks] - 10https://gerrit.wikimedia.org/r/967935 (https://phabricator.wikimedia.org/T340739) (owner: 10Jbond)
[16:49:20] <logmsgbot>	 !log elukey@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop: sync
[16:49:57] <logmsgbot>	 !log elukey@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop: sync
[16:50:14] <wikibugs>	 (03PS1) 10Stevemunene: Revert "Revert "airflow-wmde: Create scap deployment source for wmde"" [puppet] - 10https://gerrit.wikimedia.org/r/970360
[16:50:36] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Revert "Revert "airflow-wmde: Create scap deployment source for wmde"" [puppet] - 10https://gerrit.wikimedia.org/r/970360 (owner: 10Stevemunene)
[16:51:41] <wikibugs>	 (03PS2) 10Stevemunene: Revert "Revert "airflow-wmde: Create scap deployment source for wmde"" [puppet] - 10https://gerrit.wikimedia.org/r/970360
[16:52:29] <wikibugs>	 (03PS1) 10Ssingh: Release dnsdist 1.8.2-1+wmf12u1 [debs/dnsdist] - 10https://gerrit.wikimedia.org/r/970413
[16:52:35] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+1] Revert "Revert "airflow-wmde: Create scap deployment source for wmde"" [puppet] - 10https://gerrit.wikimedia.org/r/970360 (owner: 10Stevemunene)
[16:56:28] <TheresNoTime>	 jouncebot: nowandnext
[16:56:28] <jouncebot>	 For the next 0 hour(s) and 3 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231031T1600)
[16:56:29] <jouncebot>	 In 0 hour(s) and 3 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231031T1700)
[16:56:42] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/970412 (https://phabricator.wikimedia.org/T347435) (owner: 10Samtar)
[16:57:14] <wikibugs>	 (03CR) 10Stevemunene: [C: 03+2] Revert "Revert "airflow-wmde: Create scap deployment source for wmde"" [puppet] - 10https://gerrit.wikimedia.org/r/970360 (owner: 10Stevemunene)
[16:57:24] <wikibugs>	 (03Merged) 10jenkins-bot: InitialiseSettings-labs: Enable AbuseFilterBlockedExternalDomainsNotifications on enwiki.beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/970412 (https://phabricator.wikimedia.org/T347435) (owner: 10Samtar)
[16:58:45] <wikibugs>	 (03CR) 10Ebernhardson: [C: 03+1] "nothing seems obviously wrong, although I do wonder about the deployment process. I haven't verified if any of the names/paths here (swift" [deployment-charts] - 10https://gerrit.wikimedia.org/r/967229 (https://phabricator.wikimedia.org/T349095) (owner: 10Bking)
[16:59:38] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[17:00:05] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231031T1700)
[17:00:07] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[17:01:44] <godog>	 that is going to be resolved soon ^
[17:04:52] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[17:05:02] <wikibugs>	 (03CR) 10Vgutierrez: "looking good" [software/purged] - 10https://gerrit.wikimedia.org/r/970332 (https://phabricator.wikimedia.org/T334078) (owner: 10Fabfur)
[17:08:01] <icinga-wm>	 RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:09:47] <jinxer-wm>	 (Device rebooted) firing: Alert for device ps1-a2-codfw.mgmt.codfw.wmnet - Device rebooted   - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted
[17:12:11] <icinga-wm>	 PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:14:37] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [alerts] - 10https://gerrit.wikimedia.org/r/970402 (https://phabricator.wikimedia.org/T349176) (owner: 10Filippo Giunchedi)
[17:14:47] <jinxer-wm>	 (Device rebooted) resolved: Device ps1-a2-codfw.mgmt.codfw.wmnet recovered from Device rebooted   - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted
[17:16:00] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10jbond)
[17:16:18] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Investigate PKI errors - https://phabricator.wikimedia.org/T350118 (10jbond) 05In progress→03Resolved a:03jbond This is fixed now
[17:17:53] <wikibugs>	 (03PS1) 10Ryan Kemper: Revert "Revert "Revert "airflow-wmde: Create scap deployment source for wmde""" [puppet] - 10https://gerrit.wikimedia.org/r/970361
[17:19:09] <wikibugs>	 (03CR) 10Stevemunene: [C: 03+1] Revert "Revert "Revert "airflow-wmde: Create scap deployment source for wmde""" [puppet] - 10https://gerrit.wikimedia.org/r/970361 (owner: 10Ryan Kemper)
[17:19:33] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+2] Revert "Revert "Revert "airflow-wmde: Create scap deployment source for wmde""" [puppet] - 10https://gerrit.wikimedia.org/r/970361 (owner: 10Ryan Kemper)
[17:19:48] <wikibugs>	 (03PS11) 10Jbond: sre.puppet.migrate-role: add new cookbook to migrate roles to puppet7 [cookbooks] - 10https://gerrit.wikimedia.org/r/967935 (https://phabricator.wikimedia.org/T340739)
[17:27:58] <Krinkle>	 !log krinkle@deploy2002:/srv/mediawiki/private: fix untracked warning for readme.FatalErrorSettings.php
[17:28:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:42:09] <logmsgbot>	 !log otto@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-analytics: apply
[17:42:22] <logmsgbot>	 !log otto@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics: apply
[17:42:24] <logmsgbot>	 !log otto@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-analytics-external: apply
[17:42:40] <logmsgbot>	 !log otto@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics-external: apply
[17:43:14] <logmsgbot>	 !log otto@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics-external: apply
[17:43:33] <logmsgbot>	 !log otto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics-external: apply
[17:48:30] <wikibugs>	 (03PS2) 10Ayounsi: Add MoveServersUplinks Netbox script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/970379 (https://phabricator.wikimedia.org/T348129)
[17:48:53] <wikibugs>	 (03CR) 10Ayounsi: Add MoveServersUplinks Netbox script (034 comments) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/970379 (https://phabricator.wikimedia.org/T348129) (owner: 10Ayounsi)
[17:49:01] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] Release dnsdist 1.8.2-1+wmf12u1 [debs/dnsdist] - 10https://gerrit.wikimedia.org/r/970413 (owner: 10Ssingh)
[17:51:03] <icinga-wm>	 RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:51:47] <logmsgbot>	 !log taavi@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host cloudvirt-wdqs1002
[17:51:49] <logmsgbot>	 !log taavi@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudvirt-wdqs1002
[17:52:10] <logmsgbot>	 !log fabfur@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp1104.eqiad.wmnet with OS bullseye
[17:52:15] <wikibugs>	 10SRE, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1104.eqiad.wmnet with OS bullseye executed with errors: - cp1104 (**FAIL**)   - Removed from Puppet...
[17:55:13] <icinga-wm>	 PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:56:20] <logmsgbot>	 !log taavi@cumin1001 START - Cookbook sre.hosts.provision for host cloudvirt-wdqs1002.mgmt.eqiad.wmnet with reboot policy FORCED
[17:59:24] <logmsgbot>	 !log taavi@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudvirt-wdqs1002.mgmt.eqiad.wmnet with reboot policy FORCED
[18:00:04] <jouncebot>	 dduvall and dancy: Your horoscope predicts another unfortunate MediaWiki train - Utc-7 Version deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231031T1800).
[18:00:14] <dancy>	 o/
[18:04:20] <sukhe>	 !log reprepro -C component/dnsdist include bookworm-wikimedia dnsdist_1.8.2-1+wmf12u1_amd64.changes
[18:04:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:05:20] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Create automation to move servers in Netbox from old to new switch - https://phabricator.wikimedia.org/T348129 (10ayounsi) It's live on netbox-next: https://netbox-next.wikimedia.org/extras/scripts/move_server.MoveServersUplinks/  See that...
[18:05:48] <wikibugs>	 (03PS3) 10Ayounsi: Add MoveServersUplinks Netbox script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/970379 (https://phabricator.wikimedia.org/T348129)
[18:09:03] <icinga-wm>	 RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:11:24] <wikibugs>	 (03PS1) 10Kamila Součková: Initial commit of kube-state-metrics chart from prometheus-community [deployment-charts] - 10https://gerrit.wikimedia.org/r/970425 (https://phabricator.wikimedia.org/T264625)
[18:13:11] <icinga-wm>	 PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:16:50] <wikibugs>	 (03PS1) 10TrainBranchBot: group0 wikis to 1.42.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/970426 (https://phabricator.wikimedia.org/T348356)
[18:16:52] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] group0 wikis to 1.42.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/970426 (https://phabricator.wikimedia.org/T348356) (owner: 10TrainBranchBot)
[18:17:27] <dduvall>	 dancy: o/
[18:17:48] <wikibugs>	 (03Merged) 10jenkins-bot: group0 wikis to 1.42.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/970426 (https://phabricator.wikimedia.org/T348356) (owner: 10TrainBranchBot)
[18:22:43] <jinxer-wm>	 (SystemdUnitFailed) firing: wmf_auto_restart_mjolnir-kafka-msearch-daemon@0.service Failed on search-loader2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:22:48] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp1104.eqiad.wmnet with OS bullseye
[18:22:54] <wikibugs>	 10SRE, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1104.eqiad.wmnet with OS bullseye
[18:23:44] <wikibugs>	 10SRE, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10Fabfur)
[18:24:05] <logmsgbot>	 !log dduvall@deploy2002 rebuilt and synchronized wikiversions files: group0 wikis to 1.42.0-wmf.3  refs T348356
[18:24:10] <stashbot>	 T348356: 1.42.0-wmf.3 deployment blockers - https://phabricator.wikimedia.org/T348356
[18:25:43] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10Fabfur)
[18:27:01] <icinga-wm>	 RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:31:11] <icinga-wm>	 PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:32:08] <mutante>	 imagines a world where monitoring knows that right now is deployment window and therefore does not check but keeps checking once it's over
[18:32:57] <dancy>	 Dzahn!  Welcome back.
[18:33:02] <RhinosF1>	 mutante: you're back!
[18:33:09] <mutante>	 thanks dancy and RhinosF1 :)
[18:33:12] <RhinosF1>	 I have about 3 things to tell you
[18:33:22] <RhinosF1>	 mutante: when is too soon to annoy you
[18:33:25] <mutante>	 i just made a ticket in upstream phorge :)
[18:33:36] <mutante>	 RhinosF1: ping me, it's ok
[18:33:39] <mutante>	 i mean.. PM
[18:37:57] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1104.eqiad.wmnet with reason: host reimage
[18:39:08] <wikibugs>	 10SRE, 10Acme-chief, 10Traffic, 10Patch-For-Review: acme-chief should support debian bookworm - https://phabricator.wikimedia.org/T344330 (10CodeReviewBot) brett merged https://gitlab.wikimedia.org/repos/sre/acme-chief/-/merge_requests/3  Update dependencies to match Bookworm versions
[18:39:15] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10CodeReviewBot) brett merged https://gitlab.wikimedia.org/repos/sre/acme-chief/-/merge_requests/3  Update dependencies to match Bookworm versions
[18:40:57] <logmsgbot>	 !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1104.eqiad.wmnet with reason: host reimage
[18:42:21] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1005-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[18:51:44] <wikibugs>	 (03PS1) 10FNegri: Add component/prometheus-openstack-exporter to bookworm [puppet] - 10https://gerrit.wikimedia.org/r/970430 (https://phabricator.wikimedia.org/T350154)
[18:55:50] <wikibugs>	 (03CR) 10Majavah: [C: 03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/970430 (https://phabricator.wikimedia.org/T350154) (owner: 10FNegri)
[18:57:58] <jinxer-wm>	 (RdfStreamingUpdaterSpaceUsageTooHigh) firing: The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh
[18:59:45] <logmsgbot>	 !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1104.eqiad.wmnet with OS bullseye
[18:59:49] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1104.eqiad.wmnet with OS bullseye completed: - cp1104 (**PASS**)   - Remo...
[19:00:45] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10Fabfur)
[19:01:09] <wikibugs>	 (03CR) 10FNegri: [C: 03+2] Add component/prometheus-openstack-exporter to bookworm [puppet] - 10https://gerrit.wikimedia.org/r/970430 (https://phabricator.wikimedia.org/T350154) (owner: 10FNegri)
[19:01:27] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp1105.eqiad.wmnet with OS bullseye
[19:01:41] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1105.eqiad.wmnet with OS bullseye
[19:02:58] <jinxer-wm>	 (RdfStreamingUpdaterSpaceUsageTooHigh) resolved: The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh
[19:04:06] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:09:26] <kimberly_sarabia>	 hello
[19:10:08] <kimberly_sarabia>	 oops sorry im an hour early 
[19:12:07] <logmsgbot>	 !log fabfur@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1105.eqiad.wmnet with OS bullseye
[19:12:16] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1105.eqiad.wmnet with OS bullseye executed with errors: - cp1105 (**FAIL*...
[19:12:26] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp1105.eqiad.wmnet with OS bullseye
[19:12:32] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1105.eqiad.wmnet with OS bullseye
[19:16:16] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:37:04] <wikibugs>	 (03CR) 10Gehel: [C: 04-1] "Looks mostly good, but 2 minor comments inline." [puppet] - 10https://gerrit.wikimedia.org/r/970408 (https://phabricator.wikimedia.org/T329398) (owner: 10Brouberol)
[19:49:16] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:51:17] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[19:55:49] <logmsgbot>	 !log otto@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics: apply
[19:56:02] <logmsgbot>	 !log otto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics: apply
[19:57:14] <wikibugs>	 (03PS1) 10Ryan Kemper: cirrus-streaming-updater: bump vers (NPE fix) [deployment-charts] - 10https://gerrit.wikimedia.org/r/970435 (https://phabricator.wikimedia.org/T347075)
[19:57:54] <wikibugs>	 (03PS2) 10Ryan Kemper: cirrus-streaming-updater: bump vers (NPE fix) [deployment-charts] - 10https://gerrit.wikimedia.org/r/970435 (https://phabricator.wikimedia.org/T347075)
[19:59:01] <wikibugs>	 (03CR) 10Bking: [C: 03+1] cirrus-streaming-updater: bump vers (NPE fix) [deployment-charts] - 10https://gerrit.wikimedia.org/r/970435 (https://phabricator.wikimedia.org/T347075) (owner: 10Ryan Kemper)
[20:00:04] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231031T2000).
[20:00:04] <jouncebot>	 kimberly_sarabia: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:58] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:01:01] * TheresNoTime can deploy
[20:01:11] <kimberly_sarabia>	 hello
[20:01:12] <jinxer-wm>	 (SwiftObjectCountSiteDisparity) firing: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity
[20:02:03] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/969971 (https://phabricator.wikimedia.org/T349544) (owner: 10Kimberly Sarabia)
[20:02:13] <wikibugs>	 (03CR) 10Cwhite: [C: 04-1] "The 1.0.0-2 template file will be removed from the host and a 1.0.0-3 will be added correctly.  However, the logstash output is configured" [puppet] - 10https://gerrit.wikimedia.org/r/969948 (https://phabricator.wikimedia.org/T349807) (owner: 10Herron)
[20:02:25] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+2] cirrus-streaming-updater: bump vers (NPE fix) [deployment-charts] - 10https://gerrit.wikimedia.org/r/970435 (https://phabricator.wikimedia.org/T347075) (owner: 10Ryan Kemper)
[20:02:43] <wikibugs>	 (03Merged) 10jenkins-bot: Deploy vector 2022 to non-English Wikibooks, etc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/969971 (https://phabricator.wikimedia.org/T349544) (owner: 10Kimberly Sarabia)
[20:03:08] <logmsgbot>	 !log samtar@deploy2002 Started scap: Backport for [[gerrit:969971|Deploy vector 2022 to non-English Wikibooks, etc (T349544)]]
[20:03:14] <stashbot>	 T349544: Deployment of Vector 2022 to non-English Wikibooks, Wikinews,  Wikiquotes, Wikiversity, and metawiki - https://phabricator.wikimedia.org/T349544
[20:04:29] <logmsgbot>	 !log samtar@deploy2002 samtar and ksarabia: Backport for [[gerrit:969971|Deploy vector 2022 to non-English Wikibooks, etc (T349544)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[20:04:43] <TheresNoTime>	 kimberly_sarabia: live on mwdebug, is this something you can test?
[20:05:30] <kimberly_sarabia>	 Yes, one moment
[20:05:35] <wikibugs>	 (03CR) 10Cwhite: "This change LGTM! Please deploy I4c9c280a142aa07983bfed65158ff6c4a2aeb1e4 before this one to pre-provision the field." [puppet] - 10https://gerrit.wikimedia.org/r/969135 (https://phabricator.wikimedia.org/T349807) (owner: 10Herron)
[20:05:39] <logmsgbot>	 !log ryankemper@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply
[20:05:55] <logmsgbot>	 !log ryankemper@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply
[20:08:35] <kimberly_sarabia>	 TheresNoTime: LGTM
[20:08:41] <logmsgbot>	 !log samtar@deploy2002 samtar and ksarabia: Continuing with sync
[20:08:47] <TheresNoTime>	 ack :)
[20:11:36] <icinga-wm>	 RECOVERY - Check systemd state on search-loader2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:12:43] <jinxer-wm>	 (SystemdUnitFailed) resolved: wmf_auto_restart_mjolnir-kafka-msearch-daemon@0.service Failed on search-loader2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:13:59] <logmsgbot>	 !log samtar@deploy2002 Finished scap: Backport for [[gerrit:969971|Deploy vector 2022 to non-English Wikibooks, etc (T349544)]] (duration: 10m 51s)
[20:14:04] <stashbot>	 T349544: Deployment of Vector 2022 to non-English Wikibooks, Wikinews,  Wikiquotes, Wikiversity, and metawiki - https://phabricator.wikimedia.org/T349544
[20:14:13] <TheresNoTime>	 kimberly_sarabia: live on prod :)
[20:14:18] <kimberly_sarabia>	 Thanks!
[20:16:26] <TheresNoTime>	 !log close UTC late backport window
[20:16:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:21:42] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] Switch arclamp to nftables [puppet] - 10https://gerrit.wikimedia.org/r/969328 (owner: 10Muehlenhoff)
[20:32:37] <logmsgbot>	 !log fabfur@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp1105.eqiad.wmnet with OS bullseye
[20:32:43] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1105.eqiad.wmnet with OS bullseye executed with errors: - cp1105 (**FAIL*...
[20:34:12] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:40:29] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp1105.eqiad.wmnet with OS bullseye
[20:40:34] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1105.eqiad.wmnet with OS bullseye
[20:45:14] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:45:34] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for postal32 - https://phabricator.wikimedia.org/T348197 (10Aklapper) 05Stalled→03Declined Unfortunately closing this Phabricator task as no further information has been provided.
[20:52:12] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:55:42] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1105.eqiad.wmnet with reason: host reimage
[20:58:46] <logmsgbot>	 !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1105.eqiad.wmnet with reason: host reimage
[21:11:04] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to  releasers-wikibase for lojo_wmde - https://phabricator.wikimedia.org/T342973 (10Aklapper) 05Stalled→03Declined Unfortunately closing this Phabricator task as no further information has been provided.  @lojo_wmde / @darthmon_wmde: After you have provided t...
[21:15:54] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:17:26] <logmsgbot>	 !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1105.eqiad.wmnet with OS bullseye
[21:17:33] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1105.eqiad.wmnet with OS bullseye completed: - cp1105 (**PASS**)   - Remo...
[21:21:15] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp1106.eqiad.wmnet with OS bullseye
[21:21:21] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1106.eqiad.wmnet with OS bullseye
[21:23:01] <wikibugs>	 (03CR) 10Fabfur: Basic retry mechanism for specific kafka errors (031 comment) [software/purged] - 10https://gerrit.wikimedia.org/r/970332 (https://phabricator.wikimedia.org/T334078) (owner: 10Fabfur)
[21:28:33] <logmsgbot>	 !log fabfur@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1106.eqiad.wmnet with OS bullseye
[21:28:38] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1106.eqiad.wmnet with OS bullseye executed with errors: - cp1106 (**FAIL*...
[21:28:43] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp1106.eqiad.wmnet with OS bullseye
[21:28:50] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1106.eqiad.wmnet with OS bullseye
[21:37:49] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp1103.eqiad.wmnet
[21:37:59] <logmsgbot>	 !log fabfur@cumin1001 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host cp1103.eqiad.wmnet
[21:38:08] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp1103.eqiad.wmnet
[21:38:59] <logmsgbot>	 !log fabfur@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1106.eqiad.wmnet with OS bullseye
[21:39:04] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1106.eqiad.wmnet with OS bullseye executed with errors: - cp1106 (**FAIL*...
[21:39:16] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp1106.eqiad.wmnet with OS bullseye
[21:39:21] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1106.eqiad.wmnet with OS bullseye
[21:46:51] <logmsgbot>	 !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp1103.eqiad.wmnet
[21:47:22] <wikibugs>	 (03PS5) 10Samtar: Enable LoginNotify seen subnets table [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965663 (https://phabricator.wikimedia.org/T346989) (owner: 10Tim Starling)
[21:53:38] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp1107.eqiad.wmnet with OS bullseye
[21:53:45] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1107.eqiad.wmnet with OS bullseye
[21:54:25] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1106.eqiad.wmnet with reason: host reimage
[21:54:52] <icinga-wm>	 PROBLEM - Check systemd state on gitlab-runner2004 is CRITICAL: CRITICAL - degraded: The following units failed: docker-gc.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:56:12] <icinga-wm>	 PROBLEM - Check systemd state on gitlab-runner2002 is CRITICAL: CRITICAL - degraded: The following units failed: docker-gc.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:57:22] <icinga-wm>	 RECOVERY - Check systemd state on gitlab-runner2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:57:30] <logmsgbot>	 !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1106.eqiad.wmnet with reason: host reimage
[21:58:02] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10Fabfur)
[21:58:20] <icinga-wm>	 RECOVERY - Check systemd state on gitlab-runner2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:59:41] <wikibugs>	 (03CR) 10Jdlrobson: [C: 03+1] mobile: Add MobileUrlCallback [mediawiki-config] - 10https://gerrit.wikimedia.org/r/969401 (https://phabricator.wikimedia.org/T257852) (owner: 10Gergő Tisza)
[22:02:12] <logmsgbot>	 !log fabfur@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1107.eqiad.wmnet with OS bullseye
[22:02:17] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1107.eqiad.wmnet with OS bullseye executed with errors: - cp1107 (**FAIL*...
[22:02:31] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp1107.eqiad.wmnet with OS bullseye
[22:02:38] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1107.eqiad.wmnet with OS bullseye
[22:05:34] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp1108.eqiad.wmnet with OS bullseye
[22:05:39] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1108.eqiad.wmnet with OS bullseye
[22:16:54] <logmsgbot>	 !log fabfur@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp1107.eqiad.wmnet with OS bullseye
[22:16:59] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1107.eqiad.wmnet with OS bullseye executed with errors: - cp1107 (**FAIL*...
[22:17:27] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp1107.eqiad.wmnet with OS bullseye
[22:17:35] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1107.eqiad.wmnet with OS bullseye
[22:17:39] <logmsgbot>	 !log fabfur@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp1108.eqiad.wmnet with OS bullseye
[22:17:44] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1108.eqiad.wmnet with OS bullseye executed with errors: - cp1108 (**FAIL*...
[22:17:52] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp1108.eqiad.wmnet with OS bullseye
[22:17:58] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1108.eqiad.wmnet with OS bullseye
[22:18:06] <logmsgbot>	 !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1106.eqiad.wmnet with OS bullseye
[22:18:12] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1106.eqiad.wmnet with OS bullseye completed: - cp1106 (**PASS**)   - Remo...
[22:19:57] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp1109.eqiad.wmnet with OS bullseye
[22:20:03] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1109.eqiad.wmnet with OS bullseye
[22:21:08] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10Fabfur)
[22:24:11] <logmsgbot>	 !log fabfur@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1108.eqiad.wmnet with OS bullseye
[22:24:17] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1108.eqiad.wmnet with OS bullseye executed with errors: - cp1108 (**FAIL*...
[22:24:24] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp1108.eqiad.wmnet with OS bullseye
[22:24:30] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1108.eqiad.wmnet with OS bullseye
[22:24:40] <logmsgbot>	 !log fabfur@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1109.eqiad.wmnet with OS bullseye
[22:24:45] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1109.eqiad.wmnet with OS bullseye executed with errors: - cp1109 (**FAIL*...
[22:24:46] <logmsgbot>	 !log fabfur@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1107.eqiad.wmnet with OS bullseye
[22:24:51] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1107.eqiad.wmnet with OS bullseye executed with errors: - cp1107 (**FAIL*...
[22:24:58] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp1109.eqiad.wmnet with OS bullseye
[22:25:04] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1109.eqiad.wmnet with OS bullseye
[22:25:07] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp1107.eqiad.wmnet with OS bullseye
[22:25:13] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1107.eqiad.wmnet with OS bullseye
[22:28:18] <wikibugs>	 10SRE, 10Acme-chief, 10Traffic: acme-chief should support debian bookworm - https://phabricator.wikimedia.org/T344330 (10BCornwall) a:03BCornwall
[22:28:50] <wikibugs>	 10SRE, 10Acme-chief, 10Traffic: acme-chief should support debian bookworm - https://phabricator.wikimedia.org/T344330 (10BCornwall) 05Open→03In progress
[22:28:54] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10BCornwall)
[22:30:40] <wikibugs>	 10SRE-OnFire: Discover Phabricator changes needed for using Phabricator as incident response document - https://phabricator.wikimedia.org/T349120 (10BCornwall) 05Open→03In progress
[22:33:34] <logmsgbot>	 !log fabfur@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1108.eqiad.wmnet with OS bullseye
[22:33:40] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1108.eqiad.wmnet with OS bullseye executed with errors: - cp1108 (**FAIL*...
[22:33:44] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp1108.eqiad.wmnet with OS bullseye
[22:33:50] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1108.eqiad.wmnet with OS bullseye
[22:33:51] <logmsgbot>	 !log fabfur@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1109.eqiad.wmnet with OS bullseye
[22:33:57] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1109.eqiad.wmnet with OS bullseye executed with errors: - cp1109 (**FAIL*...
[22:34:15] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp1109.eqiad.wmnet with OS bullseye
[22:34:20] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Alert on Varnish high thread count - https://phabricator.wikimedia.org/T323723 (10BCornwall) 05In progress→03Resolved Closing due to lack of response. Please re-open if you'd like to re-ignite discussion. Thanks!
[22:34:24] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1109.eqiad.wmnet with OS bullseye
[22:38:35] <logmsgbot>	 !log fabfur@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1107.eqiad.wmnet with OS bullseye
[22:38:41] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1107.eqiad.wmnet with OS bullseye executed with errors: - cp1107 (**FAIL*...
[22:38:45] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp1107.eqiad.wmnet with OS bullseye
[22:38:50] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1107.eqiad.wmnet with OS bullseye
[22:41:34] <icinga-wm>	 PROBLEM - Check systemd state on gitlab-runner1004 is CRITICAL: CRITICAL - degraded: The following units failed: docker-gc.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:42:21] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1005-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[22:45:02] <icinga-wm>	 RECOVERY - Check systemd state on gitlab-runner1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:48:50] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1108.eqiad.wmnet with reason: host reimage
[22:49:17] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1109.eqiad.wmnet with reason: host reimage
[22:52:05] <logmsgbot>	 !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1108.eqiad.wmnet with reason: host reimage
[22:53:53] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1107.eqiad.wmnet with reason: host reimage
[22:54:24] <logmsgbot>	 !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1109.eqiad.wmnet with reason: host reimage
[22:57:27] <logmsgbot>	 !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1107.eqiad.wmnet with reason: host reimage
[23:01:44] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp1110.eqiad.wmnet with OS bullseye
[23:01:46] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp1111.eqiad.wmnet with OS bullseye
[23:01:48] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp1112.eqiad.wmnet with OS bullseye
[23:01:51] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1110.eqiad.wmnet with OS bullseye
[23:01:53] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1111.eqiad.wmnet with OS bullseye
[23:01:57] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1112.eqiad.wmnet with OS bullseye
[23:08:13] <logmsgbot>	 !log fabfur@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1110.eqiad.wmnet with OS bullseye
[23:08:17] <logmsgbot>	 !log fabfur@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1111.eqiad.wmnet with OS bullseye
[23:08:18] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1110.eqiad.wmnet with OS bullseye executed with errors: - cp1110 (**FAIL*...
[23:08:22] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1111.eqiad.wmnet with OS bullseye executed with errors: - cp1111 (**FAIL*...
[23:08:28] <logmsgbot>	 !log fabfur@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1112.eqiad.wmnet with OS bullseye
[23:08:33] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1112.eqiad.wmnet with OS bullseye executed with errors: - cp1112 (**FAIL*...
[23:08:38] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp1110.eqiad.wmnet with OS bullseye
[23:08:41] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp1111.eqiad.wmnet with OS bullseye
[23:08:42] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp1112.eqiad.wmnet with OS bullseye
[23:08:45] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1110.eqiad.wmnet with OS bullseye
[23:08:47] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1111.eqiad.wmnet with OS bullseye
[23:08:49] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1112.eqiad.wmnet with OS bullseye
[23:09:49] <logmsgbot>	 !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1108.eqiad.wmnet with OS bullseye
[23:09:54] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1108.eqiad.wmnet with OS bullseye completed: - cp1108 (**PASS**)   - Remo...
[23:12:37] <logmsgbot>	 !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1109.eqiad.wmnet with OS bullseye
[23:12:44] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1109.eqiad.wmnet with OS bullseye completed: - cp1109 (**PASS**)   - Remo...
[23:14:52] <logmsgbot>	 !log fabfur@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1112.eqiad.wmnet with OS bullseye
[23:14:58] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1112.eqiad.wmnet with OS bullseye executed with errors: - cp1112 (**FAIL*...
[23:14:58] <logmsgbot>	 !log fabfur@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1110.eqiad.wmnet with OS bullseye
[23:15:04] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1110.eqiad.wmnet with OS bullseye executed with errors: - cp1110 (**FAIL*...
[23:15:10] <logmsgbot>	 !log fabfur@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1111.eqiad.wmnet with OS bullseye
[23:15:15] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1111.eqiad.wmnet with OS bullseye executed with errors: - cp1111 (**FAIL*...
[23:15:29] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp1110.eqiad.wmnet with OS bullseye
[23:15:30] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp1111.eqiad.wmnet with OS bullseye
[23:15:34] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1110.eqiad.wmnet with OS bullseye
[23:15:37] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1111.eqiad.wmnet with OS bullseye
[23:15:41] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp1112.eqiad.wmnet with OS bullseye
[23:15:41] <logmsgbot>	 !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1107.eqiad.wmnet with OS bullseye
[23:15:45] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1107.eqiad.wmnet with OS bullseye completed: - cp1107 (**PASS**)   - Remo...
[23:22:53] <logmsgbot>	 !log fabfur@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1111.eqiad.wmnet with OS bullseye
[23:22:59] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1111.eqiad.wmnet with OS bullseye executed with errors: - cp1111 (**FAIL*...
[23:23:00] <logmsgbot>	 !log fabfur@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1112.eqiad.wmnet with OS bullseye
[23:23:04] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp1111.eqiad.wmnet with OS bullseye
[23:23:05] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1112.eqiad.wmnet with OS bullseye executed with errors: - cp1112 (**FAIL*...
[23:23:10] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1111.eqiad.wmnet with OS bullseye
[23:23:12] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp1112.eqiad.wmnet with OS bullseye
[23:23:18] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1112.eqiad.wmnet with OS bullseye
[23:27:01] <icinga-wm>	 RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:28:03] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10Fabfur)
[23:30:33] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1110.eqiad.wmnet with reason: host reimage
[23:30:34] <icinga-wm>	 PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:33:32] <logmsgbot>	 !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1110.eqiad.wmnet with reason: host reimage
[23:38:05] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1111.eqiad.wmnet with reason: host reimage
[23:38:18] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1112.eqiad.wmnet with reason: host reimage
[23:41:00] <logmsgbot>	 !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1111.eqiad.wmnet with reason: host reimage
[23:43:20] <logmsgbot>	 !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1112.eqiad.wmnet with reason: host reimage
[23:50:02] <icinga-wm>	 RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:51:17] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[23:51:31] <logmsgbot>	 !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1110.eqiad.wmnet with OS bullseye
[23:51:37] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1110.eqiad.wmnet with OS bullseye completed: - cp1110 (**PASS**)   - Remo...
[23:53:20] <icinga-wm>	 PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:59:13] <logmsgbot>	 !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1111.eqiad.wmnet with OS bullseye
[23:59:18] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1111.eqiad.wmnet with OS bullseye completed: - cp1111 (**PASS**)   - Remo...