[00:00:21] RECOVERY - Check systemd state on config-master2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:02:52] (ProbeDown) firing: (60) Service pki1001:443 has failed probes (http_PKI_aux_front_proxy_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:04:07] RECOVERY - Check systemd state on config-master1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:04:31] PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:07:52] (ProbeDown) firing: (80) Service pki1001:443 has failed probes (http_PKI_aux_front_proxy_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:08:15] PROBLEM - Check systemd state on config-master1001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:19:23] RECOVERY - Check systemd state on config-master1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:19:36] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1103.eqiad.wmnet with OS bullseye [00:23:29] PROBLEM - Check systemd state on config-master1001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:29:14] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp1103.eqiad.wmnet with OS bullseye [00:30:51] RECOVERY - Check systemd state on config-master2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:34:59] PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:38:58] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/969986 [00:39:00] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/969986 (owner: 10TrainBranchBot) [00:42:29] RECOVERY - Check systemd state on logstash2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:45:55] PROBLEM - Check systemd state on logstash1026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:46:29] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1103.eqiad.wmnet with OS bullseye [00:58:58] (RdfStreamingUpdaterSpaceUsageTooHigh) firing: The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh [00:59:21] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/969986 (owner: 10TrainBranchBot) [01:03:50] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T350095 (10phaultfinder) [01:03:58] (RdfStreamingUpdaterSpaceUsageTooHigh) resolved: The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh [01:04:58] (RdfStreamingUpdaterSpaceUsageTooHigh) firing: The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh [01:09:58] (RdfStreamingUpdaterSpaceUsageTooHigh) resolved: The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh [01:13:59] (RdfStreamingUpdaterSpaceUsageTooHigh) firing: The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh [01:15:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 49.07% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [01:18:58] (RdfStreamingUpdaterSpaceUsageTooHigh) resolved: The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh [01:20:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 49.07% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [01:24:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at codfw: 47.69% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [01:27:59] (PuppetFailure) firing: Puppet has failed on restbase2022:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [01:29:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at codfw: 47.69% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [01:31:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at codfw: 50% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [01:36:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at codfw: 45.83% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [01:45:59] (PuppetFailure) firing: Puppet has failed on ganeti1029:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [01:51:59] (PuppetFailure) firing: Puppet has failed on elastic2047:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [01:58:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at codfw: 46.76% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [01:59:54] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team (FY2023/2024-Q1): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10Andrew) [02:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231031T0200) [02:03:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at codfw: 46.76% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [02:04:58] (RdfStreamingUpdaterSpaceUsageTooHigh) firing: The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh [02:06:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at codfw: 44.91% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [02:07:31] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.42.0-wmf.3 [core] (wmf/1.42.0-wmf.3) - 10https://gerrit.wikimedia.org/r/969987 (https://phabricator.wikimedia.org/T348356) [02:07:33] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.42.0-wmf.3 [core] (wmf/1.42.0-wmf.3) - 10https://gerrit.wikimedia.org/r/969987 (https://phabricator.wikimedia.org/T348356) (owner: 10TrainBranchBot) [02:09:58] (RdfStreamingUpdaterSpaceUsageTooHigh) resolved: The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh [02:11:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at codfw: 49.07% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [02:21:59] (PuppetFailure) firing: Puppet has failed on deploy2002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [02:24:49] (03Merged) 10jenkins-bot: Branch commit for wmf/1.42.0-wmf.3 [core] (wmf/1.42.0-wmf.3) - 10https://gerrit.wikimedia.org/r/969987 (https://phabricator.wikimedia.org/T348356) (owner: 10TrainBranchBot) [02:25:59] (PuppetFailure) firing: (2) Puppet has failed on ganeti1029:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [02:31:59] (PuppetFailure) firing: (2) Puppet has failed on elastic1070:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [02:32:41] RECOVERY - Check systemd state on config-master2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:36:49] PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:36:59] (PuppetFailure) firing: (3) Puppet has failed on elastic1070:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [02:37:59] (PuppetFailure) firing: (2) Puppet has failed on restbase2022:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [02:38:44] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:42:21] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1005-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [02:47:55] RECOVERY - Check systemd state on config-master2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:52:05] PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:00:05] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231031T0300) [03:01:37] (03PS1) 10TrainBranchBot: testwikis wikis to 1.42.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/970008 (https://phabricator.wikimedia.org/T348356) [03:01:39] (03CR) 10TrainBranchBot: [C: 03+2] testwikis wikis to 1.42.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/970008 (https://phabricator.wikimedia.org/T348356) (owner: 10TrainBranchBot) [03:02:26] (03Merged) 10jenkins-bot: testwikis wikis to 1.42.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/970008 (https://phabricator.wikimedia.org/T348356) (owner: 10TrainBranchBot) [03:02:52] !log mwpresync@deploy2002 Started scap: testwikis wikis to 1.42.0-wmf.3 refs T348356 [03:02:58] T348356: 1.42.0-wmf.3 deployment blockers - https://phabricator.wikimedia.org/T348356 [03:04:32] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:14:58] (RdfStreamingUpdaterSpaceUsageTooHigh) firing: The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh [03:16:59] (PuppetFailure) firing: (4) Puppet has failed on elastic1068:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [03:19:58] (RdfStreamingUpdaterSpaceUsageTooHigh) resolved: The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh [03:20:59] (PuppetFailure) firing: (3) Puppet has failed on ganeti1029:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [03:39:52] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [03:42:59] (PuppetFailure) firing: Puppet has failed on kubetcd1006:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [03:45:59] (PuppetFailure) firing: Puppet has failed on cumin1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [03:50:59] (PuppetFailure) firing: (2) Puppet has failed on cumin1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [03:51:17] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [03:51:59] (PuppetFailure) firing: (2) Puppet has failed on deploy1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [03:53:36] !log mwpresync@deploy2002 Finished scap: testwikis wikis to 1.42.0-wmf.3 refs T348356 (duration: 50m 44s) [03:53:41] T348356: 1.42.0-wmf.3 deployment blockers - https://phabricator.wikimedia.org/T348356 [03:55:52] !log mwpresync@deploy2002 Pruned MediaWiki: 1.42.0-wmf.1 (duration: 02m 14s) [03:57:58] (RdfStreamingUpdaterSpaceUsageTooHigh) firing: The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh [03:59:11] RECOVERY - Check systemd state on config-master2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:02:58] (RdfStreamingUpdaterSpaceUsageTooHigh) resolved: The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh [04:03:25] PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:07:59] (PuppetFailure) firing: (2) Puppet has failed on kubetcd1004:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [04:08:06] (ProbeDown) firing: (80) Service pki1001:443 has failed probes (http_PKI_aux_front_proxy_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:08:51] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T342502 (10phaultfinder) [04:12:59] (PuppetFailure) firing: (3) Puppet has failed on kubetcd1004:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [04:13:23] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:14:37] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50713 bytes in 0.105 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:16:59] (PuppetFailure) firing: (5) Puppet has failed on elastic1068:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [04:20:59] (PuppetFailure) firing: (4) Puppet has failed on ganeti1029:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [04:22:59] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:24:17] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 1.858 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:38:33] RECOVERY - Check systemd state on config-master1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:41:59] (PuppetFailure) firing: (6) Puppet has failed on elastic1068:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [04:42:45] PROBLEM - Check systemd state on config-master1001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:52:59] (PuppetFailure) firing: (4) Puppet has failed on kubemaster1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [04:54:49] (03PS2) 10KartikMistry: Update MinT to 2023-10-31-044726-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/968388 (https://phabricator.wikimedia.org/T333969) [04:58:15] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:59:29] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50713 bytes in 0.098 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:01:59] (PuppetFailure) firing: (7) Puppet has failed on elastic1068:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [05:07:59] (PuppetFailure) firing: (5) Puppet has failed on kubemaster1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [05:09:11] RECOVERY - Check systemd state on config-master1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:13:21] PROBLEM - Check systemd state on config-master1001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:24:25] RECOVERY - Check systemd state on config-master1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:28:37] PROBLEM - Check systemd state on config-master1001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:32:59] (PuppetFailure) firing: (6) Puppet has failed on kubemaster1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [05:39:41] RECOVERY - Check systemd state on config-master1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:42:59] (PuppetFailure) firing: (7) Puppet has failed on kubemaster1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [05:43:49] PROBLEM - Check systemd state on config-master1001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:54:53] RECOVERY - Check systemd state on config-master1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:57:59] (PuppetFailure) firing: (12) Puppet has failed on kubemaster1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [05:59:01] PROBLEM - Check systemd state on config-master1001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:00:05] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231031T0600) [06:00:05] kormat, marostegui, and Amir1: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231031T0600). [06:01:39] RECOVERY - Check systemd state on config-master2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:02:58] (RdfStreamingUpdaterSpaceUsageTooHigh) firing: The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh [06:02:59] (PuppetFailure) firing: (14) Puppet has failed on kubemaster1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [06:05:49] PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:07:58] (RdfStreamingUpdaterSpaceUsageTooHigh) resolved: The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh [06:07:59] (PuppetFailure) firing: (15) Puppet has failed on kubemaster1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [06:14:53] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T342502 (10phaultfinder) [06:21:36] (03PS1) 10Marostegui: ProductionServices.php: Promote pc2014 to pc1 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/970033 [06:22:42] (03CR) 10Marostegui: [C: 03+2] ProductionServices.php: Promote pc2014 to pc1 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/970033 (owner: 10Marostegui) [06:23:23] (03Merged) 10jenkins-bot: ProductionServices.php: Promote pc2014 to pc1 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/970033 (owner: 10Marostegui) [06:24:27] !log marostegui@deploy2002 Started scap: Backport for [[gerrit:970033|ProductionServices.php: Promote pc2014 to pc1 master]] [06:25:56] !log marostegui@deploy2002 marostegui: Backport for [[gerrit:970033|ProductionServices.php: Promote pc2014 to pc1 master]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [06:26:01] !log marostegui@deploy2002 marostegui: Continuing with sync [06:26:21] (03PS1) 10Marostegui: Revert "ProductionServices.php: Promote pc2014 to pc1 master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/969772 [06:27:37] PROBLEM - BGP status on cr1-drmrs is CRITICAL: BGP CRITICAL - No response from remote host 185.15.58.128 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [06:29:59] (PuppetFailure) firing: Puppet has failed on ml-cache1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [06:31:17] !log marostegui@deploy2002 Finished scap: Backport for [[gerrit:970033|ProductionServices.php: Promote pc2014 to pc1 master]] (duration: 06m 50s) [06:32:59] (PuppetFailure) firing: (16) Puppet has failed on kubemaster1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [06:33:29] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on 34 hosts with reason: Primary switchover s4 T349820 [06:33:36] T349820: Switchover s4 master (db2179 -> db2140) - https://phabricator.wikimedia.org/T349820 [06:33:57] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 34 hosts with reason: Primary switchover s4 T349820 [06:33:59] (PuppetFailure) firing: Puppet has failed on mw1415:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [06:34:08] (03PS1) 10Marostegui: pc2014: Move it to pc2 [puppet] - 10https://gerrit.wikimedia.org/r/970202 [06:35:29] PROBLEM - Check systemd state on mwmaint2002 is CRITICAL: CRITICAL - degraded: The following units failed: mediawiki_job_purge_parsercache_pc1.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:36:48] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Set db2140 with weight 0 T349820', diff saved to https://phabricator.wikimedia.org/P53068 and previous config saved to /var/cache/conftool/dbconfig/20231031-063647-arnaudb.json [06:37:41] PROBLEM - CirrusSearch more_like eqiad 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=39 [06:37:59] (PuppetFailure) firing: (17) Puppet has failed on kubemaster1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [06:37:59] (PuppetFailure) firing: (2) Puppet has failed on restbase2022:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [06:40:20] (03CR) 10Marostegui: [C: 03+2] Revert "ProductionServices.php: Promote pc2014 to pc1 master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/969772 (owner: 10Marostegui) [06:40:25] RECOVERY - CirrusSearch more_like eqiad 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=39 [06:41:05] (03Merged) 10jenkins-bot: Revert "ProductionServices.php: Promote pc2014 to pc1 master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/969772 (owner: 10Marostegui) [06:42:06] !log marostegui@deploy2002 Started scap: Backport for [[gerrit:969772|Revert "ProductionServices.php: Promote pc2014 to pc1 master"]] [06:42:21] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1005-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [06:42:56] (03CR) 10Marostegui: [C: 03+2] pc2014: Move it to pc2 [puppet] - 10https://gerrit.wikimedia.org/r/970202 (owner: 10Marostegui) [06:42:59] (PuppetFailure) firing: (18) Puppet has failed on kubemaster1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [06:43:24] !log marostegui@deploy2002 marostegui: Backport for [[gerrit:969772|Revert "ProductionServices.php: Promote pc2014 to pc1 master"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [06:44:03] !log marostegui@deploy2002 marostegui: Continuing with sync [06:49:19] !log marostegui@deploy2002 Finished scap: Backport for [[gerrit:969772|Revert "ProductionServices.php: Promote pc2014 to pc1 master"]] (duration: 07m 12s) [06:58:10] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at codfw: 44.44% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [07:01:21] (03CR) 10Marostegui: [C: 03+1] mariadb: Promote db2140 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/968968 (https://phabricator.wikimedia.org/T349820) (owner: 10Gerrit maintenance bot) [07:01:24] (03CR) 10Arnaudb: [C: 03+2] mariadb: Promote db2140 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/968968 (https://phabricator.wikimedia.org/T349820) (owner: 10Gerrit maintenance bot) [07:02:47] !log Starting s4 codfw failover from db2179 to db2140 - T349820 [07:02:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:02:52] T349820: Switchover s4 master (db2179 -> db2140) - https://phabricator.wikimedia.org/T349820 [07:03:59] (PuppetFailure) firing: (2) Puppet has failed on mw1349:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [07:04:06] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Set s4 codfw as read-only for maintenance - T349820', diff saved to https://phabricator.wikimedia.org/P53070 and previous config saved to /var/cache/conftool/dbconfig/20231031-070405-arnaudb.json [07:05:49] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Promote db2140 to s4 primary and set section read-write T349820', diff saved to https://phabricator.wikimedia.org/P53071 and previous config saved to /var/cache/conftool/dbconfig/20231031-070549-arnaudb.json [07:07:35] (03CR) 10Marostegui: [C: 03+1] wmnet: Update s4-master alias [dns] - 10https://gerrit.wikimedia.org/r/968969 (https://phabricator.wikimedia.org/T349820) (owner: 10Gerrit maintenance bot) [07:08:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at codfw: 49.07% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [07:09:58] (03PS2) 10Arnaudb: wmnet: Update s4-master alias [dns] - 10https://gerrit.wikimedia.org/r/968969 (https://phabricator.wikimedia.org/T349820) (owner: 10Gerrit maintenance bot) [07:12:09] (03CR) 10Arnaudb: [C: 03+2] wmnet: Update s4-master alias [dns] - 10https://gerrit.wikimedia.org/r/968969 (https://phabricator.wikimedia.org/T349820) (owner: 10Gerrit maintenance bot) [07:19:39] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2179 weight mimic old db2140', diff saved to https://phabricator.wikimedia.org/P53072 and previous config saved to /var/cache/conftool/dbconfig/20231031-071938-arnaudb.json [07:30:23] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2179 depooling from API and pooling in db2140', diff saved to https://phabricator.wikimedia.org/P53073 and previous config saved to /var/cache/conftool/dbconfig/20231031-073023-arnaudb.json [07:33:12] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2179 weight rebalancing', diff saved to https://phabricator.wikimedia.org/P53074 and previous config saved to /var/cache/conftool/dbconfig/20231031-073312-arnaudb.json [07:36:53] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2179 weight rebalancing - depooled', diff saved to https://phabricator.wikimedia.org/P53075 and previous config saved to /var/cache/conftool/dbconfig/20231031-073652-arnaudb.json [07:37:59] (PuppetFailure) firing: (19) Puppet has failed on kubemaster1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [07:38:22] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2179 (re)pooling @ 15%: Host warmup', diff saved to https://phabricator.wikimedia.org/P53076 and previous config saved to /var/cache/conftool/dbconfig/20231031-073822-arnaudb.json [07:39:52] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [07:42:59] (PuppetFailure) firing: (20) Puppet has failed on kubemaster1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [07:47:05] (03PS1) 10Giuseppe Lavagetto: Add weekly-update script [deployment-charts] - 10https://gerrit.wikimedia.org/r/970204 (https://phabricator.wikimedia.org/T344478) [07:50:46] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:50:50] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:50:59] (PuppetFailure) firing: (2) Puppet has failed on cumin1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [07:51:17] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [07:51:52] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.264 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:51:56] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50714 bytes in 0.123 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:51:59] (PuppetFailure) firing: (2) Puppet has failed on deploy1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [07:51:59] (PuppetFailure) firing: (9) Puppet has failed on elastic1068:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [07:53:27] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2179 (re)pooling @ 30%: Host warmup', diff saved to https://phabricator.wikimedia.org/P53077 and previous config saved to /var/cache/conftool/dbconfig/20231031-075327-arnaudb.json [08:00:05] Amir1, Urbanecm, and taavi: That opportune time is upon us again. Time for a UTC morning backport window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231031T0800). [08:00:05] No Gerrit patches in the queue for this window AFAICS. [08:02:16] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [08:07:16] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [08:08:07] (ProbeDown) firing: (80) Service pki1001:443 has failed probes (http_PKI_aux_front_proxy_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:08:32] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2179 (re)pooling @ 45%: Host warmup', diff saved to https://phabricator.wikimedia.org/P53078 and previous config saved to /var/cache/conftool/dbconfig/20231031-080832-arnaudb.json [08:11:03] 10SRE-OnFire, 10Observability-Metrics, 10Sustainability (Incident Followup), 10User-fgiunchedi: ThanosCompactHalted error on overlapping blocks - https://phabricator.wikimedia.org/T335406 (10fgiunchedi) 05Open→03Resolved We require a replica label now as per {T350002}, resolving [08:13:49] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T342502 (10phaultfinder) [08:19:22] (03CR) 10Slyngshede: [C: 03+2] P:monitoring remove remainders of check_eth. [puppet] - 10https://gerrit.wikimedia.org/r/969721 (owner: 10Slyngshede) [08:21:17] (PuppetFailure) firing: (4) Puppet has failed on ganeti1029:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [08:23:37] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2179 (re)pooling @ 60%: Host warmup', diff saved to https://phabricator.wikimedia.org/P53079 and previous config saved to /var/cache/conftool/dbconfig/20231031-082336-arnaudb.json [08:29:51] (03PS1) 10Majavah: P:pki: use wmf-ca-certificates [puppet] - 10https://gerrit.wikimedia.org/r/970267 (https://phabricator.wikimedia.org/T350111) [08:30:45] (03CR) 10Giuseppe Lavagetto: [C: 03+1] P:pki: use wmf-ca-certificates [puppet] - 10https://gerrit.wikimedia.org/r/970267 (https://phabricator.wikimedia.org/T350111) (owner: 10Majavah) [08:31:03] (03CR) 10Jbond: [C: 03+2] P:pki: use wmf-ca-certificates [puppet] - 10https://gerrit.wikimedia.org/r/970267 (https://phabricator.wikimedia.org/T350111) (owner: 10Majavah) [08:31:38] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/245/con" [puppet] - 10https://gerrit.wikimedia.org/r/970267 (https://phabricator.wikimedia.org/T350111) (owner: 10Majavah) [08:33:59] (PuppetFailure) firing: Puppet has failed on krb2002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [08:34:34] RECOVERY - Check systemd state on config-master2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:36:26] RECOVERY - Check systemd state on pki2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:37:59] (PuppetFailure) firing: (21) Puppet has failed on kubemaster1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [08:38:42] PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:38:42] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2179 (re)pooling @ 75%: Host warmup', diff saved to https://phabricator.wikimedia.org/P53080 and previous config saved to /var/cache/conftool/dbconfig/20231031-083841-arnaudb.json [08:40:42] RECOVERY - Check systemd state on pki1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:50:19] (03CR) 10Ayounsi: "Thanks. I like the approach as it doesn't use "nerd knobs" nor adds much complexity in the policies." [homer/public] - 10https://gerrit.wikimedia.org/r/969367 (https://phabricator.wikimedia.org/T344547) (owner: 10Cathal Mooney) [08:53:01] (03CR) 10Brouberol: [C: 03+2] Enable the management of the skein certificate via Puppet [puppet] - 10https://gerrit.wikimedia.org/r/968612 (https://phabricator.wikimedia.org/T329398) (owner: 10Brouberol) [08:53:47] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2179 (re)pooling @ 90%: Host warmup', diff saved to https://phabricator.wikimedia.org/P53081 and previous config saved to /var/cache/conftool/dbconfig/20231031-085346-arnaudb.json [08:56:16] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1230 config append', diff saved to https://phabricator.wikimedia.org/P53082 and previous config saved to /var/cache/conftool/dbconfig/20231031-085615-arnaudb.json [08:56:56] (03PS1) 10Hashar: puppet_compiler: always send CORS header even on 404 [puppet] - 10https://gerrit.wikimedia.org/r/970268 (https://phabricator.wikimedia.org/T350003) [08:57:11] (03CR) 10CI reject: [V: 04-1] puppet_compiler: always send CORS header even on 404 [puppet] - 10https://gerrit.wikimedia.org/r/970268 (https://phabricator.wikimedia.org/T350003) (owner: 10Hashar) [08:57:17] (03CR) 10Brouberol: [C: 03+2] Enable the management of the skein certificate via Puppet on one instance [puppet] - 10https://gerrit.wikimedia.org/r/968613 (https://phabricator.wikimedia.org/T329398) (owner: 10Brouberol) [08:57:41] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1230 (re)pooling @ 5%: db1230 host warmup', diff saved to https://phabricator.wikimedia.org/P53083 and previous config saved to /var/cache/conftool/dbconfig/20231031-085740-arnaudb.json [08:57:52] (ProbeDown) firing: (80) Service pki1001:443 has failed probes (http_PKI_aux_front_proxy_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:57:59] (PuppetFailure) firing: Puppet has failed on kubestage2002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [08:59:07] (03PS2) 10Hashar: puppet_compiler: always send CORS header even on 404 [puppet] - 10https://gerrit.wikimedia.org/r/970268 (https://phabricator.wikimedia.org/T350003) [09:00:52] !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/changeprop: sync [09:00:59] (PuppetFailure) resolved: (2) Puppet has failed on cumin1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [09:01:07] !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop: sync [09:01:16] (PuppetFailure) firing: (4) Puppet has failed on ganeti1029:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [09:01:59] (PuppetFailure) resolved: (9) Puppet has failed on elastic1068:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [09:02:59] (PuppetFailure) resolved: Puppet has failed on kubestage2002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [09:02:59] (PuppetFailure) firing: (21) Puppet has failed on kubemaster1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [09:03:59] (PuppetFailure) resolved: Puppet has failed on krb2002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [09:03:59] (PuppetFailure) resolved: (2) Puppet has failed on mw1349:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [09:04:59] (PuppetFailure) resolved: (2) Puppet has failed on ml-cache1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [09:05:43] (03PS1) 10Arnaudb: mariadb: db1130 db1230 swap hosts [puppet] - 10https://gerrit.wikimedia.org/r/969988 (https://phabricator.wikimedia.org/T344036) [09:05:59] (PuppetFailure) resolved: (4) Puppet has failed on ganeti1029:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [09:06:59] (PuppetFailure) resolved: (2) Puppet has failed on deploy1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [09:07:59] (PuppetFailure) resolved: (2) Puppet has failed on restbase2022:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [09:07:59] (PuppetFailure) resolved: (21) Puppet has failed on kubemaster1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [09:08:15] (03CR) 10Volans: [C: 03+1] "LGTM, minor nit inline" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/969692 (owner: 10Ayounsi) [09:10:15] (03CR) 10Volans: [C: 03+1] "LGTM" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/969749 (owner: 10Ayounsi) [09:12:00] (03CR) 10Volans: [C: 03+1] "LGTM" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/969752 (owner: 10Ayounsi) [09:12:42] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1001 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [09:13:46] (03PS1) 10Stevemunene: switch druid host to run data_purge job [puppet] - 10https://gerrit.wikimedia.org/r/970272 (https://phabricator.wikimedia.org/T336042) [09:14:04] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [09:14:57] arnaudb: FYI ^^^ (diff is related to the host you're working on) [09:16:09] (03CR) 10Volans: [C: 03+1] "This change needs to be communicated to DCOps before deploying" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/969692 (owner: 10Ayounsi) [09:16:15] (03CR) 10Volans: [C: 03+1] "This change needs to be communicated to DCOps before deploying" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/969749 (owner: 10Ayounsi) [09:18:20] (03PS2) 10Elukey: services: update ChangeProp's eqiad Docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/969758 (https://phabricator.wikimedia.org/T348950) [09:18:50] (03CR) 10Elukey: "Updated the docker image to one with improved (debug) logging." [deployment-charts] - 10https://gerrit.wikimedia.org/r/969758 (https://phabricator.wikimedia.org/T348950) (owner: 10Elukey) [09:20:31] 10SRE, 10Wikimedia-Mailing-lists: New mailing list request for Project Korikath - https://phabricator.wikimedia.org/T349429 (10Ladsgroup) 05Open→03Resolved a:03Ladsgroup Created, https://lists.wikimedia.org/postorius/lists/korikath.lists.wikimedia.org. I made it as a a public mailing list, feel free to c... [09:22:38] (03PS1) 10Majavah: Fix cloud-public definitions [homer/public] - 10https://gerrit.wikimedia.org/r/970274 (https://phabricator.wikimedia.org/T350114) [09:23:33] (03CR) 10Ayounsi: [C: 03+1] Fix cloud-public definitions [homer/public] - 10https://gerrit.wikimedia.org/r/970274 (https://phabricator.wikimedia.org/T350114) (owner: 10Majavah) [09:23:54] (03CR) 10Majavah: [C: 03+2] Fix cloud-public definitions [homer/public] - 10https://gerrit.wikimedia.org/r/970274 (https://phabricator.wikimedia.org/T350114) (owner: 10Majavah) [09:24:35] (03Merged) 10jenkins-bot: Fix cloud-public definitions [homer/public] - 10https://gerrit.wikimedia.org/r/970274 (https://phabricator.wikimedia.org/T350114) (owner: 10Majavah) [09:29:37] (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [09:32:51] (03PS1) 10Majavah: cr-cloud: Move allow-public below deny-to-private-subnets [homer/public] - 10https://gerrit.wikimedia.org/r/970275 [09:34:49] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Set ', diff saved to https://phabricator.wikimedia.org/P53084 and previous config saved to /var/cache/conftool/dbconfig/20231031-093448-arnaudb.json [09:34:58] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2179 (re)pooling @ 100%: Host warmup', diff saved to https://phabricator.wikimedia.org/P53085 and previous config saved to /var/cache/conftool/dbconfig/20231031-093457-arnaudb.json [09:35:38] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [09:38:25] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10Puppet (Puppet 7.0): Investigate PKI errors - https://phabricator.wikimedia.org/T350118 (10jbond) [09:38:33] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10Puppet (Puppet 7.0): Investigate PKI errors - https://phabricator.wikimedia.org/T350118 (10jbond) [09:39:20] !log arnaudb@cumin1001 dbctl commit (dc=all): 'set db1230 as a depooled host', diff saved to https://phabricator.wikimedia.org/P53086 and previous config saved to /var/cache/conftool/dbconfig/20231031-093919-arnaudb.json [09:39:34] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1001 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [09:41:57] (03CR) 10Jbond: [C: 03+2] puppet_compiler: always send CORS header even on 404 [puppet] - 10https://gerrit.wikimedia.org/r/970268 (https://phabricator.wikimedia.org/T350003) (owner: 10Hashar) [09:45:50] (03PS1) 10Elukey: changeprop: allow to specify consumer/producer kafka settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/970276 (https://phabricator.wikimedia.org/T348950) [09:46:06] (03PS2) 10Arnaudb: mariadb: db1130 db1230 swap hosts [puppet] - 10https://gerrit.wikimedia.org/r/969988 (https://phabricator.wikimedia.org/T344036) [09:47:38] !log arnaudb@cumin1001 dbctl commit (dc=all): 'set db1230 as a depooled host', diff saved to https://phabricator.wikimedia.org/P53087 and previous config saved to /var/cache/conftool/dbconfig/20231031-094737-arnaudb.json [09:50:32] (03PS7) 10Ayounsi: Ask for port # and type instead of interface name [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/969692 [09:50:34] (03PS4) 10Ayounsi: provision_server: make switch selection optional [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/969749 [09:50:34] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance [09:50:36] (03PS3) 10Ayounsi: provision_server: don't show servers with a primary IP [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/969752 [09:50:45] (03CR) 10Ayounsi: Ask for port # and type instead of interface name (033 comments) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/969692 (owner: 10Ayounsi) [09:50:48] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance [09:50:55] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2179 (T343198)', diff saved to https://phabricator.wikimedia.org/P53088 and previous config saved to /var/cache/conftool/dbconfig/20231031-095054-arnaudb.json [09:50:59] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [09:52:14] (03PS6) 10Cathal Mooney: Change core router config to export internal routes to Switches [homer/public] - 10https://gerrit.wikimedia.org/r/969367 (https://phabricator.wikimedia.org/T344547) [09:54:28] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10Puppet (Puppet 7.0): Investigate PKI errors - https://phabricator.wikimedia.org/T350118 (10jbond) [09:57:53] (03CR) 10Marostegui: "Don't depool db1130 yet" [puppet] - 10https://gerrit.wikimedia.org/r/969988 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb) [09:58:09] (03PS1) 10Cathal Mooney: Deny traffic from cloud pub ranges to WMF private IPs and tidy conf [homer/public] - 10https://gerrit.wikimedia.org/r/970279 (https://phabricator.wikimedia.org/T347030) [09:58:14] (03CR) 10DCausse: [C: 03+1] cirrus updater: Re-enable the .* route for mwapi [deployment-charts] - 10https://gerrit.wikimedia.org/r/969209 (owner: 10Ebernhardson) [10:00:05] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231031T1000) [10:00:21] (03PS3) 10Arnaudb: mariadb: db1130 db1230 swap hosts [puppet] - 10https://gerrit.wikimedia.org/r/969988 (https://phabricator.wikimedia.org/T344036) [10:00:47] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10Puppet (Puppet 7.0): Investigate PKI errors - https://phabricator.wikimedia.org/T350118 (10jbond) The last successful sign in eqiad was at 2023-10-30T21:19:14 and in codfw at 2023-10-30T23:04:02 [10:00:52] (03CR) 10Marostegui: [C: 03+1] mariadb: db1130 db1230 swap hosts [puppet] - 10https://gerrit.wikimedia.org/r/969988 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb) [10:01:01] (03PS2) 10Cathal Mooney: Deny traffic from cloud pub ranges to WMF private IPs and tidy conf [homer/public] - 10https://gerrit.wikimedia.org/r/970279 (https://phabricator.wikimedia.org/T347030) [10:01:40] (03PS3) 10Cathal Mooney: Deny traffic from cloud pub ranges to WMF private IPs and tidy conf [homer/public] - 10https://gerrit.wikimedia.org/r/970279 (https://phabricator.wikimedia.org/T347030) [10:01:55] (03CR) 10Arnaudb: [C: 03+2] mariadb: db1130 db1230 swap hosts [puppet] - 10https://gerrit.wikimedia.org/r/969988 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb) [10:02:40] (03CR) 10Volans: "LGTM, but needs another coordinated change, one in this same repo, another one in the cookbooks" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/969319 (owner: 10Ayounsi) [10:03:42] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10Puppet (Puppet 7.0): Investigate PKI errors - https://phabricator.wikimedia.org/T350118 (10jbond) [10:04:27] (03CR) 10Volans: [C: 03+1] "LGTM" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/969692 (owner: 10Ayounsi) [10:06:55] (03CR) 10Vgutierrez: [C: 03+1] "looking good, just wondering if it's worth maintaining all the hiera apparatus introduced to be able to switch the ssl_client_certificate" [puppet] - 10https://gerrit.wikimedia.org/r/969701 (https://phabricator.wikimedia.org/T349915) (owner: 10Jbond) [10:07:32] (03PS2) 10Giuseppe Lavagetto: Add weekly-update script [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/969303 (https://phabricator.wikimedia.org/T344478) [10:08:44] (03PS1) 10Slyngshede: Bump Bitu version to 0.0.2 [software/bitu] - 10https://gerrit.wikimedia.org/r/970281 [10:10:11] (03CR) 10Giuseppe Lavagetto: Add weekly-update script (032 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/969303 (https://phabricator.wikimedia.org/T344478) (owner: 10Giuseppe Lavagetto) [10:11:02] (03CR) 10Majavah: Deny traffic from cloud pub ranges to WMF private IPs and tidy conf (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/970279 (https://phabricator.wikimedia.org/T347030) (owner: 10Cathal Mooney) [10:11:04] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10Puppet (Puppet 7.0): Investigate PKI errors - https://phabricator.wikimedia.org/T350118 (10jbond) [10:12:56] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:13:21] (03CR) 10Majavah: [C: 03+1] "the config looks fine, let me know when the endpoints are live and this can be deployed" [puppet] - 10https://gerrit.wikimedia.org/r/967963 (https://phabricator.wikimedia.org/T337390) (owner: 10Raymond Ndibe) [10:13:52] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:14:08] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:16:59] (PuppetFailure) firing: Puppet has failed on pki1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [10:17:51] !log arnaudb@cumin1001 dbctl commit (dc=all): 'set db1230 as a depooled host', diff saved to https://phabricator.wikimedia.org/P53089 and previous config saved to /var/cache/conftool/dbconfig/20231031-101750-arnaudb.json [10:18:30] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1230 (re)pooling @ 5%: db1230 host warmup', diff saved to https://phabricator.wikimedia.org/P53090 and previous config saved to /var/cache/conftool/dbconfig/20231031-101829-arnaudb.json [10:19:18] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 17 Dec 2023 03:07:37 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:19:36] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50713 bytes in 0.067 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:19:46] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.300 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:22:59] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1227 (re)pooling @ 5%: dh1227 host warmup', diff saved to https://phabricator.wikimedia.org/P53091 and previous config saved to /var/cache/conftool/dbconfig/20231031-102259-arnaudb.json [10:23:34] (03CR) 10Ayounsi: [C: 03+1] Deny traffic from cloud pub ranges to WMF private IPs and tidy conf (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/970279 (https://phabricator.wikimedia.org/T347030) (owner: 10Cathal Mooney) [10:23:44] (03PS1) 10Aklapper: Correct Gerrit Privacy Policy [puppet] - 10https://gerrit.wikimedia.org/r/970283 (https://phabricator.wikimedia.org/T350124) [10:26:55] (03PS1) 10Arnaudb: mariadb: db1127 && db1227 notifications reenabling [puppet] - 10https://gerrit.wikimedia.org/r/969989 (https://phabricator.wikimedia.org/T344036) [10:27:36] (03CR) 10Volans: "approach looks good, couple of comments/questions inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/967935 (https://phabricator.wikimedia.org/T340739) (owner: 10Jbond) [10:28:30] (03PS2) 10Arnaudb: mariadb: db1227 notifications reenabling, disabling on db1127 [puppet] - 10https://gerrit.wikimedia.org/r/969989 (https://phabricator.wikimedia.org/T344036) [10:28:45] (03CR) 10Hnowlan: [C: 03+1] "Looks reasonable compared to prod jobrunner config." [deployment-charts] - 10https://gerrit.wikimedia.org/r/968955 (https://phabricator.wikimedia.org/T349796) (owner: 10Giuseppe Lavagetto) [10:30:02] (03CR) 10Marostegui: [C: 03+1] mariadb: db1227 notifications reenabling, disabling on db1127 [puppet] - 10https://gerrit.wikimedia.org/r/969989 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb) [10:30:12] (03CR) 10Arnaudb: [C: 03+2] mariadb: db1227 notifications reenabling, disabling on db1127 [puppet] - 10https://gerrit.wikimedia.org/r/969989 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb) [10:31:15] (03CR) 10Vgutierrez: [C: 03+1] "looking good for me in terms of NOOP for acme chief clients in our production environment." [puppet] - 10https://gerrit.wikimedia.org/r/969706 (https://phabricator.wikimedia.org/T349915) (owner: 10Jbond) [10:31:18] (03CR) 10Ayounsi: "No pb, but maybe safer to do netbox-dev first." [puppet] - 10https://gerrit.wikimedia.org/r/969331 (owner: 10Muehlenhoff) [10:32:57] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10Puppet (Puppet 7.0): Investigate PKI errors - https://phabricator.wikimedia.org/T350118 (10jbond) It seems apache reloads at 00:00 every night. i believe this is what caused the issue. the pki certificates where rotated to puppet7 at 17... [10:33:35] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1230 (re)pooling @ 10%: db1230 host warmup', diff saved to https://phabricator.wikimedia.org/P53092 and previous config saved to /var/cache/conftool/dbconfig/20231031-103334-arnaudb.json [10:33:57] (03PS4) 10Cathal Mooney: Deny traffic from cloud pub ranges to WMF private IPs and tidy conf [homer/public] - 10https://gerrit.wikimedia.org/r/970279 (https://phabricator.wikimedia.org/T347030) [10:34:38] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10jbond) [10:34:41] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10Puppet (Puppet 7.0): Investigate PKI errors - https://phabricator.wikimedia.org/T350118 (10jbond) 05Open→03In progress p:05Triage→03Medium [10:36:35] (03CR) 10Majavah: [C: 03+1] Deny traffic from cloud pub ranges to WMF private IPs and tidy conf (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/970279 (https://phabricator.wikimedia.org/T347030) (owner: 10Cathal Mooney) [10:37:00] (03Abandoned) 10Giuseppe Lavagetto: Add weekly-update script [deployment-charts] - 10https://gerrit.wikimedia.org/r/970204 (https://phabricator.wikimedia.org/T344478) (owner: 10Giuseppe Lavagetto) [10:37:29] !log fnegri@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcontrol1007.eqiad.wmnet with OS bookworm [10:38:05] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1227 (re)pooling @ 10%: dh1227 host warmup', diff saved to https://phabricator.wikimedia.org/P53093 and previous config saved to /var/cache/conftool/dbconfig/20231031-103804-arnaudb.json [10:38:49] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10Puppet (Puppet 7.0): Investigate PKI errors - https://phabricator.wikimedia.org/T350118 (10jbond) [10:41:36] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Add weekly-update script [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/969303 (https://phabricator.wikimedia.org/T344478) (owner: 10Giuseppe Lavagetto) [10:42:21] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1005-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [10:42:56] (03PS1) 10Slyngshede: P:base enable ethtool data collection [puppet] - 10https://gerrit.wikimedia.org/r/970329 (https://phabricator.wikimedia.org/T347312) [10:44:04] (03PS1) 10Aklapper: Correct IDP Privacy Policy [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/970330 (https://phabricator.wikimedia.org/T350129) [10:45:03] (03PS1) 10Brouberol: Generate an RSA2048-encrypted private key for Skein [puppet] - 10https://gerrit.wikimedia.org/r/970331 (https://phabricator.wikimedia.org/T329398) [10:45:49] (03PS2) 10Brouberol: Generate an RSA2048-encrypted private key for Skein [puppet] - 10https://gerrit.wikimedia.org/r/970331 (https://phabricator.wikimedia.org/T329398) [10:47:55] (03PS1) 10Fabfur: Basic retry mechanism for specific kafka errors [software/purged] - 10https://gerrit.wikimedia.org/r/970332 (https://phabricator.wikimedia.org/T334078) [10:48:31] (03CR) 10Slyngshede: "It might be beneficial if you would take a look at the prometheus::ethtool_exporter" [puppet] - 10https://gerrit.wikimedia.org/r/970329 (https://phabricator.wikimedia.org/T347312) (owner: 10Slyngshede) [10:48:35] (03CR) 10Ayounsi: "FYI we don't need to enable it on VMs." [puppet] - 10https://gerrit.wikimedia.org/r/970329 (https://phabricator.wikimedia.org/T347312) (owner: 10Slyngshede) [10:48:39] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1230 (re)pooling @ 20%: db1230 host warmup', diff saved to https://phabricator.wikimedia.org/P53094 and previous config saved to /var/cache/conftool/dbconfig/20231031-104839-arnaudb.json [10:48:51] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T342502 (10phaultfinder) [10:49:10] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10Puppet (Puppet 7.0): Investigate PKI errors - https://phabricator.wikimedia.org/T350118 (10jbond) [10:50:30] !log fnegri@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol1007.eqiad.wmnet with reason: host reimage [10:50:49] (03PS3) 10Brouberol: Generate an RSA 4096-encrypted private key for Skein [puppet] - 10https://gerrit.wikimedia.org/r/970331 (https://phabricator.wikimedia.org/T329398) [10:52:08] (03CR) 10Brouberol: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/249/con" [puppet] - 10https://gerrit.wikimedia.org/r/970331 (https://phabricator.wikimedia.org/T329398) (owner: 10Brouberol) [10:52:30] (03CR) 10Brouberol: Generate an RSA 4096-encrypted private key for Skein [puppet] - 10https://gerrit.wikimedia.org/r/970331 (https://phabricator.wikimedia.org/T329398) (owner: 10Brouberol) [10:53:09] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1227 (re)pooling @ 20%: dh1227 host warmup', diff saved to https://phabricator.wikimedia.org/P53095 and previous config saved to /var/cache/conftool/dbconfig/20231031-105308-arnaudb.json [10:53:11] !log fnegri@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol1007.eqiad.wmnet with reason: host reimage [11:00:12] (03CR) 10Cathal Mooney: [C: 03+2] Deny traffic from cloud pub ranges to WMF private IPs and tidy conf (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/970279 (https://phabricator.wikimedia.org/T347030) (owner: 10Cathal Mooney) [11:00:47] (03Merged) 10jenkins-bot: Deny traffic from cloud pub ranges to WMF private IPs and tidy conf [homer/public] - 10https://gerrit.wikimedia.org/r/970279 (https://phabricator.wikimedia.org/T347030) (owner: 10Cathal Mooney) [11:03:37] (03PS2) 10Fabfur: Basic retry mechanism for specific kafka errors [software/purged] - 10https://gerrit.wikimedia.org/r/970332 (https://phabricator.wikimedia.org/T334078) [11:03:39] (03CR) 10Volans: [C: 04-1] "I think we can avoid to hardcode them" [cookbooks] - 10https://gerrit.wikimedia.org/r/969175 (https://phabricator.wikimedia.org/T306421) (owner: 10Cathal Mooney) [11:03:45] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1230 (re)pooling @ 30%: db1230 host warmup', diff saved to https://phabricator.wikimedia.org/P53096 and previous config saved to /var/cache/conftool/dbconfig/20231031-110344-arnaudb.json [11:04:41] (03CR) 10Vgutierrez: "this is kinda co" [puppet] - 10https://gerrit.wikimedia.org/r/969719 (https://phabricator.wikimedia.org/T349915) (owner: 10Jbond) [11:08:14] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1227 (re)pooling @ 30%: dh1227 host warmup', diff saved to https://phabricator.wikimedia.org/P53097 and previous config saved to /var/cache/conftool/dbconfig/20231031-110813-arnaudb.json [11:09:44] (03CR) 10Vgutierrez: Basic retry mechanism for specific kafka errors (032 comments) [software/purged] - 10https://gerrit.wikimedia.org/r/970332 (https://phabricator.wikimedia.org/T334078) (owner: 10Fabfur) [11:10:47] RECOVERY - Check systemd state on config-master1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:12:52] (ProbeDown) resolved: (40) Service pki2002:443 has failed probes (http_PKI_aux_front_proxy_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#pki2002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:14:29] PROBLEM - Check systemd state on config-master1001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:15:13] (03PS1) 10Majavah: diffscan: add support for multiple instances [puppet] - 10https://gerrit.wikimedia.org/r/970335 [11:15:15] (03PS1) 10Majavah: P:diffscan: add support for configuring multiple instances [puppet] - 10https://gerrit.wikimedia.org/r/970336 (https://phabricator.wikimedia.org/T206653) [11:15:23] (03PS1) 10Majavah: hieradata: lock down ssh and node-exporter on cloudvirts [puppet] - 10https://gerrit.wikimedia.org/r/970337 [11:15:39] (03CR) 10CI reject: [V: 04-1] diffscan: add support for multiple instances [puppet] - 10https://gerrit.wikimedia.org/r/970335 (owner: 10Majavah) [11:15:46] (03CR) 10CI reject: [V: 04-1] P:diffscan: add support for configuring multiple instances [puppet] - 10https://gerrit.wikimedia.org/r/970336 (https://phabricator.wikimedia.org/T206653) (owner: 10Majavah) [11:16:52] (03PS1) 10Jbond: pki::multirootca: Add puppet_rsa to multirootca [puppet] - 10https://gerrit.wikimedia.org/r/970338 (https://phabricator.wikimedia.org/T350118) [11:16:58] (03PS3) 10Fabfur: Basic retry mechanism for specific kafka errors [software/purged] - 10https://gerrit.wikimedia.org/r/970332 (https://phabricator.wikimedia.org/T334078) [11:17:06] (03CR) 10Fabfur: Basic retry mechanism for specific kafka errors (032 comments) [software/purged] - 10https://gerrit.wikimedia.org/r/970332 (https://phabricator.wikimedia.org/T334078) (owner: 10Fabfur) [11:18:05] (03PS2) 10Jbond: pki::multirootca: Add puppet_rsa to multirootca [puppet] - 10https://gerrit.wikimedia.org/r/970338 (https://phabricator.wikimedia.org/T350118) [11:18:24] (03PS2) 10Majavah: diffscan: add support for multiple instances [puppet] - 10https://gerrit.wikimedia.org/r/970335 [11:18:26] (03PS2) 10Majavah: P:diffscan: add support for configuring multiple instances [puppet] - 10https://gerrit.wikimedia.org/r/970336 (https://phabricator.wikimedia.org/T206653) [11:18:28] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/253/con" [puppet] - 10https://gerrit.wikimedia.org/r/970337 (owner: 10Majavah) [11:18:49] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1230 (re)pooling @ 40%: db1230 host warmup', diff saved to https://phabricator.wikimedia.org/P53098 and previous config saved to /var/cache/conftool/dbconfig/20231031-111849-arnaudb.json [11:18:51] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:21:03] RECOVERY - Check systemd state on config-master1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:21:11] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/255/con" [puppet] - 10https://gerrit.wikimedia.org/r/970336 (https://phabricator.wikimedia.org/T206653) (owner: 10Majavah) [11:23:20] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1227 (re)pooling @ 40%: dh1227 host warmup', diff saved to https://phabricator.wikimedia.org/P53099 and previous config saved to /var/cache/conftool/dbconfig/20231031-112318-arnaudb.json [11:24:23] !log fnegri@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol1007.eqiad.wmnet with OS bookworm [11:24:48] (03PS3) 10Majavah: diffscan: add support for multiple instances [puppet] - 10https://gerrit.wikimedia.org/r/970335 [11:24:49] (03PS3) 10Majavah: P:diffscan: add support for configuring multiple instances [puppet] - 10https://gerrit.wikimedia.org/r/970336 (https://phabricator.wikimedia.org/T206653) [11:25:09] PROBLEM - Check systemd state on config-master1001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:25:57] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/256/con" [puppet] - 10https://gerrit.wikimedia.org/r/970336 (https://phabricator.wikimedia.org/T206653) (owner: 10Majavah) [11:26:28] (03CR) 10Cathal Mooney: [C: 03+1] "Should be safe from what I understand nothing connects to these services apart from on the 10.x IP. Might there be connections from local" [puppet] - 10https://gerrit.wikimedia.org/r/970337 (owner: 10Majavah) [11:27:17] (03CR) 10Majavah: [V: 03+1 C: 03+2] hieradata: lock down ssh and node-exporter on cloudvirts [puppet] - 10https://gerrit.wikimedia.org/r/970337 (owner: 10Majavah) [11:27:42] (03CR) 10Kamila Součková: [C: 03+1] service_proxy: add rest-gateway to listeners [puppet] - 10https://gerrit.wikimedia.org/r/968617 (https://phabricator.wikimedia.org/T348731) (owner: 10Hnowlan) [11:28:12] (03PS2) 10Slyngshede: P:base enable ethtool data collection [puppet] - 10https://gerrit.wikimedia.org/r/970329 (https://phabricator.wikimedia.org/T347312) [11:28:14] (03CR) 10Jbond: "thanks see inline" [puppet] - 10https://gerrit.wikimedia.org/r/969719 (https://phabricator.wikimedia.org/T349915) (owner: 10Jbond) [11:30:10] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/970331 (https://phabricator.wikimedia.org/T329398) (owner: 10Brouberol) [11:31:13] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:31:40] (03PS1) 10Giuseppe Lavagetto: Review access change [docker-images/production-images] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/970349 [11:31:57] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Review access change [docker-images/production-images] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/970349 (owner: 10Giuseppe Lavagetto) [11:31:59] (PuppetFailure) firing: (2) Puppet has failed on pki1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [11:32:47] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/259/console" [puppet] - 10https://gerrit.wikimedia.org/r/970329 (https://phabricator.wikimedia.org/T347312) (owner: 10Slyngshede) [11:32:50] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/258/con" [puppet] - 10https://gerrit.wikimedia.org/r/970338 (https://phabricator.wikimedia.org/T350118) (owner: 10Jbond) [11:33:39] (03CR) 10Jbond: [V: 03+1 C: 03+2] pki::multirootca: Add puppet_rsa to multirootca [puppet] - 10https://gerrit.wikimedia.org/r/970338 (https://phabricator.wikimedia.org/T350118) (owner: 10Jbond) [11:33:54] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1230 (re)pooling @ 50%: db1230 host warmup', diff saved to https://phabricator.wikimedia.org/P53101 and previous config saved to /var/cache/conftool/dbconfig/20231031-113353-arnaudb.json [11:36:33] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/260/console" [puppet] - 10https://gerrit.wikimedia.org/r/970329 (https://phabricator.wikimedia.org/T347312) (owner: 10Slyngshede) [11:36:44] (03PS4) 10Majavah: P:diffscan: add support for configuring multiple instances [puppet] - 10https://gerrit.wikimedia.org/r/970336 (https://phabricator.wikimedia.org/T206653) [11:38:05] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/261/console" [puppet] - 10https://gerrit.wikimedia.org/r/970329 (https://phabricator.wikimedia.org/T347312) (owner: 10Slyngshede) [11:38:12] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/262/console" [puppet] - 10https://gerrit.wikimedia.org/r/970336 (https://phabricator.wikimedia.org/T206653) (owner: 10Majavah) [11:38:24] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1227 (re)pooling @ 50%: dh1227 host warmup', diff saved to https://phabricator.wikimedia.org/P53102 and previous config saved to /var/cache/conftool/dbconfig/20231031-113823-arnaudb.json [11:40:10] (ProbeDown) firing: (19) Service pki2002:443 has failed probes (http_PKI_cassandra_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#pki2002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:41:19] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/263/con" [puppet] - 10https://gerrit.wikimedia.org/r/968617 (https://phabricator.wikimedia.org/T348731) (owner: 10Hnowlan) [11:45:10] (ProbeDown) firing: (40) Service pki2002:443 has failed probes (http_PKI_aux_front_proxy_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#pki2002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:48:59] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1230 (re)pooling @ 60%: db1230 host warmup', diff saved to https://phabricator.wikimedia.org/P53103 and previous config saved to /var/cache/conftool/dbconfig/20231031-114858-arnaudb.json [11:50:10] (ProbeDown) firing: (42) Service pki2002:443 has failed probes (http_PKI_aux_front_proxy_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#pki2002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:51:17] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [11:51:27] RECOVERY - Check systemd state on config-master1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:53:29] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1227 (re)pooling @ 60%: dh1227 host warmup', diff saved to https://phabricator.wikimedia.org/P53104 and previous config saved to /var/cache/conftool/dbconfig/20231031-115328-arnaudb.json [11:55:38] PROBLEM - Check systemd state on config-master1001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:57:55] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:58:34] 10SRE, 10Wikimedia-Etherpad, 10collaboration-services: Upgrade etherpad.wikimedia.org to v1.9.4 - https://phabricator.wikimedia.org/T316421 (10LSobanski) [11:58:39] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:58:41] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:59:16] 10SRE, 10Wikimedia-Etherpad, 10collaboration-services: Upgrade etherpad.wikimedia.org to v1.9.4 - https://phabricator.wikimedia.org/T316421 (10LSobanski) I updated the description to reflect the new Etherpad release (1.9.4). See below for a list of changes: * Compability changes ** Log4js has been updated t... [11:59:51] (03PS1) 10Jbond: pki::multirootca: Add parameter so pki can generate its certs [puppet] - 10https://gerrit.wikimedia.org/r/970339 (https://phabricator.wikimedia.org/T350118) [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231031T1200) [12:00:28] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50715 bytes in 7.294 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:00:40] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.254 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:00:42] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 17 Dec 2023 03:07:37 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:01:32] (03PS1) 10Jbond: pki: move pki1001 back to puppet5 [puppet] - 10https://gerrit.wikimedia.org/r/970340 [12:01:43] (03CR) 10Jbond: [C: 03+2] pki: move pki1001 back to puppet5 [puppet] - 10https://gerrit.wikimedia.org/r/970340 (owner: 10Jbond) [12:04:04] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1230 (re)pooling @ 70%: db1230 host warmup', diff saved to https://phabricator.wikimedia.org/P53105 and previous config saved to /var/cache/conftool/dbconfig/20231031-120403-arnaudb.json [12:05:49] (03PS1) 10Cathal Mooney: Do not NAT traffic from cloud VPS to cloud-private, and filter ports [puppet] - 10https://gerrit.wikimedia.org/r/970341 (https://phabricator.wikimedia.org/T350132) [12:06:26] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:07:00] (PuppetFailure) firing: (2) Puppet has failed on pki1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [12:07:06] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:08:34] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1227 (re)pooling @ 70%: dh1227 host warmup', diff saved to https://phabricator.wikimedia.org/P53106 and previous config saved to /var/cache/conftool/dbconfig/20231031-120833-arnaudb.json [12:09:08] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:09:52] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:10:14] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:13:54] (03PS2) 10Jbond: pki::multirootca: Add parameter so pki can generate its certs [puppet] - 10https://gerrit.wikimedia.org/r/970339 (https://phabricator.wikimedia.org/T350118) [12:15:12] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/265/con" [puppet] - 10https://gerrit.wikimedia.org/r/970339 (https://phabricator.wikimedia.org/T350118) (owner: 10Jbond) [12:15:26] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/970329 (https://phabricator.wikimedia.org/T347312) (owner: 10Slyngshede) [12:15:54] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 17 Dec 2023 03:07:37 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:16:36] (03CR) 10Gmodena: [C: 03+1] "Ack. Thanks for the heads up." [deployment-charts] - 10https://gerrit.wikimedia.org/r/966921 (https://phabricator.wikimedia.org/T349095) (owner: 10Bking) [12:17:05] (03CR) 10Hnowlan: [C: 03+1] "Overall a lot neater, nice." [deployment-charts] - 10https://gerrit.wikimedia.org/r/970276 (https://phabricator.wikimedia.org/T348950) (owner: 10Elukey) [12:17:33] (03PS3) 10Jbond: pki::multirootca: Add parameter so pki can generate its certs [puppet] - 10https://gerrit.wikimedia.org/r/970339 (https://phabricator.wikimedia.org/T350118) [12:18:22] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 9.342 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:18:50] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/970339 (https://phabricator.wikimedia.org/T350118) (owner: 10Jbond) [12:19:02] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50713 bytes in 0.108 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:19:09] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1230 (re)pooling @ 80%: db1230 host warmup', diff saved to https://phabricator.wikimedia.org/P53107 and previous config saved to /var/cache/conftool/dbconfig/20231031-121908-arnaudb.json [12:20:22] (03CR) 10Jbond: [V: 03+1 C: 03+2] pki::multirootca: Add parameter so pki can generate its certs [puppet] - 10https://gerrit.wikimedia.org/r/970339 (https://phabricator.wikimedia.org/T350118) (owner: 10Jbond) [12:21:52] RECOVERY - Check systemd state on config-master1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:22:35] (03CR) 10Hnowlan: [C: 03+1] services: update ChangeProp's eqiad Docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/969758 (https://phabricator.wikimedia.org/T348950) (owner: 10Elukey) [12:23:40] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1227 (re)pooling @ 80%: dh1227 host warmup', diff saved to https://phabricator.wikimedia.org/P53108 and previous config saved to /var/cache/conftool/dbconfig/20231031-122338-arnaudb.json [12:24:53] (03PS2) 10Cathal Mooney: Do not NAT traffic from cloud VPS to cloud-private, and filter ports [puppet] - 10https://gerrit.wikimedia.org/r/970341 (https://phabricator.wikimedia.org/T350132) [12:25:10] (ProbeDown) resolved: (42) Service pki2002:443 has failed probes (http_PKI_aux_front_proxy_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#pki2002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:25:45] (03CR) 10Brouberol: [C: 03+2] Generate an RSA 4096-encrypted private key for Skein [puppet] - 10https://gerrit.wikimedia.org/r/970331 (https://phabricator.wikimedia.org/T329398) (owner: 10Brouberol) [12:25:48] PROBLEM - Check systemd state on config-master1001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:53:49] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1227 (re)pooling @ 100%: dh1227 host warmup', diff saved to https://phabricator.wikimedia.org/P53113 and previous config saved to /var/cache/conftool/dbconfig/20231031-125348-arnaudb.json [12:55:40] (ProbeDown) resolved: (24) Service pki2002:443 has failed probes (http_PKI_aux_front_proxy_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#pki2002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:56:10] (03PS1) 10Jbond: cfssl::ocsp: use client mtls certs if present [puppet] - 10https://gerrit.wikimedia.org/r/970369 (https://phabricator.wikimedia.org/T350118) [12:56:13] (03CR) 10Elukey: changeprop: allow to specify consumer/producer kafka settings (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/970276 (https://phabricator.wikimedia.org/T348950) (owner: 10Elukey) [12:57:49] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/268/con" [puppet] - 10https://gerrit.wikimedia.org/r/970369 (https://phabricator.wikimedia.org/T350118) (owner: 10Jbond) [12:58:50] (03PS2) 10Elukey: changeprop: allow to specify consumer/producer kafka settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/970276 (https://phabricator.wikimedia.org/T348950) [12:59:15] (03CR) 10Jbond: [V: 03+1 C: 03+2] cfssl::ocsp: use client mtls certs if present [puppet] - 10https://gerrit.wikimedia.org/r/970369 (https://phabricator.wikimedia.org/T350118) (owner: 10Jbond) [12:59:55] (03CR) 10CI reject: [V: 04-1] changeprop: allow to specify consumer/producer kafka settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/970276 (https://phabricator.wikimedia.org/T348950) (owner: 10Elukey) [13:04:34] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:05:30] (03CR) 10Paladox: [C: 03+1] Correct Gerrit Privacy Policy [puppet] - 10https://gerrit.wikimedia.org/r/970283 (https://phabricator.wikimedia.org/T350124) (owner: 10Aklapper) [13:05:52] o/ is the afternoon backport window happening? [13:06:08] (or did i get lost in time change whirlpool?) [13:06:45] jouncebot: nowandnext [13:06:46] For the next 0 hour(s) and 53 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231031T1300) [13:06:46] In 1 hour(s) and 53 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231031T1500) [13:06:59] (PuppetFailure) resolved: Puppet has failed on pki2002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [13:07:15] hah. thanks :D [13:08:12] RoanKattouw, Lucas_WMDE, urbanecm, awight, TheresNoTime, taavi: It's time to deploy and jouncebot broke [13:08:20] ihurbain: there [13:08:27] RhinosF1: thank you kindly! :) [13:08:39] (03PS1) 10Ottomata: eventgate chart - debug mode: add some perf settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/970371 (https://phabricator.wikimedia.org/T347477) [13:08:59] 10SRE, 10SRE-Access-Requests, 10Structured-Data-Backlog, 10UploadWizard: Access request to deleted image files in the backup cluster - https://phabricator.wikimedia.org/T350020 (10jcrespo) a:03jcrespo [13:09:14] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin2002 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:09:54] (03CR) 10Ottomata: [C: 03+2] eventgate chart - debug mode: add some perf settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/970371 (https://phabricator.wikimedia.org/T347477) (owner: 10Ottomata) [13:10:51] (03Merged) 10jenkins-bot: eventgate chart - debug mode: add some perf settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/970371 (https://phabricator.wikimedia.org/T347477) (owner: 10Ottomata) [13:11:05] (03PS3) 10Elukey: changeprop: allow to specify consumer/producer kafka settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/970276 (https://phabricator.wikimedia.org/T348950) [13:12:30] (03CR) 10Filippo Giunchedi: [C: 03+1] P:base enable ethtool data collection [puppet] - 10https://gerrit.wikimedia.org/r/970329 (https://phabricator.wikimedia.org/T347312) (owner: 10Slyngshede) [13:13:25] (03PS1) 10Ottomata: eventgate chart - fix missing comma [deployment-charts] - 10https://gerrit.wikimedia.org/r/970372 (https://phabricator.wikimedia.org/T347477) [13:13:54] (03CR) 10Ottomata: [C: 03+2] eventgate chart - fix missing comma [deployment-charts] - 10https://gerrit.wikimedia.org/r/970372 (https://phabricator.wikimedia.org/T347477) (owner: 10Ottomata) [13:14:50] RECOVERY - Check systemd state on config-master2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:14:50] (03PS4) 10Elukey: changeprop: allow to specify consumer/producer kafka settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/970276 (https://phabricator.wikimedia.org/T348950) [13:15:05] (03Merged) 10jenkins-bot: eventgate chart - fix missing comma [deployment-charts] - 10https://gerrit.wikimedia.org/r/970372 (https://phabricator.wikimedia.org/T347477) (owner: 10Ottomata) [13:15:11] ihurbain: I can deploy :) [13:15:21] woot! [13:15:25] i'm around & ready :) [13:15:47] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/969168 (https://phabricator.wikimedia.org/T342871) (owner: 10Isabelle Hurbain-Palatin) [13:16:33] (03Merged) 10jenkins-bot: Roll-out Parsoid Kartographer support for all English language wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/969168 (https://phabricator.wikimedia.org/T342871) (owner: 10Isabelle Hurbain-Palatin) [13:16:39] ty TheresNoTime [13:16:57] !log samtar@deploy2002 Started scap: Backport for [[gerrit:969168|Roll-out Parsoid Kartographer support for all English language wikis (T342871)]] [13:17:02] T342871: Parsoid + Kartographer roll-out plan - https://phabricator.wikimedia.org/T342871 [13:17:39] !log otto@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics: apply [13:18:19] !log samtar@deploy2002 ihurbain and samtar: Backport for [[gerrit:969168|Roll-out Parsoid Kartographer support for all English language wikis (T342871)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:18:34] ihurbain: live on mwdebug, can you test? :) [13:18:38] testing [13:19:02] PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:19:29] (03CR) 10Elukey: changeprop: allow to specify consumer/producer kafka settings (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/970276 (https://phabricator.wikimedia.org/T348950) (owner: 10Elukey) [13:22:16] 10SRE, 10Wikimedia-Mailing-lists: New mailing list request for Project Korikath - https://phabricator.wikimedia.org/T349429 (10Mrb_Rafi) Thanks a lot for the support, @Ladsgroup [13:22:19] TheresNoTime: we happy, ship it! [13:22:24] !log samtar@deploy2002 ihurbain and samtar: Continuing with sync [13:22:27] (03CR) 10Bking: [C: 03+2] Update flink-session-cluster to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/969343 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [13:22:29] :D [13:22:40] TheresNoTime: thank you very much :) [13:23:36] you're very welcome :) it'll take a few minutes to be live, I'll ping you again just to double-check its still working okay [13:23:46] ack :) [13:27:15] (03PS1) 10Ottomata: eventgate chart - remove --prof-process flag from debug mode [deployment-charts] - 10https://gerrit.wikimedia.org/r/970374 (https://phabricator.wikimedia.org/T347477) [13:27:42] 10SRE, 10SRE-Access-Requests, 10Structured-Data-Backlog, 10UploadWizard: Access request to deleted image files in the backup cluster - https://phabricator.wikimedia.org/T350020 (10jcrespo) We should discuss this a bit- as this changes not only the initial hypothesis, but also the restrictions of your proje... [13:27:47] !log samtar@deploy2002 Finished scap: Backport for [[gerrit:969168|Roll-out Parsoid Kartographer support for all English language wikis (T342871)]] (duration: 10m 49s) [13:27:47] !log otto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics: apply [13:27:51] ihurbain: live on prod :) [13:27:52] T342871: Parsoid + Kartographer roll-out plan - https://phabricator.wikimedia.org/T342871 [13:28:02] (03CR) 10Ottomata: [C: 03+2] eventgate chart - remove --prof-process flag from debug mode [deployment-charts] - 10https://gerrit.wikimedia.org/r/970374 (https://phabricator.wikimedia.org/T347477) (owner: 10Ottomata) [13:28:04] shiny! :) [13:29:34] it still seems to be working okay. [13:29:37] (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [13:29:55] (03Merged) 10jenkins-bot: eventgate chart - remove --prof-process flag from debug mode [deployment-charts] - 10https://gerrit.wikimedia.org/r/970374 (https://phabricator.wikimedia.org/T347477) (owner: 10Ottomata) [13:30:29] \o/ [13:30:50] !log close UTC afternoon backport window [13:30:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:27] 10SRE, 10MediaWiki-General, 10MediaWiki-libs-Stats, 10observability, and 5 others: MediaWiki Prometheus support - https://phabricator.wikimedia.org/T240685 (10herron) [13:35:59] (03PS1) 10Jbond: etcd::client::globalconfig: switch to wmf-ca-certificate [puppet] - 10https://gerrit.wikimedia.org/r/970377 (https://phabricator.wikimedia.org/T350147) [13:36:22] !log otto@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics: apply [13:36:35] !log otto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics: apply [13:38:40] (03PS1) 10Ayounsi: Add MoveServersUplinks Netbox script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/970379 (https://phabricator.wikimedia.org/T348129) [13:40:02] (03CR) 10Xcollazo: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/970272 (https://phabricator.wikimedia.org/T336042) (owner: 10Stevemunene) [13:41:20] RECOVERY - Check systemd state on config-master2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:45:30] PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:45:44] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4052.ulsfo.wmnet with OS bookworm [13:51:18] (03CR) 10Jbond: [C: 03+2] etcd::client::globalconfig: switch to wmf-ca-certificate [puppet] - 10https://gerrit.wikimedia.org/r/970377 (https://phabricator.wikimedia.org/T350147) (owner: 10Jbond) [13:52:18] RECOVERY - Check systemd state on config-master1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:53:58] (RdfStreamingUpdaterSpaceUsageTooHigh) firing: The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh [13:58:16] Hey folks - are backport window deploys complete? There’s a quick sec patch update I’d like to get out now, if possible... [13:58:43] !log otto@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics-external: apply [13:58:58] (RdfStreamingUpdaterSpaceUsageTooHigh) resolved: The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh [13:59:01] !log otto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics-external: apply [14:01:52] (03PS1) 10Arnaudb: mariadb: db1131 decomission [puppet] - 10https://gerrit.wikimedia.org/r/969992 (https://phabricator.wikimedia.org/T350141) [14:02:59] (03CR) 10Arnaudb: "this is supposed to be steps 1 to 5 of https://phabricator.wikimedia.org/T350141" [puppet] - 10https://gerrit.wikimedia.org/r/969992 (https://phabricator.wikimedia.org/T350141) (owner: 10Arnaudb) [14:05:24] 10SRE, 10MediaWiki-General, 10MediaWiki-libs-Stats, 10observability, and 5 others: MediaWiki Prometheus support - https://phabricator.wikimedia.org/T240685 (10herron) [14:06:50] !log Deployed updated security mitigation for T348828 [14:06:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:52] RECOVERY - Check systemd state on config-master2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:08:07] (03CR) 10Marostegui: "We approach this in a different way, keep in mind that the template is quite generic and might not fit our needs." [puppet] - 10https://gerrit.wikimedia.org/r/969992 (https://phabricator.wikimedia.org/T350141) (owner: 10Arnaudb) [14:10:12] (03PS9) 10Herron: prom-es-exporter: w3c-networkerror include uri_host label [puppet] - 10https://gerrit.wikimedia.org/r/969135 (https://phabricator.wikimedia.org/T349807) [14:11:14] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Create automation to move servers in Netbox from old to new switch - https://phabricator.wikimedia.org/T348129 (10Volans) @cmooney thanks for the summary, couple of questions: 1) will the migration be performed rack by rack as opposed to s... [14:13:05] !log install4002:/etc/dhcp/automation/ttyS1-115200 rm cp4052.conf [14:13:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:23] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team (FY2023/2024-Q1): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10Andrew) [14:19:49] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Create automation to move servers in Netbox from old to new switch - https://phabricator.wikimedia.org/T348129 (10Papaul) @Volans to get the the prefix ge vs xe maybe use the rack. In codfw we ahve only 10g servers racked in 10g rack and th... [14:20:17] 10SRE, 10MediaWiki-General, 10MediaWiki-libs-Stats, 10observability, and 5 others: MediaWiki Prometheus support - https://phabricator.wikimedia.org/T240685 (10herron) [14:21:57] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Create automation to move servers in Netbox from old to new switch - https://phabricator.wikimedia.org/T348129 (10ayounsi) > will the migration be performed rack by rack as opposed to server by server? yep > For multi-unit servers we pick... [14:23:45] (Primary inbound port utilisation over 80% #page) firing: Alert for device cloudsw1-f4-eqiad.mgmt.eqiad.wmnet - Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [14:23:45] (Primary inbound port utilisation over 80% #page) firing: Alert for device cloudsw1-f4-eqiad.mgmt.eqiad.wmnet - Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [14:23:53] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Create automation to move servers in Netbox from old to new switch - https://phabricator.wikimedia.org/T348129 (10Papaul) yes we always pick the lower numbering unit for 2U host. [14:23:54] here [14:24:10] mgmt so should not be a huge issue, acking [14:24:20] here also [14:24:37] (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [14:25:08] I know topranks was working on something related to cloud [14:25:20] maybe there was a spike of traffic or a reboot or something? [14:25:23] taavi, andrewbogott, anything going on in WMCS? [14:25:32] Making breakfast, but here if needed [14:25:46] the F4-D5 link is saturating: https://librenms.wikimedia.org/device/device=242/tab=port/port=25230/ [14:26:13] hm, andrew is doing something with Ceph which might explain it [14:26:17] XioNoX: I'm rebalancing a couple of ceph nodes but nothing that we haven't done 100 times before [14:26:32] yeah probably related [14:26:59] jynus: mgmt in the alert means that the switch monitoring data is polled via the management network, not that the alert is about management network traffic [14:27:08] I get it now [14:27:10] I'm also not 100% sure how to stop it or throttle it (and I'm in another meeting) do I need to drop everything and look at this? [14:27:25] andrewbogott: if cloud is happy we are happy [14:27:36] ok! I think we're still good. thanks [14:27:40] * andrewbogott back to meeting [14:27:46] andrewbogott: it's up to you, there is congestion on one of the links, if nothing else alerts that's probably fine to wait [14:28:45] (Primary inbound port utilisation over 80% #page) resolved: Device cloudsw1-f4-eqiad.mgmt.eqiad.wmnet recovered from Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [14:28:45] (Primary inbound port utilisation over 80% #page) resolved: Device cloudsw1-f4-eqiad.mgmt.eqiad.wmnet recovered from Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [14:29:29] ok, then as we know the most probably root cause for that, I will not give it more thought [14:29:56] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T342502 (10phaultfinder) [14:30:40] ^what's the right way to go about that [14:31:05] we just leave it there, right? [14:31:15] the ticket, I mean [14:35:46] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2179 (T343198)', diff saved to https://phabricator.wikimedia.org/P53116 and previous config saved to /var/cache/conftool/dbconfig/20231031-143545-arnaudb.json [14:35:58] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [14:36:14] jynus: I'll take a look at the ticket about spine switch discards [14:36:29] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr [14:36:36] 10SRE-tools, 10Infrastructure-Foundations: Automation to change a server's vlan - https://phabricator.wikimedia.org/T350152 (10ayounsi) [14:36:44] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T342502 (10cmooney) a:05Jclark-ctr→03cmooney [14:36:58] 10SRE, 10Infrastructure-Foundations, 10netops: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10ayounsi) [14:37:04] 10SRE-tools, 10Infrastructure-Foundations: Automation to change a server's vlan - https://phabricator.wikimedia.org/T350152 (10ayounsi) [14:37:26] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:38:19] 10SRE, 10ops-eqiad: Add test server to rack E8 - https://phabricator.wikimedia.org/T349168 (10Jclark-ctr) Configured idrac manually and verified connection on switch [14:38:45] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:39:52] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T342502 (10cmooney) a:05cmooney→03Jclark-ctr [14:42:21] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1005-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [14:42:39] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T342502 (10cmooney) Yeah I think we still need to look at this, further errors on the link today. Seems somewhat related to throughput, but we are miles away from capacity (peaks under 2Gb/sec). I'd say worth trying an optic swap on one... [14:44:48] (03CR) 10Brouberol: [C: 03+1] "LGTM! I agree with @Xcollazo's remark." [puppet] - 10https://gerrit.wikimedia.org/r/970272 (https://phabricator.wikimedia.org/T336042) (owner: 10Stevemunene) [14:45:26] (03PS1) 10Giuseppe Lavagetto: docker::builder: add system to properly perform a weekly update [puppet] - 10https://gerrit.wikimedia.org/r/970391 (https://phabricator.wikimedia.org/T344478) [14:45:28] (03PS1) 10Giuseppe Lavagetto: docker::builder: switch systemd timer to our new script [puppet] - 10https://gerrit.wikimedia.org/r/970392 (https://phabricator.wikimedia.org/T344478) [14:45:48] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:46:46] PROBLEM - BFD status on cr1-drmrs is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:47:02] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:47:20] PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:47:43] (03PS1) 10Giuseppe Lavagetto: Add fake ssh private key for docker::builder [labs/private] - 10https://gerrit.wikimedia.org/r/970393 [14:47:46] PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 2/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:48:06] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Add fake ssh private key for docker::builder [labs/private] - 10https://gerrit.wikimedia.org/r/970393 (owner: 10Giuseppe Lavagetto) [14:48:10] RECOVERY - BFD status on cr1-drmrs is OK: UP: 3 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:49:50] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:50:10] RECOVERY - BFD status on cr1-eqiad is OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:50:36] RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:50:52] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P53117 and previous config saved to /var/cache/conftool/dbconfig/20231031-145052-arnaudb.json [14:53:45] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:54:48] PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:55:12] PROBLEM - BFD status on cr1-drmrs is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:55:28] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:55:46] PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:56:07] (03PS1) 10Giuseppe Lavagetto: docker::builder: strings must be strings in yaml [labs/private] - 10https://gerrit.wikimedia.org/r/970395 [14:56:20] (03PS4) 10Majavah: diffscan: add support for multiple instances [puppet] - 10https://gerrit.wikimedia.org/r/970335 [14:56:24] (03PS5) 10Majavah: P:diffscan: add support for configuring multiple instances [puppet] - 10https://gerrit.wikimedia.org/r/970336 (https://phabricator.wikimedia.org/T206653) [14:56:28] (03PS1) 10Majavah: P:diffscan: add scan for WMCS infrastructure addresses [puppet] - 10https://gerrit.wikimedia.org/r/970396 (https://phabricator.wikimedia.org/T206653) [14:56:49] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] docker::builder: strings must be strings in yaml [labs/private] - 10https://gerrit.wikimedia.org/r/970395 (owner: 10Giuseppe Lavagetto) [14:57:03] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4052.ulsfo.wmnet with OS bookworm [14:57:11] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/278/con" [puppet] - 10https://gerrit.wikimedia.org/r/970396 (https://phabricator.wikimedia.org/T206653) (owner: 10Majavah) [14:57:22] 10SRE, 10Infrastructure-Foundations, 10netops: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10cmooney) >>! In T327938#9234691, @Volans wrote: > @cmooney adding a note here to not forget. We'll need to check how it will work for Ganeti VMs, in particular the makev... [14:58:14] (03Abandoned) 10Jdlrobson: [Visual change] Normalize small font sizes in Vector 2022 [skins/Vector] (wmf/1.42.0-wmf.2) - 10https://gerrit.wikimedia.org/r/968314 (https://phabricator.wikimedia.org/T346062) (owner: 10Jdlrobson) [14:58:18] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/279/con" [puppet] - 10https://gerrit.wikimedia.org/r/970391 (https://phabricator.wikimedia.org/T344478) (owner: 10Giuseppe Lavagetto) [14:59:02] RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:59:24] RECOVERY - BFD status on cr1-drmrs is OK: UP: 3 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:59:40] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:59:58] RECOVERY - BFD status on cr1-eqiad is OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:03:56] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:04:28] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4052.ulsfo.wmnet with OS bookworm [15:04:46] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4052.ulsfo.wmnet with OS bookworm [15:05:34] !log otto@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-analytics: apply [15:05:47] !log otto@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-analytics-external: apply [15:05:49] !log otto@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics-external: apply [15:05:52] !log otto@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics: apply [15:05:59] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P53118 and previous config saved to /var/cache/conftool/dbconfig/20231031-150558-arnaudb.json [15:08:19] (03CR) 10Brouberol: [V: 03+1] "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/970378 (https://phabricator.wikimedia.org/T329398) (owner: 10Brouberol) [15:11:13] !log otto@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-analytics-external: apply [15:11:16] !log otto@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics-external: apply [15:15:03] 10SRE-tools, 10Infrastructure-Foundations: Automation to change a server's vlan - https://phabricator.wikimedia.org/T350152 (10ayounsi) There will be special usecase, but if we can tackle all the regular servers (eg. 1 uplink, 1 IP, 1 , then we will be in a great spot. The ideal/cleanest is to go through a re... [15:15:06] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:19:33] (03PS1) 10Jforrester: wikifunctions: Bump orchestrator to image 2023-10-31-024528 [deployment-charts] - 10https://gerrit.wikimedia.org/r/970398 (https://phabricator.wikimedia.org/T350034) [15:20:55] (03CR) 10Jforrester: [C: 03+2] wikifunctions: Bump orchestrator to image 2023-10-31-024528 [deployment-charts] - 10https://gerrit.wikimedia.org/r/970398 (https://phabricator.wikimedia.org/T350034) (owner: 10Jforrester) [15:21:05] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2179 (T343198)', diff saved to https://phabricator.wikimedia.org/P53119 and previous config saved to /var/cache/conftool/dbconfig/20231031-152105-arnaudb.json [15:21:14] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [15:21:46] (03Merged) 10jenkins-bot: wikifunctions: Bump orchestrator to image 2023-10-31-024528 [deployment-charts] - 10https://gerrit.wikimedia.org/r/970398 (https://phabricator.wikimedia.org/T350034) (owner: 10Jforrester) [15:22:13] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [15:22:53] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [15:23:53] !log jforrester@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [15:23:58] (03CR) 10BCornwall: [C: 03+1] "Nice job!" [puppet] - 10https://gerrit.wikimedia.org/r/967950 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [15:24:02] (03CR) 10Hnowlan: [C: 03+1] changeprop: allow to specify consumer/producer kafka settings (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/970276 (https://phabricator.wikimedia.org/T348950) (owner: 10Elukey) [15:24:18] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4052.ulsfo.wmnet with OS bookworm [15:25:02] !log jforrester@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [15:25:07] !log jforrester@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [15:25:19] (03CR) 10Elukey: [C: 03+2] changeprop: allow to specify consumer/producer kafka settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/970276 (https://phabricator.wikimedia.org/T348950) (owner: 10Elukey) [15:26:16] !log jforrester@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [15:28:35] (03PS2) 10Arnaudb: mariadb: db1131 decomission [puppet] - 10https://gerrit.wikimedia.org/r/969992 (https://phabricator.wikimedia.org/T350141) [15:28:57] !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/changeprop: sync [15:29:12] !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop: sync [15:29:32] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:33:27] !log arnaudb@cumin1001 START - Cookbook sre.hosts.decommission for hosts db1131.eqiad.wmnet [15:35:13] (03CR) 10Marostegui: [C: 03+1] mariadb: db1131 decomission [puppet] - 10https://gerrit.wikimedia.org/r/969992 (https://phabricator.wikimedia.org/T350141) (owner: 10Arnaudb) [15:37:37] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/970378 (https://phabricator.wikimedia.org/T329398) (owner: 10Brouberol) [15:38:32] !log arnaudb@cumin1001 START - Cookbook sre.dns.netbox [15:41:01] !log arnaudb@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1131.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - arnaudb@cumin1001" [15:42:24] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1131.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - arnaudb@cumin1001" [15:42:24] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:42:25] !log arnaudb@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts db1131.eqiad.wmnet [15:43:10] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4052.ulsfo.wmnet with OS bookworm [15:44:16] (03CR) 10Brouberol: [V: 03+1 C: 03+2] Convert the Skein private key to the PKCS#8 format [puppet] - 10https://gerrit.wikimedia.org/r/970378 (https://phabricator.wikimedia.org/T329398) (owner: 10Brouberol) [15:46:54] (03PS1) 10Brouberol: Fix typo in unless condition [puppet] - 10https://gerrit.wikimedia.org/r/970401 [15:48:27] (03CR) 10Arnaudb: [C: 03+2] mariadb: db1131 decomission [puppet] - 10https://gerrit.wikimedia.org/r/969992 (https://phabricator.wikimedia.org/T350141) (owner: 10Arnaudb) [15:48:30] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:49:38] (03CR) 10Vgutierrez: Basic retry mechanism for specific kafka errors (031 comment) [software/purged] - 10https://gerrit.wikimedia.org/r/970332 (https://phabricator.wikimedia.org/T334078) (owner: 10Fabfur) [15:49:51] (03CR) 10Brouberol: [C: 03+2] Fix typo in unless condition [puppet] - 10https://gerrit.wikimedia.org/r/970401 (owner: 10Brouberol) [15:51:17] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [15:51:57] (03CR) 10Cathal Mooney: "LGTM overall, a few comments on the approach but good to go. The one on the interface naming I think we do need to tackle, not 100% sure " [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/970379 (https://phabricator.wikimedia.org/T348129) (owner: 10Ayounsi) [15:52:54] !log arnaudb@cumin1001 dbctl commit (dc=all): 'discard db1131', diff saved to https://phabricator.wikimedia.org/P53120 and previous config saved to /var/cache/conftool/dbconfig/20231031-155253-arnaudb.json [15:53:15] (03PS1) 10Filippo Giunchedi: team-sre: ignore systemd_unit_.+_owner stale textfile [alerts] - 10https://gerrit.wikimedia.org/r/970402 (https://phabricator.wikimedia.org/T349176) [15:54:30] (03PS1) 10Brouberol: Fix puppet error by providing the openssl absolute path [puppet] - 10https://gerrit.wikimedia.org/r/970403 [15:55:15] (03CR) 10Brouberol: "Sorry about the quifix PR. This slipped through PCC." [puppet] - 10https://gerrit.wikimedia.org/r/970403 (owner: 10Brouberol) [15:57:04] 10SRE, 10ops-eqiad, 10DBA: eqiad: move non WMCS servers out of rack C8 - https://phabricator.wikimedia.org/T308339 (10ABran-WMF) db1131 is ready to be handled (T350141) [15:58:48] 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review: decommission db1131.eqiad.wmnet - https://phabricator.wikimedia.org/T350141 (10ABran-WMF) [16:00:05] jbond and rzl: May I have your attention please! Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231031T1600) [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:00:27] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:03:02] (03PS7) 10Cathal Mooney: Change core router config to export internal routes to Switches [homer/public] - 10https://gerrit.wikimedia.org/r/969367 (https://phabricator.wikimedia.org/T344547) [16:04:06] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4052.ulsfo.wmnet with OS bookworm [16:06:11] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp1103.eqiad.wmnet with OS bullseye [16:07:31] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Create automation to move servers in Netbox from old to new switch - https://phabricator.wikimedia.org/T348129 (10Volans) >>! In T348129#9295072, @ayounsi wrote: >> this way there is no check to ensure that reality corresponds to what we do... [16:08:46] !log taavi@cumin1001 START - Cookbook sre.dns.netbox [16:10:55] !log taavi@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign new IPs to cloudvirt-wdqs1002 - taavi@cumin1001" [16:11:44] !log taavi@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign new IPs to cloudvirt-wdqs1002 - taavi@cumin1001" [16:11:44] !log taavi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:12:30] !log taavi@cumin1001 START - Cookbook sre.hosts.provision for host cloudvirt-wdqs1002.mgmt.eqiad.wmnet with reboot policy FORCED [16:15:22] !log taavi@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudvirt-wdqs1002.mgmt.eqiad.wmnet with reboot policy FORCED [16:15:42] !log taavi@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt-wdqs1002.eqiad.wmnet with OS bookworm [16:15:56] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10Data-Platform-SRE, 10cloud-services-team: Move cloudvirt-wdqs hosts - https://phabricator.wikimedia.org/T346948 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by taavi@cumin1001 for host cloudvirt-wdqs1002.eqiad.wmnet with OS bookworm [16:18:47] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:20:42] !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp1104.eqiad.wmnet with OS bullseye [16:20:49] 10SRE, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1104.eqiad.wmnet with OS bullseye [16:22:17] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1103.eqiad.wmnet with reason: host reimage [16:23:11] !log elukey@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop: sync [16:23:26] !log elukey@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop: sync [16:25:31] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1103.eqiad.wmnet with reason: host reimage [16:27:28] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Create automation to move servers in Netbox from old to new switch - https://phabricator.wikimedia.org/T348129 (10ayounsi) > What I mean is that this way it might be harder to catch mistakes, if a host has been plugged into a different port... [16:27:28] !log taavi@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudvirt-wdqs1002.eqiad.wmnet with OS bookworm [16:27:41] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10Data-Platform-SRE, 10cloud-services-team: Move cloudvirt-wdqs hosts - https://phabricator.wikimedia.org/T346948 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by taavi@cumin1001 for host cloudvirt-wdqs1002.eqiad.wmnet with OS bookworm executed with... [16:28:35] (03CR) 10Fabfur: Basic retry mechanism for specific kafka errors (031 comment) [software/purged] - 10https://gerrit.wikimedia.org/r/970332 (https://phabricator.wikimedia.org/T334078) (owner: 10Fabfur) [16:28:51] (03PS4) 10Fabfur: Basic retry mechanism for specific kafka errors [software/purged] - 10https://gerrit.wikimedia.org/r/970332 (https://phabricator.wikimedia.org/T334078) [16:29:53] (03PS1) 10Jbond: pki: switch to cfssl certificate [puppet] - 10https://gerrit.wikimedia.org/r/970407 (https://phabricator.wikimedia.org/T349619) [16:30:51] !log fabfur@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1104.eqiad.wmnet with OS bullseye [16:30:51] (03CR) 10Jbond: [C: 03+2] pki: switch to cfssl certificate [puppet] - 10https://gerrit.wikimedia.org/r/970407 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond) [16:30:57] 10SRE, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1104.eqiad.wmnet with OS bullseye executed with errors: - cp1104 (**FAIL**) - Downtimed on Icinga/... [16:31:11] PROBLEM - Check systemd state on gitlab-runner1004 is CRITICAL: CRITICAL - degraded: The following units failed: docker-gc.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:31:25] PROBLEM - Check systemd state on gitlab-runner2003 is CRITICAL: CRITICAL - degraded: The following units failed: docker-gc.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:31:43] PROBLEM - Check systemd state on gitlab-runner1002 is CRITICAL: CRITICAL - degraded: The following units failed: docker-gc.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:31:57] !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp1104.eqiad.wmnet with OS bullseye [16:32:01] (03PS2) 10Brouberol: Fix puppet error by providing the openssl absolute path [puppet] - 10https://gerrit.wikimedia.org/r/970403 (https://phabricator.wikimedia.org/T329398) [16:32:02] 10SRE, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1104.eqiad.wmnet with OS bullseye [16:32:05] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:33:10] (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/970403 (https://phabricator.wikimedia.org/T329398) (owner: 10Brouberol) [16:33:23] RECOVERY - Check systemd state on gitlab-runner1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:33:55] RECOVERY - Check systemd state on gitlab-runner1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:34:43] RECOVERY - Check systemd state on gitlab-runner2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:34:46] !log elukey@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop: sync [16:35:00] !log elukey@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop: sync [16:35:24] (03CR) 10Brouberol: [C: 03+2] Fix puppet error by providing the openssl absolute path [puppet] - 10https://gerrit.wikimedia.org/r/970403 (https://phabricator.wikimedia.org/T329398) (owner: 10Brouberol) [16:40:02] (03PS3) 10Elukey: services: update ChangeProp's eqiad Docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/969758 (https://phabricator.wikimedia.org/T348950) [16:41:44] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10BCornwall) [16:43:14] (03CR) 10Jbond: sre.puppet.migrate-role: add new cookbook to migrate roles to puppet7 (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/967935 (https://phabricator.wikimedia.org/T340739) (owner: 10Jbond) [16:43:26] (03PS1) 10Brouberol: Hide skein private key diff in puppet logs [puppet] - 10https://gerrit.wikimedia.org/r/970408 (https://phabricator.wikimedia.org/T329398) [16:43:38] (03PS10) 10Jbond: sre.puppet.migrate-role: add new cookbook to migrate roles to puppet7 [cookbooks] - 10https://gerrit.wikimedia.org/r/967935 (https://phabricator.wikimedia.org/T340739) [16:44:18] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1103.eqiad.wmnet with OS bullseye [16:45:17] (03CR) 10Brouberol: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/281/con" [puppet] - 10https://gerrit.wikimedia.org/r/970408 (https://phabricator.wikimedia.org/T329398) (owner: 10Brouberol) [16:45:19] (03PS1) 10Jbond: pki1001: move back to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/970409 [16:46:09] (03CR) 10Jbond: [C: 03+2] pki1001: move back to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/970409 (owner: 10Jbond) [16:46:26] (03CR) 10Elukey: [C: 03+2] services: update ChangeProp's eqiad Docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/969758 (https://phabricator.wikimedia.org/T348950) (owner: 10Elukey) [16:48:40] (03CR) 10CI reject: [V: 04-1] sre.puppet.migrate-role: add new cookbook to migrate roles to puppet7 [cookbooks] - 10https://gerrit.wikimedia.org/r/967935 (https://phabricator.wikimedia.org/T340739) (owner: 10Jbond) [16:49:20] !log elukey@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop: sync [16:49:57] !log elukey@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop: sync [16:50:14] (03PS1) 10Stevemunene: Revert "Revert "airflow-wmde: Create scap deployment source for wmde"" [puppet] - 10https://gerrit.wikimedia.org/r/970360 [16:50:36] (03CR) 10CI reject: [V: 04-1] Revert "Revert "airflow-wmde: Create scap deployment source for wmde"" [puppet] - 10https://gerrit.wikimedia.org/r/970360 (owner: 10Stevemunene) [16:51:41] (03PS2) 10Stevemunene: Revert "Revert "airflow-wmde: Create scap deployment source for wmde"" [puppet] - 10https://gerrit.wikimedia.org/r/970360 [16:52:29] (03PS1) 10Ssingh: Release dnsdist 1.8.2-1+wmf12u1 [debs/dnsdist] - 10https://gerrit.wikimedia.org/r/970413 [16:52:35] (03CR) 10Ryan Kemper: [C: 03+1] Revert "Revert "airflow-wmde: Create scap deployment source for wmde"" [puppet] - 10https://gerrit.wikimedia.org/r/970360 (owner: 10Stevemunene) [16:56:28] jouncebot: nowandnext [16:56:28] For the next 0 hour(s) and 3 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231031T1600) [16:56:29] In 0 hour(s) and 3 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231031T1700) [16:56:42] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/970412 (https://phabricator.wikimedia.org/T347435) (owner: 10Samtar) [16:57:14] (03CR) 10Stevemunene: [C: 03+2] Revert "Revert "airflow-wmde: Create scap deployment source for wmde"" [puppet] - 10https://gerrit.wikimedia.org/r/970360 (owner: 10Stevemunene) [16:57:24] (03Merged) 10jenkins-bot: InitialiseSettings-labs: Enable AbuseFilterBlockedExternalDomainsNotifications on enwiki.beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/970412 (https://phabricator.wikimedia.org/T347435) (owner: 10Samtar) [16:58:45] (03CR) 10Ebernhardson: [C: 03+1] "nothing seems obviously wrong, although I do wonder about the deployment process. I haven't verified if any of the names/paths here (swift" [deployment-charts] - 10https://gerrit.wikimedia.org/r/967229 (https://phabricator.wikimedia.org/T349095) (owner: 10Bking) [16:59:38] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [17:00:05] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231031T1700) [17:00:07] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [17:01:44] that is going to be resolved soon ^ [17:04:52] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [17:05:02] (03CR) 10Vgutierrez: "looking good" [software/purged] - 10https://gerrit.wikimedia.org/r/970332 (https://phabricator.wikimedia.org/T334078) (owner: 10Fabfur) [17:08:01] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:09:47] (Device rebooted) firing: Alert for device ps1-a2-codfw.mgmt.codfw.wmnet - Device rebooted - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted [17:12:11] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:14:37] (03CR) 10Jbond: [C: 03+1] "LGTM" [alerts] - 10https://gerrit.wikimedia.org/r/970402 (https://phabricator.wikimedia.org/T349176) (owner: 10Filippo Giunchedi) [17:14:47] (Device rebooted) resolved: Device ps1-a2-codfw.mgmt.codfw.wmnet recovered from Device rebooted - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted [17:16:00] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10jbond) [17:16:18] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Investigate PKI errors - https://phabricator.wikimedia.org/T350118 (10jbond) 05In progress→03Resolved a:03jbond This is fixed now [17:17:53] (03PS1) 10Ryan Kemper: Revert "Revert "Revert "airflow-wmde: Create scap deployment source for wmde""" [puppet] - 10https://gerrit.wikimedia.org/r/970361 [17:19:09] (03CR) 10Stevemunene: [C: 03+1] Revert "Revert "Revert "airflow-wmde: Create scap deployment source for wmde""" [puppet] - 10https://gerrit.wikimedia.org/r/970361 (owner: 10Ryan Kemper) [17:19:33] (03CR) 10Ryan Kemper: [C: 03+2] Revert "Revert "Revert "airflow-wmde: Create scap deployment source for wmde""" [puppet] - 10https://gerrit.wikimedia.org/r/970361 (owner: 10Ryan Kemper) [17:19:48] (03PS11) 10Jbond: sre.puppet.migrate-role: add new cookbook to migrate roles to puppet7 [cookbooks] - 10https://gerrit.wikimedia.org/r/967935 (https://phabricator.wikimedia.org/T340739) [17:27:58] !log krinkle@deploy2002:/srv/mediawiki/private: fix untracked warning for readme.FatalErrorSettings.php [17:28:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:42:09] !log otto@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-analytics: apply [17:42:22] !log otto@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics: apply [17:42:24] !log otto@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-analytics-external: apply [17:42:40] !log otto@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics-external: apply [17:43:14] !log otto@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics-external: apply [17:43:33] !log otto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics-external: apply [17:48:30] (03PS2) 10Ayounsi: Add MoveServersUplinks Netbox script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/970379 (https://phabricator.wikimedia.org/T348129) [17:48:53] (03CR) 10Ayounsi: Add MoveServersUplinks Netbox script (034 comments) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/970379 (https://phabricator.wikimedia.org/T348129) (owner: 10Ayounsi) [17:49:01] (03CR) 10Ssingh: [C: 03+2] Release dnsdist 1.8.2-1+wmf12u1 [debs/dnsdist] - 10https://gerrit.wikimedia.org/r/970413 (owner: 10Ssingh) [17:51:03] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:51:47] !log taavi@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host cloudvirt-wdqs1002 [17:51:49] !log taavi@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudvirt-wdqs1002 [17:52:10] !log fabfur@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp1104.eqiad.wmnet with OS bullseye [17:52:15] 10SRE, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1104.eqiad.wmnet with OS bullseye executed with errors: - cp1104 (**FAIL**) - Removed from Puppet... [17:55:13] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:56:20] !log taavi@cumin1001 START - Cookbook sre.hosts.provision for host cloudvirt-wdqs1002.mgmt.eqiad.wmnet with reboot policy FORCED [17:59:24] !log taavi@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudvirt-wdqs1002.mgmt.eqiad.wmnet with reboot policy FORCED [18:00:04] dduvall and dancy: Your horoscope predicts another unfortunate MediaWiki train - Utc-7 Version deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231031T1800). [18:00:14] o/ [18:04:20] !log reprepro -C component/dnsdist include bookworm-wikimedia dnsdist_1.8.2-1+wmf12u1_amd64.changes [18:04:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:20] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Create automation to move servers in Netbox from old to new switch - https://phabricator.wikimedia.org/T348129 (10ayounsi) It's live on netbox-next: https://netbox-next.wikimedia.org/extras/scripts/move_server.MoveServersUplinks/ See that... [18:05:48] (03PS3) 10Ayounsi: Add MoveServersUplinks Netbox script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/970379 (https://phabricator.wikimedia.org/T348129) [18:09:03] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:11:24] (03PS1) 10Kamila Součková: Initial commit of kube-state-metrics chart from prometheus-community [deployment-charts] - 10https://gerrit.wikimedia.org/r/970425 (https://phabricator.wikimedia.org/T264625) [18:13:11] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:16:50] (03PS1) 10TrainBranchBot: group0 wikis to 1.42.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/970426 (https://phabricator.wikimedia.org/T348356) [18:16:52] (03CR) 10TrainBranchBot: [C: 03+2] group0 wikis to 1.42.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/970426 (https://phabricator.wikimedia.org/T348356) (owner: 10TrainBranchBot) [18:17:27] dancy: o/ [18:17:48] (03Merged) 10jenkins-bot: group0 wikis to 1.42.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/970426 (https://phabricator.wikimedia.org/T348356) (owner: 10TrainBranchBot) [18:22:43] (SystemdUnitFailed) firing: wmf_auto_restart_mjolnir-kafka-msearch-daemon@0.service Failed on search-loader2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:22:48] !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp1104.eqiad.wmnet with OS bullseye [18:22:54] 10SRE, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1104.eqiad.wmnet with OS bullseye [18:23:44] 10SRE, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10Fabfur) [18:24:05] !log dduvall@deploy2002 rebuilt and synchronized wikiversions files: group0 wikis to 1.42.0-wmf.3 refs T348356 [18:24:10] T348356: 1.42.0-wmf.3 deployment blockers - https://phabricator.wikimedia.org/T348356 [18:25:43] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10Fabfur) [18:27:01] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:31:11] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:32:08] imagines a world where monitoring knows that right now is deployment window and therefore does not check but keeps checking once it's over [18:32:57] Dzahn! Welcome back. [18:33:02] mutante: you're back! [18:33:09] thanks dancy and RhinosF1 :) [18:33:12] I have about 3 things to tell you [18:33:22] mutante: when is too soon to annoy you [18:33:25] i just made a ticket in upstream phorge :) [18:33:36] RhinosF1: ping me, it's ok [18:33:39] i mean.. PM [18:37:57] !log fabfur@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1104.eqiad.wmnet with reason: host reimage [18:39:08] 10SRE, 10Acme-chief, 10Traffic, 10Patch-For-Review: acme-chief should support debian bookworm - https://phabricator.wikimedia.org/T344330 (10CodeReviewBot) brett merged https://gitlab.wikimedia.org/repos/sre/acme-chief/-/merge_requests/3 Update dependencies to match Bookworm versions [18:39:15] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10CodeReviewBot) brett merged https://gitlab.wikimedia.org/repos/sre/acme-chief/-/merge_requests/3 Update dependencies to match Bookworm versions [18:40:57] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1104.eqiad.wmnet with reason: host reimage [18:42:21] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1005-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [18:51:44] (03PS1) 10FNegri: Add component/prometheus-openstack-exporter to bookworm [puppet] - 10https://gerrit.wikimedia.org/r/970430 (https://phabricator.wikimedia.org/T350154) [18:55:50] (03CR) 10Majavah: [C: 03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/970430 (https://phabricator.wikimedia.org/T350154) (owner: 10FNegri) [18:57:58] (RdfStreamingUpdaterSpaceUsageTooHigh) firing: The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh [18:59:45] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1104.eqiad.wmnet with OS bullseye [18:59:49] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1104.eqiad.wmnet with OS bullseye completed: - cp1104 (**PASS**) - Remo... [19:00:45] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10Fabfur) [19:01:09] (03CR) 10FNegri: [C: 03+2] Add component/prometheus-openstack-exporter to bookworm [puppet] - 10https://gerrit.wikimedia.org/r/970430 (https://phabricator.wikimedia.org/T350154) (owner: 10FNegri) [19:01:27] !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp1105.eqiad.wmnet with OS bullseye [19:01:41] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1105.eqiad.wmnet with OS bullseye [19:02:58] (RdfStreamingUpdaterSpaceUsageTooHigh) resolved: The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh [19:04:06] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:09:26] hello [19:10:08] oops sorry im an hour early [19:12:07] !log fabfur@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1105.eqiad.wmnet with OS bullseye [19:12:16] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1105.eqiad.wmnet with OS bullseye executed with errors: - cp1105 (**FAIL*... [19:12:26] !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp1105.eqiad.wmnet with OS bullseye [19:12:32] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1105.eqiad.wmnet with OS bullseye [19:16:16] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:37:04] (03CR) 10Gehel: [C: 04-1] "Looks mostly good, but 2 minor comments inline." [puppet] - 10https://gerrit.wikimedia.org/r/970408 (https://phabricator.wikimedia.org/T329398) (owner: 10Brouberol) [19:49:16] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:51:17] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [19:55:49] !log otto@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics: apply [19:56:02] !log otto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics: apply [19:57:14] (03PS1) 10Ryan Kemper: cirrus-streaming-updater: bump vers (NPE fix) [deployment-charts] - 10https://gerrit.wikimedia.org/r/970435 (https://phabricator.wikimedia.org/T347075) [19:57:54] (03PS2) 10Ryan Kemper: cirrus-streaming-updater: bump vers (NPE fix) [deployment-charts] - 10https://gerrit.wikimedia.org/r/970435 (https://phabricator.wikimedia.org/T347075) [19:59:01] (03CR) 10Bking: [C: 03+1] cirrus-streaming-updater: bump vers (NPE fix) [deployment-charts] - 10https://gerrit.wikimedia.org/r/970435 (https://phabricator.wikimedia.org/T347075) (owner: 10Ryan Kemper) [20:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231031T2000). [20:00:04] kimberly_sarabia: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:58] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:01:01] * TheresNoTime can deploy [20:01:11] hello [20:01:12] (SwiftObjectCountSiteDisparity) firing: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity [20:02:03] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/969971 (https://phabricator.wikimedia.org/T349544) (owner: 10Kimberly Sarabia) [20:02:13] (03CR) 10Cwhite: [C: 04-1] "The 1.0.0-2 template file will be removed from the host and a 1.0.0-3 will be added correctly. However, the logstash output is configured" [puppet] - 10https://gerrit.wikimedia.org/r/969948 (https://phabricator.wikimedia.org/T349807) (owner: 10Herron) [20:02:25] (03CR) 10Ryan Kemper: [C: 03+2] cirrus-streaming-updater: bump vers (NPE fix) [deployment-charts] - 10https://gerrit.wikimedia.org/r/970435 (https://phabricator.wikimedia.org/T347075) (owner: 10Ryan Kemper) [20:02:43] (03Merged) 10jenkins-bot: Deploy vector 2022 to non-English Wikibooks, etc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/969971 (https://phabricator.wikimedia.org/T349544) (owner: 10Kimberly Sarabia) [20:03:08] !log samtar@deploy2002 Started scap: Backport for [[gerrit:969971|Deploy vector 2022 to non-English Wikibooks, etc (T349544)]] [20:03:14] T349544: Deployment of Vector 2022 to non-English Wikibooks, Wikinews, Wikiquotes, Wikiversity, and metawiki - https://phabricator.wikimedia.org/T349544 [20:04:29] !log samtar@deploy2002 samtar and ksarabia: Backport for [[gerrit:969971|Deploy vector 2022 to non-English Wikibooks, etc (T349544)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:04:43] kimberly_sarabia: live on mwdebug, is this something you can test? [20:05:30] Yes, one moment [20:05:35] (03CR) 10Cwhite: "This change LGTM! Please deploy I4c9c280a142aa07983bfed65158ff6c4a2aeb1e4 before this one to pre-provision the field." [puppet] - 10https://gerrit.wikimedia.org/r/969135 (https://phabricator.wikimedia.org/T349807) (owner: 10Herron) [20:05:39] !log ryankemper@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [20:05:55] !log ryankemper@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:08:35] TheresNoTime: LGTM [20:08:41] !log samtar@deploy2002 samtar and ksarabia: Continuing with sync [20:08:47] ack :) [20:11:36] RECOVERY - Check systemd state on search-loader2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:12:43] (SystemdUnitFailed) resolved: wmf_auto_restart_mjolnir-kafka-msearch-daemon@0.service Failed on search-loader2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:13:59] !log samtar@deploy2002 Finished scap: Backport for [[gerrit:969971|Deploy vector 2022 to non-English Wikibooks, etc (T349544)]] (duration: 10m 51s) [20:14:04] T349544: Deployment of Vector 2022 to non-English Wikibooks, Wikinews, Wikiquotes, Wikiversity, and metawiki - https://phabricator.wikimedia.org/T349544 [20:14:13] kimberly_sarabia: live on prod :) [20:14:18] Thanks! [20:16:26] !log close UTC late backport window [20:16:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:21:42] (03CR) 10Cwhite: [C: 03+1] Switch arclamp to nftables [puppet] - 10https://gerrit.wikimedia.org/r/969328 (owner: 10Muehlenhoff) [20:32:37] !log fabfur@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp1105.eqiad.wmnet with OS bullseye [20:32:43] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1105.eqiad.wmnet with OS bullseye executed with errors: - cp1105 (**FAIL*... [20:34:12] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:40:29] !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp1105.eqiad.wmnet with OS bullseye [20:40:34] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1105.eqiad.wmnet with OS bullseye [20:45:14] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:45:34] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for postal32 - https://phabricator.wikimedia.org/T348197 (10Aklapper) 05Stalled→03Declined Unfortunately closing this Phabricator task as no further information has been provided. [20:52:12] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:55:42] !log fabfur@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1105.eqiad.wmnet with reason: host reimage [20:58:46] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1105.eqiad.wmnet with reason: host reimage [21:11:04] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for lojo_wmde - https://phabricator.wikimedia.org/T342973 (10Aklapper) 05Stalled→03Declined Unfortunately closing this Phabricator task as no further information has been provided. @lojo_wmde / @darthmon_wmde: After you have provided t... [21:15:54] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:17:26] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1105.eqiad.wmnet with OS bullseye [21:17:33] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1105.eqiad.wmnet with OS bullseye completed: - cp1105 (**PASS**) - Remo... [21:21:15] !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp1106.eqiad.wmnet with OS bullseye [21:21:21] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1106.eqiad.wmnet with OS bullseye [21:23:01] (03CR) 10Fabfur: Basic retry mechanism for specific kafka errors (031 comment) [software/purged] - 10https://gerrit.wikimedia.org/r/970332 (https://phabricator.wikimedia.org/T334078) (owner: 10Fabfur) [21:28:33] !log fabfur@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1106.eqiad.wmnet with OS bullseye [21:28:38] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1106.eqiad.wmnet with OS bullseye executed with errors: - cp1106 (**FAIL*... [21:28:43] !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp1106.eqiad.wmnet with OS bullseye [21:28:50] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1106.eqiad.wmnet with OS bullseye [21:37:49] !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp1103.eqiad.wmnet [21:37:59] !log fabfur@cumin1001 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host cp1103.eqiad.wmnet [21:38:08] !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp1103.eqiad.wmnet [21:38:59] !log fabfur@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1106.eqiad.wmnet with OS bullseye [21:39:04] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1106.eqiad.wmnet with OS bullseye executed with errors: - cp1106 (**FAIL*... [21:39:16] !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp1106.eqiad.wmnet with OS bullseye [21:39:21] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1106.eqiad.wmnet with OS bullseye [21:46:51] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp1103.eqiad.wmnet [21:47:22] (03PS5) 10Samtar: Enable LoginNotify seen subnets table [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965663 (https://phabricator.wikimedia.org/T346989) (owner: 10Tim Starling) [21:53:38] !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp1107.eqiad.wmnet with OS bullseye [21:53:45] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1107.eqiad.wmnet with OS bullseye [21:54:25] !log fabfur@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1106.eqiad.wmnet with reason: host reimage [21:54:52] PROBLEM - Check systemd state on gitlab-runner2004 is CRITICAL: CRITICAL - degraded: The following units failed: docker-gc.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:56:12] PROBLEM - Check systemd state on gitlab-runner2002 is CRITICAL: CRITICAL - degraded: The following units failed: docker-gc.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:57:22] RECOVERY - Check systemd state on gitlab-runner2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:57:30] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1106.eqiad.wmnet with reason: host reimage [21:58:02] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10Fabfur) [21:58:20] RECOVERY - Check systemd state on gitlab-runner2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:59:41] (03CR) 10Jdlrobson: [C: 03+1] mobile: Add MobileUrlCallback [mediawiki-config] - 10https://gerrit.wikimedia.org/r/969401 (https://phabricator.wikimedia.org/T257852) (owner: 10Gergő Tisza) [22:02:12] !log fabfur@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1107.eqiad.wmnet with OS bullseye [22:02:17] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1107.eqiad.wmnet with OS bullseye executed with errors: - cp1107 (**FAIL*... [22:02:31] !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp1107.eqiad.wmnet with OS bullseye [22:02:38] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1107.eqiad.wmnet with OS bullseye [22:05:34] !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp1108.eqiad.wmnet with OS bullseye [22:05:39] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1108.eqiad.wmnet with OS bullseye [22:16:54] !log fabfur@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp1107.eqiad.wmnet with OS bullseye [22:16:59] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1107.eqiad.wmnet with OS bullseye executed with errors: - cp1107 (**FAIL*... [22:17:27] !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp1107.eqiad.wmnet with OS bullseye [22:17:35] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1107.eqiad.wmnet with OS bullseye [22:17:39] !log fabfur@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp1108.eqiad.wmnet with OS bullseye [22:17:44] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1108.eqiad.wmnet with OS bullseye executed with errors: - cp1108 (**FAIL*... [22:17:52] !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp1108.eqiad.wmnet with OS bullseye [22:17:58] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1108.eqiad.wmnet with OS bullseye [22:18:06] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1106.eqiad.wmnet with OS bullseye [22:18:12] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1106.eqiad.wmnet with OS bullseye completed: - cp1106 (**PASS**) - Remo... [22:19:57] !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp1109.eqiad.wmnet with OS bullseye [22:20:03] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1109.eqiad.wmnet with OS bullseye [22:21:08] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10Fabfur) [22:24:11] !log fabfur@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1108.eqiad.wmnet with OS bullseye [22:24:17] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1108.eqiad.wmnet with OS bullseye executed with errors: - cp1108 (**FAIL*... [22:24:24] !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp1108.eqiad.wmnet with OS bullseye [22:24:30] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1108.eqiad.wmnet with OS bullseye [22:24:40] !log fabfur@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1109.eqiad.wmnet with OS bullseye [22:24:45] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1109.eqiad.wmnet with OS bullseye executed with errors: - cp1109 (**FAIL*... [22:24:46] !log fabfur@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1107.eqiad.wmnet with OS bullseye [22:24:51] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1107.eqiad.wmnet with OS bullseye executed with errors: - cp1107 (**FAIL*... [22:24:58] !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp1109.eqiad.wmnet with OS bullseye [22:25:04] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1109.eqiad.wmnet with OS bullseye [22:25:07] !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp1107.eqiad.wmnet with OS bullseye [22:25:13] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1107.eqiad.wmnet with OS bullseye [22:28:18] 10SRE, 10Acme-chief, 10Traffic: acme-chief should support debian bookworm - https://phabricator.wikimedia.org/T344330 (10BCornwall) a:03BCornwall [22:28:50] 10SRE, 10Acme-chief, 10Traffic: acme-chief should support debian bookworm - https://phabricator.wikimedia.org/T344330 (10BCornwall) 05Open→03In progress [22:28:54] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10BCornwall) [22:30:40] 10SRE-OnFire: Discover Phabricator changes needed for using Phabricator as incident response document - https://phabricator.wikimedia.org/T349120 (10BCornwall) 05Open→03In progress [22:33:34] !log fabfur@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1108.eqiad.wmnet with OS bullseye [22:33:40] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1108.eqiad.wmnet with OS bullseye executed with errors: - cp1108 (**FAIL*... [22:33:44] !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp1108.eqiad.wmnet with OS bullseye [22:33:50] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1108.eqiad.wmnet with OS bullseye [22:33:51] !log fabfur@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1109.eqiad.wmnet with OS bullseye [22:33:57] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1109.eqiad.wmnet with OS bullseye executed with errors: - cp1109 (**FAIL*... [22:34:15] !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp1109.eqiad.wmnet with OS bullseye [22:34:20] 10SRE, 10Traffic, 10Patch-For-Review: Alert on Varnish high thread count - https://phabricator.wikimedia.org/T323723 (10BCornwall) 05In progress→03Resolved Closing due to lack of response. Please re-open if you'd like to re-ignite discussion. Thanks! [22:34:24] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1109.eqiad.wmnet with OS bullseye [22:38:35] !log fabfur@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1107.eqiad.wmnet with OS bullseye [22:38:41] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1107.eqiad.wmnet with OS bullseye executed with errors: - cp1107 (**FAIL*... [22:38:45] !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp1107.eqiad.wmnet with OS bullseye [22:38:50] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1107.eqiad.wmnet with OS bullseye [22:41:34] PROBLEM - Check systemd state on gitlab-runner1004 is CRITICAL: CRITICAL - degraded: The following units failed: docker-gc.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:42:21] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1005-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [22:45:02] RECOVERY - Check systemd state on gitlab-runner1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:48:50] !log fabfur@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1108.eqiad.wmnet with reason: host reimage [22:49:17] !log fabfur@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1109.eqiad.wmnet with reason: host reimage [22:52:05] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1108.eqiad.wmnet with reason: host reimage [22:53:53] !log fabfur@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1107.eqiad.wmnet with reason: host reimage [22:54:24] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1109.eqiad.wmnet with reason: host reimage [22:57:27] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1107.eqiad.wmnet with reason: host reimage [23:01:44] !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp1110.eqiad.wmnet with OS bullseye [23:01:46] !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp1111.eqiad.wmnet with OS bullseye [23:01:48] !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp1112.eqiad.wmnet with OS bullseye [23:01:51] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1110.eqiad.wmnet with OS bullseye [23:01:53] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1111.eqiad.wmnet with OS bullseye [23:01:57] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1112.eqiad.wmnet with OS bullseye [23:08:13] !log fabfur@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1110.eqiad.wmnet with OS bullseye [23:08:17] !log fabfur@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1111.eqiad.wmnet with OS bullseye [23:08:18] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1110.eqiad.wmnet with OS bullseye executed with errors: - cp1110 (**FAIL*... [23:08:22] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1111.eqiad.wmnet with OS bullseye executed with errors: - cp1111 (**FAIL*... [23:08:28] !log fabfur@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1112.eqiad.wmnet with OS bullseye [23:08:33] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1112.eqiad.wmnet with OS bullseye executed with errors: - cp1112 (**FAIL*... [23:08:38] !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp1110.eqiad.wmnet with OS bullseye [23:08:41] !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp1111.eqiad.wmnet with OS bullseye [23:08:42] !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp1112.eqiad.wmnet with OS bullseye [23:08:45] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1110.eqiad.wmnet with OS bullseye [23:08:47] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1111.eqiad.wmnet with OS bullseye [23:08:49] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1112.eqiad.wmnet with OS bullseye [23:09:49] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1108.eqiad.wmnet with OS bullseye [23:09:54] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1108.eqiad.wmnet with OS bullseye completed: - cp1108 (**PASS**) - Remo... [23:12:37] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1109.eqiad.wmnet with OS bullseye [23:12:44] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1109.eqiad.wmnet with OS bullseye completed: - cp1109 (**PASS**) - Remo... [23:14:52] !log fabfur@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1112.eqiad.wmnet with OS bullseye [23:14:58] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1112.eqiad.wmnet with OS bullseye executed with errors: - cp1112 (**FAIL*... [23:14:58] !log fabfur@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1110.eqiad.wmnet with OS bullseye [23:15:04] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1110.eqiad.wmnet with OS bullseye executed with errors: - cp1110 (**FAIL*... [23:15:10] !log fabfur@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1111.eqiad.wmnet with OS bullseye [23:15:15] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1111.eqiad.wmnet with OS bullseye executed with errors: - cp1111 (**FAIL*... [23:15:29] !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp1110.eqiad.wmnet with OS bullseye [23:15:30] !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp1111.eqiad.wmnet with OS bullseye [23:15:34] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1110.eqiad.wmnet with OS bullseye [23:15:37] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1111.eqiad.wmnet with OS bullseye [23:15:41] !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp1112.eqiad.wmnet with OS bullseye [23:15:41] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1107.eqiad.wmnet with OS bullseye [23:15:45] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1107.eqiad.wmnet with OS bullseye completed: - cp1107 (**PASS**) - Remo... [23:22:53] !log fabfur@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1111.eqiad.wmnet with OS bullseye [23:22:59] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1111.eqiad.wmnet with OS bullseye executed with errors: - cp1111 (**FAIL*... [23:23:00] !log fabfur@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1112.eqiad.wmnet with OS bullseye [23:23:04] !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp1111.eqiad.wmnet with OS bullseye [23:23:05] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1112.eqiad.wmnet with OS bullseye executed with errors: - cp1112 (**FAIL*... [23:23:10] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1111.eqiad.wmnet with OS bullseye [23:23:12] !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp1112.eqiad.wmnet with OS bullseye [23:23:18] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1112.eqiad.wmnet with OS bullseye [23:27:01] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:28:03] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10Fabfur) [23:30:33] !log fabfur@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1110.eqiad.wmnet with reason: host reimage [23:30:34] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:33:32] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1110.eqiad.wmnet with reason: host reimage [23:38:05] !log fabfur@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1111.eqiad.wmnet with reason: host reimage [23:38:18] !log fabfur@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1112.eqiad.wmnet with reason: host reimage [23:41:00] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1111.eqiad.wmnet with reason: host reimage [23:43:20] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1112.eqiad.wmnet with reason: host reimage [23:50:02] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:51:17] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [23:51:31] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1110.eqiad.wmnet with OS bullseye [23:51:37] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1110.eqiad.wmnet with OS bullseye completed: - cp1110 (**PASS**) - Remo... [23:53:20] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:59:13] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1111.eqiad.wmnet with OS bullseye [23:59:18] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1111.eqiad.wmnet with OS bullseye completed: - cp1111 (**PASS**) - Remo...