[00:15:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[00:30:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[00:38:47] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/968980
[00:38:53] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/968980 (owner: 10TrainBranchBot)
[00:45:54] <icinga-wm>	 PROBLEM - Check systemd state on logstash2026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:55:24] <icinga-wm>	 PROBLEM - Router interfaces on cr1-esams is CRITICAL: CRITICAL: host 185.15.59.128, interfaces up: 77, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[00:56:13] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/968980 (owner: 10TrainBranchBot)
[00:56:48] <icinga-wm>	 RECOVERY - Router interfaces on cr1-esams is OK: OK: host 185.15.59.128, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[01:13:05] <jinxer-wm>	 (ProbeDown) firing: Service vrts1001:1443 has failed probes (http_ticket_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#vrts1001:1443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[01:13:36] <icinga-wm>	 PROBLEM - clamd running on vrts1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 114 (clamav), command name clamd https://wikitech.wikimedia.org/wiki/VRT_System%23ClamAV
[01:13:42] <icinga-wm>	 PROBLEM - Check systemd state on vrts1001 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service,clamav-daemon.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:27:26] <icinga-wm>	 RECOVERY - clamd running on vrts1001 is OK: PROCS OK: 1 process with UID = 114 (clamav), command name clamd https://wikitech.wikimedia.org/wiki/VRT_System%23ClamAV
[01:27:34] <icinga-wm>	 RECOVERY - Check systemd state on vrts1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:30:02] <icinga-wm>	 PROBLEM - Check systemd state on aqs1010 is CRITICAL: CRITICAL - degraded: The following units failed: aqs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:31:26] <icinga-wm>	 RECOVERY - Check systemd state on aqs1010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:33:05] <jinxer-wm>	 (ProbeDown) resolved: Service vrts1001:1443 has failed probes (http_ticket_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#vrts1001:1443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[01:51:12] <jinxer-wm>	 (SwiftObjectCountSiteDisparity) firing: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity
[02:31:58] <jinxer-wm>	 (KeyholderUnarmed) firing: (2) 1 unarmed Keyholder key(s) on acmechief2002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[02:38:43] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:04:32] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:51:16] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[05:51:12] <jinxer-wm>	 (SwiftObjectCountSiteDisparity) firing: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity
[06:31:58] <jinxer-wm>	 (KeyholderUnarmed) firing: (2) 1 unarmed Keyholder key(s) on acmechief2002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[06:42:06] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1005-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[06:46:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[06:51:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[07:03:05] <wikibugs>	 (03PS1) 10Marostegui: ProductionServices.php: Promote pc1014 to pc1 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/969533
[07:06:50] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by marostegui@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/969533 (owner: 10Marostegui)
[07:07:34] <wikibugs>	 (03Merged) 10jenkins-bot: ProductionServices.php: Promote pc1014 to pc1 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/969533 (owner: 10Marostegui)
[07:08:05] <logmsgbot>	 !log marostegui@deploy2002 Started scap: Backport for [[gerrit:969533|ProductionServices.php: Promote pc1014 to pc1 master]]
[07:09:10] <icinga-wm>	 RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[07:10:52] <icinga-wm>	 PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[07:12:38] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[07:13:46] <icinga-wm>	 RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[07:14:28] <wikibugs>	 (03PS1) 10Marostegui: Revert "ProductionServices.php: Promote pc1014 to pc1 master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/969360
[07:15:02] <icinga-wm>	 RECOVERY - BFD status on cr1-eqiad is OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[07:16:06] <wikibugs>	 (03PS1) 10Vgutierrez: hiera: Switch drmrs, eqsin and esams to digicert-2023 [puppet] - 10https://gerrit.wikimedia.org/r/969662 (https://phabricator.wikimedia.org/T341119)
[07:16:21] <logmsgbot>	 !log marostegui@deploy2002 marostegui: Backport for [[gerrit:969533|ProductionServices.php: Promote pc1014 to pc1 master]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[07:16:39] <logmsgbot>	 !log marostegui@deploy2002 marostegui: Continuing with sync
[07:17:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[07:18:51] <elukey>	 !log arm keyholder on acmechief2002 and deploy1002
[07:18:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:19:54] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (CORE_DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/228/con" [puppet] - 10https://gerrit.wikimedia.org/r/969662 (https://phabricator.wikimedia.org/T341119) (owner: 10Vgutierrez)
[07:21:42] <jinxer-wm>	 (KeyholderUnarmed) resolved: (2) 1 unarmed Keyholder key(s) on acmechief2002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[07:22:09] <logmsgbot>	 !log marostegui@deploy2002 Finished scap: Backport for [[gerrit:969533|ProductionServices.php: Promote pc1014 to pc1 master]] (duration: 14m 04s)
[07:22:09] <jinxer-wm>	 (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 29.17% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[07:22:13] <marostegui>	 ^ my fault
[07:22:18] <marostegui>	 I am reverting
[07:22:21] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "ProductionServices.php: Promote pc1014 to pc1 master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/969360 (owner: 10Marostegui)
[07:22:46] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1] "chained certificates deployed on the cp servers look good:" [puppet] - 10https://gerrit.wikimedia.org/r/969662 (https://phabricator.wikimedia.org/T341119) (owner: 10Vgutierrez)
[07:22:47] <logmsgbot>	 !log marostegui@deploy2002 Started scap: Backport for [[gerrit:969360|Revert "ProductionServices.php: Promote pc1014 to pc1 master"]]
[07:22:58] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[07:24:01] <logmsgbot>	 !log marostegui@deploy2002 marostegui: Backport for [[gerrit:969360|Revert "ProductionServices.php: Promote pc1014 to pc1 master"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[07:24:03] <logmsgbot>	 !log marostegui@deploy2002 marostegui: Continuing with sync
[07:27:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad appserver GET/200: 0.8172186946249829s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[07:27:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[07:29:20] <logmsgbot>	 !log marostegui@deploy2002 Finished scap: Backport for [[gerrit:969360|Revert "ProductionServices.php: Promote pc1014 to pc1 master"]] (duration: 06m 33s)
[07:30:23] <wikibugs>	 (03PS1) 10Marostegui: ProductionServices.php: Promote pc1014 to pc1 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/969667
[07:32:09] <jinxer-wm>	 (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 49.22% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[07:32:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad appserver GET/200: 0.4157913965730752s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceede
[07:34:51] <logmsgbot>	 !log stevemunene@cumin1001 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on an-airflow1007.eqiad.wmnet with reason: Downtime as we setup the new WMDE Airflow instance
[07:35:16] <logmsgbot>	 !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on an-airflow1007.eqiad.wmnet with reason: Downtime as we setup the new WMDE Airflow instance
[07:41:54] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] ProductionServices.php: Promote pc1014 to pc1 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/969667 (owner: 10Marostegui)
[07:42:34] <wikibugs>	 (03Merged) 10jenkins-bot: ProductionServices.php: Promote pc1014 to pc1 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/969667 (owner: 10Marostegui)
[07:42:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[07:42:47] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1] "prefetched OCSP responses look healthy as well:" [puppet] - 10https://gerrit.wikimedia.org/r/969662 (https://phabricator.wikimedia.org/T341119) (owner: 10Vgutierrez)
[07:43:20] <logmsgbot>	 !log marostegui@deploy2002 Started scap: Backport for [[gerrit:969667|ProductionServices.php: Promote pc1014 to pc1 master]]
[07:44:32] <logmsgbot>	 !log marostegui@deploy2002 marostegui: Backport for [[gerrit:969667|ProductionServices.php: Promote pc1014 to pc1 master]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[07:44:37] <logmsgbot>	 !log marostegui@deploy2002 marostegui: Continuing with sync
[07:46:54] <vgutierrez>	 !log disable puppet on cp hosts in esams, eqsin and drmrs before switching to the new unified digicert certificates - T341119
[07:49:57] <logmsgbot>	 !log marostegui@deploy2002 Finished scap: Backport for [[gerrit:969667|ProductionServices.php: Promote pc1014 to pc1 master]] (duration: 06m 36s)
[07:50:36] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] hiera: Switch drmrs, eqsin and esams to digicert-2023 [puppet] - 10https://gerrit.wikimedia.org/r/969662 (https://phabricator.wikimedia.org/T341119) (owner: 10Vgutierrez)
[07:51:09] <wikibugs>	 (03PS1) 10Marostegui: Revert "ProductionServices.php: Promote pc1014 to pc1 master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/969361
[07:51:16] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[07:52:39] <vgutierrez>	 !log depool cp5025 to perform some digicert-2023 related sanity checks - T341119
[07:52:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:56:59] <icinga-wm>	 RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp5025 is OK: SSL OK - OCSP staple validity for wikipedia.org has 531724 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2024-10-16 23:59:59 +0000 (expires in 352 days) https://wikitech.wikimedia.org/wiki/HTTPS
[07:57:18] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] sre: first iteration for otel-coll alerts [alerts] - 10https://gerrit.wikimedia.org/r/967143 (https://phabricator.wikimedia.org/T345712) (owner: 10Filippo Giunchedi)
[07:57:30] <wikibugs>	 (03PS2) 10WMDE-Fisch: Cleanup Kartographer Nearby flags [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966520 (https://phabricator.wikimedia.org/T332785)
[07:58:45] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "ProductionServices.php: Promote pc1014 to pc1 master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/969361 (owner: 10Marostegui)
[07:59:25] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "ProductionServices.php: Promote pc1014 to pc1 master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/969361 (owner: 10Marostegui)
[07:59:45] <logmsgbot>	 !log marostegui@deploy2002 Started scap: Backport for [[gerrit:969361|Revert "ProductionServices.php: Promote pc1014 to pc1 master"]]
[08:00:05] <jouncebot>	 Amir1, Urbanecm, and taavi: #bothumor I � Unicode. All rise for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231030T0800).
[08:00:05] <jouncebot>	 WMDE-Fisch: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[08:00:19] <WMDE-Fisch>	 \o
[08:00:23] <wikibugs>	 (03PS1) 10Marostegui: pc1011: Move pc1011 to pc2 [puppet] - 10https://gerrit.wikimedia.org/r/969670
[08:00:36] <WMDE-Fisch>	 I can self serve though
[08:00:59] <logmsgbot>	 !log marostegui@deploy2002 marostegui: Backport for [[gerrit:969361|Revert "ProductionServices.php: Promote pc1014 to pc1 master"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[08:01:12] <logmsgbot>	 !log marostegui@deploy2002 marostegui: Continuing with sync
[08:01:38] <taavi>	 sure, although it seems marostegui is already deploying something not scheduled for this window?
[08:01:43] <WMDE-Fisch>	 Yes
[08:01:46] <WMDE-Fisch>	 Just saw that
[08:01:58] <marostegui>	 taavi: I hoped I was finished before the window
[08:02:00] <marostegui>	 It should be finished in a bit
[08:02:17] <marostegui>	 I am deploying mediawiki_config only
[08:02:28] <marostegui>	 To be able to upgrade parsercache kernels
[08:04:03] <taavi>	 ok, just let us know when you're done
[08:04:30] <marostegui>	 yeah it is almost done
[08:06:26] <logmsgbot>	 !log marostegui@deploy2002 Finished scap: Backport for [[gerrit:969361|Revert "ProductionServices.php: Promote pc1014 to pc1 master"]] (duration: 06m 41s)
[08:06:35] <marostegui>	 taavi WMDE-Fisch all done!
[08:06:38] <marostegui>	 sorry for the delay
[08:06:45] <vgutierrez>	 !log repool cp5025 - T341119
[08:06:45] <WMDE-Fisch>	 Nice, I'll take over then.
[08:06:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:08:05] <wikibugs>	 (03PS3) 10WMDE-Fisch: Cleanup Kartographer Nearby flags [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966520 (https://phabricator.wikimedia.org/T332785)
[08:08:18] <wikibugs>	 (03PS2) 10Marostegui: pc1014: Move pc1014 to pc2 [puppet] - 10https://gerrit.wikimedia.org/r/969670
[08:08:46] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] pc1014: Move pc1014 to pc2 [puppet] - 10https://gerrit.wikimedia.org/r/969670 (owner: 10Marostegui)
[08:09:19] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by wmde-fisch@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966520 (https://phabricator.wikimedia.org/T332785) (owner: 10WMDE-Fisch)
[08:10:03] <vgutierrez>	 !log triggering a puppet run on cp hosts in esams, eqsin and drmrs to switch to the new unified digicert certificates - T341119
[08:10:05] <wikibugs>	 (03Merged) 10jenkins-bot: Cleanup Kartographer Nearby flags [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966520 (https://phabricator.wikimedia.org/T332785) (owner: 10WMDE-Fisch)
[08:10:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:10:17] <logmsgbot>	 !log wmde-fisch@deploy2002 Started scap: Backport for [[gerrit:966520|Cleanup Kartographer Nearby flags (T332785)]]
[08:10:22] <stashbot>	 T332785: Remove custom old nearby functionality for Wikivoyage from Kartographer - https://phabricator.wikimedia.org/T332785
[08:11:33] <logmsgbot>	 !log wmde-fisch@deploy2002 wmde-fisch: Backport for [[gerrit:966520|Cleanup Kartographer Nearby flags (T332785)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[08:12:26] <logmsgbot>	 !log wmde-fisch@deploy2002 wmde-fisch: Continuing with sync
[08:13:53] <icinga-wm>	 RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp6005 is OK: SSL OK - OCSP staple validity for wikipedia.org has 530710 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2024-10-16 23:59:59 +0000 (expires in 352 days) https://wikitech.wikimedia.org/wiki/HTTPS
[08:13:57] <icinga-wm>	 RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp6003 is OK: SSL OK - OCSP staple validity for wikipedia.org has 530707 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2024-10-16 23:59:59 +0000 (expires in 352 days) https://wikitech.wikimedia.org/wiki/HTTPS
[08:13:57] <icinga-wm>	 RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp6002 is OK: SSL OK - OCSP staple validity for wikipedia.org has 530707 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2024-10-16 23:59:59 +0000 (expires in 352 days) https://wikitech.wikimedia.org/wiki/HTTPS
[08:14:11] <icinga-wm>	 RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp6004 is OK: SSL OK - OCSP staple validity for wikipedia.org has 530692 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2024-10-16 23:59:59 +0000 (expires in 352 days) https://wikitech.wikimedia.org/wiki/HTTPS
[08:14:30] <vgutierrez>	 sorry about the RECOVERY flood :)
[08:14:39] <icinga-wm>	 RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp6001 is OK: SSL OK - OCSP staple validity for wikipedia.org has 530665 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2024-10-16 23:59:59 +0000 (expires in 352 days) https://wikitech.wikimedia.org/wiki/HTTPS
[08:15:19] <icinga-wm>	 RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp6007 is OK: SSL OK - OCSP staple validity for wikipedia.org has 530625 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2024-10-16 23:59:59 +0000 (expires in 352 days) https://wikitech.wikimedia.org/wiki/HTTPS
[08:15:45] <icinga-wm>	 RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp6006 is OK: SSL OK - OCSP staple validity for wikipedia.org has 530598 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2024-10-16 23:59:59 +0000 (expires in 352 days) https://wikitech.wikimedia.org/wiki/HTTPS
[08:16:25] <icinga-wm>	 RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp6009 is OK: SSL OK - OCSP staple validity for wikipedia.org has 530558 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2024-10-16 23:59:59 +0000 (expires in 352 days) https://wikitech.wikimedia.org/wiki/HTTPS
[08:17:01] <icinga-wm>	 RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp6012 is OK: SSL OK - OCSP staple validity for wikipedia.org has 530522 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2024-10-16 23:59:59 +0000 (expires in 352 days) https://wikitech.wikimedia.org/wiki/HTTPS
[08:17:03] <icinga-wm>	 RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp6011 is OK: SSL OK - OCSP staple validity for wikipedia.org has 530520 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2024-10-16 23:59:59 +0000 (expires in 352 days) https://wikitech.wikimedia.org/wiki/HTTPS
[08:17:51] <icinga-wm>	 RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp6013 is OK: SSL OK - OCSP staple validity for wikipedia.org has 530473 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2024-10-16 23:59:59 +0000 (expires in 352 days) https://wikitech.wikimedia.org/wiki/HTTPS
[08:17:53] <logmsgbot>	 !log wmde-fisch@deploy2002 Finished scap: Backport for [[gerrit:966520|Cleanup Kartographer Nearby flags (T332785)]] (duration: 07m 35s)
[08:17:55] <icinga-wm>	 RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp6014 is OK: SSL OK - OCSP staple validity for wikipedia.org has 530468 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2024-10-16 23:59:59 +0000 (expires in 352 days) https://wikitech.wikimedia.org/wiki/HTTPS
[08:17:58] <stashbot>	 T332785: Remove custom old nearby functionality for Wikivoyage from Kartographer - https://phabricator.wikimedia.org/T332785
[08:18:49] <WMDE-Fisch>	 I'm done. :-)
[08:19:09] <icinga-wm>	 RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp5017 is OK: SSL OK - OCSP staple validity for wikipedia.org has 530395 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2024-10-16 23:59:59 +0000 (expires in 352 days) https://wikitech.wikimedia.org/wiki/HTTPS
[08:19:27] <icinga-wm>	 RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp6016 is OK: SSL OK - OCSP staple validity for wikipedia.org has 530376 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2024-10-16 23:59:59 +0000 (expires in 352 days) https://wikitech.wikimedia.org/wiki/HTTPS
[08:19:45] <icinga-wm>	 RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp5019 is OK: SSL OK - OCSP staple validity for wikipedia.org has 530358 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2024-10-16 23:59:59 +0000 (expires in 352 days) https://wikitech.wikimedia.org/wiki/HTTPS
[08:19:55] <icinga-wm>	 RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp5020 is OK: SSL OK - OCSP staple validity for wikipedia.org has 530349 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2024-10-16 23:59:59 +0000 (expires in 352 days) https://wikitech.wikimedia.org/wiki/HTTPS
[08:20:19] <icinga-wm>	 RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp6010 is OK: SSL OK - OCSP staple validity for wikipedia.org has 530324 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2024-10-16 23:59:59 +0000 (expires in 352 days) https://wikitech.wikimedia.org/wiki/HTTPS
[08:21:23] <icinga-wm>	 RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp5021 is OK: SSL OK - OCSP staple validity for wikipedia.org has 530260 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2024-10-16 23:59:59 +0000 (expires in 352 days) https://wikitech.wikimedia.org/wiki/HTTPS
[08:21:35] <icinga-wm>	 RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp5024 is OK: SSL OK - OCSP staple validity for wikipedia.org has 530249 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2024-10-16 23:59:59 +0000 (expires in 352 days) https://wikitech.wikimedia.org/wiki/HTTPS
[08:21:37] <icinga-wm>	 RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp5023 is OK: SSL OK - OCSP staple validity for wikipedia.org has 530246 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2024-10-16 23:59:59 +0000 (expires in 352 days) https://wikitech.wikimedia.org/wiki/HTTPS
[08:22:39] <icinga-wm>	 RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp6008 is OK: SSL OK - OCSP staple validity for wikipedia.org has 530185 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2024-10-16 23:59:59 +0000 (expires in 352 days) https://wikitech.wikimedia.org/wiki/HTTPS
[08:22:39] <icinga-wm>	 RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp5018 is OK: SSL OK - OCSP staple validity for wikipedia.org has 530184 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2024-10-16 23:59:59 +0000 (expires in 352 days) https://wikitech.wikimedia.org/wiki/HTTPS
[08:22:39] <icinga-wm>	 RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp5022 is OK: SSL OK - OCSP staple validity for wikipedia.org has 530184 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2024-10-16 23:59:59 +0000 (expires in 352 days) https://wikitech.wikimedia.org/wiki/HTTPS
[08:23:07] <icinga-wm>	 RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp5029 is OK: SSL OK - OCSP staple validity for wikipedia.org has 530156 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2024-10-16 23:59:59 +0000 (expires in 352 days) https://wikitech.wikimedia.org/wiki/HTTPS
[08:23:23] <icinga-wm>	 RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp5030 is OK: SSL OK - OCSP staple validity for wikipedia.org has 530140 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2024-10-16 23:59:59 +0000 (expires in 352 days) https://wikitech.wikimedia.org/wiki/HTTPS
[08:23:35] <icinga-wm>	 RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp5027 is OK: SSL OK - OCSP staple validity for wikipedia.org has 530129 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2024-10-16 23:59:59 +0000 (expires in 352 days) https://wikitech.wikimedia.org/wiki/HTTPS
[08:23:37] <icinga-wm>	 RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp5026 is OK: SSL OK - OCSP staple validity for wikipedia.org has 530127 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2024-10-16 23:59:59 +0000 (expires in 352 days) https://wikitech.wikimedia.org/wiki/HTTPS
[08:23:51] <icinga-wm>	 RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp5028 is OK: SSL OK - OCSP staple validity for wikipedia.org has 530112 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2024-10-16 23:59:59 +0000 (expires in 352 days) https://wikitech.wikimedia.org/wiki/HTTPS
[08:25:01] <icinga-wm>	 RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp3066 is OK: SSL OK - OCSP staple validity for wikipedia.org has 530043 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2024-10-16 23:59:59 +0000 (expires in 352 days) https://wikitech.wikimedia.org/wiki/HTTPS
[08:25:01] <icinga-wm>	 RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp6015 is OK: SSL OK - OCSP staple validity for wikipedia.org has 530043 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2024-10-16 23:59:59 +0000 (expires in 352 days) https://wikitech.wikimedia.org/wiki/HTTPS
[08:25:05] <icinga-wm>	 RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp3068 is OK: SSL OK - OCSP staple validity for wikipedia.org has 530039 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2024-10-16 23:59:59 +0000 (expires in 352 days) https://wikitech.wikimedia.org/wiki/HTTPS
[08:25:07] <icinga-wm>	 RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp5032 is OK: SSL OK - OCSP staple validity for wikipedia.org has 530036 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2024-10-16 23:59:59 +0000 (expires in 352 days) https://wikitech.wikimedia.org/wiki/HTTPS
[08:25:25] <icinga-wm>	 RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp5031 is OK: SSL OK - OCSP staple validity for wikipedia.org has 530018 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2024-10-16 23:59:59 +0000 (expires in 352 days) https://wikitech.wikimedia.org/wiki/HTTPS
[08:25:29] <icinga-wm>	 RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp3067 is OK: SSL OK - OCSP staple validity for wikipedia.org has 530014 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2024-10-16 23:59:59 +0000 (expires in 352 days) https://wikitech.wikimedia.org/wiki/HTTPS
[08:26:21] <icinga-wm>	 RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp3069 is OK: SSL OK - OCSP staple validity for wikipedia.org has 529962 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2024-10-16 23:59:59 +0000 (expires in 352 days) https://wikitech.wikimedia.org/wiki/HTTPS
[08:26:21] <icinga-wm>	 RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp3072 is OK: SSL OK - OCSP staple validity for wikipedia.org has 529962 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2024-10-16 23:59:59 +0000 (expires in 352 days) https://wikitech.wikimedia.org/wiki/HTTPS
[08:26:23] <icinga-wm>	 RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp3070 is OK: SSL OK - OCSP staple validity for wikipedia.org has 529961 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2024-10-16 23:59:59 +0000 (expires in 352 days) https://wikitech.wikimedia.org/wiki/HTTPS
[08:26:23] <icinga-wm>	 RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp3071 is OK: SSL OK - OCSP staple validity for wikipedia.org has 529960 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2024-10-16 23:59:59 +0000 (expires in 352 days) https://wikitech.wikimedia.org/wiki/HTTPS
[08:26:49] <icinga-wm>	 RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp3073 is OK: SSL OK - OCSP staple validity for wikipedia.org has 529935 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2024-10-16 23:59:59 +0000 (expires in 352 days) https://wikitech.wikimedia.org/wiki/HTTPS
[08:27:45] <icinga-wm>	 RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp3077 is OK: SSL OK - OCSP staple validity for wikipedia.org has 529879 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2024-10-16 23:59:59 +0000 (expires in 352 days) https://wikitech.wikimedia.org/wiki/HTTPS
[08:27:47] <icinga-wm>	 RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp3076 is OK: SSL OK - OCSP staple validity for wikipedia.org has 529876 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2024-10-16 23:59:59 +0000 (expires in 352 days) https://wikitech.wikimedia.org/wiki/HTTPS
[08:27:59] <icinga-wm>	 RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp3074 is OK: SSL OK - OCSP staple validity for wikipedia.org has 529864 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2024-10-16 23:59:59 +0000 (expires in 352 days) https://wikitech.wikimedia.org/wiki/HTTPS
[08:28:15] <icinga-wm>	 RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp3075 is OK: SSL OK - OCSP staple validity for wikipedia.org has 529848 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2024-10-16 23:59:59 +0000 (expires in 352 days) https://wikitech.wikimedia.org/wiki/HTTPS
[08:28:19] <icinga-wm>	 RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp3078 is OK: SSL OK - OCSP staple validity for wikipedia.org has 529844 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2024-10-16 23:59:59 +0000 (expires in 352 days) https://wikitech.wikimedia.org/wiki/HTTPS
[08:28:28] <wikibugs>	 (03PS1) 10Majavah: base: puppet_alert: fix error message [puppet] - 10https://gerrit.wikimedia.org/r/969677
[08:28:33] <icinga-wm>	 RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp3081 is OK: SSL OK - OCSP staple validity for wikipedia.org has 529830 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2024-10-16 23:59:59 +0000 (expires in 352 days) https://wikitech.wikimedia.org/wiki/HTTPS
[08:29:13] <icinga-wm>	 RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp3079 is OK: SSL OK - OCSP staple validity for wikipedia.org has 529790 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2024-10-16 23:59:59 +0000 (expires in 352 days) https://wikitech.wikimedia.org/wiki/HTTPS
[08:29:17] <icinga-wm>	 RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp3080 is OK: SSL OK - OCSP staple validity for wikipedia.org has 529786 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2024-10-16 23:59:59 +0000 (expires in 352 days) https://wikitech.wikimedia.org/wiki/HTTPS
[08:29:51] <vgutierrez>	 !log switched to digicert-2023 in esams, eqsin and drmrs - T341119
[08:29:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:31:56] <wikibugs>	 (03CR) 10Slyngshede: "Okay to merge this, as we fixed the monitoring in Prometheus/alertmanager?" [puppet] - 10https://gerrit.wikimedia.org/r/966532 (https://phabricator.wikimedia.org/T332764) (owner: 10Slyngshede)
[08:38:47] <wikibugs>	 (03PS1) 10Majavah: aptrepo: set Auto-Submitted header on reprepro change emails [puppet] - 10https://gerrit.wikimedia.org/r/969679 (https://phabricator.wikimedia.org/T347835)
[08:47:30] <wikibugs>	 (03PS1) 10Filippo Giunchedi: thanos: set cgroup memory limits for query components [puppet] - 10https://gerrit.wikimedia.org/r/969683 (https://phabricator.wikimedia.org/T349999)
[08:56:12] <wikibugs>	 (03PS6) 10Brouberol: Enable the management of the skein certificate via Puppet on one instance [puppet] - 10https://gerrit.wikimedia.org/r/968613 (https://phabricator.wikimedia.org/T329398)
[09:03:35] <wikibugs>	 (03CR) 10Brouberol: Enable the management of the skein certificate via Puppet (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/968612 (https://phabricator.wikimedia.org/T329398) (owner: 10Brouberol)
[09:19:59] <wikibugs>	 10SRE, 10Traffic: purged issues while kafka brokers are restarted - https://phabricator.wikimedia.org/T334078 (10Vgutierrez) p:05Triage→03High We had two servers (cp1089 and cp3069) having purged issues over the weekend, after losing connection to the kafka cluster and logging: ` Oct 28 05:19:11 cp1089 pur...
[09:26:10] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/969677 (owner: 10Majavah)
[09:26:22] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] base: puppet_alert: fix error message [puppet] - 10https://gerrit.wikimedia.org/r/969677 (owner: 10Majavah)
[09:26:32] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/966532 (https://phabricator.wikimedia.org/T332764) (owner: 10Slyngshede)
[09:26:36] <wikibugs>	 (03PS1) 10Filippo Giunchedi: prometheus: require replica_label to be set [puppet] - 10https://gerrit.wikimedia.org/r/969685 (https://phabricator.wikimedia.org/T350002)
[09:27:05] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] prometheus: require replica_label to be set [puppet] - 10https://gerrit.wikimedia.org/r/969685 (https://phabricator.wikimedia.org/T350002) (owner: 10Filippo Giunchedi)
[09:27:27] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/969679 (https://phabricator.wikimedia.org/T347835) (owner: 10Majavah)
[09:28:01] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] P:monitoring remove remnants of checkpuppetrun [puppet] - 10https://gerrit.wikimedia.org/r/966532 (https://phabricator.wikimedia.org/T332764) (owner: 10Slyngshede)
[09:28:09] <jinxer-wm>	 (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-misc at eqiad: 50% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-misc&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[09:28:15] <wikibugs>	 (03PS2) 10Majavah: aptrepo: set Auto-Submitted header on reprepro change emails [puppet] - 10https://gerrit.wikimedia.org/r/969679 (https://phabricator.wikimedia.org/T347835)
[09:29:06] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] aptrepo: set Auto-Submitted header on reprepro change emails [puppet] - 10https://gerrit.wikimedia.org/r/969679 (https://phabricator.wikimedia.org/T347835) (owner: 10Majavah)
[09:29:08] <wikibugs>	 (03CR) 10Filippo Giunchedi: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/969685 (https://phabricator.wikimedia.org/T350002) (owner: 10Filippo Giunchedi)
[09:33:08] <wikibugs>	 (03PS7) 10Brouberol: Enable the management of the skein certificate via Puppet on one instance [puppet] - 10https://gerrit.wikimedia.org/r/968613 (https://phabricator.wikimedia.org/T329398)
[09:33:09] <jinxer-wm>	 (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-misc at eqiad: 50% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-misc&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[09:36:05] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/230/con" [puppet] - 10https://gerrit.wikimedia.org/r/969685 (https://phabricator.wikimedia.org/T350002) (owner: 10Filippo Giunchedi)
[09:37:18] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 04-2] "realm.pp needs to be loaded when site.pp does get loaded." [puppet] - 10https://gerrit.wikimedia.org/r/969373 (https://phabricator.wikimedia.org/T349918) (owner: 10Jbond)
[09:38:57] <jynus>	 spiky behaviour on mw app misc eqiad
[09:39:02] <jynus>	 in terms of latency
[09:42:10] <logmsgbot>	 !log jnuche@deploy2002 Started deploy [releng/jenkins-deploy@af33784] (releasing): (no justification provided)
[09:42:50] <logmsgbot>	 !log jnuche@deploy2002 Finished deploy [releng/jenkins-deploy@af33784] (releasing): (no justification provided) (duration: 00m 40s)
[09:51:13] <jinxer-wm>	 (SwiftObjectCountSiteDisparity) firing: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity
[09:54:03] <wikibugs>	 (03CR) 10Jbond: "pcc: https://puppet-compiler.wmflabs.org/output/969373/225/" [puppet] - 10https://gerrit.wikimedia.org/r/969373 (https://phabricator.wikimedia.org/T349918) (owner: 10Jbond)
[09:58:44] <wikibugs>	 (03PS1) 10Majavah: hieradata: fix cloudinfra webproxy password location [labs/private] - 10https://gerrit.wikimedia.org/r/969689
[09:58:50] <wikibugs>	 (03PS1) 10Majavah: secret: dkim: move wmcs dkim keys to correct location [labs/private] - 10https://gerrit.wikimedia.org/r/969690
[10:00:04] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231030T1000)
[10:00:09] <jinxer-wm>	 (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-misc at eqiad: 50% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-misc&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[10:02:09] <wikibugs>	 (03PS1) 10Majavah: hieradata: add fake metricsinfra grafana password [labs/private] - 10https://gerrit.wikimedia.org/r/969691
[10:05:09] <jinxer-wm>	 (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-misc at eqiad: 50% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-misc&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[10:05:53] <wikibugs>	 (03PS2) 10Ayounsi: [POC] Split interface_automation into multiple files [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/969319
[10:05:55] <wikibugs>	 (03PS1) 10Ayounsi: Ask for port # and type instead of interface name [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/969692
[10:05:59] <wikibugs>	 (03PS1) 10Majavah: dynamicproxy: simplify redis replication code [puppet] - 10https://gerrit.wikimedia.org/r/969693
[10:06:31] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] dynamicproxy: simplify redis replication code [puppet] - 10https://gerrit.wikimedia.org/r/969693 (owner: 10Majavah)
[10:06:35] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Ask for port # and type instead of interface name [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/969692 (owner: 10Ayounsi)
[10:07:39] <wikibugs>	 (03PS2) 10Majavah: dynamicproxy: simplify redis replication code [puppet] - 10https://gerrit.wikimedia.org/r/969693
[10:11:34] <wikibugs>	 (03PS3) 10Majavah: dynamicproxy: simplify redis replication code [puppet] - 10https://gerrit.wikimedia.org/r/969693
[10:12:29] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/236/console" [puppet] - 10https://gerrit.wikimedia.org/r/969693 (owner: 10Majavah)
[10:13:09] <jinxer-wm>	 (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-misc at eqiad: 50% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-misc&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[10:14:48] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Grant access to nda LDAP group to xqt - https://phabricator.wikimedia.org/T348520 (10DMburugu) I approve
[10:15:13] <wikibugs>	 (03PS2) 10Isabelle Hurbain-Palatin: Roll-out Parsoid Kartographer support for all English language wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/969168 (https://phabricator.wikimedia.org/T342871)
[10:15:59] <wikibugs>	 10SRE, 10Traffic: purged issues while kafka brokers are restarted - https://phabricator.wikimedia.org/T334078 (10Fabfur) Adding, for complete information, that the list of hosts impacted with the same purged error this weekend were:  - cp1078 - cp1089 - cp6005 - cp3069
[10:18:09] <jinxer-wm>	 (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-misc at eqiad: 50% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-misc&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[10:20:11] <wikibugs>	 (03PS2) 10Ayounsi: Ask for port # and type instead of interface name [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/969692
[10:20:26] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] "One question - IIUC the percentage is a syntactic sugar to set the value in bytes that is some % of the main memory. If we set both units " [puppet] - 10https://gerrit.wikimedia.org/r/969683 (https://phabricator.wikimedia.org/T349999) (owner: 10Filippo Giunchedi)
[10:22:43] <wikibugs>	 (03PS1) 10Hashar: puppet_compiler: CORS header for Gerrit [puppet] - 10https://gerrit.wikimedia.org/r/969694 (https://phabricator.wikimedia.org/T350003)
[10:23:09] <jinxer-wm>	 (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-misc at eqiad: 50% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-misc&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[10:24:29] <wikibugs>	 (03CR) 10Hashar: "I have crafted this solely based on documentation, I haven't tested the Nginx config change." [puppet] - 10https://gerrit.wikimedia.org/r/969694 (https://phabricator.wikimedia.org/T350003) (owner: 10Hashar)
[10:24:42] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Grant access to nda LDAP group to xqt - https://phabricator.wikimedia.org/T348520 (10Urbanecm_WMF) Thanks Dennis!  @JMeybohm Hi Janis, I see you're on SRE clinic duty this week. This request now should have sponsorship from a WMF staff member (me) and approv...
[10:28:09] <jinxer-wm>	 (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-misc at eqiad: 50% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-misc&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[10:34:29] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:39:19] <wikibugs>	 (03Abandoned) 10Jbond: site.pp: rename site.pp so that it is loaded first [puppet] - 10https://gerrit.wikimedia.org/r/969373 (https://phabricator.wikimedia.org/T349918) (owner: 10Jbond)
[10:42:07] <wikibugs>	 (03PS1) 10Ayounsi: Various changes [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/969696
[10:42:21] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1005-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[10:43:06] <wikibugs>	 (03PS1) 10Jbond: idp_test: update acmechief_host [puppet] - 10https://gerrit.wikimedia.org/r/969697 (https://phabricator.wikimedia.org/T349918)
[10:44:22] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/237/console" [puppet] - 10https://gerrit.wikimedia.org/r/969697 (https://phabricator.wikimedia.org/T349918) (owner: 10Jbond)
[10:44:31] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] puppet_compiler: CORS header for Gerrit [puppet] - 10https://gerrit.wikimedia.org/r/969694 (https://phabricator.wikimedia.org/T350003) (owner: 10Hashar)
[10:45:37] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:46:38] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] idp_test: update acmechief_host [puppet] - 10https://gerrit.wikimedia.org/r/969697 (https://phabricator.wikimedia.org/T349918) (owner: 10Jbond)
[10:48:32] <wikibugs>	 (03PS3) 10Ayounsi: [POC] Split interface_automation into multiple files [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/969319
[10:48:34] <wikibugs>	 (03PS3) 10Ayounsi: Ask for port # and type instead of interface name [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/969692
[10:48:38] <wikibugs>	 (03PS2) 10Ayounsi: Various changes [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/969696
[10:51:12] <wikibugs>	 (03PS1) 10Jbond: puppet_compiler: Add semicolon [puppet] - 10https://gerrit.wikimedia.org/r/969698 (https://phabricator.wikimedia.org/T350003)
[10:51:16] <wikibugs>	 (03PS3) 10Ayounsi: Various changes [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/969696
[10:52:55] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] puppet_compiler: Add semicolon [puppet] - 10https://gerrit.wikimedia.org/r/969698 (https://phabricator.wikimedia.org/T350003) (owner: 10Jbond)
[10:58:15] <wikibugs>	 10SRE-OnFire, 10User-fgiunchedi: Deploy alerts-triage app to production - https://phabricator.wikimedia.org/T350014 (10fgiunchedi)
[10:58:59] <wikibugs>	 (03CR) 10Majavah: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/969685 (https://phabricator.wikimedia.org/T350002) (owner: 10Filippo Giunchedi)
[10:59:53] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] prometheus: require replica_label to be set [puppet] - 10https://gerrit.wikimedia.org/r/969685 (https://phabricator.wikimedia.org/T350002) (owner: 10Filippo Giunchedi)
[11:01:20] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] realm: use puppet7 acmechief when on puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/969375 (https://phabricator.wikimedia.org/T349915) (owner: 10Jbond)
[11:03:21] <wikibugs>	 10SRE, 10Traffic: purged issues while kafka brokers are restarted - https://phabricator.wikimedia.org/T334078 (10Vgutierrez) We need to work on purged Kafka consumer.  I've already spotted the issue on our codebase
[11:09:52] <wikibugs>	 (03PS1) 10Jbond: idp-test: correct hostname [puppet] - 10https://gerrit.wikimedia.org/r/969700 (https://phabricator.wikimedia.org/T349915)
[11:10:13] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] idp-test: correct hostname [puppet] - 10https://gerrit.wikimedia.org/r/969700 (https://phabricator.wikimedia.org/T349915) (owner: 10Jbond)
[11:10:42] <wikibugs>	 (03PS1) 10Jbond: Revert "realm: use puppet7 acmechief when on puppet7" [puppet] - 10https://gerrit.wikimedia.org/r/969363
[11:10:58] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] Revert "realm: use puppet7 acmechief when on puppet7" [puppet] - 10https://gerrit.wikimedia.org/r/969363 (owner: 10Jbond)
[11:11:20] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Grant access to nda LDAP group to xqt - https://phabricator.wikimedia.org/T348520 (10Aklapper) Sign off by a WMF C-level staff
[11:15:05] <wikibugs>	 (03CR) 10Filippo Giunchedi: Enable support for statsd_exporters on non-ops instances (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/969143 (https://phabricator.wikimedia.org/T343232) (owner: 10Btullis)
[11:17:28] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] thanos: set cgroup memory limits for query components (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/969683 (https://phabricator.wikimedia.org/T349999) (owner: 10Filippo Giunchedi)
[11:18:01] <wikibugs>	 (03PS2) 10Fabfur: Add version print option [software/purged] - 10https://gerrit.wikimedia.org/r/962670 (https://phabricator.wikimedia.org/T347839)
[11:24:31] <wikibugs>	 (03PS1) 10Jbond: acmechief: switch back to using puppet localcacert [puppet] - 10https://gerrit.wikimedia.org/r/969701 (https://phabricator.wikimedia.org/T349915)
[11:25:40] <wikibugs>	 (03PS2) 10Jbond: acmechief: switch back to using puppet localcacert [puppet] - 10https://gerrit.wikimedia.org/r/969701 (https://phabricator.wikimedia.org/T349915)
[11:26:15] <wikibugs>	 (03CR) 10Jbond: "ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/969701 (https://phabricator.wikimedia.org/T349915) (owner: 10Jbond)
[11:26:48] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/238/con" [puppet] - 10https://gerrit.wikimedia.org/r/969701 (https://phabricator.wikimedia.org/T349915) (owner: 10Jbond)
[11:28:17] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1130.eqiad.wmnet with reason: provisionning db1230.eqiad.wmnet - T344036
[11:28:23] <stashbot>	 T344036: Productionize db12[26-49] - https://phabricator.wikimedia.org/T344036
[11:28:32] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1130.eqiad.wmnet with reason: provisionning db1230.eqiad.wmnet - T344036
[11:28:35] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1230.eqiad.wmnet with reason: provisionning db1230.eqiad.wmnet - T344036
[11:28:49] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1230.eqiad.wmnet with reason: provisionning db1230.eqiad.wmnet - T344036
[11:30:01] <wikibugs>	 (03PS1) 10Jbond: acmechief2002: switch to localca cert [puppet] - 10https://gerrit.wikimedia.org/r/969702 (https://phabricator.wikimedia.org/T349915)
[11:30:16] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] acmechief2002: switch to localca cert [puppet] - 10https://gerrit.wikimedia.org/r/969702 (https://phabricator.wikimedia.org/T349915) (owner: 10Jbond)
[11:31:13] <jinxer-wm>	 (SwiftObjectCountSiteDisparity) resolved: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity
[11:33:33] <wikibugs>	 (03PS1) 10Filippo Giunchedi: sre: ignore pint promql/series checks for otel-coll [alerts] - 10https://gerrit.wikimedia.org/r/969703 (https://phabricator.wikimedia.org/T345712)
[11:34:02] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Adding db1230 depooled, depooling db1130', diff saved to https://phabricator.wikimedia.org/P53064 and previous config saved to /var/cache/conftool/dbconfig/20231030-113401-arnaudb.json
[11:36:10] <wikibugs>	 (03PS1) 10Filippo Giunchedi: sre: ignore promql/series for SystemdUnitCrashLoop [alerts] - 10https://gerrit.wikimedia.org/r/969704 (https://phabricator.wikimedia.org/T293970)
[11:36:19] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] sre: ignore pint promql/series checks for otel-coll [alerts] - 10https://gerrit.wikimedia.org/r/969703 (https://phabricator.wikimedia.org/T345712) (owner: 10Filippo Giunchedi)
[11:37:32] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] sre: ignore promql/series for SystemdUnitCrashLoop [alerts] - 10https://gerrit.wikimedia.org/r/969704 (https://phabricator.wikimedia.org/T293970) (owner: 10Filippo Giunchedi)
[11:46:16] <wikibugs>	 (03PS1) 10Arnaudb: mariadb: add a new host (db1230) [puppet] - 10https://gerrit.wikimedia.org/r/968984 (https://phabricator.wikimedia.org/T344036)
[11:47:48] <wikibugs>	 (03PS2) 10Arnaudb: mariadb: add a new host (db1230) [puppet] - 10https://gerrit.wikimedia.org/r/968984 (https://phabricator.wikimedia.org/T344036)
[11:48:20] <wikibugs>	 (03PS3) 10Arnaudb: mariadb: add a new host (db1230) [puppet] - 10https://gerrit.wikimedia.org/r/968984 (https://phabricator.wikimedia.org/T344036)
[11:48:46] <wikibugs>	 (03PS4) 10Arnaudb: mariadb: add a new host (db1230) [puppet] - 10https://gerrit.wikimedia.org/r/968984 (https://phabricator.wikimedia.org/T344036)
[11:49:03] <wikibugs>	 (03PS1) 10Jbond: acmechief_host: drop this value as a global [puppet] - 10https://gerrit.wikimedia.org/r/969706 (https://phabricator.wikimedia.org/T349915)
[11:49:17] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] mariadb: add a new host (db1230) [puppet] - 10https://gerrit.wikimedia.org/r/968984 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb)
[11:49:34] <wikibugs>	 (03CR) 10Arnaudb: [C: 03+2] mariadb: add a new host (db1230) [puppet] - 10https://gerrit.wikimedia.org/r/968984 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb)
[11:49:41] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] acmechief_host: drop this value as a global [puppet] - 10https://gerrit.wikimedia.org/r/969706 (https://phabricator.wikimedia.org/T349915) (owner: 10Jbond)
[11:51:16] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[11:52:20] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.mysql.clone of db1130.eqiad.wmnet onto db1230.eqiad.wmnet
[11:56:11] <icinga-wm>	 PROBLEM - Check systemd state on aqs1010 is CRITICAL: CRITICAL - degraded: The following units failed: aqs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:01:08] <icinga-wm>	 RECOVERY - Check systemd state on aqs1010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:02:28] <icinga-wm>	 PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1001 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs
[12:08:04] <wikibugs>	 (03PS1) 10Marostegui: db1217: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/969708 (https://phabricator.wikimedia.org/T349090)
[12:09:14] <icinga-wm>	 PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs
[12:10:08] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1217: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/969708 (https://phabricator.wikimedia.org/T349090) (owner: 10Marostegui)
[12:10:41] <wikibugs>	 (03PS2) 10Jbond: acmechief_host: drop this value as a global [puppet] - 10https://gerrit.wikimedia.org/r/969706 (https://phabricator.wikimedia.org/T349915)
[12:11:19] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db1217.eqiad.wmnet with OS bookworm
[12:13:08] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1021 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[12:13:12] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1024 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[12:13:16] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1027 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[12:13:24] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1023 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[12:18:10] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1025 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[12:20:36] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1020 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[12:24:12] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1217.eqiad.wmnet with reason: host reimage
[12:25:23] <marostegui>	 ^ all those expected
[12:26:56] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1217.eqiad.wmnet with reason: host reimage
[12:27:40] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1026 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[12:27:40] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1017 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[12:28:23] <wikibugs>	 (03Abandoned) 10Ayounsi: Various changes [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/969696 (owner: 10Ayounsi)
[12:28:25] <wikibugs>	 (03PS4) 10Ayounsi: Ask for port # and type instead of interface name [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/969692
[12:28:55] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'New host', diff saved to https://phabricator.wikimedia.org/P53065 and previous config saved to /var/cache/conftool/dbconfig/20231030-122855-marostegui.json
[12:29:56] <icinga-wm>	 RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs
[12:31:36] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1022 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[12:33:10] <icinga-wm>	 RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1001 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs
[12:34:40] <wikibugs>	 (03PS1) 10Jbond: acme_chief::cert: remove style violation [puppet] - 10https://gerrit.wikimedia.org/r/969719 (https://phabricator.wikimedia.org/T349915)
[12:34:42] <wikibugs>	 (03PS1) 10Jbond: acme_chief: override the acme_chief host for puppet7 nodes [puppet] - 10https://gerrit.wikimedia.org/r/969720 (https://phabricator.wikimedia.org/T349915)
[12:35:29] <wikibugs>	 (03PS1) 10Slyngshede: P:monitoring remove remainders of check_eth. [puppet] - 10https://gerrit.wikimedia.org/r/969721
[12:37:20] <wikibugs>	 10SRE-Access-Requests, 10Structured-Data-Backlog, 10UploadWizard: Access request to deleted image files in the backup cluster - https://phabricator.wikimedia.org/T350020 (10mfossati)
[12:37:42] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] acme_chief::cert: remove style violation [puppet] - 10https://gerrit.wikimedia.org/r/969719 (https://phabricator.wikimedia.org/T349915) (owner: 10Jbond)
[12:37:58] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] acme_chief: override the acme_chief host for puppet7 nodes [puppet] - 10https://gerrit.wikimedia.org/r/969720 (https://phabricator.wikimedia.org/T349915) (owner: 10Jbond)
[12:38:02] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Traffic, and 2 others: find solution for acmechief in puppet7 - https://phabricator.wikimedia.org/T349915 (10jbond)
[12:39:43] <wikibugs>	 (03PS2) 10Jbond: acme_chief::cert: remove style violation [puppet] - 10https://gerrit.wikimedia.org/r/969719 (https://phabricator.wikimedia.org/T349915)
[12:39:44] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1026 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[12:39:44] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1017 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[12:39:45] <wikibugs>	 (03PS2) 10Jbond: acme_chief: override the acme_chief host for puppet7 nodes [puppet] - 10https://gerrit.wikimedia.org/r/969720 (https://phabricator.wikimedia.org/T349915)
[12:39:50] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1025 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[12:39:52] <wikibugs>	 10SRE-Access-Requests, 10Structured-Data-Backlog, 10UploadWizard: Access request to deleted image files in the backup cluster - https://phabricator.wikimedia.org/T350020 (10mfossati)
[12:40:09] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Grant access to nda LDAP group to xqt - https://phabricator.wikimedia.org/T348520 (10Urbanecm_WMF) >>! In T348520#9290724, @Aklapper wrote: > Sign off by a WMF C-level staff  While that is indeed currently a part of the [relevant docs](https://wikitech.wikim...
[12:42:30] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1021 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[12:42:33] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] acme_chief: override the acme_chief host for puppet7 nodes [puppet] - 10https://gerrit.wikimedia.org/r/969720 (https://phabricator.wikimedia.org/T349915) (owner: 10Jbond)
[12:42:35] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] acme_chief::cert: remove style violation [puppet] - 10https://gerrit.wikimedia.org/r/969719 (https://phabricator.wikimedia.org/T349915) (owner: 10Jbond)
[12:42:36] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1024 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[12:43:00] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1020 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[12:43:02] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1022 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[12:45:17] <wikibugs>	 (03CR) 10Jbond: acme_chief::cert: remove style violation (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/969719 (https://phabricator.wikimedia.org/T349915) (owner: 10Jbond)
[12:47:41] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1217.eqiad.wmnet with OS bookworm
[12:48:10] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1023 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[12:48:10] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1027 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[12:49:25] <wikibugs>	 (03PS1) 10Marostegui: Revert "db1217: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/969364
[12:49:57] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "db1217: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/969364 (owner: 10Marostegui)
[12:51:41] <wikibugs>	 (03PS3) 10Jbond: acme_chief::cert: remove style violation [puppet] - 10https://gerrit.wikimedia.org/r/969719 (https://phabricator.wikimedia.org/T349915)
[12:51:43] <wikibugs>	 (03PS3) 10Jbond: acme_chief: override the acme_chief host for puppet7 nodes [puppet] - 10https://gerrit.wikimedia.org/r/969720 (https://phabricator.wikimedia.org/T349915)
[12:52:09] <jinxer-wm>	 (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-misc at eqiad: 50% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-misc&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[12:54:05] <wikibugs>	 (03CR) 10Jgreen: [V: 03+2 C: 03+1] Add dummy secrets for community_civicrm [labs/private] - 10https://gerrit.wikimedia.org/r/967519 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt)
[12:54:08] <wikibugs>	 (03CR) 10Jgreen: [V: 03+2 C: 03+2] Add dummy secrets for community_civicrm [labs/private] - 10https://gerrit.wikimedia.org/r/967519 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt)
[12:54:35] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] acme_chief::cert: remove style violation [puppet] - 10https://gerrit.wikimedia.org/r/969719 (https://phabricator.wikimedia.org/T349915) (owner: 10Jbond)
[12:54:45] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] acme_chief: override the acme_chief host for puppet7 nodes [puppet] - 10https://gerrit.wikimedia.org/r/969720 (https://phabricator.wikimedia.org/T349915) (owner: 10Jbond)
[12:55:12] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db1130.eqiad.wmnet onto db1230.eqiad.wmnet
[12:57:09] <jinxer-wm>	 (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-misc at eqiad: 50% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-misc&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[13:00:04] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: It is that lovely time of the day again! You are hereby commanded to deploy UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231030T1300).
[13:00:04] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[13:00:15] <urbanecm>	 welcome, hour-early-window!
[13:01:05] <TheresNoTime>	 ah daylight savings ended
[13:02:08] <taavi>	 yep, we moved to daylight confusion instead
[13:14:30] <wikibugs>	 (03PS5) 10Ayounsi: Ask for port # and type instead of interface name [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/969692
[13:14:32] <wikibugs>	 (03PS1) 10Ayounsi: provision_server: make switch selection optional [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/969749
[13:17:31] <wikibugs>	 (03PS4) 10Jbond: acme_chief::cert: remove style violation [puppet] - 10https://gerrit.wikimedia.org/r/969719 (https://phabricator.wikimedia.org/T349915)
[13:17:33] <wikibugs>	 (03PS4) 10Jbond: acme_chief: override the acme_chief host for puppet7 nodes [puppet] - 10https://gerrit.wikimedia.org/r/969720 (https://phabricator.wikimedia.org/T349915)
[13:23:24] <wikibugs>	 (03Abandoned) 10Jbond: environment: fix SC3033 [puppet] - 10https://gerrit.wikimedia.org/r/969075 (owner: 10Jbond)
[13:24:59] <wikibugs>	 (03PS10) 10Brouberol: Enable the management of the skein certificate via Puppet [puppet] - 10https://gerrit.wikimedia.org/r/968612 (https://phabricator.wikimedia.org/T329398)
[13:25:01] <wikibugs>	 (03PS9) 10Brouberol: Enable the management of the skein certificate via Puppet on one instance [puppet] - 10https://gerrit.wikimedia.org/r/968613 (https://phabricator.wikimedia.org/T329398)
[13:26:42] <wikibugs>	 (03PS11) 10Brouberol: Enable the management of the skein certificate via Puppet [puppet] - 10https://gerrit.wikimedia.org/r/968612 (https://phabricator.wikimedia.org/T329398)
[13:26:44] <wikibugs>	 (03PS10) 10Brouberol: Enable the management of the skein certificate via Puppet on one instance [puppet] - 10https://gerrit.wikimedia.org/r/968613 (https://phabricator.wikimedia.org/T329398)
[13:27:26] <wikibugs>	 (03PS12) 10Brouberol: Enable the management of the skein certificate via Puppet [puppet] - 10https://gerrit.wikimedia.org/r/968612 (https://phabricator.wikimedia.org/T329398)
[13:27:28] <wikibugs>	 (03PS11) 10Brouberol: Enable the management of the skein certificate via Puppet on one instance [puppet] - 10https://gerrit.wikimedia.org/r/968613 (https://phabricator.wikimedia.org/T329398)
[13:27:58] <wikibugs>	 (03CR) 10Brouberol: Enable the management of the skein certificate via Puppet (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/968612 (https://phabricator.wikimedia.org/T329398) (owner: 10Brouberol)
[13:39:29] <wikibugs>	 (03CR) 10Jbond: "ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/969706 (https://phabricator.wikimedia.org/T349915) (owner: 10Jbond)
[13:39:46] <wikibugs>	 (03CR) 10Jbond: "ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/969719 (https://phabricator.wikimedia.org/T349915) (owner: 10Jbond)
[13:40:04] <wikibugs>	 (03CR) 10Jbond: "ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/969720 (https://phabricator.wikimedia.org/T349915) (owner: 10Jbond)
[13:47:44] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/968612 (https://phabricator.wikimedia.org/T329398) (owner: 10Brouberol)
[13:47:53] <Lucas_WMDE>	 oh right
[13:48:04] <Lucas_WMDE>	 (re daylight confusion, that is ^^)
[13:48:13] <wikibugs>	 (03CR) 10Bking: [C: 03+2] search-loader: use default system python [puppet] - 10https://gerrit.wikimedia.org/r/969386 (https://phabricator.wikimedia.org/T346039) (owner: 10Bking)
[13:48:43] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Create cookbook to migrate servers from the puppetmasters to puppetservers - https://phabricator.wikimedia.org/T340739 (10jbond)
[13:49:08] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10Puppet (Puppet 7.0): Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10jbond) 05Open→03In progress p:05Triage→03Medium
[13:54:27] <wikibugs>	 (03PS2) 10Ayounsi: provision_server: make switch selection optional [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/969749
[13:54:29] <wikibugs>	 (03PS1) 10Ayounsi: provision_server: don't show servers with a primary IP [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/969752
[13:55:38] <jinxer-wm>	 (LogstashIndexingFailures) firing: (2) Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors  - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures
[13:58:24] <wikibugs>	 (03PS1) 10Jcrespo: dbbackups: Switchover master from db1164 to db1119 [puppet] - 10https://gerrit.wikimedia.org/r/969753 (https://phabricator.wikimedia.org/T350022)
[13:59:11] <wikibugs>	 (03CR) 10Jcrespo: [C: 04-1] "Do not deploy until Manuel says so." [puppet] - 10https://gerrit.wikimedia.org/r/969753 (https://phabricator.wikimedia.org/T350022) (owner: 10Jcrespo)
[14:00:04] <wikibugs>	 (03PS4) 10Ayounsi: Split interface_automation into multiple files [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/969319
[14:00:06] <wikibugs>	 (03PS6) 10Ayounsi: Ask for port # and type instead of interface name [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/969692
[14:00:08] <wikibugs>	 (03PS3) 10Ayounsi: provision_server: make switch selection optional [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/969749
[14:00:10] <wikibugs>	 (03PS2) 10Ayounsi: provision_server: don't show servers with a primary IP [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/969752
[14:01:17] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10Puppet (Puppet 7.0): Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10jbond)
[14:06:09] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4052.ulsfo.wmnet with OS bookworm
[14:06:58] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "Neat! LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/969721 (owner: 10Slyngshede)
[14:07:09] <jinxer-wm>	 (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-misc at eqiad: 50% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-misc&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[14:10:37] <jinxer-wm>	 (LogstashIndexingFailures) firing: (2) Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors  - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures
[14:12:09] <jinxer-wm>	 (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-misc at eqiad: 50% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-misc&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[14:16:15] <wikibugs>	 (03PS1) 10Bking: search-loader: Bring new hosts into service [puppet] - 10https://gerrit.wikimedia.org/r/969754 (https://phabricator.wikimedia.org/T346039)
[14:18:31] <wikibugs>	 (03PS1) 10Elukey: services: update the ChangeProp staging's docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/969757 (https://phabricator.wikimedia.org/T348950)
[14:20:37] <jinxer-wm>	 (LogstashIndexingFailures) firing: (2) Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors  - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures
[14:20:45] <wikibugs>	 (03PS1) 10Elukey: services: update ChangeProp's eqiad Docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/969758 (https://phabricator.wikimedia.org/T348950)
[14:26:26] <godog>	 elukey: the logstash indexing failures are from cp in staging :( 
[14:26:31] <wikibugs>	 (03PS2) 10Bking: search-loader: Bring new hosts into service [puppet] - 10https://gerrit.wikimedia.org/r/969754 (https://phabricator.wikimedia.org/T346039)
[14:26:36] <elukey>	 ah snap, lovely
[14:26:41] <godog>	 i.e. "message" is json
[14:26:44] <godog>	 "message"=>{"message"=>"[thrd:GroupCoordinator]: GroupCoordinator/1001: Sent HeartbeatRequest (v1, 109 bytes @ 0, CorrId 613)", "severity"=>7, "fac"=>"SEND"}, 
[14:26:47] <godog>	 etc
[14:27:05] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] services: update the ChangeProp staging's docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/969757 (https://phabricator.wikimedia.org/T348950) (owner: 10Elukey)
[14:27:32] <elukey>	 godog: going to shutoff debug msgs for librdkafka in a sec
[14:27:56] <elukey>	 anything that we can do to make them digestible?
[14:28:01] <elukey>	 I can't control their format sadly
[14:28:13] <elukey>	 (they are generated by librdkafka via another nodejs lib)
[14:29:00] <wikibugs>	 (03PS1) 10Jbond: sre.ganeti.makevm: Add pppet-version arguments to makevm [cookbooks] - 10https://gerrit.wikimedia.org/r/969760 (https://phabricator.wikimedia.org/T340739)
[14:29:09] <jinxer-wm>	 (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-misc at eqiad: 50% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-misc&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[14:29:10] <godog>	 mmhh good question, the first thing that comes to mind is not having json in 'message', maybe wrap it as text
[14:30:16] <wikibugs>	 (03Abandoned) 10Bking: search-loader: Bring new hosts into service [puppet] - 10https://gerrit.wikimedia.org/r/969754 (https://phabricator.wikimedia.org/T346039) (owner: 10Bking)
[14:31:28] <wikibugs>	 (03PS1) 10Bking: search-loader: Bring new hosts into service [puppet] - 10https://gerrit.wikimedia.org/r/969761 (https://phabricator.wikimedia.org/T346039)
[14:31:58] <logmsgbot>	 !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/changeprop: sync
[14:32:12] <logmsgbot>	 !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop: sync
[14:32:40] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/969760 (https://phabricator.wikimedia.org/T340739) (owner: 10Jbond)
[14:33:37] <wikibugs>	 (03CR) 10DCausse: [C: 03+1] search-loader: Bring new hosts into service [puppet] - 10https://gerrit.wikimedia.org/r/969761 (https://phabricator.wikimedia.org/T346039) (owner: 10Bking)
[14:33:39] <wikibugs>	 (03CR) 10Peter Fischer: [C: 03+1] "LGTM, as far as I can tell" [puppet] - 10https://gerrit.wikimedia.org/r/969761 (https://phabricator.wikimedia.org/T346039) (owner: 10Bking)
[14:34:09] <jinxer-wm>	 (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-misc at eqiad: 50% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-misc&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[14:34:13] <wikibugs>	 (03CR) 10Bking: [C: 03+2] search-loader: Bring new hosts into service [puppet] - 10https://gerrit.wikimedia.org/r/969761 (https://phabricator.wikimedia.org/T346039) (owner: 10Bking)
[14:34:42] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4052.ulsfo.wmnet with reason: host reimage
[14:35:37] <jinxer-wm>	 (LogstashIndexingFailures) resolved: (2) Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors  - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures
[14:36:52] <inflatador>	 !log bking@search-loader2001 disabling services as part of bullseye migration T346039
[14:36:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:36:59] <stashbot>	 T346039: Migrate search-loader hosts to Bullseye or later - https://phabricator.wikimedia.org/T346039
[14:37:09] <jinxer-wm>	 (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-misc at eqiad: 50% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-misc&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[14:37:19] <godog>	 elukey: the indexing errors are gone btw, last was at 14:32:10
[14:37:52] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on search-loader2001.codfw.wmnet with reason: T346039
[14:37:54] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4052.ulsfo.wmnet with reason: host reimage
[14:38:16] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on search-loader2001.codfw.wmnet with reason: T346039
[14:38:44] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:39:04] <elukey>	 godog: yes I removed the debug logging
[14:39:13] <elukey>	 I'll try to come up with a different solution
[14:39:26] <godog>	 ack, thanks
[14:41:13] <logmsgbot>	 !log bking@deploy2002 Started deploy [search/mjolnir/deploy@daf8c32]: T346039
[14:41:18] <logmsgbot>	 !log bking@deploy2002 Finished deploy [search/mjolnir/deploy@daf8c32]: T346039 (duration: 00m 05s)
[14:42:09] <jinxer-wm>	 (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-misc at eqiad: 50% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-misc&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[14:42:21] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1005-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[14:42:22] <elukey>	 godog: in theory https://gerrit.wikimedia.org/r/c/mediawiki/services/change-propagation/+/969765 should fix
[14:42:25] <elukey>	 does it make sense?
[14:43:44] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:45:06] <godog>	 elukey: yes LGTM
[14:45:15] <elukey>	 of course I forgot a )
[14:45:16] <elukey>	 sigh
[14:46:09] <jinxer-wm>	 (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-misc at eqiad: 50% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-misc&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[14:46:43] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10Puppet (Puppet 7.0): Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10jbond)
[14:50:59] <wikibugs>	 (03PS1) 10Jbond: puppet7: Add a motd to inform users a host has been migrated to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/969789 (https://phabricator.wikimedia.org/T349619)
[14:51:09] <jinxer-wm>	 (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-misc at eqiad: 50% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-misc&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[14:52:17] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/969789 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond)
[14:53:44] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:54:32] <wikibugs>	 10SRE, 10DNS, 10Traffic: DNS Update, Google Postmaster Tools - https://phabricator.wikimedia.org/T349942 (10NMariano-WMF) The ITS System team will set this up and manage permissions for Noah Israel (@nisrae)l and Danny Bu (@DBu-WMF).
[14:56:09] <jinxer-wm>	 (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-misc at eqiad: 50% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-misc&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[15:00:56] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] puppet7: Add a motd to inform users a host has been migrated to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/969789 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond)
[15:01:09] <jinxer-wm>	 (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-misc at eqiad: 50% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-misc&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[15:01:54] <jbond>	 bking yuo happy for me to merge your change
[15:02:04] <jbond>	 https://gerrit.wikimedia.org/r/c/operations/puppet/+/969761
[15:04:09] <jinxer-wm>	 (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-misc at eqiad: 50% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-misc&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[15:04:53] <wikibugs>	 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10Andrew) a:05dcaro→03Andrew
[15:06:16] <wikibugs>	 (03CR) 10Krinkle: [C: 03+1] profile::mediawiki::common: set default histogram buckets [puppet] - 10https://gerrit.wikimedia.org/r/954114 (https://phabricator.wikimedia.org/T344751) (owner: 10Herron)
[15:09:09] <jinxer-wm>	 (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-misc at eqiad: 50% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-misc&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[15:13:14] <wikibugs>	 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10Andrew)
[15:14:37] <wikibugs>	 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10Jclark-ctr)
[15:14:42] <wikibugs>	 (03PS1) 10Jbond: puppet::agent: correct white space in motd [puppet] - 10https://gerrit.wikimedia.org/r/969793
[15:14:58] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] puppet::agent: correct white space in motd [puppet] - 10https://gerrit.wikimedia.org/r/969793 (owner: 10Jbond)
[15:19:59] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on search-loader1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[15:20:39] <wikibugs>	 10SRE, 10ops-eqiad, 10Cloud-VPS, 10Data-Platform-SRE, 10cloud-services-team: Move cloudvirt-wdqs hosts - https://phabricator.wikimedia.org/T346948 (10VRiley-WMF) cloudvirt-wdqs1003 has been relocated   cloudvirt-wdqs1003 - C 8. U 21. port 18. CableID 4015  Side note, we had to use a 1 Gig connection sinc...
[15:21:11] <wikibugs>	 (03PS2) 10Jbond: sre.ganeti.makevm: Add puppet-version arguments to makevm [cookbooks] - 10https://gerrit.wikimedia.org/r/969760 (https://phabricator.wikimedia.org/T340739)
[15:21:21] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] sre.ganeti.makevm: Add puppet-version arguments to makevm [cookbooks] - 10https://gerrit.wikimedia.org/r/969760 (https://phabricator.wikimedia.org/T340739) (owner: 10Jbond)
[15:21:23] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host cloudvirt-wdqs1003
[15:23:20] <RhinosF1>	 jouncebot: next
[15:23:21] <jouncebot>	 In 0 hour(s) and 6 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231030T1530)
[15:24:06] <wikibugs>	 (03PS1) 10Jbond: builder: migrate role to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/969795 (https://phabricator.wikimedia.org/T349619)
[15:24:45] <wikibugs>	 (03CR) 10Jbond: "ready for review" [cookbooks] - 10https://gerrit.wikimedia.org/r/967935 (https://phabricator.wikimedia.org/T340739) (owner: 10Jbond)
[15:25:07] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] builder: migrate role to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/969795 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond)
[15:25:53] <wikibugs>	 (03Merged) 10jenkins-bot: sre.ganeti.makevm: Add puppet-version arguments to makevm [cookbooks] - 10https://gerrit.wikimedia.org/r/969760 (https://phabricator.wikimedia.org/T340739) (owner: 10Jbond)
[15:27:17] <logmsgbot>	 !log taavi@cumin1001 START - Cookbook sre.dns.netbox
[15:29:22] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudvirt-wdqs1003
[15:29:33] <logmsgbot>	 !log taavi@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign new IPs to cloudvirt-wdqs1003 - taavi@cumin1001"
[15:30:22] <logmsgbot>	 !log taavi@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign new IPs to cloudvirt-wdqs1003 - taavi@cumin1001"
[15:30:23] <logmsgbot>	 !log taavi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:33:12] <logmsgbot>	 !log taavi@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host cloudvirt-wdqs1003
[15:33:52] <wikibugs>	 10SRE, 10Maps: Allow Wikimedia Maps usage on wikiworld.sidl-corporation.fr - https://phabricator.wikimedia.org/T349985 (10Aklapper) 05Open→03Declined Hi @SIDLCorporation, thanks for taking the time to report this. The three fields above are not filled out, so for now I am going to decline this ticket.  Ple...
[15:33:57] <logmsgbot>	 !log taavi@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudvirt-wdqs1003
[15:40:49] <wikibugs>	 (03PS1) 10Jbond: cluster::unprivmanagement: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/969800 (https://phabricator.wikimedia.org/T349619)
[15:41:14] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] cluster::unprivmanagement: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/969800 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond)
[15:42:53] <logmsgbot>	 !log taavi@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt-wdqs1003.eqiad.wmnet with OS bookworm
[15:43:06] <wikibugs>	 10SRE, 10ops-eqiad, 10Cloud-VPS, 10Data-Platform-SRE, 10cloud-services-team: Move cloudvirt-wdqs hosts - https://phabricator.wikimedia.org/T346948 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by taavi@cumin1001 for host cloudvirt-wdqs1003.eqiad.wmnet with OS bookworm
[15:43:18] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: modules: add job 1.0.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/969801
[15:43:22] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: modules: fix app.job [deployment-charts] - 10https://gerrit.wikimedia.org/r/969802
[15:43:27] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp1103.eqiad.wmnet with OS bullseye
[15:43:48] <wikibugs>	 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10Jclark-ctr)
[15:43:58] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: modules: fix app.job [deployment-charts] - 10https://gerrit.wikimedia.org/r/969802
[15:44:31] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] modules: fix app.job [deployment-charts] - 10https://gerrit.wikimedia.org/r/969802 (owner: 10Giuseppe Lavagetto)
[15:45:22] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] modules: fix app.job [deployment-charts] - 10https://gerrit.wikimedia.org/r/969802 (owner: 10Giuseppe Lavagetto)
[15:48:09] <wikibugs>	 (03PS1) 10Jbond: config_master: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/969803 (https://phabricator.wikimedia.org/T349619)
[15:48:30] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] config_master: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/969803 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond)
[15:49:27] <jbond>	 !log move config_master to puppet7
[15:49:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:49:36] <jbond>	 !log move cluster::unprivmanagement to puppet7
[15:49:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:49:45] <jbond>	 !log move builder to puppet7
[15:49:48] <wikibugs>	 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate mr1-codfw from asw-a1-codfw to lsw1-a1-codfw - https://phabricator.wikimedia.org/T348164 (10Papaul) @cmooney cable is place from mr1-codfw ge0/0/3 to lsw1-a2-codfw ge-0/0/47 ID 00745
[15:49:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:51:10] <logmsgbot>	 !log fabfur@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1103.eqiad.wmnet with OS bullseye
[15:51:16] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[15:51:32] <wikibugs>	 (03PS10) 10Herron: profile::mediawiki::common: include prometheus statsd_exporter [puppet] - 10https://gerrit.wikimedia.org/r/952894 (https://phabricator.wikimedia.org/T345377)
[15:51:39] <wikibugs>	 (03PS14) 10Herron: profile::mediawiki::common: set default histogram buckets [puppet] - 10https://gerrit.wikimedia.org/r/954114 (https://phabricator.wikimedia.org/T344751)
[15:51:40] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp1103.eqiad.wmnet with OS bullseye
[15:51:46] <wikibugs>	 10SRE, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1103.eqiad.wmnet with OS bullseye
[15:53:44] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:55:57] <wikibugs>	 (03PS1) 10Jbond: failoid: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/969806 (https://phabricator.wikimedia.org/T349619)
[15:55:59] <jbond>	 !log migrate failoid to puppet7
[15:56:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:56:15] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] failoid: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/969806 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond)
[15:56:40] <logmsgbot>	 !log taavi@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "cloudvirt-wdqs1003 - taavi@cumin1001"
[15:57:40] <logmsgbot>	 !log taavi@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "cloudvirt-wdqs1003 - taavi@cumin1001"
[15:58:22] <wikibugs>	 (03PS1) 10Majavah: hieradata: update cloudvirt-wdqs1003 network config [puppet] - 10https://gerrit.wikimedia.org/r/969807
[15:58:23] <logmsgbot>	 !log taavi@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt-wdqs1003.eqiad.wmnet with reason: host reimage
[15:59:11] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] hieradata: update cloudvirt-wdqs1003 network config [puppet] - 10https://gerrit.wikimedia.org/r/969807 (owner: 10Majavah)
[15:59:33] <icinga-wm>	 PROBLEM - Check systemd state on config-master1001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:00:43] <icinga-wm>	 PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:02:33] <logmsgbot>	 !log taavi@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt-wdqs1003.eqiad.wmnet with reason: host reimage
[16:03:37] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:03:56] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.puppet.migrate-host for host ganeti-test1002.eqiad.wmnet
[16:04:13] <jbond>	 !log migrate ganeti-test1002.eqiad.wmnet to puppet7
[16:04:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:05:16] <wikibugs>	 (03PS1) 10Jbond: ganeti-test1002: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/969810 (https://phabricator.wikimedia.org/T349619)
[16:05:29] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] ganeti-test1002: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/969810 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond)
[16:07:40] <logmsgbot>	 !log fabfur@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1103.eqiad.wmnet with OS bullseye
[16:07:44] <wikibugs>	 10SRE, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1103.eqiad.wmnet with OS bullseye executed with errors: - cp1103 (**FAIL**)   - Removed from Puppet...
[16:07:56] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp1103.eqiad.wmnet with OS bullseye
[16:08:02] <wikibugs>	 10SRE, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1103.eqiad.wmnet with OS bullseye
[16:09:01] <wikibugs>	 (03PS11) 10Effie Mouzeli: ipoid: Update cronjob definition [deployment-charts] - 10https://gerrit.wikimedia.org/r/966813 (https://phabricator.wikimedia.org/T346861) (owner: 10Kosta Harlan)
[16:09:58] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] ipoid: Update cronjob definition [deployment-charts] - 10https://gerrit.wikimedia.org/r/966813 (https://phabricator.wikimedia.org/T346861) (owner: 10Kosta Harlan)
[16:10:00] <wikibugs>	 (03PS2) 10Majavah: openstack: nova: add a dependency on libvirt-clients [puppet] - 10https://gerrit.wikimedia.org/r/969299
[16:10:59] <wikibugs>	 (03PS4) 10Jforrester: [wikifunctions] Alter site to General Availability [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966570 (https://phabricator.wikimedia.org/T349054)
[16:11:05] <wikibugs>	 (03PS5) 10Jforrester: [wikifunctions] Alter site to General Availability [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966570 (https://phabricator.wikimedia.org/T349054)
[16:13:48] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] openstack: nova: add a dependency on libvirt-clients [puppet] - 10https://gerrit.wikimedia.org/r/969299 (owner: 10Majavah)
[16:14:02] <logmsgbot>	 !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host ganeti-test1002.eqiad.wmnet
[16:15:03] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:16:09] <jbond>	 !log migrate O:ganeti_test to puppet7
[16:16:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:17:03] <wikibugs>	 (03PS1) 10Jbond: ganeti_test: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/969812 (https://phabricator.wikimedia.org/T349619)
[16:17:54] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] ganeti_test: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/969812 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond)
[16:19:54] <wikibugs>	 (03PS1) 10Vgutierrez: reprepro: Fix haproxy component names for bullseye & bookworm [puppet] - 10https://gerrit.wikimedia.org/r/969814
[16:21:10] <logmsgbot>	 !log taavi@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - taavi@cumin1001"
[16:21:53] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/969814 (owner: 10Vgutierrez)
[16:22:02] <logmsgbot>	 !log taavi@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - taavi@cumin1001"
[16:22:03] <logmsgbot>	 !log taavi@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt-wdqs1003.eqiad.wmnet with OS bookworm
[16:22:09] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] reprepro: Fix haproxy component names for bullseye & bookworm [puppet] - 10https://gerrit.wikimedia.org/r/969814 (owner: 10Vgutierrez)
[16:22:15] <icinga-wm>	 RECOVERY - Check systemd state on config-master1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:22:19] <icinga-wm>	 PROBLEM - ensure kvm processes are running on cloudvirt-wdqs1003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[16:23:04] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1103.eqiad.wmnet with reason: host reimage
[16:23:33] <icinga-wm>	 RECOVERY - ensure kvm processes are running on cloudvirt-wdqs1003 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[16:24:49] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10jbond)
[16:25:54] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10jbond)
[16:26:17] <logmsgbot>	 !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1103.eqiad.wmnet with reason: host reimage
[16:26:44] <wikibugs>	 10SRE, 10DNS, 10Traffic: DNS Update, Google Postmaster Tools - https://phabricator.wikimedia.org/T349942 (10ssingh) Hi, this is for wikimedia.org, correct?
[16:28:27] <wikibugs>	 10SRE, 10DNS, 10Traffic: DNS Update, Google Postmaster Tools - https://phabricator.wikimedia.org/T349942 (10NMariano-WMF) Correct
[16:34:07] <wikibugs>	 (03PS1) 10Ssingh: wikimedia.org: update google-site-verification [dns] - 10https://gerrit.wikimedia.org/r/969816 (https://phabricator.wikimedia.org/T349942)
[16:34:08] <Dreamy_Jazz>	 I'm seeing inconsistent server errors from phabriactor
[16:34:31] <Dreamy_Jazz>	 Talking about the MySQL server going away
[16:35:24] <wikibugs>	 10SRE, 10ops-esams, 10DC-Ops, 10Patch-For-Review: Main Tracking Task for ESAMS Migration to KNAMS - https://phabricator.wikimedia.org/T329219 (10ayounsi)
[16:38:05] <icinga-wm>	 PROBLEM - haproxy process on cp4052 is CRITICAL: PROCS CRITICAL: 0 processes with command name haproxy https://wikitech.wikimedia.org/wiki/HAProxy
[16:38:17] <icinga-wm>	 PROBLEM - Check systemd state on cp4052 is CRITICAL: CRITICAL - degraded: The following units failed: haproxy.service,haproxy_stek_job.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:38:37] <icinga-wm>	 PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp4052 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS
[16:38:43] <icinga-wm>	 PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp4052 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS
[16:39:12] <sukhe>	 ^ will downtime this, host is depooled
[16:39:35] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 4:00:00 on cp4052.ulsfo.wmnet with reason: depooled, reimaging
[16:39:50] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on cp4052.ulsfo.wmnet with reason: depooled, reimaging
[16:42:04] <wikibugs>	 (03PS12) 10Effie Mouzeli: ipoid: Update cronjob definition [deployment-charts] - 10https://gerrit.wikimedia.org/r/966813 (https://phabricator.wikimedia.org/T346861) (owner: 10Kosta Harlan)
[16:42:51] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] ipoid: Update cronjob definition [deployment-charts] - 10https://gerrit.wikimedia.org/r/966813 (https://phabricator.wikimedia.org/T346861) (owner: 10Kosta Harlan)
[16:43:01] <wikibugs>	 (03PS1) 10Majavah: aptrepo: cleanup haproxy update and component names [puppet] - 10https://gerrit.wikimedia.org/r/969819
[16:44:37] <icinga-wm>	 PROBLEM - Check systemd state on config-master1001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:48:55] <wikibugs>	 (03PS1) 10Vgutierrez: reprepro: Fix haproxy components name for bullseye & bookworm [puppet] - 10https://gerrit.wikimedia.org/r/969821
[16:49:43] <wikibugs>	 (03CR) 10Majavah: [C: 03+1] reprepro: Fix haproxy components name for bullseye & bookworm [puppet] - 10https://gerrit.wikimedia.org/r/969821 (owner: 10Vgutierrez)
[16:50:26] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] reprepro: Fix haproxy components name for bullseye & bookworm [puppet] - 10https://gerrit.wikimedia.org/r/969821 (owner: 10Vgutierrez)
[16:50:34] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] wikimedia.org: update google-site-verification [dns] - 10https://gerrit.wikimedia.org/r/969816 (https://phabricator.wikimedia.org/T349942) (owner: 10Ssingh)
[16:51:01] <sukhe>	 !log running authdns-update for CR 969816
[16:51:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:51:21] <Lucas_WMDE>	 Dreamy_Jazz: same (reload usually fixes it, but given that the baseline is “I’ve never ever seen this error before”…)
[16:52:09] <taavi>	 <del>have you filed a task about it?</del>'
[16:53:01] <Lucas_WMDE>	 screenshot here https://tmp.lucaswerkmeister.de/phabricator-unhandled-exception.png
[16:53:12] <Lucas_WMDE>	 sure I’ll file a task
[16:53:15] <icinga-wm>	 RECOVERY - Check systemd state on config-master2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:53:32] <taavi>	 i think there's already one
[16:54:00] <Lucas_WMDE>	 https://phabricator.wikimedia.org/T349961 is from saturday apparently
[16:54:06] <Lucas_WMDE>	 guess that’s close enough to be the same, yeah
[16:54:51] <Lucas_WMDE>	 commented there
[16:56:58] <wikibugs>	 10SRE, 10DNS, 10Traffic, 10Patch-For-Review: DNS Update, Google Postmaster Tools - https://phabricator.wikimedia.org/T349942 (10ssingh) 05Open→03Resolved a:03ssingh wikimedia.org.  600 IN TXT "google-site-verification=uzfgD0YiIqSQgRdSQXlkA7NByyyOZDp-n0SZ3nozpDM"
[16:57:21] <icinga-wm>	 PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:57:59] <wikibugs>	 (03PS13) 10Effie Mouzeli: ipoid: Update cronjob definition [deployment-charts] - 10https://gerrit.wikimedia.org/r/966813 (https://phabricator.wikimedia.org/T346861) (owner: 10Kosta Harlan)
[16:58:53] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] ipoid: Update cronjob definition [deployment-charts] - 10https://gerrit.wikimedia.org/r/966813 (https://phabricator.wikimedia.org/T346861) (owner: 10Kosta Harlan)
[17:04:34] <logmsgbot>	 !log otto@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-analytics: apply
[17:05:34] <wikibugs>	 (03PS1) 10BCornwall: hiera: remove dns3003 from authdns_servers [puppet] - 10https://gerrit.wikimedia.org/r/969931 (https://phabricator.wikimedia.org/T342154)
[17:05:36] <wikibugs>	 (03PS14) 10Effie Mouzeli: ipoid: Update cronjob definition [deployment-charts] - 10https://gerrit.wikimedia.org/r/966813 (https://phabricator.wikimedia.org/T346861) (owner: 10Kosta Harlan)
[17:05:51] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] hiera: remove dns3003 from authdns_servers [puppet] - 10https://gerrit.wikimedia.org/r/969931 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall)
[17:06:07] <wikibugs>	 (03CR) 10BCornwall: [C: 03+2] hiera: remove dns3003 from authdns_servers [puppet] - 10https://gerrit.wikimedia.org/r/969931 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall)
[17:06:44] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] ipoid: Update cronjob definition [deployment-charts] - 10https://gerrit.wikimedia.org/r/966813 (https://phabricator.wikimedia.org/T346861) (owner: 10Kosta Harlan)
[17:09:25] <wikibugs>	 (03PS1) 10Jbond: pki::root: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/969932 (https://phabricator.wikimedia.org/T349619)
[17:09:59] <jinxer-wm>	 (PuppetFailure) firing: (3) Puppet has failed on search-loader1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[17:10:06] <jbond>	 !log migrate pki::root to puppet7
[17:10:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:10:23] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] pki::root: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/969932 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond)
[17:10:53] <wikibugs>	 (03PS15) 10Effie Mouzeli: ipoid: Update cronjob definition [deployment-charts] - 10https://gerrit.wikimedia.org/r/966813 (https://phabricator.wikimedia.org/T346861) (owner: 10Kosta Harlan)
[17:12:21] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host dns3003.wikimedia.org with OS bookworm
[17:12:31] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host dns3003.wikimedia.org with OS bookworm
[17:14:44] <logmsgbot>	 !log otto@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics: apply
[17:15:23] <icinga-wm>	 PROBLEM - BGP status on asw1-by27-esams.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[17:15:43] <icinga-wm>	 PROBLEM - BFD status on asw1-by27-esams.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[17:16:46] <logmsgbot>	 !log fabfur@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1103.eqiad.wmnet with OS bullseye
[17:16:52] <wikibugs>	 10SRE, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1103.eqiad.wmnet with OS bullseye executed with errors: - cp1103 (**FAIL**)   - Removed from Puppet...
[17:19:32] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job haproxy in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[17:20:31] <icinga-wm>	 PROBLEM - Host 2a02:ec80:300:2:185:15:59:34 is DOWN: CRITICAL - Destination Unreachable (2a02:ec80:300:2:185:15:59:34)
[17:21:53] <icinga-wm>	 PROBLEM - Recursive DNS on 185.15.59.34 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS
[17:21:57] <sukhe>	 ^ expected
[17:22:48] <jbond>	 !log migrate pki2002 to puppet7
[17:22:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:23:15] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.puppet.migrate-host for host pki2002.codfw.wmnet
[17:23:17] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[17:23:55] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[17:23:58] <jinxer-wm>	 (RdfStreamingUpdaterSpaceUsageTooHigh) firing: The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh
[17:24:35] <wikibugs>	 (03PS1) 10Jbond: pki2002: switch to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/969935 (https://phabricator.wikimedia.org/T349619)
[17:25:06] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] pki2002: switch to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/969935 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond)
[17:25:41] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[17:27:40] <logmsgbot>	 !log jbond@cumin1001 END (FAIL) - Cookbook sre.puppet.migrate-host (exit_code=99) for host pki2002.codfw.wmnet
[17:28:19] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 17 Dec 2023 03:07:37 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[17:28:49] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 2.662 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[17:28:58] <jinxer-wm>	 (RdfStreamingUpdaterSpaceUsageTooHigh) resolved: The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh
[17:33:44] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job haproxy in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[17:38:14] <icinga-wm>	 RECOVERY - Check systemd state on cp4052 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:38:51] <wikibugs>	 (03PS1) 10Jbond: pki::multiroot: convert to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/969937 (https://phabricator.wikimedia.org/T349619)
[17:39:09] <jinxer-wm>	 (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-misc at eqiad: 50% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-misc&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[17:39:11] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on dns3003.wikimedia.org with reason: host reimage
[17:39:45] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] pki::multiroot: convert to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/969937 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond)
[17:40:04] <jbond>	 !log migrate pki::multirootca to puppet7
[17:40:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:42:16] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dns3003.wikimedia.org with reason: host reimage
[17:44:04] <icinga-wm>	 RECOVERY - Check systemd state on config-master1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:44:09] <jinxer-wm>	 (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-misc at eqiad: 50% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-misc&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[17:46:06] <icinga-wm>	 RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp4052 is OK: SSL OK - OCSP staple validity for wikipedia.org has 393233 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2024-01-19 05:55:13 +0000 (expires in 80 days) https://wikitech.wikimedia.org/wiki/HTTPS
[17:46:18] <icinga-wm>	 RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp4052 is OK: SSL OK - OCSP staple validity for wikipedia.org has 220421 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2024-01-19 05:54:59 +0000 (expires in 80 days) https://wikitech.wikimedia.org/wiki/HTTPS
[17:46:36] <icinga-wm>	 RECOVERY - haproxy process on cp4052 is OK: PROCS OK: 2 processes with command name haproxy https://wikitech.wikimedia.org/wiki/HAProxy
[17:47:00] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50714 bytes in 0.193 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[17:47:08] <icinga-wm>	 PROBLEM - Check systemd state on config-master1001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:47:15] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10jbond)
[17:49:07] <wikibugs>	 (03PS1) 10Jbond: test: move test role to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/969940 (https://phabricator.wikimedia.org/T349619)
[17:50:42] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4052.ulsfo.wmnet with OS bookworm
[17:50:49] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] test: move test role to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/969940 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond)
[17:51:02] <icinga-wm>	 PROBLEM - Recursive DNS on 185.15.59.34 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS
[17:53:37] <wikibugs>	 10SRE, 10API Platform, 10MediaWiki-REST-API, 10Traffic, and 2 others: Use relative URLs in redirects emitted by rest.php - https://phabricator.wikimedia.org/T349001 (10daniel) 05Open→03Resolved a:03daniel
[17:54:22] <icinga-wm>	 RECOVERY - Check systemd state on config-master2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:55:09] <jinxer-wm>	 (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-misc at eqiad: 50% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-misc&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[17:56:18] <jbond>	 !log migrate bastionhost to puppet7
[17:56:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:56:58] <icinga-wm>	 RECOVERY - Recursive DNS on 185.15.59.34 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS
[17:57:32] <wikibugs>	 (03PS1) 10Jbond: bastionhost: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/969942 (https://phabricator.wikimedia.org/T349619)
[17:57:53] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] bastionhost: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/969942 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond)
[17:58:04] <icinga-wm>	 PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:00:09] <jinxer-wm>	 (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-misc at eqiad: 50% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-misc&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[18:00:42] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10jbond)
[18:03:09] <jinxer-wm>	 (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-misc at eqiad: 50% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-misc&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[18:07:11] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Audit of WMCS Servers Using Single & Dual Switchports - https://phabricator.wikimedia.org/T349756 (10VRiley-WMF) Hi, here is a list of C 8 servers that seem to be apart of the discrepancy    cloudswift1001 - dual (one port is dark) cloudvirt1027 - dual cloudvirt1026 - dual  clou...
[18:08:09] <jinxer-wm>	 (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-misc at eqiad: 50% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-misc&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[18:09:20] <wikibugs>	 (03PS16) 10Effie Mouzeli: ipoid: Update cronjob definition [deployment-charts] - 10https://gerrit.wikimedia.org/r/966813 (https://phabricator.wikimedia.org/T346861) (owner: 10Kosta Harlan)
[18:10:46] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp1103.eqiad.wmnet with OS bullseye
[18:10:52] <wikibugs>	 10SRE, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1103.eqiad.wmnet with OS bullseye
[18:11:09] <jinxer-wm>	 (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-misc at eqiad: 50% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-misc&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[18:11:36] <wikibugs>	 (03CR) 10Effie Mouzeli: [V: 04-1] ipoid: Update cronjob definition (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/966813 (https://phabricator.wikimedia.org/T346861) (owner: 10Kosta Harlan)
[18:11:45] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4052.ulsfo.wmnet with OS bookworm
[18:14:37] <logmsgbot>	 !log bking@deploy2002 Started deploy [search/mjolnir/deploy@daf8c32]: T346039
[18:14:44] <logmsgbot>	 !log bking@deploy2002 Finished deploy [search/mjolnir/deploy@daf8c32]: T346039 (duration: 00m 06s)
[18:15:00] <wikibugs>	 (03CR) 10Herron: [C: 03+2] profile::mediawiki::common: include prometheus statsd_exporter [puppet] - 10https://gerrit.wikimedia.org/r/952894 (https://phabricator.wikimedia.org/T345377) (owner: 10Herron)
[18:15:05] <wikibugs>	 (03CR) 10Herron: [C: 03+2] profile::mediawiki::common: set default histogram buckets [puppet] - 10https://gerrit.wikimedia.org/r/954114 (https://phabricator.wikimedia.org/T344751) (owner: 10Herron)
[18:16:09] <jinxer-wm>	 (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-misc at eqiad: 50% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-misc&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[18:18:13] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on search-loader[1001-1002].eqiad.wmnet with reason: T346039
[18:18:28] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on search-loader[1001-1002].eqiad.wmnet with reason: T346039
[18:19:45] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on search-loader[2001-2002].codfw.wmnet with reason: T346039
[18:19:46] <icinga-wm>	 RECOVERY - Check systemd state on config-master2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:20:10] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on search-loader[2001-2002].codfw.wmnet with reason: T346039
[18:22:32] <logmsgbot>	 !log fabfur@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1103.eqiad.wmnet with OS bullseye
[18:22:37] <wikibugs>	 10SRE, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1103.eqiad.wmnet with OS bullseye executed with errors: - cp1103 (**FAIL**)   - Downtimed on Icinga/...
[18:23:58] <icinga-wm>	 PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:24:42] <sukhe>	 !log racadm racreset cp1103.eqiad.wmnet
[18:24:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:25:09] <jinxer-wm>	 (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-misc at eqiad: 50% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-misc&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[18:26:46] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp1103.eqiad.wmnet with OS bullseye
[18:27:33] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.puppet.migrate-role for role: ping_offload
[18:27:38] <jbond>	 !log migrate ping_offload to puppet7
[18:27:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:29:01] <wikibugs>	 (03PS1) 10Jbond: ping_offload: switch to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/969945 (https://phabricator.wikimedia.org/T349619)
[18:29:41] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] ping_offload: switch to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/969945 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond)
[18:30:01] <wikibugs>	 (03CR) 10Herron: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/969135 (https://phabricator.wikimedia.org/T349807) (owner: 10Herron)
[18:30:09] <jinxer-wm>	 (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-misc at eqiad: 50% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-misc&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[18:30:46] <icinga-wm>	 PROBLEM - Check systemd state on pki2002 is CRITICAL: CRITICAL - degraded: The following units failed: cfssl-ocsprefresh-Wikimedia_Internal_Root_CA.service,cfssl-ocsprefresh-aux.service,cfssl-ocsprefresh-aux_front_proxy.service,cfssl-ocsprefresh-cassandra.service,cfssl-ocsprefresh-cloud_wmnet_ca.service,cfssl-ocsprefresh-debmonitor.service,cfssl-ocsprefresh-discovery.service,cfssl-ocsprefresh-dse.service,cfssl-ocsprefresh-dse_front_proxy.
[18:30:46] <icinga-wm>	 cfssl-ocsprefresh-etcd.service,cfssl-ocsprefresh-kafka.service,cfssl-ocsprefresh-mlserve.service,cfssl-ocsprefresh-mlserve_front_proxy.service,cfssl-ocsprefresh-mlserve_staging.service,cfssl-ocsprefresh-mlserve_staging_front_proxy.service,cfssl-ocsprefresh-network_devices.service,cfssl-ocsprefresh-syslog.service,cfssl-ocsprefresh-wikikube.service,cfssl-ocsprefresh-wikikube_front_proxy.service,cfssl-ocsprefresh-wikikube_staging.service,cfs
[18:30:46] <icinga-wm>	 efresh-wikikube_staging_front_proxy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:31:28] <jinxer-wm>	 (WidespreadPuppetFailure) firing: Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[18:31:54] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10jbond)
[18:33:32] <logmsgbot>	 !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: ping_offload
[18:34:40] <logmsgbot>	 !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1103.eqiad.wmnet with OS bullseye
[18:34:52] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp1103.eqiad.wmnet with OS bullseye
[18:35:14] <wikibugs>	 (03PS1) 10Bking: search-loader: removed unneeded package dep [puppet] - 10https://gerrit.wikimedia.org/r/969947 (https://phabricator.wikimedia.org/T346039)
[18:35:51] <logmsgbot>	 !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4052.ulsfo.wmnet with OS bookworm
[18:36:08] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4052.ulsfo.wmnet with OS bookworm
[18:36:29] <jinxer-wm>	 (WidespreadPuppetFailure) firing: (2) Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[18:36:50] <herron>	 WidespreadPuppetFailure looks like a race condition related to my recent patch, but the subsequent puppet run succeeds.  should clear on its own.  keeping an eye on it
[18:37:06] <icinga-wm>	 RECOVERY - BGP status on asw1-by27-esams.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[18:37:28] <wikibugs>	 (03CR) 10Ebernhardson: [C: 03+1] search-loader: removed unneeded package dep [puppet] - 10https://gerrit.wikimedia.org/r/969947 (https://phabricator.wikimedia.org/T346039) (owner: 10Bking)
[18:37:50] <icinga-wm>	 RECOVERY - BFD status on asw1-by27-esams.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[18:38:06] <wikibugs>	 (03CR) 10Bking: [C: 03+2] search-loader: removed unneeded package dep [puppet] - 10https://gerrit.wikimedia.org/r/969947 (https://phabricator.wikimedia.org/T346039) (owner: 10Bking)
[18:38:16] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dns3003.wikimedia.org with OS bookworm
[18:38:26] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host dns3003.wikimedia.org with OS bookworm completed: - dns3003 (**PASS**)   - Downtimed on Icinga/Al...
[18:42:21] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1005-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[18:43:44] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[18:44:09] <jinxer-wm>	 (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-misc at eqiad: 50% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-misc&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[18:44:50] <wikibugs>	 (03PS1) 10BCornwall: Revert "hiera: remove dns3003 from authdns_servers" [puppet] - 10https://gerrit.wikimedia.org/r/969768
[18:45:04] <icinga-wm>	 PROBLEM - Check systemd state on pki1001 is CRITICAL: CRITICAL - degraded: The following units failed: cfssl-ocsprefresh-Wikimedia_Internal_Root_CA.service,cfssl-ocsprefresh-aux.service,cfssl-ocsprefresh-aux_front_proxy.service,cfssl-ocsprefresh-cassandra.service,cfssl-ocsprefresh-cloud_wmnet_ca.service,cfssl-ocsprefresh-debmonitor.service,cfssl-ocsprefresh-discovery.service,cfssl-ocsprefresh-dse.service,cfssl-ocsprefresh-dse_front_proxy.
[18:45:04] <icinga-wm>	 cfssl-ocsprefresh-etcd.service,cfssl-ocsprefresh-kafka.service,cfssl-ocsprefresh-mlserve.service,cfssl-ocsprefresh-mlserve_front_proxy.service,cfssl-ocsprefresh-mlserve_staging.service,cfssl-ocsprefresh-mlserve_staging_front_proxy.service,cfssl-ocsprefresh-network_devices.service,cfssl-ocsprefresh-syslog.service,cfssl-ocsprefresh-wikikube.service,cfssl-ocsprefresh-wikikube_front_proxy.service,cfssl-ocsprefresh-wikikube_staging.service,cfs
[18:45:04] <icinga-wm>	 efresh-wikikube_staging_front_proxy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:45:40] <wikibugs>	 (03CR) 10BCornwall: [C: 03+2] Revert "hiera: remove dns3003 from authdns_servers" [puppet] - 10https://gerrit.wikimedia.org/r/969768 (owner: 10BCornwall)
[18:47:06] <wikibugs>	 (03PS1) 10Herron: logstash: add uri_host field to w3creportingapi template [puppet] - 10https://gerrit.wikimedia.org/r/969948 (https://phabricator.wikimedia.org/T349807)
[18:49:09] <jinxer-wm>	 (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-misc at eqiad: 50% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-misc&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[18:51:03] <wikibugs>	 (03PS1) 10BCornwall: hiera: remove dns3004 from authdns_servers [puppet] - 10https://gerrit.wikimedia.org/r/969949 (https://phabricator.wikimedia.org/T342154)
[18:51:37] <wikibugs>	 (03CR) 10BCornwall: [C: 03+2] hiera: remove dns3004 from authdns_servers [puppet] - 10https://gerrit.wikimedia.org/r/969949 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall)
[18:52:55] <logmsgbot>	 !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4052.ulsfo.wmnet with OS bookworm
[18:53:08] <logmsgbot>	 !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1103.eqiad.wmnet with OS bullseye
[18:54:23] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4052.ulsfo.wmnet with OS bookworm
[18:58:05] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10jbond)
[18:59:52] <icinga-wm>	 RECOVERY - Check systemd state on config-master1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:59:59] <logmsgbot>	 !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4052.ulsfo.wmnet with OS bookworm
[19:01:29] <jinxer-wm>	 (WidespreadPuppetFailure) firing: (2) Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[19:04:02] <icinga-wm>	 PROBLEM - Check systemd state on config-master1001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:04:08] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10jbond)
[19:04:18] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:06:29] <jinxer-wm>	 (WidespreadPuppetFailure) resolved: (2) Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[19:07:52] <icinga-wm>	 PROBLEM - Bird Internet Routing Daemon on dns3004 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[19:08:16] <icinga-wm>	 PROBLEM - BFD status on asw1-bw27-esams.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[19:08:20] <icinga-wm>	 PROBLEM - BGP status on asw1-bw27-esams.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[19:15:26] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:21:05] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host dns3004.wikimedia.org with OS bookworm
[19:21:17] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host dns3004.wikimedia.org with OS bookworm
[19:28:44] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job haproxy in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[19:30:10] <icinga-wm>	 RECOVERY - Check systemd state on config-master1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:33:14] <icinga-wm>	 PROBLEM - Check systemd state on config-master1001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:33:44] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job haproxy in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[19:47:54] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on dns3004.wikimedia.org with reason: host reimage
[19:48:52] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Audit of WMCS Servers Using Single & Dual Switchports - https://phabricator.wikimedia.org/T349756 (10wiki_willy) Awesome, thanks for working on this @VRiley-WMF.  @nskaggs & @cmooney - since we have some discrepancies with the number of ports being used on these cloudvirts, shou...
[19:51:03] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dns3004.wikimedia.org with reason: host reimage
[19:51:17] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[19:55:01] <icinga-wm>	 PROBLEM - Recursive DNS on 185.15.59.2 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS
[19:55:15] <RhinosF1>	 jouncebot: next
[19:55:16] <jouncebot>	 In 0 hour(s) and 4 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231030T2000)
[20:00:06] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: (Dis)respected human, time to deploy UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231030T2000). Please do the needful.
[20:00:06] <jouncebot>	 RhinosF1: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:09] <jinxer-wm>	 (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-misc at eqiad: 50% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-misc&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[20:00:15] <RhinosF1>	 im here
[20:01:57] <TheresNoTime>	 I'm on a train, so can't deploy
[20:02:40] <RhinosF1>	 TheresNoTime: you on holiday? trains your way are awful
[20:03:37] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:04:54] <dancy>	 RhinosF1: I can deploy
[20:05:04] <RhinosF1>	 dancy: thanks
[20:05:07] <RhinosF1>	 ready when you are
[20:05:13] <wikibugs>	 (03PS1) 10Ottomata: eventgate chart - disable SYS_PTRACE on wmfdebug container [deployment-charts] - 10https://gerrit.wikimedia.org/r/969961 (https://phabricator.wikimedia.org/T347477)
[20:05:58] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by dancy@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/969353 (https://phabricator.wikimedia.org/T349970) (owner: 10RhinosF1)
[20:06:00] <thcipriani>	 thanks dancy
[20:06:21] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] eventgate chart - disable SYS_PTRACE on wmfdebug container [deployment-charts] - 10https://gerrit.wikimedia.org/r/969961 (https://phabricator.wikimedia.org/T347477) (owner: 10Ottomata)
[20:06:51] <wikibugs>	 (03Merged) 10jenkins-bot: namespaces:mediawiki: add Extensions/Skins as alias of Extension/Skin (+ tallk) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/969353 (https://phabricator.wikimedia.org/T349970) (owner: 10RhinosF1)
[20:07:07] <logmsgbot>	 !log dancy@deploy2002 Started scap: Backport for [[gerrit:969353|namespaces:mediawiki: add Extensions/Skins as alias of Extension/Skin (+ tallk) (T349970)]]
[20:07:13] <stashbot>	 T349970: Add Extensions/Skins as an alias of Extension/Skin on Mediawikiwiki - https://phabricator.wikimedia.org/T349970
[20:07:40] <wikibugs>	 (03Merged) 10jenkins-bot: eventgate chart - disable SYS_PTRACE on wmfdebug container [deployment-charts] - 10https://gerrit.wikimedia.org/r/969961 (https://phabricator.wikimedia.org/T347477) (owner: 10Ottomata)
[20:07:47] <icinga-wm>	 RECOVERY - Recursive DNS on 185.15.59.2 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS
[20:08:24] <logmsgbot>	 !log dancy@deploy2002 dancy and rhinosf1: Backport for [[gerrit:969353|namespaces:mediawiki: add Extensions/Skins as alias of Extension/Skin (+ tallk) (T349970)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[20:09:04] <dancy>	 RhinosF1: Lemme know when you've tested 
[20:09:51] <RhinosF1>	 dancy: lgtm but will need namespaceDupes.php
[20:10:10] <jinxer-wm>	 (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-misc at eqiad: 50% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-misc&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[20:10:33] <icinga-wm>	 RECOVERY - Check systemd state on config-master1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:10:55] <logmsgbot>	 !log otto@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-analytics: apply
[20:11:15] <dancy>	 RhinosF1: Is that something that I need to run? If so, I'll need a complete command line.
[20:11:34] <RhinosF1>	 dancy: mwscript namespaceDupes.php mediawikiwiki
[20:11:47] <RhinosF1>	 after deploy
[20:11:54] <dancy>	 ok.. proceeding, then I'll run that.
[20:11:56] <logmsgbot>	 !log dancy@deploy2002 dancy and rhinosf1: Continuing with sync
[20:14:25] <icinga-wm>	 PROBLEM - Check systemd state on config-master1001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:16:13] <wikibugs>	 (03PS1) 10Ottomata: eventgate chart - separate config for wmfdebug container from nodejs profiler [deployment-charts] - 10https://gerrit.wikimedia.org/r/969963 (https://phabricator.wikimedia.org/T347477)
[20:16:19] <icinga-wm>	 RECOVERY - BFD status on asw1-bw27-esams.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[20:17:12] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] eventgate chart - separate config for wmfdebug container from nodejs profiler [deployment-charts] - 10https://gerrit.wikimedia.org/r/969963 (https://phabricator.wikimedia.org/T347477) (owner: 10Ottomata)
[20:17:17] <logmsgbot>	 !log dancy@deploy2002 Finished scap: Backport for [[gerrit:969353|namespaces:mediawiki: add Extensions/Skins as alias of Extension/Skin (+ tallk) (T349970)]] (duration: 10m 09s)
[20:17:21] <icinga-wm>	 RECOVERY - BGP status on asw1-bw27-esams.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[20:17:22] <stashbot>	 T349970: Add Extensions/Skins as an alias of Extension/Skin on Mediawikiwiki - https://phabricator.wikimedia.org/T349970
[20:18:02] <dancy>	 https://www.irccloud.com/pastebin/53GhnMOJ/
[20:18:19] <wikibugs>	 (03Merged) 10jenkins-bot: eventgate chart - separate config for wmfdebug container from nodejs profiler [deployment-charts] - 10https://gerrit.wikimedia.org/r/969963 (https://phabricator.wikimedia.org/T347477) (owner: 10Ottomata)
[20:18:57] <dancy>	 RhinosF1: Did that actually do anything?  Do I need to pass the `--fix` flag?
[20:19:04] <RhinosF1>	 dancy: do with --fix added please
[20:19:21] <dancy>	 https://www.irccloud.com/pastebin/AbFfW1TG/
[20:20:52] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dns3004.wikimedia.org with OS bookworm
[20:21:01] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host dns3004.wikimedia.org with OS bookworm completed: - dns3004 (**PASS**)   - Downtimed on Icinga/Al...
[20:21:19] <RhinosF1>	 dancy: we can add --add-prefix=broken to fix Extension:Gadgets and then tag it for deletion, it's a redirect though anyway, i don't think it would cause harm to leave it
[20:21:44] <logmsgbot>	 !log otto@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-analytics: apply
[20:22:37] <dancy>	 OK. I'll do whatever you recommend.
[20:23:35] <RhinosF1>	 dancy: i feel better not leaving inaccessible pages in db so I say do mwscript namespaceDupes.php --add-prefix=broken
[20:23:45] <RhinosF1>	 then --fix --add-prefix=broken
[20:23:53] <dancy>	 Ok
[20:24:19] <wikibugs>	 (03PS1) 10Ottomata: eventgate chart - fix debug mode CLI args [deployment-charts] - 10https://gerrit.wikimedia.org/r/969964 (https://phabricator.wikimedia.org/T347477)
[20:24:24] <dancy>	 https://www.irccloud.com/pastebin/dnHBRTPG/
[20:25:03] <dancy>	 https://www.irccloud.com/pastebin/UfgWdGbs/
[20:25:13] <RhinosF1>	 dancy: all good
[20:25:19] <dancy>	 Awesome
[20:25:29] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] eventgate chart - fix debug mode CLI args [deployment-charts] - 10https://gerrit.wikimedia.org/r/969964 (https://phabricator.wikimedia.org/T347477) (owner: 10Ottomata)
[20:26:04] <RhinosF1>	 taavi: also thank you for deleting that in that 5ms so i didn't have to tag it
[20:26:08] <RhinosF1>	 dancy: have a good evening
[20:26:23] <taavi>	 :-P
[20:26:33] <wikibugs>	 (03Merged) 10jenkins-bot: eventgate chart - fix debug mode CLI args [deployment-charts] - 10https://gerrit.wikimedia.org/r/969964 (https://phabricator.wikimedia.org/T347477) (owner: 10Ottomata)
[20:28:01] <logmsgbot>	 !log otto@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-analytics: apply
[20:29:33] <logmsgbot>	 !log otto@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-analytics: apply
[20:29:52] <logmsgbot>	 !log otto@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics: apply
[20:30:47] <icinga-wm>	 RECOVERY - Check systemd state on config-master1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:33:13] <wikibugs>	 (03PS1) 10Urbanecm: Growth: Enable new Impact module on all Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/969966 (https://phabricator.wikimedia.org/T336203)
[20:34:43] <wikibugs>	 (03PS1) 10Urbanecm: Growth: Disable new impact A/B testing on pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/969967 (https://phabricator.wikimedia.org/T336203)
[20:34:53] <icinga-wm>	 PROBLEM - Check systemd state on config-master1001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:35:31] <wikibugs>	 (03CR) 10Urbanecm: [C: 04-2] "not yet, scheduled for Nov 01" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/969966 (https://phabricator.wikimedia.org/T336203) (owner: 10Urbanecm)
[20:35:34] <wikibugs>	 (03CR) 10Urbanecm: [C: 04-2] "not yet, scheduled for Nov 01" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/969967 (https://phabricator.wikimedia.org/T336203) (owner: 10Urbanecm)
[20:43:27] <wikibugs>	 (03PS1) 10BCornwall: Revert "hiera: remove dns3004 from authdns_servers" [puppet] - 10https://gerrit.wikimedia.org/r/969769
[20:43:29] <logmsgbot>	 !log otto@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-analytics: apply
[20:43:42] <logmsgbot>	 !log otto@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics: apply
[20:43:46] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] Revert "hiera: remove dns3004 from authdns_servers" [puppet] - 10https://gerrit.wikimedia.org/r/969769 (owner: 10BCornwall)
[20:44:09] <wikibugs>	 (03PS1) 10Bking: kafka-jumbo: permit traffic from new search-loader VMs [puppet] - 10https://gerrit.wikimedia.org/r/969968 (https://phabricator.wikimedia.org/T346039)
[20:44:24] <logmsgbot>	 !log otto@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics: apply
[20:44:31] <logmsgbot>	 !log otto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics: apply
[20:45:05] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:45:12] <logmsgbot>	 !log otto@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics: apply
[20:45:25] <logmsgbot>	 !log otto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics: apply
[20:47:57] <wikibugs>	 (03CR) 10BCornwall: [C: 03+2] Revert "hiera: remove dns3004 from authdns_servers" [puppet] - 10https://gerrit.wikimedia.org/r/969769 (owner: 10BCornwall)
[20:49:13] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:50:32] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10BCornwall)
[20:58:46] <wikibugs>	 (03CR) 10Ebernhardson: [C: 03+1] kafka-jumbo: permit traffic from new search-loader VMs [puppet] - 10https://gerrit.wikimedia.org/r/969968 (https://phabricator.wikimedia.org/T346039) (owner: 10Bking)
[20:58:49] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+1] kafka-jumbo: permit traffic from new search-loader VMs [puppet] - 10https://gerrit.wikimedia.org/r/969968 (https://phabricator.wikimedia.org/T346039) (owner: 10Bking)
[20:59:02] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+1] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/969968 (https://phabricator.wikimedia.org/T346039) (owner: 10Bking)
[20:59:12] <wikibugs>	 (03PS2) 10Bking: kafka-jumbo: permit traffic from new search-loader VMs [puppet] - 10https://gerrit.wikimedia.org/r/969968 (https://phabricator.wikimedia.org/T346039)
[21:00:05] <jouncebot>	 Reedy, sbassett, Maryum, and manfredi: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231030T2100).
[21:00:19] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:00:57] <wikibugs>	 (03CR) 10Bking: [C: 03+2] kafka-jumbo: permit traffic from new search-loader VMs [puppet] - 10https://gerrit.wikimedia.org/r/969968 (https://phabricator.wikimedia.org/T346039) (owner: 10Bking)
[21:02:03] <wikibugs>	 (03CR) 10Bking: [C: 03+2] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/969968 (https://phabricator.wikimedia.org/T346039) (owner: 10Bking)
[21:02:28] <wikibugs>	 (03CR) 10Bking: kafka-jumbo: permit traffic from new search-loader VMs [puppet] - 10https://gerrit.wikimedia.org/r/969968 (https://phabricator.wikimedia.org/T346039) (owner: 10Bking)
[21:04:18] <wikibugs>	 (03CR) 10Bking: [C: 03+2] kafka-jumbo: permit traffic from new search-loader VMs [puppet] - 10https://gerrit.wikimedia.org/r/969968 (https://phabricator.wikimedia.org/T346039) (owner: 10Bking)
[21:08:45] <sbassett>	 Hey all - have one quick update for PS.php I’d like to get out as part of the sec deploy window...
[21:19:13] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.remove-downtime for search-loader[2001-2002].codfw.wmnet,search-loader[1001-1002].eqiad.wmnet
[21:19:14] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for search-loader[2001-2002].codfw.wmnet,search-loader[1001-1002].eqiad.wmnet
[21:19:37] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:20:05] <icinga-wm>	 PROBLEM - Check systemd state on search-loader2001 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_mjolnir-kafka-msearch-daemon@0.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:22:49] <sbassett>	 !log Deployed updated security mitigation for T348828
[21:22:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:30:37] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:34:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[21:39:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag  - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[21:40:56] <wikibugs>	 (03PS1) 10Kimberly Sarabia: Deploy vector 2022 to non-English Wikibooks, etc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/969971 (https://phabricator.wikimedia.org/T349544)
[21:48:44] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[22:03:41] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:16:07] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:33:51] <icinga-wm>	 RECOVERY - Check systemd state on config-master2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:36:14] <wikibugs>	 (03CR) 10Jdlrobson: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/969971 (https://phabricator.wikimedia.org/T349544) (owner: 10Kimberly Sarabia)
[22:38:01] <icinga-wm>	 PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:42:21] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1005-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[23:19:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag  - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[23:19:43] <icinga-wm>	 RECOVERY - Check systemd state on config-master2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:23:55] <icinga-wm>	 PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:24:52] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag  - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[23:29:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag  - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[23:33:21] <icinga-wm>	 RECOVERY - Check systemd state on config-master1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:34:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag  - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[23:37:35] <icinga-wm>	 PROBLEM - Check systemd state on config-master1001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:39:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag  - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[23:48:43] <icinga-wm>	 RECOVERY - Check systemd state on config-master1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:50:29] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp1103.eqiad.wmnet with OS bullseye
[23:51:17] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[23:52:55] <icinga-wm>	 PROBLEM - Check systemd state on config-master1001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:56:19] <logmsgbot>	 !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1103.eqiad.wmnet with OS bullseye
[23:56:29] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp1103.eqiad.wmnet with OS bullseye