[00:15:37] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [00:30:37] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [00:38:47] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/968980 [00:38:53] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/968980 (owner: 10TrainBranchBot) [00:45:54] PROBLEM - Check systemd state on logstash2026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:55:24] PROBLEM - Router interfaces on cr1-esams is CRITICAL: CRITICAL: host 185.15.59.128, interfaces up: 77, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:56:13] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/968980 (owner: 10TrainBranchBot) [00:56:48] RECOVERY - Router interfaces on cr1-esams is OK: OK: host 185.15.59.128, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:13:05] (ProbeDown) firing: Service vrts1001:1443 has failed probes (http_ticket_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#vrts1001:1443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:13:36] PROBLEM - clamd running on vrts1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 114 (clamav), command name clamd https://wikitech.wikimedia.org/wiki/VRT_System%23ClamAV [01:13:42] PROBLEM - Check systemd state on vrts1001 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service,clamav-daemon.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:27:26] RECOVERY - clamd running on vrts1001 is OK: PROCS OK: 1 process with UID = 114 (clamav), command name clamd https://wikitech.wikimedia.org/wiki/VRT_System%23ClamAV [01:27:34] RECOVERY - Check systemd state on vrts1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:30:02] PROBLEM - Check systemd state on aqs1010 is CRITICAL: CRITICAL - degraded: The following units failed: aqs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:31:26] RECOVERY - Check systemd state on aqs1010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:33:05] (ProbeDown) resolved: Service vrts1001:1443 has failed probes (http_ticket_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#vrts1001:1443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:51:12] (SwiftObjectCountSiteDisparity) firing: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity [02:31:58] (KeyholderUnarmed) firing: (2) 1 unarmed Keyholder key(s) on acmechief2002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [02:38:43] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:04:32] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:51:16] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [05:51:12] (SwiftObjectCountSiteDisparity) firing: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity [06:31:58] (KeyholderUnarmed) firing: (2) 1 unarmed Keyholder key(s) on acmechief2002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [06:42:06] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1005-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [06:46:37] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [06:51:37] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [07:03:05] (03PS1) 10Marostegui: ProductionServices.php: Promote pc1014 to pc1 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/969533 [07:06:50] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by marostegui@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/969533 (owner: 10Marostegui) [07:07:34] (03Merged) 10jenkins-bot: ProductionServices.php: Promote pc1014 to pc1 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/969533 (owner: 10Marostegui) [07:08:05] !log marostegui@deploy2002 Started scap: Backport for [[gerrit:969533|ProductionServices.php: Promote pc1014 to pc1 master]] [07:09:10] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:10:52] PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:12:38] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [07:13:46] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:14:28] (03PS1) 10Marostegui: Revert "ProductionServices.php: Promote pc1014 to pc1 master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/969360 [07:15:02] RECOVERY - BFD status on cr1-eqiad is OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:16:06] (03PS1) 10Vgutierrez: hiera: Switch drmrs, eqsin and esams to digicert-2023 [puppet] - 10https://gerrit.wikimedia.org/r/969662 (https://phabricator.wikimedia.org/T341119) [07:16:21] !log marostegui@deploy2002 marostegui: Backport for [[gerrit:969533|ProductionServices.php: Promote pc1014 to pc1 master]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [07:16:39] !log marostegui@deploy2002 marostegui: Continuing with sync [07:17:37] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [07:18:51] !log arm keyholder on acmechief2002 and deploy1002 [07:18:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:54] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (CORE_DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/228/con" [puppet] - 10https://gerrit.wikimedia.org/r/969662 (https://phabricator.wikimedia.org/T341119) (owner: 10Vgutierrez) [07:21:42] (KeyholderUnarmed) resolved: (2) 1 unarmed Keyholder key(s) on acmechief2002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [07:22:09] !log marostegui@deploy2002 Finished scap: Backport for [[gerrit:969533|ProductionServices.php: Promote pc1014 to pc1 master]] (duration: 14m 04s) [07:22:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 29.17% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [07:22:13] ^ my fault [07:22:18] I am reverting [07:22:21] (03CR) 10Marostegui: [C: 03+2] Revert "ProductionServices.php: Promote pc1014 to pc1 master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/969360 (owner: 10Marostegui) [07:22:46] (03CR) 10Vgutierrez: [V: 03+1] "chained certificates deployed on the cp servers look good:" [puppet] - 10https://gerrit.wikimedia.org/r/969662 (https://phabricator.wikimedia.org/T341119) (owner: 10Vgutierrez) [07:22:47] !log marostegui@deploy2002 Started scap: Backport for [[gerrit:969360|Revert "ProductionServices.php: Promote pc1014 to pc1 master"]] [07:22:58] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:24:01] !log marostegui@deploy2002 marostegui: Backport for [[gerrit:969360|Revert "ProductionServices.php: Promote pc1014 to pc1 master"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [07:24:03] !log marostegui@deploy2002 marostegui: Continuing with sync [07:27:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad appserver GET/200: 0.8172186946249829s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [07:27:37] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [07:29:20] !log marostegui@deploy2002 Finished scap: Backport for [[gerrit:969360|Revert "ProductionServices.php: Promote pc1014 to pc1 master"]] (duration: 06m 33s) [07:30:23] (03PS1) 10Marostegui: ProductionServices.php: Promote pc1014 to pc1 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/969667 [07:32:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 49.22% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [07:32:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad appserver GET/200: 0.4157913965730752s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceede [07:34:51] !log stevemunene@cumin1001 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on an-airflow1007.eqiad.wmnet with reason: Downtime as we setup the new WMDE Airflow instance [07:35:16] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on an-airflow1007.eqiad.wmnet with reason: Downtime as we setup the new WMDE Airflow instance [07:41:54] (03CR) 10Marostegui: [C: 03+2] ProductionServices.php: Promote pc1014 to pc1 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/969667 (owner: 10Marostegui) [07:42:34] (03Merged) 10jenkins-bot: ProductionServices.php: Promote pc1014 to pc1 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/969667 (owner: 10Marostegui) [07:42:37] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [07:42:47] (03CR) 10Vgutierrez: [V: 03+1] "prefetched OCSP responses look healthy as well:" [puppet] - 10https://gerrit.wikimedia.org/r/969662 (https://phabricator.wikimedia.org/T341119) (owner: 10Vgutierrez) [07:43:20] !log marostegui@deploy2002 Started scap: Backport for [[gerrit:969667|ProductionServices.php: Promote pc1014 to pc1 master]] [07:44:32] !log marostegui@deploy2002 marostegui: Backport for [[gerrit:969667|ProductionServices.php: Promote pc1014 to pc1 master]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [07:44:37] !log marostegui@deploy2002 marostegui: Continuing with sync [07:46:54] !log disable puppet on cp hosts in esams, eqsin and drmrs before switching to the new unified digicert certificates - T341119 [07:49:57] !log marostegui@deploy2002 Finished scap: Backport for [[gerrit:969667|ProductionServices.php: Promote pc1014 to pc1 master]] (duration: 06m 36s) [07:50:36] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] hiera: Switch drmrs, eqsin and esams to digicert-2023 [puppet] - 10https://gerrit.wikimedia.org/r/969662 (https://phabricator.wikimedia.org/T341119) (owner: 10Vgutierrez) [07:51:09] (03PS1) 10Marostegui: Revert "ProductionServices.php: Promote pc1014 to pc1 master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/969361 [07:51:16] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [07:52:39] !log depool cp5025 to perform some digicert-2023 related sanity checks - T341119 [07:52:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:56:59] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp5025 is OK: SSL OK - OCSP staple validity for wikipedia.org has 531724 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2024-10-16 23:59:59 +0000 (expires in 352 days) https://wikitech.wikimedia.org/wiki/HTTPS [07:57:18] (03CR) 10Filippo Giunchedi: [C: 03+2] sre: first iteration for otel-coll alerts [alerts] - 10https://gerrit.wikimedia.org/r/967143 (https://phabricator.wikimedia.org/T345712) (owner: 10Filippo Giunchedi) [07:57:30] (03PS2) 10WMDE-Fisch: Cleanup Kartographer Nearby flags [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966520 (https://phabricator.wikimedia.org/T332785) [07:58:45] (03CR) 10Marostegui: [C: 03+2] Revert "ProductionServices.php: Promote pc1014 to pc1 master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/969361 (owner: 10Marostegui) [07:59:25] (03Merged) 10jenkins-bot: Revert "ProductionServices.php: Promote pc1014 to pc1 master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/969361 (owner: 10Marostegui) [07:59:45] !log marostegui@deploy2002 Started scap: Backport for [[gerrit:969361|Revert "ProductionServices.php: Promote pc1014 to pc1 master"]] [08:00:05] Amir1, Urbanecm, and taavi: #bothumor I � Unicode. All rise for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231030T0800). [08:00:05] WMDE-Fisch: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:19] \o [08:00:23] (03PS1) 10Marostegui: pc1011: Move pc1011 to pc2 [puppet] - 10https://gerrit.wikimedia.org/r/969670 [08:00:36] I can self serve though [08:00:59] !log marostegui@deploy2002 marostegui: Backport for [[gerrit:969361|Revert "ProductionServices.php: Promote pc1014 to pc1 master"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:01:12] !log marostegui@deploy2002 marostegui: Continuing with sync [08:01:38] sure, although it seems marostegui is already deploying something not scheduled for this window? [08:01:43] Yes [08:01:46] Just saw that [08:01:58] taavi: I hoped I was finished before the window [08:02:00] It should be finished in a bit [08:02:17] I am deploying mediawiki_config only [08:02:28] To be able to upgrade parsercache kernels [08:04:03] ok, just let us know when you're done [08:04:30] yeah it is almost done [08:06:26] !log marostegui@deploy2002 Finished scap: Backport for [[gerrit:969361|Revert "ProductionServices.php: Promote pc1014 to pc1 master"]] (duration: 06m 41s) [08:06:35] taavi WMDE-Fisch all done! [08:06:38] sorry for the delay [08:06:45] !log repool cp5025 - T341119 [08:06:45] Nice, I'll take over then. [08:06:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:05] (03PS3) 10WMDE-Fisch: Cleanup Kartographer Nearby flags [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966520 (https://phabricator.wikimedia.org/T332785) [08:08:18] (03PS2) 10Marostegui: pc1014: Move pc1014 to pc2 [puppet] - 10https://gerrit.wikimedia.org/r/969670 [08:08:46] (03CR) 10Marostegui: [C: 03+2] pc1014: Move pc1014 to pc2 [puppet] - 10https://gerrit.wikimedia.org/r/969670 (owner: 10Marostegui) [08:09:19] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by wmde-fisch@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966520 (https://phabricator.wikimedia.org/T332785) (owner: 10WMDE-Fisch) [08:10:03] !log triggering a puppet run on cp hosts in esams, eqsin and drmrs to switch to the new unified digicert certificates - T341119 [08:10:05] (03Merged) 10jenkins-bot: Cleanup Kartographer Nearby flags [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966520 (https://phabricator.wikimedia.org/T332785) (owner: 10WMDE-Fisch) [08:10:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:17] !log wmde-fisch@deploy2002 Started scap: Backport for [[gerrit:966520|Cleanup Kartographer Nearby flags (T332785)]] [08:10:22] T332785: Remove custom old nearby functionality for Wikivoyage from Kartographer - https://phabricator.wikimedia.org/T332785 [08:11:33] !log wmde-fisch@deploy2002 wmde-fisch: Backport for [[gerrit:966520|Cleanup Kartographer Nearby flags (T332785)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:12:26] !log wmde-fisch@deploy2002 wmde-fisch: Continuing with sync [08:13:53] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp6005 is OK: SSL OK - OCSP staple validity for wikipedia.org has 530710 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2024-10-16 23:59:59 +0000 (expires in 352 days) https://wikitech.wikimedia.org/wiki/HTTPS [08:13:57] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp6003 is OK: SSL OK - OCSP staple validity for wikipedia.org has 530707 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2024-10-16 23:59:59 +0000 (expires in 352 days) https://wikitech.wikimedia.org/wiki/HTTPS [08:13:57] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp6002 is OK: SSL OK - OCSP staple validity for wikipedia.org has 530707 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2024-10-16 23:59:59 +0000 (expires in 352 days) https://wikitech.wikimedia.org/wiki/HTTPS [08:14:11] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp6004 is OK: SSL OK - OCSP staple validity for wikipedia.org has 530692 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2024-10-16 23:59:59 +0000 (expires in 352 days) https://wikitech.wikimedia.org/wiki/HTTPS [08:14:30] sorry about the RECOVERY flood :) [08:14:39] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp6001 is OK: SSL OK - OCSP staple validity for wikipedia.org has 530665 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2024-10-16 23:59:59 +0000 (expires in 352 days) https://wikitech.wikimedia.org/wiki/HTTPS [08:15:19] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp6007 is OK: SSL OK - OCSP staple validity for wikipedia.org has 530625 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2024-10-16 23:59:59 +0000 (expires in 352 days) https://wikitech.wikimedia.org/wiki/HTTPS [08:15:45] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp6006 is OK: SSL OK - OCSP staple validity for wikipedia.org has 530598 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2024-10-16 23:59:59 +0000 (expires in 352 days) https://wikitech.wikimedia.org/wiki/HTTPS [08:16:25] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp6009 is OK: SSL OK - OCSP staple validity for wikipedia.org has 530558 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2024-10-16 23:59:59 +0000 (expires in 352 days) https://wikitech.wikimedia.org/wiki/HTTPS [08:17:01] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp6012 is OK: SSL OK - OCSP staple validity for wikipedia.org has 530522 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2024-10-16 23:59:59 +0000 (expires in 352 days) https://wikitech.wikimedia.org/wiki/HTTPS [08:17:03] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp6011 is OK: SSL OK - OCSP staple validity for wikipedia.org has 530520 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2024-10-16 23:59:59 +0000 (expires in 352 days) https://wikitech.wikimedia.org/wiki/HTTPS [08:17:51] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp6013 is OK: SSL OK - OCSP staple validity for wikipedia.org has 530473 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2024-10-16 23:59:59 +0000 (expires in 352 days) https://wikitech.wikimedia.org/wiki/HTTPS [08:17:53] !log wmde-fisch@deploy2002 Finished scap: Backport for [[gerrit:966520|Cleanup Kartographer Nearby flags (T332785)]] (duration: 07m 35s) [08:17:55] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp6014 is OK: SSL OK - OCSP staple validity for wikipedia.org has 530468 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2024-10-16 23:59:59 +0000 (expires in 352 days) https://wikitech.wikimedia.org/wiki/HTTPS [08:17:58] T332785: Remove custom old nearby functionality for Wikivoyage from Kartographer - https://phabricator.wikimedia.org/T332785 [08:18:49] I'm done. :-) [08:19:09] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp5017 is OK: SSL OK - OCSP staple validity for wikipedia.org has 530395 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2024-10-16 23:59:59 +0000 (expires in 352 days) https://wikitech.wikimedia.org/wiki/HTTPS [08:19:27] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp6016 is OK: SSL OK - OCSP staple validity for wikipedia.org has 530376 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2024-10-16 23:59:59 +0000 (expires in 352 days) https://wikitech.wikimedia.org/wiki/HTTPS [08:19:45] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp5019 is OK: SSL OK - OCSP staple validity for wikipedia.org has 530358 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2024-10-16 23:59:59 +0000 (expires in 352 days) https://wikitech.wikimedia.org/wiki/HTTPS [08:19:55] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp5020 is OK: SSL OK - OCSP staple validity for wikipedia.org has 530349 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2024-10-16 23:59:59 +0000 (expires in 352 days) https://wikitech.wikimedia.org/wiki/HTTPS [08:20:19] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp6010 is OK: SSL OK - OCSP staple validity for wikipedia.org has 530324 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2024-10-16 23:59:59 +0000 (expires in 352 days) https://wikitech.wikimedia.org/wiki/HTTPS [08:21:23] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp5021 is OK: SSL OK - OCSP staple validity for wikipedia.org has 530260 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2024-10-16 23:59:59 +0000 (expires in 352 days) https://wikitech.wikimedia.org/wiki/HTTPS [08:21:35] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp5024 is OK: SSL OK - OCSP staple validity for wikipedia.org has 530249 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2024-10-16 23:59:59 +0000 (expires in 352 days) https://wikitech.wikimedia.org/wiki/HTTPS [08:21:37] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp5023 is OK: SSL OK - OCSP staple validity for wikipedia.org has 530246 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2024-10-16 23:59:59 +0000 (expires in 352 days) https://wikitech.wikimedia.org/wiki/HTTPS [08:22:39] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp6008 is OK: SSL OK - OCSP staple validity for wikipedia.org has 530185 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2024-10-16 23:59:59 +0000 (expires in 352 days) https://wikitech.wikimedia.org/wiki/HTTPS [08:22:39] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp5018 is OK: SSL OK - OCSP staple validity for wikipedia.org has 530184 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2024-10-16 23:59:59 +0000 (expires in 352 days) https://wikitech.wikimedia.org/wiki/HTTPS [08:22:39] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp5022 is OK: SSL OK - OCSP staple validity for wikipedia.org has 530184 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2024-10-16 23:59:59 +0000 (expires in 352 days) https://wikitech.wikimedia.org/wiki/HTTPS [08:23:07] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp5029 is OK: SSL OK - OCSP staple validity for wikipedia.org has 530156 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2024-10-16 23:59:59 +0000 (expires in 352 days) https://wikitech.wikimedia.org/wiki/HTTPS [08:23:23] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp5030 is OK: SSL OK - OCSP staple validity for wikipedia.org has 530140 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2024-10-16 23:59:59 +0000 (expires in 352 days) https://wikitech.wikimedia.org/wiki/HTTPS [08:23:35] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp5027 is OK: SSL OK - OCSP staple validity for wikipedia.org has 530129 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2024-10-16 23:59:59 +0000 (expires in 352 days) https://wikitech.wikimedia.org/wiki/HTTPS [08:23:37] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp5026 is OK: SSL OK - OCSP staple validity for wikipedia.org has 530127 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2024-10-16 23:59:59 +0000 (expires in 352 days) https://wikitech.wikimedia.org/wiki/HTTPS [08:23:51] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp5028 is OK: SSL OK - OCSP staple validity for wikipedia.org has 530112 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2024-10-16 23:59:59 +0000 (expires in 352 days) https://wikitech.wikimedia.org/wiki/HTTPS [08:25:01] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp3066 is OK: SSL OK - OCSP staple validity for wikipedia.org has 530043 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2024-10-16 23:59:59 +0000 (expires in 352 days) https://wikitech.wikimedia.org/wiki/HTTPS [08:25:01] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp6015 is OK: SSL OK - OCSP staple validity for wikipedia.org has 530043 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2024-10-16 23:59:59 +0000 (expires in 352 days) https://wikitech.wikimedia.org/wiki/HTTPS [08:25:05] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp3068 is OK: SSL OK - OCSP staple validity for wikipedia.org has 530039 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2024-10-16 23:59:59 +0000 (expires in 352 days) https://wikitech.wikimedia.org/wiki/HTTPS [08:25:07] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp5032 is OK: SSL OK - OCSP staple validity for wikipedia.org has 530036 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2024-10-16 23:59:59 +0000 (expires in 352 days) https://wikitech.wikimedia.org/wiki/HTTPS [08:25:25] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp5031 is OK: SSL OK - OCSP staple validity for wikipedia.org has 530018 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2024-10-16 23:59:59 +0000 (expires in 352 days) https://wikitech.wikimedia.org/wiki/HTTPS [08:25:29] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp3067 is OK: SSL OK - OCSP staple validity for wikipedia.org has 530014 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2024-10-16 23:59:59 +0000 (expires in 352 days) https://wikitech.wikimedia.org/wiki/HTTPS [08:26:21] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp3069 is OK: SSL OK - OCSP staple validity for wikipedia.org has 529962 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2024-10-16 23:59:59 +0000 (expires in 352 days) https://wikitech.wikimedia.org/wiki/HTTPS [08:26:21] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp3072 is OK: SSL OK - OCSP staple validity for wikipedia.org has 529962 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2024-10-16 23:59:59 +0000 (expires in 352 days) https://wikitech.wikimedia.org/wiki/HTTPS [08:26:23] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp3070 is OK: SSL OK - OCSP staple validity for wikipedia.org has 529961 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2024-10-16 23:59:59 +0000 (expires in 352 days) https://wikitech.wikimedia.org/wiki/HTTPS [08:26:23] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp3071 is OK: SSL OK - OCSP staple validity for wikipedia.org has 529960 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2024-10-16 23:59:59 +0000 (expires in 352 days) https://wikitech.wikimedia.org/wiki/HTTPS [08:26:49] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp3073 is OK: SSL OK - OCSP staple validity for wikipedia.org has 529935 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2024-10-16 23:59:59 +0000 (expires in 352 days) https://wikitech.wikimedia.org/wiki/HTTPS [08:27:45] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp3077 is OK: SSL OK - OCSP staple validity for wikipedia.org has 529879 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2024-10-16 23:59:59 +0000 (expires in 352 days) https://wikitech.wikimedia.org/wiki/HTTPS [08:27:47] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp3076 is OK: SSL OK - OCSP staple validity for wikipedia.org has 529876 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2024-10-16 23:59:59 +0000 (expires in 352 days) https://wikitech.wikimedia.org/wiki/HTTPS [08:27:59] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp3074 is OK: SSL OK - OCSP staple validity for wikipedia.org has 529864 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2024-10-16 23:59:59 +0000 (expires in 352 days) https://wikitech.wikimedia.org/wiki/HTTPS [08:28:15] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp3075 is OK: SSL OK - OCSP staple validity for wikipedia.org has 529848 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2024-10-16 23:59:59 +0000 (expires in 352 days) https://wikitech.wikimedia.org/wiki/HTTPS [08:28:19] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp3078 is OK: SSL OK - OCSP staple validity for wikipedia.org has 529844 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2024-10-16 23:59:59 +0000 (expires in 352 days) https://wikitech.wikimedia.org/wiki/HTTPS [08:28:28] (03PS1) 10Majavah: base: puppet_alert: fix error message [puppet] - 10https://gerrit.wikimedia.org/r/969677 [08:28:33] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp3081 is OK: SSL OK - OCSP staple validity for wikipedia.org has 529830 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2024-10-16 23:59:59 +0000 (expires in 352 days) https://wikitech.wikimedia.org/wiki/HTTPS [08:29:13] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp3079 is OK: SSL OK - OCSP staple validity for wikipedia.org has 529790 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2024-10-16 23:59:59 +0000 (expires in 352 days) https://wikitech.wikimedia.org/wiki/HTTPS [08:29:17] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp3080 is OK: SSL OK - OCSP staple validity for wikipedia.org has 529786 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2024-10-16 23:59:59 +0000 (expires in 352 days) https://wikitech.wikimedia.org/wiki/HTTPS [08:29:51] !log switched to digicert-2023 in esams, eqsin and drmrs - T341119 [08:29:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:56] (03CR) 10Slyngshede: "Okay to merge this, as we fixed the monitoring in Prometheus/alertmanager?" [puppet] - 10https://gerrit.wikimedia.org/r/966532 (https://phabricator.wikimedia.org/T332764) (owner: 10Slyngshede) [08:38:47] (03PS1) 10Majavah: aptrepo: set Auto-Submitted header on reprepro change emails [puppet] - 10https://gerrit.wikimedia.org/r/969679 (https://phabricator.wikimedia.org/T347835) [08:47:30] (03PS1) 10Filippo Giunchedi: thanos: set cgroup memory limits for query components [puppet] - 10https://gerrit.wikimedia.org/r/969683 (https://phabricator.wikimedia.org/T349999) [08:56:12] (03PS6) 10Brouberol: Enable the management of the skein certificate via Puppet on one instance [puppet] - 10https://gerrit.wikimedia.org/r/968613 (https://phabricator.wikimedia.org/T329398) [09:03:35] (03CR) 10Brouberol: Enable the management of the skein certificate via Puppet (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/968612 (https://phabricator.wikimedia.org/T329398) (owner: 10Brouberol) [09:19:59] 10SRE, 10Traffic: purged issues while kafka brokers are restarted - https://phabricator.wikimedia.org/T334078 (10Vgutierrez) p:05Triage→03High We had two servers (cp1089 and cp3069) having purged issues over the weekend, after losing connection to the kafka cluster and logging: ` Oct 28 05:19:11 cp1089 pur... [09:26:10] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/969677 (owner: 10Majavah) [09:26:22] (03CR) 10Majavah: [C: 03+2] base: puppet_alert: fix error message [puppet] - 10https://gerrit.wikimedia.org/r/969677 (owner: 10Majavah) [09:26:32] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/966532 (https://phabricator.wikimedia.org/T332764) (owner: 10Slyngshede) [09:26:36] (03PS1) 10Filippo Giunchedi: prometheus: require replica_label to be set [puppet] - 10https://gerrit.wikimedia.org/r/969685 (https://phabricator.wikimedia.org/T350002) [09:27:05] (03CR) 10CI reject: [V: 04-1] prometheus: require replica_label to be set [puppet] - 10https://gerrit.wikimedia.org/r/969685 (https://phabricator.wikimedia.org/T350002) (owner: 10Filippo Giunchedi) [09:27:27] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/969679 (https://phabricator.wikimedia.org/T347835) (owner: 10Majavah) [09:28:01] (03CR) 10Slyngshede: [C: 03+2] P:monitoring remove remnants of checkpuppetrun [puppet] - 10https://gerrit.wikimedia.org/r/966532 (https://phabricator.wikimedia.org/T332764) (owner: 10Slyngshede) [09:28:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-misc at eqiad: 50% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-misc&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [09:28:15] (03PS2) 10Majavah: aptrepo: set Auto-Submitted header on reprepro change emails [puppet] - 10https://gerrit.wikimedia.org/r/969679 (https://phabricator.wikimedia.org/T347835) [09:29:06] (03CR) 10Majavah: [C: 03+2] aptrepo: set Auto-Submitted header on reprepro change emails [puppet] - 10https://gerrit.wikimedia.org/r/969679 (https://phabricator.wikimedia.org/T347835) (owner: 10Majavah) [09:29:08] (03CR) 10Filippo Giunchedi: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/969685 (https://phabricator.wikimedia.org/T350002) (owner: 10Filippo Giunchedi) [09:33:08] (03PS7) 10Brouberol: Enable the management of the skein certificate via Puppet on one instance [puppet] - 10https://gerrit.wikimedia.org/r/968613 (https://phabricator.wikimedia.org/T329398) [09:33:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-misc at eqiad: 50% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-misc&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [09:36:05] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/230/con" [puppet] - 10https://gerrit.wikimedia.org/r/969685 (https://phabricator.wikimedia.org/T350002) (owner: 10Filippo Giunchedi) [09:37:18] (03CR) 10Giuseppe Lavagetto: [C: 04-2] "realm.pp needs to be loaded when site.pp does get loaded." [puppet] - 10https://gerrit.wikimedia.org/r/969373 (https://phabricator.wikimedia.org/T349918) (owner: 10Jbond) [09:38:57] spiky behaviour on mw app misc eqiad [09:39:02] in terms of latency [09:42:10] !log jnuche@deploy2002 Started deploy [releng/jenkins-deploy@af33784] (releasing): (no justification provided) [09:42:50] !log jnuche@deploy2002 Finished deploy [releng/jenkins-deploy@af33784] (releasing): (no justification provided) (duration: 00m 40s) [09:51:13] (SwiftObjectCountSiteDisparity) firing: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity [09:54:03] (03CR) 10Jbond: "pcc: https://puppet-compiler.wmflabs.org/output/969373/225/" [puppet] - 10https://gerrit.wikimedia.org/r/969373 (https://phabricator.wikimedia.org/T349918) (owner: 10Jbond) [09:58:44] (03PS1) 10Majavah: hieradata: fix cloudinfra webproxy password location [labs/private] - 10https://gerrit.wikimedia.org/r/969689 [09:58:50] (03PS1) 10Majavah: secret: dkim: move wmcs dkim keys to correct location [labs/private] - 10https://gerrit.wikimedia.org/r/969690 [10:00:04] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231030T1000) [10:00:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-misc at eqiad: 50% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-misc&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [10:02:09] (03PS1) 10Majavah: hieradata: add fake metricsinfra grafana password [labs/private] - 10https://gerrit.wikimedia.org/r/969691 [10:05:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-misc at eqiad: 50% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-misc&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [10:05:53] (03PS2) 10Ayounsi: [POC] Split interface_automation into multiple files [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/969319 [10:05:55] (03PS1) 10Ayounsi: Ask for port # and type instead of interface name [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/969692 [10:05:59] (03PS1) 10Majavah: dynamicproxy: simplify redis replication code [puppet] - 10https://gerrit.wikimedia.org/r/969693 [10:06:31] (03CR) 10CI reject: [V: 04-1] dynamicproxy: simplify redis replication code [puppet] - 10https://gerrit.wikimedia.org/r/969693 (owner: 10Majavah) [10:06:35] (03CR) 10CI reject: [V: 04-1] Ask for port # and type instead of interface name [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/969692 (owner: 10Ayounsi) [10:07:39] (03PS2) 10Majavah: dynamicproxy: simplify redis replication code [puppet] - 10https://gerrit.wikimedia.org/r/969693 [10:11:34] (03PS3) 10Majavah: dynamicproxy: simplify redis replication code [puppet] - 10https://gerrit.wikimedia.org/r/969693 [10:12:29] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/236/console" [puppet] - 10https://gerrit.wikimedia.org/r/969693 (owner: 10Majavah) [10:13:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-misc at eqiad: 50% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-misc&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [10:14:48] 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Grant access to nda LDAP group to xqt - https://phabricator.wikimedia.org/T348520 (10DMburugu) I approve [10:15:13] (03PS2) 10Isabelle Hurbain-Palatin: Roll-out Parsoid Kartographer support for all English language wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/969168 (https://phabricator.wikimedia.org/T342871) [10:15:59] 10SRE, 10Traffic: purged issues while kafka brokers are restarted - https://phabricator.wikimedia.org/T334078 (10Fabfur) Adding, for complete information, that the list of hosts impacted with the same purged error this weekend were: - cp1078 - cp1089 - cp6005 - cp3069 [10:18:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-misc at eqiad: 50% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-misc&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [10:20:11] (03PS2) 10Ayounsi: Ask for port # and type instead of interface name [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/969692 [10:20:26] (03CR) 10Elukey: [C: 03+1] "One question - IIUC the percentage is a syntactic sugar to set the value in bytes that is some % of the main memory. If we set both units " [puppet] - 10https://gerrit.wikimedia.org/r/969683 (https://phabricator.wikimedia.org/T349999) (owner: 10Filippo Giunchedi) [10:22:43] (03PS1) 10Hashar: puppet_compiler: CORS header for Gerrit [puppet] - 10https://gerrit.wikimedia.org/r/969694 (https://phabricator.wikimedia.org/T350003) [10:23:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-misc at eqiad: 50% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-misc&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [10:24:29] (03CR) 10Hashar: "I have crafted this solely based on documentation, I haven't tested the Nginx config change." [puppet] - 10https://gerrit.wikimedia.org/r/969694 (https://phabricator.wikimedia.org/T350003) (owner: 10Hashar) [10:24:42] 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Grant access to nda LDAP group to xqt - https://phabricator.wikimedia.org/T348520 (10Urbanecm_WMF) Thanks Dennis! @JMeybohm Hi Janis, I see you're on SRE clinic duty this week. This request now should have sponsorship from a WMF staff member (me) and approv... [10:28:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-misc at eqiad: 50% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-misc&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [10:34:29] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:39:19] (03Abandoned) 10Jbond: site.pp: rename site.pp so that it is loaded first [puppet] - 10https://gerrit.wikimedia.org/r/969373 (https://phabricator.wikimedia.org/T349918) (owner: 10Jbond) [10:42:07] (03PS1) 10Ayounsi: Various changes [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/969696 [10:42:21] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1005-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [10:43:06] (03PS1) 10Jbond: idp_test: update acmechief_host [puppet] - 10https://gerrit.wikimedia.org/r/969697 (https://phabricator.wikimedia.org/T349918) [10:44:22] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/237/console" [puppet] - 10https://gerrit.wikimedia.org/r/969697 (https://phabricator.wikimedia.org/T349918) (owner: 10Jbond) [10:44:31] (03CR) 10Jbond: [C: 03+2] puppet_compiler: CORS header for Gerrit [puppet] - 10https://gerrit.wikimedia.org/r/969694 (https://phabricator.wikimedia.org/T350003) (owner: 10Hashar) [10:45:37] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:46:38] (03CR) 10Jbond: [V: 03+1 C: 03+2] idp_test: update acmechief_host [puppet] - 10https://gerrit.wikimedia.org/r/969697 (https://phabricator.wikimedia.org/T349918) (owner: 10Jbond) [10:48:32] (03PS3) 10Ayounsi: [POC] Split interface_automation into multiple files [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/969319 [10:48:34] (03PS3) 10Ayounsi: Ask for port # and type instead of interface name [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/969692 [10:48:38] (03PS2) 10Ayounsi: Various changes [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/969696 [10:51:12] (03PS1) 10Jbond: puppet_compiler: Add semicolon [puppet] - 10https://gerrit.wikimedia.org/r/969698 (https://phabricator.wikimedia.org/T350003) [10:51:16] (03PS3) 10Ayounsi: Various changes [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/969696 [10:52:55] (03CR) 10Jbond: [C: 03+2] puppet_compiler: Add semicolon [puppet] - 10https://gerrit.wikimedia.org/r/969698 (https://phabricator.wikimedia.org/T350003) (owner: 10Jbond) [10:58:15] 10SRE-OnFire, 10User-fgiunchedi: Deploy alerts-triage app to production - https://phabricator.wikimedia.org/T350014 (10fgiunchedi) [10:58:59] (03CR) 10Majavah: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/969685 (https://phabricator.wikimedia.org/T350002) (owner: 10Filippo Giunchedi) [10:59:53] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] prometheus: require replica_label to be set [puppet] - 10https://gerrit.wikimedia.org/r/969685 (https://phabricator.wikimedia.org/T350002) (owner: 10Filippo Giunchedi) [11:01:20] (03CR) 10Jbond: [V: 03+1 C: 03+2] realm: use puppet7 acmechief when on puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/969375 (https://phabricator.wikimedia.org/T349915) (owner: 10Jbond) [11:03:21] 10SRE, 10Traffic: purged issues while kafka brokers are restarted - https://phabricator.wikimedia.org/T334078 (10Vgutierrez) We need to work on purged Kafka consumer. I've already spotted the issue on our codebase [11:09:52] (03PS1) 10Jbond: idp-test: correct hostname [puppet] - 10https://gerrit.wikimedia.org/r/969700 (https://phabricator.wikimedia.org/T349915) [11:10:13] (03CR) 10Jbond: [C: 03+2] idp-test: correct hostname [puppet] - 10https://gerrit.wikimedia.org/r/969700 (https://phabricator.wikimedia.org/T349915) (owner: 10Jbond) [11:10:42] (03PS1) 10Jbond: Revert "realm: use puppet7 acmechief when on puppet7" [puppet] - 10https://gerrit.wikimedia.org/r/969363 [11:10:58] (03CR) 10Jbond: [C: 03+2] Revert "realm: use puppet7 acmechief when on puppet7" [puppet] - 10https://gerrit.wikimedia.org/r/969363 (owner: 10Jbond) [11:11:20] 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Grant access to nda LDAP group to xqt - https://phabricator.wikimedia.org/T348520 (10Aklapper) Sign off by a WMF C-level staff [11:15:05] (03CR) 10Filippo Giunchedi: Enable support for statsd_exporters on non-ops instances (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/969143 (https://phabricator.wikimedia.org/T343232) (owner: 10Btullis) [11:17:28] (03CR) 10Filippo Giunchedi: [C: 03+2] thanos: set cgroup memory limits for query components (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/969683 (https://phabricator.wikimedia.org/T349999) (owner: 10Filippo Giunchedi) [11:18:01] (03PS2) 10Fabfur: Add version print option [software/purged] - 10https://gerrit.wikimedia.org/r/962670 (https://phabricator.wikimedia.org/T347839) [11:24:31] (03PS1) 10Jbond: acmechief: switch back to using puppet localcacert [puppet] - 10https://gerrit.wikimedia.org/r/969701 (https://phabricator.wikimedia.org/T349915) [11:25:40] (03PS2) 10Jbond: acmechief: switch back to using puppet localcacert [puppet] - 10https://gerrit.wikimedia.org/r/969701 (https://phabricator.wikimedia.org/T349915) [11:26:15] (03CR) 10Jbond: "ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/969701 (https://phabricator.wikimedia.org/T349915) (owner: 10Jbond) [11:26:48] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/238/con" [puppet] - 10https://gerrit.wikimedia.org/r/969701 (https://phabricator.wikimedia.org/T349915) (owner: 10Jbond) [11:28:17] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1130.eqiad.wmnet with reason: provisionning db1230.eqiad.wmnet - T344036 [11:28:23] T344036: Productionize db12[26-49] - https://phabricator.wikimedia.org/T344036 [11:28:32] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1130.eqiad.wmnet with reason: provisionning db1230.eqiad.wmnet - T344036 [11:28:35] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1230.eqiad.wmnet with reason: provisionning db1230.eqiad.wmnet - T344036 [11:28:49] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1230.eqiad.wmnet with reason: provisionning db1230.eqiad.wmnet - T344036 [11:30:01] (03PS1) 10Jbond: acmechief2002: switch to localca cert [puppet] - 10https://gerrit.wikimedia.org/r/969702 (https://phabricator.wikimedia.org/T349915) [11:30:16] (03CR) 10Jbond: [C: 03+2] acmechief2002: switch to localca cert [puppet] - 10https://gerrit.wikimedia.org/r/969702 (https://phabricator.wikimedia.org/T349915) (owner: 10Jbond) [11:31:13] (SwiftObjectCountSiteDisparity) resolved: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity [11:33:33] (03PS1) 10Filippo Giunchedi: sre: ignore pint promql/series checks for otel-coll [alerts] - 10https://gerrit.wikimedia.org/r/969703 (https://phabricator.wikimedia.org/T345712) [11:34:02] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Adding db1230 depooled, depooling db1130', diff saved to https://phabricator.wikimedia.org/P53064 and previous config saved to /var/cache/conftool/dbconfig/20231030-113401-arnaudb.json [11:36:10] (03PS1) 10Filippo Giunchedi: sre: ignore promql/series for SystemdUnitCrashLoop [alerts] - 10https://gerrit.wikimedia.org/r/969704 (https://phabricator.wikimedia.org/T293970) [11:36:19] (03CR) 10Filippo Giunchedi: [C: 03+2] sre: ignore pint promql/series checks for otel-coll [alerts] - 10https://gerrit.wikimedia.org/r/969703 (https://phabricator.wikimedia.org/T345712) (owner: 10Filippo Giunchedi) [11:37:32] (03CR) 10Filippo Giunchedi: [C: 03+2] sre: ignore promql/series for SystemdUnitCrashLoop [alerts] - 10https://gerrit.wikimedia.org/r/969704 (https://phabricator.wikimedia.org/T293970) (owner: 10Filippo Giunchedi) [11:46:16] (03PS1) 10Arnaudb: mariadb: add a new host (db1230) [puppet] - 10https://gerrit.wikimedia.org/r/968984 (https://phabricator.wikimedia.org/T344036) [11:47:48] (03PS2) 10Arnaudb: mariadb: add a new host (db1230) [puppet] - 10https://gerrit.wikimedia.org/r/968984 (https://phabricator.wikimedia.org/T344036) [11:48:20] (03PS3) 10Arnaudb: mariadb: add a new host (db1230) [puppet] - 10https://gerrit.wikimedia.org/r/968984 (https://phabricator.wikimedia.org/T344036) [11:48:46] (03PS4) 10Arnaudb: mariadb: add a new host (db1230) [puppet] - 10https://gerrit.wikimedia.org/r/968984 (https://phabricator.wikimedia.org/T344036) [11:49:03] (03PS1) 10Jbond: acmechief_host: drop this value as a global [puppet] - 10https://gerrit.wikimedia.org/r/969706 (https://phabricator.wikimedia.org/T349915) [11:49:17] (03CR) 10Marostegui: [C: 03+1] mariadb: add a new host (db1230) [puppet] - 10https://gerrit.wikimedia.org/r/968984 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb) [11:49:34] (03CR) 10Arnaudb: [C: 03+2] mariadb: add a new host (db1230) [puppet] - 10https://gerrit.wikimedia.org/r/968984 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb) [11:49:41] (03CR) 10CI reject: [V: 04-1] acmechief_host: drop this value as a global [puppet] - 10https://gerrit.wikimedia.org/r/969706 (https://phabricator.wikimedia.org/T349915) (owner: 10Jbond) [11:51:16] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [11:52:20] !log arnaudb@cumin1001 START - Cookbook sre.mysql.clone of db1130.eqiad.wmnet onto db1230.eqiad.wmnet [11:56:11] PROBLEM - Check systemd state on aqs1010 is CRITICAL: CRITICAL - degraded: The following units failed: aqs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:01:08] RECOVERY - Check systemd state on aqs1010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:02:28] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1001 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [12:08:04] (03PS1) 10Marostegui: db1217: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/969708 (https://phabricator.wikimedia.org/T349090) [12:09:14] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [12:10:08] (03CR) 10Marostegui: [C: 03+2] db1217: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/969708 (https://phabricator.wikimedia.org/T349090) (owner: 10Marostegui) [12:10:41] (03PS2) 10Jbond: acmechief_host: drop this value as a global [puppet] - 10https://gerrit.wikimedia.org/r/969706 (https://phabricator.wikimedia.org/T349915) [12:11:19] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db1217.eqiad.wmnet with OS bookworm [12:13:08] PROBLEM - haproxy failover on dbproxy1021 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [12:13:12] PROBLEM - haproxy failover on dbproxy1024 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [12:13:16] PROBLEM - haproxy failover on dbproxy1027 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [12:13:24] PROBLEM - haproxy failover on dbproxy1023 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [12:18:10] PROBLEM - haproxy failover on dbproxy1025 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [12:20:36] PROBLEM - haproxy failover on dbproxy1020 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [12:24:12] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1217.eqiad.wmnet with reason: host reimage [12:25:23] ^ all those expected [12:26:56] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1217.eqiad.wmnet with reason: host reimage [12:27:40] PROBLEM - haproxy failover on dbproxy1026 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [12:27:40] PROBLEM - haproxy failover on dbproxy1017 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [12:28:23] (03Abandoned) 10Ayounsi: Various changes [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/969696 (owner: 10Ayounsi) [12:28:25] (03PS4) 10Ayounsi: Ask for port # and type instead of interface name [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/969692 [12:28:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'New host', diff saved to https://phabricator.wikimedia.org/P53065 and previous config saved to /var/cache/conftool/dbconfig/20231030-122855-marostegui.json [12:29:56] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [12:31:36] PROBLEM - haproxy failover on dbproxy1022 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [12:33:10] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1001 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [12:34:40] (03PS1) 10Jbond: acme_chief::cert: remove style violation [puppet] - 10https://gerrit.wikimedia.org/r/969719 (https://phabricator.wikimedia.org/T349915) [12:34:42] (03PS1) 10Jbond: acme_chief: override the acme_chief host for puppet7 nodes [puppet] - 10https://gerrit.wikimedia.org/r/969720 (https://phabricator.wikimedia.org/T349915) [12:35:29] (03PS1) 10Slyngshede: P:monitoring remove remainders of check_eth. [puppet] - 10https://gerrit.wikimedia.org/r/969721 [12:37:20] 10SRE-Access-Requests, 10Structured-Data-Backlog, 10UploadWizard: Access request to deleted image files in the backup cluster - https://phabricator.wikimedia.org/T350020 (10mfossati) [12:37:42] (03CR) 10CI reject: [V: 04-1] acme_chief::cert: remove style violation [puppet] - 10https://gerrit.wikimedia.org/r/969719 (https://phabricator.wikimedia.org/T349915) (owner: 10Jbond) [12:37:58] (03CR) 10CI reject: [V: 04-1] acme_chief: override the acme_chief host for puppet7 nodes [puppet] - 10https://gerrit.wikimedia.org/r/969720 (https://phabricator.wikimedia.org/T349915) (owner: 10Jbond) [12:38:02] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Traffic, and 2 others: find solution for acmechief in puppet7 - https://phabricator.wikimedia.org/T349915 (10jbond) [12:39:43] (03PS2) 10Jbond: acme_chief::cert: remove style violation [puppet] - 10https://gerrit.wikimedia.org/r/969719 (https://phabricator.wikimedia.org/T349915) [12:39:44] RECOVERY - haproxy failover on dbproxy1026 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [12:39:44] RECOVERY - haproxy failover on dbproxy1017 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [12:39:45] (03PS2) 10Jbond: acme_chief: override the acme_chief host for puppet7 nodes [puppet] - 10https://gerrit.wikimedia.org/r/969720 (https://phabricator.wikimedia.org/T349915) [12:39:50] RECOVERY - haproxy failover on dbproxy1025 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [12:39:52] 10SRE-Access-Requests, 10Structured-Data-Backlog, 10UploadWizard: Access request to deleted image files in the backup cluster - https://phabricator.wikimedia.org/T350020 (10mfossati) [12:40:09] 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Grant access to nda LDAP group to xqt - https://phabricator.wikimedia.org/T348520 (10Urbanecm_WMF) >>! In T348520#9290724, @Aklapper wrote: > Sign off by a WMF C-level staff While that is indeed currently a part of the [relevant docs](https://wikitech.wikim... [12:42:30] RECOVERY - haproxy failover on dbproxy1021 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [12:42:33] (03CR) 10CI reject: [V: 04-1] acme_chief: override the acme_chief host for puppet7 nodes [puppet] - 10https://gerrit.wikimedia.org/r/969720 (https://phabricator.wikimedia.org/T349915) (owner: 10Jbond) [12:42:35] (03CR) 10CI reject: [V: 04-1] acme_chief::cert: remove style violation [puppet] - 10https://gerrit.wikimedia.org/r/969719 (https://phabricator.wikimedia.org/T349915) (owner: 10Jbond) [12:42:36] RECOVERY - haproxy failover on dbproxy1024 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [12:43:00] RECOVERY - haproxy failover on dbproxy1020 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [12:43:02] RECOVERY - haproxy failover on dbproxy1022 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [12:45:17] (03CR) 10Jbond: acme_chief::cert: remove style violation (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/969719 (https://phabricator.wikimedia.org/T349915) (owner: 10Jbond) [12:47:41] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1217.eqiad.wmnet with OS bookworm [12:48:10] RECOVERY - haproxy failover on dbproxy1023 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [12:48:10] RECOVERY - haproxy failover on dbproxy1027 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [12:49:25] (03PS1) 10Marostegui: Revert "db1217: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/969364 [12:49:57] (03CR) 10Marostegui: [C: 03+2] Revert "db1217: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/969364 (owner: 10Marostegui) [12:51:41] (03PS3) 10Jbond: acme_chief::cert: remove style violation [puppet] - 10https://gerrit.wikimedia.org/r/969719 (https://phabricator.wikimedia.org/T349915) [12:51:43] (03PS3) 10Jbond: acme_chief: override the acme_chief host for puppet7 nodes [puppet] - 10https://gerrit.wikimedia.org/r/969720 (https://phabricator.wikimedia.org/T349915) [12:52:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-misc at eqiad: 50% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-misc&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [12:54:05] (03CR) 10Jgreen: [V: 03+2 C: 03+1] Add dummy secrets for community_civicrm [labs/private] - 10https://gerrit.wikimedia.org/r/967519 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [12:54:08] (03CR) 10Jgreen: [V: 03+2 C: 03+2] Add dummy secrets for community_civicrm [labs/private] - 10https://gerrit.wikimedia.org/r/967519 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [12:54:35] (03CR) 10CI reject: [V: 04-1] acme_chief::cert: remove style violation [puppet] - 10https://gerrit.wikimedia.org/r/969719 (https://phabricator.wikimedia.org/T349915) (owner: 10Jbond) [12:54:45] (03CR) 10CI reject: [V: 04-1] acme_chief: override the acme_chief host for puppet7 nodes [puppet] - 10https://gerrit.wikimedia.org/r/969720 (https://phabricator.wikimedia.org/T349915) (owner: 10Jbond) [12:55:12] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db1130.eqiad.wmnet onto db1230.eqiad.wmnet [12:57:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-misc at eqiad: 50% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-misc&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: It is that lovely time of the day again! You are hereby commanded to deploy UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231030T1300). [13:00:04] No Gerrit patches in the queue for this window AFAICS. [13:00:15] welcome, hour-early-window! [13:01:05] ah daylight savings ended [13:02:08] yep, we moved to daylight confusion instead [13:14:30] (03PS5) 10Ayounsi: Ask for port # and type instead of interface name [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/969692 [13:14:32] (03PS1) 10Ayounsi: provision_server: make switch selection optional [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/969749 [13:17:31] (03PS4) 10Jbond: acme_chief::cert: remove style violation [puppet] - 10https://gerrit.wikimedia.org/r/969719 (https://phabricator.wikimedia.org/T349915) [13:17:33] (03PS4) 10Jbond: acme_chief: override the acme_chief host for puppet7 nodes [puppet] - 10https://gerrit.wikimedia.org/r/969720 (https://phabricator.wikimedia.org/T349915) [13:23:24] (03Abandoned) 10Jbond: environment: fix SC3033 [puppet] - 10https://gerrit.wikimedia.org/r/969075 (owner: 10Jbond) [13:24:59] (03PS10) 10Brouberol: Enable the management of the skein certificate via Puppet [puppet] - 10https://gerrit.wikimedia.org/r/968612 (https://phabricator.wikimedia.org/T329398) [13:25:01] (03PS9) 10Brouberol: Enable the management of the skein certificate via Puppet on one instance [puppet] - 10https://gerrit.wikimedia.org/r/968613 (https://phabricator.wikimedia.org/T329398) [13:26:42] (03PS11) 10Brouberol: Enable the management of the skein certificate via Puppet [puppet] - 10https://gerrit.wikimedia.org/r/968612 (https://phabricator.wikimedia.org/T329398) [13:26:44] (03PS10) 10Brouberol: Enable the management of the skein certificate via Puppet on one instance [puppet] - 10https://gerrit.wikimedia.org/r/968613 (https://phabricator.wikimedia.org/T329398) [13:27:26] (03PS12) 10Brouberol: Enable the management of the skein certificate via Puppet [puppet] - 10https://gerrit.wikimedia.org/r/968612 (https://phabricator.wikimedia.org/T329398) [13:27:28] (03PS11) 10Brouberol: Enable the management of the skein certificate via Puppet on one instance [puppet] - 10https://gerrit.wikimedia.org/r/968613 (https://phabricator.wikimedia.org/T329398) [13:27:58] (03CR) 10Brouberol: Enable the management of the skein certificate via Puppet (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/968612 (https://phabricator.wikimedia.org/T329398) (owner: 10Brouberol) [13:39:29] (03CR) 10Jbond: "ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/969706 (https://phabricator.wikimedia.org/T349915) (owner: 10Jbond) [13:39:46] (03CR) 10Jbond: "ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/969719 (https://phabricator.wikimedia.org/T349915) (owner: 10Jbond) [13:40:04] (03CR) 10Jbond: "ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/969720 (https://phabricator.wikimedia.org/T349915) (owner: 10Jbond) [13:47:44] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/968612 (https://phabricator.wikimedia.org/T329398) (owner: 10Brouberol) [13:47:53] oh right [13:48:04] (re daylight confusion, that is ^^) [13:48:13] (03CR) 10Bking: [C: 03+2] search-loader: use default system python [puppet] - 10https://gerrit.wikimedia.org/r/969386 (https://phabricator.wikimedia.org/T346039) (owner: 10Bking) [13:48:43] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Create cookbook to migrate servers from the puppetmasters to puppetservers - https://phabricator.wikimedia.org/T340739 (10jbond) [13:49:08] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10Puppet (Puppet 7.0): Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10jbond) 05Open→03In progress p:05Triage→03Medium [13:54:27] (03PS2) 10Ayounsi: provision_server: make switch selection optional [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/969749 [13:54:29] (03PS1) 10Ayounsi: provision_server: don't show servers with a primary IP [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/969752 [13:55:38] (LogstashIndexingFailures) firing: (2) Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures [13:58:24] (03PS1) 10Jcrespo: dbbackups: Switchover master from db1164 to db1119 [puppet] - 10https://gerrit.wikimedia.org/r/969753 (https://phabricator.wikimedia.org/T350022) [13:59:11] (03CR) 10Jcrespo: [C: 04-1] "Do not deploy until Manuel says so." [puppet] - 10https://gerrit.wikimedia.org/r/969753 (https://phabricator.wikimedia.org/T350022) (owner: 10Jcrespo) [14:00:04] (03PS4) 10Ayounsi: Split interface_automation into multiple files [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/969319 [14:00:06] (03PS6) 10Ayounsi: Ask for port # and type instead of interface name [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/969692 [14:00:08] (03PS3) 10Ayounsi: provision_server: make switch selection optional [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/969749 [14:00:10] (03PS2) 10Ayounsi: provision_server: don't show servers with a primary IP [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/969752 [14:01:17] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10Puppet (Puppet 7.0): Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10jbond) [14:06:09] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4052.ulsfo.wmnet with OS bookworm [14:06:58] (03CR) 10Filippo Giunchedi: [C: 03+1] "Neat! LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/969721 (owner: 10Slyngshede) [14:07:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-misc at eqiad: 50% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-misc&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:10:37] (LogstashIndexingFailures) firing: (2) Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures [14:12:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-misc at eqiad: 50% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-misc&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:16:15] (03PS1) 10Bking: search-loader: Bring new hosts into service [puppet] - 10https://gerrit.wikimedia.org/r/969754 (https://phabricator.wikimedia.org/T346039) [14:18:31] (03PS1) 10Elukey: services: update the ChangeProp staging's docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/969757 (https://phabricator.wikimedia.org/T348950) [14:20:37] (LogstashIndexingFailures) firing: (2) Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures [14:20:45] (03PS1) 10Elukey: services: update ChangeProp's eqiad Docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/969758 (https://phabricator.wikimedia.org/T348950) [14:26:26] elukey: the logstash indexing failures are from cp in staging :( [14:26:31] (03PS2) 10Bking: search-loader: Bring new hosts into service [puppet] - 10https://gerrit.wikimedia.org/r/969754 (https://phabricator.wikimedia.org/T346039) [14:26:36] ah snap, lovely [14:26:41] i.e. "message" is json [14:26:44] "message"=>{"message"=>"[thrd:GroupCoordinator]: GroupCoordinator/1001: Sent HeartbeatRequest (v1, 109 bytes @ 0, CorrId 613)", "severity"=>7, "fac"=>"SEND"}, [14:26:47] etc [14:27:05] (03CR) 10Elukey: [C: 03+2] services: update the ChangeProp staging's docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/969757 (https://phabricator.wikimedia.org/T348950) (owner: 10Elukey) [14:27:32] godog: going to shutoff debug msgs for librdkafka in a sec [14:27:56] anything that we can do to make them digestible? [14:28:01] I can't control their format sadly [14:28:13] (they are generated by librdkafka via another nodejs lib) [14:29:00] (03PS1) 10Jbond: sre.ganeti.makevm: Add pppet-version arguments to makevm [cookbooks] - 10https://gerrit.wikimedia.org/r/969760 (https://phabricator.wikimedia.org/T340739) [14:29:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-misc at eqiad: 50% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-misc&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:29:10] mmhh good question, the first thing that comes to mind is not having json in 'message', maybe wrap it as text [14:30:16] (03Abandoned) 10Bking: search-loader: Bring new hosts into service [puppet] - 10https://gerrit.wikimedia.org/r/969754 (https://phabricator.wikimedia.org/T346039) (owner: 10Bking) [14:31:28] (03PS1) 10Bking: search-loader: Bring new hosts into service [puppet] - 10https://gerrit.wikimedia.org/r/969761 (https://phabricator.wikimedia.org/T346039) [14:31:58] !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/changeprop: sync [14:32:12] !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop: sync [14:32:40] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/969760 (https://phabricator.wikimedia.org/T340739) (owner: 10Jbond) [14:33:37] (03CR) 10DCausse: [C: 03+1] search-loader: Bring new hosts into service [puppet] - 10https://gerrit.wikimedia.org/r/969761 (https://phabricator.wikimedia.org/T346039) (owner: 10Bking) [14:33:39] (03CR) 10Peter Fischer: [C: 03+1] "LGTM, as far as I can tell" [puppet] - 10https://gerrit.wikimedia.org/r/969761 (https://phabricator.wikimedia.org/T346039) (owner: 10Bking) [14:34:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-misc at eqiad: 50% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-misc&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:34:13] (03CR) 10Bking: [C: 03+2] search-loader: Bring new hosts into service [puppet] - 10https://gerrit.wikimedia.org/r/969761 (https://phabricator.wikimedia.org/T346039) (owner: 10Bking) [14:34:42] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4052.ulsfo.wmnet with reason: host reimage [14:35:37] (LogstashIndexingFailures) resolved: (2) Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures [14:36:52] !log bking@search-loader2001 disabling services as part of bullseye migration T346039 [14:36:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:59] T346039: Migrate search-loader hosts to Bullseye or later - https://phabricator.wikimedia.org/T346039 [14:37:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-misc at eqiad: 50% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-misc&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:37:19] elukey: the indexing errors are gone btw, last was at 14:32:10 [14:37:52] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on search-loader2001.codfw.wmnet with reason: T346039 [14:37:54] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4052.ulsfo.wmnet with reason: host reimage [14:38:16] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on search-loader2001.codfw.wmnet with reason: T346039 [14:38:44] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:39:04] godog: yes I removed the debug logging [14:39:13] I'll try to come up with a different solution [14:39:26] ack, thanks [14:41:13] !log bking@deploy2002 Started deploy [search/mjolnir/deploy@daf8c32]: T346039 [14:41:18] !log bking@deploy2002 Finished deploy [search/mjolnir/deploy@daf8c32]: T346039 (duration: 00m 05s) [14:42:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-misc at eqiad: 50% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-misc&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:42:21] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1005-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [14:42:22] godog: in theory https://gerrit.wikimedia.org/r/c/mediawiki/services/change-propagation/+/969765 should fix [14:42:25] does it make sense? [14:43:44] (JobUnavailable) firing: (2) Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:45:06] elukey: yes LGTM [14:45:15] of course I forgot a ) [14:45:16] sigh [14:46:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-misc at eqiad: 50% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-misc&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:46:43] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10Puppet (Puppet 7.0): Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10jbond) [14:50:59] (03PS1) 10Jbond: puppet7: Add a motd to inform users a host has been migrated to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/969789 (https://phabricator.wikimedia.org/T349619) [14:51:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-misc at eqiad: 50% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-misc&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:52:17] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/969789 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond) [14:53:44] (JobUnavailable) firing: (2) Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:54:32] 10SRE, 10DNS, 10Traffic: DNS Update, Google Postmaster Tools - https://phabricator.wikimedia.org/T349942 (10NMariano-WMF) The ITS System team will set this up and manage permissions for Noah Israel (@nisrae)l and Danny Bu (@DBu-WMF). [14:56:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-misc at eqiad: 50% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-misc&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [15:00:56] (03CR) 10Jbond: [V: 03+1 C: 03+2] puppet7: Add a motd to inform users a host has been migrated to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/969789 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond) [15:01:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-misc at eqiad: 50% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-misc&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [15:01:54] bking yuo happy for me to merge your change [15:02:04] https://gerrit.wikimedia.org/r/c/operations/puppet/+/969761 [15:04:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-misc at eqiad: 50% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-misc&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [15:04:53] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10Andrew) a:05dcaro→03Andrew [15:06:16] (03CR) 10Krinkle: [C: 03+1] profile::mediawiki::common: set default histogram buckets [puppet] - 10https://gerrit.wikimedia.org/r/954114 (https://phabricator.wikimedia.org/T344751) (owner: 10Herron) [15:09:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-misc at eqiad: 50% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-misc&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [15:13:14] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10Andrew) [15:14:37] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10Jclark-ctr) [15:14:42] (03PS1) 10Jbond: puppet::agent: correct white space in motd [puppet] - 10https://gerrit.wikimedia.org/r/969793 [15:14:58] (03CR) 10Jbond: [C: 03+2] puppet::agent: correct white space in motd [puppet] - 10https://gerrit.wikimedia.org/r/969793 (owner: 10Jbond) [15:19:59] (PuppetFailure) firing: Puppet has failed on search-loader1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [15:20:39] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10Data-Platform-SRE, 10cloud-services-team: Move cloudvirt-wdqs hosts - https://phabricator.wikimedia.org/T346948 (10VRiley-WMF) cloudvirt-wdqs1003 has been relocated cloudvirt-wdqs1003 - C 8. U 21. port 18. CableID 4015 Side note, we had to use a 1 Gig connection sinc... [15:21:11] (03PS2) 10Jbond: sre.ganeti.makevm: Add puppet-version arguments to makevm [cookbooks] - 10https://gerrit.wikimedia.org/r/969760 (https://phabricator.wikimedia.org/T340739) [15:21:21] (03CR) 10Jbond: [C: 03+2] sre.ganeti.makevm: Add puppet-version arguments to makevm [cookbooks] - 10https://gerrit.wikimedia.org/r/969760 (https://phabricator.wikimedia.org/T340739) (owner: 10Jbond) [15:21:23] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host cloudvirt-wdqs1003 [15:23:20] jouncebot: next [15:23:21] In 0 hour(s) and 6 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231030T1530) [15:24:06] (03PS1) 10Jbond: builder: migrate role to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/969795 (https://phabricator.wikimedia.org/T349619) [15:24:45] (03CR) 10Jbond: "ready for review" [cookbooks] - 10https://gerrit.wikimedia.org/r/967935 (https://phabricator.wikimedia.org/T340739) (owner: 10Jbond) [15:25:07] (03CR) 10Jbond: [C: 03+2] builder: migrate role to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/969795 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond) [15:25:53] (03Merged) 10jenkins-bot: sre.ganeti.makevm: Add puppet-version arguments to makevm [cookbooks] - 10https://gerrit.wikimedia.org/r/969760 (https://phabricator.wikimedia.org/T340739) (owner: 10Jbond) [15:27:17] !log taavi@cumin1001 START - Cookbook sre.dns.netbox [15:29:22] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudvirt-wdqs1003 [15:29:33] !log taavi@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign new IPs to cloudvirt-wdqs1003 - taavi@cumin1001" [15:30:22] !log taavi@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign new IPs to cloudvirt-wdqs1003 - taavi@cumin1001" [15:30:23] !log taavi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:33:12] !log taavi@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host cloudvirt-wdqs1003 [15:33:52] 10SRE, 10Maps: Allow Wikimedia Maps usage on wikiworld.sidl-corporation.fr - https://phabricator.wikimedia.org/T349985 (10Aklapper) 05Open→03Declined Hi @SIDLCorporation, thanks for taking the time to report this. The three fields above are not filled out, so for now I am going to decline this ticket. Ple... [15:33:57] !log taavi@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudvirt-wdqs1003 [15:40:49] (03PS1) 10Jbond: cluster::unprivmanagement: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/969800 (https://phabricator.wikimedia.org/T349619) [15:41:14] (03CR) 10Jbond: [C: 03+2] cluster::unprivmanagement: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/969800 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond) [15:42:53] !log taavi@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt-wdqs1003.eqiad.wmnet with OS bookworm [15:43:06] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10Data-Platform-SRE, 10cloud-services-team: Move cloudvirt-wdqs hosts - https://phabricator.wikimedia.org/T346948 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by taavi@cumin1001 for host cloudvirt-wdqs1003.eqiad.wmnet with OS bookworm [15:43:18] (03PS1) 10Giuseppe Lavagetto: modules: add job 1.0.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/969801 [15:43:22] (03PS1) 10Giuseppe Lavagetto: modules: fix app.job [deployment-charts] - 10https://gerrit.wikimedia.org/r/969802 [15:43:27] !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp1103.eqiad.wmnet with OS bullseye [15:43:48] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10Jclark-ctr) [15:43:58] (03PS2) 10Giuseppe Lavagetto: modules: fix app.job [deployment-charts] - 10https://gerrit.wikimedia.org/r/969802 [15:44:31] (03CR) 10CI reject: [V: 04-1] modules: fix app.job [deployment-charts] - 10https://gerrit.wikimedia.org/r/969802 (owner: 10Giuseppe Lavagetto) [15:45:22] (03CR) 10CI reject: [V: 04-1] modules: fix app.job [deployment-charts] - 10https://gerrit.wikimedia.org/r/969802 (owner: 10Giuseppe Lavagetto) [15:48:09] (03PS1) 10Jbond: config_master: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/969803 (https://phabricator.wikimedia.org/T349619) [15:48:30] (03CR) 10Jbond: [C: 03+2] config_master: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/969803 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond) [15:49:27] !log move config_master to puppet7 [15:49:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:36] !log move cluster::unprivmanagement to puppet7 [15:49:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:45] !log move builder to puppet7 [15:49:48] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate mr1-codfw from asw-a1-codfw to lsw1-a1-codfw - https://phabricator.wikimedia.org/T348164 (10Papaul) @cmooney cable is place from mr1-codfw ge0/0/3 to lsw1-a2-codfw ge-0/0/47 ID 00745 [15:49:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:10] !log fabfur@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1103.eqiad.wmnet with OS bullseye [15:51:16] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [15:51:32] (03PS10) 10Herron: profile::mediawiki::common: include prometheus statsd_exporter [puppet] - 10https://gerrit.wikimedia.org/r/952894 (https://phabricator.wikimedia.org/T345377) [15:51:39] (03PS14) 10Herron: profile::mediawiki::common: set default histogram buckets [puppet] - 10https://gerrit.wikimedia.org/r/954114 (https://phabricator.wikimedia.org/T344751) [15:51:40] !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp1103.eqiad.wmnet with OS bullseye [15:51:46] 10SRE, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1103.eqiad.wmnet with OS bullseye [15:53:44] (JobUnavailable) firing: (2) Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:55:57] (03PS1) 10Jbond: failoid: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/969806 (https://phabricator.wikimedia.org/T349619) [15:55:59] !log migrate failoid to puppet7 [15:56:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:15] (03CR) 10Jbond: [C: 03+2] failoid: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/969806 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond) [15:56:40] !log taavi@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "cloudvirt-wdqs1003 - taavi@cumin1001" [15:57:40] !log taavi@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "cloudvirt-wdqs1003 - taavi@cumin1001" [15:58:22] (03PS1) 10Majavah: hieradata: update cloudvirt-wdqs1003 network config [puppet] - 10https://gerrit.wikimedia.org/r/969807 [15:58:23] !log taavi@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt-wdqs1003.eqiad.wmnet with reason: host reimage [15:59:11] (03CR) 10Majavah: [C: 03+2] hieradata: update cloudvirt-wdqs1003 network config [puppet] - 10https://gerrit.wikimedia.org/r/969807 (owner: 10Majavah) [15:59:33] PROBLEM - Check systemd state on config-master1001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:00:43] PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:02:33] !log taavi@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt-wdqs1003.eqiad.wmnet with reason: host reimage [16:03:37] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:03:56] !log jbond@cumin1001 START - Cookbook sre.puppet.migrate-host for host ganeti-test1002.eqiad.wmnet [16:04:13] !log migrate ganeti-test1002.eqiad.wmnet to puppet7 [16:04:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:16] (03PS1) 10Jbond: ganeti-test1002: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/969810 (https://phabricator.wikimedia.org/T349619) [16:05:29] (03CR) 10Jbond: [C: 03+2] ganeti-test1002: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/969810 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond) [16:07:40] !log fabfur@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1103.eqiad.wmnet with OS bullseye [16:07:44] 10SRE, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1103.eqiad.wmnet with OS bullseye executed with errors: - cp1103 (**FAIL**) - Removed from Puppet... [16:07:56] !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp1103.eqiad.wmnet with OS bullseye [16:08:02] 10SRE, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1103.eqiad.wmnet with OS bullseye [16:09:01] (03PS11) 10Effie Mouzeli: ipoid: Update cronjob definition [deployment-charts] - 10https://gerrit.wikimedia.org/r/966813 (https://phabricator.wikimedia.org/T346861) (owner: 10Kosta Harlan) [16:09:58] (03CR) 10CI reject: [V: 04-1] ipoid: Update cronjob definition [deployment-charts] - 10https://gerrit.wikimedia.org/r/966813 (https://phabricator.wikimedia.org/T346861) (owner: 10Kosta Harlan) [16:10:00] (03PS2) 10Majavah: openstack: nova: add a dependency on libvirt-clients [puppet] - 10https://gerrit.wikimedia.org/r/969299 [16:10:59] (03PS4) 10Jforrester: [wikifunctions] Alter site to General Availability [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966570 (https://phabricator.wikimedia.org/T349054) [16:11:05] (03PS5) 10Jforrester: [wikifunctions] Alter site to General Availability [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966570 (https://phabricator.wikimedia.org/T349054) [16:13:48] (03CR) 10Majavah: [C: 03+2] openstack: nova: add a dependency on libvirt-clients [puppet] - 10https://gerrit.wikimedia.org/r/969299 (owner: 10Majavah) [16:14:02] !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host ganeti-test1002.eqiad.wmnet [16:15:03] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:16:09] !log migrate O:ganeti_test to puppet7 [16:16:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:03] (03PS1) 10Jbond: ganeti_test: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/969812 (https://phabricator.wikimedia.org/T349619) [16:17:54] (03CR) 10Jbond: [C: 03+2] ganeti_test: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/969812 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond) [16:19:54] (03PS1) 10Vgutierrez: reprepro: Fix haproxy component names for bullseye & bookworm [puppet] - 10https://gerrit.wikimedia.org/r/969814 [16:21:10] !log taavi@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - taavi@cumin1001" [16:21:53] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/969814 (owner: 10Vgutierrez) [16:22:02] !log taavi@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - taavi@cumin1001" [16:22:03] !log taavi@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt-wdqs1003.eqiad.wmnet with OS bookworm [16:22:09] (03CR) 10Vgutierrez: [C: 03+2] reprepro: Fix haproxy component names for bullseye & bookworm [puppet] - 10https://gerrit.wikimedia.org/r/969814 (owner: 10Vgutierrez) [16:22:15] RECOVERY - Check systemd state on config-master1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:22:19] PROBLEM - ensure kvm processes are running on cloudvirt-wdqs1003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [16:23:04] !log fabfur@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1103.eqiad.wmnet with reason: host reimage [16:23:33] RECOVERY - ensure kvm processes are running on cloudvirt-wdqs1003 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [16:24:49] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10jbond) [16:25:54] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10jbond) [16:26:17] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1103.eqiad.wmnet with reason: host reimage [16:26:44] 10SRE, 10DNS, 10Traffic: DNS Update, Google Postmaster Tools - https://phabricator.wikimedia.org/T349942 (10ssingh) Hi, this is for wikimedia.org, correct? [16:28:27] 10SRE, 10DNS, 10Traffic: DNS Update, Google Postmaster Tools - https://phabricator.wikimedia.org/T349942 (10NMariano-WMF) Correct [16:34:07] (03PS1) 10Ssingh: wikimedia.org: update google-site-verification [dns] - 10https://gerrit.wikimedia.org/r/969816 (https://phabricator.wikimedia.org/T349942) [16:34:08] I'm seeing inconsistent server errors from phabriactor [16:34:31] Talking about the MySQL server going away [16:35:24] 10SRE, 10ops-esams, 10DC-Ops, 10Patch-For-Review: Main Tracking Task for ESAMS Migration to KNAMS - https://phabricator.wikimedia.org/T329219 (10ayounsi) [16:38:05] PROBLEM - haproxy process on cp4052 is CRITICAL: PROCS CRITICAL: 0 processes with command name haproxy https://wikitech.wikimedia.org/wiki/HAProxy [16:38:17] PROBLEM - Check systemd state on cp4052 is CRITICAL: CRITICAL - degraded: The following units failed: haproxy.service,haproxy_stek_job.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:38:37] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp4052 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [16:38:43] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp4052 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [16:39:12] ^ will downtime this, host is depooled [16:39:35] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 4:00:00 on cp4052.ulsfo.wmnet with reason: depooled, reimaging [16:39:50] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on cp4052.ulsfo.wmnet with reason: depooled, reimaging [16:42:04] (03PS12) 10Effie Mouzeli: ipoid: Update cronjob definition [deployment-charts] - 10https://gerrit.wikimedia.org/r/966813 (https://phabricator.wikimedia.org/T346861) (owner: 10Kosta Harlan) [16:42:51] (03CR) 10CI reject: [V: 04-1] ipoid: Update cronjob definition [deployment-charts] - 10https://gerrit.wikimedia.org/r/966813 (https://phabricator.wikimedia.org/T346861) (owner: 10Kosta Harlan) [16:43:01] (03PS1) 10Majavah: aptrepo: cleanup haproxy update and component names [puppet] - 10https://gerrit.wikimedia.org/r/969819 [16:44:37] PROBLEM - Check systemd state on config-master1001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:48:55] (03PS1) 10Vgutierrez: reprepro: Fix haproxy components name for bullseye & bookworm [puppet] - 10https://gerrit.wikimedia.org/r/969821 [16:49:43] (03CR) 10Majavah: [C: 03+1] reprepro: Fix haproxy components name for bullseye & bookworm [puppet] - 10https://gerrit.wikimedia.org/r/969821 (owner: 10Vgutierrez) [16:50:26] (03CR) 10Vgutierrez: [C: 03+2] reprepro: Fix haproxy components name for bullseye & bookworm [puppet] - 10https://gerrit.wikimedia.org/r/969821 (owner: 10Vgutierrez) [16:50:34] (03CR) 10Ssingh: [C: 03+2] wikimedia.org: update google-site-verification [dns] - 10https://gerrit.wikimedia.org/r/969816 (https://phabricator.wikimedia.org/T349942) (owner: 10Ssingh) [16:51:01] !log running authdns-update for CR 969816 [16:51:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:21] Dreamy_Jazz: same (reload usually fixes it, but given that the baseline is “I’ve never ever seen this error before”…) [16:52:09] have you filed a task about it?' [16:53:01] screenshot here https://tmp.lucaswerkmeister.de/phabricator-unhandled-exception.png [16:53:12] sure I’ll file a task [16:53:15] RECOVERY - Check systemd state on config-master2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:53:32] i think there's already one [16:54:00] https://phabricator.wikimedia.org/T349961 is from saturday apparently [16:54:06] guess that’s close enough to be the same, yeah [16:54:51] commented there [16:56:58] 10SRE, 10DNS, 10Traffic, 10Patch-For-Review: DNS Update, Google Postmaster Tools - https://phabricator.wikimedia.org/T349942 (10ssingh) 05Open→03Resolved a:03ssingh wikimedia.org. 600 IN TXT "google-site-verification=uzfgD0YiIqSQgRdSQXlkA7NByyyOZDp-n0SZ3nozpDM" [16:57:21] PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:57:59] (03PS13) 10Effie Mouzeli: ipoid: Update cronjob definition [deployment-charts] - 10https://gerrit.wikimedia.org/r/966813 (https://phabricator.wikimedia.org/T346861) (owner: 10Kosta Harlan) [16:58:53] (03CR) 10CI reject: [V: 04-1] ipoid: Update cronjob definition [deployment-charts] - 10https://gerrit.wikimedia.org/r/966813 (https://phabricator.wikimedia.org/T346861) (owner: 10Kosta Harlan) [17:04:34] !log otto@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-analytics: apply [17:05:34] (03PS1) 10BCornwall: hiera: remove dns3003 from authdns_servers [puppet] - 10https://gerrit.wikimedia.org/r/969931 (https://phabricator.wikimedia.org/T342154) [17:05:36] (03PS14) 10Effie Mouzeli: ipoid: Update cronjob definition [deployment-charts] - 10https://gerrit.wikimedia.org/r/966813 (https://phabricator.wikimedia.org/T346861) (owner: 10Kosta Harlan) [17:05:51] (03CR) 10Ssingh: [C: 03+1] hiera: remove dns3003 from authdns_servers [puppet] - 10https://gerrit.wikimedia.org/r/969931 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall) [17:06:07] (03CR) 10BCornwall: [C: 03+2] hiera: remove dns3003 from authdns_servers [puppet] - 10https://gerrit.wikimedia.org/r/969931 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall) [17:06:44] (03CR) 10CI reject: [V: 04-1] ipoid: Update cronjob definition [deployment-charts] - 10https://gerrit.wikimedia.org/r/966813 (https://phabricator.wikimedia.org/T346861) (owner: 10Kosta Harlan) [17:09:25] (03PS1) 10Jbond: pki::root: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/969932 (https://phabricator.wikimedia.org/T349619) [17:09:59] (PuppetFailure) firing: (3) Puppet has failed on search-loader1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [17:10:06] !log migrate pki::root to puppet7 [17:10:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:10:23] (03CR) 10Jbond: [C: 03+2] pki::root: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/969932 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond) [17:10:53] (03PS15) 10Effie Mouzeli: ipoid: Update cronjob definition [deployment-charts] - 10https://gerrit.wikimedia.org/r/966813 (https://phabricator.wikimedia.org/T346861) (owner: 10Kosta Harlan) [17:12:21] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host dns3003.wikimedia.org with OS bookworm [17:12:31] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host dns3003.wikimedia.org with OS bookworm [17:14:44] !log otto@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics: apply [17:15:23] PROBLEM - BGP status on asw1-by27-esams.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:15:43] PROBLEM - BFD status on asw1-by27-esams.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:16:46] !log fabfur@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1103.eqiad.wmnet with OS bullseye [17:16:52] 10SRE, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1103.eqiad.wmnet with OS bullseye executed with errors: - cp1103 (**FAIL**) - Removed from Puppet... [17:19:32] (JobUnavailable) firing: (4) Reduced availability for job haproxy in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:20:31] PROBLEM - Host 2a02:ec80:300:2:185:15:59:34 is DOWN: CRITICAL - Destination Unreachable (2a02:ec80:300:2:185:15:59:34) [17:21:53] PROBLEM - Recursive DNS on 185.15.59.34 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [17:21:57] ^ expected [17:22:48] !log migrate pki2002 to puppet7 [17:22:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:15] !log jbond@cumin1001 START - Cookbook sre.puppet.migrate-host for host pki2002.codfw.wmnet [17:23:17] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:23:55] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:23:58] (RdfStreamingUpdaterSpaceUsageTooHigh) firing: The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh [17:24:35] (03PS1) 10Jbond: pki2002: switch to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/969935 (https://phabricator.wikimedia.org/T349619) [17:25:06] (03CR) 10Jbond: [C: 03+2] pki2002: switch to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/969935 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond) [17:25:41] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:27:40] !log jbond@cumin1001 END (FAIL) - Cookbook sre.puppet.migrate-host (exit_code=99) for host pki2002.codfw.wmnet [17:28:19] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 17 Dec 2023 03:07:37 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:28:49] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 2.662 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:28:58] (RdfStreamingUpdaterSpaceUsageTooHigh) resolved: The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh [17:33:44] (JobUnavailable) firing: (4) Reduced availability for job haproxy in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:38:14] RECOVERY - Check systemd state on cp4052 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:38:51] (03PS1) 10Jbond: pki::multiroot: convert to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/969937 (https://phabricator.wikimedia.org/T349619) [17:39:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-misc at eqiad: 50% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-misc&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [17:39:11] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on dns3003.wikimedia.org with reason: host reimage [17:39:45] (03CR) 10Jbond: [C: 03+2] pki::multiroot: convert to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/969937 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond) [17:40:04] !log migrate pki::multirootca to puppet7 [17:40:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:42:16] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dns3003.wikimedia.org with reason: host reimage [17:44:04] RECOVERY - Check systemd state on config-master1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:44:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-misc at eqiad: 50% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-misc&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [17:46:06] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp4052 is OK: SSL OK - OCSP staple validity for wikipedia.org has 393233 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2024-01-19 05:55:13 +0000 (expires in 80 days) https://wikitech.wikimedia.org/wiki/HTTPS [17:46:18] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp4052 is OK: SSL OK - OCSP staple validity for wikipedia.org has 220421 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2024-01-19 05:54:59 +0000 (expires in 80 days) https://wikitech.wikimedia.org/wiki/HTTPS [17:46:36] RECOVERY - haproxy process on cp4052 is OK: PROCS OK: 2 processes with command name haproxy https://wikitech.wikimedia.org/wiki/HAProxy [17:47:00] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50714 bytes in 0.193 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:47:08] PROBLEM - Check systemd state on config-master1001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:47:15] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10jbond) [17:49:07] (03PS1) 10Jbond: test: move test role to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/969940 (https://phabricator.wikimedia.org/T349619) [17:50:42] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4052.ulsfo.wmnet with OS bookworm [17:50:49] (03CR) 10Jbond: [C: 03+2] test: move test role to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/969940 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond) [17:51:02] PROBLEM - Recursive DNS on 185.15.59.34 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [17:53:37] 10SRE, 10API Platform, 10MediaWiki-REST-API, 10Traffic, and 2 others: Use relative URLs in redirects emitted by rest.php - https://phabricator.wikimedia.org/T349001 (10daniel) 05Open→03Resolved a:03daniel [17:54:22] RECOVERY - Check systemd state on config-master2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:55:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-misc at eqiad: 50% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-misc&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [17:56:18] !log migrate bastionhost to puppet7 [17:56:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:58] RECOVERY - Recursive DNS on 185.15.59.34 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [17:57:32] (03PS1) 10Jbond: bastionhost: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/969942 (https://phabricator.wikimedia.org/T349619) [17:57:53] (03CR) 10Jbond: [C: 03+2] bastionhost: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/969942 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond) [17:58:04] PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:00:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-misc at eqiad: 50% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-misc&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [18:00:42] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10jbond) [18:03:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-misc at eqiad: 50% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-misc&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [18:07:11] 10SRE, 10ops-eqiad, 10DC-Ops: Audit of WMCS Servers Using Single & Dual Switchports - https://phabricator.wikimedia.org/T349756 (10VRiley-WMF) Hi, here is a list of C 8 servers that seem to be apart of the discrepancy cloudswift1001 - dual (one port is dark) cloudvirt1027 - dual cloudvirt1026 - dual clou... [18:08:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-misc at eqiad: 50% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-misc&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [18:09:20] (03PS16) 10Effie Mouzeli: ipoid: Update cronjob definition [deployment-charts] - 10https://gerrit.wikimedia.org/r/966813 (https://phabricator.wikimedia.org/T346861) (owner: 10Kosta Harlan) [18:10:46] !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp1103.eqiad.wmnet with OS bullseye [18:10:52] 10SRE, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1103.eqiad.wmnet with OS bullseye [18:11:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-misc at eqiad: 50% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-misc&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [18:11:36] (03CR) 10Effie Mouzeli: [V: 04-1] ipoid: Update cronjob definition (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/966813 (https://phabricator.wikimedia.org/T346861) (owner: 10Kosta Harlan) [18:11:45] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4052.ulsfo.wmnet with OS bookworm [18:14:37] !log bking@deploy2002 Started deploy [search/mjolnir/deploy@daf8c32]: T346039 [18:14:44] !log bking@deploy2002 Finished deploy [search/mjolnir/deploy@daf8c32]: T346039 (duration: 00m 06s) [18:15:00] (03CR) 10Herron: [C: 03+2] profile::mediawiki::common: include prometheus statsd_exporter [puppet] - 10https://gerrit.wikimedia.org/r/952894 (https://phabricator.wikimedia.org/T345377) (owner: 10Herron) [18:15:05] (03CR) 10Herron: [C: 03+2] profile::mediawiki::common: set default histogram buckets [puppet] - 10https://gerrit.wikimedia.org/r/954114 (https://phabricator.wikimedia.org/T344751) (owner: 10Herron) [18:16:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-misc at eqiad: 50% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-misc&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [18:18:13] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on search-loader[1001-1002].eqiad.wmnet with reason: T346039 [18:18:28] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on search-loader[1001-1002].eqiad.wmnet with reason: T346039 [18:19:45] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on search-loader[2001-2002].codfw.wmnet with reason: T346039 [18:19:46] RECOVERY - Check systemd state on config-master2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:20:10] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on search-loader[2001-2002].codfw.wmnet with reason: T346039 [18:22:32] !log fabfur@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1103.eqiad.wmnet with OS bullseye [18:22:37] 10SRE, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1103.eqiad.wmnet with OS bullseye executed with errors: - cp1103 (**FAIL**) - Downtimed on Icinga/... [18:23:58] PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:24:42] !log racadm racreset cp1103.eqiad.wmnet [18:24:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-misc at eqiad: 50% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-misc&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [18:26:46] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp1103.eqiad.wmnet with OS bullseye [18:27:33] !log jbond@cumin1001 START - Cookbook sre.puppet.migrate-role for role: ping_offload [18:27:38] !log migrate ping_offload to puppet7 [18:27:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:29:01] (03PS1) 10Jbond: ping_offload: switch to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/969945 (https://phabricator.wikimedia.org/T349619) [18:29:41] (03CR) 10Jbond: [C: 03+2] ping_offload: switch to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/969945 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond) [18:30:01] (03CR) 10Herron: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/969135 (https://phabricator.wikimedia.org/T349807) (owner: 10Herron) [18:30:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-misc at eqiad: 50% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-misc&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [18:30:46] PROBLEM - Check systemd state on pki2002 is CRITICAL: CRITICAL - degraded: The following units failed: cfssl-ocsprefresh-Wikimedia_Internal_Root_CA.service,cfssl-ocsprefresh-aux.service,cfssl-ocsprefresh-aux_front_proxy.service,cfssl-ocsprefresh-cassandra.service,cfssl-ocsprefresh-cloud_wmnet_ca.service,cfssl-ocsprefresh-debmonitor.service,cfssl-ocsprefresh-discovery.service,cfssl-ocsprefresh-dse.service,cfssl-ocsprefresh-dse_front_proxy. [18:30:46] cfssl-ocsprefresh-etcd.service,cfssl-ocsprefresh-kafka.service,cfssl-ocsprefresh-mlserve.service,cfssl-ocsprefresh-mlserve_front_proxy.service,cfssl-ocsprefresh-mlserve_staging.service,cfssl-ocsprefresh-mlserve_staging_front_proxy.service,cfssl-ocsprefresh-network_devices.service,cfssl-ocsprefresh-syslog.service,cfssl-ocsprefresh-wikikube.service,cfssl-ocsprefresh-wikikube_front_proxy.service,cfssl-ocsprefresh-wikikube_staging.service,cfs [18:30:46] efresh-wikikube_staging_front_proxy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:31:28] (WidespreadPuppetFailure) firing: Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [18:31:54] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10jbond) [18:33:32] !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: ping_offload [18:34:40] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1103.eqiad.wmnet with OS bullseye [18:34:52] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp1103.eqiad.wmnet with OS bullseye [18:35:14] (03PS1) 10Bking: search-loader: removed unneeded package dep [puppet] - 10https://gerrit.wikimedia.org/r/969947 (https://phabricator.wikimedia.org/T346039) [18:35:51] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4052.ulsfo.wmnet with OS bookworm [18:36:08] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4052.ulsfo.wmnet with OS bookworm [18:36:29] (WidespreadPuppetFailure) firing: (2) Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [18:36:50] WidespreadPuppetFailure looks like a race condition related to my recent patch, but the subsequent puppet run succeeds. should clear on its own. keeping an eye on it [18:37:06] RECOVERY - BGP status on asw1-by27-esams.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:37:28] (03CR) 10Ebernhardson: [C: 03+1] search-loader: removed unneeded package dep [puppet] - 10https://gerrit.wikimedia.org/r/969947 (https://phabricator.wikimedia.org/T346039) (owner: 10Bking) [18:37:50] RECOVERY - BFD status on asw1-by27-esams.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:38:06] (03CR) 10Bking: [C: 03+2] search-loader: removed unneeded package dep [puppet] - 10https://gerrit.wikimedia.org/r/969947 (https://phabricator.wikimedia.org/T346039) (owner: 10Bking) [18:38:16] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dns3003.wikimedia.org with OS bookworm [18:38:26] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host dns3003.wikimedia.org with OS bookworm completed: - dns3003 (**PASS**) - Downtimed on Icinga/Al... [18:42:21] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1005-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [18:43:44] (JobUnavailable) firing: (2) Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:44:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-misc at eqiad: 50% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-misc&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [18:44:50] (03PS1) 10BCornwall: Revert "hiera: remove dns3003 from authdns_servers" [puppet] - 10https://gerrit.wikimedia.org/r/969768 [18:45:04] PROBLEM - Check systemd state on pki1001 is CRITICAL: CRITICAL - degraded: The following units failed: cfssl-ocsprefresh-Wikimedia_Internal_Root_CA.service,cfssl-ocsprefresh-aux.service,cfssl-ocsprefresh-aux_front_proxy.service,cfssl-ocsprefresh-cassandra.service,cfssl-ocsprefresh-cloud_wmnet_ca.service,cfssl-ocsprefresh-debmonitor.service,cfssl-ocsprefresh-discovery.service,cfssl-ocsprefresh-dse.service,cfssl-ocsprefresh-dse_front_proxy. [18:45:04] cfssl-ocsprefresh-etcd.service,cfssl-ocsprefresh-kafka.service,cfssl-ocsprefresh-mlserve.service,cfssl-ocsprefresh-mlserve_front_proxy.service,cfssl-ocsprefresh-mlserve_staging.service,cfssl-ocsprefresh-mlserve_staging_front_proxy.service,cfssl-ocsprefresh-network_devices.service,cfssl-ocsprefresh-syslog.service,cfssl-ocsprefresh-wikikube.service,cfssl-ocsprefresh-wikikube_front_proxy.service,cfssl-ocsprefresh-wikikube_staging.service,cfs [18:45:04] efresh-wikikube_staging_front_proxy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:45:40] (03CR) 10BCornwall: [C: 03+2] Revert "hiera: remove dns3003 from authdns_servers" [puppet] - 10https://gerrit.wikimedia.org/r/969768 (owner: 10BCornwall) [18:47:06] (03PS1) 10Herron: logstash: add uri_host field to w3creportingapi template [puppet] - 10https://gerrit.wikimedia.org/r/969948 (https://phabricator.wikimedia.org/T349807) [18:49:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-misc at eqiad: 50% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-misc&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [18:51:03] (03PS1) 10BCornwall: hiera: remove dns3004 from authdns_servers [puppet] - 10https://gerrit.wikimedia.org/r/969949 (https://phabricator.wikimedia.org/T342154) [18:51:37] (03CR) 10BCornwall: [C: 03+2] hiera: remove dns3004 from authdns_servers [puppet] - 10https://gerrit.wikimedia.org/r/969949 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall) [18:52:55] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4052.ulsfo.wmnet with OS bookworm [18:53:08] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1103.eqiad.wmnet with OS bullseye [18:54:23] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4052.ulsfo.wmnet with OS bookworm [18:58:05] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10jbond) [18:59:52] RECOVERY - Check systemd state on config-master1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:59:59] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4052.ulsfo.wmnet with OS bookworm [19:01:29] (WidespreadPuppetFailure) firing: (2) Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [19:04:02] PROBLEM - Check systemd state on config-master1001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:04:08] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10jbond) [19:04:18] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:06:29] (WidespreadPuppetFailure) resolved: (2) Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [19:07:52] PROBLEM - Bird Internet Routing Daemon on dns3004 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [19:08:16] PROBLEM - BFD status on asw1-bw27-esams.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:08:20] PROBLEM - BGP status on asw1-bw27-esams.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:15:26] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:21:05] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host dns3004.wikimedia.org with OS bookworm [19:21:17] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host dns3004.wikimedia.org with OS bookworm [19:28:44] (JobUnavailable) firing: (3) Reduced availability for job haproxy in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:30:10] RECOVERY - Check systemd state on config-master1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:33:14] PROBLEM - Check systemd state on config-master1001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:33:44] (JobUnavailable) firing: (3) Reduced availability for job haproxy in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:47:54] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on dns3004.wikimedia.org with reason: host reimage [19:48:52] 10SRE, 10ops-eqiad, 10DC-Ops: Audit of WMCS Servers Using Single & Dual Switchports - https://phabricator.wikimedia.org/T349756 (10wiki_willy) Awesome, thanks for working on this @VRiley-WMF. @nskaggs & @cmooney - since we have some discrepancies with the number of ports being used on these cloudvirts, shou... [19:51:03] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dns3004.wikimedia.org with reason: host reimage [19:51:17] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [19:55:01] PROBLEM - Recursive DNS on 185.15.59.2 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [19:55:15] jouncebot: next [19:55:16] In 0 hour(s) and 4 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231030T2000) [20:00:06] RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: (Dis)respected human, time to deploy UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231030T2000). Please do the needful. [20:00:06] RhinosF1: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-misc at eqiad: 50% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-misc&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:00:15] im here [20:01:57] I'm on a train, so can't deploy [20:02:40] TheresNoTime: you on holiday? trains your way are awful [20:03:37] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:04:54] RhinosF1: I can deploy [20:05:04] dancy: thanks [20:05:07] ready when you are [20:05:13] (03PS1) 10Ottomata: eventgate chart - disable SYS_PTRACE on wmfdebug container [deployment-charts] - 10https://gerrit.wikimedia.org/r/969961 (https://phabricator.wikimedia.org/T347477) [20:05:58] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by dancy@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/969353 (https://phabricator.wikimedia.org/T349970) (owner: 10RhinosF1) [20:06:00] thanks dancy [20:06:21] (03CR) 10Ottomata: [C: 03+2] eventgate chart - disable SYS_PTRACE on wmfdebug container [deployment-charts] - 10https://gerrit.wikimedia.org/r/969961 (https://phabricator.wikimedia.org/T347477) (owner: 10Ottomata) [20:06:51] (03Merged) 10jenkins-bot: namespaces:mediawiki: add Extensions/Skins as alias of Extension/Skin (+ tallk) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/969353 (https://phabricator.wikimedia.org/T349970) (owner: 10RhinosF1) [20:07:07] !log dancy@deploy2002 Started scap: Backport for [[gerrit:969353|namespaces:mediawiki: add Extensions/Skins as alias of Extension/Skin (+ tallk) (T349970)]] [20:07:13] T349970: Add Extensions/Skins as an alias of Extension/Skin on Mediawikiwiki - https://phabricator.wikimedia.org/T349970 [20:07:40] (03Merged) 10jenkins-bot: eventgate chart - disable SYS_PTRACE on wmfdebug container [deployment-charts] - 10https://gerrit.wikimedia.org/r/969961 (https://phabricator.wikimedia.org/T347477) (owner: 10Ottomata) [20:07:47] RECOVERY - Recursive DNS on 185.15.59.2 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [20:08:24] !log dancy@deploy2002 dancy and rhinosf1: Backport for [[gerrit:969353|namespaces:mediawiki: add Extensions/Skins as alias of Extension/Skin (+ tallk) (T349970)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:09:04] RhinosF1: Lemme know when you've tested [20:09:51] dancy: lgtm but will need namespaceDupes.php [20:10:10] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-misc at eqiad: 50% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-misc&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:10:33] RECOVERY - Check systemd state on config-master1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:10:55] !log otto@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-analytics: apply [20:11:15] RhinosF1: Is that something that I need to run? If so, I'll need a complete command line. [20:11:34] dancy: mwscript namespaceDupes.php mediawikiwiki [20:11:47] after deploy [20:11:54] ok.. proceeding, then I'll run that. [20:11:56] !log dancy@deploy2002 dancy and rhinosf1: Continuing with sync [20:14:25] PROBLEM - Check systemd state on config-master1001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:16:13] (03PS1) 10Ottomata: eventgate chart - separate config for wmfdebug container from nodejs profiler [deployment-charts] - 10https://gerrit.wikimedia.org/r/969963 (https://phabricator.wikimedia.org/T347477) [20:16:19] RECOVERY - BFD status on asw1-bw27-esams.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [20:17:12] (03CR) 10Ottomata: [C: 03+2] eventgate chart - separate config for wmfdebug container from nodejs profiler [deployment-charts] - 10https://gerrit.wikimedia.org/r/969963 (https://phabricator.wikimedia.org/T347477) (owner: 10Ottomata) [20:17:17] !log dancy@deploy2002 Finished scap: Backport for [[gerrit:969353|namespaces:mediawiki: add Extensions/Skins as alias of Extension/Skin (+ tallk) (T349970)]] (duration: 10m 09s) [20:17:21] RECOVERY - BGP status on asw1-bw27-esams.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:17:22] T349970: Add Extensions/Skins as an alias of Extension/Skin on Mediawikiwiki - https://phabricator.wikimedia.org/T349970 [20:18:02] https://www.irccloud.com/pastebin/53GhnMOJ/ [20:18:19] (03Merged) 10jenkins-bot: eventgate chart - separate config for wmfdebug container from nodejs profiler [deployment-charts] - 10https://gerrit.wikimedia.org/r/969963 (https://phabricator.wikimedia.org/T347477) (owner: 10Ottomata) [20:18:57] RhinosF1: Did that actually do anything? Do I need to pass the `--fix` flag? [20:19:04] dancy: do with --fix added please [20:19:21] https://www.irccloud.com/pastebin/AbFfW1TG/ [20:20:52] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dns3004.wikimedia.org with OS bookworm [20:21:01] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host dns3004.wikimedia.org with OS bookworm completed: - dns3004 (**PASS**) - Downtimed on Icinga/Al... [20:21:19] dancy: we can add --add-prefix=broken to fix Extension:Gadgets and then tag it for deletion, it's a redirect though anyway, i don't think it would cause harm to leave it [20:21:44] !log otto@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-analytics: apply [20:22:37] OK. I'll do whatever you recommend. [20:23:35] dancy: i feel better not leaving inaccessible pages in db so I say do mwscript namespaceDupes.php --add-prefix=broken [20:23:45] then --fix --add-prefix=broken [20:23:53] Ok [20:24:19] (03PS1) 10Ottomata: eventgate chart - fix debug mode CLI args [deployment-charts] - 10https://gerrit.wikimedia.org/r/969964 (https://phabricator.wikimedia.org/T347477) [20:24:24] https://www.irccloud.com/pastebin/dnHBRTPG/ [20:25:03] https://www.irccloud.com/pastebin/UfgWdGbs/ [20:25:13] dancy: all good [20:25:19] Awesome [20:25:29] (03CR) 10Ottomata: [C: 03+2] eventgate chart - fix debug mode CLI args [deployment-charts] - 10https://gerrit.wikimedia.org/r/969964 (https://phabricator.wikimedia.org/T347477) (owner: 10Ottomata) [20:26:04] taavi: also thank you for deleting that in that 5ms so i didn't have to tag it [20:26:08] dancy: have a good evening [20:26:23] :-P [20:26:33] (03Merged) 10jenkins-bot: eventgate chart - fix debug mode CLI args [deployment-charts] - 10https://gerrit.wikimedia.org/r/969964 (https://phabricator.wikimedia.org/T347477) (owner: 10Ottomata) [20:28:01] !log otto@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-analytics: apply [20:29:33] !log otto@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-analytics: apply [20:29:52] !log otto@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics: apply [20:30:47] RECOVERY - Check systemd state on config-master1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:33:13] (03PS1) 10Urbanecm: Growth: Enable new Impact module on all Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/969966 (https://phabricator.wikimedia.org/T336203) [20:34:43] (03PS1) 10Urbanecm: Growth: Disable new impact A/B testing on pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/969967 (https://phabricator.wikimedia.org/T336203) [20:34:53] PROBLEM - Check systemd state on config-master1001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:35:31] (03CR) 10Urbanecm: [C: 04-2] "not yet, scheduled for Nov 01" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/969966 (https://phabricator.wikimedia.org/T336203) (owner: 10Urbanecm) [20:35:34] (03CR) 10Urbanecm: [C: 04-2] "not yet, scheduled for Nov 01" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/969967 (https://phabricator.wikimedia.org/T336203) (owner: 10Urbanecm) [20:43:27] (03PS1) 10BCornwall: Revert "hiera: remove dns3004 from authdns_servers" [puppet] - 10https://gerrit.wikimedia.org/r/969769 [20:43:29] !log otto@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-analytics: apply [20:43:42] !log otto@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics: apply [20:43:46] (03CR) 10Ssingh: [C: 03+1] Revert "hiera: remove dns3004 from authdns_servers" [puppet] - 10https://gerrit.wikimedia.org/r/969769 (owner: 10BCornwall) [20:44:09] (03PS1) 10Bking: kafka-jumbo: permit traffic from new search-loader VMs [puppet] - 10https://gerrit.wikimedia.org/r/969968 (https://phabricator.wikimedia.org/T346039) [20:44:24] !log otto@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics: apply [20:44:31] !log otto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics: apply [20:45:05] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:45:12] !log otto@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics: apply [20:45:25] !log otto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics: apply [20:47:57] (03CR) 10BCornwall: [C: 03+2] Revert "hiera: remove dns3004 from authdns_servers" [puppet] - 10https://gerrit.wikimedia.org/r/969769 (owner: 10BCornwall) [20:49:13] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:50:32] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10BCornwall) [20:58:46] (03CR) 10Ebernhardson: [C: 03+1] kafka-jumbo: permit traffic from new search-loader VMs [puppet] - 10https://gerrit.wikimedia.org/r/969968 (https://phabricator.wikimedia.org/T346039) (owner: 10Bking) [20:58:49] (03CR) 10Ryan Kemper: [C: 03+1] kafka-jumbo: permit traffic from new search-loader VMs [puppet] - 10https://gerrit.wikimedia.org/r/969968 (https://phabricator.wikimedia.org/T346039) (owner: 10Bking) [20:59:02] (03CR) 10Ryan Kemper: [C: 03+1] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/969968 (https://phabricator.wikimedia.org/T346039) (owner: 10Bking) [20:59:12] (03PS2) 10Bking: kafka-jumbo: permit traffic from new search-loader VMs [puppet] - 10https://gerrit.wikimedia.org/r/969968 (https://phabricator.wikimedia.org/T346039) [21:00:05] Reedy, sbassett, Maryum, and manfredi: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231030T2100). [21:00:19] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:00:57] (03CR) 10Bking: [C: 03+2] kafka-jumbo: permit traffic from new search-loader VMs [puppet] - 10https://gerrit.wikimedia.org/r/969968 (https://phabricator.wikimedia.org/T346039) (owner: 10Bking) [21:02:03] (03CR) 10Bking: [C: 03+2] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/969968 (https://phabricator.wikimedia.org/T346039) (owner: 10Bking) [21:02:28] (03CR) 10Bking: kafka-jumbo: permit traffic from new search-loader VMs [puppet] - 10https://gerrit.wikimedia.org/r/969968 (https://phabricator.wikimedia.org/T346039) (owner: 10Bking) [21:04:18] (03CR) 10Bking: [C: 03+2] kafka-jumbo: permit traffic from new search-loader VMs [puppet] - 10https://gerrit.wikimedia.org/r/969968 (https://phabricator.wikimedia.org/T346039) (owner: 10Bking) [21:08:45] Hey all - have one quick update for PS.php I’d like to get out as part of the sec deploy window... [21:19:13] !log bking@cumin1001 START - Cookbook sre.hosts.remove-downtime for search-loader[2001-2002].codfw.wmnet,search-loader[1001-1002].eqiad.wmnet [21:19:14] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for search-loader[2001-2002].codfw.wmnet,search-loader[1001-1002].eqiad.wmnet [21:19:37] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:20:05] PROBLEM - Check systemd state on search-loader2001 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_mjolnir-kafka-msearch-daemon@0.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:22:49] !log Deployed updated security mitigation for T348828 [21:22:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:30:37] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:34:37] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [21:39:37] (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [21:40:56] (03PS1) 10Kimberly Sarabia: Deploy vector 2022 to non-English Wikibooks, etc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/969971 (https://phabricator.wikimedia.org/T349544) [21:48:44] (JobUnavailable) resolved: Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:03:41] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:16:07] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:33:51] RECOVERY - Check systemd state on config-master2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:36:14] (03CR) 10Jdlrobson: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/969971 (https://phabricator.wikimedia.org/T349544) (owner: 10Kimberly Sarabia) [22:38:01] PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:42:21] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1005-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [23:19:37] (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [23:19:43] RECOVERY - Check systemd state on config-master2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:23:55] PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:24:52] (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [23:29:37] (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [23:33:21] RECOVERY - Check systemd state on config-master1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:34:37] (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [23:37:35] PROBLEM - Check systemd state on config-master1001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:39:37] (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [23:48:43] RECOVERY - Check systemd state on config-master1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:50:29] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp1103.eqiad.wmnet with OS bullseye [23:51:17] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [23:52:55] PROBLEM - Check systemd state on config-master1001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:56:19] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1103.eqiad.wmnet with OS bullseye [23:56:29] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp1103.eqiad.wmnet with OS bullseye