[00:05:27] <icinga-wm>	 PROBLEM - Check systemd state on grafana1002 is CRITICAL: CRITICAL - degraded: The following units failed: grafana-ldap-users-sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:17:09] <wikibugs>	 (03PS1) 10Jdlrobson: Add option for html label in Menu template [skins/Vector] (wmf/1.41.0-wmf.16) - 10https://gerrit.wikimedia.org/r/936737 (https://phabricator.wikimedia.org/T340217)
[00:39:00] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/936807
[00:39:06] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/936807 (owner: 10TrainBranchBot)
[00:53:05] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+2] webperf: Set XHGUI_PDO_INITSCHEMA=false to avoid 'CREATE TABLE' fatal [puppet] - 10https://gerrit.wikimedia.org/r/936767 (owner: 10Krinkle)
[00:55:29] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/936807 (owner: 10TrainBranchBot)
[01:03:35] <wikibugs>	 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T341538 (10phaultfinder)
[01:46:41] <icinga-wm>	 RECOVERY - Router interfaces on cr1-drmrs is OK: OK: host 185.15.58.128, interfaces up: 58, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[01:47:01] <icinga-wm>	 RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[01:47:05] <icinga-wm>	 RECOVERY - BFD status on cr1-eqiad is OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[01:53:15] <icinga-wm>	 RECOVERY - Check systemd state on kafkamon1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:57:55] <icinga-wm>	 PROBLEM - Check systemd state on kafkamon1003 is CRITICAL: CRITICAL - degraded: The following units failed: burrow-jumbo-eqiad.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:00:06] <jouncebot>	 Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230711T0200)
[02:00:57] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to wmf LDAP group for Urbanecm - https://phabricator.wikimedia.org/T341443 (10Dzahn) I ran the "check_user" script on a cumin host as described in https://wikitech.wikimedia.org/wiki/SRE/Clinic_Duty/Access_requests#Verifying_WMF_developer_accounts   ` WikiTech Users:...
[02:05:42] <mutante>	 !log LDAP - added urbanecm to wmf group, removed from nda group (conversion volunteer to staff) T341443
[02:05:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:05:46] <stashbot>	 T341443: Grant Access to wmf LDAP group for Urbanecm - https://phabricator.wikimedia.org/T341443
[02:07:28] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/1.41.0-wmf.17 [core] (wmf/1.41.0-wmf.17) - 10https://gerrit.wikimedia.org/r/936808 (https://phabricator.wikimedia.org/T340245)
[02:07:38] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.41.0-wmf.17 [core] (wmf/1.41.0-wmf.17) - 10https://gerrit.wikimedia.org/r/936808 (https://phabricator.wikimedia.org/T340245) (owner: 10TrainBranchBot)
[02:07:55] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to wmf LDAP group for Urbanecm - https://phabricator.wikimedia.org/T341443 (10Dzahn) done.   - added to wmf group in LDAP - removed from nda group in LDAP  - added to WMF-NDA in Phab https://phabricator.wikimedia.org/project/members/61/ - no puppet changed needed sin...
[02:08:21] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:08:22] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to wmf LDAP group for Urbanecm - https://phabricator.wikimedia.org/T341443 (10Dzahn) 05Open→03Resolved a:03Dzahn
[02:23:15] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/1.41.0-wmf.17 [core] (wmf/1.41.0-wmf.17) - 10https://gerrit.wikimedia.org/r/936808 (https://phabricator.wikimedia.org/T340245) (owner: 10TrainBranchBot)
[02:29:19] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:37:01] <wikibugs>	 (03Abandoned) 10Anzx: Enable tabs for non logged in users on knwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/932284 (https://phabricator.wikimedia.org/T340276) (owner: 10Anzx)
[02:59:27] <icinga-wm>	 PROBLEM - Host urldownloader2003 is DOWN: PING CRITICAL - Packet loss = 100%
[03:00:07] <jouncebot>	 Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230711T0300)
[03:00:23] <icinga-wm>	 PROBLEM - Host irc2002 is DOWN: PING CRITICAL - Packet loss = 100%
[03:00:23] <icinga-wm>	 PROBLEM - Host logstash2032 is DOWN: PING CRITICAL - Packet loss = 100%
[03:00:29] <icinga-wm>	 PROBLEM - Host dragonfly-supernode2001 is DOWN: PING CRITICAL - Packet loss = 100%
[03:00:29] <icinga-wm>	 PROBLEM - Host schema2003 is DOWN: PING CRITICAL - Packet loss = 100%
[03:00:29] <icinga-wm>	 PROBLEM - Host ganeti2014 is DOWN: PING CRITICAL - Packet loss = 100%
[03:00:29] <icinga-wm>	 PROBLEM - Host durum2001 is DOWN: PING CRITICAL - Packet loss = 100%
[03:00:35] <icinga-wm>	 PROBLEM - Host webperf2003 is DOWN: PING CRITICAL - Packet loss = 100%
[03:00:57] <icinga-wm>	 RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:01:03] <icinga-wm>	 PROBLEM - Host failoid2002 is DOWN: PING CRITICAL - Packet loss = 100%
[03:01:38] <jinxer-wm>	 (ProbeDown) firing: (2) Service irc2002:6667 has failed probes (tcp_mw_rc_irc_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#irc2002:6667 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:01:45] <icinga-wm>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv6: Connect - Anycast, AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[03:01:47] <icinga-wm>	 PROBLEM - BFD status on cr1-codfw is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[03:01:53] <icinga-wm>	 PROBLEM - BFD status on cr2-codfw is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[03:04:19] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:05:35] <icinga-wm>	 PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: train-presync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:08:21] <jinxer-wm>	 (JobUnavailable) firing: (5) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:43:43] <wikibugs>	 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T341538 (10Papaul) 05Open→03Resolved a:03Papaul
[03:48:10] <wikibugs>	 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T341433 (10Papaul) a:03Jhancock.wm
[03:48:30] <jinxer-wm>	 (Not accepting/receiving prefixes from anycast BGP peer) firing: Alert for device cloudsw1-d5-eqiad.mgmt.eqiad.wmnet - Not accepting/receiving prefixes from anycast BGP peer   - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer
[03:59:47] <wikibugs>	 (03CR) 10RLazarus: [C: 03+2] opentelemetry-collector: Vendor 0.62.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/936388 (https://phabricator.wikimedia.org/T324117) (owner: 10RLazarus)
[04:00:42] <wikibugs>	 (03Merged) 10jenkins-bot: opentelemetry-collector: Vendor 0.62.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/936388 (https://phabricator.wikimedia.org/T324117) (owner: 10RLazarus)
[04:00:56] <wikibugs>	 (03CR) 10RLazarus: [C: 03+2] opentelemetry-collector: Fix image and entry point [deployment-charts] - 10https://gerrit.wikimedia.org/r/936389 (https://phabricator.wikimedia.org/T320564) (owner: 10RLazarus)
[04:01:48] <wikibugs>	 (03Merged) 10jenkins-bot: opentelemetry-collector: Fix image and entry point [deployment-charts] - 10https://gerrit.wikimedia.org/r/936389 (https://phabricator.wikimedia.org/T320564) (owner: 10RLazarus)
[04:34:25] <logmsgbot>	 !log rzl@deploy1002 helmfile [staging] START helmfile.d/services/opentelemetry-collector: apply
[05:18:30] <wikibugs>	 (03PS1) 10KartikMistry: Update MinT to 2023-07-10-051738-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/936831 (https://phabricator.wikimedia.org/T341335)
[05:24:10] <rzl>	 !log imported otelcol-contrib 0.81.0 to buster-wikimedia and bullseye-wikimedia in component thirdparty/otelcol-contrib
[05:24:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:29:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[05:33:42] <wikibugs>	 (03PS1) 10RLazarus: otelcol: Bump to version 0.81.0 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/936832
[05:34:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[05:40:07] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[05:45:07] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[05:46:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[05:51:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[05:52:07] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[05:57:07] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[06:00:05] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230711T0600)
[06:00:05] <jouncebot>	 kormat, marostegui, and Amir1: #bothumor My software never has bugs. It just develops random features. Rise for Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230711T0600).
[06:31:37] <wikibugs>	 10SRE, 10Continuous-Integration-Infrastructure, 10collaboration-services, 10Patch-For-Review: contint2002 service implementation tracking - https://phabricator.wikimedia.org/T324659 (10Jelto)
[06:36:10] <moritzm>	 !log rebalance ganeti group eqiad/B after reboots
[06:36:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:55:35] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet (Puppet 7.0): Cumin: update config to use new puppet7 infrastructre - https://phabricator.wikimedia.org/T341497 (10MoritzMuehlenhoff) If it's helpful for the rampup and/or early testing we can also go ahead and point cuminunpriv1001 to the Puppet 7...
[06:59:29] <elukey>	 !log restart kube-apiserver on ml-serve-ctrl1* as attempt to resolve spikes in latencies
[06:59:30] <wikibugs>	 (03CR) 10Ayounsi: "Could you provide an ssh-ed25519 key instead? We're moving away from ssh-rsa https://phabricator.wikimedia.org/T336769" [homer/public] - 10https://gerrit.wikimedia.org/r/935479 (owner: 10Fabfur)
[06:59:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:00:05] <jouncebot>	 Amir1, Urbanecm, and taavi: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230711T0700).
[07:00:05] <jouncebot>	 aanzx: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[07:01:38] <jinxer-wm>	 (ProbeDown) firing: (2) Service irc2002:6667 has failed probes (tcp_mw_rc_irc_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#irc2002:6667 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[07:04:47] <icinga-wm>	 PROBLEM - Check systemd state on ml-serve-ctrl1002 is CRITICAL: CRITICAL - degraded: The following units failed: kube-controller-manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:06:38] <jinxer-wm>	 (KubernetesAPILatency) firing: (4) High Kubernetes API latency (LIST csidrivers) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[07:07:57] <icinga-wm>	 RECOVERY - Check systemd state on ml-serve-ctrl1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:08:21] <jinxer-wm>	 (JobUnavailable) firing: (5) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[07:08:27] <moritzm>	 !log rebalance ganeti in drmrs after reboots
[07:08:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:10:27] <wikibugs>	 10SRE, 10Continuous-Integration-Infrastructure, 10collaboration-services, 10Patch-For-Review: contint2002 service implementation tracking - https://phabricator.wikimedia.org/T324659 (10hashar)
[07:11:38] <jinxer-wm>	 (KubernetesAPILatency) resolved: (4) High Kubernetes API latency (LIST csidrivers) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[07:14:32] <wikibugs>	 10SRE, 10Continuous-Integration-Infrastructure, 10collaboration-services, 10Patch-For-Review: contint2002 service implementation tracking - https://phabricator.wikimedia.org/T324659 (10hashar)
[07:19:45] <icinga-wm>	 PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS1299/IPv6: Active - Telia, AS1299/IPv4: Active - Telia https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[07:21:20] <hashar>	 good morning, we are switching over the continuous integration server in ~ 40 minutes. Jenkins/Zuul will be unavailable during that time
[07:21:36] <hashar>	 (I have updated the Deployments page)
[07:22:36] <moritzm>	 !log installing libxpm security updates
[07:22:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:28:42] <moritzm>	 !log powercycle ganeti2014
[07:28:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:31:56] <kart_>	 hashar: Can I do MinT deployment before that?
[07:32:35] <wikibugs>	 10SRE, 10ops-codfw: ganeti2014 failed - https://phabricator.wikimedia.org/T341546 (10MoritzMuehlenhoff)
[07:32:52] <wikibugs>	 10SRE, 10ops-codfw: ganeti2014: broken RAM - https://phabricator.wikimedia.org/T341546 (10MoritzMuehlenhoff) p:05Triage→03Medium
[07:33:06] <hashar>	 kart_: yes please do :)
[07:33:26] <kart_>	 Thanks!
[07:33:50] <wikibugs>	 (03CR) 10KartikMistry: [C: 03+2] Update MinT to 2023-07-10-051738-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/936831 (https://phabricator.wikimedia.org/T341335) (owner: 10KartikMistry)
[07:34:34] <wikibugs>	 (03Merged) 10jenkins-bot: Update MinT to 2023-07-10-051738-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/936831 (https://phabricator.wikimedia.org/T341335) (owner: 10KartikMistry)
[07:36:09] <icinga-wm>	 RECOVERY - Host dragonfly-supernode2001 is UP: PING OK - Packet loss = 0%, RTA = 35.40 ms
[07:36:13] <icinga-wm>	 RECOVERY - Host durum2001 is UP: PING WARNING - Packet loss = 60%, RTA = 33.40 ms
[07:36:24] <moritzm>	 !log failover broken ganeti2014 node
[07:36:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:36:27] <logmsgbot>	 !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/machinetranslation: apply
[07:37:03] <icinga-wm>	 RECOVERY - Host irc2002 is UP: PING OK - Packet loss = 0%, RTA = 33.27 ms
[07:37:37] <icinga-wm>	 RECOVERY - Host logstash2032 is UP: PING OK - Packet loss = 0%, RTA = 33.33 ms
[07:38:15] <icinga-wm>	 RECOVERY - Host urldownloader2003 is UP: PING OK - Packet loss = 0%, RTA = 33.55 ms
[07:38:34] <jinxer-wm>	 (JobUnavailable) firing: (5) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[07:38:41] <icinga-wm>	 RECOVERY - Host webperf2003 is UP: PING OK - Packet loss = 0%, RTA = 33.41 ms
[07:38:58] <logmsgbot>	 !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply
[07:39:01] <icinga-wm>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 111, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[07:39:31] <icinga-wm>	 RECOVERY - BFD status on cr1-codfw is OK: UP: 22 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[07:39:33] <icinga-wm>	 RECOVERY - BFD status on cr2-codfw is OK: UP: 20 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[07:41:38] <jinxer-wm>	 (ProbeDown) resolved: (2) Service irc2002:6667 has failed probes (tcp_mw_rc_irc_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#irc2002:6667 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[07:41:55] <wikibugs>	 10SRE, 10ops-codfw: ganeti2014: broken RAM - https://phabricator.wikimedia.org/T341546 (10MoritzMuehlenhoff) a:03Papaul I've evacuated the VMs off the broken node, can you please have a look?   I realise the server is OOW, but do we have a compatible DIMM around from a decommissioned server, e.g.?
[07:42:40] <logmsgbot>	 !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/machinetranslation: apply
[07:43:30] <jinxer-wm>	 (JobUnavailable) firing: (5) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[07:44:12] <wikibugs>	 10SRE-swift-storage, 10Observability-Metrics, 10User-fgiunchedi: Split Thanos components from thanos-fe hosts - https://phabricator.wikimedia.org/T341488 (10fgiunchedi) >>! In T341488#9003223, @Eevans wrote: >>>! In T341488#9001995, @fgiunchedi wrote: >> @MatthewVernon @Eevans please let me know what you thi...
[07:45:53] <jelto>	 hashar: as discussed, I merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/936266 now and run puppet on contint2001 and contint2002
[07:46:20] <hashar>	 correct
[07:46:35] <wikibugs>	 (03CR) 10Jelto: [C: 03+2] contint: move zuul-merger from contint2001 to contint2002 [puppet] - 10https://gerrit.wikimedia.org/r/936266 (https://phabricator.wikimedia.org/T324659) (owner: 10Hashar)
[07:47:49] <logmsgbot>	 !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/machinetranslation: apply
[07:48:30] <jinxer-wm>	 (Not accepting/receiving prefixes from anycast BGP peer) firing: Alert for device cloudsw1-d5-eqiad.mgmt.eqiad.wmnet - Not accepting/receiving prefixes from anycast BGP peer   - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer
[07:48:51] <hashar>	 if Puppet is behaving as expected, that should bring down zuul-merger on contint2001 and bring it up on contint2002
[07:49:24] <hashar>	 and there is another instance running on contint1002 (which actually takes most of the load since it is way faster thanks to SSD for disk io)
[07:49:29] <logmsgbot>	 !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/machinetranslation: apply
[07:49:40] <jelto>	 hashar: done and I can confirm that from puppet agent log output
[07:50:20] <jelto>	 ps also shows zuul-merger on contint2002 only
[07:50:56] <hashar>	 ahaha I am so happy when our Puppet manifests do the right thing
[07:53:15] <hashar>	 and I can confirm the switch happened at the application level (the zuul-merger are attaching to the Zuul server over the Gearman protocol  which can be checked from the primary host:  `zuul-gearman.py workers|grep merger`
[07:54:18] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] "LGTM, some suggestions to improve the commit message." [puppet] - 10https://gerrit.wikimedia.org/r/936273 (https://phabricator.wikimedia.org/T338811) (owner: 10Jbond)
[07:54:22] <logmsgbot>	 !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/machinetranslation: apply
[07:55:06] <wikibugs>	 10SRE, 10Continuous-Integration-Infrastructure, 10collaboration-services, 10Patch-For-Review: contint2002 service implementation tracking - https://phabricator.wikimedia.org/T324659 (10Jelto)
[07:55:31] <kart_>	 !log Updated MinT to 2023-07-10-051738-production (T341335, T333969)
[07:55:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:55:36] <stashbot>	 T341335: MinT not working for Latvian in Content & Section Translation - https://phabricator.wikimedia.org/T341335
[07:55:36] <stashbot>	 T333969: Enable Opus models for languages lacking other Machine Translation options - https://phabricator.wikimedia.org/T333969
[07:56:52] <hashar>	 jelto: I can confirm the new zuul-merger works fine on contint2002 and there are already CI builds using it \o/
[07:57:25] <wikibugs>	 (03CR) 10Volans: users: add new user (fabfur) (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/935479 (owner: 10Fabfur)
[07:57:44] <jelto>	 great, next step is to downtime both hosts and disable puppet. I'll do that in 3 minutes
[07:58:07] <wikibugs>	 10SRE, 10Continuous-Integration-Infrastructure, 10collaboration-services, 10Patch-For-Review: contint2002 service implementation tracking - https://phabricator.wikimedia.org/T324659 (10Jelto)
[07:58:08] <kart_>	 hashar: I'm done now.
[07:58:42] <wikibugs>	 10SRE, 10Continuous-Integration-Infrastructure, 10collaboration-services, 10Patch-For-Review: contint2002 service implementation tracking - https://phabricator.wikimedia.org/T324659 (10hashar)
[07:58:43] <hashar>	 kart_: awesome. Congratulations on the MinT deployment
[07:58:57] <kart_>	 :)
[08:01:03] <logmsgbot>	 !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on contint2001.wikimedia.org with reason: Switch contint hosts for hardware replacement
[08:01:17] <logmsgbot>	 !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on contint2001.wikimedia.org with reason: Switch contint hosts for hardware replacement
[08:01:27] <wikibugs>	 10SRE, 10Continuous-Integration-Infrastructure, 10collaboration-services, 10Patch-For-Review: contint2002 service implementation tracking - https://phabricator.wikimedia.org/T324659 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=fb9b83f1-475c-4737-a872-7868377e05ee) set by jelto@cumin1...
[08:01:29] <logmsgbot>	 !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on contint2002.wikimedia.org with reason: Switch contint hosts for hardware replacement
[08:01:43] <logmsgbot>	 !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on contint2002.wikimedia.org with reason: Switch contint hosts for hardware replacement
[08:01:43] <icinga-wm>	 RECOVERY - Host failoid2002 is UP: PING OK - Packet loss = 0%, RTA = 33.68 ms
[08:01:54] <wikibugs>	 10SRE, 10Continuous-Integration-Infrastructure, 10collaboration-services, 10Patch-For-Review: contint2002 service implementation tracking - https://phabricator.wikimedia.org/T324659 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=af763fea-db6d-494f-8c4c-8139c0ceab0c) set by jelto@cumin1...
[08:02:40] <jelto>	 hashar: contint2001 and 2002 are downtimed and puppet is disabled. Next step is to stop jenkins and zuul. Do you want to do that?
[08:03:23] <hashar>	 yes doing so now
[08:03:27] <wikibugs>	 (03CR) 10Ayounsi: users: add new user (fabfur) (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/935479 (owner: 10Fabfur)
[08:03:29] <wikibugs>	 10SRE, 10Continuous-Integration-Infrastructure, 10collaboration-services, 10Patch-For-Review: contint2002 service implementation tracking - https://phabricator.wikimedia.org/T324659 (10Jelto)
[08:03:34] <hashar>	 !log Stopping Jenkins and Zuul for server switch over
[08:03:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:03:57] <icinga-wm>	 RECOVERY - Host schema2003 is UP: PING OK - Packet loss = 0%, RTA = 33.50 ms
[08:04:34] <hashar>	 hmm
[08:04:47] <hashar>	 I stopped them both but https://integration.wikimedia.org/zuul/ still gives me some status updates
[08:04:51] <hashar>	 I think it is cache related
[08:04:53] <icinga-wm>	 PROBLEM - Check systemd state on schema2003 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens13.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:04:56] <hashar>	 let me keep traces of those requests
[08:05:49] <hashar>	 yeah that is the json reply which is cached by our varnish/ats statck. I will dig into it later
[08:05:56] <icinga-wm>	 RECOVERY - Check systemd state on schema2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:06:04] <jelto>	 ack
[08:06:44] <hashar>	 I am doing the rsync
[08:07:19] <jelto>	 ack thanks
[08:07:21] <hashar>	 the large /srv/jenkins syncs in a minute or so
[08:07:42] <hashar>	 I have triggered it yesterday and again this morning roughly an hour or so ago. So disks cache are warm
[08:10:15] <hashar>	 jelto: all rsync done
[08:10:18] <wikibugs>	 10SRE, 10Continuous-Integration-Infrastructure, 10collaboration-services, 10Patch-For-Review: contint2002 service implementation tracking - https://phabricator.wikimedia.org/T324659 (10hashar)
[08:10:21] <hashar>	 so you can do the DNS switch
[08:10:22] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10serviceops-radar, 10Patch-For-Review, 10Puppet (Puppet 7.0): expose_puppet_certs:  Services will need to trust the new ca - https://phabricator.wikimedia.org/T340741 (10JMeybohm)
[08:10:45] <jelto>	 hashar: let me know when I should merge and apply the dns change
[08:10:50] <hashar>	 +1
[08:10:52] <hashar>	 :)
[08:11:03] <hashar>	 I mean, you can do it 
[08:11:31] <wikibugs>	 (03PS2) 10Jelto: switch contint.wikimedia.org from contint2001 to contint2002 [dns] - 10https://gerrit.wikimedia.org/r/933196 (https://phabricator.wikimedia.org/T324659) (owner: 10Dzahn)
[08:11:41] <jelto>	 rebasing, one sec
[08:11:49] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] envoy: Limit the total number of active connections [puppet] - 10https://gerrit.wikimedia.org/r/935711 (https://phabricator.wikimedia.org/T340955) (owner: 10JMeybohm)
[08:11:51] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] rake_modules/taskgen: Don't process direcories in setup_python_extensions [puppet] - 10https://gerrit.wikimedia.org/r/935714 (owner: 10JMeybohm)
[08:11:53] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] envoy: Remove tls_minimum_protocol_version [puppet] - 10https://gerrit.wikimedia.org/r/935683 (https://phabricator.wikimedia.org/T337453) (owner: 10JMeybohm)
[08:11:56] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] envoy: Refactor max_requests_per_connection [puppet] - 10https://gerrit.wikimedia.org/r/935678 (https://phabricator.wikimedia.org/T304124) (owner: 10JMeybohm)
[08:13:29] <jelto>	 hashar: ah of course the is no ci now ... 
[08:15:33] <hashar>	 ah yeah
[08:15:39] <jelto>	 hashar: I'll manually verify +2 https://gerrit.wikimedia.org/r/c/operations/dns/+/933196. Before rebase jenkins +2ed
[08:16:39] <wikibugs>	 (03CR) 10Hashar: [C: 03+1] switch contint.wikimedia.org from contint2001 to contint2002 [dns] - 10https://gerrit.wikimedia.org/r/933196 (https://phabricator.wikimedia.org/T324659) (owner: 10Dzahn)
[08:16:59] <hashar>	 you should have the permissions in Gerrit to Verified +2 and Submit it
[08:17:36] <wikibugs>	 (03CR) 10Jelto: [V: 03+2 C: 03+2] "manually verify, because jenkins is down due to maintenance" [dns] - 10https://gerrit.wikimedia.org/r/933196 (https://phabricator.wikimedia.org/T324659) (owner: 10Dzahn)
[08:17:40] <hashar>	 \o/
[08:17:46] <wikibugs>	 (03PS2) 10Jbond: puppetmaster: enable submitting data to puppetdb7 [puppet] - 10https://gerrit.wikimedia.org/r/936273 (https://phabricator.wikimedia.org/T338811)
[08:17:55] <wikibugs>	 (03CR) 10Jbond: "updated thanks" [puppet] - 10https://gerrit.wikimedia.org/r/936273 (https://phabricator.wikimedia.org/T338811) (owner: 10Jbond)
[08:18:45] <jelto>	 authdns update diff shows: -contint         5M IN CNAME contint2001.wikimedia.org.
[08:18:45] <jelto>	 +contint         5M IN CNAME contint2002.wikimedia.org. 
[08:18:49] <jelto>	 I'll continue
[08:18:59] <godog>	 !log upgrade prometheus to 2.24.1+ds-1+wmf2 on cloudmetrics*
[08:19:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:19:50] <jelto>	 OK - authdns-update successful on all nodes!
[08:20:04] <wikibugs>	 (03PS1) 10Ayounsi: users: remove older ssh-rsa key for Alex and Chris [homer/public] - 10https://gerrit.wikimedia.org/r/937039 (https://phabricator.wikimedia.org/T336769)
[08:20:07] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on parse2012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:20:12] <hashar>	 afaik that contint dns entry is only used to route the http requests made to ATS/Varnish  to the proper machine
[08:20:18] <hashar>	 the rest of the CI stack uses ip addresses
[08:20:23] <wikibugs>	 10SRE, 10Continuous-Integration-Infrastructure, 10collaboration-services, 10Patch-For-Review: contint2002 service implementation tracking - https://phabricator.wikimedia.org/T324659 (10Jelto)
[08:21:11] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on planet2002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:21:11] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on schema2004 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:21:51] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on wdqs1013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:22:17] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/936287 (owner: 10Ladsgroup)
[08:22:25] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on prometheus6002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:22:38] <jelto>	 I guess the envoy alerts is not us but jaymes change?
[08:22:49] <hashar>	 yeah that looks unrelated
[08:23:13] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on mw1411 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:23:23] <hashar>	 you can do the two other puppet changes to change the primary in hiera
[08:23:27] <jelto>	 hashar: Then I'm going to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/867705 and https://gerrit.wikimedia.org/r/c/operations/puppet/+/935919 ok?
[08:23:33] <hashar>	 +1 :)
[08:23:34] <jelto>	 ack will do
[08:23:55] <wikibugs>	 (03CR) 10Jelto: [C: 03+2] ci/zuul: switch gearman server from contint2001 to contint2002 [puppet] - 10https://gerrit.wikimedia.org/r/867705 (https://phabricator.wikimedia.org/T324659) (owner: 10Dzahn)
[08:24:02] <hashar>	 as an extra step I will run puppet on contint1002 (the other host which runs zuul-merger) in order for that service to switch to the new host as well
[08:24:11] <wikibugs>	 (03CR) 10Jelto: [C: 03+2] ci/zuul: set contint2002 as the active ci::manager_host [puppet] - 10https://gerrit.wikimedia.org/r/935919 (https://phabricator.wikimedia.org/T324659) (owner: 10Jelto)
[08:24:13] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on mwdebug1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:24:31] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on mw1487 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:24:59] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on mw1398 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:25:05] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on mw1467 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:25:09] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on mw2286 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:25:21] <wikibugs>	 10SRE, 10envoy, 10serviceops, 10Patch-For-Review: Refactor envoy max_requests_per_connection from Cluster to HttpProtocolOptions - https://phabricator.wikimedia.org/T304124 (10JMeybohm) 05Open→03Resolved
[08:25:26] <wikibugs>	 10SRE, 10Traffic, 10envoy, 10serviceops, 10Patch-For-Review: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 (10JMeybohm)
[08:25:32] <wikibugs>	 10SRE, 10envoy, 10serviceops, 10Patch-For-Review: Remove tls_minimum_protocol_version from envoy config - https://phabricator.wikimedia.org/T337453 (10JMeybohm) 05Open→03Resolved
[08:25:38] <wikibugs>	 10SRE, 10Traffic, 10envoy, 10serviceops, 10Patch-For-Review: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 (10JMeybohm)
[08:25:41] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on webperf1003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.005 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:25:45] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on mw2323 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:25:46] <wikibugs>	 10SRE, 10Traffic, 10envoy, 10serviceops, 10Patch-For-Review: Set a limit to the number of allowed active connections via runtime key overload.global_downstream_max_connections - https://phabricator.wikimedia.org/T340955 (10JMeybohm) 05Open→03Resolved
[08:25:54] <wikibugs>	 10SRE, 10Traffic, 10envoy, 10serviceops, 10Patch-For-Review: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 (10JMeybohm)
[08:25:57] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on parse1014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:26:11] <jelto>	 hashar: both puppet changes merged
[08:26:18] <hashar>	 Puppet has moved the zuul-merger on contint1002 to the new host (config change applied + restarted the service)
[08:26:20] <wikibugs>	 10SRE, 10Traffic, 10envoy, 10serviceops, 10Patch-For-Review: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 (10JMeybohm) 05Open→03Resolved This is done from our end.
[08:26:28] <wikibugs>	 (03CR) 10Volans: [C: 04-2] "That's used in the wmcs-cookbooks repository to get the CA for each server as they can be different due to project's puppetmasters" [software/spicerack] - 10https://gerrit.wikimedia.org/r/936774 (https://phabricator.wikimedia.org/T341496) (owner: 10Jbond)
[08:26:42] <wikibugs>	 10SRE, 10Continuous-Integration-Infrastructure, 10collaboration-services, 10Patch-For-Review: contint2002 service implementation tracking - https://phabricator.wikimedia.org/T324659 (10Jelto)
[08:26:49] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on restbase1021 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:26:53] <hashar>	 then I guess you can enable and run the Puppet agent on contint2002
[08:26:58] <hashar>	 I will tail the logs
[08:27:03] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on mw2329 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:27:03] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on mw2446 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:27:07] <jelto>	 I'll do so now
[08:27:29] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on mw2350 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:27:33] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on parse1006 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:27:45] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on snapshot1011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:27:47] * hashar crosses fingers
[08:28:29] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on snapshot1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:28:39] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on ores2008 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:28:39] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on phab2002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:28:57] <jelto>	 puppet run done on contint2002: Notice: Applied catalog in 53.20 seconds
[08:29:07] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on mw1439 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:29:07] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on parse1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:29:07] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on wcqs1003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:29:09] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on chartmuseum1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.006 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:29:35] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on parse2003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:29:37] <hashar>	 Jenkins is starting and connecting to WMCS instances (there were some missing firewall rules which I have caught on friday)
[08:29:43] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on schema2003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:29:46] <jayme>	 oookay...that might be me
[08:29:47] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on wdqs1014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:30:03] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on debmonitor2002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:30:03] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on mw1399 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:30:03] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on cloudweb1004 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:30:09] <hashar>	 the web interface is running at https://integration.wikimedia.org/ci/
[08:30:13] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on idm1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:30:15] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on mw2302 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:30:33] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on restbase1028 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:30:39] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on mw1406 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:30:45] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on mw1356 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:30:51] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on restbase1018 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:30:57] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on mw2419 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:31:01] <hashar>	 jelto: I am testing zuul
[08:31:09] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on mw2273 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:31:11] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on prometheus2006 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:31:15] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on mw1489 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:31:29] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on restbase1017 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:31:30] <jelto>	 thanks! I can reach the webinterface at least. "Last reconfigured: Tue Jul 11 2023 10:28:58 " also looks promising
[08:31:45] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on mw2351 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:31:45] <hashar>	 ah yeah
[08:31:49] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on mw2318 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:31:51] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on wdqs2012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:32:06] <hashar>	 and Zuul does receive events from Gerrit
[08:32:09] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on ores2004 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:32:09] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on restbase2024 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:32:23] <hashar>	 it also managed to reach out to Jenkins and trigger a build which is executing on the WMCS instance
[08:32:35] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on mw2426 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:32:43] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on mw1476 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:33:02] <hashar>	 jelto: it worked on https://gerrit.wikimedia.org/r/c/test/gerrit-ping/+/937040 :-]
[08:33:39] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on mw2300 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:33:39] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on mw2424 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:33:45] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on thanos-fe2002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:33:47] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on mw2386 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:33:50] <jelto>	 hashar: great. Did we verify zuul and jenkins now? Or only jenkins?
[08:33:55] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on mw2399 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:33:55] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on thanos-fe2003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:34:05] <hashar>	 both 
[08:34:09] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on ores1007 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:34:12] <wikibugs>	 10SRE, 10Continuous-Integration-Infrastructure, 10collaboration-services, 10Patch-For-Review: contint2002 service implementation tracking - https://phabricator.wikimedia.org/T324659 (10hashar)
[08:34:13] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on ores1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:34:24] <jelto>	 hashar: ok filling the checkboxes in the task
[08:34:31] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on parse2019 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:34:34] <volans>	 jayme: need a hand?
[08:34:37] <hashar>	 zuul is the scheduler/workflow  and Jenkins is merely a library of cookbooks executed by Zuul
[08:34:37] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on mw2264 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:34:37] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on mw2272 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:34:37] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on ores2002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:34:45] <volans>	 should we disable puppet?
[08:34:58] <jelto>	 hashar: a you already did that
[08:35:00] <jayme>	 volans:  is there a way do downtime that one check on all hosts?
[08:35:01] <hashar>	 jelto: I have ticked the box and added an extra step I did (run puppet on contint1002 to update the zuul-merger instance running that)
[08:35:06] <wikibugs>	 (03PS1) 10Elukey: services: increase kafka batch wait time for eventgate-main [deployment-charts] - 10https://gerrit.wikimedia.org/r/937041 (https://phabricator.wikimedia.org/T338357)
[08:35:15] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on wdqs2007 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:35:17] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on mw2398 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:35:23] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on mw1372 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:35:29] <hashar>	 jelto: so now if we enable puppet on the old host (contint2001) that should mask/disabled/stop Jenkins and Zuul
[08:35:33] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on mw2379 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:35:33] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on mw2295 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:35:35] <jelto>	 hashar: then I'll enable and run puppet on contint2001 again
[08:35:37] <volans>	 jayme: yes and no, let me do it but will take few minutes
[08:35:38] <_joe_>	 jayme: ^^
[08:35:39] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on mw1395 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:35:43] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on phab1004 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:35:59] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on ores2007 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:36:13] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on restbase1030 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:36:19] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on parse1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:36:19] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on mw2330 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:36:21] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on mw1396 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:36:31] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on mw2319 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:36:31] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on mw2428 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:36:36] <wikibugs>	 (03PS4) 10Fabfur: hiera: add silent-drop directives for http frontend [puppet] - 10https://gerrit.wikimedia.org/r/936701 (https://phabricator.wikimedia.org/T340983)
[08:36:37] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on mw2405 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:36:43] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on mw1350 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:36:47] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on restbase1031 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:36:50] <jayme>	 volans: ack, I'll disable puppet on all envoy hosts
[08:36:53] <volans>	 jayme: disabling puppet on affected hosts is surely quicker
[08:36:57] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on snapshot1008 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.007 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:36:57] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on mw1359 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:37:05] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on mw2282 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:37:23] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on mw2291 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:37:31] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on mw2322 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:37:32] <jayme>	 volans: but that does not stop existing spam, no?
[08:37:35] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on mw1445 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:37:43] <volans>	 neither the downtime
[08:37:43] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on idp1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:37:45] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on mw2394 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:37:45] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on logstash2025 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:37:50] <volans>	 recovery will spam anyway
[08:37:51] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on mw1434 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:37:51] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on mw1375 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:37:51] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on puppetmaster1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:37:51] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on mw2417 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:37:51] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on mw2353 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:37:53] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on releases1003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:37:59] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on mw2431 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:37:59] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on wdqs2008 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:38:03] <volans>	 will sto any host not yet updated
[08:38:07] <volans>	 with the new config
[08:38:15] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on restbase1027 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.004 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:38:15] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on mw2429 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:38:24] <jayme>	 sure, that's done
[08:38:25] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on mw2292 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:38:29] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on parse1018 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:38:29] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on mw2421 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:38:35] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on mw1351 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:38:38] <jelto>	 hashar: I see multiple "removed" and "masked" in the puppet run, looks good. jenkins slave is running on contint2001
[08:38:47] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on mw1450 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:38:55] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on doc1003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.003 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:38:55] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on parse2005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:38:55] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on logstash2030 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.005 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:39:01] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on mw1357 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:39:01] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on mw1485 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:39:01] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on mw2308 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:39:01] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on mw2436 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:39:05] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on mw1447 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:39:21] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on wcqs1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:39:23] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on mw1384 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:39:23] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on prometheus1006 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:39:23] <jayme>	 !log disabled puppet on 'P{R:Package = envoyproxy}'
[08:39:23] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on mw2401 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:39:23] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on mw2277 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:39:23] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on mw2440 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:39:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:39:27] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on mw2450 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:39:29] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on snapshot1014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:39:31] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on chartmuseum2001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.003 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:39:33] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on snapshot1010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:39:33] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on releases1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.003 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:39:33] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on mw2390 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:39:46] <wikibugs>	 (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42387/console" [puppet] - 10https://gerrit.wikimedia.org/r/936701 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur)
[08:39:51] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on debmonitor1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:39:59] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on prometheus4002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.003 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:40:01] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on moscovium is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:40:01] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on mw2339 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:40:13] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on prometheus5002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.003 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:40:14] <volans>	 !log downtiming service 'Check no envoy runtime configuration is left persistent' on envoy hosts
[08:40:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:40:19] <volans>	 jayme: ^^^
[08:40:21] <hashar>	 jelto: confirmed all three services are masked/stopped on contint2001. I  am doing the Jenkins config change to get rid of the jenkins-slave 
[08:40:23] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on mw1463 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:40:27] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on wdqs1016 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:40:29] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on mw2331 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:40:30] <jayme>	 volans: thanks!
[08:40:33] <jelto>	 hashar: ack
[08:40:41] <volans>	 I've put 2h
[08:40:43] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on mw2299 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:40:43] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on mw2356 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:40:51] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on mw1421 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:40:51] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on mw2362 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:40:55] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on mw2355 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:40:59] <jayme>	 volans: nothing bad happened btw. The icinga check is "wrong" on a way
[08:41:03] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on mw2395 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:41:03] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on mw2361 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:41:03] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on mw2365 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:41:07] <icinga-wm>	 RECOVERY - BFD status on cr2-eqsin is OK: UP: 14 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[08:41:15] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on mw2404 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:41:15] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on releases2003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:41:37] <godog>	 jayme: can we ditch the check altogether ?
[08:41:51] <jayme>	 godog: on it
[08:41:51] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on mw1460 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:41:51] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on restbase1029 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.004 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:41:51] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on mw2444 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:41:53] <_joe_>	 probably yes
[08:42:02] <godog>	 <3 <3 <3 thank you 
[08:42:11] <icinga-wm>	 RECOVERY - BFD status on cr4-ulsfo is OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[08:42:21] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on miscweb2003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.003 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:42:26] <volans>	 still running...
[08:42:26] <wikibugs>	 10SRE, 10Continuous-Integration-Infrastructure, 10collaboration-services, 10Patch-For-Review: contint2002 service implementation tracking - https://phabricator.wikimedia.org/T324659 (10hashar)
[08:42:33] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on mw1466 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:43:11] <volans>	 !log previous downtiming completed
[08:43:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:43:52] <volans>	 jayme: downtimed service on 576 hosts
[08:43:57] <volans>	 for 2 h
[08:44:08] <wikibugs>	 (03PS1) 10JMeybohm: envoy: Absent check for zero runtime changes [puppet] - 10https://gerrit.wikimedia.org/r/937042 (https://phabricator.wikimedia.org/T300324)
[08:44:17] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "It looks good to me, but let's also run the patch/approach by Bryan" [software/bitu] - 10https://gerrit.wikimedia.org/r/935376 (owner: 10Slyngshede)
[08:44:46] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] services: increase kafka batch wait time for eventgate-main [deployment-charts] - 10https://gerrit.wikimedia.org/r/937041 (https://phabricator.wikimedia.org/T338357) (owner: 10Elukey)
[08:44:50] <wikibugs>	 (03PS2) 10JMeybohm: envoy: Absent check for zero runtime changes [puppet] - 10https://gerrit.wikimedia.org/r/937042 (https://phabricator.wikimedia.org/T300324)
[08:45:43] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] otelcol: Bump to version 0.81.0 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/936832 (owner: 10RLazarus)
[08:45:51] <wikibugs>	 10SRE, 10ops-eqiad, 10Goal, 10User-aborrero, 10cloud-services-team (FY2022/2023-Q4): cloud @ eqiad: hardware re-racking plan - https://phabricator.wikimedia.org/T341494 (10aborrero)
[08:46:05] <hashar>	 jelto: https://gerrit.wikimedia.org/r/c/operations/puppet/+/867712 can be deployed yes :)
[08:46:35] <wikibugs>	 10SRE, 10ops-eqiad, 10Goal, 10User-aborrero, 10cloud-services-team (FY2022/2023-Q4): cloud @ eqiad: hardware re-racking plan - https://phabricator.wikimedia.org/T341494 (10aborrero)
[08:46:54] <jayme>	 volans: if you have another minute: https://gerrit.wikimedia.org/r/937042 - https://puppet-compiler.wmflabs.org/output/937042/42388/
[08:47:10] <wikibugs>	 (03CR) 10Jelto: [C: 03+2] ci: make contint2002 the new rsync source, remove contint2001 [puppet] - 10https://gerrit.wikimedia.org/r/867712 (https://phabricator.wikimedia.org/T324659) (owner: 10Dzahn)
[08:47:37] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] services: increase kafka batch wait time for eventgate-main [deployment-charts] - 10https://gerrit.wikimedia.org/r/937041 (https://phabricator.wikimedia.org/T338357) (owner: 10Elukey)
[08:48:27] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/937042 (https://phabricator.wikimedia.org/T300324) (owner: 10JMeybohm)
[08:48:51] <jelto>	 hashar: merged, but puppet is disabled because of the envoy config change. I'll wait until that is done. But the rsync change is not urgent I think
[08:48:51] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] envoy: Absent check for zero runtime changes [puppet] - 10https://gerrit.wikimedia.org/r/937042 (https://phabricator.wikimedia.org/T300324) (owner: 10JMeybohm)
[08:49:24] <hashar>	 jelto: yes that can wait. Overall I it is a success as far as I can tell
[08:49:57] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] users: remove older ssh-rsa key for Alex and Chris [homer/public] - 10https://gerrit.wikimedia.org/r/937039 (https://phabricator.wikimedia.org/T336769) (owner: 10Ayounsi)
[08:50:25] <hashar>	 jelto: I have filled an unrelated follow up action about the stalled data on https://integration.wikimedia.org/zuul/  which is https://phabricator.wikimedia.org/T341548  and is due to some http cache header. That is unrelated to the switch over though.
[08:51:07] <jelto>	 hashar: great. The icinga downtime will expire in ~10 minutes. I think icinga needs some time to catch up with the checks because puppet is disabled. I'll check https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=contint 
[08:51:38] <jayme>	 volans: thanks. Is there a clever way to get a list of hosts that had already applied that change?
[08:52:30] <volans>	 what was the change you merged?
[08:52:33] <hashar>	 jelto: I'd expect Puppet to remove the Icinga checks for contint2001. 
[08:53:01] <jayme>	 the last in chain was https://gerrit.wikimedia.org/r/c/operations/puppet/+/935711/9
[08:53:22] <jayme>	 which is also the one that broke the check
[08:53:56] <volans>	 and which file was it wriing?
[08:54:50] <volans>	 rephrasing... does this change ends up writing a persistent file that then the check complains about?
[08:55:04] <volans>	 let me check the icinga check to understand what it's complaining about
[08:55:16] <jayme>	 ultimately it will write /etc/envoy/envoy.yaml
[08:55:19] <volans>	 ah the check is an http check
[08:55:40] <volans>	 yeah but all the hosts have that file, so you have 2 options
[08:55:58] <volans>	 1) target P:envoy with batch say 20 and just wait
[08:56:10] <wikibugs>	 (03PS1) 10Jbond: puppet-facts-export-puppetdb: add client auth support [puppet] - 10https://gerrit.wikimedia.org/r/937044 (https://phabricator.wikimedia.org/T341268)
[08:56:11] <jayme>	 I thought I could maybe check in puppetdb if that git commit has been applied
[08:56:21] <volans>	 2) use 2 commands, the first one of which fails where the change was not applied so cumin will not run the second command (run puppet)
[08:56:38] <wikibugs>	 10SRE, 10Continuous-Integration-Infrastructure, 10collaboration-services, 10Patch-For-Review: contint2002 service implementation tracking - https://phabricator.wikimedia.org/T324659 (10Jelto)
[08:57:01] <jayme>	 well..given you downtimed the check for 2h, simply running puppet on all envoy nodes is fine I guess
[08:57:20] <volans>	 or just re-enable it and let it run
[08:57:25] <volans>	 within 30m it will be fixed
[08:57:30] <jayme>	 indeed
[08:57:32] <volans>	 technically 1h
[08:57:44] <volans>	 because puppet has to run on alert hosts too after they run on the host
[08:59:15] <logmsgbot>	 !log elukey@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-main: sync
[08:59:17] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host kafkamon1003.eqiad.wmnet
[08:59:24] <logmsgbot>	 !log elukey@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-main: sync
[08:59:30] <jayme>	 ack
[09:00:57] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] puppet-facts-export-puppetdb: add client auth support [puppet] - 10https://gerrit.wikimedia.org/r/937044 (https://phabricator.wikimedia.org/T341268) (owner: 10Jbond)
[09:01:06] <wikibugs>	 10SRE, 10Ganeti, 10Infrastructure-Foundations, 10netops: Investigate Ganeti in routed mode - https://phabricator.wikimedia.org/T300152 (10ayounsi) a:03ayounsi
[09:01:14] <logmsgbot>	 !log elukey@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-main: sync
[09:01:42] <logmsgbot>	 !log elukey@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-main: sync
[09:02:44] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] envoy: Absent check for zero runtime changes [puppet] - 10https://gerrit.wikimedia.org/r/937042 (https://phabricator.wikimedia.org/T300324) (owner: 10JMeybohm)
[09:03:40] <jelto>	 hashar: icinga looks good (beside envoy runtime check), downtime expired
[09:06:02] <jayme>	 !log enabled puppet on 'P{R:Package = envoyproxy}'
[09:06:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:06:16] <logmsgbot>	 !log elukey@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-main: sync
[09:06:41] <logmsgbot>	 !log elukey@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-main: sync
[09:06:54] <volans>	 jayme: fyi P:envoy is the same :)
[09:07:30] <jayme>	 yeah, but not in my bash history :)
[09:07:34] <hashar>	 jelto: congratulations \o/
[09:07:37] <jayme>	 thanks for your help volans!
[09:08:25] <volans>	 no prob, anytime, we should upgrade the downtime cookbook to support this too as spicerack does support it
[09:08:29] <logmsgbot>	 !log btullis@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host kafkamon1003.eqiad.wmnet
[09:08:37] <volans>	 it's just not exposed via the cookbook
[09:13:27] <wikibugs>	 (03CR) 10Volans: "some questions inline" [software/spicerack] - 10https://gerrit.wikimedia.org/r/936781 (https://phabricator.wikimedia.org/T341496) (owner: 10Jbond)
[09:13:43] <jelto>	 hashar: thanks and thanks for running the switchover
[09:13:57] <icinga-wm>	 PROBLEM - PHP opcache health on parse2010 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.206: Connection reset by peer https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[09:15:00] <wikibugs>	 (03PS1) 10JMeybohm: prometheus: Collect runtime metrics from envoy (ops and k8s) [puppet] - 10https://gerrit.wikimedia.org/r/937046 (https://phabricator.wikimedia.org/T341554)
[09:16:57] <wikibugs>	 (03PS2) 10JMeybohm: prometheus: Collect runtime metrics from envoy (ops and k8s) [puppet] - 10https://gerrit.wikimedia.org/r/937046 (https://phabricator.wikimedia.org/T341554)
[09:18:05] <wikibugs>	 (03PS1) 10Slyngshede: P:sretest Test httppaswd function [puppet] - 10https://gerrit.wikimedia.org/r/937048
[09:19:34] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] toolforge: Add more CORS headers to docker registry [puppet] - 10https://gerrit.wikimedia.org/r/936797 (https://phabricator.wikimedia.org/T232135) (owner: 10BryanDavis)
[09:19:43] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:20:40] <wikibugs>	 (03PS1) 10Jbond: puppetmaster: move source scripts under the puppetserver name space [puppet] - 10https://gerrit.wikimedia.org/r/937049 (https://phabricator.wikimedia.org/T330490)
[09:21:05] <hashar>	 jelto: excellent. I guess you can reply to the email confirming the switch over is a success. And we will be able to decommission contint2001 \o/
[09:21:08] <wikibugs>	 (03CR) 10Jbond: "ahh thanks i missed the wmcs branch" [software/spicerack] - 10https://gerrit.wikimedia.org/r/936774 (https://phabricator.wikimedia.org/T341496) (owner: 10Jbond)
[09:21:17] <wikibugs>	 (03Abandoned) 10Jbond: puppet: drop PuppetHosts.get_ca_servers [software/spicerack] - 10https://gerrit.wikimedia.org/r/936774 (https://phabricator.wikimedia.org/T341496) (owner: 10Jbond)
[09:22:09] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42390/console" [puppet] - 10https://gerrit.wikimedia.org/r/937048 (owner: 10Slyngshede)
[09:22:51] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42389/console" [puppet] - 10https://gerrit.wikimedia.org/r/937046 (https://phabricator.wikimedia.org/T341554) (owner: 10JMeybohm)
[09:23:29] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] puppetmaster: move source scripts under the puppetserver name space [puppet] - 10https://gerrit.wikimedia.org/r/937049 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond)
[09:24:00] <wikibugs>	 (03PS2) 10Slyngshede: P:sretest Test httppaswd function [puppet] - 10https://gerrit.wikimedia.org/r/937048
[09:24:22] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Jennifer Ebe - https://phabricator.wikimedia.org/T341557 (10JEbe-WMF)
[09:26:21] <icinga-wm>	 RECOVERY - Check systemd state on puppetserver1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:29:35] <wikibugs>	 (03PS3) 10Slyngshede: P:sretest Test httppaswd function [puppet] - 10https://gerrit.wikimedia.org/r/937048
[09:30:33] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:30:51] <jbond>	 !log disable puppet fleet wide to deploy 936273
[09:30:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:31:22] <wikibugs>	 (03CR) 10Jbond: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/936273 (https://phabricator.wikimedia.org/T338811) (owner: 10Jbond)
[09:34:17] <wikibugs>	 (03PS4) 10Slyngshede: P:sretest Test httppaswd function [puppet] - 10https://gerrit.wikimedia.org/r/937048
[09:35:24] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42394/console" [puppet] - 10https://gerrit.wikimedia.org/r/937048 (owner: 10Slyngshede)
[09:36:15] <wikibugs>	 10SRE, 10Continuous-Integration-Infrastructure, 10collaboration-services, 10Patch-For-Review: contint2002 service implementation tracking - https://phabricator.wikimedia.org/T324659 (10Lucas_Werkmeister_WMDE) Just because I first saw that error after CI came back from maintenance: do you think there’s any...
[09:36:26] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] puppetmaster: enable submitting data to puppetdb7 [puppet] - 10https://gerrit.wikimedia.org/r/936273 (https://phabricator.wikimedia.org/T338811) (owner: 10Jbond)
[09:36:47] <jbond>	 !log deploy gerrit:936273 enable submitting data to puppetdb7
[09:36:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:40:18] <wikibugs>	 (03PS5) 10Slyngshede: P:sretest Test httppaswd function [puppet] - 10https://gerrit.wikimedia.org/r/937048
[09:40:53] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] P:sretest Test httppaswd function [puppet] - 10https://gerrit.wikimedia.org/r/937048 (owner: 10Slyngshede)
[09:41:12] <wikibugs>	 10SRE, 10Phabricator, 10Traffic: Accessing Phabricator from Tor (some ranges blocked but not others) - https://phabricator.wikimedia.org/T254568 (10Aklapper) 05Open→03Resolved Optimistically resolving as T253632 is resolved. Please reopen if this is still an issue - thanks!
[09:42:00] <wikibugs>	 10SRE, 10Continuous-Integration-Infrastructure, 10collaboration-services, 10Patch-For-Review: contint2002 service implementation tracking - https://phabricator.wikimedia.org/T324659 (10hashar) >>! In T324659#9004463, @Lucas_Werkmeister_WMDE wrote: > Just because I first saw that error after CI came back fr...
[09:43:22] <wikibugs>	 (03PS1) 10JMeybohm: Add warning alerts on envoy running with changes config [alerts] - 10https://gerrit.wikimedia.org/r/937054 (https://phabricator.wikimedia.org/T341554)
[09:43:29] <wikibugs>	 (03PS6) 10Slyngshede: P:sretest Test httppaswd function [puppet] - 10https://gerrit.wikimedia.org/r/937048
[09:43:56] <hashar>	 !log Updating Zuul configuration which was stall to a version from March 29th after the switchover from contint2001 to contint2002 | T324659   T341556
[09:44:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:44:01] <stashbot>	 T324659: contint2002 service implementation tracking - https://phabricator.wikimedia.org/T324659
[09:44:01] <stashbot>	 T341556: CentralAuthExtensionJsonTest::testHookHandler with data set #11 ('securepoll') failing in Wikidata.org CI - https://phabricator.wikimedia.org/T341556
[09:44:03] <hashar>	 Lucas_WMDE: you are a magician :)
[09:44:05] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] P:sretest Test httppaswd function [puppet] - 10https://gerrit.wikimedia.org/r/937048 (owner: 10Slyngshede)
[09:44:16] <Lucas_WMDE>	 :)
[09:44:28] <hashar>	 jelto: I forgot to update the integration/config repo so the switch over caused Zuul to spin up with an outdated configuration form March 29th :-\
[09:44:38] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42396/console" [puppet] - 10https://gerrit.wikimedia.org/r/937048 (owner: 10Slyngshede)
[09:44:48] <Lucas_WMDE>	 thanks for fixing it!
[09:45:17] <wikibugs>	 (03PS1) 10JMeybohm: envoy: Remove envoy_runtime_vars nagios check [puppet] - 10https://gerrit.wikimedia.org/r/937055 (https://phabricator.wikimedia.org/T341554)
[09:45:53] <Amir1>	 jouncebot: nowandnext
[09:45:53] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 14 minute(s)
[09:45:53] <jouncebot>	 In 0 hour(s) and 14 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230711T1000)
[09:45:58] <Amir1>	 cool
[09:46:05] <Lucas_WMDE>	 hashar: can I recheck already or does it need more time?
[09:46:11] <wikibugs>	 10SRE, 10Continuous-Integration-Infrastructure, 10collaboration-services, 10Patch-For-Review: contint2002 service implementation tracking - https://phabricator.wikimedia.org/T324659 (10hashar)
[09:46:11] <urbanecm>	 hashar: and i was thinking "this CI failure is very puzzling" :-D
[09:46:34] <wikibugs>	 (03CR) 10Volans: "Did a very first pass, I'm not familiar with the commands to be executed on the network devices so I skipped those." [cookbooks] - 10https://gerrit.wikimedia.org/r/933094 (https://phabricator.wikimedia.org/T334594) (owner: 10Ayounsi)
[09:46:37] <wikibugs>	 (03PS2) 10JMeybohm: Add warning alerts on envoy running with changed config [alerts] - 10https://gerrit.wikimedia.org/r/937054 (https://phabricator.wikimedia.org/T341554)
[09:47:02] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] Override liftwing hostname (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936796 (https://phabricator.wikimedia.org/T319170) (owner: 10Ladsgroup)
[09:47:05] <wikibugs>	 (03PS7) 10Slyngshede: P:sretest Test httppaswd function [puppet] - 10https://gerrit.wikimedia.org/r/937048
[09:47:20] <jbond>	 !log renable puppet
[09:47:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:47:28] <hashar>	 Lucas_WMDE: I have deployed the update in theory. Let me check
[09:47:44] <wikibugs>	 (03Merged) 10jenkins-bot: Override liftwing hostname [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936796 (https://phabricator.wikimedia.org/T319170) (owner: 10Ladsgroup)
[09:48:12] <Lucas_WMDE>	 alright, retrying the gate-and-submit
[09:48:12] <hashar>	 Lucas_WMDE: yes zuul config should be up to date now so you can `recheck`
[09:48:16] <Lucas_WMDE>	 ok thanks!
[09:49:00] <logmsgbot>	 !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:936796|Override liftwing hostname (T319170)]]
[09:49:03] <stashbot>	 T319170: Move backend of ORES MediaWiki extension to Lift Wing - https://phabricator.wikimedia.org/T319170
[09:49:58] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42397/console" [puppet] - 10https://gerrit.wikimedia.org/r/937048 (owner: 10Slyngshede)
[09:50:03] <jelto>	 hashar: thanks for finding that. Let me know if you need anything from my side
[09:50:52] <hashar>	 jelto: I think we are all set :]
[09:52:56] <jbond>	 !log disable puppet fleet wide to deploy 936273
[09:52:57] <logmsgbot>	 !log ladsgroup@deploy1002 ladsgroup: Backport for [[gerrit:936796|Override liftwing hostname (T319170)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet
[09:52:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:53:41] <wikibugs>	 10SRE, 10ops-eqiad, 10Goal, 10User-aborrero, 10cloud-services-team (FY2022/2023-Q4): cloud @ eqiad: hardware re-racking plan - https://phabricator.wikimedia.org/T341494 (10aborrero)
[09:54:23] <wikibugs>	 10SRE, 10ops-eqiad, 10Goal, 10User-aborrero, 10cloud-services-team (FY2022/2023-Q4): cloud @ eqiad: hardware re-racking plan - https://phabricator.wikimedia.org/T341494 (10cmooney) @arturo thanks for this.  The hosts can go in any rack, but we should make sure hosts of the same type go into different one...
[09:56:11] <icinga-wm>	 PROBLEM - puppet last run on kubestagemaster2001 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[09:56:13] <icinga-wm>	 RECOVERY - Check systemd state on kubestagemaster2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:56:47] <wikibugs>	 (03PS5) 10Fabfur: hiera: add silent-drop directives for http frontend [puppet] - 10https://gerrit.wikimedia.org/r/936701 (https://phabricator.wikimedia.org/T340983)
[09:57:13] <wikibugs>	 (03PS1) 10Elukey: profile::services_proxy::envoy: add inference to enabled_listeners [puppet] - 10https://gerrit.wikimedia.org/r/937056 (https://phabricator.wikimedia.org/T319170)
[09:58:13] <wikibugs>	 (03CR) 10Fabfur: hiera: add silent-drop directives for http frontend (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/936701 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur)
[09:58:24] <hashar>	 Lucas_WMDE: Jakob / Leszek had a few Wikibase changes rejected as well. I commented on one of them ( https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/933103 ) to let them know.  
[09:58:33] <Lucas_WMDE>	 cool, thanks!
[09:58:48] <wikibugs>	 (03PS1) 10Btullis: Enable the required upgrade jobs for datahub in production [deployment-charts] - 10https://gerrit.wikimedia.org/r/937057 (https://phabricator.wikimedia.org/T329514)
[09:58:53] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42398/console" [puppet] - 10https://gerrit.wikimedia.org/r/937056 (https://phabricator.wikimedia.org/T319170) (owner: 10Elukey)
[09:59:19] <icinga-wm>	 RECOVERY - Check systemd state on kafkamon1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:59:33] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Enable the required upgrade jobs for datahub in production [deployment-charts] - 10https://gerrit.wikimedia.org/r/937057 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis)
[09:59:42] <hashar>	 jelto: the sole step I have missed was to "git pull" the Zuul configuration and I have added that to the task as a missed step. I will refresh the wikipage runbook for the next switch over.  Beside that all seems to be working fine.  Thank you!
[10:00:01] <jelto>	 hashar: great thanks a lot :)
[10:00:05] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230711T1000)
[10:00:36] <wikibugs>	 (03PS8) 10Slyngshede: P:sretest Test httppaswd function [puppet] - 10https://gerrit.wikimedia.org/r/937048
[10:00:58] <wikibugs>	 (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42399/console" [puppet] - 10https://gerrit.wikimedia.org/r/936701 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur)
[10:01:04] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] k8s::apiserver: Implement kube-apiserver reload [puppet] - 10https://gerrit.wikimedia.org/r/936666 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm)
[10:01:14] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] P:sretest Test httppaswd function [puppet] - 10https://gerrit.wikimedia.org/r/937048 (owner: 10Slyngshede)
[10:01:18] <Lucas_WMDE>	 <wrong lesson learned> only set up new hosts immediately before switching to them </wll>
[10:01:22] <wikibugs>	 (03PS9) 10Slyngshede: P:sretest Test httppaswd function [puppet] - 10https://gerrit.wikimedia.org/r/937048
[10:01:23] <icinga-wm>	 RECOVERY - PHP opcache health on parse2010 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[10:01:41] <icinga-wm>	 RECOVERY - puppet last run on kubestagemaster2001 is OK: OK: Puppet is currently disabled (roll out 936273), not alerting. Last run 1 day ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[10:03:27] <wikibugs>	 (03CR) 10Muehlenhoff: "Looks good, a few random comments inline" [software/bitu] - 10https://gerrit.wikimedia.org/r/934519 (https://phabricator.wikimedia.org/T340637) (owner: 10Slyngshede)
[10:03:34] <logmsgbot>	 !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:936796|Override liftwing hostname (T319170)]] (duration: 14m 34s)
[10:03:38] <stashbot>	 T319170: Move backend of ORES MediaWiki extension to Lift Wing - https://phabricator.wikimedia.org/T319170
[10:03:51] <icinga-wm>	 PROBLEM - Check systemd state on kafkamon1003 is CRITICAL: CRITICAL - degraded: The following units failed: burrow-jumbo-eqiad.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:07:25] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/937048 (owner: 10Slyngshede)
[10:07:48] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] P:sretest Test httppaswd function [puppet] - 10https://gerrit.wikimedia.org/r/937048 (owner: 10Slyngshede)
[10:08:37] <wikibugs>	 10SRE, 10ops-eqiad, 10Goal, 10User-aborrero, 10cloud-services-team (FY2022/2023-Q4): cloud @ eqiad: hardware re-racking plan - https://phabricator.wikimedia.org/T341494 (10aborrero) Thanks. So if #Ops-eqiad don't have any other preference, we could do something like: * cloudcontrol1005 --> `C8` * cloudco...
[10:09:07] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+1] GrowthExperiments: Enable backend of link recommendation 10, 11, 12th round wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935723 (https://phabricator.wikimedia.org/T308135) (owner: 10Sergio Gimeno)
[10:10:23] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: Collect runtime metrics from envoy (ops and k8s) [puppet] - 10https://gerrit.wikimedia.org/r/937046 (https://phabricator.wikimedia.org/T341554) (owner: 10JMeybohm)
[10:11:03] <wikibugs>	 10SRE, 10API Platform, 10Anti-Harassment, 10Content-Transform-Team, and 18 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10fnegri)
[10:11:11] <wikibugs>	 10SRE, 10ops-eqiad, 10Goal, 10User-aborrero, 10cloud-services-team (FY2022/2023-Q4): cloud @ eqiad: hardware re-racking plan - https://phabricator.wikimedia.org/T341494 (10aborrero)
[10:11:18] <wikibugs>	 (03CR) 10Filippo Giunchedi: envoy: Remove envoy_runtime_vars nagios check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/937055 (https://phabricator.wikimedia.org/T341554) (owner: 10JMeybohm)
[10:11:54] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-aborrero: Add support for nftables in profile::firewall - https://phabricator.wikimedia.org/T336497 (10aborrero)
[10:12:34] <wikibugs>	 (03CR) 10Filippo Giunchedi: Add warning alerts on envoy running with changed config (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/937054 (https://phabricator.wikimedia.org/T341554) (owner: 10JMeybohm)
[10:13:22] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1 C: 03+2] prometheus: Collect runtime metrics from envoy (ops and k8s) [puppet] - 10https://gerrit.wikimedia.org/r/937046 (https://phabricator.wikimedia.org/T341554) (owner: 10JMeybohm)
[10:13:42] <wikibugs>	 10SRE, 10ops-eqiad, 10Goal, 10User-aborrero, 10cloud-services-team (FY2022/2023-Q4): cloud @ eqiad: hardware re-racking plan - https://phabricator.wikimedia.org/T341494 (10aborrero)
[10:17:32] <wikibugs>	 10SRE, 10ops-eqiad, 10Goal, 10User-aborrero, 10cloud-services-team (FY2022/2023-Q4): cloud @ eqiad: hardware re-racking plan - https://phabricator.wikimedia.org/T341494 (10aborrero)
[10:18:03] <wikibugs>	 10SRE, 10ops-eqiad, 10Goal, 10User-aborrero, 10cloud-services-team (FY2022/2023-Q4): cloud @ eqiad: hardware re-racking plan - https://phabricator.wikimedia.org/T341494 (10aborrero)
[10:18:19] <wikibugs>	 10SRE, 10Continuous-Integration-Infrastructure, 10collaboration-services, 10Patch-For-Review: contint2002 service implementation tracking - https://phabricator.wikimedia.org/T324659 (10hashar) 05In progress→03Resolved
[10:18:25] <wikibugs>	 10SRE, 10Continuous-Integration-Infrastructure, 10collaboration-services, 10Patch-For-Review: contint2002 service implementation tracking - https://phabricator.wikimedia.org/T324659 (10hashar) a:03Jelto As far as I can tell, the services were successfully switched over from contint2001 to contint2002. I...
[10:19:18] <Amir1>	 jouncebot: nowandnext
[10:19:18] <jouncebot>	 For the next 0 hour(s) and 40 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230711T1000)
[10:19:18] <jouncebot>	 In 2 hour(s) and 40 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230711T1300)
[10:19:18] <jouncebot>	 In 2 hour(s) and 40 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230711T1300)
[10:19:28] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] ExternalLinks: Make oneWildcard avoid adding wildcard to domain [core] (wmf/1.41.0-wmf.16) - 10https://gerrit.wikimedia.org/r/936733 (https://phabricator.wikimedia.org/T326251) (owner: 10Ladsgroup)
[10:19:30] <moritzm>	 !log rebalance ganeti group codfw/C after reboots
[10:19:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:22:36] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] codesearch: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/826864 (owner: 10Muehlenhoff)
[10:23:43] <wikibugs>	 10SRE, 10ops-eqiad, 10Goal, 10User-aborrero, 10cloud-services-team (FY2022/2023-Q4): cloud @ eqiad: hardware re-racking plan - https://phabricator.wikimedia.org/T341494 (10aborrero)
[10:26:49] <wikibugs>	 10SRE, 10ops-eqiad, 10Goal, 10User-aborrero, 10cloud-services-team (FY2022/2023-Q4): cloud @ eqiad: hardware re-racking plan - https://phabricator.wikimedia.org/T341494 (10cmooney) Put cloudservices1005 in C8 if there is room there instead of F4.
[10:26:55] <wikibugs>	 (03PS1) 10Filippo Giunchedi: sre: fix k8s selector for kubernetes-generic [alerts] - 10https://gerrit.wikimedia.org/r/937060
[10:29:57] <wikibugs>	 (03PS2) 10Btullis: Enable the required upgrade jobs for datahub in production [deployment-charts] - 10https://gerrit.wikimedia.org/r/937057 (https://phabricator.wikimedia.org/T329514)
[10:31:00] <wikibugs>	 10SRE, 10ops-eqiad, 10Goal, 10User-aborrero, 10cloud-services-team (FY2022/2023-Q4): cloud @ eqiad: hardware re-racking plan - https://phabricator.wikimedia.org/T341494 (10aborrero)
[10:32:09] <wikibugs>	 10SRE, 10ops-eqiad, 10Goal, 10User-aborrero, 10cloud-services-team (FY2022/2023-Q4): cloud @ eqiad: hardware re-racking plan - https://phabricator.wikimedia.org/T341494 (10aborrero) >>! In T341494#9004690, @cmooney wrote: > Put cloudservices1005 in D5 if there is room there instead of F4.  Done. What sho...
[10:36:40] <wikibugs>	 (03Merged) 10jenkins-bot: ExternalLinks: Make oneWildcard avoid adding wildcard to domain [core] (wmf/1.41.0-wmf.16) - 10https://gerrit.wikimedia.org/r/936733 (https://phabricator.wikimedia.org/T326251) (owner: 10Ladsgroup)
[10:37:00] <wikibugs>	 (03CR) 10Muehlenhoff: "Looks good, few comments inline" [software/bitu] - 10https://gerrit.wikimedia.org/r/935462 (owner: 10Slyngshede)
[10:37:35] <logmsgbot>	 !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:936733|ExternalLinks: Make oneWildcard avoid adding wildcard to domain (T326251)]]
[10:37:39] <stashbot>	 T326251: Write code for read new fields of externallinks - https://phabricator.wikimedia.org/T326251
[10:37:48] <wikibugs>	 (03PS1) 10Hnowlan: cache: set api.wikimedia.org to normal caching [puppet] - 10https://gerrit.wikimedia.org/r/937061 (https://phabricator.wikimedia.org/T338916)
[10:38:13] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] cache: set api.wikimedia.org to normal caching [puppet] - 10https://gerrit.wikimedia.org/r/937061 (https://phabricator.wikimedia.org/T338916) (owner: 10Hnowlan)
[10:38:40] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' .
[10:39:02] <logmsgbot>	 !log ladsgroup@deploy1002 ladsgroup: Backport for [[gerrit:936733|ExternalLinks: Make oneWildcard avoid adding wildcard to domain (T326251)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet
[10:40:05] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on mw2307 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[10:40:27] <wikibugs>	 (03PS2) 10Hnowlan: cache: set api.wikimedia.org to normal caching [puppet] - 10https://gerrit.wikimedia.org/r/937061 (https://phabricator.wikimedia.org/T338916)
[10:40:40] <wikibugs>	 (03CR) 10Vgutierrez: hiera: add silent-drop directives for http frontend (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/936701 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur)
[10:41:15] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on testreduce1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.003 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[10:42:35] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on mw2306 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[10:42:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[10:42:45] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on mw2374 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[10:42:51] <vgutierrez>	 hmmm envoy config issues?
[10:43:07] <jayme>	 not really. bad icinga check
[10:43:12] <vgutierrez>	 oh ok
[10:43:14] <jayme>	 should be fixed already...looking again
[10:43:51] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on mw2276 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[10:43:51] <wikibugs>	 (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42400/console" [puppet] - 10https://gerrit.wikimedia.org/r/936701 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur)
[10:43:52] <jayme>	 ah, puppet is disabled there
[10:44:15] <jbond>	 jayme: which host?
[10:44:23] <logmsgbot>	 !log ladsgroup@deploy1002 Sync cancelled.
[10:44:39] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on mw2275 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[10:44:51] <jayme>	 jbond: the ones alerting
[10:44:52] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' .
[10:44:58] <jayme>	 I checked mw2306
[10:45:05] <jayme>	 that's probably from you
[10:45:08] <wikibugs>	 (03PS1) 10Ladsgroup: Revert "ExternalLinks: Make oneWildcard avoid adding wildcard to domain" [core] (wmf/1.41.0-wmf.16) - 10https://gerrit.wikimedia.org/r/936739
[10:45:13] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] Revert "ExternalLinks: Make oneWildcard avoid adding wildcard to domain" [core] (wmf/1.41.0-wmf.16) - 10https://gerrit.wikimedia.org/r/936739 (owner: 10Ladsgroup)
[10:45:21] <jbond>	 puppet id disabled by me but we can enable it if you need to deploy something
[10:45:33] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on mw2420 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[10:45:38] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (DELETE services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:45:55] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on miscweb1003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[10:45:57] <jayme>	 jbond: yeah, it would be nice to not have them spam here
[10:46:09] <jbond>	 jayme: so everything eith envoy?
[10:46:31] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on mw2412 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[10:46:35] <jayme>	 jbond: yes, envoy is fine. I've disabled the icinga check in a follow-up change
[10:46:36] <moritzm>	 !log installing libx11 security updates
[10:46:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:46:56] <jayme>	 jbond: https://gerrit.wikimedia.org/r/c/operations/puppet/+/937042
[10:47:11] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on mw2409 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[10:47:27] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on puppetboard1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[10:47:48] <jbond>	 ack running now
[10:47:59] <jayme>	 thanks!
[10:48:01] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on mw1497 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[10:48:16] <wikibugs>	 (03PS6) 10Fabfur: hiera: add silent-drop directives for http frontend [puppet] - 10https://gerrit.wikimedia.org/r/936701 (https://phabricator.wikimedia.org/T340983)
[10:49:57] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on mw1360 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[10:50:25] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on wdqs1005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[10:50:25] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on thanos-fe1004 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[10:50:31] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on mw2373 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[10:50:35] <icinga-wm>	 PROBLEM - Router interfaces on cr1-drmrs is CRITICAL: CRITICAL: host 185.15.58.128, interfaces up: 57, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[10:50:38] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (DELETE services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:50:42] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] "```" [puppet] - 10https://gerrit.wikimedia.org/r/937061 (https://phabricator.wikimedia.org/T338916) (owner: 10Hnowlan)
[10:50:52] <wikibugs>	 (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (CORE_DIFF 7): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42401/console" [puppet] - 10https://gerrit.wikimedia.org/r/936701 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur)
[10:50:55] <icinga-wm>	 PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[10:51:03] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on mw2448 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[10:51:07] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on parse1005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[10:51:27] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on mw1382 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[10:52:05] <icinga-wm>	 RECOVERY - Router interfaces on cr1-drmrs is OK: OK: host 185.15.58.128, interfaces up: 58, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[10:52:09] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on parse2004 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[10:52:25] <icinga-wm>	 RECOVERY - BFD status on cr1-eqiad is OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[10:52:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag  - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[10:53:07] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on mw1402 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[10:53:07] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on mw1409 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[10:53:23] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] profile::services_proxy::envoy: add inference to enabled_listeners [puppet] - 10https://gerrit.wikimedia.org/r/937056 (https://phabricator.wikimedia.org/T319170) (owner: 10Elukey)
[10:53:57] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on mw1420 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[10:54:50] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "the patch LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/935751 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[10:55:43] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on ores2003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[10:56:29] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on mw1478 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[10:56:30] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on mw1475 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[10:57:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag  - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[10:57:59] <wikibugs>	 10SRE, 10Developer-Advocacy, 10Infrastructure-Foundations, 10cloud-services-team, 10LDAP: Create a single application to provision and manage developer (LDAP) accounts - https://phabricator.wikimedia.org/T179463 (10fnegri)
[10:58:18] <wikibugs>	 (03PS7) 10Fabfur: hiera: add silent-drop directives for http frontend [puppet] - 10https://gerrit.wikimedia.org/r/936701 (https://phabricator.wikimedia.org/T340983)
[10:58:25] <wikibugs>	 (03CR) 10Fabfur: hiera: add silent-drop directives for http frontend (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/936701 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur)
[10:59:26] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Enable the required upgrade jobs for datahub in production [deployment-charts] - 10https://gerrit.wikimedia.org/r/937057 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis)
[11:00:09] <wikibugs>	 (03Merged) 10jenkins-bot: Enable the required upgrade jobs for datahub in production [deployment-charts] - 10https://gerrit.wikimedia.org/r/937057 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis)
[11:00:29] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on restbase2015 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[11:00:30] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on restbase2020 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[11:00:30] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on restbase2018 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[11:02:12] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Jennifer Ebe - https://phabricator.wikimedia.org/T341557 (10ArielGlenn) See also https://phabricator.wikimedia.org/T341045 for the context. @WDoranWMF please sign off just in case that's needed. Thanks!
[11:03:40] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "ExternalLinks: Make oneWildcard avoid adding wildcard to domain" [core] (wmf/1.41.0-wmf.16) - 10https://gerrit.wikimedia.org/r/936739 (owner: 10Ladsgroup)
[11:06:49] <jbond>	 jayme: puppet has been enabled and run o0n all envoproxy systems
[11:06:56] <logmsgbot>	 !log btullis@deploy1002 helmfile [eqiad] START helmfile.d/services/datahub: apply on main
[11:07:35] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+1] "lgtm" [deployment-charts] - 10https://gerrit.wikimedia.org/r/935051 (https://phabricator.wikimedia.org/T340769) (owner: 10Jgiannelos)
[11:07:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag  - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[11:09:49] <jayme>	 jbond: there are still some alerting. restbase for example - or is the puppet run still ongoing?
[11:11:28] <wikibugs>	 (03CR) 10Sergio Gimeno: [C: 03+1] Growth: Increase mentorship percentage to 25% on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936639 (https://phabricator.wikimedia.org/T341399) (owner: 10Urbanecm)
[11:11:41] <jbond>	 jayme: hmm checking
[11:15:27] <jayme>	 the restbase node I was looking at is now donw
[11:16:23] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Jennifer Ebe - https://phabricator.wikimedia.org/T341557 (10WDoranWMF) Approved
[11:17:35] <jayme>	 jbond: looks good now
[11:17:39] <logmsgbot>	 !log btullis@deploy1002 helmfile [eqiad] DONE helmfile.d/services/datahub: sync on main
[11:17:43] <jbond>	 jayme: ley me know if you see any others, fyi the alert is returning " NRPE: Command 'check_envoy_runtime_vars' not defined"
[11:17:57] <jbond>	 and i noticed the following when i rolled out the change 
[11:17:58] <jbond>	 Notice: /Stage[main]/Profile::Envoy/Nrpe::Monitor_service[envoy_runtime_vars]/Nrpe::Check[check_envoy_runtime_vars]/File[/etc/nagios/nrpe.d/check_envoy_runtime_vars.cfg]/ensure: removed
[11:18:03] <jayme>	 yeah, that's because puppet also did not run on alert*
[11:18:08] <jbond>	 so would seem that something is  still; using that check
[11:18:16] <jbond>	 ahh let me get that
[11:18:22] <jayme>	 running already
[11:18:26] <jbond>	 cool
[11:18:40] <jayme>	 thanks!
[11:18:44] <jbond>	 np
[11:27:01] <logmsgbot>	 !log btullis@deploy1002 helmfile [eqiad] START helmfile.d/services/datahub: apply on main
[11:31:41] <wikibugs>	 (03CR) 10Gmodena: data-engineering: add alerts flink enrichment apps (034 comments) [alerts] - 10https://gerrit.wikimedia.org/r/936096 (https://phabricator.wikimedia.org/T340666) (owner: 10Gmodena)
[11:35:35] <wikibugs>	 10SRE, 10ops-eqsin, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Management LAN in eqsin offline due to failure of mr1-eqsin - https://phabricator.wikimedia.org/T341447 (10cmooney) 05Open→03Resolved a:03cmooney Still stable so I will close this for now, if it re-occurs we can engage Juniper.
[11:36:56] <wikibugs>	 (03PS13) 10Muehlenhoff: Add a new nftables::service define [puppet] - 10https://gerrit.wikimedia.org/r/935751 (https://phabricator.wikimedia.org/T336497)
[11:37:35] <logmsgbot>	 !log btullis@deploy1002 helmfile [eqiad] DONE helmfile.d/services/datahub: sync on main
[11:38:23] <wikibugs>	 (03CR) 10Muehlenhoff: Add a new nftables::service define (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/935751 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[11:38:27] <logmsgbot>	 !log btullis@deploy1002 helmfile [codfw] START helmfile.d/services/datahub: apply on main
[11:40:00] <wikibugs>	 (03CR) 10Jgiannelos: [C: 03+2] wikifeeds: Add CSP headers for restbase sunset [deployment-charts] - 10https://gerrit.wikimedia.org/r/935051 (https://phabricator.wikimedia.org/T340769) (owner: 10Jgiannelos)
[11:41:40] <wikibugs>	 (03Merged) 10jenkins-bot: wikifeeds: Add CSP headers for restbase sunset [deployment-charts] - 10https://gerrit.wikimedia.org/r/935051 (https://phabricator.wikimedia.org/T340769) (owner: 10Jgiannelos)
[11:42:28] <logmsgbot>	 !log btullis@deploy1002 helmfile [codfw] DONE helmfile.d/services/datahub: sync on main
[11:44:20] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[11:48:30] <jinxer-wm>	 (Not accepting/receiving prefixes from anycast BGP peer) firing: Alert for device cloudsw1-d5-eqiad.mgmt.eqiad.wmnet - Not accepting/receiving prefixes from anycast BGP peer   - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer
[11:58:58] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] knams: decom Datahop [homer/public] - 10https://gerrit.wikimedia.org/r/932236 (https://phabricator.wikimedia.org/T340049) (owner: 10Ayounsi)
[11:59:33] <wikibugs>	 (03Merged) 10jenkins-bot: knams: decom Datahop [homer/public] - 10https://gerrit.wikimedia.org/r/932236 (https://phabricator.wikimedia.org/T340049) (owner: 10Ayounsi)
[12:00:01] <XioNoX>	 !log decom datahop in knams - T340049
[12:00:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:00:35] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] sre: fix k8s selector for kubernetes-generic [alerts] - 10https://gerrit.wikimedia.org/r/937060 (owner: 10Filippo Giunchedi)
[12:02:42] <icinga-wm>	 RECOVERY - Check systemd state on grafana1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:04:22] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:07:16] <icinga-wm>	 PROBLEM - Check systemd state on grafana1002 is CRITICAL: CRITICAL - degraded: The following units failed: grafana-ldap-users-sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:08:15] <wikibugs>	 (03PS2) 10JMeybohm: envoy: Remove envoy_runtime_vars nagios check [puppet] - 10https://gerrit.wikimedia.org/r/937055 (https://phabricator.wikimedia.org/T341554)
[12:08:17] <wikibugs>	 (03PS1) 10JMeybohm: prometheus: Condense metric_relabel_configs into one [puppet] - 10https://gerrit.wikimedia.org/r/937074 (https://phabricator.wikimedia.org/T341554)
[12:14:24] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42402/console" [puppet] - 10https://gerrit.wikimedia.org/r/937074 (https://phabricator.wikimedia.org/T341554) (owner: 10JMeybohm)
[12:16:02] <wikibugs>	 (03PS1) 10Ayounsi: users: Update mark's key to ed25519 [homer/public] - 10https://gerrit.wikimedia.org/r/937075 (https://phabricator.wikimedia.org/T336769)
[12:16:20] <icinga-wm>	 RECOVERY - Check systemd state on grafana1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:16:28] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:22:59] <wikibugs>	 (03PS1) 10Fabfur: common.yaml: update fabfur key to ed25519 [homer/public] - 10https://gerrit.wikimedia.org/r/937077
[12:24:21] <wikibugs>	 (03PS1) 10Ayounsi: users: Update robh's key to ed25519 [homer/public] - 10https://gerrit.wikimedia.org/r/937078 (https://phabricator.wikimedia.org/T336769)
[12:25:29] <wikibugs>	 (03CR) 10RobH: [C: 03+2] users: Update robh's key to ed25519 [homer/public] - 10https://gerrit.wikimedia.org/r/937078 (https://phabricator.wikimedia.org/T336769) (owner: 10Ayounsi)
[12:26:57] <wikibugs>	 (03PS1) 10Filippo Giunchedi: prometheus: refactor alerts-deploy to pick up k8s clusters [puppet] - 10https://gerrit.wikimedia.org/r/937079
[12:27:51] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "Please add a comment next to the relabel configs too mentioning this pitfall" [puppet] - 10https://gerrit.wikimedia.org/r/937074 (https://phabricator.wikimedia.org/T341554) (owner: 10JMeybohm)
[12:32:55] <wikibugs>	 (03PS2) 10David Caro: wmcs: enable isort and black [puppet] - 10https://gerrit.wikimedia.org/r/936231
[12:32:57] <wikibugs>	 (03PS5) 10David Caro: replica_cnf_api: refactor to use multiple backends [puppet] - 10https://gerrit.wikimedia.org/r/933973 (https://phabricator.wikimedia.org/T265691)
[12:32:59] <wikibugs>	 (03PS2) 10David Caro: replica_cnf_api: add envvars backend [puppet] - 10https://gerrit.wikimedia.org/r/936232 (https://phabricator.wikimedia.org/T265691)
[12:33:18] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42403/console" [puppet] - 10https://gerrit.wikimedia.org/r/937079 (owner: 10Filippo Giunchedi)
[12:34:54] <wikibugs>	 (03PS3) 10David Caro: wmcs: enable isort and black [puppet] - 10https://gerrit.wikimedia.org/r/936231
[12:34:56] <wikibugs>	 (03PS6) 10David Caro: replica_cnf_api: refactor to use multiple backends [puppet] - 10https://gerrit.wikimedia.org/r/933973 (https://phabricator.wikimedia.org/T265691)
[12:34:58] <wikibugs>	 (03PS3) 10David Caro: replica_cnf_api: add envvars backend [puppet] - 10https://gerrit.wikimedia.org/r/936232 (https://phabricator.wikimedia.org/T265691)
[12:39:02] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] replica_cnf_api: add envvars backend [puppet] - 10https://gerrit.wikimedia.org/r/936232 (https://phabricator.wikimedia.org/T265691) (owner: 10David Caro)
[12:39:20] <icinga-wm>	 PROBLEM - Host puppetdb2003 is DOWN: PING CRITICAL - Packet loss = 100%
[12:39:28] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wmcs: enable isort and black [puppet] - 10https://gerrit.wikimedia.org/r/936231 (owner: 10David Caro)
[12:39:53] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] replica_cnf_api: refactor to use multiple backends [puppet] - 10https://gerrit.wikimedia.org/r/933973 (https://phabricator.wikimedia.org/T265691) (owner: 10David Caro)
[12:40:44] <icinga-wm>	 RECOVERY - Host puppetdb2003 is UP: PING OK - Packet loss = 0%, RTA = 33.20 ms
[12:43:50] <icinga-wm>	 PROBLEM - puppet last run on kubestagemaster2001 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[12:44:20] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[12:45:16] <icinga-wm>	 PROBLEM - Check systemd state on puppetdb2003 is CRITICAL: CRITICAL - starting: Late bootup, before the job queue becomes idle for the first time, or one of the rescue targets are reached. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:48:14] <icinga-wm>	 RECOVERY - Check systemd state on kubestagemaster2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:49:24] <icinga-wm>	 RECOVERY - puppet last run on kubestagemaster2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[12:50:53] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10cloud-services-team: Create a cron to clean clientbucket every day or hour - https://phabricator.wikimedia.org/T165885 (10fnegri)
[12:51:30] <icinga-wm>	 RECOVERY - Check systemd state on puppetdb2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:53:21] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[12:53:54] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.postgresql.postgres-init
[12:59:36] <logmsgbot>	 !log jbond@cumin1001 END (ERROR) - Cookbook sre.postgresql.postgres-init (exit_code=97)
[12:59:40] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.postgresql.postgres-init
[13:00:05] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: #bothumor I � Unicode. All rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230711T1300).
[13:00:05] <jouncebot>	 sergi0 and Urbanecm: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:05] <jouncebot>	 Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230711T1300)
[13:00:13] <logmsgbot>	 !log jbond@cumin1001 END (FAIL) - Cookbook sre.postgresql.postgres-init (exit_code=99)
[13:00:14] <sergi0>	 hello
[13:00:18] <icinga-wm>	 PROBLEM - uWSGI puppetboard -http via nrpe- on puppetboard2003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 BAD GATEWAY - 250 bytes in 0.046 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/puppetboard
[13:00:34] <taavi>	 urbanecm: I assume you're deploying those patches?
[13:00:43] <urbanecm>	 correct
[13:00:44] <urbanecm>	 hi all
[13:00:50] <icinga-wm>	 PROBLEM - uWSGI puppetboard -http via nrpe- on puppetboard1003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 BAD GATEWAY - 250 bytes in 0.020 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/puppetboard
[13:01:32] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] GrowthExperiments: Enable backend of link recommendation 10, 11, 12th round wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935723 (https://phabricator.wikimedia.org/T308135) (owner: 10Sergio Gimeno)
[13:01:35] <wikibugs>	 (03PS4) 10Urbanecm: GrowthExperiments: Enable backend of link recommendation 10, 11, 12th round wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935723 (https://phabricator.wikimedia.org/T308135) (owner: 10Sergio Gimeno)
[13:01:37] <wikibugs>	 (03CR) 10Filippo Giunchedi: data-engineering: add alerts flink enrichment apps (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/936096 (https://phabricator.wikimedia.org/T340666) (owner: 10Gmodena)
[13:01:41] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] GrowthExperiments: Enable backend of link recommendation 10, 11, 12th round wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935723 (https://phabricator.wikimedia.org/T308135) (owner: 10Sergio Gimeno)
[13:02:19] <aanzx>	 Urbanecm:  Can I add one more , which was scheduled for morning backport which didn't happen now 
[13:02:21] <aanzx>	 https://gerrit.wikimedia.org/r/c/936826/
[13:02:21] <wikibugs>	 (03Merged) 10jenkins-bot: GrowthExperiments: Enable backend of link recommendation 10, 11, 12th round wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935723 (https://phabricator.wikimedia.org/T308135) (owner: 10Sergio Gimeno)
[13:02:35] <urbanecm>	 aanzx: sure, can you add it to the calendar please?
[13:02:43] <aanzx>	 Ok
[13:03:20] <aanzx>	 Added
[13:03:21] <jinxer-wm>	 (JobUnavailable) firing: (5) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[13:03:28] <logmsgbot>	 !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:935723|GrowthExperiments: Enable backend of link recommendation 10, 11, 12th round wikis (T308135 T308136 T308137)]]
[13:03:34] <stashbot>	 T308137: Deploy "add a link" to 12th round of wikis - https://phabricator.wikimedia.org/T308137
[13:03:34] <stashbot>	 T308135: Deploy "add a link" to 10th round of wikis - https://phabricator.wikimedia.org/T308135
[13:03:35] <stashbot>	 T308136: Deploy "add a link" to 11th round of wikis - https://phabricator.wikimedia.org/T308136
[13:03:58] <wikibugs>	 (03PS2) 10Urbanecm: Growth: Increase mentorship percentage to 25% on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936639 (https://phabricator.wikimedia.org/T341399)
[13:04:46] <wikibugs>	 (03CR) 10Gmodena: data-engineering: add alerts flink enrichment apps (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/936096 (https://phabricator.wikimedia.org/T340666) (owner: 10Gmodena)
[13:04:58] <logmsgbot>	 !log urbanecm@deploy1002 sgimeno and urbanecm: Backport for [[gerrit:935723|GrowthExperiments: Enable backend of link recommendation 10, 11, 12th round wikis (T308135 T308136 T308137)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet
[13:05:36] <urbanecm>	 sergi0: not 100% sure if link recommendation backend is testable at mwdebug, but if you want to test sth, go ahead :)
[13:06:41] <wikibugs>	 (03PS2) 10Urbanecm: Enable tabs for non loggedin mobile users on knwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936826 (https://phabricator.wikimedia.org/T340276) (owner: 10Anzx)
[13:07:08] <sergi0>	 urbanecm: I don't think we can test anything in mwdebug at this point. I'll check the dataset containers during this evening.
[13:07:20] <urbanecm>	 sounds good to me. proceeding.
[13:07:31] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Growth: Increase mentorship percentage to 25% on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936639 (https://phabricator.wikimedia.org/T341399) (owner: 10Urbanecm)
[13:08:11] <wikibugs>	 (03Merged) 10jenkins-bot: Growth: Increase mentorship percentage to 25% on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936639 (https://phabricator.wikimedia.org/T341399) (owner: 10Urbanecm)
[13:08:41] <wikibugs>	 (03PS3) 10Urbanecm: Enable tabs for non loggedin mobile users on knwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936826 (https://phabricator.wikimedia.org/T340276) (owner: 10Anzx)
[13:08:46] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Enable tabs for non loggedin mobile users on knwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936826 (https://phabricator.wikimedia.org/T340276) (owner: 10Anzx)
[13:09:50] <wikibugs>	 (03CR) 10Elukey: [V: 03+1 C: 03+2] profile::services_proxy::envoy: add inference to enabled_listeners [puppet] - 10https://gerrit.wikimedia.org/r/937056 (https://phabricator.wikimedia.org/T319170) (owner: 10Elukey)
[13:10:27] <wikibugs>	 (03PS3) 10Samtar: IS: Enable Phonos on medium projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936717 (https://phabricator.wikimedia.org/T336763)
[13:11:48] <icinga-wm>	 RECOVERY - uWSGI puppetboard -http via nrpe- on puppetboard1003 is OK: HTTP OK: HTTP/1.1 200 OK - 111763 bytes in 3.822 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/puppetboard
[13:12:50] <icinga-wm>	 RECOVERY - uWSGI puppetboard -http via nrpe- on puppetboard2003 is OK: HTTP OK: HTTP/1.1 200 OK - 115397 bytes in 3.849 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/puppetboard
[13:13:13] <logmsgbot>	 !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:935723|GrowthExperiments: Enable backend of link recommendation 10, 11, 12th round wikis (T308135 T308136 T308137)]] (duration: 09m 45s)
[13:13:19] <stashbot>	 T308137: Deploy "add a link" to 12th round of wikis - https://phabricator.wikimedia.org/T308137
[13:13:19] <stashbot>	 T308135: Deploy "add a link" to 10th round of wikis - https://phabricator.wikimedia.org/T308135
[13:13:19] <stashbot>	 T308136: Deploy "add a link" to 11th round of wikis - https://phabricator.wikimedia.org/T308136
[13:13:21] <jinxer-wm>	 (JobUnavailable) firing: (5) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[13:14:10] <urbanecm>	 sergi0: your patch's deployed. anything else from you?
[13:14:12] <logmsgbot>	 !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:936639|Growth: Increase mentorship percentage to 25% on enwiki (T341399)]]
[13:14:14] <stashbot>	 T341399: Increase percentage of newcomers who receive Growth mentorship at English Wikipedia - https://phabricator.wikimedia.org/T341399
[13:14:47] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Enable tabs for non loggedin mobile users on knwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936826 (https://phabricator.wikimedia.org/T340276) (owner: 10Anzx)
[13:15:29] <wikibugs>	 (03Merged) 10jenkins-bot: Enable tabs for non loggedin mobile users on knwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936826 (https://phabricator.wikimedia.org/T340276) (owner: 10Anzx)
[13:16:58] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Jennifer Ebe - https://phabricator.wikimedia.org/T341557 (10BTullis) Jennifer is already a member of `wmf`  https://ldap.toolforge.org/user/jebe  Double checked. ` btullis@seaborgium:~$ ldapsearch -A -x member=uid=jebe,ou=people,dc=wikimedia,dc=org dn # ex...
[13:17:04] <sergi0>	 urbanecm: nope, thanks for your assistance :)
[13:17:17] <urbanecm>	 np
[13:17:17] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Jennifer Ebe - https://phabricator.wikimedia.org/T341557 (10BTullis) 05Open→03Resolved a:03BTullis
[13:18:47] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] "@Arzhel: Happy to take care of merging this, let me know." [homer/public] - 10https://gerrit.wikimedia.org/r/937077 (owner: 10Fabfur)
[13:21:27] <logmsgbot>	 !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:936639|Growth: Increase mentorship percentage to 25% on enwiki (T341399)]] (duration: 07m 15s)
[13:21:30] <stashbot>	 T341399: Increase percentage of newcomers who receive Growth mentorship at English Wikipedia - https://phabricator.wikimedia.org/T341399
[13:21:42] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: changeprop: Change normal_rule_processing to histogram [deployment-charts] - 10https://gerrit.wikimedia.org/r/937090
[13:21:52] <logmsgbot>	 !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:936826|Enable tabs for non loggedin mobile users on knwikisource (T340276)]]
[13:21:55] <stashbot>	 T340276: Enable tabs for non logged-in mobile skin users on knwikisource - https://phabricator.wikimedia.org/T340276
[13:22:31] <logmsgbot>	 !log jgiannelos@deploy1002 Started deploy [restbase/deploy@930f075]: (no justification provided)
[13:22:41] <wikibugs>	 (03PS1) 10Elukey: burrow: add LimitNOFILE=8192 to systemd's units [puppet] - 10https://gerrit.wikimedia.org/r/937091 (https://phabricator.wikimedia.org/T341551)
[13:23:24] <logmsgbot>	 !log urbanecm@deploy1002 urbanecm and anzx: Backport for [[gerrit:936826|Enable tabs for non loggedin mobile users on knwikisource (T340276)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet
[13:23:32] <aanzx>	 Testing
[13:23:44] <urbanecm>	 aanzx: was just going to ask for testing :). let me know if it works.
[13:24:15] <wikibugs>	 (03CR) 10Jbond: [WIP] Manage TLS on network devices (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/933094 (https://phabricator.wikimedia.org/T334594) (owner: 10Ayounsi)
[13:27:18] <aanzx>	 urbanecm: works , good to go
[13:27:43] <urbanecm>	 proceeding
[13:28:41] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] users: remove older ssh-rsa key for Alex and Chris [homer/public] - 10https://gerrit.wikimedia.org/r/937039 (https://phabricator.wikimedia.org/T336769) (owner: 10Ayounsi)
[13:28:50] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] users: Update mark's key to ed25519 [homer/public] - 10https://gerrit.wikimedia.org/r/937075 (https://phabricator.wikimedia.org/T336769) (owner: 10Ayounsi)
[13:29:13] <wikibugs>	 (03Merged) 10jenkins-bot: users: remove older ssh-rsa key for Alex and Chris [homer/public] - 10https://gerrit.wikimedia.org/r/937039 (https://phabricator.wikimedia.org/T336769) (owner: 10Ayounsi)
[13:29:22] <wikibugs>	 (03Merged) 10jenkins-bot: users: Update mark's key to ed25519 [homer/public] - 10https://gerrit.wikimedia.org/r/937075 (https://phabricator.wikimedia.org/T336769) (owner: 10Ayounsi)
[13:29:25] <wikibugs>	 (03Merged) 10jenkins-bot: users: Update robh's key to ed25519 [homer/public] - 10https://gerrit.wikimedia.org/r/937078 (https://phabricator.wikimedia.org/T336769) (owner: 10Ayounsi)
[13:29:31] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] common.yaml: update fabfur key to ed25519 [homer/public] - 10https://gerrit.wikimedia.org/r/937077 (owner: 10Fabfur)
[13:30:06] <wikibugs>	 (03Merged) 10jenkins-bot: common.yaml: update fabfur key to ed25519 [homer/public] - 10https://gerrit.wikimedia.org/r/937077 (owner: 10Fabfur)
[13:30:18] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Update filippo's key [homer/public] - 10https://gerrit.wikimedia.org/r/932234 (https://phabricator.wikimedia.org/T336769) (owner: 10Filippo Giunchedi)
[13:30:37] <wikibugs>	 (03PS2) 10JMeybohm: prometheus: Condense metric_relabel_configs into one [puppet] - 10https://gerrit.wikimedia.org/r/937074 (https://phabricator.wikimedia.org/T341554)
[13:30:39] <wikibugs>	 (03PS3) 10JMeybohm: envoy: Remove envoy_runtime_vars nagios check [puppet] - 10https://gerrit.wikimedia.org/r/937055 (https://phabricator.wikimedia.org/T341554)
[13:30:52] <wikibugs>	 (03Merged) 10jenkins-bot: Update filippo's key [homer/public] - 10https://gerrit.wikimedia.org/r/932234 (https://phabricator.wikimedia.org/T336769) (owner: 10Filippo Giunchedi)
[13:33:25] <Amir1>	 James_F: <3 I don't know how to thank you
[13:33:26] <logmsgbot>	 !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:936826|Enable tabs for non loggedin mobile users on knwikisource (T340276)]] (duration: 11m 33s)
[13:33:29] <stashbot>	 T340276: Enable tabs for non logged-in mobile skin users on knwikisource - https://phabricator.wikimedia.org/T340276
[13:33:42] <urbanecm>	 aanzx: and deployed. anything else?
[13:33:46] <wikibugs>	 (03PS1) 10Mabualruz: Run a synthetic test for client side preferences [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937092 (https://phabricator.wikimedia.org/T336527)
[13:33:55] <aanzx>	 Nothing, thanks 
[13:34:01] <wikibugs>	 (03PS1) 10Fabfur: admin: Update fabfur's rsa key to ed25519 [puppet] - 10https://gerrit.wikimedia.org/r/937093
[13:34:03] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] prometheus: Condense metric_relabel_configs into one [puppet] - 10https://gerrit.wikimedia.org/r/937074 (https://phabricator.wikimedia.org/T341554) (owner: 10JMeybohm)
[13:34:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:35:05] <wikibugs>	 (03PS4) 10David Caro: wmcs: enable isort and black [puppet] - 10https://gerrit.wikimedia.org/r/936231
[13:35:07] <wikibugs>	 (03PS4) 10David Caro: replica_cnf_api: add envvars backend [puppet] - 10https://gerrit.wikimedia.org/r/936232 (https://phabricator.wikimedia.org/T265691)
[13:35:22] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Jennifer Ebe - https://phabricator.wikimedia.org/T341557 (10ArielGlenn) >>! In T341557#9005233, @BTullis wrote: > Jennifer is already a member of `wmf` >  > https://ldap.toolforge.org/user/jebe >  > Double checked. > ` > btullis@seaborgium:~$ ldapsearch -A...
[13:35:36] <wikibugs>	 (03CR) 10Mabualruz: "Synthetic test files" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937092 (https://phabricator.wikimedia.org/T336527) (owner: 10Mabualruz)
[13:36:16] <wikibugs>	 (03CR) 10David Caro: "tricky flake8, also as it does not pin python to 3.7, the tests for replica_cnf when it´s included in the global wmcs tox entry fail for m" [puppet] - 10https://gerrit.wikimedia.org/r/936231 (owner: 10David Caro)
[13:36:36] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] burrow: add LimitNOFILE=8192 to systemd's units [puppet] - 10https://gerrit.wikimedia.org/r/937091 (https://phabricator.wikimedia.org/T341551) (owner: 10Elukey)
[13:36:44] <James_F>	 Amir1: Keeping being awesome is thanks enough!
[13:36:59] <Amir1>	 <3
[13:37:15] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] "Many thanks elukey" [puppet] - 10https://gerrit.wikimedia.org/r/937091 (https://phabricator.wikimedia.org/T341551) (owner: 10Elukey)
[13:38:02] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] replica_cnf_api: add envvars backend [puppet] - 10https://gerrit.wikimedia.org/r/936232 (https://phabricator.wikimedia.org/T265691) (owner: 10David Caro)
[13:39:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:40:59] <wikibugs>	 (03PS2) 10Mabualruz: Run a synthetic test for client side preferences [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937092 (https://phabricator.wikimedia.org/T336527)
[13:42:22] <logmsgbot>	 !log jgiannelos@deploy1002 Finished deploy [restbase/deploy@930f075]: (no justification provided) (duration: 19m 50s)
[13:42:30] <wikibugs>	 (03PS1) 10Jsn.sherman: log additional events on Special:Diff|MobileDiff [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937096 (https://phabricator.wikimedia.org/T326212)
[13:44:34] <wikibugs>	 (03CR) 10Jsn.sherman: "follow-up here: I6bfb201d0b8cdd0bbe22a1cbdbc1298cf1bab2cc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936748 (https://phabricator.wikimedia.org/T326212) (owner: 10Jsn.sherman)
[13:49:45] <moritzm>	 !log rebalance ganeti group eqiad/d after reboots
[13:49:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:52:19] <wikibugs>	 (03CR) 10Alexandros Kosiaris: "https://codesearch.wmcloud.org/search/?q=_normal_rule_processing&files=&excludeFiles=&repos= says nothing in the various repos. That leave" [deployment-charts] - 10https://gerrit.wikimedia.org/r/937090 (owner: 10Alexandros Kosiaris)
[13:52:22] <wikibugs>	 (03PS3) 10Mabualruz: Run a synthetic test for client side preferences [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937092 (https://phabricator.wikimedia.org/T336527)
[13:52:35] <wikibugs>	 (03PS1) 10Ladsgroup: Externallinks: Keep domain wildcard if path is not specified [core] (wmf/1.41.0-wmf.17) - 10https://gerrit.wikimedia.org/r/937108 (https://phabricator.wikimedia.org/T326251)
[13:54:48] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/935751 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[13:55:02] <wikibugs>	 (03PS1) 10Btullis: Add the option to clean datahub indices to the restore job [deployment-charts] - 10https://gerrit.wikimedia.org/r/937099 (https://phabricator.wikimedia.org/T329514)
[13:55:08] <wikibugs>	 (03CR) 10Alexandros Kosiaris: "The following panels in https://grafana-rw.wikimedia.org/d/CbmStnlGk/jobqueue-job will need to be updated" [deployment-charts] - 10https://gerrit.wikimedia.org/r/937090 (owner: 10Alexandros Kosiaris)
[13:56:32] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Add the option to clean datahub indices to the restore job [deployment-charts] - 10https://gerrit.wikimedia.org/r/937099 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis)
[13:56:50] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] prometheus: refactor alerts-deploy to pick up k8s clusters [puppet] - 10https://gerrit.wikimedia.org/r/937079 (owner: 10Filippo Giunchedi)
[13:56:57] <wikibugs>	 (03CR) 10Alexandros Kosiaris: "And the 2 job run panels in https://grafana-rw.wikimedia.org/d/LSeAShkGz/jobqueue?orgId=1" [deployment-charts] - 10https://gerrit.wikimedia.org/r/937090 (owner: 10Alexandros Kosiaris)
[13:57:35] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Update network SSH keys to ssh-ed25519 - https://phabricator.wikimedia.org/T336769 (10ayounsi)
[13:57:57] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.druid.roll-restart-workers for Druid public cluster: Roll restart of Druid jvm daemons.
[13:58:21] <wikibugs>	 (03PS3) 10JMeybohm: Add warning alerts on envoy running with changed config [alerts] - 10https://gerrit.wikimedia.org/r/937054 (https://phabricator.wikimedia.org/T341554)
[13:58:25] <wikibugs>	 (03Merged) 10jenkins-bot: Add the option to clean datahub indices to the restore job [deployment-charts] - 10https://gerrit.wikimedia.org/r/937099 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis)
[13:59:09] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] prometheus: refactor alerts-deploy to pick up k8s clusters [puppet] - 10https://gerrit.wikimedia.org/r/937079 (owner: 10Filippo Giunchedi)
[13:59:28] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main
[13:59:43] <moritzm>	 !log installing yajl security updates
[13:59:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:00:55] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.kafka.roll-restart-mirror-maker restart MirrorMaker for Kafka A:kafka-mirror-maker-jumbo-eqiad cluster: Roll restart of jvm daemons.
[14:01:53] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Update network SSH keys to ssh-ed25519 - https://phabricator.wikimedia.org/T336769 (10ayounsi) a:03BBlack Assigning the task to @BBlack for when he comes back.
[14:01:57] <wikibugs>	 (03PS1) 10Muehlenhoff: Add library hint for yajl [puppet] - 10https://gerrit.wikimedia.org/r/937101
[14:02:12] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main
[14:02:27] <wikibugs>	 (03PS4) 10JMeybohm: envoy: Remove envoy_runtime_vars nagios check [puppet] - 10https://gerrit.wikimedia.org/r/937055 (https://phabricator.wikimedia.org/T341554)
[14:02:29] <wikibugs>	 (03PS1) 10JMeybohm: envoy: Absent monitor_systemd_unit_state for envoyproxy [puppet] - 10https://gerrit.wikimedia.org/r/937102 (https://phabricator.wikimedia.org/T341554)
[14:03:04] <wikibugs>	 (03CR) 10JMeybohm: envoy: Remove envoy_runtime_vars nagios check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/937055 (https://phabricator.wikimedia.org/T341554) (owner: 10JMeybohm)
[14:04:41] <akosiaris>	 Lucas_WMDE: rounds 2 of the migration to histograms for jobqueue metrics: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/937090
[14:04:56] <akosiaris>	 I 've identified keys panels and alerts in the comments and I 'll fix those after merging, but searching through all grafana dashboards/alerts isn't feasible unfortunately. So if you have any other stuff you know of, please let me know
[14:05:32] <Lucas_WMDE>	 akosiaris: I’ll try to take a look later
[14:05:52] <akosiaris>	 Lucas_WMDE: no rush, it can wait. 
[14:08:21] <jinxer-wm>	 (JobUnavailable) firing: (5) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:11:05] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add library hint for yajl [puppet] - 10https://gerrit.wikimedia.org/r/937101 (owner: 10Muehlenhoff)
[14:11:22] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Update network SSH keys to ssh-ed25519 - https://phabricator.wikimedia.org/T336769 (10RobH)
[14:12:04] <logmsgbot>	 !log btullis@deploy1002 helmfile [codfw] START helmfile.d/services/datahub: apply on main
[14:12:22] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] "Confirmed with Fabrizio on IRC." [puppet] - 10https://gerrit.wikimedia.org/r/937093 (owner: 10Fabfur)
[14:13:28] <wikibugs>	 (03CR) 10Fabfur: [C: 03+2] admin: Update fabfur's rsa key to ed25519 [puppet] - 10https://gerrit.wikimedia.org/r/937093 (owner: 10Fabfur)
[14:13:55] <wikibugs>	 (03CR) 10Ladsgroup: [V: 03+1 C: 03+2] "Tested and works well and makes it much faster too." [cookbooks] - 10https://gerrit.wikimedia.org/r/936287 (owner: 10Ladsgroup)
[14:13:57] <Lucas_WMDE>	 “add library hint for yall” thx ;)
[14:14:09] <wikibugs>	 (03PS1) 10Andrew Bogott: Add puppet role and profile for etcd_discovery service [puppet] - 10https://gerrit.wikimedia.org/r/937104 (https://phabricator.wikimedia.org/T341355)
[14:14:20] <jinxer-wm>	 (JobUnavailable) firing: (5) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:14:33] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add puppet role and profile for etcd_discovery service [puppet] - 10https://gerrit.wikimedia.org/r/937104 (https://phabricator.wikimedia.org/T341355) (owner: 10Andrew Bogott)
[14:15:30] <logmsgbot>	 !log btullis@deploy1002 helmfile [codfw] DONE helmfile.d/services/datahub: sync on main
[14:15:34] <wikibugs>	 (03PS2) 10Andrew Bogott: Add puppet role and profile for etcd_discovery service [puppet] - 10https://gerrit.wikimedia.org/r/937104 (https://phabricator.wikimedia.org/T341355)
[14:16:05] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "Very nice! Thank you" [alerts] - 10https://gerrit.wikimedia.org/r/937054 (https://phabricator.wikimedia.org/T341554) (owner: 10JMeybohm)
[14:16:20] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] envoy: Remove envoy_runtime_vars nagios check [puppet] - 10https://gerrit.wikimedia.org/r/937055 (https://phabricator.wikimedia.org/T341554) (owner: 10JMeybohm)
[14:16:30] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] envoy: Absent monitor_systemd_unit_state for envoyproxy [puppet] - 10https://gerrit.wikimedia.org/r/937102 (https://phabricator.wikimedia.org/T341554) (owner: 10JMeybohm)
[14:17:00] <wikibugs>	 (03PS4) 10Mabualruz: Run a synthetic test for client side preferences [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937092 (https://phabricator.wikimedia.org/T336527)
[14:17:03] <logmsgbot>	 !log btullis@deploy1002 helmfile [eqiad] START helmfile.d/services/datahub: apply on main
[14:17:37] <moritzm>	 !log restarting apache on mw canaries
[14:17:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:18:21] <jinxer-wm>	 (JobUnavailable) firing: (5) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:19:13] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.kafka.roll-restart-mirror-maker (exit_code=0) restart MirrorMaker for Kafka A:kafka-mirror-maker-jumbo-eqiad cluster: Roll restart of jvm daemons.
[14:20:33] <wikibugs>	 (03CR) 10Filippo Giunchedi: data-engineering: add alerts flink enrichment apps (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/936096 (https://phabricator.wikimedia.org/T340666) (owner: 10Gmodena)
[14:20:59] <wikibugs>	 (03PS3) 10Andrew Bogott: Add puppet role and profile for etcd_discovery service [puppet] - 10https://gerrit.wikimedia.org/r/937104 (https://phabricator.wikimedia.org/T341355)
[14:21:11] <logmsgbot>	 !log btullis@deploy1002 helmfile [eqiad] DONE helmfile.d/services/datahub: sync on main
[14:21:33] <wikibugs>	 (03Merged) 10jenkins-bot: sre.mysql.clone: Only encrypt data transfers between DCs [cookbooks] - 10https://gerrit.wikimedia.org/r/936287 (owner: 10Ladsgroup)
[14:22:47] <wikibugs>	 (03PS4) 10Andrew Bogott: Add puppet role and profile for etcd_discovery service [puppet] - 10https://gerrit.wikimedia.org/r/937104 (https://phabricator.wikimedia.org/T341355)
[14:25:15] <wikibugs>	 (03CR) 10David Caro: wmcs: enable isort and black (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/936231 (owner: 10David Caro)
[14:26:42] <wikibugs>	 (03PS5) 10Andrew Bogott: Add puppet role and profile for etcd_discovery service [puppet] - 10https://gerrit.wikimedia.org/r/937104 (https://phabricator.wikimedia.org/T341355)
[14:36:04] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): changeprop: Change normal_rule_processing to histogram (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/937090 (owner: 10Alexandros Kosiaris)
[14:40:02] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add a new nftables::service define [puppet] - 10https://gerrit.wikimedia.org/r/935751 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[14:45:35] <wikibugs>	 (03PS1) 10Muehlenhoff: Move nftables/ferm types to wmflib [puppet] - 10https://gerrit.wikimedia.org/r/937135 (https://phabricator.wikimedia.org/T336497)
[14:48:38] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Move nftables/ferm types to wmflib [puppet] - 10https://gerrit.wikimedia.org/r/937135 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[14:49:17] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.druid.roll-restart-workers (exit_code=0) for Druid public cluster: Roll restart of Druid jvm daemons.
[14:49:36] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 140, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[14:50:22] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 225, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[14:51:43] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10serviceops-radar, 10Patch-For-Review, 10Puppet (Puppet 7.0): expose_puppet_certs:  Services will need to trust the new ca - https://phabricator.wikimedia.org/T340741 (10jcrespo) @jbond I was out of office. Backups is a very special case, I would like to comment that...
[14:52:06] <wikibugs>	 (03PS1) 10Andrew Bogott: Magnum: allow configuration of etcd discovery service host [puppet] - 10https://gerrit.wikimedia.org/r/937138 (https://phabricator.wikimedia.org/T341355)
[14:52:44] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 141, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[14:53:30] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 226, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[14:55:50] <Lucas_WMDE>	 jouncebot: now
[14:55:50] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 4 minute(s)
[14:56:00] <wikibugs>	 (03PS1) 10Ssingh: dnsrecursor: use validate_cmd for pdns-recursor config [puppet] - 10https://gerrit.wikimedia.org/r/937139
[14:56:20] <Lucas_WMDE>	 I’d like to do a quick backport if that’s okay with everyone
[14:56:28] <Lucas_WMDE>	 (will go ahead in a few minutes unless I hear otherwise)
[14:56:30] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] Add warning alerts on envoy running with changed config [alerts] - 10https://gerrit.wikimedia.org/r/937054 (https://phabricator.wikimedia.org/T341554) (owner: 10JMeybohm)
[14:57:32] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] envoy: Absent monitor_systemd_unit_state for envoyproxy [puppet] - 10https://gerrit.wikimedia.org/r/937102 (https://phabricator.wikimedia.org/T341554) (owner: 10JMeybohm)
[14:57:35] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] envoy: Remove envoy_runtime_vars nagios check [puppet] - 10https://gerrit.wikimedia.org/r/937055 (https://phabricator.wikimedia.org/T341554) (owner: 10JMeybohm)
[14:57:37] <wikibugs>	 (03Merged) 10jenkins-bot: Add warning alerts on envoy running with changed config [alerts] - 10https://gerrit.wikimedia.org/r/937054 (https://phabricator.wikimedia.org/T341554) (owner: 10JMeybohm)
[14:59:48] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [skins/Vector] (wmf/1.41.0-wmf.16) - 10https://gerrit.wikimedia.org/r/936737 (https://phabricator.wikimedia.org/T340217) (owner: 10Jdlrobson)
[15:03:17] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1 C: 03+2] k8s::apiserver: Implement kube-apiserver reload [puppet] - 10https://gerrit.wikimedia.org/r/936666 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm)
[15:07:56] <wikibugs>	 (03CR) 10Andrew Bogott: "pcc results https://puppet-compiler.wmflabs.org/output/937138/42404/" [puppet] - 10https://gerrit.wikimedia.org/r/937138 (https://phabricator.wikimedia.org/T341355) (owner: 10Andrew Bogott)
[15:09:12] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching A:restbase-codfw: Applying JVM update - eevans@cumin1001
[15:11:53] <wikibugs>	 (03CR) 10Alexandros Kosiaris: changeprop: Change normal_rule_processing to histogram (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/937090 (owner: 10Alexandros Kosiaris)
[15:12:47] <wikibugs>	 10ops-eqsin: ManagementSSHDown - https://phabricator.wikimedia.org/T341438 (10RobH) 05Open→03Resolved a:03RobH ` robh@cumin1001:~$ ping cr2-eqsin.mgmt.eqsin.wmnet PING cr2-eqsin.mgmt.eqsin.wmnet (10.132.128.6) 56(84) bytes of data. 64 bytes from cr2-eqsin.mgmt.eqsin.wmnet (10.132.128.6): icmp_seq=1 ttl=60...
[15:13:18] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (GET pods) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:13:55] <wikibugs>	 10ops-eqsin: ManagementSSHDown - https://phabricator.wikimedia.org/T341437 (10RobH) 05Open→03Resolved a:03RobH ` robh@cumin1001:~$ ping cp5023.mgmt.eqsin.wmnet PING cp5023.mgmt.eqsin.wmnet (10.132.128.19) 56(84) bytes of data. 64 bytes from cp5023.mgmt.eqsin.wmnet (10.132.128.19): icmp_seq=1 ttl=60 time=22...
[15:14:25] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10cloud-services-team: Create a cron to clean clientbucket every day or hour - https://phabricator.wikimedia.org/T165885 (10Dzahn) A "cron" (timer) has been created. So it could be called resolved. The only thing is that this is opt-in and not automatically fo...
[15:15:05] <wikibugs>	 10SRE, 10ops-eqsin, 10DC-Ops: eqsin cp501[3456] setup and secure erase - https://phabricator.wikimedia.org/T335414 (10RobH) 05Open→03Resolved did this weeks ago and forgot to resolve
[15:15:25] <wikibugs>	 10SRE, 10ops-eqsin, 10ops-ulsfo, 10DC-Ops: eqsin & ulsfo: new R450s drawing far more power than R440s (power over contracted caps in both sites) - https://phabricator.wikimedia.org/T328957 (10RobH) 05Open→03Resolved After some discussion there isn't a lot to adjust so we've just raised our power caps.
[15:15:36] <wikibugs>	 (03PS2) 10Krinkle: Remove oversampling for Navigation Timing extension. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930712 (https://phabricator.wikimedia.org/T337858) (owner: 10Phedenskog)
[15:16:19] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by krinkle@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930712 (https://phabricator.wikimedia.org/T337858) (owner: 10Phedenskog)
[15:17:46] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "this should be needed to scap deploy the docroot on contint*" [puppet] - 10https://gerrit.wikimedia.org/r/867713 (owner: 10Dzahn)
[15:17:48] <logmsgbot>	 !log eevans@cumin1001 END (FAIL) - Cookbook sre.cassandra.roll-restart (exit_code=99) for nodes matching A:restbase-codfw: Applying JVM update - eevans@cumin1001
[15:18:18] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (GET pods) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:18:23] <wikibugs>	 (03PS1) 10Effie Mouzeli: thumbor: add mcrouter support [deployment-charts] - 10https://gerrit.wikimedia.org/r/937141 (https://phabricator.wikimedia.org/T318695)
[15:19:00] <wikibugs>	 (03CR) 10Daniel Kinzler: "The following boards have versions of the "job concurrency" panel (in a collapsed row at the bottom):" [deployment-charts] - 10https://gerrit.wikimedia.org/r/937090 (owner: 10Alexandros Kosiaris)
[15:19:02] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] thumbor: add mcrouter support [deployment-charts] - 10https://gerrit.wikimedia.org/r/937141 (https://phabricator.wikimedia.org/T318695) (owner: 10Effie Mouzeli)
[15:19:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:20:39] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] phabricator: quarterly_metrics.sh: Improve Bitergia instructions [puppet] - 10https://gerrit.wikimedia.org/r/935416 (https://phabricator.wikimedia.org/T341064) (owner: 10Aklapper)
[15:20:55] <wikibugs>	 (03Merged) 10jenkins-bot: Remove oversampling for Navigation Timing extension. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930712 (https://phabricator.wikimedia.org/T337858) (owner: 10Phedenskog)
[15:21:00] <wikibugs>	 (03Merged) 10jenkins-bot: Add option for html label in Menu template [skins/Vector] (wmf/1.41.0-wmf.16) - 10https://gerrit.wikimedia.org/r/936737 (https://phabricator.wikimedia.org/T340217) (owner: 10Jdlrobson)
[15:21:21] <logmsgbot>	 !log krinkle@deploy1002 Started scap: Backport for [[gerrit:930712|Remove oversampling for Navigation Timing extension. (T337858)]]
[15:21:24] <stashbot>	 T337858: Remove is_oversample feature in the Navigation Timing extension - https://phabricator.wikimedia.org/T337858
[15:22:24] <Krinkle>	 Lucas_WMDE: missed your message, I ran scap backport, it says it's locked, so go ahead fi you haven't already
[15:22:43] <Krinkle>	 what's confusing me is that scap then continued without waiting after printing there is a lock
[15:22:48] <Lucas_WMDE>	 Yapparently mine failed, patch didn’t  apply :S
[15:22:53] <logmsgbot>	 !log krinkle@deploy1002 phedenskog and krinkle: Backport for [[gerrit:930712|Remove oversampling for Navigation Timing extension. (T337858)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet
[15:23:02] <Lucas_WMDE>	 so I guess you can go ahead at the moment?
[15:23:09] <Lucas_WMDE>	 and I’ll need to figure out what I can do about my conflict
[15:23:10] <Krinkle>	 okay, I'm guessing scap takes care of not accidentally deploying yours
[15:23:39] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "since this isn't about to be merged and I will be out for a while, I am removing myself from open gerrit patches" [puppet] - 10https://gerrit.wikimedia.org/r/931581 (owner: 10Muehlenhoff)
[15:23:45] <Krinkle>	 git l
[15:23:46] <Krinkle>	 * 3267c8b85 - (HEAD -> master, origin/master, origin/HEAD) Remove oversampling for Navigation Timing extension. (8 minutes ago) <Peter Hedenskog>
[15:23:46] <Krinkle>	 * 67194085c - Enable tabs for non loggedin mobile users on knwikisource (2 hours ago) <anzx>
[15:24:00] <Krinkle>	 yours is change 936737, right?
[15:24:03] <Krinkle>	 so LGTM
[15:24:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (LIST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:24:38] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching restbase20[13-27].codfw.wmnet: Applying JVM update - eevans@cumin1001
[15:24:51] <Krinkle>	 ah yours is in a different repo
[15:24:55] <Krinkle>	 let me check 
[15:25:20] <wikibugs>	 (03PS2) 10Effie Mouzeli: (WIP) thumbor: add mcrouter support [deployment-charts] - 10https://gerrit.wikimedia.org/r/937141 (https://phabricator.wikimedia.org/T318695)
[15:26:32] <logmsgbot>	 !log krinkle@deploy1002 Sync cancelled.
[15:27:24] <Krinkle>	 this is strange, so aborted scap commands just leave it applied for a future sync to implicitly deploy?
[15:27:48] <Krinkle>	 I don't know why that surprises me since that's how its' always worked, I guess when I'm not the one +2'ing and git pull'ing, I expect it to also magically undo those
[15:28:28] <icinga-wm>	 RECOVERY - Check systemd state on kafkamon1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:31:33] <logmsgbot>	 !log krinkle@deploy1002 Locking from deployment [ALL REPOSITORIES]: pending security problem, see mediawiki_security IRC
[15:32:29] <wikibugs>	 10SRE-swift-storage, 10Commons: Server error 500 after uploading chunk - https://phabricator.wikimedia.org/T340917 (10Midleading) In fact the file key has been changed when uploadstash-file-not-found error occured. Need to go to Special:UploadStash to find the new correct key and manually recover, see https://...
[15:33:08] <wikibugs>	 (03PS2) 10TChin: Bump stream versions in mw-page-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/934719 (https://phabricator.wikimedia.org/T340746)
[15:34:46] <icinga-wm>	 PROBLEM - Check systemd state on kafkamon1003 is CRITICAL: CRITICAL - degraded: The following units failed: burrow-jumbo-eqiad.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:36:12] <wikibugs>	 (03PS1) 10Elukey: burrow: use start-latest=true for the kafka consumer [puppet] - 10https://gerrit.wikimedia.org/r/937144 (https://phabricator.wikimedia.org/T341551)
[15:37:34] <jinxer-wm>	 (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:39:31] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "Reasoning and fix LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/937144 (https://phabricator.wikimedia.org/T341551) (owner: 10Elukey)
[15:40:30] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] burrow: use start-latest=true for the kafka consumer [puppet] - 10https://gerrit.wikimedia.org/r/937144 (https://phabricator.wikimedia.org/T341551) (owner: 10Elukey)
[15:41:55] <wikibugs>	 (03PS1) 10TChin: mw-page-content-change-enrich bump docker version [deployment-charts] - 10https://gerrit.wikimedia.org/r/937145 (https://phabricator.wikimedia.org/T338169)
[15:42:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: (3) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:44:48] <icinga-wm>	 RECOVERY - Check systemd state on kafkamon1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:45:00] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: Mailman3 templates with colons in filename made operations/puppet not cloneable on Windows - https://phabricator.wikimedia.org/T282308 (10Novem_Linguae) I would be grateful if someone could fix this. I am on Windows and I cannot submit patches to the operations/puppet repo bec...
[15:48:30] <jinxer-wm>	 (Not accepting/receiving prefixes from anycast BGP peer) firing: Alert for device cloudsw1-d5-eqiad.mgmt.eqiad.wmnet - Not accepting/receiving prefixes from anycast BGP peer   - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer
[15:48:36] <logmsgbot>	 !log krinkle@deploy1002 Unlocked for deployment [ALL REPOSITORIES]: pending security problem, see mediawiki_security IRC (duration: 17m 03s)
[15:53:22] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:936737|Add option for html label in Menu template (T340217)]]
[15:54:13] <Krinkle>	 !log Deployed https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/930712 ("Remove oversampling for Navigation Timing extension.")
[15:54:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:54:55] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 jdlrobson and lucaswerkmeister-wmde: Backport for [[gerrit:936737|Add option for html label in Menu template (T340217)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet
[16:00:05] <jouncebot>	 jbond and rzl: #bothumor I � Unicode. All rise for Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230711T1600).
[16:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[16:00:05] <jouncebot>	 cwhite: Dear deployers, time to do the Logstash DC Transition deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230711T1600).
[16:00:34] <jinxer-wm>	 (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST blockaffinities) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[16:00:59] <Lucas_WMDE>	 (I’m still deploying but that probably doesn’t affect Puppeteers)
[16:02:06] <wikibugs>	 (03CR) 10Gmodena: [C: 03+1] "LGTM. Feel free to deploy the change when ready." [deployment-charts] - 10https://gerrit.wikimedia.org/r/937145 (https://phabricator.wikimedia.org/T338169) (owner: 10TChin)
[16:02:21] <jbond>	 !oncall
[16:02:32] <jbond>	 !oncall now
[16:02:38] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:936737|Add option for html label in Menu template (T340217)]] (duration: 09m 15s)
[16:03:21] <Lucas_WMDE>	 !log previous backport also included [[gerrit:930712|Remove oversampling for Navigation Timing extension. (T337858)]]
[16:03:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:03:24] <stashbot>	 T337858: Remove is_oversample feature in the Navigation Timing extension - https://phabricator.wikimedia.org/T337858
[16:04:07] <logmsgbot>	 !log eevans@cumin1001 END (FAIL) - Cookbook sre.cassandra.roll-restart (exit_code=99) for nodes matching restbase20[13-27].codfw.wmnet: Applying JVM update - eevans@cumin1001
[16:05:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (LIST blockaffinities) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[16:07:19] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] hiera: map logstash.wm.o to kibana7.eqiad [puppet] - 10https://gerrit.wikimedia.org/r/935502 (https://phabricator.wikimedia.org/T333732) (owner: 10Cwhite)
[16:08:13] <wikibugs>	 (03PS3) 10Effie Mouzeli: thumbor: add mcrouter support [deployment-charts] - 10https://gerrit.wikimedia.org/r/937141 (https://phabricator.wikimedia.org/T318695)
[16:08:39] <sukhe>	 !log upgrade dns1004 to gdnsd 3.99.0~alpha2
[16:08:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:08:50] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] thumbor: add mcrouter support [deployment-charts] - 10https://gerrit.wikimedia.org/r/937141 (https://phabricator.wikimedia.org/T318695) (owner: 10Effie Mouzeli)
[16:09:17] <wikibugs>	 (03CR) 10Hashar: "The tests pass locally under Python 3.10. I have to resetup my dev environment to reinstall the previous python and test against them loca" [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/935991 (owner: 10Hashar)
[16:17:17] <icinga-wm>	 PROBLEM - puppet last run on cp6012 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[16:17:23] <icinga-wm>	 PROBLEM - puppet last run on install6002 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[16:17:37] <icinga-wm>	 PROBLEM - puppet last run on cp6003 is CRITICAL: CRITICAL: Puppet last ran 7 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[16:17:49] <wikibugs>	 (03CR) 10RLazarus: [V: 03+2 C: 03+2] otelcol: Bump to version 0.81.0 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/936832 (owner: 10RLazarus)
[16:18:11] <icinga-wm>	 PROBLEM - puppet last run on netflow6001 is CRITICAL: CRITICAL: Puppet last ran 7 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[16:18:24] <wikibugs>	 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T341433 (10Papaul) 05Open→03Resolved Power cord issue. Fixed
[16:18:33] <icinga-wm>	 PROBLEM - puppet last run on cp6009 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[16:18:39] <icinga-wm>	 PROBLEM - puppet last run on doh6002 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[16:18:43] <icinga-wm>	 PROBLEM - puppet last run on lvs6002 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[16:18:43] <icinga-wm>	 PROBLEM - puppet last run on durum6001 is CRITICAL: CRITICAL: Puppet last ran 7 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[16:18:47] <icinga-wm>	 PROBLEM - puppet last run on dns6001 is CRITICAL: CRITICAL: Puppet last ran 7 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[16:18:51] <icinga-wm>	 PROBLEM - puppet last run on ganeti6001 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[16:19:05] <icinga-wm>	 PROBLEM - puppet last run on lvs6001 is CRITICAL: CRITICAL: Puppet last ran 7 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[16:19:17] <icinga-wm>	 PROBLEM - puppet last run on ganeti6004 is CRITICAL: CRITICAL: Puppet last ran 7 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[16:19:17] <sukhe>	 ^ being discussed in -sre
[16:19:25] <icinga-wm>	 PROBLEM - puppet last run on ganeti6002 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[16:19:37] <icinga-wm>	 PROBLEM - puppet last run on dns6002 is CRITICAL: CRITICAL: Puppet last ran 7 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[16:19:57] <icinga-wm>	 PROBLEM - puppet last run on lvs6003 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[16:19:57] <icinga-wm>	 PROBLEM - puppet last run on cp6008 is CRITICAL: CRITICAL: Puppet last ran 7 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[16:19:57] <icinga-wm>	 PROBLEM - puppet last run on cp6004 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[16:20:05] <icinga-wm>	 PROBLEM - puppet last run on cp6005 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[16:20:05] <icinga-wm>	 PROBLEM - puppet last run on cp6001 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[16:20:17] <icinga-wm>	 PROBLEM - puppet last run on ganeti6003 is CRITICAL: CRITICAL: Puppet last ran 7 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[16:20:17] <icinga-wm>	 PROBLEM - puppet last run on doh6001 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[16:20:45] <wikibugs>	 (03PS4) 10Effie Mouzeli: thumbor: add mcrouter support [deployment-charts] - 10https://gerrit.wikimedia.org/r/937141 (https://phabricator.wikimedia.org/T318695)
[16:20:51] <icinga-wm>	 PROBLEM - puppet last run on cp6015 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[16:20:53] <icinga-wm>	 PROBLEM - puppet last run on cp6011 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[16:21:16] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] thumbor: add mcrouter support [deployment-charts] - 10https://gerrit.wikimedia.org/r/937141 (https://phabricator.wikimedia.org/T318695) (owner: 10Effie Mouzeli)
[16:21:27] <icinga-wm>	 PROBLEM - puppet last run on ncredir6002 is CRITICAL: CRITICAL: Puppet last ran 7 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[16:21:39] <icinga-wm>	 PROBLEM - puppet last run on cp6014 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[16:21:41] <icinga-wm>	 PROBLEM - puppet last run on bast6002 is CRITICAL: CRITICAL: Puppet last ran 7 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[16:21:41] <icinga-wm>	 PROBLEM - puppet last run on cp6007 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[16:21:41] <icinga-wm>	 PROBLEM - puppet last run on cp6010 is CRITICAL: CRITICAL: Puppet last ran 7 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[16:21:53] <icinga-wm>	 PROBLEM - puppet last run on cp6016 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[16:22:01] <wikibugs>	 10SRE, 10ops-codfw: ganeti2014: broken RAM - https://phabricator.wikimedia.org/T341546 (10Papaul) DDR-4 slot A1 32G
[16:22:39] <icinga-wm>	 RECOVERY - puppet last run on cp6012 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[16:22:42] <wikibugs>	 (03PS5) 10Effie Mouzeli: thumbor: add mcrouter support [deployment-charts] - 10https://gerrit.wikimedia.org/r/937141 (https://phabricator.wikimedia.org/T318695)
[16:23:20] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] thumbor: add mcrouter support [deployment-charts] - 10https://gerrit.wikimedia.org/r/937141 (https://phabricator.wikimedia.org/T318695) (owner: 10Effie Mouzeli)
[16:23:55] <icinga-wm>	 RECOVERY - puppet last run on cp6009 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[16:24:45] <logmsgbot>	 !log rzl@deploy1002 helmfile [staging] START helmfile.d/services/opentelemetry-collector: apply
[16:25:41] <icinga-wm>	 RECOVERY - puppet last run on doh6001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[16:27:07] <icinga-wm>	 RECOVERY - puppet last run on cp6007 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[16:27:19] <icinga-wm>	 RECOVERY - puppet last run on cp6016 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[16:28:09] <vgutierrez>	 !log reenabling puppet in cp6002
[16:28:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:28:13] <icinga-wm>	 RECOVERY - puppet last run on install6002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[16:28:34] <wikibugs>	 (03PS6) 10Effie Mouzeli: thumbor: add mcrouter support [deployment-charts] - 10https://gerrit.wikimedia.org/r/937141 (https://phabricator.wikimedia.org/T318695)
[16:29:29] <icinga-wm>	 RECOVERY - puppet last run on doh6002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[16:29:33] <icinga-wm>	 RECOVERY - puppet last run on lvs6002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[16:29:41] <icinga-wm>	 RECOVERY - puppet last run on ganeti6001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[16:30:17] <icinga-wm>	 RECOVERY - puppet last run on ganeti6002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[16:30:49] <icinga-wm>	 RECOVERY - puppet last run on lvs6003 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[16:30:49] <icinga-wm>	 RECOVERY - puppet last run on cp6004 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[16:31:43] <icinga-wm>	 RECOVERY - puppet last run on cp6015 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[16:31:49] <icinga-wm>	 RECOVERY - puppet last run on cp6011 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[16:32:35] <icinga-wm>	 RECOVERY - puppet last run on cp6014 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[16:33:57] <icinga-wm>	 RECOVERY - puppet last run on cp6003 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[16:36:17] <icinga-wm>	 RECOVERY - puppet last run on cp6008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[16:36:23] <icinga-wm>	 RECOVERY - puppet last run on cp6001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[16:36:25] <icinga-wm>	 RECOVERY - puppet last run on cp6005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[16:38:05] <icinga-wm>	 RECOVERY - puppet last run on cp6010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[16:39:57] <icinga-wm>	 RECOVERY - puppet last run on netflow6001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[16:40:35] <wikibugs>	 10SRE, 10Phabricator, 10Traffic, 10SecTeam-Processed: Accessing Phabricator from Tor (some ranges blocked but not others) - https://phabricator.wikimedia.org/T254568 (10sbassett)
[16:40:53] <icinga-wm>	 RECOVERY - puppet last run on lvs6001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[16:41:05] <icinga-wm>	 RECOVERY - puppet last run on ganeti6004 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[16:43:21] <icinga-wm>	 RECOVERY - puppet last run on ncredir6002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[16:43:35] <icinga-wm>	 RECOVERY - puppet last run on bast6002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[16:44:44] <wikibugs>	 (03PS1) 10RLazarus: opentelemetry-collector: Bump tag to 0.81.0-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/937152
[16:45:45] <wikibugs>	 (03CR) 10RLazarus: [C: 03+2] opentelemetry-collector: Bump tag to 0.81.0-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/937152 (owner: 10RLazarus)
[16:45:59] <icinga-wm>	 RECOVERY - puppet last run on durum6001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[16:46:01] <icinga-wm>	 RECOVERY - puppet last run on dns6001 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[16:46:27] <wikibugs>	 (03Merged) 10jenkins-bot: opentelemetry-collector: Bump tag to 0.81.0-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/937152 (owner: 10RLazarus)
[16:46:57] <icinga-wm>	 RECOVERY - puppet last run on dns6002 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[16:47:33] <icinga-wm>	 RECOVERY - puppet last run on ganeti6003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[16:52:04] <wikibugs>	 (03CR) 10Hashar: "recheck cause I could not reproduce locally?" [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/935991 (owner: 10Hashar)
[16:52:48] <logmsgbot>	 !log rzl@deploy1002 helmfile [staging] START helmfile.d/services/opentelemetry-collector: apply
[17:00:04] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230711T1700)
[17:03:55] <wikibugs>	 (03PS1) 10TrainBranchBot: testwikis wikis to 1.41.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937156 (https://phabricator.wikimedia.org/T340245)
[17:03:57] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] testwikis wikis to 1.41.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937156 (https://phabricator.wikimedia.org/T340245) (owner: 10TrainBranchBot)
[17:04:38] <wikibugs>	 (03Merged) 10jenkins-bot: testwikis wikis to 1.41.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937156 (https://phabricator.wikimedia.org/T340245) (owner: 10TrainBranchBot)
[17:05:06] <logmsgbot>	 !log dduvall@deploy1002 Started scap: testwikis wikis to 1.41.0-wmf.17  refs T340245
[17:05:11] <stashbot>	 T340245: 1.41.0-wmf.17 deployment blockers - https://phabricator.wikimedia.org/T340245
[17:15:34] <jinxer-wm>	 (HelmReleaseBadStatus) firing: Helm release opentelemetry-collector/main on k8s-staging@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[17:21:25] <icinga-wm>	 PROBLEM - uWSGI puppetboard -http via nrpe- on puppetboard1003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 BAD GATEWAY - 250 bytes in 0.010 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/puppetboard
[17:22:41] <icinga-wm>	 PROBLEM - uWSGI puppetboard -http via nrpe- on puppetboard2003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 BAD GATEWAY - 250 bytes in 0.039 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/puppetboard
[17:23:21] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[17:24:20] <jinxer-wm>	 (JobUnavailable) firing: (5) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[17:28:21] <jinxer-wm>	 (JobUnavailable) firing: (5) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[17:31:37] <wikibugs>	 (03CR) 10Clare Ming: [C: 03+1] "hope this works" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937096 (https://phabricator.wikimedia.org/T326212) (owner: 10Jsn.sherman)
[17:47:13] <wikibugs>	 (03CR) 10Brennen Bearnes: [C: 03+1] "+1 for idea. It might be good to remind user in output that they have local config?" [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/935991 (owner: 10Hashar)
[17:50:57] <logmsgbot>	 !log dduvall@deploy1002 Finished scap: testwikis wikis to 1.41.0-wmf.17  refs T340245 (duration: 45m 50s)
[17:51:00] <stashbot>	 T340245: 1.41.0-wmf.17 deployment blockers - https://phabricator.wikimedia.org/T340245
[17:53:15] <logmsgbot>	 !log dduvall@deploy1002 Pruned MediaWiki: 1.41.0-wmf.15 (duration: 02m 16s)
[18:00:05] <jouncebot>	 dduvall and hashar: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) MediaWiki train - Utc-7+Utc-0 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230711T1800).
[18:06:50] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] P:openstack: move magnum fw rules to haproxy profile [puppet] - 10https://gerrit.wikimedia.org/r/936663 (https://phabricator.wikimedia.org/T341459) (owner: 10Majavah)
[18:08:09] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] P:openstack: open eqiad1 magnum api to the public [puppet] - 10https://gerrit.wikimedia.org/r/936664 (https://phabricator.wikimedia.org/T341459) (owner: 10Majavah)
[18:08:17] <wikibugs>	 (03PS4) 10Andrew Bogott: P:openstack: open eqiad1 magnum api to the public [puppet] - 10https://gerrit.wikimedia.org/r/936664 (https://phabricator.wikimedia.org/T341459) (owner: 10Majavah)
[18:09:12] <wikibugs>	 (03CR) 10Hashar: "I am confused cause I clearly remember to have moving those list of hosts to use a Puppet DB query based on hosts having the relevant Scap" [puppet] - 10https://gerrit.wikimedia.org/r/867713 (owner: 10Dzahn)
[18:17:51] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Add puppet role and profile for etcd_discovery service [puppet] - 10https://gerrit.wikimedia.org/r/937104 (https://phabricator.wikimedia.org/T341355) (owner: 10Andrew Bogott)
[18:22:17] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to wmde for Ifrahkhanyaree (Ifrah_WMDE) - https://phabricator.wikimedia.org/T341455 (10cmooney) @KFrancis Hi.  Would you be kind enough to follow up with @Ifrahkhanyaree and get them to sign an NDA before I grant the requested access?  Thanks.
[18:46:33] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer
[18:49:42] <wikibugs>	 (03PS1) 10TrainBranchBot: group0 wikis to 1.41.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937165 (https://phabricator.wikimedia.org/T340245)
[18:49:44] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] group0 wikis to 1.41.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937165 (https://phabricator.wikimedia.org/T340245) (owner: 10TrainBranchBot)
[18:50:27] <wikibugs>	 (03Merged) 10jenkins-bot: group0 wikis to 1.41.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937165 (https://phabricator.wikimedia.org/T340245) (owner: 10TrainBranchBot)
[18:57:17] <logmsgbot>	 !log dduvall@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.41.0-wmf.17  refs T340245
[18:57:20] <stashbot>	 T340245: 1.41.0-wmf.17 deployment blockers - https://phabricator.wikimedia.org/T340245
[18:57:25] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "I see this also needs a rebase and I uploaded in 2022. so maybe you did and this is outdated. let me do the manual rebase and find out!, h" [puppet] - 10https://gerrit.wikimedia.org/r/867713 (owner: 10Dzahn)
[18:58:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[19:00:19] <wikibugs>	 (03CR) 10Dzahn: "So.. you are right. You have already replaced the list with a query. It's simply that this happened after this patch was originally upload" [puppet] - 10https://gerrit.wikimedia.org/r/867713 (owner: 10Dzahn)
[19:00:40] <wikibugs>	 (03Abandoned) 10Dzahn: scap: remove contint2001 from "dsh groups" [puppet] - 10https://gerrit.wikimedia.org/r/867713 (owner: 10Dzahn)
[19:03:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[19:15:50] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to wmde for Ifrahkhanyaree (Ifrah_WMDE) - https://phabricator.wikimedia.org/T341455 (10KFrancis) @Ifrahkhanyaree, please send the following information to my WMF email address, kfrancis@wikimedia.org:   Full legal name Mailing address Email address
[19:21:34] <wikibugs>	 (03CR) 10Hashar: scap: remove contint2001 from "dsh groups" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/867713 (owner: 10Dzahn)
[19:29:37] <wikibugs>	 (03PS2) 10Andrew Bogott: Magnum: allow configuration of etcd discovery service host [puppet] - 10https://gerrit.wikimedia.org/r/937138 (https://phabricator.wikimedia.org/T341355)
[19:29:39] <wikibugs>	 (03PS1) 10Andrew Bogott: etcd-discovery: restart etcd after config change [puppet] - 10https://gerrit.wikimedia.org/r/937172 (https://phabricator.wikimedia.org/T341355)
[19:32:39] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Magnum: allow configuration of etcd discovery service host [puppet] - 10https://gerrit.wikimedia.org/r/937138 (https://phabricator.wikimedia.org/T341355) (owner: 10Andrew Bogott)
[19:32:43] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] etcd-discovery: restart etcd after config change [puppet] - 10https://gerrit.wikimedia.org/r/937172 (https://phabricator.wikimedia.org/T341355) (owner: 10Andrew Bogott)
[19:45:40] <wikibugs>	 (03PS1) 10Urbanecm: Always return the class as string from Html::getTextInputAttributes [core] (wmf/1.41.0-wmf.17) - 10https://gerrit.wikimedia.org/r/937113 (https://phabricator.wikimedia.org/T341566)
[19:47:53] <wikibugs>	 (03PS1) 10BCornwall: roll-restart-wikimedia-dns: Add reboot action [cookbooks] - 10https://gerrit.wikimedia.org/r/937173
[19:48:07] <logmsgbot>	 !log denisse@cumin1001 START - Cookbook sre.dns.netbox
[19:48:30] <jinxer-wm>	 (Not accepting/receiving prefixes from anycast BGP peer) firing: Alert for device cloudsw1-d5-eqiad.mgmt.eqiad.wmnet - Not accepting/receiving prefixes from anycast BGP peer   - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer
[19:49:22] <logmsgbot>	 !log denisse@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[19:56:30] <wikibugs>	 (03CR) 10Dzahn: "yep:) thanks for doing that! it did reduce the number of places with host names, cool" [puppet] - 10https://gerrit.wikimedia.org/r/867713 (owner: 10Dzahn)
[20:00:05] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: #bothumor I � Unicode. All rise for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230711T2000).
[20:00:05] <jouncebot>	 Jdlrobson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:19] <taavi>	 o/ I can deploy
[20:02:03] <taavi>	 Jdlrobson: ping
[20:11:02] <wikibugs>	 10SRE, 10ops-eqiad, 10Goal, 10User-aborrero, 10cloud-services-team (FY2022/2023-Q4): cloud @ eqiad: hardware re-racking plan - https://phabricator.wikimedia.org/T341494 (10wiki_willy) a:03Jclark-ctr Hi @Jclark-ctr - can you work with @aborrero on the timeframe and migration plan for these servers?   Th...
[20:14:38] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99)
[20:16:27] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer
[20:17:53] <urbanecm>	 taavi: can i steal the window for https://gerrit.wikimedia.org/r/c/mediawiki/core/+/937113? or at least until Jon comes.
[20:18:07] <taavi>	 urbanecm: yes, go ahead!
[20:18:13] <urbanecm>	 ty
[20:18:15] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Always return the class as string from Html::getTextInputAttributes [core] (wmf/1.41.0-wmf.17) - 10https://gerrit.wikimedia.org/r/937113 (https://phabricator.wikimedia.org/T341566) (owner: 10Urbanecm)
[20:18:21] <icinga-wm>	 RECOVERY - uWSGI puppetboard -http via nrpe- on puppetboard2003 is OK: HTTP OK: HTTP/1.1 200 OK - 1518696 bytes in 5.657 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/puppetboard
[20:18:21] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[20:18:29] <icinga-wm>	 RECOVERY - uWSGI puppetboard -http via nrpe- on puppetboard1003 is OK: HTTP OK: HTTP/1.1 200 OK - 1521517 bytes in 5.341 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/puppetboard
[20:23:01] <Jdlrobson>	 taavi: here sorrry im late
[20:23:12] <Jdlrobson>	 urbanecm:  back
[20:23:29] <urbanecm>	 okay, i'll do your patch too
[20:23:53] <wikibugs>	 (03PS3) 10Urbanecm: Logos: Fixes grantswiki and idwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936097 (owner: 10Jdlrobson)
[20:23:57] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Logos: Fixes grantswiki and idwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936097 (owner: 10Jdlrobson)
[20:24:38] <wikibugs>	 (03Merged) 10jenkins-bot: Logos: Fixes grantswiki and idwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936097 (owner: 10Jdlrobson)
[20:25:15] <logmsgbot>	 !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:936097|Logos: Fixes grantswiki and idwiktionary]]
[20:26:47] <logmsgbot>	 !log urbanecm@deploy1002 jdlrobson and urbanecm: Backport for [[gerrit:936097|Logos: Fixes grantswiki and idwiktionary]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet
[20:27:03] <urbanecm>	 Jdlrobson: your patch is at mwdebug1001. can you test?
[20:27:12] <Jdlrobson>	 checking
[20:29:03] <Jdlrobson>	 urbanecm: the grants one is good but not the wiktionary one - it's the wrong size :/
[20:29:16] <Jdlrobson>	 the SVG on commons is bad :(
[20:29:27] <urbanecm>	 :-(
[20:29:28] <Jdlrobson>	 Shall I follow up or revert and do a new patch?
[20:30:00] <urbanecm>	 Jdlrobson: depends on how long a follow-up would take. if it's a few minutes thing, upload a follow-up please.
[20:30:49] <Jdlrobson>	 1 min
[20:30:50] <Jdlrobson>	 ill do it now
[20:31:00] <urbanecm>	 great, waiting :)
[20:31:27] <wikibugs>	 (03PS1) 10Andrew Bogott: magnum: use eqiad1-hosted etcd discovery service [puppet] - 10https://gerrit.wikimedia.org/r/937176 (https://phabricator.wikimedia.org/T341355)
[20:32:08] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] magnum: use eqiad1-hosted etcd discovery service [puppet] - 10https://gerrit.wikimedia.org/r/937176 (https://phabricator.wikimedia.org/T341355) (owner: 10Andrew Bogott)
[20:32:17] <wikibugs>	 (03PS1) 10Jdlrobson: Drop idwiktionary wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937177
[20:32:18] <Jdlrobson>	 ^ urbanecm 
[20:32:29] <wikibugs>	 (03Merged) 10jenkins-bot: Always return the class as string from Html::getTextInputAttributes [core] (wmf/1.41.0-wmf.17) - 10https://gerrit.wikimedia.org/r/937113 (https://phabricator.wikimedia.org/T341566) (owner: 10Urbanecm)
[20:32:40] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Drop idwiktionary wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937177 (owner: 10Jdlrobson)
[20:32:44] <logmsgbot>	 !log urbanecm@deploy1002 Sync cancelled.
[20:33:21] <wikibugs>	 (03Merged) 10jenkins-bot: Drop idwiktionary wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937177 (owner: 10Jdlrobson)
[20:33:55] <logmsgbot>	 !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:936097|Logos: Fixes grantswiki and idwiktionary]], [[gerrit:937177|Drop idwiktionary wordmark]], [[gerrit:937113|Always return the class as string from Html::getTextInputAttributes (T341566)]]
[20:33:59] <stashbot>	 T341566: With $wgUseMediaWikiUIEverywhere = true, Xml::input() with class attribute causes warning or TypeError: htmlspecialchars() expects parameter 1 to be string, array given - https://phabricator.wikimedia.org/T341566
[20:35:27] <logmsgbot>	 !log urbanecm@deploy1002 jdlrobson and urbanecm: Backport for [[gerrit:936097|Logos: Fixes grantswiki and idwiktionary]], [[gerrit:937177|Drop idwiktionary wordmark]], [[gerrit:937113|Always return the class as string from Html::getTextInputAttributes (T341566)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet
[20:35:40] <urbanecm>	 Jdlrobson: can you check mwdebug1001 again please? :)
[20:39:11] <Jdlrobson>	 urbanecm: LGTM now!
[20:39:16] <urbanecm>	 great, syncing
[20:39:27] <urbanecm>	 (together with my core backport)
[20:43:20] <Jdlrobson>	 thanks urbanecm 
[20:44:03] <urbanecm>	 np
[20:45:06] <logmsgbot>	 !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:936097|Logos: Fixes grantswiki and idwiktionary]], [[gerrit:937177|Drop idwiktionary wordmark]], [[gerrit:937113|Always return the class as string from Html::getTextInputAttributes (T341566)]] (duration: 11m 10s)
[20:45:11] <urbanecm>	 Jdlrobson: and, deployed
[20:45:13] <urbanecm>	 anything else?
[20:45:14] <stashbot>	 T341566: With $wgUseMediaWikiUIEverywhere = true, Xml::input() with class attribute causes warning or TypeError: htmlspecialchars() expects parameter 1 to be string, array given - https://phabricator.wikimedia.org/T341566
[20:54:04] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] service: remove plaintext labweb service (I) [puppet] - 10https://gerrit.wikimedia.org/r/831174 (https://phabricator.wikimedia.org/T317463) (owner: 10Majavah)
[20:54:08] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] service: remove plaintext labweb service (II) [puppet] - 10https://gerrit.wikimedia.org/r/831175 (https://phabricator.wikimedia.org/T317463) (owner: 10Majavah)
[20:54:39] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] service: remove plaintext labweb service (III) [puppet] - 10https://gerrit.wikimedia.org/r/831176 (https://phabricator.wikimedia.org/T317463) (owner: 10Majavah)
[20:54:55] <wikibugs>	 (03PS2) 10Andrew Bogott: service: remove plaintext labweb service (I) [puppet] - 10https://gerrit.wikimedia.org/r/831174 (https://phabricator.wikimedia.org/T317463) (owner: 10Majavah)
[20:55:15] <wikibugs>	 (03PS2) 10Andrew Bogott: service: remove plaintext labweb service (II) [puppet] - 10https://gerrit.wikimedia.org/r/831175 (https://phabricator.wikimedia.org/T317463) (owner: 10Majavah)
[20:55:24] <wikibugs>	 (03PS2) 10Andrew Bogott: service: remove plaintext labweb service (III) [puppet] - 10https://gerrit.wikimedia.org/r/831176 (https://phabricator.wikimedia.org/T317463) (owner: 10Majavah)
[21:00:13] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs1019 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.40:80]) https://wikitech.wikimedia.org/wiki/PyBal
[21:01:55] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs1020 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.40:80]) https://wikitech.wikimedia.org/wiki/PyBal
[21:02:06] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] hieradata: labweb: update lvs pool to reference the ssl service [puppet] - 10https://gerrit.wikimedia.org/r/831173 (https://phabricator.wikimedia.org/T317463) (owner: 10Majavah)
[21:04:46] <jinxer-wm>	 (ConfdResourceFailed) firing: confd resource _srv_config-master_pybal_eqiad_labweb.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[21:05:54] <icinga-wm>	 ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs1019 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.40:80]) Andrew Bogott this is me failing to downtime properly, sorry! https://wikitech.wikimedia.org/wiki/PyBal
[21:05:54] <icinga-wm>	 ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs1020 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.40:80]) Andrew Bogott this is me failing to downtime properly, sorry! https://wikitech.wikimedia.org/wiki/PyBal
[21:14:46] <jinxer-wm>	 (ConfdResourceFailed) resolved: confd resource _srv_config-master_pybal_eqiad_labweb.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[21:15:49] <jinxer-wm>	 (HelmReleaseBadStatus) firing: Helm release opentelemetry-collector/main on k8s-staging@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[21:16:43] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs1019 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[21:18:25] <wikibugs>	 (03PS1) 10Superpes15: [knwiki] Reverting the temporary logo and updating logo/wordmark/tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937183 (https://phabricator.wikimedia.org/T338136)
[21:18:27] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs1020 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[21:24:35] <jinxer-wm>	 (Nonwrite HTTP requests with primary DB connections alert) firing: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DNonwrite+HTTP+requests+with+primary+DB+connections+alert
[21:38:25] <wikibugs>	 (03PS1) 10BCornwall: Add some petty spelling error fixes [software/spicerack] - 10https://gerrit.wikimedia.org/r/937185
[21:44:35] <jinxer-wm>	 (Nonwrite HTTP requests with primary DB connections alert) resolved: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DNonwrite+HTTP+requests+with+primary+DB+connections+alert
[21:51:32] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)
[21:51:55] <icinga-wm>	 PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs2013 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[21:51:57] <icinga-wm>	 PROBLEM - Blazegraph process -wdqs-blazegraph- on wdqs2013 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[21:52:07] <icinga-wm>	 PROBLEM - Blazegraph process -wdqs-categories- on wdqs2013 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9990 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[21:52:07] <icinga-wm>	 PROBLEM - Check systemd state on wdqs2013 is CRITICAL: CRITICAL - degraded: The following units failed: load-dcatap-weekly.service,prometheus-blazegraph-exporter-wdqs-blazegraph.service,prometheus-blazegraph-exporter-wdqs-categories.service,wdqs-blazegraph.service,wdqs-categories.service,wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service,wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-categories.service https://w
[21:52:07] <icinga-wm>	 wikimedia.org/wiki/Monitoring/check_systemd_state
[21:52:25] <icinga-wm>	 PROBLEM - Blazegraph Port for wdqs-categories on wdqs2013 is CRITICAL: connect to address 127.0.0.1 and port 9990: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[21:52:33] <icinga-wm>	 PROBLEM - WDQS SPARQL on wdqs2013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string http://www.w3.org/2001/XML... not found on https://query.wikidata.org:443/bigdata/namespace/wdq/sparql?query=SELECT%20*%20WHERE%20%7Bwikibase%3ADump%20schema%3AdateModified%20%3Fy%7D%20LIMIT%201 - 398 bytes in 0.174 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[21:52:47] <icinga-wm>	 PROBLEM - Query Service HTTP Port on wdqs2013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 364 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service
[22:32:03] <jinxer-wm>	 (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:37:03] <jinxer-wm>	 (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:51:03] <jinxer-wm>	 (ProbeDown) firing: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:56:03] <jinxer-wm>	 (ProbeDown) resolved: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[23:07:17] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[23:08:23] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[23:09:43] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50276 bytes in 0.063 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[23:10:07] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.267 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[23:40:58] <wikibugs>	 (03CR) 10Raymond Ndibe: replica_cnf_api: refactor to use multiple backends (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/933973 (https://phabricator.wikimedia.org/T265691) (owner: 10David Caro)
[23:42:00] <wikibugs>	 (03CR) 10Raymond Ndibe: replica_cnf_api: refactor to use multiple backends (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/933973 (https://phabricator.wikimedia.org/T265691) (owner: 10David Caro)
[23:48:30] <jinxer-wm>	 (Not accepting/receiving prefixes from anycast BGP peer) firing: Alert for device cloudsw1-d5-eqiad.mgmt.eqiad.wmnet - Not accepting/receiving prefixes from anycast BGP peer   - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer
[23:51:45] <wikibugs>	 (03PS1) 10Krinkle: mc: Remove mcrouter-with-onhost-tier from ParserCache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937197 (https://phabricator.wikimedia.org/T264604)
[23:54:10] <wikibugs>	 (03CR) 10Jdlrobson: [C: 04-1] "As discussed the after HTML should be identical to the HTML we're expecting to ship if https://gerrit.wikimedia.org/r/c/mediawiki/core/+/9" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937092 (https://phabricator.wikimedia.org/T336527) (owner: 10Mabualruz)
[23:57:51] <icinga-wm>	 PROBLEM - PHP opcache health on mw1467 is CRITICAL: CRITICAL: opcache full on php 7.4. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health