[00:05:27] PROBLEM - Check systemd state on grafana1002 is CRITICAL: CRITICAL - degraded: The following units failed: grafana-ldap-users-sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:17:09] (03PS1) 10Jdlrobson: Add option for html label in Menu template [skins/Vector] (wmf/1.41.0-wmf.16) - 10https://gerrit.wikimedia.org/r/936737 (https://phabricator.wikimedia.org/T340217) [00:39:00] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/936807 [00:39:06] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/936807 (owner: 10TrainBranchBot) [00:53:05] (03CR) 10Andrea Denisse: [C: 03+2] webperf: Set XHGUI_PDO_INITSCHEMA=false to avoid 'CREATE TABLE' fatal [puppet] - 10https://gerrit.wikimedia.org/r/936767 (owner: 10Krinkle) [00:55:29] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/936807 (owner: 10TrainBranchBot) [01:03:35] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T341538 (10phaultfinder) [01:46:41] RECOVERY - Router interfaces on cr1-drmrs is OK: OK: host 185.15.58.128, interfaces up: 58, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:47:01] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:47:05] RECOVERY - BFD status on cr1-eqiad is OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [01:53:15] RECOVERY - Check systemd state on kafkamon1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:57:55] PROBLEM - Check systemd state on kafkamon1003 is CRITICAL: CRITICAL - degraded: The following units failed: burrow-jumbo-eqiad.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:00:06] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230711T0200) [02:00:57] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf LDAP group for Urbanecm - https://phabricator.wikimedia.org/T341443 (10Dzahn) I ran the "check_user" script on a cumin host as described in https://wikitech.wikimedia.org/wiki/SRE/Clinic_Duty/Access_requests#Verifying_WMF_developer_accounts ` WikiTech Users:... [02:05:42] !log LDAP - added urbanecm to wmf group, removed from nda group (conversion volunteer to staff) T341443 [02:05:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:05:46] T341443: Grant Access to wmf LDAP group for Urbanecm - https://phabricator.wikimedia.org/T341443 [02:07:28] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.41.0-wmf.17 [core] (wmf/1.41.0-wmf.17) - 10https://gerrit.wikimedia.org/r/936808 (https://phabricator.wikimedia.org/T340245) [02:07:38] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.41.0-wmf.17 [core] (wmf/1.41.0-wmf.17) - 10https://gerrit.wikimedia.org/r/936808 (https://phabricator.wikimedia.org/T340245) (owner: 10TrainBranchBot) [02:07:55] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf LDAP group for Urbanecm - https://phabricator.wikimedia.org/T341443 (10Dzahn) done. - added to wmf group in LDAP - removed from nda group in LDAP - added to WMF-NDA in Phab https://phabricator.wikimedia.org/project/members/61/ - no puppet changed needed sin... [02:08:21] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:08:22] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf LDAP group for Urbanecm - https://phabricator.wikimedia.org/T341443 (10Dzahn) 05Open→03Resolved a:03Dzahn [02:23:15] (03Merged) 10jenkins-bot: Branch commit for wmf/1.41.0-wmf.17 [core] (wmf/1.41.0-wmf.17) - 10https://gerrit.wikimedia.org/r/936808 (https://phabricator.wikimedia.org/T340245) (owner: 10TrainBranchBot) [02:29:19] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:37:01] (03Abandoned) 10Anzx: Enable tabs for non logged in users on knwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/932284 (https://phabricator.wikimedia.org/T340276) (owner: 10Anzx) [02:59:27] PROBLEM - Host urldownloader2003 is DOWN: PING CRITICAL - Packet loss = 100% [03:00:07] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230711T0300) [03:00:23] PROBLEM - Host irc2002 is DOWN: PING CRITICAL - Packet loss = 100% [03:00:23] PROBLEM - Host logstash2032 is DOWN: PING CRITICAL - Packet loss = 100% [03:00:29] PROBLEM - Host dragonfly-supernode2001 is DOWN: PING CRITICAL - Packet loss = 100% [03:00:29] PROBLEM - Host schema2003 is DOWN: PING CRITICAL - Packet loss = 100% [03:00:29] PROBLEM - Host ganeti2014 is DOWN: PING CRITICAL - Packet loss = 100% [03:00:29] PROBLEM - Host durum2001 is DOWN: PING CRITICAL - Packet loss = 100% [03:00:35] PROBLEM - Host webperf2003 is DOWN: PING CRITICAL - Packet loss = 100% [03:00:57] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:01:03] PROBLEM - Host failoid2002 is DOWN: PING CRITICAL - Packet loss = 100% [03:01:38] (ProbeDown) firing: (2) Service irc2002:6667 has failed probes (tcp_mw_rc_irc_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#irc2002:6667 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:01:45] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv6: Connect - Anycast, AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [03:01:47] PROBLEM - BFD status on cr1-codfw is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [03:01:53] PROBLEM - BFD status on cr2-codfw is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [03:04:19] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:05:35] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: train-presync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:08:21] (JobUnavailable) firing: (5) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:43:43] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T341538 (10Papaul) 05Open→03Resolved a:03Papaul [03:48:10] 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T341433 (10Papaul) a:03Jhancock.wm [03:48:30] (Not accepting/receiving prefixes from anycast BGP peer) firing: Alert for device cloudsw1-d5-eqiad.mgmt.eqiad.wmnet - Not accepting/receiving prefixes from anycast BGP peer - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer [03:59:47] (03CR) 10RLazarus: [C: 03+2] opentelemetry-collector: Vendor 0.62.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/936388 (https://phabricator.wikimedia.org/T324117) (owner: 10RLazarus) [04:00:42] (03Merged) 10jenkins-bot: opentelemetry-collector: Vendor 0.62.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/936388 (https://phabricator.wikimedia.org/T324117) (owner: 10RLazarus) [04:00:56] (03CR) 10RLazarus: [C: 03+2] opentelemetry-collector: Fix image and entry point [deployment-charts] - 10https://gerrit.wikimedia.org/r/936389 (https://phabricator.wikimedia.org/T320564) (owner: 10RLazarus) [04:01:48] (03Merged) 10jenkins-bot: opentelemetry-collector: Fix image and entry point [deployment-charts] - 10https://gerrit.wikimedia.org/r/936389 (https://phabricator.wikimedia.org/T320564) (owner: 10RLazarus) [04:34:25] !log rzl@deploy1002 helmfile [staging] START helmfile.d/services/opentelemetry-collector: apply [05:18:30] (03PS1) 10KartikMistry: Update MinT to 2023-07-10-051738-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/936831 (https://phabricator.wikimedia.org/T341335) [05:24:10] !log imported otelcol-contrib 0.81.0 to buster-wikimedia and bullseye-wikimedia in component thirdparty/otelcol-contrib [05:24:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:29:37] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [05:33:42] (03PS1) 10RLazarus: otelcol: Bump to version 0.81.0 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/936832 [05:34:37] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [05:40:07] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [05:45:07] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [05:46:37] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [05:51:37] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [05:52:07] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [05:57:07] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [06:00:05] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230711T0600) [06:00:05] kormat, marostegui, and Amir1: #bothumor My software never has bugs. It just develops random features. Rise for Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230711T0600). [06:31:37] 10SRE, 10Continuous-Integration-Infrastructure, 10collaboration-services, 10Patch-For-Review: contint2002 service implementation tracking - https://phabricator.wikimedia.org/T324659 (10Jelto) [06:36:10] !log rebalance ganeti group eqiad/B after reboots [06:36:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:55:35] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet (Puppet 7.0): Cumin: update config to use new puppet7 infrastructre - https://phabricator.wikimedia.org/T341497 (10MoritzMuehlenhoff) If it's helpful for the rampup and/or early testing we can also go ahead and point cuminunpriv1001 to the Puppet 7... [06:59:29] !log restart kube-apiserver on ml-serve-ctrl1* as attempt to resolve spikes in latencies [06:59:30] (03CR) 10Ayounsi: "Could you provide an ssh-ed25519 key instead? We're moving away from ssh-rsa https://phabricator.wikimedia.org/T336769" [homer/public] - 10https://gerrit.wikimedia.org/r/935479 (owner: 10Fabfur) [06:59:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:05] Amir1, Urbanecm, and taavi: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230711T0700). [07:00:05] aanzx: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:01:38] (ProbeDown) firing: (2) Service irc2002:6667 has failed probes (tcp_mw_rc_irc_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#irc2002:6667 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:04:47] PROBLEM - Check systemd state on ml-serve-ctrl1002 is CRITICAL: CRITICAL - degraded: The following units failed: kube-controller-manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:06:38] (KubernetesAPILatency) firing: (4) High Kubernetes API latency (LIST csidrivers) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:07:57] RECOVERY - Check systemd state on ml-serve-ctrl1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:08:21] (JobUnavailable) firing: (5) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:08:27] !log rebalance ganeti in drmrs after reboots [07:08:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:10:27] 10SRE, 10Continuous-Integration-Infrastructure, 10collaboration-services, 10Patch-For-Review: contint2002 service implementation tracking - https://phabricator.wikimedia.org/T324659 (10hashar) [07:11:38] (KubernetesAPILatency) resolved: (4) High Kubernetes API latency (LIST csidrivers) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:14:32] 10SRE, 10Continuous-Integration-Infrastructure, 10collaboration-services, 10Patch-For-Review: contint2002 service implementation tracking - https://phabricator.wikimedia.org/T324659 (10hashar) [07:19:45] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS1299/IPv6: Active - Telia, AS1299/IPv4: Active - Telia https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:21:20] good morning, we are switching over the continuous integration server in ~ 40 minutes. Jenkins/Zuul will be unavailable during that time [07:21:36] (I have updated the Deployments page) [07:22:36] !log installing libxpm security updates [07:22:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:42] !log powercycle ganeti2014 [07:28:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:56] hashar: Can I do MinT deployment before that? [07:32:35] 10SRE, 10ops-codfw: ganeti2014 failed - https://phabricator.wikimedia.org/T341546 (10MoritzMuehlenhoff) [07:32:52] 10SRE, 10ops-codfw: ganeti2014: broken RAM - https://phabricator.wikimedia.org/T341546 (10MoritzMuehlenhoff) p:05Triage→03Medium [07:33:06] kart_: yes please do :) [07:33:26] Thanks! [07:33:50] (03CR) 10KartikMistry: [C: 03+2] Update MinT to 2023-07-10-051738-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/936831 (https://phabricator.wikimedia.org/T341335) (owner: 10KartikMistry) [07:34:34] (03Merged) 10jenkins-bot: Update MinT to 2023-07-10-051738-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/936831 (https://phabricator.wikimedia.org/T341335) (owner: 10KartikMistry) [07:36:09] RECOVERY - Host dragonfly-supernode2001 is UP: PING OK - Packet loss = 0%, RTA = 35.40 ms [07:36:13] RECOVERY - Host durum2001 is UP: PING WARNING - Packet loss = 60%, RTA = 33.40 ms [07:36:24] !log failover broken ganeti2014 node [07:36:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:36:27] !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/machinetranslation: apply [07:37:03] RECOVERY - Host irc2002 is UP: PING OK - Packet loss = 0%, RTA = 33.27 ms [07:37:37] RECOVERY - Host logstash2032 is UP: PING OK - Packet loss = 0%, RTA = 33.33 ms [07:38:15] RECOVERY - Host urldownloader2003 is UP: PING OK - Packet loss = 0%, RTA = 33.55 ms [07:38:34] (JobUnavailable) firing: (5) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:38:41] RECOVERY - Host webperf2003 is UP: PING OK - Packet loss = 0%, RTA = 33.41 ms [07:38:58] !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply [07:39:01] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 111, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:39:31] RECOVERY - BFD status on cr1-codfw is OK: UP: 22 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:39:33] RECOVERY - BFD status on cr2-codfw is OK: UP: 20 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:41:38] (ProbeDown) resolved: (2) Service irc2002:6667 has failed probes (tcp_mw_rc_irc_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#irc2002:6667 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:41:55] 10SRE, 10ops-codfw: ganeti2014: broken RAM - https://phabricator.wikimedia.org/T341546 (10MoritzMuehlenhoff) a:03Papaul I've evacuated the VMs off the broken node, can you please have a look? I realise the server is OOW, but do we have a compatible DIMM around from a decommissioned server, e.g.? [07:42:40] !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/machinetranslation: apply [07:43:30] (JobUnavailable) firing: (5) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:44:12] 10SRE-swift-storage, 10Observability-Metrics, 10User-fgiunchedi: Split Thanos components from thanos-fe hosts - https://phabricator.wikimedia.org/T341488 (10fgiunchedi) >>! In T341488#9003223, @Eevans wrote: >>>! In T341488#9001995, @fgiunchedi wrote: >> @MatthewVernon @Eevans please let me know what you thi... [07:45:53] hashar: as discussed, I merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/936266 now and run puppet on contint2001 and contint2002 [07:46:20] correct [07:46:35] (03CR) 10Jelto: [C: 03+2] contint: move zuul-merger from contint2001 to contint2002 [puppet] - 10https://gerrit.wikimedia.org/r/936266 (https://phabricator.wikimedia.org/T324659) (owner: 10Hashar) [07:47:49] !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/machinetranslation: apply [07:48:30] (Not accepting/receiving prefixes from anycast BGP peer) firing: Alert for device cloudsw1-d5-eqiad.mgmt.eqiad.wmnet - Not accepting/receiving prefixes from anycast BGP peer - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer [07:48:51] if Puppet is behaving as expected, that should bring down zuul-merger on contint2001 and bring it up on contint2002 [07:49:24] and there is another instance running on contint1002 (which actually takes most of the load since it is way faster thanks to SSD for disk io) [07:49:29] !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/machinetranslation: apply [07:49:40] hashar: done and I can confirm that from puppet agent log output [07:50:20] ps also shows zuul-merger on contint2002 only [07:50:56] ahaha I am so happy when our Puppet manifests do the right thing [07:53:15] and I can confirm the switch happened at the application level (the zuul-merger are attaching to the Zuul server over the Gearman protocol which can be checked from the primary host: `zuul-gearman.py workers|grep merger` [07:54:18] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "LGTM, some suggestions to improve the commit message." [puppet] - 10https://gerrit.wikimedia.org/r/936273 (https://phabricator.wikimedia.org/T338811) (owner: 10Jbond) [07:54:22] !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/machinetranslation: apply [07:55:06] 10SRE, 10Continuous-Integration-Infrastructure, 10collaboration-services, 10Patch-For-Review: contint2002 service implementation tracking - https://phabricator.wikimedia.org/T324659 (10Jelto) [07:55:31] !log Updated MinT to 2023-07-10-051738-production (T341335, T333969) [07:55:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:36] T341335: MinT not working for Latvian in Content & Section Translation - https://phabricator.wikimedia.org/T341335 [07:55:36] T333969: Enable Opus models for languages lacking other Machine Translation options - https://phabricator.wikimedia.org/T333969 [07:56:52] jelto: I can confirm the new zuul-merger works fine on contint2002 and there are already CI builds using it \o/ [07:57:25] (03CR) 10Volans: users: add new user (fabfur) (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/935479 (owner: 10Fabfur) [07:57:44] great, next step is to downtime both hosts and disable puppet. I'll do that in 3 minutes [07:58:07] 10SRE, 10Continuous-Integration-Infrastructure, 10collaboration-services, 10Patch-For-Review: contint2002 service implementation tracking - https://phabricator.wikimedia.org/T324659 (10Jelto) [07:58:08] hashar: I'm done now. [07:58:42] 10SRE, 10Continuous-Integration-Infrastructure, 10collaboration-services, 10Patch-For-Review: contint2002 service implementation tracking - https://phabricator.wikimedia.org/T324659 (10hashar) [07:58:43] kart_: awesome. Congratulations on the MinT deployment [07:58:57] :) [08:01:03] !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on contint2001.wikimedia.org with reason: Switch contint hosts for hardware replacement [08:01:17] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on contint2001.wikimedia.org with reason: Switch contint hosts for hardware replacement [08:01:27] 10SRE, 10Continuous-Integration-Infrastructure, 10collaboration-services, 10Patch-For-Review: contint2002 service implementation tracking - https://phabricator.wikimedia.org/T324659 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=fb9b83f1-475c-4737-a872-7868377e05ee) set by jelto@cumin1... [08:01:29] !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on contint2002.wikimedia.org with reason: Switch contint hosts for hardware replacement [08:01:43] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on contint2002.wikimedia.org with reason: Switch contint hosts for hardware replacement [08:01:43] RECOVERY - Host failoid2002 is UP: PING OK - Packet loss = 0%, RTA = 33.68 ms [08:01:54] 10SRE, 10Continuous-Integration-Infrastructure, 10collaboration-services, 10Patch-For-Review: contint2002 service implementation tracking - https://phabricator.wikimedia.org/T324659 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=af763fea-db6d-494f-8c4c-8139c0ceab0c) set by jelto@cumin1... [08:02:40] hashar: contint2001 and 2002 are downtimed and puppet is disabled. Next step is to stop jenkins and zuul. Do you want to do that? [08:03:23] yes doing so now [08:03:27] (03CR) 10Ayounsi: users: add new user (fabfur) (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/935479 (owner: 10Fabfur) [08:03:29] 10SRE, 10Continuous-Integration-Infrastructure, 10collaboration-services, 10Patch-For-Review: contint2002 service implementation tracking - https://phabricator.wikimedia.org/T324659 (10Jelto) [08:03:34] !log Stopping Jenkins and Zuul for server switch over [08:03:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:57] RECOVERY - Host schema2003 is UP: PING OK - Packet loss = 0%, RTA = 33.50 ms [08:04:34] hmm [08:04:47] I stopped them both but https://integration.wikimedia.org/zuul/ still gives me some status updates [08:04:51] I think it is cache related [08:04:53] PROBLEM - Check systemd state on schema2003 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens13.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:04:56] let me keep traces of those requests [08:05:49] yeah that is the json reply which is cached by our varnish/ats statck. I will dig into it later [08:05:56] RECOVERY - Check systemd state on schema2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:06:04] ack [08:06:44] I am doing the rsync [08:07:19] ack thanks [08:07:21] the large /srv/jenkins syncs in a minute or so [08:07:42] I have triggered it yesterday and again this morning roughly an hour or so ago. So disks cache are warm [08:10:15] jelto: all rsync done [08:10:18] 10SRE, 10Continuous-Integration-Infrastructure, 10collaboration-services, 10Patch-For-Review: contint2002 service implementation tracking - https://phabricator.wikimedia.org/T324659 (10hashar) [08:10:21] so you can do the DNS switch [08:10:22] 10SRE, 10Infrastructure-Foundations, 10serviceops-radar, 10Patch-For-Review, 10Puppet (Puppet 7.0): expose_puppet_certs: Services will need to trust the new ca - https://phabricator.wikimedia.org/T340741 (10JMeybohm) [08:10:45] hashar: let me know when I should merge and apply the dns change [08:10:50] +1 [08:10:52] :) [08:11:03] I mean, you can do it [08:11:31] (03PS2) 10Jelto: switch contint.wikimedia.org from contint2001 to contint2002 [dns] - 10https://gerrit.wikimedia.org/r/933196 (https://phabricator.wikimedia.org/T324659) (owner: 10Dzahn) [08:11:41] rebasing, one sec [08:11:49] (03CR) 10JMeybohm: [C: 03+2] envoy: Limit the total number of active connections [puppet] - 10https://gerrit.wikimedia.org/r/935711 (https://phabricator.wikimedia.org/T340955) (owner: 10JMeybohm) [08:11:51] (03CR) 10JMeybohm: [C: 03+2] rake_modules/taskgen: Don't process direcories in setup_python_extensions [puppet] - 10https://gerrit.wikimedia.org/r/935714 (owner: 10JMeybohm) [08:11:53] (03CR) 10JMeybohm: [C: 03+2] envoy: Remove tls_minimum_protocol_version [puppet] - 10https://gerrit.wikimedia.org/r/935683 (https://phabricator.wikimedia.org/T337453) (owner: 10JMeybohm) [08:11:56] (03CR) 10JMeybohm: [C: 03+2] envoy: Refactor max_requests_per_connection [puppet] - 10https://gerrit.wikimedia.org/r/935678 (https://phabricator.wikimedia.org/T304124) (owner: 10JMeybohm) [08:13:29] hashar: ah of course the is no ci now ... [08:15:33] ah yeah [08:15:39] hashar: I'll manually verify +2 https://gerrit.wikimedia.org/r/c/operations/dns/+/933196. Before rebase jenkins +2ed [08:16:39] (03CR) 10Hashar: [C: 03+1] switch contint.wikimedia.org from contint2001 to contint2002 [dns] - 10https://gerrit.wikimedia.org/r/933196 (https://phabricator.wikimedia.org/T324659) (owner: 10Dzahn) [08:16:59] you should have the permissions in Gerrit to Verified +2 and Submit it [08:17:36] (03CR) 10Jelto: [V: 03+2 C: 03+2] "manually verify, because jenkins is down due to maintenance" [dns] - 10https://gerrit.wikimedia.org/r/933196 (https://phabricator.wikimedia.org/T324659) (owner: 10Dzahn) [08:17:40] \o/ [08:17:46] (03PS2) 10Jbond: puppetmaster: enable submitting data to puppetdb7 [puppet] - 10https://gerrit.wikimedia.org/r/936273 (https://phabricator.wikimedia.org/T338811) [08:17:55] (03CR) 10Jbond: "updated thanks" [puppet] - 10https://gerrit.wikimedia.org/r/936273 (https://phabricator.wikimedia.org/T338811) (owner: 10Jbond) [08:18:45] authdns update diff shows: -contint 5M IN CNAME contint2001.wikimedia.org. [08:18:45] +contint 5M IN CNAME contint2002.wikimedia.org. [08:18:49] I'll continue [08:18:59] !log upgrade prometheus to 2.24.1+ds-1+wmf2 on cloudmetrics* [08:19:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:50] OK - authdns-update successful on all nodes! [08:20:04] (03PS1) 10Ayounsi: users: remove older ssh-rsa key for Alex and Chris [homer/public] - 10https://gerrit.wikimedia.org/r/937039 (https://phabricator.wikimedia.org/T336769) [08:20:07] PROBLEM - Check no envoy runtime configuration is left persistent on parse2012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:20:12] afaik that contint dns entry is only used to route the http requests made to ATS/Varnish to the proper machine [08:20:18] the rest of the CI stack uses ip addresses [08:20:23] 10SRE, 10Continuous-Integration-Infrastructure, 10collaboration-services, 10Patch-For-Review: contint2002 service implementation tracking - https://phabricator.wikimedia.org/T324659 (10Jelto) [08:21:11] PROBLEM - Check no envoy runtime configuration is left persistent on planet2002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:21:11] PROBLEM - Check no envoy runtime configuration is left persistent on schema2004 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:21:51] PROBLEM - Check no envoy runtime configuration is left persistent on wdqs1013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:22:17] (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/936287 (owner: 10Ladsgroup) [08:22:25] PROBLEM - Check no envoy runtime configuration is left persistent on prometheus6002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:22:38] I guess the envoy alerts is not us but jaymes change? [08:22:49] yeah that looks unrelated [08:23:13] PROBLEM - Check no envoy runtime configuration is left persistent on mw1411 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:23:23] you can do the two other puppet changes to change the primary in hiera [08:23:27] hashar: Then I'm going to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/867705 and https://gerrit.wikimedia.org/r/c/operations/puppet/+/935919 ok? [08:23:33] +1 :) [08:23:34] ack will do [08:23:55] (03CR) 10Jelto: [C: 03+2] ci/zuul: switch gearman server from contint2001 to contint2002 [puppet] - 10https://gerrit.wikimedia.org/r/867705 (https://phabricator.wikimedia.org/T324659) (owner: 10Dzahn) [08:24:02] as an extra step I will run puppet on contint1002 (the other host which runs zuul-merger) in order for that service to switch to the new host as well [08:24:11] (03CR) 10Jelto: [C: 03+2] ci/zuul: set contint2002 as the active ci::manager_host [puppet] - 10https://gerrit.wikimedia.org/r/935919 (https://phabricator.wikimedia.org/T324659) (owner: 10Jelto) [08:24:13] PROBLEM - Check no envoy runtime configuration is left persistent on mwdebug1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:24:31] PROBLEM - Check no envoy runtime configuration is left persistent on mw1487 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:24:59] PROBLEM - Check no envoy runtime configuration is left persistent on mw1398 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:25:05] PROBLEM - Check no envoy runtime configuration is left persistent on mw1467 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:25:09] PROBLEM - Check no envoy runtime configuration is left persistent on mw2286 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:25:21] 10SRE, 10envoy, 10serviceops, 10Patch-For-Review: Refactor envoy max_requests_per_connection from Cluster to HttpProtocolOptions - https://phabricator.wikimedia.org/T304124 (10JMeybohm) 05Open→03Resolved [08:25:26] 10SRE, 10Traffic, 10envoy, 10serviceops, 10Patch-For-Review: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 (10JMeybohm) [08:25:32] 10SRE, 10envoy, 10serviceops, 10Patch-For-Review: Remove tls_minimum_protocol_version from envoy config - https://phabricator.wikimedia.org/T337453 (10JMeybohm) 05Open→03Resolved [08:25:38] 10SRE, 10Traffic, 10envoy, 10serviceops, 10Patch-For-Review: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 (10JMeybohm) [08:25:41] PROBLEM - Check no envoy runtime configuration is left persistent on webperf1003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.005 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:25:45] PROBLEM - Check no envoy runtime configuration is left persistent on mw2323 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:25:46] 10SRE, 10Traffic, 10envoy, 10serviceops, 10Patch-For-Review: Set a limit to the number of allowed active connections via runtime key overload.global_downstream_max_connections - https://phabricator.wikimedia.org/T340955 (10JMeybohm) 05Open→03Resolved [08:25:54] 10SRE, 10Traffic, 10envoy, 10serviceops, 10Patch-For-Review: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 (10JMeybohm) [08:25:57] PROBLEM - Check no envoy runtime configuration is left persistent on parse1014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:26:11] hashar: both puppet changes merged [08:26:18] Puppet has moved the zuul-merger on contint1002 to the new host (config change applied + restarted the service) [08:26:20] 10SRE, 10Traffic, 10envoy, 10serviceops, 10Patch-For-Review: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 (10JMeybohm) 05Open→03Resolved This is done from our end. [08:26:28] (03CR) 10Volans: [C: 04-2] "That's used in the wmcs-cookbooks repository to get the CA for each server as they can be different due to project's puppetmasters" [software/spicerack] - 10https://gerrit.wikimedia.org/r/936774 (https://phabricator.wikimedia.org/T341496) (owner: 10Jbond) [08:26:42] 10SRE, 10Continuous-Integration-Infrastructure, 10collaboration-services, 10Patch-For-Review: contint2002 service implementation tracking - https://phabricator.wikimedia.org/T324659 (10Jelto) [08:26:49] PROBLEM - Check no envoy runtime configuration is left persistent on restbase1021 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:26:53] then I guess you can enable and run the Puppet agent on contint2002 [08:26:58] I will tail the logs [08:27:03] PROBLEM - Check no envoy runtime configuration is left persistent on mw2329 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:27:03] PROBLEM - Check no envoy runtime configuration is left persistent on mw2446 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:27:07] I'll do so now [08:27:29] PROBLEM - Check no envoy runtime configuration is left persistent on mw2350 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:27:33] PROBLEM - Check no envoy runtime configuration is left persistent on parse1006 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:27:45] PROBLEM - Check no envoy runtime configuration is left persistent on snapshot1011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:27:47] * hashar crosses fingers [08:28:29] PROBLEM - Check no envoy runtime configuration is left persistent on snapshot1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:28:39] PROBLEM - Check no envoy runtime configuration is left persistent on ores2008 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:28:39] PROBLEM - Check no envoy runtime configuration is left persistent on phab2002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:28:57] puppet run done on contint2002: Notice: Applied catalog in 53.20 seconds [08:29:07] PROBLEM - Check no envoy runtime configuration is left persistent on mw1439 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:29:07] PROBLEM - Check no envoy runtime configuration is left persistent on parse1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:29:07] PROBLEM - Check no envoy runtime configuration is left persistent on wcqs1003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:29:09] PROBLEM - Check no envoy runtime configuration is left persistent on chartmuseum1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.006 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:29:35] PROBLEM - Check no envoy runtime configuration is left persistent on parse2003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:29:37] Jenkins is starting and connecting to WMCS instances (there were some missing firewall rules which I have caught on friday) [08:29:43] PROBLEM - Check no envoy runtime configuration is left persistent on schema2003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:29:46] oookay...that might be me [08:29:47] PROBLEM - Check no envoy runtime configuration is left persistent on wdqs1014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:30:03] PROBLEM - Check no envoy runtime configuration is left persistent on debmonitor2002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:30:03] PROBLEM - Check no envoy runtime configuration is left persistent on mw1399 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:30:03] PROBLEM - Check no envoy runtime configuration is left persistent on cloudweb1004 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:30:09] the web interface is running at https://integration.wikimedia.org/ci/ [08:30:13] PROBLEM - Check no envoy runtime configuration is left persistent on idm1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:30:15] PROBLEM - Check no envoy runtime configuration is left persistent on mw2302 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:30:33] PROBLEM - Check no envoy runtime configuration is left persistent on restbase1028 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:30:39] PROBLEM - Check no envoy runtime configuration is left persistent on mw1406 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:30:45] PROBLEM - Check no envoy runtime configuration is left persistent on mw1356 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:30:51] PROBLEM - Check no envoy runtime configuration is left persistent on restbase1018 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:30:57] PROBLEM - Check no envoy runtime configuration is left persistent on mw2419 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:31:01] jelto: I am testing zuul [08:31:09] PROBLEM - Check no envoy runtime configuration is left persistent on mw2273 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:31:11] PROBLEM - Check no envoy runtime configuration is left persistent on prometheus2006 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:31:15] PROBLEM - Check no envoy runtime configuration is left persistent on mw1489 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:31:29] PROBLEM - Check no envoy runtime configuration is left persistent on restbase1017 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:31:30] thanks! I can reach the webinterface at least. "Last reconfigured: Tue Jul 11 2023 10:28:58 " also looks promising [08:31:45] PROBLEM - Check no envoy runtime configuration is left persistent on mw2351 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:31:45] ah yeah [08:31:49] PROBLEM - Check no envoy runtime configuration is left persistent on mw2318 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:31:51] PROBLEM - Check no envoy runtime configuration is left persistent on wdqs2012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:32:06] and Zuul does receive events from Gerrit [08:32:09] PROBLEM - Check no envoy runtime configuration is left persistent on ores2004 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:32:09] PROBLEM - Check no envoy runtime configuration is left persistent on restbase2024 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:32:23] it also managed to reach out to Jenkins and trigger a build which is executing on the WMCS instance [08:32:35] PROBLEM - Check no envoy runtime configuration is left persistent on mw2426 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:32:43] PROBLEM - Check no envoy runtime configuration is left persistent on mw1476 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:33:02] jelto: it worked on https://gerrit.wikimedia.org/r/c/test/gerrit-ping/+/937040 :-] [08:33:39] PROBLEM - Check no envoy runtime configuration is left persistent on mw2300 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:33:39] PROBLEM - Check no envoy runtime configuration is left persistent on mw2424 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:33:45] PROBLEM - Check no envoy runtime configuration is left persistent on thanos-fe2002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:33:47] PROBLEM - Check no envoy runtime configuration is left persistent on mw2386 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:33:50] hashar: great. Did we verify zuul and jenkins now? Or only jenkins? [08:33:55] PROBLEM - Check no envoy runtime configuration is left persistent on mw2399 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:33:55] PROBLEM - Check no envoy runtime configuration is left persistent on thanos-fe2003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:34:05] both [08:34:09] PROBLEM - Check no envoy runtime configuration is left persistent on ores1007 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:34:12] 10SRE, 10Continuous-Integration-Infrastructure, 10collaboration-services, 10Patch-For-Review: contint2002 service implementation tracking - https://phabricator.wikimedia.org/T324659 (10hashar) [08:34:13] PROBLEM - Check no envoy runtime configuration is left persistent on ores1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:34:24] hashar: ok filling the checkboxes in the task [08:34:31] PROBLEM - Check no envoy runtime configuration is left persistent on parse2019 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:34:34] jayme: need a hand? [08:34:37] zuul is the scheduler/workflow and Jenkins is merely a library of cookbooks executed by Zuul [08:34:37] PROBLEM - Check no envoy runtime configuration is left persistent on mw2264 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:34:37] PROBLEM - Check no envoy runtime configuration is left persistent on mw2272 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:34:37] PROBLEM - Check no envoy runtime configuration is left persistent on ores2002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:34:45] should we disable puppet? [08:34:58] hashar: a you already did that [08:35:00] volans: is there a way do downtime that one check on all hosts? [08:35:01] jelto: I have ticked the box and added an extra step I did (run puppet on contint1002 to update the zuul-merger instance running that) [08:35:06] (03PS1) 10Elukey: services: increase kafka batch wait time for eventgate-main [deployment-charts] - 10https://gerrit.wikimedia.org/r/937041 (https://phabricator.wikimedia.org/T338357) [08:35:15] PROBLEM - Check no envoy runtime configuration is left persistent on wdqs2007 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:35:17] PROBLEM - Check no envoy runtime configuration is left persistent on mw2398 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:35:23] PROBLEM - Check no envoy runtime configuration is left persistent on mw1372 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:35:29] jelto: so now if we enable puppet on the old host (contint2001) that should mask/disabled/stop Jenkins and Zuul [08:35:33] PROBLEM - Check no envoy runtime configuration is left persistent on mw2379 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:35:33] PROBLEM - Check no envoy runtime configuration is left persistent on mw2295 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:35:35] hashar: then I'll enable and run puppet on contint2001 again [08:35:37] jayme: yes and no, let me do it but will take few minutes [08:35:38] <_joe_> jayme: ^^ [08:35:39] PROBLEM - Check no envoy runtime configuration is left persistent on mw1395 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:35:43] PROBLEM - Check no envoy runtime configuration is left persistent on phab1004 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:35:59] PROBLEM - Check no envoy runtime configuration is left persistent on ores2007 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:36:13] PROBLEM - Check no envoy runtime configuration is left persistent on restbase1030 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:36:19] PROBLEM - Check no envoy runtime configuration is left persistent on parse1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:36:19] PROBLEM - Check no envoy runtime configuration is left persistent on mw2330 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:36:21] PROBLEM - Check no envoy runtime configuration is left persistent on mw1396 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:36:31] PROBLEM - Check no envoy runtime configuration is left persistent on mw2319 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:36:31] PROBLEM - Check no envoy runtime configuration is left persistent on mw2428 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:36:36] (03PS4) 10Fabfur: hiera: add silent-drop directives for http frontend [puppet] - 10https://gerrit.wikimedia.org/r/936701 (https://phabricator.wikimedia.org/T340983) [08:36:37] PROBLEM - Check no envoy runtime configuration is left persistent on mw2405 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:36:43] PROBLEM - Check no envoy runtime configuration is left persistent on mw1350 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:36:47] PROBLEM - Check no envoy runtime configuration is left persistent on restbase1031 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:36:50] volans: ack, I'll disable puppet on all envoy hosts [08:36:53] jayme: disabling puppet on affected hosts is surely quicker [08:36:57] PROBLEM - Check no envoy runtime configuration is left persistent on snapshot1008 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.007 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:36:57] PROBLEM - Check no envoy runtime configuration is left persistent on mw1359 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:37:05] PROBLEM - Check no envoy runtime configuration is left persistent on mw2282 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:37:23] PROBLEM - Check no envoy runtime configuration is left persistent on mw2291 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:37:31] PROBLEM - Check no envoy runtime configuration is left persistent on mw2322 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:37:32] volans: but that does not stop existing spam, no? [08:37:35] PROBLEM - Check no envoy runtime configuration is left persistent on mw1445 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:37:43] neither the downtime [08:37:43] PROBLEM - Check no envoy runtime configuration is left persistent on idp1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:37:45] PROBLEM - Check no envoy runtime configuration is left persistent on mw2394 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:37:45] PROBLEM - Check no envoy runtime configuration is left persistent on logstash2025 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:37:50] recovery will spam anyway [08:37:51] PROBLEM - Check no envoy runtime configuration is left persistent on mw1434 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:37:51] PROBLEM - Check no envoy runtime configuration is left persistent on mw1375 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:37:51] PROBLEM - Check no envoy runtime configuration is left persistent on puppetmaster1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:37:51] PROBLEM - Check no envoy runtime configuration is left persistent on mw2417 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:37:51] PROBLEM - Check no envoy runtime configuration is left persistent on mw2353 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:37:53] PROBLEM - Check no envoy runtime configuration is left persistent on releases1003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:37:59] PROBLEM - Check no envoy runtime configuration is left persistent on mw2431 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:37:59] PROBLEM - Check no envoy runtime configuration is left persistent on wdqs2008 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:38:03] will sto any host not yet updated [08:38:07] with the new config [08:38:15] PROBLEM - Check no envoy runtime configuration is left persistent on restbase1027 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.004 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:38:15] PROBLEM - Check no envoy runtime configuration is left persistent on mw2429 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:38:24] sure, that's done [08:38:25] PROBLEM - Check no envoy runtime configuration is left persistent on mw2292 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:38:29] PROBLEM - Check no envoy runtime configuration is left persistent on parse1018 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:38:29] PROBLEM - Check no envoy runtime configuration is left persistent on mw2421 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:38:35] PROBLEM - Check no envoy runtime configuration is left persistent on mw1351 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:38:38] hashar: I see multiple "removed" and "masked" in the puppet run, looks good. jenkins slave is running on contint2001 [08:38:47] PROBLEM - Check no envoy runtime configuration is left persistent on mw1450 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:38:55] PROBLEM - Check no envoy runtime configuration is left persistent on doc1003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.003 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:38:55] PROBLEM - Check no envoy runtime configuration is left persistent on parse2005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:38:55] PROBLEM - Check no envoy runtime configuration is left persistent on logstash2030 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.005 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:39:01] PROBLEM - Check no envoy runtime configuration is left persistent on mw1357 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:39:01] PROBLEM - Check no envoy runtime configuration is left persistent on mw1485 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:39:01] PROBLEM - Check no envoy runtime configuration is left persistent on mw2308 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:39:01] PROBLEM - Check no envoy runtime configuration is left persistent on mw2436 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:39:05] PROBLEM - Check no envoy runtime configuration is left persistent on mw1447 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:39:21] PROBLEM - Check no envoy runtime configuration is left persistent on wcqs1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:39:23] PROBLEM - Check no envoy runtime configuration is left persistent on mw1384 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:39:23] PROBLEM - Check no envoy runtime configuration is left persistent on prometheus1006 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:39:23] !log disabled puppet on 'P{R:Package = envoyproxy}' [08:39:23] PROBLEM - Check no envoy runtime configuration is left persistent on mw2401 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:39:23] PROBLEM - Check no envoy runtime configuration is left persistent on mw2277 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:39:23] PROBLEM - Check no envoy runtime configuration is left persistent on mw2440 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:39:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:27] PROBLEM - Check no envoy runtime configuration is left persistent on mw2450 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:39:29] PROBLEM - Check no envoy runtime configuration is left persistent on snapshot1014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:39:31] PROBLEM - Check no envoy runtime configuration is left persistent on chartmuseum2001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.003 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:39:33] PROBLEM - Check no envoy runtime configuration is left persistent on snapshot1010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:39:33] PROBLEM - Check no envoy runtime configuration is left persistent on releases1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.003 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:39:33] PROBLEM - Check no envoy runtime configuration is left persistent on mw2390 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:39:46] (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42387/console" [puppet] - 10https://gerrit.wikimedia.org/r/936701 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur) [08:39:51] PROBLEM - Check no envoy runtime configuration is left persistent on debmonitor1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:39:59] PROBLEM - Check no envoy runtime configuration is left persistent on prometheus4002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.003 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:40:01] PROBLEM - Check no envoy runtime configuration is left persistent on moscovium is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:40:01] PROBLEM - Check no envoy runtime configuration is left persistent on mw2339 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:40:13] PROBLEM - Check no envoy runtime configuration is left persistent on prometheus5002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.003 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:40:14] !log downtiming service 'Check no envoy runtime configuration is left persistent' on envoy hosts [08:40:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:19] jayme: ^^^ [08:40:21] jelto: confirmed all three services are masked/stopped on contint2001. I am doing the Jenkins config change to get rid of the jenkins-slave [08:40:23] PROBLEM - Check no envoy runtime configuration is left persistent on mw1463 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:40:27] PROBLEM - Check no envoy runtime configuration is left persistent on wdqs1016 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:40:29] PROBLEM - Check no envoy runtime configuration is left persistent on mw2331 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:40:30] volans: thanks! [08:40:33] hashar: ack [08:40:41] I've put 2h [08:40:43] PROBLEM - Check no envoy runtime configuration is left persistent on mw2299 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:40:43] PROBLEM - Check no envoy runtime configuration is left persistent on mw2356 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:40:51] PROBLEM - Check no envoy runtime configuration is left persistent on mw1421 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:40:51] PROBLEM - Check no envoy runtime configuration is left persistent on mw2362 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:40:55] PROBLEM - Check no envoy runtime configuration is left persistent on mw2355 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:40:59] volans: nothing bad happened btw. The icinga check is "wrong" on a way [08:41:03] PROBLEM - Check no envoy runtime configuration is left persistent on mw2395 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:41:03] PROBLEM - Check no envoy runtime configuration is left persistent on mw2361 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:41:03] PROBLEM - Check no envoy runtime configuration is left persistent on mw2365 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:41:07] RECOVERY - BFD status on cr2-eqsin is OK: UP: 14 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [08:41:15] PROBLEM - Check no envoy runtime configuration is left persistent on mw2404 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:41:15] PROBLEM - Check no envoy runtime configuration is left persistent on releases2003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:41:37] jayme: can we ditch the check altogether ? [08:41:51] godog: on it [08:41:51] PROBLEM - Check no envoy runtime configuration is left persistent on mw1460 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:41:51] PROBLEM - Check no envoy runtime configuration is left persistent on restbase1029 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.004 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:41:51] PROBLEM - Check no envoy runtime configuration is left persistent on mw2444 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:41:53] <_joe_> probably yes [08:42:02] <3 <3 <3 thank you [08:42:11] RECOVERY - BFD status on cr4-ulsfo is OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [08:42:21] PROBLEM - Check no envoy runtime configuration is left persistent on miscweb2003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.003 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:42:26] still running... [08:42:26] 10SRE, 10Continuous-Integration-Infrastructure, 10collaboration-services, 10Patch-For-Review: contint2002 service implementation tracking - https://phabricator.wikimedia.org/T324659 (10hashar) [08:42:33] PROBLEM - Check no envoy runtime configuration is left persistent on mw1466 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:43:11] !log previous downtiming completed [08:43:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:52] jayme: downtimed service on 576 hosts [08:43:57] for 2 h [08:44:08] (03PS1) 10JMeybohm: envoy: Absent check for zero runtime changes [puppet] - 10https://gerrit.wikimedia.org/r/937042 (https://phabricator.wikimedia.org/T300324) [08:44:17] (03CR) 10Muehlenhoff: [C: 03+1] "It looks good to me, but let's also run the patch/approach by Bryan" [software/bitu] - 10https://gerrit.wikimedia.org/r/935376 (owner: 10Slyngshede) [08:44:46] (03CR) 10Alexandros Kosiaris: [C: 03+1] services: increase kafka batch wait time for eventgate-main [deployment-charts] - 10https://gerrit.wikimedia.org/r/937041 (https://phabricator.wikimedia.org/T338357) (owner: 10Elukey) [08:44:50] (03PS2) 10JMeybohm: envoy: Absent check for zero runtime changes [puppet] - 10https://gerrit.wikimedia.org/r/937042 (https://phabricator.wikimedia.org/T300324) [08:45:43] (03CR) 10Clément Goubert: [C: 03+1] otelcol: Bump to version 0.81.0 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/936832 (owner: 10RLazarus) [08:45:51] 10SRE, 10ops-eqiad, 10Goal, 10User-aborrero, 10cloud-services-team (FY2022/2023-Q4): cloud @ eqiad: hardware re-racking plan - https://phabricator.wikimedia.org/T341494 (10aborrero) [08:46:05] jelto: https://gerrit.wikimedia.org/r/c/operations/puppet/+/867712 can be deployed yes :) [08:46:35] 10SRE, 10ops-eqiad, 10Goal, 10User-aborrero, 10cloud-services-team (FY2022/2023-Q4): cloud @ eqiad: hardware re-racking plan - https://phabricator.wikimedia.org/T341494 (10aborrero) [08:46:54] volans: if you have another minute: https://gerrit.wikimedia.org/r/937042 - https://puppet-compiler.wmflabs.org/output/937042/42388/ [08:47:10] (03CR) 10Jelto: [C: 03+2] ci: make contint2002 the new rsync source, remove contint2001 [puppet] - 10https://gerrit.wikimedia.org/r/867712 (https://phabricator.wikimedia.org/T324659) (owner: 10Dzahn) [08:47:37] (03CR) 10Elukey: [C: 03+2] services: increase kafka batch wait time for eventgate-main [deployment-charts] - 10https://gerrit.wikimedia.org/r/937041 (https://phabricator.wikimedia.org/T338357) (owner: 10Elukey) [08:48:27] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/937042 (https://phabricator.wikimedia.org/T300324) (owner: 10JMeybohm) [08:48:51] hashar: merged, but puppet is disabled because of the envoy config change. I'll wait until that is done. But the rsync change is not urgent I think [08:48:51] (03CR) 10Filippo Giunchedi: [C: 03+1] envoy: Absent check for zero runtime changes [puppet] - 10https://gerrit.wikimedia.org/r/937042 (https://phabricator.wikimedia.org/T300324) (owner: 10JMeybohm) [08:49:24] jelto: yes that can wait. Overall I it is a success as far as I can tell [08:49:57] (03CR) 10Alexandros Kosiaris: [C: 03+1] users: remove older ssh-rsa key for Alex and Chris [homer/public] - 10https://gerrit.wikimedia.org/r/937039 (https://phabricator.wikimedia.org/T336769) (owner: 10Ayounsi) [08:50:25] jelto: I have filled an unrelated follow up action about the stalled data on https://integration.wikimedia.org/zuul/ which is https://phabricator.wikimedia.org/T341548 and is due to some http cache header. That is unrelated to the switch over though. [08:51:07] hashar: great. The icinga downtime will expire in ~10 minutes. I think icinga needs some time to catch up with the checks because puppet is disabled. I'll check https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=contint [08:51:38] volans: thanks. Is there a clever way to get a list of hosts that had already applied that change? [08:52:30] what was the change you merged? [08:52:33] jelto: I'd expect Puppet to remove the Icinga checks for contint2001. [08:53:01] the last in chain was https://gerrit.wikimedia.org/r/c/operations/puppet/+/935711/9 [08:53:22] which is also the one that broke the check [08:53:56] and which file was it wriing? [08:54:50] rephrasing... does this change ends up writing a persistent file that then the check complains about? [08:55:04] let me check the icinga check to understand what it's complaining about [08:55:16] ultimately it will write /etc/envoy/envoy.yaml [08:55:19] ah the check is an http check [08:55:40] yeah but all the hosts have that file, so you have 2 options [08:55:58] 1) target P:envoy with batch say 20 and just wait [08:56:10] (03PS1) 10Jbond: puppet-facts-export-puppetdb: add client auth support [puppet] - 10https://gerrit.wikimedia.org/r/937044 (https://phabricator.wikimedia.org/T341268) [08:56:11] I thought I could maybe check in puppetdb if that git commit has been applied [08:56:21] 2) use 2 commands, the first one of which fails where the change was not applied so cumin will not run the second command (run puppet) [08:56:38] 10SRE, 10Continuous-Integration-Infrastructure, 10collaboration-services, 10Patch-For-Review: contint2002 service implementation tracking - https://phabricator.wikimedia.org/T324659 (10Jelto) [08:57:01] well..given you downtimed the check for 2h, simply running puppet on all envoy nodes is fine I guess [08:57:20] or just re-enable it and let it run [08:57:25] within 30m it will be fixed [08:57:30] indeed [08:57:32] technically 1h [08:57:44] because puppet has to run on alert hosts too after they run on the host [08:59:15] !log elukey@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-main: sync [08:59:17] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host kafkamon1003.eqiad.wmnet [08:59:24] !log elukey@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-main: sync [08:59:30] ack [09:00:57] (03CR) 10Jbond: [C: 03+2] puppet-facts-export-puppetdb: add client auth support [puppet] - 10https://gerrit.wikimedia.org/r/937044 (https://phabricator.wikimedia.org/T341268) (owner: 10Jbond) [09:01:06] 10SRE, 10Ganeti, 10Infrastructure-Foundations, 10netops: Investigate Ganeti in routed mode - https://phabricator.wikimedia.org/T300152 (10ayounsi) a:03ayounsi [09:01:14] !log elukey@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-main: sync [09:01:42] !log elukey@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-main: sync [09:02:44] (03CR) 10JMeybohm: [C: 03+2] envoy: Absent check for zero runtime changes [puppet] - 10https://gerrit.wikimedia.org/r/937042 (https://phabricator.wikimedia.org/T300324) (owner: 10JMeybohm) [09:03:40] hashar: icinga looks good (beside envoy runtime check), downtime expired [09:06:02] !log enabled puppet on 'P{R:Package = envoyproxy}' [09:06:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:16] !log elukey@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-main: sync [09:06:41] !log elukey@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-main: sync [09:06:54] jayme: fyi P:envoy is the same :) [09:07:30] yeah, but not in my bash history :) [09:07:34] jelto: congratulations \o/ [09:07:37] thanks for your help volans! [09:08:25] no prob, anytime, we should upgrade the downtime cookbook to support this too as spicerack does support it [09:08:29] !log btullis@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host kafkamon1003.eqiad.wmnet [09:08:37] it's just not exposed via the cookbook [09:13:27] (03CR) 10Volans: "some questions inline" [software/spicerack] - 10https://gerrit.wikimedia.org/r/936781 (https://phabricator.wikimedia.org/T341496) (owner: 10Jbond) [09:13:43] hashar: thanks and thanks for running the switchover [09:13:57] PROBLEM - PHP opcache health on parse2010 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.206: Connection reset by peer https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [09:15:00] (03PS1) 10JMeybohm: prometheus: Collect runtime metrics from envoy (ops and k8s) [puppet] - 10https://gerrit.wikimedia.org/r/937046 (https://phabricator.wikimedia.org/T341554) [09:16:57] (03PS2) 10JMeybohm: prometheus: Collect runtime metrics from envoy (ops and k8s) [puppet] - 10https://gerrit.wikimedia.org/r/937046 (https://phabricator.wikimedia.org/T341554) [09:18:05] (03PS1) 10Slyngshede: P:sretest Test httppaswd function [puppet] - 10https://gerrit.wikimedia.org/r/937048 [09:19:34] (03CR) 10Alexandros Kosiaris: [C: 03+2] toolforge: Add more CORS headers to docker registry [puppet] - 10https://gerrit.wikimedia.org/r/936797 (https://phabricator.wikimedia.org/T232135) (owner: 10BryanDavis) [09:19:43] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:20:40] (03PS1) 10Jbond: puppetmaster: move source scripts under the puppetserver name space [puppet] - 10https://gerrit.wikimedia.org/r/937049 (https://phabricator.wikimedia.org/T330490) [09:21:05] jelto: excellent. I guess you can reply to the email confirming the switch over is a success. And we will be able to decommission contint2001 \o/ [09:21:08] (03CR) 10Jbond: "ahh thanks i missed the wmcs branch" [software/spicerack] - 10https://gerrit.wikimedia.org/r/936774 (https://phabricator.wikimedia.org/T341496) (owner: 10Jbond) [09:21:17] (03Abandoned) 10Jbond: puppet: drop PuppetHosts.get_ca_servers [software/spicerack] - 10https://gerrit.wikimedia.org/r/936774 (https://phabricator.wikimedia.org/T341496) (owner: 10Jbond) [09:22:09] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42390/console" [puppet] - 10https://gerrit.wikimedia.org/r/937048 (owner: 10Slyngshede) [09:22:51] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42389/console" [puppet] - 10https://gerrit.wikimedia.org/r/937046 (https://phabricator.wikimedia.org/T341554) (owner: 10JMeybohm) [09:23:29] (03CR) 10Jbond: [C: 03+2] puppetmaster: move source scripts under the puppetserver name space [puppet] - 10https://gerrit.wikimedia.org/r/937049 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [09:24:00] (03PS2) 10Slyngshede: P:sretest Test httppaswd function [puppet] - 10https://gerrit.wikimedia.org/r/937048 [09:24:22] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Jennifer Ebe - https://phabricator.wikimedia.org/T341557 (10JEbe-WMF) [09:26:21] RECOVERY - Check systemd state on puppetserver1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:29:35] (03PS3) 10Slyngshede: P:sretest Test httppaswd function [puppet] - 10https://gerrit.wikimedia.org/r/937048 [09:30:33] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:30:51] !log disable puppet fleet wide to deploy 936273 [09:30:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:22] (03CR) 10Jbond: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/936273 (https://phabricator.wikimedia.org/T338811) (owner: 10Jbond) [09:34:17] (03PS4) 10Slyngshede: P:sretest Test httppaswd function [puppet] - 10https://gerrit.wikimedia.org/r/937048 [09:35:24] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42394/console" [puppet] - 10https://gerrit.wikimedia.org/r/937048 (owner: 10Slyngshede) [09:36:15] 10SRE, 10Continuous-Integration-Infrastructure, 10collaboration-services, 10Patch-For-Review: contint2002 service implementation tracking - https://phabricator.wikimedia.org/T324659 (10Lucas_Werkmeister_WMDE) Just because I first saw that error after CI came back from maintenance: do you think there’s any... [09:36:26] (03CR) 10Jbond: [C: 03+2] puppetmaster: enable submitting data to puppetdb7 [puppet] - 10https://gerrit.wikimedia.org/r/936273 (https://phabricator.wikimedia.org/T338811) (owner: 10Jbond) [09:36:47] !log deploy gerrit:936273 enable submitting data to puppetdb7 [09:36:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:18] (03PS5) 10Slyngshede: P:sretest Test httppaswd function [puppet] - 10https://gerrit.wikimedia.org/r/937048 [09:40:53] (03CR) 10CI reject: [V: 04-1] P:sretest Test httppaswd function [puppet] - 10https://gerrit.wikimedia.org/r/937048 (owner: 10Slyngshede) [09:41:12] 10SRE, 10Phabricator, 10Traffic: Accessing Phabricator from Tor (some ranges blocked but not others) - https://phabricator.wikimedia.org/T254568 (10Aklapper) 05Open→03Resolved Optimistically resolving as T253632 is resolved. Please reopen if this is still an issue - thanks! [09:42:00] 10SRE, 10Continuous-Integration-Infrastructure, 10collaboration-services, 10Patch-For-Review: contint2002 service implementation tracking - https://phabricator.wikimedia.org/T324659 (10hashar) >>! In T324659#9004463, @Lucas_Werkmeister_WMDE wrote: > Just because I first saw that error after CI came back fr... [09:43:22] (03PS1) 10JMeybohm: Add warning alerts on envoy running with changes config [alerts] - 10https://gerrit.wikimedia.org/r/937054 (https://phabricator.wikimedia.org/T341554) [09:43:29] (03PS6) 10Slyngshede: P:sretest Test httppaswd function [puppet] - 10https://gerrit.wikimedia.org/r/937048 [09:43:56] !log Updating Zuul configuration which was stall to a version from March 29th after the switchover from contint2001 to contint2002 | T324659 T341556 [09:44:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:01] T324659: contint2002 service implementation tracking - https://phabricator.wikimedia.org/T324659 [09:44:01] T341556: CentralAuthExtensionJsonTest::testHookHandler with data set #11 ('securepoll') failing in Wikidata.org CI - https://phabricator.wikimedia.org/T341556 [09:44:03] Lucas_WMDE: you are a magician :) [09:44:05] (03CR) 10CI reject: [V: 04-1] P:sretest Test httppaswd function [puppet] - 10https://gerrit.wikimedia.org/r/937048 (owner: 10Slyngshede) [09:44:16] :) [09:44:28] jelto: I forgot to update the integration/config repo so the switch over caused Zuul to spin up with an outdated configuration form March 29th :-\ [09:44:38] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42396/console" [puppet] - 10https://gerrit.wikimedia.org/r/937048 (owner: 10Slyngshede) [09:44:48] thanks for fixing it! [09:45:17] (03PS1) 10JMeybohm: envoy: Remove envoy_runtime_vars nagios check [puppet] - 10https://gerrit.wikimedia.org/r/937055 (https://phabricator.wikimedia.org/T341554) [09:45:53] jouncebot: nowandnext [09:45:53] No deployments scheduled for the next 0 hour(s) and 14 minute(s) [09:45:53] In 0 hour(s) and 14 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230711T1000) [09:45:58] cool [09:46:05] hashar: can I recheck already or does it need more time? [09:46:11] 10SRE, 10Continuous-Integration-Infrastructure, 10collaboration-services, 10Patch-For-Review: contint2002 service implementation tracking - https://phabricator.wikimedia.org/T324659 (10hashar) [09:46:11] hashar: and i was thinking "this CI failure is very puzzling" :-D [09:46:34] (03CR) 10Volans: "Did a very first pass, I'm not familiar with the commands to be executed on the network devices so I skipped those." [cookbooks] - 10https://gerrit.wikimedia.org/r/933094 (https://phabricator.wikimedia.org/T334594) (owner: 10Ayounsi) [09:46:37] (03PS2) 10JMeybohm: Add warning alerts on envoy running with changed config [alerts] - 10https://gerrit.wikimedia.org/r/937054 (https://phabricator.wikimedia.org/T341554) [09:47:02] (03CR) 10Ladsgroup: [C: 03+2] Override liftwing hostname (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936796 (https://phabricator.wikimedia.org/T319170) (owner: 10Ladsgroup) [09:47:05] (03PS7) 10Slyngshede: P:sretest Test httppaswd function [puppet] - 10https://gerrit.wikimedia.org/r/937048 [09:47:20] !log renable puppet [09:47:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:28] Lucas_WMDE: I have deployed the update in theory. Let me check [09:47:44] (03Merged) 10jenkins-bot: Override liftwing hostname [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936796 (https://phabricator.wikimedia.org/T319170) (owner: 10Ladsgroup) [09:48:12] alright, retrying the gate-and-submit [09:48:12] Lucas_WMDE: yes zuul config should be up to date now so you can `recheck` [09:48:16] ok thanks! [09:49:00] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:936796|Override liftwing hostname (T319170)]] [09:49:03] T319170: Move backend of ORES MediaWiki extension to Lift Wing - https://phabricator.wikimedia.org/T319170 [09:49:58] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42397/console" [puppet] - 10https://gerrit.wikimedia.org/r/937048 (owner: 10Slyngshede) [09:50:03] hashar: thanks for finding that. Let me know if you need anything from my side [09:50:52] jelto: I think we are all set :] [09:52:56] !log disable puppet fleet wide to deploy 936273 [09:52:57] !log ladsgroup@deploy1002 ladsgroup: Backport for [[gerrit:936796|Override liftwing hostname (T319170)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet [09:52:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:41] 10SRE, 10ops-eqiad, 10Goal, 10User-aborrero, 10cloud-services-team (FY2022/2023-Q4): cloud @ eqiad: hardware re-racking plan - https://phabricator.wikimedia.org/T341494 (10aborrero) [09:54:23] 10SRE, 10ops-eqiad, 10Goal, 10User-aborrero, 10cloud-services-team (FY2022/2023-Q4): cloud @ eqiad: hardware re-racking plan - https://phabricator.wikimedia.org/T341494 (10cmooney) @arturo thanks for this. The hosts can go in any rack, but we should make sure hosts of the same type go into different one... [09:56:11] PROBLEM - puppet last run on kubestagemaster2001 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [09:56:13] RECOVERY - Check systemd state on kubestagemaster2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:56:47] (03PS5) 10Fabfur: hiera: add silent-drop directives for http frontend [puppet] - 10https://gerrit.wikimedia.org/r/936701 (https://phabricator.wikimedia.org/T340983) [09:57:13] (03PS1) 10Elukey: profile::services_proxy::envoy: add inference to enabled_listeners [puppet] - 10https://gerrit.wikimedia.org/r/937056 (https://phabricator.wikimedia.org/T319170) [09:58:13] (03CR) 10Fabfur: hiera: add silent-drop directives for http frontend (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/936701 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur) [09:58:24] Lucas_WMDE: Jakob / Leszek had a few Wikibase changes rejected as well. I commented on one of them ( https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/933103 ) to let them know. [09:58:33] cool, thanks! [09:58:48] (03PS1) 10Btullis: Enable the required upgrade jobs for datahub in production [deployment-charts] - 10https://gerrit.wikimedia.org/r/937057 (https://phabricator.wikimedia.org/T329514) [09:58:53] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42398/console" [puppet] - 10https://gerrit.wikimedia.org/r/937056 (https://phabricator.wikimedia.org/T319170) (owner: 10Elukey) [09:59:19] RECOVERY - Check systemd state on kafkamon1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:59:33] (03CR) 10CI reject: [V: 04-1] Enable the required upgrade jobs for datahub in production [deployment-charts] - 10https://gerrit.wikimedia.org/r/937057 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [09:59:42] jelto: the sole step I have missed was to "git pull" the Zuul configuration and I have added that to the task as a missed step. I will refresh the wikipage runbook for the next switch over. Beside that all seems to be working fine. Thank you! [10:00:01] hashar: great thanks a lot :) [10:00:05] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230711T1000) [10:00:36] (03PS8) 10Slyngshede: P:sretest Test httppaswd function [puppet] - 10https://gerrit.wikimedia.org/r/937048 [10:00:58] (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42399/console" [puppet] - 10https://gerrit.wikimedia.org/r/936701 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur) [10:01:04] (03CR) 10Alexandros Kosiaris: [C: 03+1] k8s::apiserver: Implement kube-apiserver reload [puppet] - 10https://gerrit.wikimedia.org/r/936666 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm) [10:01:14] (03CR) 10CI reject: [V: 04-1] P:sretest Test httppaswd function [puppet] - 10https://gerrit.wikimedia.org/r/937048 (owner: 10Slyngshede) [10:01:18] only set up new hosts immediately before switching to them [10:01:22] (03PS9) 10Slyngshede: P:sretest Test httppaswd function [puppet] - 10https://gerrit.wikimedia.org/r/937048 [10:01:23] RECOVERY - PHP opcache health on parse2010 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [10:01:41] RECOVERY - puppet last run on kubestagemaster2001 is OK: OK: Puppet is currently disabled (roll out 936273), not alerting. Last run 1 day ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [10:03:27] (03CR) 10Muehlenhoff: "Looks good, a few random comments inline" [software/bitu] - 10https://gerrit.wikimedia.org/r/934519 (https://phabricator.wikimedia.org/T340637) (owner: 10Slyngshede) [10:03:34] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:936796|Override liftwing hostname (T319170)]] (duration: 14m 34s) [10:03:38] T319170: Move backend of ORES MediaWiki extension to Lift Wing - https://phabricator.wikimedia.org/T319170 [10:03:51] PROBLEM - Check systemd state on kafkamon1003 is CRITICAL: CRITICAL - degraded: The following units failed: burrow-jumbo-eqiad.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:07:25] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/937048 (owner: 10Slyngshede) [10:07:48] (03CR) 10Slyngshede: [C: 03+2] P:sretest Test httppaswd function [puppet] - 10https://gerrit.wikimedia.org/r/937048 (owner: 10Slyngshede) [10:08:37] 10SRE, 10ops-eqiad, 10Goal, 10User-aborrero, 10cloud-services-team (FY2022/2023-Q4): cloud @ eqiad: hardware re-racking plan - https://phabricator.wikimedia.org/T341494 (10aborrero) Thanks. So if #Ops-eqiad don't have any other preference, we could do something like: * cloudcontrol1005 --> `C8` * cloudco... [10:09:07] (03CR) 10Urbanecm: [C: 03+1] GrowthExperiments: Enable backend of link recommendation 10, 11, 12th round wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935723 (https://phabricator.wikimedia.org/T308135) (owner: 10Sergio Gimeno) [10:10:23] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: Collect runtime metrics from envoy (ops and k8s) [puppet] - 10https://gerrit.wikimedia.org/r/937046 (https://phabricator.wikimedia.org/T341554) (owner: 10JMeybohm) [10:11:03] 10SRE, 10API Platform, 10Anti-Harassment, 10Content-Transform-Team, and 18 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10fnegri) [10:11:11] 10SRE, 10ops-eqiad, 10Goal, 10User-aborrero, 10cloud-services-team (FY2022/2023-Q4): cloud @ eqiad: hardware re-racking plan - https://phabricator.wikimedia.org/T341494 (10aborrero) [10:11:18] (03CR) 10Filippo Giunchedi: envoy: Remove envoy_runtime_vars nagios check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/937055 (https://phabricator.wikimedia.org/T341554) (owner: 10JMeybohm) [10:11:54] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-aborrero: Add support for nftables in profile::firewall - https://phabricator.wikimedia.org/T336497 (10aborrero) [10:12:34] (03CR) 10Filippo Giunchedi: Add warning alerts on envoy running with changed config (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/937054 (https://phabricator.wikimedia.org/T341554) (owner: 10JMeybohm) [10:13:22] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] prometheus: Collect runtime metrics from envoy (ops and k8s) [puppet] - 10https://gerrit.wikimedia.org/r/937046 (https://phabricator.wikimedia.org/T341554) (owner: 10JMeybohm) [10:13:42] 10SRE, 10ops-eqiad, 10Goal, 10User-aborrero, 10cloud-services-team (FY2022/2023-Q4): cloud @ eqiad: hardware re-racking plan - https://phabricator.wikimedia.org/T341494 (10aborrero) [10:17:32] 10SRE, 10ops-eqiad, 10Goal, 10User-aborrero, 10cloud-services-team (FY2022/2023-Q4): cloud @ eqiad: hardware re-racking plan - https://phabricator.wikimedia.org/T341494 (10aborrero) [10:18:03] 10SRE, 10ops-eqiad, 10Goal, 10User-aborrero, 10cloud-services-team (FY2022/2023-Q4): cloud @ eqiad: hardware re-racking plan - https://phabricator.wikimedia.org/T341494 (10aborrero) [10:18:19] 10SRE, 10Continuous-Integration-Infrastructure, 10collaboration-services, 10Patch-For-Review: contint2002 service implementation tracking - https://phabricator.wikimedia.org/T324659 (10hashar) 05In progress→03Resolved [10:18:25] 10SRE, 10Continuous-Integration-Infrastructure, 10collaboration-services, 10Patch-For-Review: contint2002 service implementation tracking - https://phabricator.wikimedia.org/T324659 (10hashar) a:03Jelto As far as I can tell, the services were successfully switched over from contint2001 to contint2002. I... [10:19:18] jouncebot: nowandnext [10:19:18] For the next 0 hour(s) and 40 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230711T1000) [10:19:18] In 2 hour(s) and 40 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230711T1300) [10:19:18] In 2 hour(s) and 40 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230711T1300) [10:19:28] (03CR) 10Ladsgroup: [C: 03+2] ExternalLinks: Make oneWildcard avoid adding wildcard to domain [core] (wmf/1.41.0-wmf.16) - 10https://gerrit.wikimedia.org/r/936733 (https://phabricator.wikimedia.org/T326251) (owner: 10Ladsgroup) [10:19:30] !log rebalance ganeti group codfw/C after reboots [10:19:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:36] (03CR) 10Muehlenhoff: [C: 03+2] codesearch: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/826864 (owner: 10Muehlenhoff) [10:23:43] 10SRE, 10ops-eqiad, 10Goal, 10User-aborrero, 10cloud-services-team (FY2022/2023-Q4): cloud @ eqiad: hardware re-racking plan - https://phabricator.wikimedia.org/T341494 (10aborrero) [10:26:49] 10SRE, 10ops-eqiad, 10Goal, 10User-aborrero, 10cloud-services-team (FY2022/2023-Q4): cloud @ eqiad: hardware re-racking plan - https://phabricator.wikimedia.org/T341494 (10cmooney) Put cloudservices1005 in C8 if there is room there instead of F4. [10:26:55] (03PS1) 10Filippo Giunchedi: sre: fix k8s selector for kubernetes-generic [alerts] - 10https://gerrit.wikimedia.org/r/937060 [10:29:57] (03PS2) 10Btullis: Enable the required upgrade jobs for datahub in production [deployment-charts] - 10https://gerrit.wikimedia.org/r/937057 (https://phabricator.wikimedia.org/T329514) [10:31:00] 10SRE, 10ops-eqiad, 10Goal, 10User-aborrero, 10cloud-services-team (FY2022/2023-Q4): cloud @ eqiad: hardware re-racking plan - https://phabricator.wikimedia.org/T341494 (10aborrero) [10:32:09] 10SRE, 10ops-eqiad, 10Goal, 10User-aborrero, 10cloud-services-team (FY2022/2023-Q4): cloud @ eqiad: hardware re-racking plan - https://phabricator.wikimedia.org/T341494 (10aborrero) >>! In T341494#9004690, @cmooney wrote: > Put cloudservices1005 in D5 if there is room there instead of F4. Done. What sho... [10:36:40] (03Merged) 10jenkins-bot: ExternalLinks: Make oneWildcard avoid adding wildcard to domain [core] (wmf/1.41.0-wmf.16) - 10https://gerrit.wikimedia.org/r/936733 (https://phabricator.wikimedia.org/T326251) (owner: 10Ladsgroup) [10:37:00] (03CR) 10Muehlenhoff: "Looks good, few comments inline" [software/bitu] - 10https://gerrit.wikimedia.org/r/935462 (owner: 10Slyngshede) [10:37:35] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:936733|ExternalLinks: Make oneWildcard avoid adding wildcard to domain (T326251)]] [10:37:39] T326251: Write code for read new fields of externallinks - https://phabricator.wikimedia.org/T326251 [10:37:48] (03PS1) 10Hnowlan: cache: set api.wikimedia.org to normal caching [puppet] - 10https://gerrit.wikimedia.org/r/937061 (https://phabricator.wikimedia.org/T338916) [10:38:13] (03CR) 10CI reject: [V: 04-1] cache: set api.wikimedia.org to normal caching [puppet] - 10https://gerrit.wikimedia.org/r/937061 (https://phabricator.wikimedia.org/T338916) (owner: 10Hnowlan) [10:38:40] !log isaranto@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [10:39:02] !log ladsgroup@deploy1002 ladsgroup: Backport for [[gerrit:936733|ExternalLinks: Make oneWildcard avoid adding wildcard to domain (T326251)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet [10:40:05] PROBLEM - Check no envoy runtime configuration is left persistent on mw2307 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [10:40:27] (03PS2) 10Hnowlan: cache: set api.wikimedia.org to normal caching [puppet] - 10https://gerrit.wikimedia.org/r/937061 (https://phabricator.wikimedia.org/T338916) [10:40:40] (03CR) 10Vgutierrez: hiera: add silent-drop directives for http frontend (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/936701 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur) [10:41:15] PROBLEM - Check no envoy runtime configuration is left persistent on testreduce1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.003 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [10:42:35] PROBLEM - Check no envoy runtime configuration is left persistent on mw2306 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [10:42:37] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [10:42:45] PROBLEM - Check no envoy runtime configuration is left persistent on mw2374 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [10:42:51] hmmm envoy config issues? [10:43:07] not really. bad icinga check [10:43:12] oh ok [10:43:14] should be fixed already...looking again [10:43:51] PROBLEM - Check no envoy runtime configuration is left persistent on mw2276 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [10:43:51] (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42400/console" [puppet] - 10https://gerrit.wikimedia.org/r/936701 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur) [10:43:52] ah, puppet is disabled there [10:44:15] jayme: which host? [10:44:23] !log ladsgroup@deploy1002 Sync cancelled. [10:44:39] PROBLEM - Check no envoy runtime configuration is left persistent on mw2275 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [10:44:51] jbond: the ones alerting [10:44:52] !log isaranto@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [10:44:58] I checked mw2306 [10:45:05] that's probably from you [10:45:08] (03PS1) 10Ladsgroup: Revert "ExternalLinks: Make oneWildcard avoid adding wildcard to domain" [core] (wmf/1.41.0-wmf.16) - 10https://gerrit.wikimedia.org/r/936739 [10:45:13] (03CR) 10Ladsgroup: [C: 03+2] Revert "ExternalLinks: Make oneWildcard avoid adding wildcard to domain" [core] (wmf/1.41.0-wmf.16) - 10https://gerrit.wikimedia.org/r/936739 (owner: 10Ladsgroup) [10:45:21] puppet id disabled by me but we can enable it if you need to deploy something [10:45:33] PROBLEM - Check no envoy runtime configuration is left persistent on mw2420 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [10:45:38] (KubernetesAPILatency) firing: High Kubernetes API latency (DELETE services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:45:55] PROBLEM - Check no envoy runtime configuration is left persistent on miscweb1003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [10:45:57] jbond: yeah, it would be nice to not have them spam here [10:46:09] jayme: so everything eith envoy? [10:46:31] PROBLEM - Check no envoy runtime configuration is left persistent on mw2412 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [10:46:35] jbond: yes, envoy is fine. I've disabled the icinga check in a follow-up change [10:46:36] !log installing libx11 security updates [10:46:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:56] jbond: https://gerrit.wikimedia.org/r/c/operations/puppet/+/937042 [10:47:11] PROBLEM - Check no envoy runtime configuration is left persistent on mw2409 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [10:47:27] PROBLEM - Check no envoy runtime configuration is left persistent on puppetboard1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [10:47:48] ack running now [10:47:59] thanks! [10:48:01] PROBLEM - Check no envoy runtime configuration is left persistent on mw1497 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [10:48:16] (03PS6) 10Fabfur: hiera: add silent-drop directives for http frontend [puppet] - 10https://gerrit.wikimedia.org/r/936701 (https://phabricator.wikimedia.org/T340983) [10:49:57] PROBLEM - Check no envoy runtime configuration is left persistent on mw1360 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [10:50:25] PROBLEM - Check no envoy runtime configuration is left persistent on wdqs1005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [10:50:25] PROBLEM - Check no envoy runtime configuration is left persistent on thanos-fe1004 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [10:50:31] PROBLEM - Check no envoy runtime configuration is left persistent on mw2373 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [10:50:35] PROBLEM - Router interfaces on cr1-drmrs is CRITICAL: CRITICAL: host 185.15.58.128, interfaces up: 57, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:50:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (DELETE services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:50:42] (03CR) 10Vgutierrez: [C: 03+1] "```" [puppet] - 10https://gerrit.wikimedia.org/r/937061 (https://phabricator.wikimedia.org/T338916) (owner: 10Hnowlan) [10:50:52] (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (CORE_DIFF 7): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42401/console" [puppet] - 10https://gerrit.wikimedia.org/r/936701 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur) [10:50:55] PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:51:03] PROBLEM - Check no envoy runtime configuration is left persistent on mw2448 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [10:51:07] PROBLEM - Check no envoy runtime configuration is left persistent on parse1005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [10:51:27] PROBLEM - Check no envoy runtime configuration is left persistent on mw1382 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [10:52:05] RECOVERY - Router interfaces on cr1-drmrs is OK: OK: host 185.15.58.128, interfaces up: 58, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:52:09] PROBLEM - Check no envoy runtime configuration is left persistent on parse2004 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [10:52:25] RECOVERY - BFD status on cr1-eqiad is OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:52:37] (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [10:53:07] PROBLEM - Check no envoy runtime configuration is left persistent on mw1402 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [10:53:07] PROBLEM - Check no envoy runtime configuration is left persistent on mw1409 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [10:53:23] (03CR) 10Clément Goubert: [C: 03+1] profile::services_proxy::envoy: add inference to enabled_listeners [puppet] - 10https://gerrit.wikimedia.org/r/937056 (https://phabricator.wikimedia.org/T319170) (owner: 10Elukey) [10:53:57] PROBLEM - Check no envoy runtime configuration is left persistent on mw1420 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [10:54:50] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "the patch LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/935751 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [10:55:43] PROBLEM - Check no envoy runtime configuration is left persistent on ores2003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [10:56:29] PROBLEM - Check no envoy runtime configuration is left persistent on mw1478 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [10:56:30] PROBLEM - Check no envoy runtime configuration is left persistent on mw1475 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [10:57:37] (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [10:57:59] 10SRE, 10Developer-Advocacy, 10Infrastructure-Foundations, 10cloud-services-team, 10LDAP: Create a single application to provision and manage developer (LDAP) accounts - https://phabricator.wikimedia.org/T179463 (10fnegri) [10:58:18] (03PS7) 10Fabfur: hiera: add silent-drop directives for http frontend [puppet] - 10https://gerrit.wikimedia.org/r/936701 (https://phabricator.wikimedia.org/T340983) [10:58:25] (03CR) 10Fabfur: hiera: add silent-drop directives for http frontend (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/936701 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur) [10:59:26] (03CR) 10Btullis: [C: 03+2] Enable the required upgrade jobs for datahub in production [deployment-charts] - 10https://gerrit.wikimedia.org/r/937057 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [11:00:09] (03Merged) 10jenkins-bot: Enable the required upgrade jobs for datahub in production [deployment-charts] - 10https://gerrit.wikimedia.org/r/937057 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [11:00:29] PROBLEM - Check no envoy runtime configuration is left persistent on restbase2015 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [11:00:30] PROBLEM - Check no envoy runtime configuration is left persistent on restbase2020 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [11:00:30] PROBLEM - Check no envoy runtime configuration is left persistent on restbase2018 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 433 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [11:02:12] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Jennifer Ebe - https://phabricator.wikimedia.org/T341557 (10ArielGlenn) See also https://phabricator.wikimedia.org/T341045 for the context. @WDoranWMF please sign off just in case that's needed. Thanks! [11:03:40] (03Merged) 10jenkins-bot: Revert "ExternalLinks: Make oneWildcard avoid adding wildcard to domain" [core] (wmf/1.41.0-wmf.16) - 10https://gerrit.wikimedia.org/r/936739 (owner: 10Ladsgroup) [11:06:49] jayme: puppet has been enabled and run o0n all envoproxy systems [11:06:56] !log btullis@deploy1002 helmfile [eqiad] START helmfile.d/services/datahub: apply on main [11:07:35] (03CR) 10Hnowlan: [C: 03+1] "lgtm" [deployment-charts] - 10https://gerrit.wikimedia.org/r/935051 (https://phabricator.wikimedia.org/T340769) (owner: 10Jgiannelos) [11:07:37] (LogstashKafkaConsumerLag) resolved: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [11:09:49] jbond: there are still some alerting. restbase for example - or is the puppet run still ongoing? [11:11:28] (03CR) 10Sergio Gimeno: [C: 03+1] Growth: Increase mentorship percentage to 25% on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936639 (https://phabricator.wikimedia.org/T341399) (owner: 10Urbanecm) [11:11:41] jayme: hmm checking [11:15:27] the restbase node I was looking at is now donw [11:16:23] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Jennifer Ebe - https://phabricator.wikimedia.org/T341557 (10WDoranWMF) Approved [11:17:35] jbond: looks good now [11:17:39] !log btullis@deploy1002 helmfile [eqiad] DONE helmfile.d/services/datahub: sync on main [11:17:43] jayme: ley me know if you see any others, fyi the alert is returning " NRPE: Command 'check_envoy_runtime_vars' not defined" [11:17:57] and i noticed the following when i rolled out the change [11:17:58] Notice: /Stage[main]/Profile::Envoy/Nrpe::Monitor_service[envoy_runtime_vars]/Nrpe::Check[check_envoy_runtime_vars]/File[/etc/nagios/nrpe.d/check_envoy_runtime_vars.cfg]/ensure: removed [11:18:03] yeah, that's because puppet also did not run on alert* [11:18:08] so would seem that something is still; using that check [11:18:16] ahh let me get that [11:18:22] running already [11:18:26] cool [11:18:40] thanks! [11:18:44] np [11:27:01] !log btullis@deploy1002 helmfile [eqiad] START helmfile.d/services/datahub: apply on main [11:31:41] (03CR) 10Gmodena: data-engineering: add alerts flink enrichment apps (034 comments) [alerts] - 10https://gerrit.wikimedia.org/r/936096 (https://phabricator.wikimedia.org/T340666) (owner: 10Gmodena) [11:35:35] 10SRE, 10ops-eqsin, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Management LAN in eqsin offline due to failure of mr1-eqsin - https://phabricator.wikimedia.org/T341447 (10cmooney) 05Open→03Resolved a:03cmooney Still stable so I will close this for now, if it re-occurs we can engage Juniper. [11:36:56] (03PS13) 10Muehlenhoff: Add a new nftables::service define [puppet] - 10https://gerrit.wikimedia.org/r/935751 (https://phabricator.wikimedia.org/T336497) [11:37:35] !log btullis@deploy1002 helmfile [eqiad] DONE helmfile.d/services/datahub: sync on main [11:38:23] (03CR) 10Muehlenhoff: Add a new nftables::service define (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/935751 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [11:38:27] !log btullis@deploy1002 helmfile [codfw] START helmfile.d/services/datahub: apply on main [11:40:00] (03CR) 10Jgiannelos: [C: 03+2] wikifeeds: Add CSP headers for restbase sunset [deployment-charts] - 10https://gerrit.wikimedia.org/r/935051 (https://phabricator.wikimedia.org/T340769) (owner: 10Jgiannelos) [11:41:40] (03Merged) 10jenkins-bot: wikifeeds: Add CSP headers for restbase sunset [deployment-charts] - 10https://gerrit.wikimedia.org/r/935051 (https://phabricator.wikimedia.org/T340769) (owner: 10Jgiannelos) [11:42:28] !log btullis@deploy1002 helmfile [codfw] DONE helmfile.d/services/datahub: sync on main [11:44:20] (JobUnavailable) firing: (3) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:48:30] (Not accepting/receiving prefixes from anycast BGP peer) firing: Alert for device cloudsw1-d5-eqiad.mgmt.eqiad.wmnet - Not accepting/receiving prefixes from anycast BGP peer - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer [11:58:58] (03CR) 10Ayounsi: [C: 03+2] knams: decom Datahop [homer/public] - 10https://gerrit.wikimedia.org/r/932236 (https://phabricator.wikimedia.org/T340049) (owner: 10Ayounsi) [11:59:33] (03Merged) 10jenkins-bot: knams: decom Datahop [homer/public] - 10https://gerrit.wikimedia.org/r/932236 (https://phabricator.wikimedia.org/T340049) (owner: 10Ayounsi) [12:00:01] !log decom datahop in knams - T340049 [12:00:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:35] (03CR) 10Filippo Giunchedi: [C: 03+2] sre: fix k8s selector for kubernetes-generic [alerts] - 10https://gerrit.wikimedia.org/r/937060 (owner: 10Filippo Giunchedi) [12:02:42] RECOVERY - Check systemd state on grafana1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:04:22] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:07:16] PROBLEM - Check systemd state on grafana1002 is CRITICAL: CRITICAL - degraded: The following units failed: grafana-ldap-users-sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:08:15] (03PS2) 10JMeybohm: envoy: Remove envoy_runtime_vars nagios check [puppet] - 10https://gerrit.wikimedia.org/r/937055 (https://phabricator.wikimedia.org/T341554) [12:08:17] (03PS1) 10JMeybohm: prometheus: Condense metric_relabel_configs into one [puppet] - 10https://gerrit.wikimedia.org/r/937074 (https://phabricator.wikimedia.org/T341554) [12:14:24] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42402/console" [puppet] - 10https://gerrit.wikimedia.org/r/937074 (https://phabricator.wikimedia.org/T341554) (owner: 10JMeybohm) [12:16:02] (03PS1) 10Ayounsi: users: Update mark's key to ed25519 [homer/public] - 10https://gerrit.wikimedia.org/r/937075 (https://phabricator.wikimedia.org/T336769) [12:16:20] RECOVERY - Check systemd state on grafana1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:16:28] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:22:59] (03PS1) 10Fabfur: common.yaml: update fabfur key to ed25519 [homer/public] - 10https://gerrit.wikimedia.org/r/937077 [12:24:21] (03PS1) 10Ayounsi: users: Update robh's key to ed25519 [homer/public] - 10https://gerrit.wikimedia.org/r/937078 (https://phabricator.wikimedia.org/T336769) [12:25:29] (03CR) 10RobH: [C: 03+2] users: Update robh's key to ed25519 [homer/public] - 10https://gerrit.wikimedia.org/r/937078 (https://phabricator.wikimedia.org/T336769) (owner: 10Ayounsi) [12:26:57] (03PS1) 10Filippo Giunchedi: prometheus: refactor alerts-deploy to pick up k8s clusters [puppet] - 10https://gerrit.wikimedia.org/r/937079 [12:27:51] (03CR) 10Filippo Giunchedi: [C: 03+1] "Please add a comment next to the relabel configs too mentioning this pitfall" [puppet] - 10https://gerrit.wikimedia.org/r/937074 (https://phabricator.wikimedia.org/T341554) (owner: 10JMeybohm) [12:32:55] (03PS2) 10David Caro: wmcs: enable isort and black [puppet] - 10https://gerrit.wikimedia.org/r/936231 [12:32:57] (03PS5) 10David Caro: replica_cnf_api: refactor to use multiple backends [puppet] - 10https://gerrit.wikimedia.org/r/933973 (https://phabricator.wikimedia.org/T265691) [12:32:59] (03PS2) 10David Caro: replica_cnf_api: add envvars backend [puppet] - 10https://gerrit.wikimedia.org/r/936232 (https://phabricator.wikimedia.org/T265691) [12:33:18] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42403/console" [puppet] - 10https://gerrit.wikimedia.org/r/937079 (owner: 10Filippo Giunchedi) [12:34:54] (03PS3) 10David Caro: wmcs: enable isort and black [puppet] - 10https://gerrit.wikimedia.org/r/936231 [12:34:56] (03PS6) 10David Caro: replica_cnf_api: refactor to use multiple backends [puppet] - 10https://gerrit.wikimedia.org/r/933973 (https://phabricator.wikimedia.org/T265691) [12:34:58] (03PS3) 10David Caro: replica_cnf_api: add envvars backend [puppet] - 10https://gerrit.wikimedia.org/r/936232 (https://phabricator.wikimedia.org/T265691) [12:39:02] (03CR) 10CI reject: [V: 04-1] replica_cnf_api: add envvars backend [puppet] - 10https://gerrit.wikimedia.org/r/936232 (https://phabricator.wikimedia.org/T265691) (owner: 10David Caro) [12:39:20] PROBLEM - Host puppetdb2003 is DOWN: PING CRITICAL - Packet loss = 100% [12:39:28] (03CR) 10CI reject: [V: 04-1] wmcs: enable isort and black [puppet] - 10https://gerrit.wikimedia.org/r/936231 (owner: 10David Caro) [12:39:53] (03CR) 10CI reject: [V: 04-1] replica_cnf_api: refactor to use multiple backends [puppet] - 10https://gerrit.wikimedia.org/r/933973 (https://phabricator.wikimedia.org/T265691) (owner: 10David Caro) [12:40:44] RECOVERY - Host puppetdb2003 is UP: PING OK - Packet loss = 0%, RTA = 33.20 ms [12:43:50] PROBLEM - puppet last run on kubestagemaster2001 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [12:44:20] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:45:16] PROBLEM - Check systemd state on puppetdb2003 is CRITICAL: CRITICAL - starting: Late bootup, before the job queue becomes idle for the first time, or one of the rescue targets are reached. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:48:14] RECOVERY - Check systemd state on kubestagemaster2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:49:24] RECOVERY - puppet last run on kubestagemaster2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [12:50:53] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10cloud-services-team: Create a cron to clean clientbucket every day or hour - https://phabricator.wikimedia.org/T165885 (10fnegri) [12:51:30] RECOVERY - Check systemd state on puppetdb2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:53:21] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:53:54] !log jbond@cumin1001 START - Cookbook sre.postgresql.postgres-init [12:59:36] !log jbond@cumin1001 END (ERROR) - Cookbook sre.postgresql.postgres-init (exit_code=97) [12:59:40] !log jbond@cumin1001 START - Cookbook sre.postgresql.postgres-init [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: #bothumor I � Unicode. All rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230711T1300). [13:00:05] sergi0 and Urbanecm: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230711T1300) [13:00:13] !log jbond@cumin1001 END (FAIL) - Cookbook sre.postgresql.postgres-init (exit_code=99) [13:00:14] hello [13:00:18] PROBLEM - uWSGI puppetboard -http via nrpe- on puppetboard2003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 BAD GATEWAY - 250 bytes in 0.046 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/puppetboard [13:00:34] urbanecm: I assume you're deploying those patches? [13:00:43] correct [13:00:44] hi all [13:00:50] PROBLEM - uWSGI puppetboard -http via nrpe- on puppetboard1003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 BAD GATEWAY - 250 bytes in 0.020 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/puppetboard [13:01:32] (03CR) 10Urbanecm: [C: 03+2] GrowthExperiments: Enable backend of link recommendation 10, 11, 12th round wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935723 (https://phabricator.wikimedia.org/T308135) (owner: 10Sergio Gimeno) [13:01:35] (03PS4) 10Urbanecm: GrowthExperiments: Enable backend of link recommendation 10, 11, 12th round wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935723 (https://phabricator.wikimedia.org/T308135) (owner: 10Sergio Gimeno) [13:01:37] (03CR) 10Filippo Giunchedi: data-engineering: add alerts flink enrichment apps (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/936096 (https://phabricator.wikimedia.org/T340666) (owner: 10Gmodena) [13:01:41] (03CR) 10Urbanecm: [C: 03+2] GrowthExperiments: Enable backend of link recommendation 10, 11, 12th round wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935723 (https://phabricator.wikimedia.org/T308135) (owner: 10Sergio Gimeno) [13:02:19] Urbanecm: Can I add one more , which was scheduled for morning backport which didn't happen now [13:02:21] https://gerrit.wikimedia.org/r/c/936826/ [13:02:21] (03Merged) 10jenkins-bot: GrowthExperiments: Enable backend of link recommendation 10, 11, 12th round wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935723 (https://phabricator.wikimedia.org/T308135) (owner: 10Sergio Gimeno) [13:02:35] aanzx: sure, can you add it to the calendar please? [13:02:43] Ok [13:03:20] Added [13:03:21] (JobUnavailable) firing: (5) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:03:28] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:935723|GrowthExperiments: Enable backend of link recommendation 10, 11, 12th round wikis (T308135 T308136 T308137)]] [13:03:34] T308137: Deploy "add a link" to 12th round of wikis - https://phabricator.wikimedia.org/T308137 [13:03:34] T308135: Deploy "add a link" to 10th round of wikis - https://phabricator.wikimedia.org/T308135 [13:03:35] T308136: Deploy "add a link" to 11th round of wikis - https://phabricator.wikimedia.org/T308136 [13:03:58] (03PS2) 10Urbanecm: Growth: Increase mentorship percentage to 25% on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936639 (https://phabricator.wikimedia.org/T341399) [13:04:46] (03CR) 10Gmodena: data-engineering: add alerts flink enrichment apps (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/936096 (https://phabricator.wikimedia.org/T340666) (owner: 10Gmodena) [13:04:58] !log urbanecm@deploy1002 sgimeno and urbanecm: Backport for [[gerrit:935723|GrowthExperiments: Enable backend of link recommendation 10, 11, 12th round wikis (T308135 T308136 T308137)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet [13:05:36] sergi0: not 100% sure if link recommendation backend is testable at mwdebug, but if you want to test sth, go ahead :) [13:06:41] (03PS2) 10Urbanecm: Enable tabs for non loggedin mobile users on knwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936826 (https://phabricator.wikimedia.org/T340276) (owner: 10Anzx) [13:07:08] urbanecm: I don't think we can test anything in mwdebug at this point. I'll check the dataset containers during this evening. [13:07:20] sounds good to me. proceeding. [13:07:31] (03CR) 10Urbanecm: [C: 03+2] Growth: Increase mentorship percentage to 25% on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936639 (https://phabricator.wikimedia.org/T341399) (owner: 10Urbanecm) [13:08:11] (03Merged) 10jenkins-bot: Growth: Increase mentorship percentage to 25% on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936639 (https://phabricator.wikimedia.org/T341399) (owner: 10Urbanecm) [13:08:41] (03PS3) 10Urbanecm: Enable tabs for non loggedin mobile users on knwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936826 (https://phabricator.wikimedia.org/T340276) (owner: 10Anzx) [13:08:46] (03CR) 10Urbanecm: [C: 03+2] Enable tabs for non loggedin mobile users on knwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936826 (https://phabricator.wikimedia.org/T340276) (owner: 10Anzx) [13:09:50] (03CR) 10Elukey: [V: 03+1 C: 03+2] profile::services_proxy::envoy: add inference to enabled_listeners [puppet] - 10https://gerrit.wikimedia.org/r/937056 (https://phabricator.wikimedia.org/T319170) (owner: 10Elukey) [13:10:27] (03PS3) 10Samtar: IS: Enable Phonos on medium projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936717 (https://phabricator.wikimedia.org/T336763) [13:11:48] RECOVERY - uWSGI puppetboard -http via nrpe- on puppetboard1003 is OK: HTTP OK: HTTP/1.1 200 OK - 111763 bytes in 3.822 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/puppetboard [13:12:50] RECOVERY - uWSGI puppetboard -http via nrpe- on puppetboard2003 is OK: HTTP OK: HTTP/1.1 200 OK - 115397 bytes in 3.849 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/puppetboard [13:13:13] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:935723|GrowthExperiments: Enable backend of link recommendation 10, 11, 12th round wikis (T308135 T308136 T308137)]] (duration: 09m 45s) [13:13:19] T308137: Deploy "add a link" to 12th round of wikis - https://phabricator.wikimedia.org/T308137 [13:13:19] T308135: Deploy "add a link" to 10th round of wikis - https://phabricator.wikimedia.org/T308135 [13:13:19] T308136: Deploy "add a link" to 11th round of wikis - https://phabricator.wikimedia.org/T308136 [13:13:21] (JobUnavailable) firing: (5) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:14:10] sergi0: your patch's deployed. anything else from you? [13:14:12] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:936639|Growth: Increase mentorship percentage to 25% on enwiki (T341399)]] [13:14:14] T341399: Increase percentage of newcomers who receive Growth mentorship at English Wikipedia - https://phabricator.wikimedia.org/T341399 [13:14:47] (03CR) 10Urbanecm: [C: 03+2] Enable tabs for non loggedin mobile users on knwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936826 (https://phabricator.wikimedia.org/T340276) (owner: 10Anzx) [13:15:29] (03Merged) 10jenkins-bot: Enable tabs for non loggedin mobile users on knwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936826 (https://phabricator.wikimedia.org/T340276) (owner: 10Anzx) [13:16:58] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Jennifer Ebe - https://phabricator.wikimedia.org/T341557 (10BTullis) Jennifer is already a member of `wmf` https://ldap.toolforge.org/user/jebe Double checked. ` btullis@seaborgium:~$ ldapsearch -A -x member=uid=jebe,ou=people,dc=wikimedia,dc=org dn # ex... [13:17:04] urbanecm: nope, thanks for your assistance :) [13:17:17] np [13:17:17] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Jennifer Ebe - https://phabricator.wikimedia.org/T341557 (10BTullis) 05Open→03Resolved a:03BTullis [13:18:47] (03CR) 10Ssingh: [C: 03+1] "@Arzhel: Happy to take care of merging this, let me know." [homer/public] - 10https://gerrit.wikimedia.org/r/937077 (owner: 10Fabfur) [13:21:27] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:936639|Growth: Increase mentorship percentage to 25% on enwiki (T341399)]] (duration: 07m 15s) [13:21:30] T341399: Increase percentage of newcomers who receive Growth mentorship at English Wikipedia - https://phabricator.wikimedia.org/T341399 [13:21:42] (03PS1) 10Alexandros Kosiaris: changeprop: Change normal_rule_processing to histogram [deployment-charts] - 10https://gerrit.wikimedia.org/r/937090 [13:21:52] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:936826|Enable tabs for non loggedin mobile users on knwikisource (T340276)]] [13:21:55] T340276: Enable tabs for non logged-in mobile skin users on knwikisource - https://phabricator.wikimedia.org/T340276 [13:22:31] !log jgiannelos@deploy1002 Started deploy [restbase/deploy@930f075]: (no justification provided) [13:22:41] (03PS1) 10Elukey: burrow: add LimitNOFILE=8192 to systemd's units [puppet] - 10https://gerrit.wikimedia.org/r/937091 (https://phabricator.wikimedia.org/T341551) [13:23:24] !log urbanecm@deploy1002 urbanecm and anzx: Backport for [[gerrit:936826|Enable tabs for non loggedin mobile users on knwikisource (T340276)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet [13:23:32] Testing [13:23:44] aanzx: was just going to ask for testing :). let me know if it works. [13:24:15] (03CR) 10Jbond: [WIP] Manage TLS on network devices (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/933094 (https://phabricator.wikimedia.org/T334594) (owner: 10Ayounsi) [13:27:18] urbanecm: works , good to go [13:27:43] proceeding [13:28:41] (03CR) 10Ayounsi: [C: 03+2] users: remove older ssh-rsa key for Alex and Chris [homer/public] - 10https://gerrit.wikimedia.org/r/937039 (https://phabricator.wikimedia.org/T336769) (owner: 10Ayounsi) [13:28:50] (03CR) 10Ayounsi: [C: 03+2] users: Update mark's key to ed25519 [homer/public] - 10https://gerrit.wikimedia.org/r/937075 (https://phabricator.wikimedia.org/T336769) (owner: 10Ayounsi) [13:29:13] (03Merged) 10jenkins-bot: users: remove older ssh-rsa key for Alex and Chris [homer/public] - 10https://gerrit.wikimedia.org/r/937039 (https://phabricator.wikimedia.org/T336769) (owner: 10Ayounsi) [13:29:22] (03Merged) 10jenkins-bot: users: Update mark's key to ed25519 [homer/public] - 10https://gerrit.wikimedia.org/r/937075 (https://phabricator.wikimedia.org/T336769) (owner: 10Ayounsi) [13:29:25] (03Merged) 10jenkins-bot: users: Update robh's key to ed25519 [homer/public] - 10https://gerrit.wikimedia.org/r/937078 (https://phabricator.wikimedia.org/T336769) (owner: 10Ayounsi) [13:29:31] (03CR) 10Ayounsi: [C: 03+2] common.yaml: update fabfur key to ed25519 [homer/public] - 10https://gerrit.wikimedia.org/r/937077 (owner: 10Fabfur) [13:30:06] (03Merged) 10jenkins-bot: common.yaml: update fabfur key to ed25519 [homer/public] - 10https://gerrit.wikimedia.org/r/937077 (owner: 10Fabfur) [13:30:18] (03CR) 10Ayounsi: [C: 03+2] Update filippo's key [homer/public] - 10https://gerrit.wikimedia.org/r/932234 (https://phabricator.wikimedia.org/T336769) (owner: 10Filippo Giunchedi) [13:30:37] (03PS2) 10JMeybohm: prometheus: Condense metric_relabel_configs into one [puppet] - 10https://gerrit.wikimedia.org/r/937074 (https://phabricator.wikimedia.org/T341554) [13:30:39] (03PS3) 10JMeybohm: envoy: Remove envoy_runtime_vars nagios check [puppet] - 10https://gerrit.wikimedia.org/r/937055 (https://phabricator.wikimedia.org/T341554) [13:30:52] (03Merged) 10jenkins-bot: Update filippo's key [homer/public] - 10https://gerrit.wikimedia.org/r/932234 (https://phabricator.wikimedia.org/T336769) (owner: 10Filippo Giunchedi) [13:33:25] James_F: <3 I don't know how to thank you [13:33:26] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:936826|Enable tabs for non loggedin mobile users on knwikisource (T340276)]] (duration: 11m 33s) [13:33:29] T340276: Enable tabs for non logged-in mobile skin users on knwikisource - https://phabricator.wikimedia.org/T340276 [13:33:42] aanzx: and deployed. anything else? [13:33:46] (03PS1) 10Mabualruz: Run a synthetic test for client side preferences [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937092 (https://phabricator.wikimedia.org/T336527) [13:33:55] Nothing, thanks [13:34:01] (03PS1) 10Fabfur: admin: Update fabfur's rsa key to ed25519 [puppet] - 10https://gerrit.wikimedia.org/r/937093 [13:34:03] (03CR) 10JMeybohm: [C: 03+2] prometheus: Condense metric_relabel_configs into one [puppet] - 10https://gerrit.wikimedia.org/r/937074 (https://phabricator.wikimedia.org/T341554) (owner: 10JMeybohm) [13:34:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:35:05] (03PS4) 10David Caro: wmcs: enable isort and black [puppet] - 10https://gerrit.wikimedia.org/r/936231 [13:35:07] (03PS4) 10David Caro: replica_cnf_api: add envvars backend [puppet] - 10https://gerrit.wikimedia.org/r/936232 (https://phabricator.wikimedia.org/T265691) [13:35:22] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Jennifer Ebe - https://phabricator.wikimedia.org/T341557 (10ArielGlenn) >>! In T341557#9005233, @BTullis wrote: > Jennifer is already a member of `wmf` > > https://ldap.toolforge.org/user/jebe > > Double checked. > ` > btullis@seaborgium:~$ ldapsearch -A... [13:35:36] (03CR) 10Mabualruz: "Synthetic test files" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937092 (https://phabricator.wikimedia.org/T336527) (owner: 10Mabualruz) [13:36:16] (03CR) 10David Caro: "tricky flake8, also as it does not pin python to 3.7, the tests for replica_cnf when it´s included in the global wmcs tox entry fail for m" [puppet] - 10https://gerrit.wikimedia.org/r/936231 (owner: 10David Caro) [13:36:36] (03CR) 10Elukey: [C: 03+2] burrow: add LimitNOFILE=8192 to systemd's units [puppet] - 10https://gerrit.wikimedia.org/r/937091 (https://phabricator.wikimedia.org/T341551) (owner: 10Elukey) [13:36:44] Amir1: Keeping being awesome is thanks enough! [13:36:59] <3 [13:37:15] (03CR) 10Btullis: [C: 03+1] "Many thanks elukey" [puppet] - 10https://gerrit.wikimedia.org/r/937091 (https://phabricator.wikimedia.org/T341551) (owner: 10Elukey) [13:38:02] (03CR) 10CI reject: [V: 04-1] replica_cnf_api: add envvars backend [puppet] - 10https://gerrit.wikimedia.org/r/936232 (https://phabricator.wikimedia.org/T265691) (owner: 10David Caro) [13:39:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:40:59] (03PS2) 10Mabualruz: Run a synthetic test for client side preferences [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937092 (https://phabricator.wikimedia.org/T336527) [13:42:22] !log jgiannelos@deploy1002 Finished deploy [restbase/deploy@930f075]: (no justification provided) (duration: 19m 50s) [13:42:30] (03PS1) 10Jsn.sherman: log additional events on Special:Diff|MobileDiff [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937096 (https://phabricator.wikimedia.org/T326212) [13:44:34] (03CR) 10Jsn.sherman: "follow-up here: I6bfb201d0b8cdd0bbe22a1cbdbc1298cf1bab2cc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936748 (https://phabricator.wikimedia.org/T326212) (owner: 10Jsn.sherman) [13:49:45] !log rebalance ganeti group eqiad/d after reboots [13:49:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:19] (03CR) 10Alexandros Kosiaris: "https://codesearch.wmcloud.org/search/?q=_normal_rule_processing&files=&excludeFiles=&repos= says nothing in the various repos. That leave" [deployment-charts] - 10https://gerrit.wikimedia.org/r/937090 (owner: 10Alexandros Kosiaris) [13:52:22] (03PS3) 10Mabualruz: Run a synthetic test for client side preferences [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937092 (https://phabricator.wikimedia.org/T336527) [13:52:35] (03PS1) 10Ladsgroup: Externallinks: Keep domain wildcard if path is not specified [core] (wmf/1.41.0-wmf.17) - 10https://gerrit.wikimedia.org/r/937108 (https://phabricator.wikimedia.org/T326251) [13:54:48] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/935751 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [13:55:02] (03PS1) 10Btullis: Add the option to clean datahub indices to the restore job [deployment-charts] - 10https://gerrit.wikimedia.org/r/937099 (https://phabricator.wikimedia.org/T329514) [13:55:08] (03CR) 10Alexandros Kosiaris: "The following panels in https://grafana-rw.wikimedia.org/d/CbmStnlGk/jobqueue-job will need to be updated" [deployment-charts] - 10https://gerrit.wikimedia.org/r/937090 (owner: 10Alexandros Kosiaris) [13:56:32] (03CR) 10Btullis: [C: 03+2] Add the option to clean datahub indices to the restore job [deployment-charts] - 10https://gerrit.wikimedia.org/r/937099 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [13:56:50] (03CR) 10JMeybohm: [C: 03+1] prometheus: refactor alerts-deploy to pick up k8s clusters [puppet] - 10https://gerrit.wikimedia.org/r/937079 (owner: 10Filippo Giunchedi) [13:56:57] (03CR) 10Alexandros Kosiaris: "And the 2 job run panels in https://grafana-rw.wikimedia.org/d/LSeAShkGz/jobqueue?orgId=1" [deployment-charts] - 10https://gerrit.wikimedia.org/r/937090 (owner: 10Alexandros Kosiaris) [13:57:35] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Update network SSH keys to ssh-ed25519 - https://phabricator.wikimedia.org/T336769 (10ayounsi) [13:57:57] !log btullis@cumin1001 START - Cookbook sre.druid.roll-restart-workers for Druid public cluster: Roll restart of Druid jvm daemons. [13:58:21] (03PS3) 10JMeybohm: Add warning alerts on envoy running with changed config [alerts] - 10https://gerrit.wikimedia.org/r/937054 (https://phabricator.wikimedia.org/T341554) [13:58:25] (03Merged) 10jenkins-bot: Add the option to clean datahub indices to the restore job [deployment-charts] - 10https://gerrit.wikimedia.org/r/937099 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [13:59:09] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] prometheus: refactor alerts-deploy to pick up k8s clusters [puppet] - 10https://gerrit.wikimedia.org/r/937079 (owner: 10Filippo Giunchedi) [13:59:28] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [13:59:43] !log installing yajl security updates [13:59:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:55] !log btullis@cumin1001 START - Cookbook sre.kafka.roll-restart-mirror-maker restart MirrorMaker for Kafka A:kafka-mirror-maker-jumbo-eqiad cluster: Roll restart of jvm daemons. [14:01:53] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Update network SSH keys to ssh-ed25519 - https://phabricator.wikimedia.org/T336769 (10ayounsi) a:03BBlack Assigning the task to @BBlack for when he comes back. [14:01:57] (03PS1) 10Muehlenhoff: Add library hint for yajl [puppet] - 10https://gerrit.wikimedia.org/r/937101 [14:02:12] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [14:02:27] (03PS4) 10JMeybohm: envoy: Remove envoy_runtime_vars nagios check [puppet] - 10https://gerrit.wikimedia.org/r/937055 (https://phabricator.wikimedia.org/T341554) [14:02:29] (03PS1) 10JMeybohm: envoy: Absent monitor_systemd_unit_state for envoyproxy [puppet] - 10https://gerrit.wikimedia.org/r/937102 (https://phabricator.wikimedia.org/T341554) [14:03:04] (03CR) 10JMeybohm: envoy: Remove envoy_runtime_vars nagios check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/937055 (https://phabricator.wikimedia.org/T341554) (owner: 10JMeybohm) [14:04:41] Lucas_WMDE: rounds 2 of the migration to histograms for jobqueue metrics: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/937090 [14:04:56] I 've identified keys panels and alerts in the comments and I 'll fix those after merging, but searching through all grafana dashboards/alerts isn't feasible unfortunately. So if you have any other stuff you know of, please let me know [14:05:32] akosiaris: I’ll try to take a look later [14:05:52] Lucas_WMDE: no rush, it can wait. [14:08:21] (JobUnavailable) firing: (5) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:11:05] (03CR) 10Muehlenhoff: [C: 03+2] Add library hint for yajl [puppet] - 10https://gerrit.wikimedia.org/r/937101 (owner: 10Muehlenhoff) [14:11:22] 10SRE, 10Infrastructure-Foundations, 10netops: Update network SSH keys to ssh-ed25519 - https://phabricator.wikimedia.org/T336769 (10RobH) [14:12:04] !log btullis@deploy1002 helmfile [codfw] START helmfile.d/services/datahub: apply on main [14:12:22] (03CR) 10Ssingh: [C: 03+1] "Confirmed with Fabrizio on IRC." [puppet] - 10https://gerrit.wikimedia.org/r/937093 (owner: 10Fabfur) [14:13:28] (03CR) 10Fabfur: [C: 03+2] admin: Update fabfur's rsa key to ed25519 [puppet] - 10https://gerrit.wikimedia.org/r/937093 (owner: 10Fabfur) [14:13:55] (03CR) 10Ladsgroup: [V: 03+1 C: 03+2] "Tested and works well and makes it much faster too." [cookbooks] - 10https://gerrit.wikimedia.org/r/936287 (owner: 10Ladsgroup) [14:13:57] “add library hint for yall” thx ;) [14:14:09] (03PS1) 10Andrew Bogott: Add puppet role and profile for etcd_discovery service [puppet] - 10https://gerrit.wikimedia.org/r/937104 (https://phabricator.wikimedia.org/T341355) [14:14:20] (JobUnavailable) firing: (5) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:14:33] (03CR) 10CI reject: [V: 04-1] Add puppet role and profile for etcd_discovery service [puppet] - 10https://gerrit.wikimedia.org/r/937104 (https://phabricator.wikimedia.org/T341355) (owner: 10Andrew Bogott) [14:15:30] !log btullis@deploy1002 helmfile [codfw] DONE helmfile.d/services/datahub: sync on main [14:15:34] (03PS2) 10Andrew Bogott: Add puppet role and profile for etcd_discovery service [puppet] - 10https://gerrit.wikimedia.org/r/937104 (https://phabricator.wikimedia.org/T341355) [14:16:05] (03CR) 10Filippo Giunchedi: [C: 03+1] "Very nice! Thank you" [alerts] - 10https://gerrit.wikimedia.org/r/937054 (https://phabricator.wikimedia.org/T341554) (owner: 10JMeybohm) [14:16:20] (03CR) 10Filippo Giunchedi: [C: 03+1] envoy: Remove envoy_runtime_vars nagios check [puppet] - 10https://gerrit.wikimedia.org/r/937055 (https://phabricator.wikimedia.org/T341554) (owner: 10JMeybohm) [14:16:30] (03CR) 10Filippo Giunchedi: [C: 03+1] envoy: Absent monitor_systemd_unit_state for envoyproxy [puppet] - 10https://gerrit.wikimedia.org/r/937102 (https://phabricator.wikimedia.org/T341554) (owner: 10JMeybohm) [14:17:00] (03PS4) 10Mabualruz: Run a synthetic test for client side preferences [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937092 (https://phabricator.wikimedia.org/T336527) [14:17:03] !log btullis@deploy1002 helmfile [eqiad] START helmfile.d/services/datahub: apply on main [14:17:37] !log restarting apache on mw canaries [14:17:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:21] (JobUnavailable) firing: (5) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:19:13] !log btullis@cumin1001 END (PASS) - Cookbook sre.kafka.roll-restart-mirror-maker (exit_code=0) restart MirrorMaker for Kafka A:kafka-mirror-maker-jumbo-eqiad cluster: Roll restart of jvm daemons. [14:20:33] (03CR) 10Filippo Giunchedi: data-engineering: add alerts flink enrichment apps (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/936096 (https://phabricator.wikimedia.org/T340666) (owner: 10Gmodena) [14:20:59] (03PS3) 10Andrew Bogott: Add puppet role and profile for etcd_discovery service [puppet] - 10https://gerrit.wikimedia.org/r/937104 (https://phabricator.wikimedia.org/T341355) [14:21:11] !log btullis@deploy1002 helmfile [eqiad] DONE helmfile.d/services/datahub: sync on main [14:21:33] (03Merged) 10jenkins-bot: sre.mysql.clone: Only encrypt data transfers between DCs [cookbooks] - 10https://gerrit.wikimedia.org/r/936287 (owner: 10Ladsgroup) [14:22:47] (03PS4) 10Andrew Bogott: Add puppet role and profile for etcd_discovery service [puppet] - 10https://gerrit.wikimedia.org/r/937104 (https://phabricator.wikimedia.org/T341355) [14:25:15] (03CR) 10David Caro: wmcs: enable isort and black (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/936231 (owner: 10David Caro) [14:26:42] (03PS5) 10Andrew Bogott: Add puppet role and profile for etcd_discovery service [puppet] - 10https://gerrit.wikimedia.org/r/937104 (https://phabricator.wikimedia.org/T341355) [14:36:04] (03CR) 10Lucas Werkmeister (WMDE): changeprop: Change normal_rule_processing to histogram (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/937090 (owner: 10Alexandros Kosiaris) [14:40:02] (03CR) 10Muehlenhoff: [C: 03+2] Add a new nftables::service define [puppet] - 10https://gerrit.wikimedia.org/r/935751 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [14:45:35] (03PS1) 10Muehlenhoff: Move nftables/ferm types to wmflib [puppet] - 10https://gerrit.wikimedia.org/r/937135 (https://phabricator.wikimedia.org/T336497) [14:48:38] (03CR) 10CI reject: [V: 04-1] Move nftables/ferm types to wmflib [puppet] - 10https://gerrit.wikimedia.org/r/937135 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [14:49:17] !log btullis@cumin1001 END (PASS) - Cookbook sre.druid.roll-restart-workers (exit_code=0) for Druid public cluster: Roll restart of Druid jvm daemons. [14:49:36] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 140, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:50:22] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 225, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:51:43] 10SRE, 10Infrastructure-Foundations, 10serviceops-radar, 10Patch-For-Review, 10Puppet (Puppet 7.0): expose_puppet_certs: Services will need to trust the new ca - https://phabricator.wikimedia.org/T340741 (10jcrespo) @jbond I was out of office. Backups is a very special case, I would like to comment that... [14:52:06] (03PS1) 10Andrew Bogott: Magnum: allow configuration of etcd discovery service host [puppet] - 10https://gerrit.wikimedia.org/r/937138 (https://phabricator.wikimedia.org/T341355) [14:52:44] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 141, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:53:30] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 226, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:55:50] jouncebot: now [14:55:50] No deployments scheduled for the next 1 hour(s) and 4 minute(s) [14:56:00] (03PS1) 10Ssingh: dnsrecursor: use validate_cmd for pdns-recursor config [puppet] - 10https://gerrit.wikimedia.org/r/937139 [14:56:20] I’d like to do a quick backport if that’s okay with everyone [14:56:28] (will go ahead in a few minutes unless I hear otherwise) [14:56:30] (03CR) 10JMeybohm: [C: 03+2] Add warning alerts on envoy running with changed config [alerts] - 10https://gerrit.wikimedia.org/r/937054 (https://phabricator.wikimedia.org/T341554) (owner: 10JMeybohm) [14:57:32] (03CR) 10JMeybohm: [C: 03+2] envoy: Absent monitor_systemd_unit_state for envoyproxy [puppet] - 10https://gerrit.wikimedia.org/r/937102 (https://phabricator.wikimedia.org/T341554) (owner: 10JMeybohm) [14:57:35] (03CR) 10JMeybohm: [C: 03+2] envoy: Remove envoy_runtime_vars nagios check [puppet] - 10https://gerrit.wikimedia.org/r/937055 (https://phabricator.wikimedia.org/T341554) (owner: 10JMeybohm) [14:57:37] (03Merged) 10jenkins-bot: Add warning alerts on envoy running with changed config [alerts] - 10https://gerrit.wikimedia.org/r/937054 (https://phabricator.wikimedia.org/T341554) (owner: 10JMeybohm) [14:59:48] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [skins/Vector] (wmf/1.41.0-wmf.16) - 10https://gerrit.wikimedia.org/r/936737 (https://phabricator.wikimedia.org/T340217) (owner: 10Jdlrobson) [15:03:17] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] k8s::apiserver: Implement kube-apiserver reload [puppet] - 10https://gerrit.wikimedia.org/r/936666 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm) [15:07:56] (03CR) 10Andrew Bogott: "pcc results https://puppet-compiler.wmflabs.org/output/937138/42404/" [puppet] - 10https://gerrit.wikimedia.org/r/937138 (https://phabricator.wikimedia.org/T341355) (owner: 10Andrew Bogott) [15:09:12] !log eevans@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching A:restbase-codfw: Applying JVM update - eevans@cumin1001 [15:11:53] (03CR) 10Alexandros Kosiaris: changeprop: Change normal_rule_processing to histogram (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/937090 (owner: 10Alexandros Kosiaris) [15:12:47] 10ops-eqsin: ManagementSSHDown - https://phabricator.wikimedia.org/T341438 (10RobH) 05Open→03Resolved a:03RobH ` robh@cumin1001:~$ ping cr2-eqsin.mgmt.eqsin.wmnet PING cr2-eqsin.mgmt.eqsin.wmnet (10.132.128.6) 56(84) bytes of data. 64 bytes from cr2-eqsin.mgmt.eqsin.wmnet (10.132.128.6): icmp_seq=1 ttl=60... [15:13:18] (KubernetesAPILatency) firing: High Kubernetes API latency (GET pods) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:13:55] 10ops-eqsin: ManagementSSHDown - https://phabricator.wikimedia.org/T341437 (10RobH) 05Open→03Resolved a:03RobH ` robh@cumin1001:~$ ping cp5023.mgmt.eqsin.wmnet PING cp5023.mgmt.eqsin.wmnet (10.132.128.19) 56(84) bytes of data. 64 bytes from cp5023.mgmt.eqsin.wmnet (10.132.128.19): icmp_seq=1 ttl=60 time=22... [15:14:25] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10cloud-services-team: Create a cron to clean clientbucket every day or hour - https://phabricator.wikimedia.org/T165885 (10Dzahn) A "cron" (timer) has been created. So it could be called resolved. The only thing is that this is opt-in and not automatically fo... [15:15:05] 10SRE, 10ops-eqsin, 10DC-Ops: eqsin cp501[3456] setup and secure erase - https://phabricator.wikimedia.org/T335414 (10RobH) 05Open→03Resolved did this weeks ago and forgot to resolve [15:15:25] 10SRE, 10ops-eqsin, 10ops-ulsfo, 10DC-Ops: eqsin & ulsfo: new R450s drawing far more power than R440s (power over contracted caps in both sites) - https://phabricator.wikimedia.org/T328957 (10RobH) 05Open→03Resolved After some discussion there isn't a lot to adjust so we've just raised our power caps. [15:15:36] (03PS2) 10Krinkle: Remove oversampling for Navigation Timing extension. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930712 (https://phabricator.wikimedia.org/T337858) (owner: 10Phedenskog) [15:16:19] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by krinkle@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930712 (https://phabricator.wikimedia.org/T337858) (owner: 10Phedenskog) [15:17:46] (03CR) 10Dzahn: [C: 03+1] "this should be needed to scap deploy the docroot on contint*" [puppet] - 10https://gerrit.wikimedia.org/r/867713 (owner: 10Dzahn) [15:17:48] !log eevans@cumin1001 END (FAIL) - Cookbook sre.cassandra.roll-restart (exit_code=99) for nodes matching A:restbase-codfw: Applying JVM update - eevans@cumin1001 [15:18:18] (KubernetesAPILatency) resolved: High Kubernetes API latency (GET pods) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:18:23] (03PS1) 10Effie Mouzeli: thumbor: add mcrouter support [deployment-charts] - 10https://gerrit.wikimedia.org/r/937141 (https://phabricator.wikimedia.org/T318695) [15:19:00] (03CR) 10Daniel Kinzler: "The following boards have versions of the "job concurrency" panel (in a collapsed row at the bottom):" [deployment-charts] - 10https://gerrit.wikimedia.org/r/937090 (owner: 10Alexandros Kosiaris) [15:19:02] (03CR) 10CI reject: [V: 04-1] thumbor: add mcrouter support [deployment-charts] - 10https://gerrit.wikimedia.org/r/937141 (https://phabricator.wikimedia.org/T318695) (owner: 10Effie Mouzeli) [15:19:34] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:20:39] (03CR) 10Dzahn: [C: 03+2] phabricator: quarterly_metrics.sh: Improve Bitergia instructions [puppet] - 10https://gerrit.wikimedia.org/r/935416 (https://phabricator.wikimedia.org/T341064) (owner: 10Aklapper) [15:20:55] (03Merged) 10jenkins-bot: Remove oversampling for Navigation Timing extension. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930712 (https://phabricator.wikimedia.org/T337858) (owner: 10Phedenskog) [15:21:00] (03Merged) 10jenkins-bot: Add option for html label in Menu template [skins/Vector] (wmf/1.41.0-wmf.16) - 10https://gerrit.wikimedia.org/r/936737 (https://phabricator.wikimedia.org/T340217) (owner: 10Jdlrobson) [15:21:21] !log krinkle@deploy1002 Started scap: Backport for [[gerrit:930712|Remove oversampling for Navigation Timing extension. (T337858)]] [15:21:24] T337858: Remove is_oversample feature in the Navigation Timing extension - https://phabricator.wikimedia.org/T337858 [15:22:24] Lucas_WMDE: missed your message, I ran scap backport, it says it's locked, so go ahead fi you haven't already [15:22:43] what's confusing me is that scap then continued without waiting after printing there is a lock [15:22:48] Yapparently mine failed, patch didn’t apply :S [15:22:53] !log krinkle@deploy1002 phedenskog and krinkle: Backport for [[gerrit:930712|Remove oversampling for Navigation Timing extension. (T337858)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet [15:23:02] so I guess you can go ahead at the moment? [15:23:09] and I’ll need to figure out what I can do about my conflict [15:23:10] okay, I'm guessing scap takes care of not accidentally deploying yours [15:23:39] (03CR) 10Dzahn: [C: 03+1] "since this isn't about to be merged and I will be out for a while, I am removing myself from open gerrit patches" [puppet] - 10https://gerrit.wikimedia.org/r/931581 (owner: 10Muehlenhoff) [15:23:45] git l [15:23:46] * 3267c8b85 - (HEAD -> master, origin/master, origin/HEAD) Remove oversampling for Navigation Timing extension. (8 minutes ago) [15:23:46] * 67194085c - Enable tabs for non loggedin mobile users on knwikisource (2 hours ago) [15:24:00] yours is change 936737, right? [15:24:03] so LGTM [15:24:34] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (LIST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:24:38] !log eevans@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching restbase20[13-27].codfw.wmnet: Applying JVM update - eevans@cumin1001 [15:24:51] ah yours is in a different repo [15:24:55] let me check [15:25:20] (03PS2) 10Effie Mouzeli: (WIP) thumbor: add mcrouter support [deployment-charts] - 10https://gerrit.wikimedia.org/r/937141 (https://phabricator.wikimedia.org/T318695) [15:26:32] !log krinkle@deploy1002 Sync cancelled. [15:27:24] this is strange, so aborted scap commands just leave it applied for a future sync to implicitly deploy? [15:27:48] I don't know why that surprises me since that's how its' always worked, I guess when I'm not the one +2'ing and git pull'ing, I expect it to also magically undo those [15:28:28] RECOVERY - Check systemd state on kafkamon1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:31:33] !log krinkle@deploy1002 Locking from deployment [ALL REPOSITORIES]: pending security problem, see mediawiki_security IRC [15:32:29] 10SRE-swift-storage, 10Commons: Server error 500 after uploading chunk - https://phabricator.wikimedia.org/T340917 (10Midleading) In fact the file key has been changed when uploadstash-file-not-found error occured. Need to go to Special:UploadStash to find the new correct key and manually recover, see https://... [15:33:08] (03PS2) 10TChin: Bump stream versions in mw-page-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/934719 (https://phabricator.wikimedia.org/T340746) [15:34:46] PROBLEM - Check systemd state on kafkamon1003 is CRITICAL: CRITICAL - degraded: The following units failed: burrow-jumbo-eqiad.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:36:12] (03PS1) 10Elukey: burrow: use start-latest=true for the kafka consumer [puppet] - 10https://gerrit.wikimedia.org/r/937144 (https://phabricator.wikimedia.org/T341551) [15:37:34] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:39:31] (03CR) 10Filippo Giunchedi: [C: 03+1] "Reasoning and fix LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/937144 (https://phabricator.wikimedia.org/T341551) (owner: 10Elukey) [15:40:30] (03CR) 10Elukey: [C: 03+2] burrow: use start-latest=true for the kafka consumer [puppet] - 10https://gerrit.wikimedia.org/r/937144 (https://phabricator.wikimedia.org/T341551) (owner: 10Elukey) [15:41:55] (03PS1) 10TChin: mw-page-content-change-enrich bump docker version [deployment-charts] - 10https://gerrit.wikimedia.org/r/937145 (https://phabricator.wikimedia.org/T338169) [15:42:34] (KubernetesAPILatency) resolved: (3) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:44:48] RECOVERY - Check systemd state on kafkamon1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:45:00] 10SRE, 10Wikimedia-Mailing-lists: Mailman3 templates with colons in filename made operations/puppet not cloneable on Windows - https://phabricator.wikimedia.org/T282308 (10Novem_Linguae) I would be grateful if someone could fix this. I am on Windows and I cannot submit patches to the operations/puppet repo bec... [15:48:30] (Not accepting/receiving prefixes from anycast BGP peer) firing: Alert for device cloudsw1-d5-eqiad.mgmt.eqiad.wmnet - Not accepting/receiving prefixes from anycast BGP peer - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer [15:48:36] !log krinkle@deploy1002 Unlocked for deployment [ALL REPOSITORIES]: pending security problem, see mediawiki_security IRC (duration: 17m 03s) [15:53:22] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:936737|Add option for html label in Menu template (T340217)]] [15:54:13] !log Deployed https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/930712 ("Remove oversampling for Navigation Timing extension.") [15:54:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:55] !log lucaswerkmeister-wmde@deploy1002 jdlrobson and lucaswerkmeister-wmde: Backport for [[gerrit:936737|Add option for html label in Menu template (T340217)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet [16:00:05] jbond and rzl: #bothumor I � Unicode. All rise for Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230711T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:00:05] cwhite: Dear deployers, time to do the Logstash DC Transition deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230711T1600). [16:00:34] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST blockaffinities) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:00:59] (I’m still deploying but that probably doesn’t affect Puppeteers) [16:02:06] (03CR) 10Gmodena: [C: 03+1] "LGTM. Feel free to deploy the change when ready." [deployment-charts] - 10https://gerrit.wikimedia.org/r/937145 (https://phabricator.wikimedia.org/T338169) (owner: 10TChin) [16:02:21] !oncall [16:02:32] !oncall now [16:02:38] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:936737|Add option for html label in Menu template (T340217)]] (duration: 09m 15s) [16:03:21] !log previous backport also included [[gerrit:930712|Remove oversampling for Navigation Timing extension. (T337858)]] [16:03:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:24] T337858: Remove is_oversample feature in the Navigation Timing extension - https://phabricator.wikimedia.org/T337858 [16:04:07] !log eevans@cumin1001 END (FAIL) - Cookbook sre.cassandra.roll-restart (exit_code=99) for nodes matching restbase20[13-27].codfw.wmnet: Applying JVM update - eevans@cumin1001 [16:05:34] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (LIST blockaffinities) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:07:19] (03CR) 10Cwhite: [C: 03+2] hiera: map logstash.wm.o to kibana7.eqiad [puppet] - 10https://gerrit.wikimedia.org/r/935502 (https://phabricator.wikimedia.org/T333732) (owner: 10Cwhite) [16:08:13] (03PS3) 10Effie Mouzeli: thumbor: add mcrouter support [deployment-charts] - 10https://gerrit.wikimedia.org/r/937141 (https://phabricator.wikimedia.org/T318695) [16:08:39] !log upgrade dns1004 to gdnsd 3.99.0~alpha2 [16:08:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:50] (03CR) 10CI reject: [V: 04-1] thumbor: add mcrouter support [deployment-charts] - 10https://gerrit.wikimedia.org/r/937141 (https://phabricator.wikimedia.org/T318695) (owner: 10Effie Mouzeli) [16:09:17] (03CR) 10Hashar: "The tests pass locally under Python 3.10. I have to resetup my dev environment to reinstall the previous python and test against them loca" [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/935991 (owner: 10Hashar) [16:17:17] PROBLEM - puppet last run on cp6012 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:17:23] PROBLEM - puppet last run on install6002 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:17:37] PROBLEM - puppet last run on cp6003 is CRITICAL: CRITICAL: Puppet last ran 7 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:17:49] (03CR) 10RLazarus: [V: 03+2 C: 03+2] otelcol: Bump to version 0.81.0 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/936832 (owner: 10RLazarus) [16:18:11] PROBLEM - puppet last run on netflow6001 is CRITICAL: CRITICAL: Puppet last ran 7 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:18:24] 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T341433 (10Papaul) 05Open→03Resolved Power cord issue. Fixed [16:18:33] PROBLEM - puppet last run on cp6009 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:18:39] PROBLEM - puppet last run on doh6002 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:18:43] PROBLEM - puppet last run on lvs6002 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:18:43] PROBLEM - puppet last run on durum6001 is CRITICAL: CRITICAL: Puppet last ran 7 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:18:47] PROBLEM - puppet last run on dns6001 is CRITICAL: CRITICAL: Puppet last ran 7 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:18:51] PROBLEM - puppet last run on ganeti6001 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:19:05] PROBLEM - puppet last run on lvs6001 is CRITICAL: CRITICAL: Puppet last ran 7 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:19:17] PROBLEM - puppet last run on ganeti6004 is CRITICAL: CRITICAL: Puppet last ran 7 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:19:17] ^ being discussed in -sre [16:19:25] PROBLEM - puppet last run on ganeti6002 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:19:37] PROBLEM - puppet last run on dns6002 is CRITICAL: CRITICAL: Puppet last ran 7 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:19:57] PROBLEM - puppet last run on lvs6003 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:19:57] PROBLEM - puppet last run on cp6008 is CRITICAL: CRITICAL: Puppet last ran 7 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:19:57] PROBLEM - puppet last run on cp6004 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:20:05] PROBLEM - puppet last run on cp6005 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:20:05] PROBLEM - puppet last run on cp6001 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:20:17] PROBLEM - puppet last run on ganeti6003 is CRITICAL: CRITICAL: Puppet last ran 7 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:20:17] PROBLEM - puppet last run on doh6001 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:20:45] (03PS4) 10Effie Mouzeli: thumbor: add mcrouter support [deployment-charts] - 10https://gerrit.wikimedia.org/r/937141 (https://phabricator.wikimedia.org/T318695) [16:20:51] PROBLEM - puppet last run on cp6015 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:20:53] PROBLEM - puppet last run on cp6011 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:21:16] (03CR) 10CI reject: [V: 04-1] thumbor: add mcrouter support [deployment-charts] - 10https://gerrit.wikimedia.org/r/937141 (https://phabricator.wikimedia.org/T318695) (owner: 10Effie Mouzeli) [16:21:27] PROBLEM - puppet last run on ncredir6002 is CRITICAL: CRITICAL: Puppet last ran 7 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:21:39] PROBLEM - puppet last run on cp6014 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:21:41] PROBLEM - puppet last run on bast6002 is CRITICAL: CRITICAL: Puppet last ran 7 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:21:41] PROBLEM - puppet last run on cp6007 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:21:41] PROBLEM - puppet last run on cp6010 is CRITICAL: CRITICAL: Puppet last ran 7 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:21:53] PROBLEM - puppet last run on cp6016 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:22:01] 10SRE, 10ops-codfw: ganeti2014: broken RAM - https://phabricator.wikimedia.org/T341546 (10Papaul) DDR-4 slot A1 32G [16:22:39] RECOVERY - puppet last run on cp6012 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:22:42] (03PS5) 10Effie Mouzeli: thumbor: add mcrouter support [deployment-charts] - 10https://gerrit.wikimedia.org/r/937141 (https://phabricator.wikimedia.org/T318695) [16:23:20] (03CR) 10CI reject: [V: 04-1] thumbor: add mcrouter support [deployment-charts] - 10https://gerrit.wikimedia.org/r/937141 (https://phabricator.wikimedia.org/T318695) (owner: 10Effie Mouzeli) [16:23:55] RECOVERY - puppet last run on cp6009 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:24:45] !log rzl@deploy1002 helmfile [staging] START helmfile.d/services/opentelemetry-collector: apply [16:25:41] RECOVERY - puppet last run on doh6001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:27:07] RECOVERY - puppet last run on cp6007 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:27:19] RECOVERY - puppet last run on cp6016 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:28:09] !log reenabling puppet in cp6002 [16:28:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:13] RECOVERY - puppet last run on install6002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:28:34] (03PS6) 10Effie Mouzeli: thumbor: add mcrouter support [deployment-charts] - 10https://gerrit.wikimedia.org/r/937141 (https://phabricator.wikimedia.org/T318695) [16:29:29] RECOVERY - puppet last run on doh6002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:29:33] RECOVERY - puppet last run on lvs6002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:29:41] RECOVERY - puppet last run on ganeti6001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:30:17] RECOVERY - puppet last run on ganeti6002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:30:49] RECOVERY - puppet last run on lvs6003 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:30:49] RECOVERY - puppet last run on cp6004 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:31:43] RECOVERY - puppet last run on cp6015 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:31:49] RECOVERY - puppet last run on cp6011 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:32:35] RECOVERY - puppet last run on cp6014 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:33:57] RECOVERY - puppet last run on cp6003 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:36:17] RECOVERY - puppet last run on cp6008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:36:23] RECOVERY - puppet last run on cp6001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:36:25] RECOVERY - puppet last run on cp6005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:38:05] RECOVERY - puppet last run on cp6010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:39:57] RECOVERY - puppet last run on netflow6001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:40:35] 10SRE, 10Phabricator, 10Traffic, 10SecTeam-Processed: Accessing Phabricator from Tor (some ranges blocked but not others) - https://phabricator.wikimedia.org/T254568 (10sbassett) [16:40:53] RECOVERY - puppet last run on lvs6001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:41:05] RECOVERY - puppet last run on ganeti6004 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:43:21] RECOVERY - puppet last run on ncredir6002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:43:35] RECOVERY - puppet last run on bast6002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:44:44] (03PS1) 10RLazarus: opentelemetry-collector: Bump tag to 0.81.0-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/937152 [16:45:45] (03CR) 10RLazarus: [C: 03+2] opentelemetry-collector: Bump tag to 0.81.0-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/937152 (owner: 10RLazarus) [16:45:59] RECOVERY - puppet last run on durum6001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:46:01] RECOVERY - puppet last run on dns6001 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:46:27] (03Merged) 10jenkins-bot: opentelemetry-collector: Bump tag to 0.81.0-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/937152 (owner: 10RLazarus) [16:46:57] RECOVERY - puppet last run on dns6002 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:47:33] RECOVERY - puppet last run on ganeti6003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:52:04] (03CR) 10Hashar: "recheck cause I could not reproduce locally?" [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/935991 (owner: 10Hashar) [16:52:48] !log rzl@deploy1002 helmfile [staging] START helmfile.d/services/opentelemetry-collector: apply [17:00:04] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230711T1700) [17:03:55] (03PS1) 10TrainBranchBot: testwikis wikis to 1.41.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937156 (https://phabricator.wikimedia.org/T340245) [17:03:57] (03CR) 10TrainBranchBot: [C: 03+2] testwikis wikis to 1.41.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937156 (https://phabricator.wikimedia.org/T340245) (owner: 10TrainBranchBot) [17:04:38] (03Merged) 10jenkins-bot: testwikis wikis to 1.41.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937156 (https://phabricator.wikimedia.org/T340245) (owner: 10TrainBranchBot) [17:05:06] !log dduvall@deploy1002 Started scap: testwikis wikis to 1.41.0-wmf.17 refs T340245 [17:05:11] T340245: 1.41.0-wmf.17 deployment blockers - https://phabricator.wikimedia.org/T340245 [17:15:34] (HelmReleaseBadStatus) firing: Helm release opentelemetry-collector/main on k8s-staging@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [17:21:25] PROBLEM - uWSGI puppetboard -http via nrpe- on puppetboard1003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 BAD GATEWAY - 250 bytes in 0.010 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/puppetboard [17:22:41] PROBLEM - uWSGI puppetboard -http via nrpe- on puppetboard2003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 BAD GATEWAY - 250 bytes in 0.039 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/puppetboard [17:23:21] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:24:20] (JobUnavailable) firing: (5) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:28:21] (JobUnavailable) firing: (5) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:31:37] (03CR) 10Clare Ming: [C: 03+1] "hope this works" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937096 (https://phabricator.wikimedia.org/T326212) (owner: 10Jsn.sherman) [17:47:13] (03CR) 10Brennen Bearnes: [C: 03+1] "+1 for idea. It might be good to remind user in output that they have local config?" [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/935991 (owner: 10Hashar) [17:50:57] !log dduvall@deploy1002 Finished scap: testwikis wikis to 1.41.0-wmf.17 refs T340245 (duration: 45m 50s) [17:51:00] T340245: 1.41.0-wmf.17 deployment blockers - https://phabricator.wikimedia.org/T340245 [17:53:15] !log dduvall@deploy1002 Pruned MediaWiki: 1.41.0-wmf.15 (duration: 02m 16s) [18:00:05] dduvall and hashar: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) MediaWiki train - Utc-7+Utc-0 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230711T1800). [18:06:50] (03CR) 10Andrew Bogott: [C: 03+2] P:openstack: move magnum fw rules to haproxy profile [puppet] - 10https://gerrit.wikimedia.org/r/936663 (https://phabricator.wikimedia.org/T341459) (owner: 10Majavah) [18:08:09] (03CR) 10Andrew Bogott: [C: 03+2] P:openstack: open eqiad1 magnum api to the public [puppet] - 10https://gerrit.wikimedia.org/r/936664 (https://phabricator.wikimedia.org/T341459) (owner: 10Majavah) [18:08:17] (03PS4) 10Andrew Bogott: P:openstack: open eqiad1 magnum api to the public [puppet] - 10https://gerrit.wikimedia.org/r/936664 (https://phabricator.wikimedia.org/T341459) (owner: 10Majavah) [18:09:12] (03CR) 10Hashar: "I am confused cause I clearly remember to have moving those list of hosts to use a Puppet DB query based on hosts having the relevant Scap" [puppet] - 10https://gerrit.wikimedia.org/r/867713 (owner: 10Dzahn) [18:17:51] (03CR) 10Andrew Bogott: [C: 03+2] Add puppet role and profile for etcd_discovery service [puppet] - 10https://gerrit.wikimedia.org/r/937104 (https://phabricator.wikimedia.org/T341355) (owner: 10Andrew Bogott) [18:22:17] 10SRE, 10LDAP-Access-Requests: Grant Access to wmde for Ifrahkhanyaree (Ifrah_WMDE) - https://phabricator.wikimedia.org/T341455 (10cmooney) @KFrancis Hi. Would you be kind enough to follow up with @Ifrahkhanyaree and get them to sign an NDA before I grant the requested access? Thanks. [18:46:33] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [18:49:42] (03PS1) 10TrainBranchBot: group0 wikis to 1.41.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937165 (https://phabricator.wikimedia.org/T340245) [18:49:44] (03CR) 10TrainBranchBot: [C: 03+2] group0 wikis to 1.41.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937165 (https://phabricator.wikimedia.org/T340245) (owner: 10TrainBranchBot) [18:50:27] (03Merged) 10jenkins-bot: group0 wikis to 1.41.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937165 (https://phabricator.wikimedia.org/T340245) (owner: 10TrainBranchBot) [18:57:17] !log dduvall@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.41.0-wmf.17 refs T340245 [18:57:20] T340245: 1.41.0-wmf.17 deployment blockers - https://phabricator.wikimedia.org/T340245 [18:57:25] (03CR) 10Dzahn: [C: 03+1] "I see this also needs a rebase and I uploaded in 2022. so maybe you did and this is outdated. let me do the manual rebase and find out!, h" [puppet] - 10https://gerrit.wikimedia.org/r/867713 (owner: 10Dzahn) [18:58:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [19:00:19] (03CR) 10Dzahn: "So.. you are right. You have already replaced the list with a query. It's simply that this happened after this patch was originally upload" [puppet] - 10https://gerrit.wikimedia.org/r/867713 (owner: 10Dzahn) [19:00:40] (03Abandoned) 10Dzahn: scap: remove contint2001 from "dsh groups" [puppet] - 10https://gerrit.wikimedia.org/r/867713 (owner: 10Dzahn) [19:03:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [19:15:50] 10SRE, 10LDAP-Access-Requests: Grant Access to wmde for Ifrahkhanyaree (Ifrah_WMDE) - https://phabricator.wikimedia.org/T341455 (10KFrancis) @Ifrahkhanyaree, please send the following information to my WMF email address, kfrancis@wikimedia.org: Full legal name Mailing address Email address [19:21:34] (03CR) 10Hashar: scap: remove contint2001 from "dsh groups" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/867713 (owner: 10Dzahn) [19:29:37] (03PS2) 10Andrew Bogott: Magnum: allow configuration of etcd discovery service host [puppet] - 10https://gerrit.wikimedia.org/r/937138 (https://phabricator.wikimedia.org/T341355) [19:29:39] (03PS1) 10Andrew Bogott: etcd-discovery: restart etcd after config change [puppet] - 10https://gerrit.wikimedia.org/r/937172 (https://phabricator.wikimedia.org/T341355) [19:32:39] (03CR) 10Andrew Bogott: [C: 03+2] Magnum: allow configuration of etcd discovery service host [puppet] - 10https://gerrit.wikimedia.org/r/937138 (https://phabricator.wikimedia.org/T341355) (owner: 10Andrew Bogott) [19:32:43] (03CR) 10Andrew Bogott: [C: 03+2] etcd-discovery: restart etcd after config change [puppet] - 10https://gerrit.wikimedia.org/r/937172 (https://phabricator.wikimedia.org/T341355) (owner: 10Andrew Bogott) [19:45:40] (03PS1) 10Urbanecm: Always return the class as string from Html::getTextInputAttributes [core] (wmf/1.41.0-wmf.17) - 10https://gerrit.wikimedia.org/r/937113 (https://phabricator.wikimedia.org/T341566) [19:47:53] (03PS1) 10BCornwall: roll-restart-wikimedia-dns: Add reboot action [cookbooks] - 10https://gerrit.wikimedia.org/r/937173 [19:48:07] !log denisse@cumin1001 START - Cookbook sre.dns.netbox [19:48:30] (Not accepting/receiving prefixes from anycast BGP peer) firing: Alert for device cloudsw1-d5-eqiad.mgmt.eqiad.wmnet - Not accepting/receiving prefixes from anycast BGP peer - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer [19:49:22] !log denisse@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:56:30] (03CR) 10Dzahn: "yep:) thanks for doing that! it did reduce the number of places with host names, cool" [puppet] - 10https://gerrit.wikimedia.org/r/867713 (owner: 10Dzahn) [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: #bothumor I � Unicode. All rise for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230711T2000). [20:00:05] Jdlrobson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:19] o/ I can deploy [20:02:03] Jdlrobson: ping [20:11:02] 10SRE, 10ops-eqiad, 10Goal, 10User-aborrero, 10cloud-services-team (FY2022/2023-Q4): cloud @ eqiad: hardware re-racking plan - https://phabricator.wikimedia.org/T341494 (10wiki_willy) a:03Jclark-ctr Hi @Jclark-ctr - can you work with @aborrero on the timeframe and migration plan for these servers? Th... [20:14:38] !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) [20:16:27] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [20:17:53] taavi: can i steal the window for https://gerrit.wikimedia.org/r/c/mediawiki/core/+/937113? or at least until Jon comes. [20:18:07] urbanecm: yes, go ahead! [20:18:13] ty [20:18:15] (03CR) 10Urbanecm: [C: 03+2] Always return the class as string from Html::getTextInputAttributes [core] (wmf/1.41.0-wmf.17) - 10https://gerrit.wikimedia.org/r/937113 (https://phabricator.wikimedia.org/T341566) (owner: 10Urbanecm) [20:18:21] RECOVERY - uWSGI puppetboard -http via nrpe- on puppetboard2003 is OK: HTTP OK: HTTP/1.1 200 OK - 1518696 bytes in 5.657 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/puppetboard [20:18:21] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:18:29] RECOVERY - uWSGI puppetboard -http via nrpe- on puppetboard1003 is OK: HTTP OK: HTTP/1.1 200 OK - 1521517 bytes in 5.341 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/puppetboard [20:23:01] taavi: here sorrry im late [20:23:12] urbanecm: back [20:23:29] okay, i'll do your patch too [20:23:53] (03PS3) 10Urbanecm: Logos: Fixes grantswiki and idwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936097 (owner: 10Jdlrobson) [20:23:57] (03CR) 10Urbanecm: [C: 03+2] Logos: Fixes grantswiki and idwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936097 (owner: 10Jdlrobson) [20:24:38] (03Merged) 10jenkins-bot: Logos: Fixes grantswiki and idwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936097 (owner: 10Jdlrobson) [20:25:15] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:936097|Logos: Fixes grantswiki and idwiktionary]] [20:26:47] !log urbanecm@deploy1002 jdlrobson and urbanecm: Backport for [[gerrit:936097|Logos: Fixes grantswiki and idwiktionary]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet [20:27:03] Jdlrobson: your patch is at mwdebug1001. can you test? [20:27:12] checking [20:29:03] urbanecm: the grants one is good but not the wiktionary one - it's the wrong size :/ [20:29:16] the SVG on commons is bad :( [20:29:27] :-( [20:29:28] Shall I follow up or revert and do a new patch? [20:30:00] Jdlrobson: depends on how long a follow-up would take. if it's a few minutes thing, upload a follow-up please. [20:30:49] 1 min [20:30:50] ill do it now [20:31:00] great, waiting :) [20:31:27] (03PS1) 10Andrew Bogott: magnum: use eqiad1-hosted etcd discovery service [puppet] - 10https://gerrit.wikimedia.org/r/937176 (https://phabricator.wikimedia.org/T341355) [20:32:08] (03CR) 10Andrew Bogott: [C: 03+2] magnum: use eqiad1-hosted etcd discovery service [puppet] - 10https://gerrit.wikimedia.org/r/937176 (https://phabricator.wikimedia.org/T341355) (owner: 10Andrew Bogott) [20:32:17] (03PS1) 10Jdlrobson: Drop idwiktionary wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937177 [20:32:18] ^ urbanecm [20:32:29] (03Merged) 10jenkins-bot: Always return the class as string from Html::getTextInputAttributes [core] (wmf/1.41.0-wmf.17) - 10https://gerrit.wikimedia.org/r/937113 (https://phabricator.wikimedia.org/T341566) (owner: 10Urbanecm) [20:32:40] (03CR) 10Urbanecm: [C: 03+2] Drop idwiktionary wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937177 (owner: 10Jdlrobson) [20:32:44] !log urbanecm@deploy1002 Sync cancelled. [20:33:21] (03Merged) 10jenkins-bot: Drop idwiktionary wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937177 (owner: 10Jdlrobson) [20:33:55] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:936097|Logos: Fixes grantswiki and idwiktionary]], [[gerrit:937177|Drop idwiktionary wordmark]], [[gerrit:937113|Always return the class as string from Html::getTextInputAttributes (T341566)]] [20:33:59] T341566: With $wgUseMediaWikiUIEverywhere = true, Xml::input() with class attribute causes warning or TypeError: htmlspecialchars() expects parameter 1 to be string, array given - https://phabricator.wikimedia.org/T341566 [20:35:27] !log urbanecm@deploy1002 jdlrobson and urbanecm: Backport for [[gerrit:936097|Logos: Fixes grantswiki and idwiktionary]], [[gerrit:937177|Drop idwiktionary wordmark]], [[gerrit:937113|Always return the class as string from Html::getTextInputAttributes (T341566)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet [20:35:40] Jdlrobson: can you check mwdebug1001 again please? :) [20:39:11] urbanecm: LGTM now! [20:39:16] great, syncing [20:39:27] (together with my core backport) [20:43:20] thanks urbanecm [20:44:03] np [20:45:06] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:936097|Logos: Fixes grantswiki and idwiktionary]], [[gerrit:937177|Drop idwiktionary wordmark]], [[gerrit:937113|Always return the class as string from Html::getTextInputAttributes (T341566)]] (duration: 11m 10s) [20:45:11] Jdlrobson: and, deployed [20:45:13] anything else? [20:45:14] T341566: With $wgUseMediaWikiUIEverywhere = true, Xml::input() with class attribute causes warning or TypeError: htmlspecialchars() expects parameter 1 to be string, array given - https://phabricator.wikimedia.org/T341566 [20:54:04] (03CR) 10Andrew Bogott: [C: 03+2] service: remove plaintext labweb service (I) [puppet] - 10https://gerrit.wikimedia.org/r/831174 (https://phabricator.wikimedia.org/T317463) (owner: 10Majavah) [20:54:08] (03CR) 10Andrew Bogott: [C: 03+2] service: remove plaintext labweb service (II) [puppet] - 10https://gerrit.wikimedia.org/r/831175 (https://phabricator.wikimedia.org/T317463) (owner: 10Majavah) [20:54:39] (03CR) 10Andrew Bogott: [C: 03+2] service: remove plaintext labweb service (III) [puppet] - 10https://gerrit.wikimedia.org/r/831176 (https://phabricator.wikimedia.org/T317463) (owner: 10Majavah) [20:54:55] (03PS2) 10Andrew Bogott: service: remove plaintext labweb service (I) [puppet] - 10https://gerrit.wikimedia.org/r/831174 (https://phabricator.wikimedia.org/T317463) (owner: 10Majavah) [20:55:15] (03PS2) 10Andrew Bogott: service: remove plaintext labweb service (II) [puppet] - 10https://gerrit.wikimedia.org/r/831175 (https://phabricator.wikimedia.org/T317463) (owner: 10Majavah) [20:55:24] (03PS2) 10Andrew Bogott: service: remove plaintext labweb service (III) [puppet] - 10https://gerrit.wikimedia.org/r/831176 (https://phabricator.wikimedia.org/T317463) (owner: 10Majavah) [21:00:13] PROBLEM - PyBal IPVS diff check on lvs1019 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.40:80]) https://wikitech.wikimedia.org/wiki/PyBal [21:01:55] PROBLEM - PyBal IPVS diff check on lvs1020 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.40:80]) https://wikitech.wikimedia.org/wiki/PyBal [21:02:06] (03CR) 10Andrew Bogott: [C: 03+2] hieradata: labweb: update lvs pool to reference the ssl service [puppet] - 10https://gerrit.wikimedia.org/r/831173 (https://phabricator.wikimedia.org/T317463) (owner: 10Majavah) [21:04:46] (ConfdResourceFailed) firing: confd resource _srv_config-master_pybal_eqiad_labweb.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [21:05:54] ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs1019 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.40:80]) Andrew Bogott this is me failing to downtime properly, sorry! https://wikitech.wikimedia.org/wiki/PyBal [21:05:54] ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs1020 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.40:80]) Andrew Bogott this is me failing to downtime properly, sorry! https://wikitech.wikimedia.org/wiki/PyBal [21:14:46] (ConfdResourceFailed) resolved: confd resource _srv_config-master_pybal_eqiad_labweb.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [21:15:49] (HelmReleaseBadStatus) firing: Helm release opentelemetry-collector/main on k8s-staging@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [21:16:43] RECOVERY - PyBal IPVS diff check on lvs1019 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [21:18:25] (03PS1) 10Superpes15: [knwiki] Reverting the temporary logo and updating logo/wordmark/tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937183 (https://phabricator.wikimedia.org/T338136) [21:18:27] RECOVERY - PyBal IPVS diff check on lvs1020 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [21:24:35] (Nonwrite HTTP requests with primary DB connections alert) firing: - https://alerts.wikimedia.org/?q=alertname%3DNonwrite+HTTP+requests+with+primary+DB+connections+alert [21:38:25] (03PS1) 10BCornwall: Add some petty spelling error fixes [software/spicerack] - 10https://gerrit.wikimedia.org/r/937185 [21:44:35] (Nonwrite HTTP requests with primary DB connections alert) resolved: - https://alerts.wikimedia.org/?q=alertname%3DNonwrite+HTTP+requests+with+primary+DB+connections+alert [21:51:32] !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [21:51:55] PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs2013 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [21:51:57] PROBLEM - Blazegraph process -wdqs-blazegraph- on wdqs2013 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [21:52:07] PROBLEM - Blazegraph process -wdqs-categories- on wdqs2013 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9990 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [21:52:07] PROBLEM - Check systemd state on wdqs2013 is CRITICAL: CRITICAL - degraded: The following units failed: load-dcatap-weekly.service,prometheus-blazegraph-exporter-wdqs-blazegraph.service,prometheus-blazegraph-exporter-wdqs-categories.service,wdqs-blazegraph.service,wdqs-categories.service,wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service,wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-categories.service https://w [21:52:07] wikimedia.org/wiki/Monitoring/check_systemd_state [21:52:25] PROBLEM - Blazegraph Port for wdqs-categories on wdqs2013 is CRITICAL: connect to address 127.0.0.1 and port 9990: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [21:52:33] PROBLEM - WDQS SPARQL on wdqs2013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string http://www.w3.org/2001/XML... not found on https://query.wikidata.org:443/bigdata/namespace/wdq/sparql?query=SELECT%20*%20WHERE%20%7Bwikibase%3ADump%20schema%3AdateModified%20%3Fy%7D%20LIMIT%201 - 398 bytes in 0.174 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [21:52:47] PROBLEM - Query Service HTTP Port on wdqs2013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 364 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [22:32:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:37:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:51:03] (ProbeDown) firing: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:56:03] (ProbeDown) resolved: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:07:17] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:08:23] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:09:43] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50276 bytes in 0.063 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:10:07] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.267 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:40:58] (03CR) 10Raymond Ndibe: replica_cnf_api: refactor to use multiple backends (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/933973 (https://phabricator.wikimedia.org/T265691) (owner: 10David Caro) [23:42:00] (03CR) 10Raymond Ndibe: replica_cnf_api: refactor to use multiple backends (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/933973 (https://phabricator.wikimedia.org/T265691) (owner: 10David Caro) [23:48:30] (Not accepting/receiving prefixes from anycast BGP peer) firing: Alert for device cloudsw1-d5-eqiad.mgmt.eqiad.wmnet - Not accepting/receiving prefixes from anycast BGP peer - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer [23:51:45] (03PS1) 10Krinkle: mc: Remove mcrouter-with-onhost-tier from ParserCache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937197 (https://phabricator.wikimedia.org/T264604) [23:54:10] (03CR) 10Jdlrobson: [C: 04-1] "As discussed the after HTML should be identical to the HTML we're expecting to ship if https://gerrit.wikimedia.org/r/c/mediawiki/core/+/9" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937092 (https://phabricator.wikimedia.org/T336527) (owner: 10Mabualruz) [23:57:51] PROBLEM - PHP opcache health on mw1467 is CRITICAL: CRITICAL: opcache full on php 7.4. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health