[00:02:59] <logmsgbot>	 !log rzl@deploy1002 helmfile [staging] START helmfile.d/services/opentelemetry-collector: apply
[00:03:05] <logmsgbot>	 !log rzl@deploy1002 helmfile [staging] DONE helmfile.d/services/opentelemetry-collector: apply
[00:27:20] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 225, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[00:28:08] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 140, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[00:32:02] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 226, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[00:32:54] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 141, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[00:38:21] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/935144
[00:38:27] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/935144 (owner: 10TrainBranchBot)
[00:55:18] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/935144 (owner: 10TrainBranchBot)
[01:02:18] <wikibugs>	 (03PS1) 10Jdlrobson: WIP: Update more logos with available SVGs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935824 (https://phabricator.wikimedia.org/T338162)
[01:02:29] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] WIP: Update more logos with available SVGs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935824 (https://phabricator.wikimedia.org/T338162) (owner: 10Jdlrobson)
[02:00:02] <icinga-wm>	 RECOVERY - Check systemd state on mwlog2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:01:50] <wikibugs>	 (03PS1) 10RLazarus: opentelemetry-collector: Use a NodePort service instead of a hostPort. [deployment-charts] - 10https://gerrit.wikimedia.org/r/935826 (https://phabricator.wikimedia.org/T320564)
[02:03:48] <icinga-wm>	 PROBLEM - Check systemd state on gitlab2002 is CRITICAL: CRITICAL - degraded: The following units failed: backup-restore.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:04:00] <icinga-wm>	 PROBLEM - Check systemd state on gitlab1003 is CRITICAL: CRITICAL - degraded: The following units failed: backup-restore.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:04:23] <wikibugs>	 (03CR) 10RLazarus: "helmfile diff: https://phabricator.wikimedia.org/P49519" [deployment-charts] - 10https://gerrit.wikimedia.org/r/935826 (https://phabricator.wikimedia.org/T320564) (owner: 10RLazarus)
[02:05:21] <logmsgbot>	 !log rzl@deploy1002 helmfile [codfw] START helmfile.d/services/opentelemetry-collector: apply
[02:05:33] <logmsgbot>	 !log rzl@deploy1002 helmfile [codfw] DONE helmfile.d/services/opentelemetry-collector: apply
[02:05:45] <logmsgbot>	 !log rzl@deploy1002 helmfile [codfw] START helmfile.d/services/opentelemetry-collector: apply
[02:06:09] <logmsgbot>	 !log rzl@deploy1002 helmfile [codfw] DONE helmfile.d/services/opentelemetry-collector: apply
[02:13:02] <icinga-wm>	 RECOVERY - Hadoop NodeManager on analytics1069 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[02:16:42] <logmsgbot>	 !log rzl@deploy1002 helmfile [eqiad] START helmfile.d/services/opentelemetry-collector: apply
[02:17:00] <logmsgbot>	 !log rzl@deploy1002 helmfile [eqiad] DONE helmfile.d/services/opentelemetry-collector: apply
[02:22:36] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:41:05] <jinxer-wm>	 (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[02:45:34] <wikibugs>	 (03PS1) 10Cwhite: hiera: map logstash.wm.o to kibana7.eqiad" [puppet] - 10https://gerrit.wikimedia.org/r/935502
[02:45:55] <wikibugs>	 (03PS2) 10Cwhite: hiera: map logstash.wm.o to kibana7.eqiad" [puppet] - 10https://gerrit.wikimedia.org/r/935502
[02:46:11] <wikibugs>	 (03PS3) 10Cwhite: hiera: map logstash.wm.o to kibana7.eqiad [puppet] - 10https://gerrit.wikimedia.org/r/935502
[02:46:34] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] hiera: map logstash.wm.o to kibana7.eqiad [puppet] - 10https://gerrit.wikimedia.org/r/935502 (owner: 10Cwhite)
[02:47:08] <wikibugs>	 (03PS4) 10Cwhite: hiera: map logstash.wm.o to kibana7.eqiad [puppet] - 10https://gerrit.wikimedia.org/r/935502
[02:47:30] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] hiera: map logstash.wm.o to kibana7.eqiad [puppet] - 10https://gerrit.wikimedia.org/r/935502 (owner: 10Cwhite)
[02:49:25] <wikibugs>	 (03PS5) 10Cwhite: hiera: map logstash.wm.o to kibana7.eqiad [puppet] - 10https://gerrit.wikimedia.org/r/935502 (https://phabricator.wikimedia.org/T333732)
[03:11:05] <jinxer-wm>	 (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[04:18:26] <icinga-wm>	 RECOVERY - Router interfaces on cr2-drmrs is OK: OK: host 185.15.58.129, interfaces up: 62, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[04:19:38] <icinga-wm>	 RECOVERY - Router interfaces on cr3-esams is OK: OK: host 91.198.174.245, interfaces up: 84, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[04:21:05] <jinxer-wm>	 (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[04:56:38] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[05:01:38] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[05:11:05] <jinxer-wm>	 (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[05:23:20] <wikibugs>	 (03CR) 10Legoktm: mw-cli-wrapper: fix own dc reference in Beta Cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/935448 (owner: 10Krinkle)
[05:46:12] <wikibugs>	 (03PS1) 10KartikMistry: Update MinT to 2023-07-06-051402-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/935835
[05:56:26] <wikibugs>	 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T341168 (10phaultfinder)
[05:56:33] <wikibugs>	 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T341169 (10phaultfinder)
[05:56:38] <wikibugs>	 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T341170 (10phaultfinder)
[05:56:38] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[06:00:05] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230706T0600)
[06:00:05] <jouncebot>	 kormat, marostegui, and Amir1: How many deployers does it take to do Primary database switchover deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230706T0600).
[06:00:05] <wikibugs>	 (03CR) 10Elukey: changeprop: increase the linger.ms value (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/935772 (https://phabricator.wikimedia.org/T338357) (owner: 10Elukey)
[06:01:38] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[06:02:03] <wikibugs>	 (03CR) 10Elukey: changeprop: increase the linger.ms value (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/935772 (https://phabricator.wikimedia.org/T338357) (owner: 10Elukey)
[06:22:36] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:50:38] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[06:54:37] <logmsgbot>	 !log jelto@cumin1001 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1004.wikimedia.org with reason: GitLab minor version upgrade
[06:55:38] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[07:00:07] <jouncebot>	 Amir1, apergos, and jnuche: My dear minions, it's time we take the moon! Just kidding. Time for UTC morning backport and config training deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230706T0700).
[07:01:52] <apergos>	 let's see what's happening today
[07:02:14] <apergos>	 no patches scheduled for the window. aaaaand
[07:02:36] <apergos>	 no trainees signed up to help with those 0 patches, whew!
[07:02:50] <apergos>	 have a nice day and see you next time!
[07:04:16] <logmsgbot>	 !log stevemunene@cumin1001 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on 9 hosts with reason: Stopping puppet and hadoop-hdfs-datanode services then decommissioning the hosts
[07:04:35] <logmsgbot>	 !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on 9 hosts with reason: Stopping puppet and hadoop-hdfs-datanode services then decommissioning the hosts
[07:05:23] <kart_>	 I'll deploy MinT then :)
[07:07:15] <wikibugs>	 (03CR) 10KartikMistry: [C: 03+2] Update MinT to 2023-07-06-051402-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/935835 (owner: 10KartikMistry)
[07:08:08] <wikibugs>	 (03Merged) 10jenkins-bot: Update MinT to 2023-07-06-051402-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/935835 (owner: 10KartikMistry)
[07:09:59] <logmsgbot>	 !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/machinetranslation: apply
[07:12:31] <logmsgbot>	 !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply
[07:17:47] <logmsgbot>	 !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/machinetranslation: apply
[07:21:33] <wikibugs>	 (03CR) 10Stevemunene: [C: 03+2] analytics: Remove analytics1064_1069 from hdfs net_topology [puppet] - 10https://gerrit.wikimedia.org/r/933387 (https://phabricator.wikimedia.org/T317861) (owner: 10Stevemunene)
[07:23:09] <logmsgbot>	 !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/machinetranslation: apply
[07:25:24] <logmsgbot>	 !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/machinetranslation: apply
[07:26:38] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[07:27:22] <wikibugs>	 (03PS1) 10Jelto: ci/zuul: set contint2002 as the active ci::manager_host [puppet] - 10https://gerrit.wikimedia.org/r/935919 (https://phabricator.wikimedia.org/T324659)
[07:29:21] <wikibugs>	 (03CR) 10Jelto: ci/zuul: switch gearman server from contint2001 to contint2002 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/867705 (https://phabricator.wikimedia.org/T324659) (owner: 10Dzahn)
[07:29:32] <logmsgbot>	 !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/machinetranslation: apply
[07:29:43] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.swift.roll-restart-reboot-swift-thanos-proxies rolling reboot on A:thanos-fe
[07:30:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[07:31:38] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[07:31:40] <kart_>	 !log Updated MinT to 2023-07-06-051402-production
[07:31:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:33:20] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs2013 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.77:443]) https://wikitech.wikimedia.org/wiki/PyBal
[07:34:28] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs2014 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.77:443]) https://wikitech.wikimedia.org/wiki/PyBal
[07:35:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[07:35:38] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[07:38:52] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs2013 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[07:40:02] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs2014 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[07:40:38] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[07:41:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[07:46:03] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[07:49:18] <vgutierrez>	 those lvs alerts are related to the thanos-fe restarts
[07:50:38] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[07:54:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[07:55:38] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[07:59:03] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:00:05] <jouncebot>	 hashar and brennen: #bothumor My software never has bugs. It just develops random features. Rise for MediaWiki train - Utc-0+Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230706T0800).
[08:03:05] <logmsgbot>	 !log jelto@cumin1001 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1004.wikimedia.org with reason: GitLab minor version upgrade
[08:04:34] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs1020 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.77:443]) https://wikitech.wikimedia.org/wiki/PyBal
[08:05:42] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs1019 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.77:443]) https://wikitech.wikimedia.org/wiki/PyBal
[08:16:49] <jinxer-wm>	 (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) firing: Processing latency of WDQS_Streaming_Updater in eqiad (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh
[08:17:23] <wikibugs>	 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T341168 (10fgiunchedi)
[08:17:25] <wikibugs>	 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T341170 (10fgiunchedi)
[08:17:27] <wikibugs>	 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T341169 (10fgiunchedi)
[08:17:42] <fabfur>	 !log disabling puppet temporary on cp1075.eqiad.wmnet, cp2027.codfw.wmnet, cp3050.esams.wmnet to apply 935760 (T340983) 
[08:17:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:17:46] <stashbot>	 T340983: provide haproxy silent-drop support for port 80 as well - https://phabricator.wikimedia.org/T340983
[08:19:03] <wikibugs>	 (03CR) 10Fabfur: [V: 03+1 C: 03+2] haproxy: support different actions for tls and http frontend [puppet] - 10https://gerrit.wikimedia.org/r/935760 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur)
[08:20:38] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[08:21:10] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs1020 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[08:21:49] <jinxer-wm>	 (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) resolved: Processing latency of WDQS_Streaming_Updater in eqiad (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh
[08:22:20] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs1019 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[08:25:38] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[08:33:58] <wikibugs>	 (03PS1) 10Btullis: Bump the image of datahub to the new 0.10.4 containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/935982 (https://phabricator.wikimedia.org/T329514)
[08:35:38] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[08:36:19] <wikibugs>	 (03CR) 10Jelto: [C: 04-1] "This change allows public dockerhub images (mariadb) on Trusted Runners (production infrastructure). This is discouraged and we only allow" [puppet] - 10https://gerrit.wikimedia.org/r/935703 (https://phabricator.wikimedia.org/T339352) (owner: 10Kosta Harlan)
[08:38:01] <wikibugs>	 (03CR) 10Kosta Harlan: common/gitlab_runner: Allow mariadb:* images for allowed_docker_services (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/935703 (https://phabricator.wikimedia.org/T339352) (owner: 10Kosta Harlan)
[08:39:49] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-thanos-proxies (exit_code=0) rolling reboot on A:thanos-fe
[08:40:38] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[08:45:19] <fabfur>	 !log reenabled puppet on cp1075.eqiad.wmnet, cp2027.codfw.wmnet, cp3050.esams.wmnet
[08:45:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:49:10] <hashar>	 I forgot to run the train sorry
[08:49:14] <hashar>	 going to run it now
[08:49:31] <_joe_>	 kart_: around?
[08:49:46] <logmsgbot>	 !log oblivian@deploy1002 helmfile [staging] START helmfile.d/services/cxserver: apply
[08:49:54] <logmsgbot>	 !log oblivian@deploy1002 helmfile [staging] DONE helmfile.d/services/cxserver: apply
[08:50:33] <wikibugs>	 10SRE, 10Observability-Alerting, 10Traffic, 10serviceops: Timeouts when talking to phabricator API - https://phabricator.wikimedia.org/T341039 (10fgiunchedi) I have extracted the `maniphest.edit` event duration from phab1004 access log, and on the 29th the operation started to take a whole lot longer:  ` 2...
[08:50:38] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[08:50:42] <logmsgbot>	 !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/cxserver: apply
[08:51:07] <logmsgbot>	 !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply
[08:51:37] <wikibugs>	 10SRE, 10Observability-Alerting, 10Traffic, 10serviceops: Timeouts when talking to phabricator API - https://phabricator.wikimedia.org/T341039 (10fgiunchedi) @brennen I saw your updates to phab in SAL, does the above (`maniphest.edit` taking a lot longer to create tasks) ring a bell?
[08:54:21] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Wikimedia-IRC-RC-Server: Spam in PMs on IRC recent changes server - https://phabricator.wikimedia.org/T341097 (10Volans) For context there have been already a larger effort in the past towards moving the irc server to a newer and re-written server that serve only the re...
[08:55:02] <logmsgbot>	 !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/cxserver: apply
[08:55:23] <logmsgbot>	 !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply
[08:55:38] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[08:58:49] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/935725 (https://phabricator.wikimedia.org/T341063) (owner: 10Arturo Borrero Gonzalez)
[08:59:32] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [V: 03+1 C: 03+2] wmcs: cloud_private_subnet: introduce per-rack vlan_id support [puppet] - 10https://gerrit.wikimedia.org/r/935725 (https://phabricator.wikimedia.org/T341063) (owner: 10Arturo Borrero Gonzalez)
[09:02:01] <wikibugs>	 (03PS1) 10TrainBranchBot: all wikis to 1.41.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935985 (https://phabricator.wikimedia.org/T340244)
[09:02:03] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] all wikis to 1.41.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935985 (https://phabricator.wikimedia.org/T340244) (owner: 10TrainBranchBot)
[09:02:53] <wikibugs>	 (03Merged) 10jenkins-bot: all wikis to 1.41.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935985 (https://phabricator.wikimedia.org/T340244) (owner: 10TrainBranchBot)
[09:04:26] <kart_>	 _joe_: now. Tell me.
[09:05:38] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[09:06:30] <elukey>	 kart_: o/ Joe deployed cx to remove the extra key, I think he wanted to ping you about it
[09:08:08] <kart_>	 cool. Thanks a lot, _joe_ 
[09:08:32] <kart_>	 I was looking at graphs if something has exploded in cxserver/MinT :D
[09:10:11] <logmsgbot>	 !log hashar@deploy1002 rebuilt and synchronized wikiversions files: all wikis to 1.41.0-wmf.16  refs T340244
[09:10:14] <stashbot>	 T340244: 1.41.0-wmf.16 deployment blockers - https://phabricator.wikimedia.org/T340244
[09:10:38] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[09:10:57] <wikibugs>	 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops-collab, 10Patch-For-Review: contint2002 service implementation tracking - https://phabricator.wikimedia.org/T324659 (10Jelto) Thanks @hashar for the detailed summary!  Regarding rsync the following commands //should// be needed (executed on `cont...
[09:11:52] <elukey>	 !log restart kube-apiserver on ml-serve-ctrl2* as attempt to fix LIST-related latency issues
[09:11:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:12:38] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[09:13:55] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main
[09:15:14] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: Add a new nftables::service define (0310 comments) [puppet] - 10https://gerrit.wikimedia.org/r/935751 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[09:15:32] <icinga-wm>	 PROBLEM - Check systemd state on ml-serve-ctrl2002 is CRITICAL: CRITICAL - degraded: The following units failed: kube-controller-manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:17:04] <icinga-wm>	 RECOVERY - Check systemd state on ml-serve-ctrl2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:17:38] <jinxer-wm>	 (KubernetesAPILatency) firing: (4) High Kubernetes API latency (LIST endpointslices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[09:20:29] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] opentelemetry-collector: Use a NodePort service instead of a hostPort. [deployment-charts] - 10https://gerrit.wikimedia.org/r/935826 (https://phabricator.wikimedia.org/T320564) (owner: 10RLazarus)
[09:22:38] <jinxer-wm>	 (KubernetesAPILatency) resolved: (4) High Kubernetes API latency (LIST endpointslices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[09:28:38] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main
[09:30:23] <wikibugs>	 (03CR) 10Kamila Součková: [C: 03+1] changeprop: increase the linger.ms value [deployment-charts] - 10https://gerrit.wikimedia.org/r/935772 (https://phabricator.wikimedia.org/T338357) (owner: 10Elukey)
[09:33:51] <wikibugs>	 (03CR) 10Jbond: "some comments inline" [puppet] - 10https://gerrit.wikimedia.org/r/935751 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[09:33:59] <wikibugs>	 (03PS1) 10Fabfur: haproxy: fix variable type and better naming [puppet] - 10https://gerrit.wikimedia.org/r/935988 (https://phabricator.wikimedia.org/T340983)
[09:35:50] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling reboot on A:swift-fe
[09:39:09] <wikibugs>	 (03CR) 10Jelto: [C: 03+1] miscweb: add statictendril release to miscweb staging (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/930886 (https://phabricator.wikimedia.org/T300171) (owner: 10Dzahn)
[09:39:53] <wikibugs>	 (03PS1) 10Filippo Giunchedi: alertmanager: add page routes for traffic and netops [puppet] - 10https://gerrit.wikimedia.org/r/935990
[09:42:51] <wikibugs>	 (03PS3) 10Jbond: pybal: update check to conform to the nagios plugin api [puppet] - 10https://gerrit.wikimedia.org/r/933398 (https://phabricator.wikimedia.org/T322377)
[09:43:59] <wikibugs>	 (03CR) 10Jbond: pybal: update check to conform to the nagios plugin api (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/933398 (https://phabricator.wikimedia.org/T322377) (owner: 10Jbond)
[09:44:01] <wikibugs>	 (03PS4) 10Jbond: pybal: update check to conform to the nagios plugin api [puppet] - 10https://gerrit.wikimedia.org/r/933398 (https://phabricator.wikimedia.org/T322377)
[09:46:07] <wikibugs>	 (03CR) 10Kamila Součková: [C: 03+1] requirements: bump pyssim [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/935715 (owner: 10Hnowlan)
[09:50:17] <wikibugs>	 (03PS1) 10Hashar: Recognize ~/.config/docker-pkg.yaml [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/935991
[09:52:34] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Recognize ~/.config/docker-pkg.yaml [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/935991 (owner: 10Hashar)
[09:54:57] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] requirements: bump pyssim [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/935715 (owner: 10Hnowlan)
[09:55:05] <wikibugs>	 (03CR) 10Urbanecm: Enable global abuse filters on almost all projects (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935815 (https://phabricator.wikimedia.org/T341159) (owner: 10Urbanecm)
[09:55:27] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] requirements: bump pyssim [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/935715 (owner: 10Hnowlan)
[09:56:59] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] alertmanager: add page routes for traffic and netops [puppet] - 10https://gerrit.wikimedia.org/r/935990 (owner: 10Filippo Giunchedi)
[09:58:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:58:38] <logmsgbot>	 !log stevemunene@cumin1001 START - Cookbook sre.hosts.decommission for hosts analytics1061.eqiad.wmnet
[10:00:05] <jouncebot>	 mvolz: #bothumor My software never has bugs. It just develops random features. Rise for Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230706T1000).
[10:00:05] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230706T1000)
[10:03:18] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:05:50] <logmsgbot>	 !log stevemunene@cumin1001 START - Cookbook sre.dns.netbox
[10:07:09] <wikibugs>	 (03CR) 10Kamila Součková: [C: 03+1] "recheck" [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/935715 (owner: 10Hnowlan)
[10:08:42] <logmsgbot>	 !log stevemunene@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: analytics1061.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - stevemunene@cumin1001"
[10:10:38] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:10:42] <logmsgbot>	 !log stevemunene@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: analytics1061.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - stevemunene@cumin1001"
[10:10:43] <logmsgbot>	 !log stevemunene@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[10:10:43] <logmsgbot>	 !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts analytics1061.eqiad.wmnet
[10:11:12] <wikibugs>	 (03PS2) 10Fabfur: haproxy: fix variable type and better naming [puppet] - 10https://gerrit.wikimedia.org/r/935988 (https://phabricator.wikimedia.org/T340983)
[10:13:44] <wikibugs>	 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops-collab, 10Patch-For-Review: contint2002 service implementation tracking - https://phabricator.wikimedia.org/T324659 (10hashar) Thanks for the rsync commands!  Some adjustements: * delete files on the destination with: `--delete-delay` * swap the...
[10:15:28] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "This change will need adjusting of CirrusSearchJobQueueLagTooHigh alert, 'pint' reported this error (AlertLintProblem alert)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/935089 (owner: 10Clément Goubert)
[10:15:30] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] api-gateway: add native AQS1-style routes for AQS services [deployment-charts] - 10https://gerrit.wikimedia.org/r/935457 (https://phabricator.wikimedia.org/T338916) (owner: 10Hnowlan)
[10:15:38] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:16:25] <wikibugs>	 (03Merged) 10jenkins-bot: api-gateway: add native AQS1-style routes for AQS services [deployment-charts] - 10https://gerrit.wikimedia.org/r/935457 (https://phabricator.wikimedia.org/T338916) (owner: 10Hnowlan)
[10:18:27] <hashar>	 I am off for lunch
[10:18:47] <wikibugs>	 (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (NOOP 7): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42296/console" [puppet] - 10https://gerrit.wikimedia.org/r/935988 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur)
[10:22:50] <wikibugs>	 (03PS3) 10Fabfur: haproxy: fix variable type and better naming [puppet] - 10https://gerrit.wikimedia.org/r/935988 (https://phabricator.wikimedia.org/T340983)
[10:22:51] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[10:24:42] <wikibugs>	 (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42299/console" [puppet] - 10https://gerrit.wikimedia.org/r/935988 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur)
[10:27:26] <wikibugs>	 (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42300/console" [puppet] - 10https://gerrit.wikimedia.org/r/935988 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur)
[10:29:51] <wikibugs>	 (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (NOOP 7): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42301/console" [puppet] - 10https://gerrit.wikimedia.org/r/935988 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur)
[10:30:23] <wikibugs>	 (03PS7) 10Arturo Borrero Gonzalez: Add a new nftables::service define [puppet] - 10https://gerrit.wikimedia.org/r/935751 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[10:30:50] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/935988 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur)
[10:33:02] <wikibugs>	 (03CR) 10Fabfur: [V: 03+1 C: 03+2] haproxy: fix variable type and better naming [puppet] - 10https://gerrit.wikimedia.org/r/935988 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur)
[10:35:31] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: Add a new nftables::service define (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/935751 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[10:37:49] <wikibugs>	 (03PS2) 10Btullis: Bump the image of datahub to the new 0.10.4 containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/935982 (https://phabricator.wikimedia.org/T329514)
[10:40:38] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:41:05] <wikibugs>	 (03PS3) 10Btullis: Bump the image of datahub to the new 0.10.4 containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/935982 (https://phabricator.wikimedia.org/T329514)
[10:41:19] <logmsgbot>	 !log stevemunene@cumin1001 START - Cookbook sre.hosts.decommission for hosts analytics1062.eqiad.wmnet
[10:42:02] <taavi>	 jouncebot: nowandnext
[10:42:03] <jouncebot>	 For the next 0 hour(s) and 17 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230706T1000)
[10:42:03] <jouncebot>	 For the next 0 hour(s) and 17 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230706T1000)
[10:42:03] <jouncebot>	 In 2 hour(s) and 17 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230706T1300)
[10:42:03] <jouncebot>	 In 2 hour(s) and 17 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230706T1300)
[10:44:00] <wikibugs>	 (03CR) 10Stevemunene: [C: 03+1] Bump the image of datahub to the new 0.10.4 containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/935982 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis)
[10:44:15] <wikibugs>	 (03PS1) 10Majavah: extdist: REL1_40 is stable, REL1_38 is EOL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935997
[10:45:38] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:46:28] <taavi>	 urbanecm: (or someone else) if you could quickly double-check https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/935997/ is correct I'd appreciate it
[10:47:15] <logmsgbot>	 !log stevemunene@cumin1001 START - Cookbook sre.dns.netbox
[10:47:47] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+1] extdist: REL1_40 is stable, REL1_38 is EOL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935997 (owner: 10Majavah)
[10:47:50] <urbanecm>	 sounds correct to me taavi 
[10:48:02] <taavi>	 thanks
[10:48:16] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Bump the image of datahub to the new 0.10.4 containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/935982 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis)
[10:48:16] <taavi>	 looks like the mw infra window is unused, so I'll push that out now
[10:48:24] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] extdist: REL1_40 is stable, REL1_38 is EOL (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935997 (owner: 10Majavah)
[10:48:41] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935997 (owner: 10Majavah)
[10:49:16] <wikibugs>	 (03Merged) 10jenkins-bot: Bump the image of datahub to the new 0.10.4 containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/935982 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis)
[10:49:25] <wikibugs>	 (03Merged) 10jenkins-bot: extdist: REL1_40 is stable, REL1_38 is EOL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935997 (owner: 10Majavah)
[10:49:46] <logmsgbot>	 !log taavi@deploy1002 Started scap: Backport for [[gerrit:935997|extdist: REL1_40 is stable, REL1_38 is EOL]]
[10:51:10] <logmsgbot>	 !log taavi@deploy1002 taavi: Backport for [[gerrit:935997|extdist: REL1_40 is stable, REL1_38 is EOL]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet
[10:51:16] <Lucas_WMDE>	 I have two other config changes I could deploy afterwards if no one else is doing anything
[10:51:29] <Lucas_WMDE>	 (or even three)
[10:51:43] <Lucas_WMDE>	 (I won’t be around for the backport window later unfortunately)
[10:52:28] <taavi>	 syncing
[10:53:46] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main
[10:54:12] <logmsgbot>	 !log stevemunene@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: analytics1062.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - stevemunene@cumin1001"
[10:55:04] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] Beta-Wikidata: Always show mul on desktop Termbox (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935770 (https://phabricator.wikimedia.org/T339104) (owner: 10Michael Große)
[10:55:38] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:58:07] <logmsgbot>	 !log taavi@deploy1002 Finished scap: Backport for [[gerrit:935997|extdist: REL1_40 is stable, REL1_38 is EOL]] (duration: 08m 21s)
[10:58:14] * taavi done
[11:00:38] <jinxer-wm>	 (KubernetesAPILatency) resolved: (3) High Kubernetes API latency (PUT replicasets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[11:01:26] <wikibugs>	 (03PS8) 10Jbond: puppetdb: Add ability to configure secondary proxies [puppet] - 10https://gerrit.wikimedia.org/r/933608 (https://phabricator.wikimedia.org/T338811)
[11:01:28] <wikibugs>	 (03PS5) 10Jbond: puppetdb: add secondary web site to proxy requests form the puppet5 masters [puppet] - 10https://gerrit.wikimedia.org/r/935755 (https://phabricator.wikimedia.org/T338811)
[11:02:25] <wikibugs>	 (03CR) 10Jbond: "thanks" [puppet] - 10https://gerrit.wikimedia.org/r/933608 (https://phabricator.wikimedia.org/T338811) (owner: 10Jbond)
[11:02:39] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42302/console" [puppet] - 10https://gerrit.wikimedia.org/r/935755 (https://phabricator.wikimedia.org/T338811) (owner: 10Jbond)
[11:03:50] <Lucas_WMDE>	 alright, I’ll deploy some config changes then
[11:03:56] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling reboot on A:swift-fe
[11:03:58] <Lucas_WMDE>	 (none of them are urgent, feel free to ping me if you want to do something in between)
[11:04:20] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main
[11:05:13] <logmsgbot>	 !log stevemunene@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: analytics1062.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - stevemunene@cumin1001"
[11:05:14] <logmsgbot>	 !log stevemunene@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[11:05:14] <logmsgbot>	 !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts analytics1062.eqiad.wmnet
[11:05:19] <wikibugs>	 (03PS2) 10Lucas Werkmeister (WMDE): outreachwiki: Set wmgWikibaseSiteGroup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935455
[11:05:34] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935455 (owner: 10Lucas Werkmeister (WMDE))
[11:06:19] <wikibugs>	 (03Merged) 10jenkins-bot: outreachwiki: Set wmgWikibaseSiteGroup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935455 (owner: 10Lucas Werkmeister (WMDE))
[11:06:37] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:935455|outreachwiki: Set wmgWikibaseSiteGroup]]
[11:07:58] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde: Backport for [[gerrit:935455|outreachwiki: Set wmgWikibaseSiteGroup]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet
[11:08:31] <Lucas_WMDE>	 I checked in `mwscript shell outreachwiki` that `wbc::getSiteGroup()` returns the same result before and after the change, as expected. syncing
[11:10:28] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.hosts.decommission for hosts cloudswift1001.eqiad.wmnet
[11:12:29] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/api-gateway: apply
[11:12:49] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply
[11:14:12] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:935455|outreachwiki: Set wmgWikibaseSiteGroup]] (duration: 07m 35s)
[11:14:14] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/api-gateway: apply
[11:14:16] <wikibugs>	 (03PS9) 10Jbond: puppetdb: Add ability to configure secondary proxies [puppet] - 10https://gerrit.wikimedia.org/r/933608 (https://phabricator.wikimedia.org/T338811)
[11:14:18] <wikibugs>	 (03PS6) 10Jbond: puppetdb: add secondary web site to proxy requests form the puppet5 masters [puppet] - 10https://gerrit.wikimedia.org/r/935755 (https://phabricator.wikimedia.org/T338811)
[11:14:33] <wikibugs>	 (03PS9) 10Lucas Werkmeister (WMDE): foundationwiki: Enable WikibaseClient [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850547 (https://phabricator.wikimedia.org/T321967) (owner: 10Varnent)
[11:14:48] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/api-gateway: apply
[11:15:00] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850547 (https://phabricator.wikimedia.org/T321967) (owner: 10Varnent)
[11:15:22] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42303/console" [puppet] - 10https://gerrit.wikimedia.org/r/935755 (https://phabricator.wikimedia.org/T338811) (owner: 10Jbond)
[11:15:47] <wikibugs>	 (03Merged) 10jenkins-bot: foundationwiki: Enable WikibaseClient [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850547 (https://phabricator.wikimedia.org/T321967) (owner: 10Varnent)
[11:16:03] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:850547|foundationwiki: Enable WikibaseClient (T321967)]]
[11:16:06] <stashbot>	 T321967: Enable Wikibase client on Wikimedia Foundation Governance Wiki - https://phabricator.wikimedia.org/T321967
[11:17:23] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 varnent and lucaswerkmeister-wmde: Backport for [[gerrit:850547|foundationwiki: Enable WikibaseClient (T321967)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet
[11:17:59] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/api-gateway: apply
[11:18:27] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: apply
[11:19:02] <logmsgbot>	 !log stevemunene@cumin1001 START - Cookbook sre.hosts.decommission for hosts analytics1063.eqiad.wmnet
[11:19:12] <Lucas_WMDE>	 I linked foundationwiki’s Wikimedia:Sandbox to https://www.wikidata.org/wiki/Q3938?debug=2
[11:19:16] <Lucas_WMDE>	 and sitelinks appeared on https://foundation.wikimedia.org/wiki/Wikimedia:Sandbox
[11:19:22] <Lucas_WMDE>	 I think that’s a success. syncing
[11:19:51] <wikibugs>	 (03PS1) 10ArielGlenn: Give Dan Andreescu and Jennifer Ebe root on dumps hosts [puppet] - 10https://gerrit.wikimedia.org/r/936003 (https://phabricator.wikimedia.org/T341045)
[11:22:34] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] puppetdb: Add ability to configure secondary proxies [puppet] - 10https://gerrit.wikimedia.org/r/933608 (https://phabricator.wikimedia.org/T338811) (owner: 10Jbond)
[11:22:39] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] puppetdb: add secondary web site to proxy requests form the puppet5 masters [puppet] - 10https://gerrit.wikimedia.org/r/935755 (https://phabricator.wikimedia.org/T338811) (owner: 10Jbond)
[11:22:43] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] puppetdb: Add ability to configure secondary proxies [puppet] - 10https://gerrit.wikimedia.org/r/933608 (https://phabricator.wikimedia.org/T338811) (owner: 10Jbond)
[11:22:55] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.dns.netbox
[11:23:05] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.hosts.decommission for hosts cloudswift1002.eqiad.wmnet
[11:24:01] <wikibugs>	 (03PS2) 10Lucas Werkmeister (WMDE): Beta-Wikidata: Always show mul on desktop Termbox [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935770 (https://phabricator.wikimedia.org/T339104) (owner: 10Michael Große)
[11:24:08] <logmsgbot>	 !log aborrero@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[11:24:09] <logmsgbot>	 !log aborrero@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts cloudswift1001.eqiad.wmnet
[11:24:38] <logmsgbot>	 !log stevemunene@cumin1001 START - Cookbook sre.dns.netbox
[11:25:02] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:850547|foundationwiki: Enable WikibaseClient (T321967)]] (duration: 08m 58s)
[11:25:06] <stashbot>	 T321967: Enable Wikibase client on Wikimedia Foundation Governance Wiki - https://phabricator.wikimedia.org/T321967
[11:25:30] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935770 (https://phabricator.wikimedia.org/T339104) (owner: 10Michael Große)
[11:25:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST replicasets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[11:26:17] <wikibugs>	 (03Merged) 10jenkins-bot: Beta-Wikidata: Always show mul on desktop Termbox [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935770 (https://phabricator.wikimedia.org/T339104) (owner: 10Michael Große)
[11:26:33] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:935770|Beta-Wikidata: Always show mul on desktop Termbox (T339104)]]
[11:26:36] <stashbot>	 T339104: Create feature flag to always show `mul` in “in more languages” section of desktop termbox - https://phabricator.wikimedia.org/T339104
[11:26:53] <wikibugs>	 (03PS1) 10Jbond: puppetdb::site: secret needs to be content not source [puppet] - 10https://gerrit.wikimedia.org/r/936007 (https://phabricator.wikimedia.org/T338811)
[11:26:59] <wikibugs>	 (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/935868
[11:27:21] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.dns.netbox
[11:27:32] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.kafka.roll-restart-brokers for Kafka A:kafka-test-eqiad cluster: Roll restart of jvm daemons.
[11:27:52] <wikibugs>	 (03CR) 10Vgutierrez: trafficserver: add gateway routing script, route device-analytics (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/935464 (https://phabricator.wikimedia.org/T320967) (owner: 10Hnowlan)
[11:27:54] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 migr and lucaswerkmeister-wmde: Backport for [[gerrit:935770|Beta-Wikidata: Always show mul on desktop Termbox (T339104)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet
[11:28:17] <Lucas_WMDE>	 not much to test here, it’s a Beta-only change
[11:28:24] <Lucas_WMDE>	 it just touches Wikibase.php, but should have no effect
[11:28:32] <Lucas_WMDE>	 syncing after confirming that the site didn’t blow up on mwdebug
[11:29:31] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudswift1002.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - aborrero@cumin1001"
[11:30:01] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] puppetdb::site: secret needs to be content not source [puppet] - 10https://gerrit.wikimedia.org/r/936007 (https://phabricator.wikimedia.org/T338811) (owner: 10Jbond)
[11:30:27] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudswift1002.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - aborrero@cumin1001"
[11:30:28] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[11:30:28] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudswift1002.eqiad.wmnet
[11:30:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST replicasets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[11:34:11] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:935770|Beta-Wikidata: Always show mul on desktop Termbox (T339104)]] (duration: 07m 37s)
[11:34:14] <stashbot>	 T339104: Create feature flag to always show `mul` in “in more languages” section of desktop termbox - https://phabricator.wikimedia.org/T339104
[11:34:30] <icinga-wm>	 PROBLEM - uWSGI puppetboard -http via nrpe- on puppetboard1003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 INTERNAL SERVER ERROR - 5551 bytes in 0.014 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/puppetboard
[11:34:42] <icinga-wm>	 PROBLEM - Check systemd state on puppetdb2003 is CRITICAL: CRITICAL - degraded: The following units failed: nginx.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:34:48] * Lucas_WMDE done
[11:35:01] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.dns.netbox
[11:35:45] <wikibugs>	 (03PS1) 10Majavah: P:toolforge::prometheus: add pod_name label [puppet] - 10https://gerrit.wikimedia.org/r/936014
[11:36:25] <wikibugs>	 (03CR) 10Jelto: [C: 04-1] common/gitlab_runner: Allow mariadb:* images for allowed_docker_services (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/935703 (https://phabricator.wikimedia.org/T339352) (owner: 10Kosta Harlan)
[11:36:46] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: Revert "Add tag when reference added to the page" [extensions/VisualEditor] (wmf/1.41.0-wmf.16) - 10https://gerrit.wikimedia.org/r/935854 (https://phabricator.wikimedia.org/T341202)
[11:38:49] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudlb - aborrero@cumin1001"
[11:39:10] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42304/console" [puppet] - 10https://gerrit.wikimedia.org/r/936014 (owner: 10Majavah)
[11:39:33] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudlb - aborrero@cumin1001"
[11:39:33] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[11:41:14] <logmsgbot>	 !log stevemunene@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[11:41:15] <logmsgbot>	 !log stevemunene@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=99) for hosts analytics1063.eqiad.wmnet
[11:41:20] <MatmaRex>	 hi, anyone would like to deploy a revert for me? https://gerrit.wikimedia.org/r/c/mediawiki/extensions/VisualEditor/+/935854
[11:41:26] <MatmaRex>	 seems like a bad train regression
[11:41:44] <logmsgbot>	 !log stevemunene@cumin1001 START - Cookbook sre.hosts.decommission for hosts analytics1063.eqiad.wmnet
[11:41:55] <wikibugs>	 10SRE, 10API Platform, 10Anti-Harassment, 10Cloud-Services, and 18 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10BTullis)
[11:42:53] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host cloudlb1001
[11:43:11] <logmsgbot>	 !log aborrero@cumin1001 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host cloudlb1001
[11:43:14] <TheresNoTime>	 MatmaRex: can do
[11:43:19] <TheresNoTime>	 jouncebot: nowandnext
[11:43:19] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 16 minute(s)
[11:43:19] <jouncebot>	 In 1 hour(s) and 16 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230706T1300)
[11:43:19] <jouncebot>	 In 1 hour(s) and 16 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230706T1300)
[11:44:14] <wikibugs>	 (03PS1) 10Btullis: Bump the version of the datahub image [deployment-charts] - 10https://gerrit.wikimedia.org/r/936015 (https://phabricator.wikimedia.org/T329514)
[11:45:45] * TheresNoTime waiting for 935854's CI to finish
[11:46:27] <wikibugs>	 (03PS1) 10Jbond: puppetdb::site: fix nginx syntax error [puppet] - 10https://gerrit.wikimedia.org/r/936016
[11:46:30] <wikibugs>	 (03PS1) 10Jbond: nginx: manage nginx directory [puppet] - 10https://gerrit.wikimedia.org/r/936017
[11:46:32] <logmsgbot>	 !log stevemunene@cumin1001 START - Cookbook sre.dns.netbox
[11:47:45] <logmsgbot>	 !log stevemunene@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[11:47:45] <logmsgbot>	 !log stevemunene@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=99) for hosts analytics1063.eqiad.wmnet
[11:48:15] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host cloudlb1001
[11:48:19] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/api-gateway: apply
[11:48:23] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply
[11:48:31] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudlb1001
[11:48:36] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/api-gateway: apply
[11:49:00] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: apply
[11:49:17] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/api-gateway: apply
[11:49:50] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/api-gateway: apply
[11:50:11] <logmsgbot>	 !log aborrero@cumin2002 START - Cookbook sre.dns.wipe-cache cloudlb1001.eqiad.wmnet on all recursors
[11:50:13] <logmsgbot>	 !log aborrero@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudlb1001.eqiad.wmnet on all recursors
[11:50:25] <logmsgbot>	 !log stevemunene@cumin1001 START - Cookbook sre.hosts.decommission for hosts analytics1063.eqiad.wmnet
[11:50:36] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Bump the version of the datahub image [deployment-charts] - 10https://gerrit.wikimedia.org/r/936015 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis)
[11:50:45] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.dns.netbox
[11:51:21] <wikibugs>	 (03Merged) 10jenkins-bot: Bump the version of the datahub image [deployment-charts] - 10https://gerrit.wikimedia.org/r/936015 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis)
[11:52:04] <icinga-wm>	 PROBLEM - uWSGI puppetboard -http via nrpe- on puppetboard2003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 INTERNAL SERVER ERROR - 5551 bytes in 0.145 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/puppetboard
[11:52:08] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [extensions/VisualEditor] (wmf/1.41.0-wmf.16) - 10https://gerrit.wikimedia.org/r/935854 (https://phabricator.wikimedia.org/T341202) (owner: 10Bartosz Dziewoński)
[11:52:21] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main
[11:52:43] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudlb - aborrero@cumin1001"
[11:53:27] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudlb - aborrero@cumin1001"
[11:53:27] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[11:53:44] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.dns.netbox
[11:54:24] <logmsgbot>	 !log aborrero@cumin2002 START - Cookbook sre.dns.wipe-cache cloudlb1001.eqiad.wmnet on all recursors
[11:54:27] <logmsgbot>	 !log aborrero@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudlb1001.eqiad.wmnet on all recursors
[11:55:21] <logmsgbot>	 !log stevemunene@cumin1001 START - Cookbook sre.dns.netbox
[11:55:41] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudlb - aborrero@cumin1001"
[11:55:59] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 58): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42305/console" [puppet] - 10https://gerrit.wikimedia.org/r/936017 (owner: 10Jbond)
[11:56:18] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudlb - aborrero@cumin1001"
[11:56:18] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[11:56:26] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host cloudlb1002
[11:56:41] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudlb1002
[11:56:43] <logmsgbot>	 !log stevemunene@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[11:56:44] <logmsgbot>	 !log stevemunene@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts analytics1063.eqiad.wmnet
[11:56:46] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main
[11:56:57] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.hosts.reimage for host cloudlb1001.eqiad.wmnet with OS bullseye
[11:58:40] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloudlb1001/1002: add role [puppet] - 10https://gerrit.wikimedia.org/r/936019 (https://phabricator.wikimedia.org/T341200)
[11:59:58] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: cloudlb1001/1002: add role [puppet] - 10https://gerrit.wikimedia.org/r/936019 (https://phabricator.wikimedia.org/T341200)
[12:00:46] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudlb1001/1002: add role [puppet] - 10https://gerrit.wikimedia.org/r/936019 (https://phabricator.wikimedia.org/T341200) (owner: 10Arturo Borrero Gonzalez)
[12:01:59] <wikibugs>	 (03PS2) 10Jbond: nginx: manage nginx directory [puppet] - 10https://gerrit.wikimedia.org/r/936017
[12:02:17] <wikibugs>	 (03PS1) 10Andrew Bogott: Revert "cinder-backups: consolidate backup jobs on one host" [puppet] - 10https://gerrit.wikimedia.org/r/936020
[12:06:21] <wikibugs>	 (03CR) 10Hashar: [C: 03+1] Revert "Add tag when reference added to the page" [extensions/VisualEditor] (wmf/1.41.0-wmf.16) - 10https://gerrit.wikimedia.org/r/935854 (https://phabricator.wikimedia.org/T341202) (owner: 10Bartosz Dziewoński)
[12:08:18] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Add tag when reference added to the page" [extensions/VisualEditor] (wmf/1.41.0-wmf.16) - 10https://gerrit.wikimedia.org/r/935854 (https://phabricator.wikimedia.org/T341202) (owner: 10Bartosz Dziewoński)
[12:08:28] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloudlb: eqiad: bootstrap hiera data [puppet] - 10https://gerrit.wikimedia.org/r/936022 (https://phabricator.wikimedia.org/T341200)
[12:08:34] <logmsgbot>	 !log samtar@deploy1002 Started scap: Backport for [[gerrit:935854|Revert "Add tag when reference added to the page" (T341202)]]
[12:08:37] <stashbot>	 T341202: Unable to edit an article on mobile (JavaScript error) - https://phabricator.wikimedia.org/T341202
[12:11:57] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] puppetdb::site: fix nginx syntax error [puppet] - 10https://gerrit.wikimedia.org/r/936016 (owner: 10Jbond)
[12:12:01] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] nginx: manage nginx directory [puppet] - 10https://gerrit.wikimedia.org/r/936017 (owner: 10Jbond)
[12:15:22] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host zookeeper-test1002.eqiad.wmnet with OS bookworm
[12:15:31] <logmsgbot>	 !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host zookeeper-test1002.eqiad.wmnet with OS bookworm
[12:16:22] <icinga-wm>	 RECOVERY - uWSGI puppetboard -http via nrpe- on puppetboard2003 is OK: HTTP OK: HTTP/1.1 200 OK - 10835 bytes in 0.477 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/puppetboard
[12:17:22] <icinga-wm>	 RECOVERY - uWSGI puppetboard -http via nrpe- on puppetboard1003 is OK: HTTP OK: HTTP/1.1 200 OK - 10868 bytes in 0.214 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/puppetboard
[12:17:50] <icinga-wm>	 RECOVERY - Check systemd state on puppetdb2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:21:03] <logmsgbot>	 !log samtar@deploy1002 matmarex and samtar: Backport for [[gerrit:935854|Revert "Add tag when reference added to the page" (T341202)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet
[12:21:06] <stashbot>	 T341202: Unable to edit an article on mobile (JavaScript error) - https://phabricator.wikimedia.org/T341202
[12:21:24] <MatmaRex>	 will test
[12:21:27] <TheresNoTime>	 MatmaRex: ack
[12:22:19] <MatmaRex>	 TheresNoTime: looks good, no console errors
[12:22:25] <TheresNoTime>	 syncing
[12:23:29] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/936014 (owner: 10Majavah)
[12:32:38] <logmsgbot>	 !log samtar@deploy1002 Finished scap: Backport for [[gerrit:935854|Revert "Add tag when reference added to the page" (T341202)]] (duration: 24m 04s)
[12:32:41] <stashbot>	 T341202: Unable to edit an article on mobile (JavaScript error) - https://phabricator.wikimedia.org/T341202
[12:32:58] <wikibugs>	 (03CR) 10Kosta Harlan: common/gitlab_runner: Allow mariadb:* images for allowed_docker_services (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/935703 (https://phabricator.wikimedia.org/T339352) (owner: 10Kosta Harlan)
[12:33:02] <wikibugs>	 (03Abandoned) 10Kosta Harlan: common/gitlab_runner: Allow mariadb:* images for allowed_docker_services [puppet] - 10https://gerrit.wikimedia.org/r/935703 (https://phabricator.wikimedia.org/T339352) (owner: 10Kosta Harlan)
[12:34:00] <wikibugs>	 10SRE, 10CAS-SSO, 10Infrastructure-Foundations, 10serviceops-collab, and 4 others: migrate gitlab away from the CAS protocol - https://phabricator.wikimedia.org/T320390 (10Jelto) Thanks for reporting the issue @Arnoldokoth !  I grepped a bit in `/var/log/cas/cas-2023-07-05.log ` on `idp-test1002` and found...
[12:34:58] <TheresNoTime>	 MatmaRex: live :)
[12:35:04] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: cloudlb: eqiad: bootstrap hiera data [puppet] - 10https://gerrit.wikimedia.org/r/936022 (https://phabricator.wikimedia.org/T341200)
[12:35:20] <MatmaRex>	 thanks TheresNoTime
[12:35:32] <logmsgbot>	 !log stevemunene@cumin1001 START - Cookbook sre.hosts.decommission for hosts analytics1064.eqiad.wmnet
[12:36:53] <wikibugs>	 (03PS1) 10Jbond: puppetmaster::puppetdb::cilent: updatre submit only port to 8443 [puppet] - 10https://gerrit.wikimedia.org/r/936026 (https://phabricator.wikimedia.org/T338811)
[12:40:52] <logmsgbot>	 !log stevemunene@cumin1001 START - Cookbook sre.dns.netbox
[12:41:14] <icinga-wm>	 RECOVERY - Ganeti memory on ganeti1013 is OK: OK Memory 86% used https://wikitech.wikimedia.org/wiki/Ganeti%23Memory_pressure
[12:41:29] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] trafficserver: add gateway routing script, route device-analytics (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/935464 (https://phabricator.wikimedia.org/T320967) (owner: 10Hnowlan)
[12:42:58] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host zookeeper-test1002.eqiad.wmnet with OS bookworm
[12:43:00] <logmsgbot>	 !log stevemunene@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: analytics1064.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - stevemunene@cumin1001"
[12:47:08] <TheresNoTime>	 MatmaRex: I just +2'd the master (935853) patch for that backport too... realised I probably should have asked before doing so
[12:47:45] <MatmaRex>	 TheresNoTime: oh, thanks, i think that's just a formality
[12:48:54] <MatmaRex>	 i think my team is mostly asleep now, and i don't want to ping them when the issue is mitigated already
[12:49:47] <wikibugs>	 10SRE, 10CAS-SSO, 10Infrastructure-Foundations, 10serviceops-collab, and 4 others: migrate gitlab away from the CAS protocol - https://phabricator.wikimedia.org/T320390 (10jbond) >>! In T320390#8993613, @Jelto wrote: > Thanks for reporting the issue @Arnoldokoth ! >  > I grepped a bit in `/var/log/cas/cas-...
[12:51:44] <wikibugs>	 (03PS2) 10Jbond: puppetmaster::puppetdb::cilent: updatre submit only port to 8443 [puppet] - 10https://gerrit.wikimedia.org/r/936026 (https://phabricator.wikimedia.org/T338811)
[12:51:53] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Configure eqiad cloudsw devices to support cloud-private - https://phabricator.wikimedia.org/T341223 (10cmooney) p:05Triage→03Medium
[12:52:29] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42307/console" [puppet] - 10https://gerrit.wikimedia.org/r/936026 (https://phabricator.wikimedia.org/T338811) (owner: 10Jbond)
[12:53:09] <wikibugs>	 (03PS3) 10Jbond: puppetmaster::puppetdb::cilent: updatre submit only port to 8443 [puppet] - 10https://gerrit.wikimedia.org/r/936026 (https://phabricator.wikimedia.org/T338811)
[12:54:27] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Configure eqiad cloudsw devices to support cloud-private - https://phabricator.wikimedia.org/T341223 (10cmooney)
[12:55:11] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Configure eqiad cloudsw devices to support cloud-private - https://phabricator.wikimedia.org/T341223 (10cmooney)
[12:56:22] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on zookeeper-test1002.eqiad.wmnet with reason: host reimage
[12:56:29] <wikibugs>	 (03PS4) 10Jbond: puppetmaster::puppetdb::cilent: updatre submit only port to 8443 [puppet] - 10https://gerrit.wikimedia.org/r/936026 (https://phabricator.wikimedia.org/T338811)
[12:58:15] <logmsgbot>	 !log stevemunene@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: analytics1064.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - stevemunene@cumin1001"
[12:58:15] <logmsgbot>	 !log stevemunene@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[12:58:16] <logmsgbot>	 !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts analytics1064.eqiad.wmnet
[12:58:53] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on zookeeper-test1002.eqiad.wmnet with reason: host reimage
[12:58:54] <wikibugs>	 (03CR) 10Majavah: Enable global abuse filters on almost all projects (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935815 (https://phabricator.wikimedia.org/T341159) (owner: 10Urbanecm)
[13:00:05] <jouncebot>	 Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230706T1300)
[13:00:05] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: My dear minions, it's time we take the moon! Just kidding. Time for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230706T1300).
[13:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[13:00:36] <urbanecm>	 nothing to do indeed
[13:00:38] <wikibugs>	 (03PS5) 10Jbond: puppetmaster::puppetdb::cilent: updatre submit only port to 8443 [puppet] - 10https://gerrit.wikimedia.org/r/936026 (https://phabricator.wikimedia.org/T338811)
[13:00:43] <logmsgbot>	 !log aborrero@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudlb1001.eqiad.wmnet with OS bullseye
[13:01:39] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42310/console" [puppet] - 10https://gerrit.wikimedia.org/r/936026 (https://phabricator.wikimedia.org/T338811) (owner: 10Jbond)
[13:02:33] <wikibugs>	 (03CR) 10Urbanecm: Enable global abuse filters on almost all projects (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935815 (https://phabricator.wikimedia.org/T341159) (owner: 10Urbanecm)
[13:02:38] <logmsgbot>	 !log stevemunene@cumin1001 START - Cookbook sre.hosts.decommission for hosts analytics1065.eqiad.wmnet
[13:02:45] <urbanecm>	 taavi: if you have a while, maybe we can finish the discussion synchronously here and deploy?
[13:02:52] <taavi>	 sure
[13:03:17] <urbanecm>	 TLDR CI requires me to remove it (at least) from `MWMultiVersion::DB_LIST`. We can workaround that if we want to though.
[13:03:18] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] puppetmaster::puppetdb::cilent: updatre submit only port to 8443 [puppet] - 10https://gerrit.wikimedia.org/r/936026 (https://phabricator.wikimedia.org/T338811) (owner: 10Jbond)
[13:03:47] <wikibugs>	 (03PS1) 10Elukey: java::version: add support for openjdk-17 [puppet] - 10https://gerrit.wikimedia.org/r/936032
[13:04:07] <taavi>	 why? because it's not used anywhere?
[13:04:11] <urbanecm>	 yes
[13:04:21] <taavi>	 ah, I see
[13:04:21] <urbanecm>	 anywhere in operations/mediawiki-config at least
[13:04:42] <urbanecm>	 it might be used in regular maintenance jobs or one-off  `foreachwikiindblist` tasks
[13:04:55] <taavi>	 oh right, I was just about to ask if .dblists are used anywhere else
[13:05:22] <urbanecm>	 those are the two places that come to mind. it might be used for a lot of things, and it is nearly impossible to identify where it is (not) used
[13:05:48] <taavi>	 my original concern was that leaving it there but in a way it's not visible to the config might be confusing, but then I didn't realize other places also use the dblist files
[13:05:49] <urbanecm>	 so my suggestion is to follow what CI wants and write a task to decide whether it should be removed for good or left as is
[13:06:15] <urbanecm>	 we have other dblists that are only available from outside of the config repo (growthexperiments.dblist is one, and there are probably others)
[13:06:18] <taavi>	 that sounds ok to me
[13:06:31] <wikibugs>	 (03CR) 10Majavah: [C: 03+1] Enable global abuse filters on almost all projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935815 (https://phabricator.wikimedia.org/T341159) (owner: 10Urbanecm)
[13:06:40] <urbanecm>	 thanks! using the window to sync it out then.
[13:06:57] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935815 (https://phabricator.wikimedia.org/T341159) (owner: 10Urbanecm)
[13:07:06] <wikibugs>	 (03PS4) 10Urbanecm: Enable global abuse filters on almost all projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935815 (https://phabricator.wikimedia.org/T341159)
[13:07:08] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Enable global abuse filters on almost all projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935815 (https://phabricator.wikimedia.org/T341159) (owner: 10Urbanecm)
[13:08:16] <wikibugs>	 (03Merged) 10jenkins-bot: Enable global abuse filters on almost all projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935815 (https://phabricator.wikimedia.org/T341159) (owner: 10Urbanecm)
[13:08:34] <logmsgbot>	 !log stevemunene@cumin1001 START - Cookbook sre.dns.netbox
[13:08:36] <logmsgbot>	 !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:935815|Enable global abuse filters on almost all projects (T341159)]]
[13:08:39] <stashbot>	 T341159: Enable global abuse filters for all Wikimedia projects - https://phabricator.wikimedia.org/T341159
[13:10:02] <logmsgbot>	 !log urbanecm@deploy1002 urbanecm: Backport for [[gerrit:935815|Enable global abuse filters on almost all projects (T341159)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet
[13:10:39] <logmsgbot>	 !log stevemunene@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: analytics1065.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - stevemunene@cumin1001"
[13:10:51] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.kafka.roll-restart-brokers (exit_code=0) for Kafka A:kafka-test-eqiad cluster: Roll restart of jvm daemons.
[13:11:00] <icinga-wm>	 PROBLEM - Check systemd state on kafka-test1010 is CRITICAL: CRITICAL - degraded: The following units failed: kafka.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:11:06] <icinga-wm>	 PROBLEM - Kafka Broker Server on kafka-test1010 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args Kafka /etc/kafka/server.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration
[13:11:32] <icinga-wm>	 PROBLEM - Kafka broker TLS certificate validity on kafka-test1010 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate
[13:12:07] <logmsgbot>	 !log stevemunene@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: analytics1065.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - stevemunene@cumin1001"
[13:12:07] <logmsgbot>	 !log stevemunene@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:12:08] <logmsgbot>	 !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts analytics1065.eqiad.wmnet
[13:12:52] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] java::version: add support for openjdk-17 [puppet] - 10https://gerrit.wikimedia.org/r/936032 (owner: 10Elukey)
[13:13:07] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] java::version: add support for openjdk-17 [puppet] - 10https://gerrit.wikimedia.org/r/936032 (owner: 10Elukey)
[13:14:18] <logmsgbot>	 !log stevemunene@cumin1001 START - Cookbook sre.hosts.decommission for hosts analytics1066.eqiad.wmnet
[13:14:30] <elukey>	 kafka test is my fault :)
[13:16:29] <wikibugs>	 (03PS1) 10Btullis: Deploy a new image for the datahub service [deployment-charts] - 10https://gerrit.wikimedia.org/r/936035 (https://phabricator.wikimedia.org/T329514)
[13:17:26] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on an-worker1095.eqiad.wmnet with reason: Replacing RAID controller battery
[13:17:40] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker1095.eqiad.wmnet with reason: Replacing RAID controller battery
[13:17:46] <wikibugs>	 10SRE, 10ops-eqiad, 10Data-Platform-SRE: Replace RAID controller battery on an-worker1095 - https://phabricator.wikimedia.org/T340946 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=6f84de2d-a493-4b54-92d4-cefed7da6f97) set by btullis@cumin1001 for 7 days, 0:00:00 on 1 host(s) and their s...
[13:18:43] <logmsgbot>	 !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:935815|Enable global abuse filters on almost all projects (T341159)]] (duration: 10m 07s)
[13:18:47] <stashbot>	 T341159: Enable global abuse filters for all Wikimedia projects - https://phabricator.wikimedia.org/T341159
[13:18:58] <wikibugs>	 10SRE, 10ops-eqiad, 10Data-Platform-SRE: Replace RAID controller battery on an-worker1095 - https://phabricator.wikimedia.org/T340946 (10BTullis) @Jclark-ctr - I've shut down the machine and downtimed it. Feel free to boot it again normally after changing the battery. Many thanks.
[13:18:59] <urbanecm>	 deployed.
[13:19:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:20:08] <wikibugs>	 (03PS1) 10Cathal Mooney: Enable DHCP relay function for vlan 1023 (analytics1-d-eqiad) [homer/public] - 10https://gerrit.wikimedia.org/r/936036
[13:20:36] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Deploy a new image for the datahub service [deployment-charts] - 10https://gerrit.wikimedia.org/r/936035 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis)
[13:21:27] <wikibugs>	 (03Merged) 10jenkins-bot: Deploy a new image for the datahub service [deployment-charts] - 10https://gerrit.wikimedia.org/r/936035 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis)
[13:22:24] <wikibugs>	 (03CR) 10Papaul: [C: 03+2] Enable DHCP relay function for vlan 1023 (analytics1-d-eqiad) [homer/public] - 10https://gerrit.wikimedia.org/r/936036 (owner: 10Cathal Mooney)
[13:22:54] <logmsgbot>	 !log stevemunene@cumin1001 START - Cookbook sre.dns.netbox
[13:23:56] <wikibugs>	 (03CR) 10Papaul: [C: 03+2] Enable DHCP relay function for vlan 1023 (analytics1-d-eqiad) [homer/public] - 10https://gerrit.wikimedia.org/r/936036 (owner: 10Cathal Mooney)
[13:24:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:24:57] <logmsgbot>	 !log stevemunene@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: analytics1066.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - stevemunene@cumin1001"
[13:25:09] <wikibugs>	 (03PS1) 10Cathal Mooney: Enable DHCP relay function for vlan 1030 (analytics1-a-eqiad) [homer/public] - 10https://gerrit.wikimedia.org/r/936037
[13:26:45] <wikibugs>	 (03CR) 10Papaul: [V: 03+1] Enable DHCP relay function for vlan 1030 (analytics1-a-eqiad) [homer/public] - 10https://gerrit.wikimedia.org/r/936037 (owner: 10Cathal Mooney)
[13:26:57] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] Enable DHCP relay function for vlan 1030 (analytics1-a-eqiad) [homer/public] - 10https://gerrit.wikimedia.org/r/936037 (owner: 10Cathal Mooney)
[13:27:30] <wikibugs>	 (03Merged) 10jenkins-bot: Enable DHCP relay function for vlan 1030 (analytics1-a-eqiad) [homer/public] - 10https://gerrit.wikimedia.org/r/936037 (owner: 10Cathal Mooney)
[13:29:39] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.dhcp (exit_code=0) for host an-test-worker1003.eqiad.wmnet
[13:29:45] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.dhcp for host an-test-worker1003.eqiad.wmnet
[13:29:47] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[13:29:52] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main
[13:30:23] <logmsgbot>	 !log stevemunene@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: analytics1066.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - stevemunene@cumin1001"
[13:30:23] <logmsgbot>	 !log stevemunene@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:30:23] <logmsgbot>	 !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts analytics1066.eqiad.wmnet
[13:32:38] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Shared-Data-Infrastructure: Multiple RAID battery failures on hadoop worker hosts - https://phabricator.wikimedia.org/T318659 (10Jclark-ctr)
[13:32:44] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.dhcp (exit_code=0) for host an-test-worker1003.eqiad.wmnet
[13:32:56] <wikibugs>	 10SRE, 10ops-eqiad, 10Data-Platform-SRE: Replace RAID controller battery on an-worker1095 - https://phabricator.wikimedia.org/T340946 (10Jclark-ctr) 05Open→03Resolved @BTullis  replaced failed battery. server is booting up now
[13:33:11] <wikibugs>	 (03PS1) 10Ladsgroup: ExternalLinks: Make order by and continue only rely on el_id in READ NEW [core] (wmf/1.41.0-wmf.16) - 10https://gerrit.wikimedia.org/r/935856 (https://phabricator.wikimedia.org/T341000)
[13:33:47] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host zookeeper-test1002.eqiad.wmnet with OS bookworm
[13:34:12] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main
[13:34:57] <wikibugs>	 (03PS1) 10Ladsgroup: ExternalLinks: Make order by and continue only rely on el_id in READ NEW [core] (wmf/1.41.0-wmf.15) - 10https://gerrit.wikimedia.org/r/935857 (https://phabricator.wikimedia.org/T341000)
[13:35:51] <icinga-wm>	 RECOVERY - Check systemd state on kafka-test1010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:35:55] <icinga-wm>	 RECOVERY - Kafka Broker Server on kafka-test1010 is OK: PROCS OK: 1 process with command name java, args Kafka /etc/kafka/server.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration
[13:37:06] <wikibugs>	 (03PS1) 10JMeybohm: calico::kubernetes: Drop variable assigments used during migration [puppet] - 10https://gerrit.wikimedia.org/r/936040 (https://phabricator.wikimedia.org/T328291)
[13:37:08] <wikibugs>	 (03PS1) 10JMeybohm: kubernetes::master: Drop variable assigments used during migration [puppet] - 10https://gerrit.wikimedia.org/r/936041 (https://phabricator.wikimedia.org/T328291)
[13:37:49] <icinga-wm>	 PROBLEM - Kafka broker TLS certificate validity on kafka-test1006 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate
[13:38:01] <wikibugs>	 (03PS1) 10Btullis: Enable the datahub systemupdate job [deployment-charts] - 10https://gerrit.wikimedia.org/r/936042 (https://phabricator.wikimedia.org/T329514)
[13:38:17] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.dhcp for host cloudlb1001.eqiad.wmnet
[13:38:39] <icinga-wm>	 RECOVERY - Kafka broker TLS certificate validity on kafka-test1006 is OK: SSL OK - Certificate kafka-test1006.eqiad.wmnet valid until 2024-04-04 08:08:00 +0000 (expires in 272 days) https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate
[13:39:41] <icinga-wm>	 PROBLEM - Kafka broker TLS certificate validity on kafka-test1007 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate
[13:40:51] <icinga-wm>	 RECOVERY - Kafka broker TLS certificate validity on kafka-test1007 is OK: SSL OK - Certificate kafka-test1007.eqiad.wmnet valid until 2024-04-04 09:53:00 +0000 (expires in 272 days) https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate
[13:41:02] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Enable the datahub systemupdate job [deployment-charts] - 10https://gerrit.wikimedia.org/r/936042 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis)
[13:41:31] <icinga-wm>	 PROBLEM - Kafka broker TLS certificate validity on kafka-test1008 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate
[13:41:46] <wikibugs>	 (03Merged) 10jenkins-bot: Enable the datahub systemupdate job [deployment-charts] - 10https://gerrit.wikimedia.org/r/936042 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis)
[13:42:03] <icinga-wm>	 RECOVERY - Kafka broker TLS certificate validity on kafka-test1008 is OK: SSL OK - Certificate kafka-test1008.eqiad.wmnet valid until 2024-04-04 14:13:00 +0000 (expires in 273 days) https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate
[13:42:14] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main
[13:43:21] <icinga-wm>	 PROBLEM - Kafka broker TLS certificate validity on kafka-test1009 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate
[13:44:47] <icinga-wm>	 RECOVERY - Kafka broker TLS certificate validity on kafka-test1009 is OK: SSL OK - Certificate kafka-test1009.eqiad.wmnet valid until 2024-04-04 14:53:00 +0000 (expires in 273 days) https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate
[13:45:13] <wikibugs>	 (03PS1) 10JMeybohm: kubernetes::node: Drop variable assigments used during migration [puppet] - 10https://gerrit.wikimedia.org/r/936043 (https://phabricator.wikimedia.org/T328291)
[13:46:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:47:11] <icinga-wm>	 RECOVERY - Kafka broker TLS certificate validity on kafka-test1010 is OK: SSL OK - Certificate kafka-test1010.eqiad.wmnet valid until 2024-04-04 15:44:00 +0000 (expires in 273 days) https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate
[13:47:45] <wikibugs>	 (03PS1) 10JMeybohm: rsyslog::kubernetes: Drop variable assigments used during migration [puppet] - 10https://gerrit.wikimedia.org/r/936044 (https://phabricator.wikimedia.org/T328291)
[13:50:08] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] rsyslog::kubernetes: Drop variable assigments used during migration [puppet] - 10https://gerrit.wikimedia.org/r/936044 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm)
[13:50:10] <wikibugs>	 10SRE, 10ops-eqiad, 10Infrastructure-Foundations: Ripe atlas eqiad reported down in Icinga since 2023-06-27 - https://phabricator.wikimedia.org/T341108 (10Jclark-ctr) Replaced SFP-T looks like link returned will close ticket if alert has cleared
[13:51:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:51:40] <wikibugs>	 (03PS2) 10Ssingh: dns1004: provision new DNS host in eqiad (hardware refresh) [puppet] - 10https://gerrit.wikimedia.org/r/933918 (https://phabricator.wikimedia.org/T326685)
[13:51:42] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.dhcp (exit_code=0) for host cloudlb1001.eqiad.wmnet
[13:53:01] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] dns1004: provision new DNS host in eqiad (hardware refresh) [puppet] - 10https://gerrit.wikimedia.org/r/933918 (https://phabricator.wikimedia.org/T326685) (owner: 10Ssingh)
[13:53:40] <wikibugs>	 (03PS2) 10JMeybohm: rsyslog::kubernetes: Drop variable assigments used during migration [puppet] - 10https://gerrit.wikimedia.org/r/936044 (https://phabricator.wikimedia.org/T328291)
[13:54:00] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main
[13:54:34] <jinxer-wm>	 (HelmReleaseBadStatus) firing: Helm release datahub/main on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=datahub - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[13:54:37] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: deployment_server: add REPL for mw-debug [puppet] - 10https://gerrit.wikimedia.org/r/936046 (https://phabricator.wikimedia.org/T341197)
[13:55:37] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer
[13:56:02] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host dns1004.wikimedia.org with OS bullseye
[13:56:13] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install dns100[456] - https://phabricator.wikimedia.org/T326685 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host dns1004.wikimedia.org with OS bullseye
[13:56:57] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] deployment_server: add REPL for mw-debug [puppet] - 10https://gerrit.wikimedia.org/r/936046 (https://phabricator.wikimedia.org/T341197) (owner: 10Giuseppe Lavagetto)
[13:57:53] <icinga-wm>	 PROBLEM - BFD status on cr4-ulsfo is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[13:59:27] <sukhe>	 ^ 198.35.26.207            Down      xe-0/1/2.0
[13:59:34] <jinxer-wm>	 (HelmReleaseBadStatus) resolved: Helm release datahub/main on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=datahub - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[14:00:45] <icinga-wm>	 RECOVERY - Check systemd state on gitlab1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:01:23] <icinga-wm>	 RECOVERY - Check systemd state on gitlab2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:01:28] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host an-test-worker1003.eqiad.wmnet with OS bullseye
[14:02:39] <logmsgbot>	 !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dns1004.wikimedia.org with OS bullseye
[14:02:49] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install dns100[456] - https://phabricator.wikimedia.org/T326685 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host dns1004.wikimedia.org with OS bullseye executed with errors: - dns1004 (**FAIL**)   - Removed fro...
[14:02:49] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host dns1004.wikimedia.org with OS bullseye
[14:02:59] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install dns100[456] - https://phabricator.wikimedia.org/T326685 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host dns1004.wikimedia.org with OS bullseye
[14:05:00] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: service=ats-be,name=cp2037.codfw.wmnet
[14:05:56] <hnowlan>	 !log disabling puppet on A:cp-text to test 935464 
[14:05:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:06:36] <logmsgbot>	 !log stevemunene@cumin1001 START - Cookbook sre.hosts.decommission for hosts analytics1067.eqiad.wmnet
[14:08:20] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:09:26] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main
[14:09:28] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] trafficserver: add gateway routing script, route device-analytics [puppet] - 10https://gerrit.wikimedia.org/r/935464 (https://phabricator.wikimedia.org/T320967) (owner: 10Hnowlan)
[14:11:14] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (NOOP 12): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42311/console" [puppet] - 10https://gerrit.wikimedia.org/r/936044 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm)
[14:12:23] <logmsgbot>	 !log stevemunene@cumin1001 START - Cookbook sre.dns.netbox
[14:13:29] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main
[14:13:45] <logmsgbot>	 !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dns1004.wikimedia.org with OS bullseye
[14:13:56] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install dns100[456] - https://phabricator.wikimedia.org/T326685 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host dns1004.wikimedia.org with OS bullseye executed with errors: - dns1004 (**FAIL**)   - Removed fro...
[14:14:02] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host dns1004.wikimedia.org with OS bullseye
[14:14:13] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install dns100[456] - https://phabricator.wikimedia.org/T326685 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host dns1004.wikimedia.org with OS bullseye
[14:14:31] <logmsgbot>	 !log stevemunene@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: analytics1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - stevemunene@cumin1001"
[14:15:33] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (NOOP 12): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42312/console" [puppet] - 10https://gerrit.wikimedia.org/r/936043 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm)
[14:15:44] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/api-gateway: apply
[14:16:18] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/api-gateway: apply
[14:16:40] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (NOOP 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42313/console" [puppet] - 10https://gerrit.wikimedia.org/r/936041 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm)
[14:18:20] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:18:44] <logmsgbot>	 !log stevemunene@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: analytics1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - stevemunene@cumin1001"
[14:18:44] <logmsgbot>	 !log stevemunene@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:18:45] <logmsgbot>	 !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts analytics1067.eqiad.wmnet
[14:19:23] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host cloudlb1001
[14:19:44] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudlb1001
[14:20:27] <logmsgbot>	 !log stevemunene@cumin1001 START - Cookbook sre.hosts.decommission for hosts analytics1068.eqiad.wmnet
[14:22:01] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/api-gateway: apply
[14:22:25] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: apply
[14:25:24] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main
[14:25:34] <jinxer-wm>	 (HelmReleaseBadStatus) firing: Helm release datahub/main on k8s-staging@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=datahub - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[14:25:53] <logmsgbot>	 !log stevemunene@cumin1001 START - Cookbook sre.dns.netbox
[14:26:51] <wikibugs>	 (03PS1) 10Ssingh: Revert "dns1004: provision new DNS host in eqiad (hardware refresh)" [puppet] - 10https://gerrit.wikimedia.org/r/935858
[14:27:26] <logmsgbot>	 !log btullis@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-test-worker1003.eqiad.wmnet with OS bullseye
[14:27:58] <logmsgbot>	 !log stevemunene@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: analytics1068.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - stevemunene@cumin1001"
[14:28:19] <wikibugs>	 (03PS1) 10Jbond: nftable::service: address comments [puppet] - 10https://gerrit.wikimedia.org/r/936049
[14:28:28] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host an-test-worker1003.eqiad.wmnet with OS bullseye
[14:29:06] <logmsgbot>	 !log stevemunene@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: analytics1068.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - stevemunene@cumin1001"
[14:29:07] <logmsgbot>	 !log stevemunene@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:29:07] <logmsgbot>	 !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts analytics1068.eqiad.wmnet
[14:29:24] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Configure eqiad cloudsw devices to support cloud-private - https://phabricator.wikimedia.org/T341223 (10aborrero)
[14:30:23] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10User-aborrero: Configure eqiad cloudsw devices to support cloud-private - https://phabricator.wikimedia.org/T341223 (10aborrero)
[14:30:34] <jinxer-wm>	 (HelmReleaseBadStatus) resolved: Helm release datahub/main on k8s-staging@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=datahub - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[14:30:59] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10User-aborrero: Configure eqiad cloudsw devices to support cloud-private - https://phabricator.wikimedia.org/T341223 (10cmooney)
[14:31:24] <logmsgbot>	 !log stevemunene@cumin1001 START - Cookbook sre.hosts.decommission for hosts analytics1069.eqiad.wmnet
[14:31:31] <wikibugs>	 (03PS1) 10Hnowlan: Revert "trafficserver: add gateway routing script, route device-analytics" [puppet] - 10https://gerrit.wikimedia.org/r/935859
[14:31:52] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] Revert "trafficserver: add gateway routing script, route device-analytics" [puppet] - 10https://gerrit.wikimedia.org/r/935859 (owner: 10Hnowlan)
[14:32:59] <wikibugs>	 (03CR) 10Jbond: [C: 04-1] "-1: see inline, i also created a Cr with all theses comments applied[1] if we agree on this approach we can squash that into this" [puppet] - 10https://gerrit.wikimedia.org/r/935751 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[14:34:06] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] Revert "trafficserver: add gateway routing script, route device-analytics" [puppet] - 10https://gerrit.wikimedia.org/r/935859 (owner: 10Hnowlan)
[14:35:27] <icinga-wm>	 RECOVERY - Host ripe-atlas-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.31 ms
[14:35:39] <wikibugs>	 (03CR) 10Jbond: [C: 04-1] Add a new nftables::service define (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/935751 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[14:35:44] <hnowlan>	 !log reenabling puppet on A:cp
[14:35:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:36:17] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: service=ats-be,name=cp2037.codfw.wmnet
[14:37:04] <logmsgbot>	 !log stevemunene@cumin1001 START - Cookbook sre.dns.netbox
[14:37:29] <icinga-wm>	 PROBLEM - IPv4 ping to eqiad on ripe-atlas-eqiad is CRITICAL: CRITICAL - failed 670 probes of 761 (alerts on 35) - https://atlas.ripe.net/measurements/1790945/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[14:37:39] <icinga-wm>	 RECOVERY - Host ripe-atlas-eqiad IPv6 is UP: PING OK - Packet loss = 0%, RTA = 1.32 ms
[14:37:45] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 537 probes of 694 (alerts on 90) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[14:42:01] <logmsgbot>	 !log stevemunene@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: analytics1069.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - stevemunene@cumin1001"
[14:42:59] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 61 probes of 695 (alerts on 90) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[14:44:24] <wikibugs>	 (03PS1) 10Ssingh: P:ntp: do not use global variables [puppet] - 10https://gerrit.wikimedia.org/r/936050
[14:45:22] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42314/console" [puppet] - 10https://gerrit.wikimedia.org/r/936050 (owner: 10Ssingh)
[14:45:37] <logmsgbot>	 !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dns1004.wikimedia.org with OS bullseye
[14:45:47] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install dns100[456] - https://phabricator.wikimedia.org/T326685 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host dns1004.wikimedia.org with OS bullseye executed with errors: - dns1004 (**FAIL**)   - Removed fro...
[14:45:52] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host dns1004.wikimedia.org with OS bullseye
[14:46:01] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install dns100[456] - https://phabricator.wikimedia.org/T326685 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host dns1004.wikimedia.org with OS bullseye
[14:46:05] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-test-worker1003.eqiad.wmnet with reason: host reimage
[14:46:17] <logmsgbot>	 !log stevemunene@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: analytics1069.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - stevemunene@cumin1001"
[14:46:17] <logmsgbot>	 !log stevemunene@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:46:18] <logmsgbot>	 !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts analytics1069.eqiad.wmnet
[14:47:58] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main
[14:47:59] <icinga-wm>	 RECOVERY - IPv4 ping to eqiad on ripe-atlas-eqiad is OK: OK - failed 11 probes of 762 (alerts on 35) - https://atlas.ripe.net/measurements/1790945/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[14:48:11] <wikibugs>	 (03PS1) 10Stevemunene: analytics: remove puppet references for analytics[1058-1069] [puppet] - 10https://gerrit.wikimedia.org/r/936051 (https://phabricator.wikimedia.org/T317861)
[14:49:26] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-test-worker1003.eqiad.wmnet with reason: host reimage
[14:51:30] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main
[14:53:24] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (NOOP 12): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42315/console" [puppet] - 10https://gerrit.wikimedia.org/r/936040 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm)
[14:54:06] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.dns.netbox
[14:55:17] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:56:30] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: nftable::service: address comments (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/936049 (owner: 10Jbond)
[14:57:09] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: Add a new nftables::service define (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/935751 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[14:57:29] <wikibugs>	 (03PS2) 10Stevemunene: analytics: remove puppet references for analytics[1058-1069] [puppet] - 10https://gerrit.wikimedia.org/r/936051 (https://phabricator.wikimedia.org/T317861)
[14:57:54] <wikibugs>	 (03PS2) 10Ssingh: sites.yaml: add new dns host dns1004 (eqiad hardware refresh) [homer/public] - 10https://gerrit.wikimedia.org/r/933917 (https://phabricator.wikimedia.org/T326685)
[14:58:09] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on dns1004.wikimedia.org with reason: host reimage
[14:58:28] <wikibugs>	 (03PS1) 10Cathal Mooney: Add Eqiad cloud VIP range to prefix list filtering inbound from hosts [homer/public] - 10https://gerrit.wikimedia.org/r/936053 (https://phabricator.wikimedia.org/T341223)
[15:00:37] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM, thanks!" [homer/public] - 10https://gerrit.wikimedia.org/r/936053 (https://phabricator.wikimedia.org/T341223) (owner: 10Cathal Mooney)
[15:00:39] <wikibugs>	 (03CR) 10Stevemunene: [V: 03+1 C: 03+2] Create spark3 local directory [puppet] - 10https://gerrit.wikimedia.org/r/935444 (https://phabricator.wikimedia.org/T332765) (owner: 10Stevemunene)
[15:00:49] <wikibugs>	 (03CR) 10Btullis: analytics: remove puppet references for analytics[1058-1069] (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/936051 (https://phabricator.wikimedia.org/T317861) (owner: 10Stevemunene)
[15:01:17] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] Add Eqiad cloud VIP range to prefix list filtering inbound from hosts [homer/public] - 10https://gerrit.wikimedia.org/r/936053 (https://phabricator.wikimedia.org/T341223) (owner: 10Cathal Mooney)
[15:02:05] <wikibugs>	 (03Merged) 10jenkins-bot: Add Eqiad cloud VIP range to prefix list filtering inbound from hosts [homer/public] - 10https://gerrit.wikimedia.org/r/936053 (https://phabricator.wikimedia.org/T341223) (owner: 10Cathal Mooney)
[15:02:07] <icinga-wm>	 PROBLEM - BGP status on cloudsw1-c8-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:02:17] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dns1004.wikimedia.org with reason: host reimage
[15:04:03] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: deployment_server: add REPL for mw-debug [puppet] - 10https://gerrit.wikimedia.org/r/936046 (https://phabricator.wikimedia.org/T341197)
[15:05:44] <wikibugs>	 (03CR) 10JMeybohm: [C: 04-1] deployment_server: add REPL for mw-debug (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/936046 (https://phabricator.wikimedia.org/T341197) (owner: 10Giuseppe Lavagetto)
[15:06:44] <wikibugs>	 (03PS3) 10Stevemunene: analytics: remove puppet references for analytics[1058-1069] [puppet] - 10https://gerrit.wikimedia.org/r/936051 (https://phabricator.wikimedia.org/T317861)
[15:07:21] <icinga-wm>	 PROBLEM - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns2005 is CRITICAL: CRITICAL: Service ntp.service has not been restarted after /etc/ntp.conf was changed (stale by 2924325.79s). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring
[15:07:35] <sukhe>	 hm ok
[15:07:37] <sukhe>	 expected
[15:08:12] <wikibugs>	 (03CR) 10Stevemunene: analytics: remove puppet references for analytics[1058-1069] (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/936051 (https://phabricator.wikimedia.org/T317861) (owner: 10Stevemunene)
[15:10:44] <wikibugs>	 (03CR) 10JMeybohm: [C: 04-1] deployment_server: add REPL for mw-debug (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/936046 (https://phabricator.wikimedia.org/T341197) (owner: 10Giuseppe Lavagetto)
[15:12:32] <wikibugs>	 (03PS1) 10Ssingh: P:ntp: increase interval for checking stale ntp.conf file [puppet] - 10https://gerrit.wikimedia.org/r/936054
[15:13:34] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42316/console" [puppet] - 10https://gerrit.wikimedia.org/r/936054 (owner: 10Ssingh)
[15:13:54] <wikibugs>	 (03PS1) 10Dreamy Jazz: Disable purging of old client hint data by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936055 (https://phabricator.wikimedia.org/T340959)
[15:15:55] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-test-worker1003.eqiad.wmnet with OS bullseye
[15:16:25] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main
[15:16:41] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] calico::kubernetes: Drop variable assigments used during migration [puppet] - 10https://gerrit.wikimedia.org/r/936040 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm)
[15:17:25] <wikibugs>	 (03PS2) 10Elukey: changeprop: increase the linger.ms value [deployment-charts] - 10https://gerrit.wikimedia.org/r/935772 (https://phabricator.wikimedia.org/T338357)
[15:17:44] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] kubernetes::master: Drop variable assigments used during migration [puppet] - 10https://gerrit.wikimedia.org/r/936041 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm)
[15:18:17] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] kubernetes::node: Drop variable assigments used during migration [puppet] - 10https://gerrit.wikimedia.org/r/936043 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm)
[15:18:37] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] rsyslog::kubernetes: Drop variable assigments used during migration [puppet] - 10https://gerrit.wikimedia.org/r/936044 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm)
[15:18:37] <icinga-wm>	 PROBLEM - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns6001 is CRITICAL: CRITICAL: Service ntp.service has not been restarted after /etc/ntp.conf was changed (stale by 2922987.70s). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring
[15:18:56] <sukhe>	 ^ this is expected, first time reimaging with the automation, so will tune the check intervals
[15:19:01] <sukhe>	 the patch is above, merging later
[15:19:27] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: deployment_server: add REPL for mw-debug [puppet] - 10https://gerrit.wikimedia.org/r/936046 (https://phabricator.wikimedia.org/T341197)
[15:19:33] <icinga-wm>	 PROBLEM - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns5004 is CRITICAL: CRITICAL: Service ntp.service has not been restarted after /etc/ntp.conf was changed (stale by 2920992.35s). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring
[15:19:34] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: deployment_server: add REPL for mw-debug (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/936046 (https://phabricator.wikimedia.org/T341197) (owner: 10Giuseppe Lavagetto)
[15:20:53] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - sukhe@cumin2002"
[15:21:50] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - sukhe@cumin2002"
[15:21:51] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dns1004.wikimedia.org with OS bullseye
[15:22:02] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install dns100[456] - https://phabricator.wikimedia.org/T326685 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host dns1004.wikimedia.org with OS bullseye completed: - dns1004 (**PASS**)   - Removed from Puppet an...
[15:22:10] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] changeprop: increase the linger.ms value [deployment-charts] - 10https://gerrit.wikimedia.org/r/935772 (https://phabricator.wikimedia.org/T338357) (owner: 10Elukey)
[15:23:19] <icinga-wm>	 RECOVERY - BGP status on cloudsw1-c8-eqiad.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:24:19] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10serviceops-radar, 10Patch-For-Review: Drop the `deploy-service` right, move three included users to `deployment` (or drop access)? - https://phabricator.wikimedia.org/T340165 (10akosiaris)
[15:25:52] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99)
[15:26:07] <icinga-wm>	 PROBLEM - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns2004 is CRITICAL: CRITICAL: Service ntp.service has not been restarted after /etc/ntp.conf was changed (stale by 2925119.33s). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring
[15:27:04] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: deployment_server: add REPL for mw-debug (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/936046 (https://phabricator.wikimedia.org/T341197) (owner: 10Giuseppe Lavagetto)
[15:27:30] <wikibugs>	 (03PS2) 10Jdlrobson: Update more logos with available SVGs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935824 (https://phabricator.wikimedia.org/T338162)
[15:27:41] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Update more logos with available SVGs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935824 (https://phabricator.wikimedia.org/T338162) (owner: 10Jdlrobson)
[15:27:53] <icinga-wm>	 PROBLEM - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns6002 is CRITICAL: CRITICAL: Service ntp.service has not been restarted after /etc/ntp.conf was changed (stale by 2921399.04s). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring
[15:28:22] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main
[15:28:34] <jinxer-wm>	 (HelmReleaseBadStatus) firing: Helm release datahub/main on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=datahub - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[15:29:32] <sukhe>	 !log restart ntp.service on A:dns-rec
[15:29:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:30:38] <wikibugs>	 (03PS1) 10Elukey: changeprop: bump chart's version [deployment-charts] - 10https://gerrit.wikimedia.org/r/936057
[15:30:57] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] "Forgot to bump the chart's version https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/936057" [deployment-charts] - 10https://gerrit.wikimedia.org/r/935772 (https://phabricator.wikimedia.org/T338357) (owner: 10Elukey)
[15:31:10] <wikibugs>	 10SRE, 10observability, 10serviceops: stop using $::site in description field of service.yaml - https://phabricator.wikimedia.org/T258697 (10akosiaris) Any objections to switching "svc.%{::site}.wmnet" to "discovery.wmnet" ?
[15:31:26] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] changeprop: bump chart's version [deployment-charts] - 10https://gerrit.wikimedia.org/r/936057 (owner: 10Elukey)
[15:33:34] <jinxer-wm>	 (HelmReleaseBadStatus) resolved: Helm release datahub/main on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=datahub - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[15:33:36] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/936003 (https://phabricator.wikimedia.org/T341045) (owner: 10ArielGlenn)
[15:34:06] <wikibugs>	 (03PS2) 10Ssingh: P:ntp: increase interval for checking stale ntp.conf file [puppet] - 10https://gerrit.wikimedia.org/r/936054
[15:35:04] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42317/console" [puppet] - 10https://gerrit.wikimedia.org/r/936054 (owner: 10Ssingh)
[15:35:28] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer
[15:36:23] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] "<3" [puppet] - 10https://gerrit.wikimedia.org/r/936040 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm)
[15:36:27] <icinga-wm>	 PROBLEM - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns5003 is CRITICAL: CRITICAL: Service ntp.service has not been restarted after /etc/ntp.conf was changed (stale by 2923205.83s). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring
[15:36:39] <icinga-wm>	 PROBLEM - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns4004 is CRITICAL: CRITICAL: Service ntp.service has not been restarted after /etc/ntp.conf was changed (stale by 2920896.90s). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring
[15:36:43] <logmsgbot>	 !log elukey@deploy1002 helmfile [staging] START helmfile.d/services/changeprop: sync
[15:36:53] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1 C: 03+2] P:ntp: increase interval for checking stale ntp.conf file [puppet] - 10https://gerrit.wikimedia.org/r/936054 (owner: 10Ssingh)
[15:36:55] <logmsgbot>	 !log elukey@deploy1002 helmfile [staging] DONE helmfile.d/services/changeprop: sync
[15:36:56] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM thanks" [puppet] - 10https://gerrit.wikimedia.org/r/936050 (owner: 10Ssingh)
[15:37:29] <wikibugs>	 10SRE, 10Content-Transform-Team-WIP, 10Mobile-Content-Service, 10RESTbase Sunsetting, and 2 others: Setup allowed list for MCS decom - https://phabricator.wikimedia.org/T340036 (10akosiaris) @MSantos, change deployed today. e.g. https://en.wikipedia.org/api/rest_v1/page/mobile-sections now returns a 403 wi...
[15:39:56] <wikibugs>	 (03PS2) 10Jbond: nftable::service: address comments [puppet] - 10https://gerrit.wikimedia.org/r/936049
[15:40:21] <wikibugs>	 10SRE, 10Observability-Alerting, 10Traffic, 10collaboration-services, 10serviceops-radar: Timeouts when talking to phabricator API - https://phabricator.wikimedia.org/T341039 (10akosiaris)
[15:41:02] <wikibugs>	 (03PS3) 10Jbond: nftable::service: address comments [puppet] - 10https://gerrit.wikimedia.org/r/936049
[15:41:14] <wikibugs>	 (03CR) 10Jbond: nftable::service: address comments (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/936049 (owner: 10Jbond)
[15:41:59] <icinga-wm>	 PROBLEM - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns4003 is CRITICAL: CRITICAL: Service ntp.service has not been restarted after /etc/ntp.conf was changed (stale by 2923120.30s). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring
[15:42:37] <wikibugs>	 (03PS1) 10Effie Mouzeli: ipoid: add APP_CONFIG_PATH for testing [deployment-charts] - 10https://gerrit.wikimedia.org/r/936059
[15:43:12] <wikibugs>	 10SRE, 10observability, 10serviceops: stop using $::site in description field of service.yaml - https://phabricator.wikimedia.org/T258697 (10Joe) Is this still relevant? I think we moved all LVS alerts off of icinga by now. But yeah no objection apart from what I stated above.
[15:44:32] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] mesh.configuration: Limit the total number of active connections [deployment-charts] - 10https://gerrit.wikimedia.org/r/935702 (https://phabricator.wikimedia.org/T340955) (owner: 10JMeybohm)
[15:45:16] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+2] ipoid: add APP_CONFIG_PATH for testing [deployment-charts] - 10https://gerrit.wikimedia.org/r/936059 (owner: 10Effie Mouzeli)
[15:45:30] <logmsgbot>	 !log elukey@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop: sync
[15:45:45] <logmsgbot>	 !log elukey@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop: sync
[15:45:50] <wikibugs>	 10SRE, 10observability, 10serviceops: stop using $::site in description field of service.yaml - https://phabricator.wikimedia.org/T258697 (10akosiaris) No, it's not relevant to icinga so much any more (and it's going to be less and less). It's still an interesting informational thing though and the replaceme...
[15:46:00] <wikibugs>	 (03Merged) 10jenkins-bot: ipoid: add APP_CONFIG_PATH for testing [deployment-charts] - 10https://gerrit.wikimedia.org/r/936059 (owner: 10Effie Mouzeli)
[15:47:04] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10User-aborrero: Configure eqiad cloudsw devices to support cloud-private - https://phabricator.wikimedia.org/T341223 (10cmooney)
[15:47:13] <logmsgbot>	 !log jiji@deploy1002 helmfile [staging] START helmfile.d/services/ipoid: apply
[15:47:30] <logmsgbot>	 !log jiji@deploy1002 helmfile [staging] DONE helmfile.d/services/ipoid: apply
[15:49:34] <jinxer-wm>	 (HelmReleaseBadStatus) firing: Helm release datahub/main on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=datahub - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[15:49:36] <wikibugs>	 (03CR) 10Btullis: analytics: remove puppet references for analytics[1058-1069] (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/936051 (https://phabricator.wikimedia.org/T317861) (owner: 10Stevemunene)
[15:50:13] <wikibugs>	 (03PS2) 10Milimetric: replicas: redact revdeleted, oversighted information [puppet] - 10https://gerrit.wikimedia.org/r/935752 (https://phabricator.wikimedia.org/T339037) (owner: 10Samuel (WMF))
[15:51:06] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: service: Replace svc.%{::site} with discovery [puppet] - 10https://gerrit.wikimedia.org/r/936062 (https://phabricator.wikimedia.org/T258697)
[15:51:29] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] service: Replace svc.%{::site} with discovery [puppet] - 10https://gerrit.wikimedia.org/r/936062 (https://phabricator.wikimedia.org/T258697) (owner: 10Alexandros Kosiaris)
[15:53:13] <logmsgbot>	 !log elukey@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop: sync
[15:53:27] <logmsgbot>	 !log elukey@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop: sync
[15:53:27] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: service: Replace svc.%{::site} with discovery [puppet] - 10https://gerrit.wikimedia.org/r/936062 (https://phabricator.wikimedia.org/T258697)
[15:54:24] <elukey>	 !log changeprop's kafka linger.ms set to 20s - T338357 (was 5ms, now changeprop waits a bit more to batch messages to send to kafka in one go)
[15:54:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:54:27] <stashbot>	 T338357: Pushing jobs to jobqueue is slow again - https://phabricator.wikimedia.org/T338357
[15:54:34] <jinxer-wm>	 (HelmReleaseBadStatus) resolved: Helm release datahub/main on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=datahub - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[15:56:23] <wikibugs>	 (03CR) 10RLazarus: [C: 03+2] opentelemetry-collector: Use a NodePort service instead of a hostPort. [deployment-charts] - 10https://gerrit.wikimedia.org/r/935826 (https://phabricator.wikimedia.org/T320564) (owner: 10RLazarus)
[15:57:07] <wikibugs>	 (03Merged) 10jenkins-bot: opentelemetry-collector: Use a NodePort service instead of a hostPort. [deployment-charts] - 10https://gerrit.wikimedia.org/r/935826 (https://phabricator.wikimedia.org/T320564) (owner: 10RLazarus)
[15:57:33] <icinga-wm>	 PROBLEM - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns2006 is CRITICAL: CRITICAL: Service ntp.service has not been restarted after /etc/ntp.conf was changed (stale by 2924662.70s). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring
[15:57:33] <icinga-wm>	 RECOVERY - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns4003 is OK: OK: ntp.service was restarted after /etc/ntp.conf was changed (within 3600 seconds). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring
[15:57:33] <icinga-wm>	 PROBLEM - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns3002 is CRITICAL: CRITICAL: Service ntp.service has not been restarted after /etc/ntp.conf was changed (stale by 2920932.00s). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring
[15:57:35] <icinga-wm>	 RECOVERY - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns5003 is OK: OK: ntp.service was restarted after /etc/ntp.conf was changed (within 3600 seconds). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring
[15:57:49] <sukhe>	 ^ expected, spacing out restarts should resolve soon
[15:57:58] <sukhe>	 increased the check interval here so we don't start spamming early
[15:59:01] <mutante>	 👍
[15:59:59] <icinga-wm>	 RECOVERY - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns2006 is OK: OK: ntp.service was restarted after /etc/ntp.conf was changed (within 3600 seconds). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring
[15:59:59] <icinga-wm>	 RECOVERY - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns4004 is OK: OK: ntp.service was restarted after /etc/ntp.conf was changed (within 3600 seconds). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring
[15:59:59] <icinga-wm>	 RECOVERY - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns3002 is OK: OK: ntp.service was restarted after /etc/ntp.conf was changed (within 3600 seconds). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring
[15:59:59] <icinga-wm>	 RECOVERY - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns6001 is OK: OK: ntp.service was restarted after /etc/ntp.conf was changed (within 3600 seconds). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring
[16:00:04] <jouncebot>	 jbond and rzl: Time to snap out of that daydream and deploy Puppet request window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230706T1600).
[16:00:04] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[16:02:29] <icinga-wm>	 RECOVERY - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns2005 is OK: OK: ntp.service was restarted after /etc/ntp.conf was changed (within 3600 seconds). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring
[16:02:29] <icinga-wm>	 RECOVERY - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns6002 is OK: OK: ntp.service was restarted after /etc/ntp.conf was changed (within 3600 seconds). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring
[16:02:29] <icinga-wm>	 RECOVERY - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns5004 is OK: OK: ntp.service was restarted after /etc/ntp.conf was changed (within 3600 seconds). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring
[16:03:19] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "I would say merge this to the main CR?" [puppet] - 10https://gerrit.wikimedia.org/r/936049 (owner: 10Jbond)
[16:06:33] <wikibugs>	 (03PS4) 10Jbond: Add a new nftables::service define [puppet] - 10https://gerrit.wikimedia.org/r/936049 (https://phabricator.wikimedia.org/T336497)
[16:07:03] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add a new nftables::service define [puppet] - 10https://gerrit.wikimedia.org/r/936049 (https://phabricator.wikimedia.org/T336497) (owner: 10Jbond)
[16:09:29] <icinga-wm>	 RECOVERY - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns2004 is OK: OK: ntp.service was restarted after /etc/ntp.conf was changed (within 3600 seconds). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring
[16:10:10] <wikibugs>	 10SRE, 10Content-Transform-Team-WIP, 10Mobile-Content-Service, 10RESTbase Sunsetting, and 2 others: Setup allowed list for MCS decom - https://phabricator.wikimedia.org/T340036 (10MSantos) 05Open→03Resolved a:03akosiaris >>! In T340036#8994407, @akosiaris wrote: > @MSantos, change deployed today. e.g...
[16:10:39] <wikibugs>	 (03PS8) 10Jbond: Add a new nftables::service define [puppet] - 10https://gerrit.wikimedia.org/r/935751 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[16:11:33] <wikibugs>	 (03Abandoned) 10Jbond: Add a new nftables::service define [puppet] - 10https://gerrit.wikimedia.org/r/936049 (https://phabricator.wikimedia.org/T336497) (owner: 10Jbond)
[16:11:50] <logmsgbot>	 !log rzl@deploy1002 helmfile [staging] START helmfile.d/services/opentelemetry-collector: apply
[16:12:00] <logmsgbot>	 !log rzl@deploy1002 helmfile [staging] DONE helmfile.d/services/opentelemetry-collector: apply
[16:13:36] <wikibugs>	 (03PS9) 10Jbond: Add a new nftables::service define [puppet] - 10https://gerrit.wikimedia.org/r/935751 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[16:13:41] <wikibugs>	 (03Abandoned) 10Ssingh: Revert "dns1004: provision new DNS host in eqiad (hardware refresh)" [puppet] - 10https://gerrit.wikimedia.org/r/935858 (owner: 10Ssingh)
[16:13:51] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] sites.yaml: add new dns host dns1004 (eqiad hardware refresh) [homer/public] - 10https://gerrit.wikimedia.org/r/933917 (https://phabricator.wikimedia.org/T326685) (owner: 10Ssingh)
[16:15:23] <wikibugs>	 10SRE, 10RESTBase, 10RESTBase-API, 10Traffic: REST API is not invalidating caches after template and/or module changes - https://phabricator.wikimedia.org/T335770 (10Brycehughes) 05Open→03Resolved
[16:15:53] <wikibugs>	 (03PS1) 10Urbanecm: PageView: Route requests through restbase service proxy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936065 (https://phabricator.wikimedia.org/T341191)
[16:16:19] <wikibugs>	 (03CR) 10Jbond: "I have squashed my changes into this one, closed my comments. Overall i im not sure if i have a strong preference for this or epp.  I felt" [puppet] - 10https://gerrit.wikimedia.org/r/935751 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[16:16:19] <sukhe>	 !log homer "cr*-eqiad*" commit "Gerrit: 933917 add new DNS host dns1004"
[16:16:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:17:05] <wikibugs>	 (03PS10) 10Jbond: Add a new nftables::service define [puppet] - 10https://gerrit.wikimedia.org/r/935751 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[16:17:50] <wikibugs>	 (03PS2) 10Urbanecm: PageView: Route requests through restbase service proxy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936065 (https://phabricator.wikimedia.org/T341191)
[16:21:59] <wikibugs>	 (03PS1) 10Effie Mouzeli: ipoid: updated app.port [deployment-charts] - 10https://gerrit.wikimedia.org/r/936067
[16:22:44] <wikibugs>	 (03CR) 10Jbond: "a few more comments" [puppet] - 10https://gerrit.wikimedia.org/r/935751 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[16:23:30] <wikibugs>	 (03PS2) 10Effie Mouzeli: ipoid: update app.port [deployment-charts] - 10https://gerrit.wikimedia.org/r/936067
[16:25:06] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+2] ipoid: update app.port [deployment-charts] - 10https://gerrit.wikimedia.org/r/936067 (owner: 10Effie Mouzeli)
[16:25:23] <wikibugs>	 10SRE, 10Observability-Alerting, 10Traffic, 10collaboration-services, 10serviceops-radar: Timeouts when talking to phabricator API - https://phabricator.wikimedia.org/T341039 (10brennen) > @brennen I saw your updates to phab in SAL, does the above (maniphest.edit taking a lot longer to create tasks) ring...
[16:25:52] <wikibugs>	 (03Merged) 10jenkins-bot: ipoid: update app.port [deployment-charts] - 10https://gerrit.wikimedia.org/r/936067 (owner: 10Effie Mouzeli)
[16:29:08] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Fix CirrusSearchJobQueueLagTooHigh to use histograms [alerts] - 10https://gerrit.wikimedia.org/r/936070
[16:30:06] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] changeprop: Change normal_rule_processing_delay to histogram (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/935089 (owner: 10Clément Goubert)
[16:30:15] <logmsgbot>	 !log jiji@deploy1002 helmfile [staging] START helmfile.d/services/ipoid: apply
[16:30:38] <logmsgbot>	 !log jiji@deploy1002 helmfile [staging] DONE helmfile.d/services/ipoid: apply
[16:31:33] <sukhe>	 !log ns0: set routing-options static route 208.80.154.238/32 next-hop [ 208.80.154.6 208.80.155.108 208.80.154.134 ]
[16:31:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:31:39] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Fix CirrusSearchJobQueueLagTooHigh to use histograms [alerts] - 10https://gerrit.wikimedia.org/r/936070 (owner: 10Alexandros Kosiaris)
[16:33:08] <wikibugs>	 (03CR) 10Milimetric: [C: 04-1] "I addressed my own comments but I shouldn't (and don't have rights to) self-merge.  We're currently testing this on the DE cloud replica, " [puppet] - 10https://gerrit.wikimedia.org/r/935752 (https://phabricator.wikimedia.org/T339037) (owner: 10Samuel (WMF))
[16:36:03] <wikibugs>	 (03PS1) 10Ssingh: sites.yaml: remove dns1001 from anycast_neighbors (host decom) [homer/public] - 10https://gerrit.wikimedia.org/r/936071 (https://phabricator.wikimedia.org/T326685)
[16:40:46] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] sites.yaml: remove dns1001 from anycast_neighbors (host decom) [homer/public] - 10https://gerrit.wikimedia.org/r/936071 (https://phabricator.wikimedia.org/T326685) (owner: 10Ssingh)
[16:42:21] <wikibugs>	 (03CR) 10Alexandros Kosiaris: modules: Add a new networkpolicy for base modules (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/935746 (https://phabricator.wikimedia.org/T340843) (owner: 10Alexandros Kosiaris)
[16:44:30] <wikibugs>	 (03CR) 10Jbond: Add a new nftables::service define (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/936049 (https://phabricator.wikimedia.org/T336497) (owner: 10Jbond)
[16:44:47] <sukhe>	 !log homer "cr*-eqiad*" commit "decommission DNS host dns1001 (replaced by dns1004)"
[16:44:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:46:42] <wikibugs>	 (03PS1) 10Ssingh: hiera: decommission dns host dns1001 (eqiad hardware refresh) [puppet] - 10https://gerrit.wikimedia.org/r/936072 (https://phabricator.wikimedia.org/T326685)
[16:47:05] <wikibugs>	 (03PS2) 10Jbond: puppedb::bookworm: Force client auth [puppet] - 10https://gerrit.wikimedia.org/r/935733 (https://phabricator.wikimedia.org/T338811)
[16:47:25] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] hiera: decommission dns host dns1001 (eqiad hardware refresh) [puppet] - 10https://gerrit.wikimedia.org/r/936072 (https://phabricator.wikimedia.org/T326685) (owner: 10Ssingh)
[16:47:51] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] puppedb::bookworm: Force client auth [puppet] - 10https://gerrit.wikimedia.org/r/935733 (https://phabricator.wikimedia.org/T338811) (owner: 10Jbond)
[16:49:23] <sukhe>	 !log sudo cumin A:netbox 'run-puppet-agent': removing dns1001 before decomm cookbook
[16:49:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:52:56] <wikibugs>	 (03PS1) 10Ssingh: common.yaml: add dns1004, remove dns1001 [homer/public] - 10https://gerrit.wikimedia.org/r/936074
[16:54:59] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.decommission for hosts dns1001.wikimedia.org
[16:56:50] <wikibugs>	 (03PS1) 10Kosta Harlan: ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/936075 (https://phabricator.wikimedia.org/T341129)
[16:57:01] <wikibugs>	 (03PS2) 10Kosta Harlan: ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/936075 (https://phabricator.wikimedia.org/T341129)
[16:57:18] <wikibugs>	 (03CR) 10Kosta Harlan: [C: 03+2] ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/936075 (https://phabricator.wikimedia.org/T341129) (owner: 10Kosta Harlan)
[16:58:00] <wikibugs>	 (03Merged) 10jenkins-bot: ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/936075 (https://phabricator.wikimedia.org/T341129) (owner: 10Kosta Harlan)
[16:58:41] <logmsgbot>	 !log kharlan@deploy1002 helmfile [staging] START helmfile.d/services/ipoid: apply
[16:58:59] <logmsgbot>	 !log kharlan@deploy1002 helmfile [staging] DONE helmfile.d/services/ipoid: apply
[17:00:06] <jouncebot>	 bd808: #bothumor My software never has bugs. It just develops random features. Rise for Technical Engagement weekly deploy (Toolhub, Developer portal, Striker). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230706T1700).
[17:00:06] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230706T1700)
[17:00:25] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.dns.netbox
[17:00:27] <icinga-wm>	 PROBLEM - uWSGI puppetboard -http via nrpe- on puppetboard1003 is CRITICAL: HTTP CRITICAL - No data received from host https://wikitech.wikimedia.org/wiki/Services/Monitoring/puppetboard
[17:00:59] <icinga-wm>	 PROBLEM - uWSGI puppetboard -http via nrpe- on puppetboard2003 is CRITICAL: HTTP CRITICAL - No data received from host https://wikitech.wikimedia.org/wiki/Services/Monitoring/puppetboard
[17:01:03] <sukhe>	 hmmm
[17:01:33] <jbond>	 sukhe: ill check thats i think its me that broke it
[17:01:36] <sukhe>	 jbond: <3
[17:01:37] <jbond>	 also they are not live
[17:01:43] <sukhe>	 bookworm hosts?
[17:01:48] <jbond>	 yes
[17:01:51] <sukhe>	 ok thanks!
[17:01:55] <jbond>	 new puppet7 stuff
[17:02:08] <sukhe>	 +profile::puppetdb::ssl_verify_client: 'on'
[17:02:11] <sukhe>	 probably this then
[17:02:28] <jbond>	 yes exactly im gussing i need to configure puppet board to sends its client certs
[17:02:30] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: dns1001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin2002"
[17:04:19] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: dns1001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin2002"
[17:04:19] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:04:20] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts dns1001.wikimedia.org
[17:04:29] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install dns100[456] - https://phabricator.wikimedia.org/T326685 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by sukhe@cumin2002 for hosts: `dns1001.wikimedia.org` - dns1001.wikimedia.org (**WARN**)   - Downtimed host on Icinga/Alertmanag...
[17:07:16] <wikibugs>	 (03PS1) 10Jbond: puppetboard: Add additional site to proxy puppet7 config [puppet] - 10https://gerrit.wikimedia.org/r/936076 (https://phabricator.wikimedia.org/T338811)
[17:07:46] <mbsantos>	 akosiaris: 
[17:07:52] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99)
[17:08:00] <mbsantos>	 apparently the MCS decom is affecting PCS https://phabricator.wikimedia.org/T341248
[17:09:12] <akosiaris>	 mbsantos: not sure what the issue is though
[17:09:31] <wikibugs>	 (03PS1) 10Jbond: Revert "puppedb::bookworm: Force client auth" [puppet] - 10https://gerrit.wikimedia.org/r/935862
[17:09:33] <mbsantos>	 mobile-html endpoints are receiving 403
[17:09:41] <akosiaris>	 weren't they meant to ? 
[17:09:44] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] Revert "puppedb::bookworm: Force client auth" [puppet] - 10https://gerrit.wikimedia.org/r/935862 (owner: 10Jbond)
[17:10:01] <wikibugs>	 (03PS1) 10Jbond: puppedb::bookworm: Force client auth [puppet] - 10https://gerrit.wikimedia.org/r/935863 (https://phabricator.wikimedia.org/T338811)
[17:10:18] <akosiaris>	 mbsantos: I had pasted the regex in https://phabricator.wikimedia.org/T340036#8956205
[17:10:25] <akosiaris>	 if (req.url ~ "^/api/rest_v1/page/mobile-"
[17:10:31] <wikibugs>	 (03CR) 10Jbond: [C: 04-1] "need to configure puppetboard with client auth first" [puppet] - 10https://gerrit.wikimedia.org/r/935863 (https://phabricator.wikimedia.org/T338811) (owner: 10Jbond)
[17:10:36] <akosiaris>	 if that's wrong, I can change it, but let me know to what
[17:10:48] <mbsantos>	 yeah that's my bad it should be mobile-sections only
[17:10:58] <akosiaris>	 ok, easy to fix, gimme a sec
[17:11:08] <mbsantos>	 thanks
[17:13:16] <akosiaris>	 done
[17:13:22] <akosiaris>	 ok now the match is if (req.url ~ "^/api/rest_v1/page/mobile-sections"
[17:13:40] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] common.yaml: add dns1004, remove dns1001 [homer/public] - 10https://gerrit.wikimedia.org/r/936074 (owner: 10Ssingh)
[17:13:42] <akosiaris>	 which should also match /page/mobile-sections-remaining and /page/mobile/sections-lead
[17:13:51] <akosiaris>	 from the https://en.wikipedia.org/api/rest_v1/#/Mobile stuff at least
[17:13:57] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet (Puppet 7.0): Puppetboard: configure client auth - https://phabricator.wikimedia.org/T341268 (10jbond)
[17:15:06] <akosiaris>	 mbsantos: I 've responded on the task too
[17:15:40] <sukhe>	 !log  homer "mr*" commit "update ntp_servers (add dns1004, remove dns1001)"
[17:15:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:16:01] <icinga-wm>	 RECOVERY - uWSGI puppetboard -http via nrpe- on puppetboard1003 is OK: HTTP OK: HTTP/1.1 200 OK - 11910 bytes in 0.354 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/puppetboard
[17:16:11] <mbsantos>	 akosiaris: thank you very much!
[17:16:35] <icinga-wm>	 RECOVERY - uWSGI puppetboard -http via nrpe- on puppetboard2003 is OK: HTTP OK: HTTP/1.1 200 OK - 12836 bytes in 0.301 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/puppetboard
[17:17:08] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer
[17:20:42] <wikibugs>	 10SRE, 10RESTBase, 10RESTBase-API, 10Traffic: REST API is not invalidating caches after template and/or module changes - https://phabricator.wikimedia.org/T335770 (10akosiaris) >>! In T335770#8988938, @Brycehughes wrote: > @akosiaris Yep all clear now from Georgia (the country). However, this lasted much m...
[17:24:30] <wikibugs>	 10SRE, 10RESTBase, 10RESTBase-API, 10Traffic: REST API is not invalidating caches after template and/or module changes - https://phabricator.wikimedia.org/T335770 (10Brycehughes) @akosiaris Fair enough. Ah, the joys of caching. Thanks.
[17:24:40] <sukhe>	 !log sudo cumin -b1 -s300 'A:dns-rec' 'systemctl restart ntp.service'
[17:24:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:33:30] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:36:12] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet-Core, 10Patch-For-Review: Resource attributes are quoted inconsistently - https://phabricator.wikimedia.org/T91908 (10Dzahn) It seemed a bit much to link every single change to this ticket, but then also,, I wanted to somehow link them.  So here it goes as a si...
[17:36:19] <wikibugs>	 (03CR) 10Dzahn: "https://phabricator.wikimedia.org/T91908" [puppet] - 10https://gerrit.wikimedia.org/r/934640 (owner: 10Dzahn)
[17:36:23] <wikibugs>	 (03CR) 10Dzahn: "https://phabricator.wikimedia.org/T91908" [puppet] - 10https://gerrit.wikimedia.org/r/934637 (owner: 10Dzahn)
[17:36:27] <wikibugs>	 (03CR) 10Dzahn: "https://phabricator.wikimedia.org/T91908" [puppet] - 10https://gerrit.wikimedia.org/r/934641 (owner: 10Dzahn)
[17:36:32] <wikibugs>	 (03CR) 10Dzahn: "https://phabricator.wikimedia.org/T91908" [puppet] - 10https://gerrit.wikimedia.org/r/934642 (owner: 10Dzahn)
[17:36:36] <wikibugs>	 (03CR) 10Dzahn: "https://phabricator.wikimedia.org/T91908" [puppet] - 10https://gerrit.wikimedia.org/r/934638 (owner: 10Dzahn)
[17:36:40] <wikibugs>	 (03CR) 10Dzahn: "https://phabricator.wikimedia.org/T91908" [puppet] - 10https://gerrit.wikimedia.org/r/934639 (owner: 10Dzahn)
[17:37:34] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "no difference in compiler, just style fixes: https://puppet-compiler.wmflabs.org/output/934639/42318/" [puppet] - 10https://gerrit.wikimedia.org/r/934639 (owner: 10Dzahn)
[17:38:28] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] wikistats: fix quoting for ensure parameter [puppet] - 10https://gerrit.wikimedia.org/r/934640 (owner: 10Dzahn)
[17:40:43] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/output/934637/42319/" [puppet] - 10https://gerrit.wikimedia.org/r/934637 (owner: 10Dzahn)
[17:46:06] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:51:23] <wikibugs>	 (03CR) 10Tchanders: [C: 03+1] Disable purging of old client hint data by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936055 (https://phabricator.wikimedia.org/T340959) (owner: 10Dreamy Jazz)
[18:00:06] <jouncebot>	 hashar and brennen: My dear minions, it's time we take the moon! Just kidding. Time for MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230706T1800).
[18:01:53] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/output/934638/42320/" [puppet] - 10https://gerrit.wikimedia.org/r/934638 (owner: 10Dzahn)
[18:01:55] <wikibugs>	 10SRE, 10Observability-Alerting, 10Traffic, 10collaboration-services, 10serviceops-radar: Timeouts when talking to phabricator API - https://phabricator.wikimedia.org/T341039 (10Aklapper) Hmm. The problem //could// be related to deploying the bug fix (see non-public T338611#8965304 for details) in 6b59a3...
[18:10:16] <wikibugs>	 (03PS2) 10Dzahn: vrts: fix quoting of ensure parameter [puppet] - 10https://gerrit.wikimedia.org/r/934641
[18:10:37] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] ""If a string is a value from an enumerable set of options, such as" [puppet] - 10https://gerrit.wikimedia.org/r/934639 (owner: 10Dzahn)
[18:10:52] <wikibugs>	 (03PS2) 10Dzahn: releases: fix quoting of ensure parameter [puppet] - 10https://gerrit.wikimedia.org/r/934642
[18:11:06] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/output/934639/42323/" [puppet] - 10https://gerrit.wikimedia.org/r/934641 (owner: 10Dzahn)
[18:12:53] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] "this was about https://phabricator.wikimedia.org/T91908" [puppet] - 10https://gerrit.wikimedia.org/r/934641 (owner: 10Dzahn)
[18:13:15] <wikibugs>	 (03CR) 10Dzahn: "this is about https://phabricator.wikimedia.org/T91908" [puppet] - 10https://gerrit.wikimedia.org/r/934642 (owner: 10Dzahn)
[18:18:20] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[18:19:07] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/output/934639/42325/" [puppet] - 10https://gerrit.wikimedia.org/r/934642 (owner: 10Dzahn)
[18:25:53] <wikibugs>	 (03CR) 10Dzahn: miscweb: add statictendril release to miscweb staging (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/930886 (https://phabricator.wikimedia.org/T300171) (owner: 10Dzahn)
[18:28:59] <wikibugs>	 (03CR) 10Dzahn: miscweb: add statictendril release to miscweb staging (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/930886 (https://phabricator.wikimedia.org/T300171) (owner: 10Dzahn)
[18:32:30] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:33:50] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.278 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:40:24] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Wikimedia-IRC-RC-Server: Spam in PMs on IRC recent changes server - https://phabricator.wikimedia.org/T341097 (10jhsoby) The spammers have now moved on from promoting that one IRC network to posting links and ASCII art depicting lemon party and goatse (if you're lucky e...
[18:49:41] <urbanecm>	 jouncebot: nowandnext
[18:49:41] <jouncebot>	 For the next 1 hour(s) and 10 minute(s): MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230706T1800)
[18:49:41] <jouncebot>	 In 1 hour(s) and 10 minute(s): UTC late backport and config training (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230706T2000)
[18:51:45] <urbanecm>	 seems like we're on .16 already, and the window's unused?
[18:51:56] <wikibugs>	 (03PS3) 10Urbanecm: PageView: Route requests through restbase service proxy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936065 (https://phabricator.wikimedia.org/T341191)
[18:54:49] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936065 (https://phabricator.wikimedia.org/T341191) (owner: 10Urbanecm)
[18:55:48] <wikibugs>	 (03Merged) 10jenkins-bot: PageView: Route requests through restbase service proxy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936065 (https://phabricator.wikimedia.org/T341191) (owner: 10Urbanecm)
[18:56:03] <logmsgbot>	 !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:936065|PageView: Route requests through restbase service proxy (T341191)]]
[18:56:06] <stashbot>	 T341191: Failed fetching https://wikimedia.org/api/rest_v1/metrics/unique-devices/{parameters}: Connection timed out - https://phabricator.wikimedia.org/T341191
[18:57:32] <logmsgbot>	 !log urbanecm@deploy1002 urbanecm: Backport for [[gerrit:936065|PageView: Route requests through restbase service proxy (T341191)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet
[19:01:34] <wikibugs>	 (03Abandoned) 10Stang: Update logo/wordmark/tagline for Serbian project [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892955 (https://phabricator.wikimedia.org/T324545) (owner: 10Stang)
[19:03:30] <logmsgbot>	 !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:936065|PageView: Route requests through restbase service proxy (T341191)]] (duration: 07m 27s)
[19:03:33] <stashbot>	 T341191: Failed fetching https://wikimedia.org/api/rest_v1/metrics/unique-devices/{parameters}: Connection timed out - https://phabricator.wikimedia.org/T341191
[19:04:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (GET replicasets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[19:06:59] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)
[19:07:04] <icinga-wm>	 PROBLEM - Check systemd state on wdqs2012 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:07:10] <icinga-wm>	 PROBLEM - Blazegraph Port for wdqs-categories on wdqs2019 is CRITICAL: connect to address 127.0.0.1 and port 9990: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[19:07:22] <icinga-wm>	 PROBLEM - Query Service HTTP Port on wdqs2019 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 364 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service
[19:07:42] <icinga-wm>	 PROBLEM - WDQS SPARQL on wdqs2019 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string http://www.w3.org/2001/XML... not found on https://query.wikidata.org:443/bigdata/namespace/wdq/sparql?query=SELECT%20*%20WHERE%20%7Bwikibase%3ADump%20schema%3AdateModified%20%3Fy%7D%20LIMIT%201 - 398 bytes in 0.179 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[19:07:48] <icinga-wm>	 PROBLEM - Blazegraph process -wdqs-categories- on wdqs2019 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9990 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[19:08:20] <icinga-wm>	 PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs2019 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[19:08:24] <icinga-wm>	 PROBLEM - Blazegraph process -wdqs-blazegraph- on wdqs2019 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[19:08:30] <icinga-wm>	 PROBLEM - Check systemd state on wdqs2019 is CRITICAL: CRITICAL - degraded: The following units failed: load-dcatap-weekly.service,prometheus-blazegraph-exporter-wdqs-blazegraph.service,prometheus-blazegraph-exporter-wdqs-categories.service,wdqs-blazegraph.service,wdqs-categories.service,wdqs-updater.service,wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service,wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-categor
[19:08:30] <icinga-wm>	 ice https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:09:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (GET replicasets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[19:15:53] <wikibugs>	 (03PS1) 10Urbanecm: PageView: Fix base URL when using service proxy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936083 (https://phabricator.wikimedia.org/T341191)
[19:16:24] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936083 (https://phabricator.wikimedia.org/T341191) (owner: 10Urbanecm)
[19:17:11] <wikibugs>	 (03Merged) 10jenkins-bot: PageView: Fix base URL when using service proxy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936083 (https://phabricator.wikimedia.org/T341191) (owner: 10Urbanecm)
[19:17:28] <logmsgbot>	 !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:936083|PageView: Fix base URL when using service proxy (T341191)]]
[19:17:31] <stashbot>	 T341191: Failed fetching https://wikimedia.org/api/rest_v1/metrics/unique-devices/{parameters}: Connection timed out - https://phabricator.wikimedia.org/T341191
[19:22:55] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1 C: 03+2] P:ntp: do not use global variables [puppet] - 10https://gerrit.wikimedia.org/r/936050 (owner: 10Ssingh)
[19:24:44] <logmsgbot>	 !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:936083|PageView: Fix base URL when using service proxy (T341191)]] (duration: 07m 16s)
[19:24:48] <stashbot>	 T341191: Failed fetching https://wikimedia.org/api/rest_v1/metrics/unique-devices/{parameters}: Connection timed out - https://phabricator.wikimedia.org/T341191
[19:32:09] <wikibugs>	 (03PS1) 10Stang: pawikibooks: Install Quiz extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936084 (https://phabricator.wikimedia.org/T340613)
[19:37:34] <icinga-wm>	 RECOVERY - Blazegraph process -wdqs-blazegraph- on wdqs2019 is OK: PROCS OK: 1 process with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[19:38:23] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] sre: add gitlab ci alerts [alerts] - 10https://gerrit.wikimedia.org/r/931286 (https://phabricator.wikimedia.org/T339370) (owner: 10Jelto)
[19:43:25] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "Yes, this does indeed control whether jenkins and zuul services are enabled. So I expect this should be merged after all rsyncing is done " [puppet] - 10https://gerrit.wikimedia.org/r/935919 (https://phabricator.wikimedia.org/T324659) (owner: 10Jelto)
[19:44:42] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "this should also be merged at some point between rsyncing and before re-enabling puppet I think. Not sure if before or after zuul and jenk" [puppet] - 10https://gerrit.wikimedia.org/r/867705 (https://phabricator.wikimedia.org/T324659) (owner: 10Dzahn)
[19:46:31] <wikibugs>	 10SRE, 10Continuous-Integration-Infrastructure, 10collaboration-services, 10Patch-For-Review: contint2002 service implementation tracking - https://phabricator.wikimedia.org/T324659 (10hashar) I have done a first initial transfer of `/srv/jenkins` since I wanted to have a rough estimate of how long it took...
[19:56:20] <hashar>	 mutante: the Jenkins build rsync takes a little more than a minute once warmed up :]
[19:56:27] <hashar>	 thanks again for the magic `rsync` commands
[19:57:06] <hashar>	 I will dig tomorrow in the sequence of the actions to do the migration and write the puppet patches to stop the services and enable them then sync up with jelto
[20:00:06] <jouncebot>	 brennen and TheresNoTime: OwO what's this, a deployment window?? UTC late backport and config training. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230706T2000). nyaa~
[20:00:06] <jouncebot>	 Dreamy_Jazz, Jdlrobson, and koi: A patch you scheduled for UTC late backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:14] <Dreamy_Jazz>	 \o
[20:00:27] <koi>	 o/
[20:01:05] <Dreamy_Jazz>	 My change should need no testing as it's adding a config that isn't used by any code yet.
[20:02:22] <TheresNoTime>	 I'll be around in 15ish if no one else appears
[20:02:27] <thcipriani>	 I can deploy
[20:03:02] <Dreamy_Jazz>	 I can be around for the entire backport window if needed.
[20:03:38] <TheresNoTime>	 thcipriani: please do ^^
[20:04:05] * thcipriani does :)
[20:04:14] <thcipriani>	 alright Dreamy_Jazz you're up first
[20:04:57] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by thcipriani@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936055 (https://phabricator.wikimedia.org/T340959) (owner: 10Dreamy Jazz)
[20:05:28] <thcipriani>	 Jdlrobson: around for backport window?
[20:05:41] <hashar>	 TIL about the Quiz extension ( https://www.mediawiki.org/wiki/Extension:Quiz )
[20:06:14] <wikibugs>	 (03Merged) 10jenkins-bot: Disable purging of old client hint data by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936055 (https://phabricator.wikimedia.org/T340959) (owner: 10Dreamy Jazz)
[20:06:31] <logmsgbot>	 !log thcipriani@deploy1002 Started scap: Backport for [[gerrit:936055|Disable purging of old client hint data by default (T340959 T341076)]]
[20:06:35] <stashbot>	 T340959: Update CheckUser prune job to remove client hint data - https://phabricator.wikimedia.org/T340959
[20:06:36] <stashbot>	 T341076: Creation of database tables cu_useragent_clienthints and cu_useragent_clienthints_map - https://phabricator.wikimedia.org/T341076
[20:07:08] <wikibugs>	 (03CR) 10Hashar: [C: 03+1] "The extension does need anything on the database side so that looks fine.  I never heard before of that https://www.mediawiki.org/wiki/Ext" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936084 (https://phabricator.wikimedia.org/T340613) (owner: 10Stang)
[20:07:54] <logmsgbot>	 !log thcipriani@deploy1002 thcipriani and dreamyjazz: Backport for [[gerrit:936055|Disable purging of old client hint data by default (T340959 T341076)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet
[20:08:49] <thcipriani>	 ^ Dreamy_Jazz you mentioned this needs no testing? Unused?
[20:09:12] <Dreamy_Jazz>	 No testing as the config will be used by a patch that depends on this (needs to be a different value to the default on WMF wikis)
[20:09:24] <Dreamy_Jazz>	 So there is no code that uses this config yet on WMF wikis
[20:09:31] <thcipriani>	 got it, thank you
[20:09:37] <Dreamy_Jazz>	 Thanks!
[20:10:14] <wikibugs>	 (03PS2) 10Hashar: pawikibooks: Install Quiz extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936084 (https://phabricator.wikimedia.org/T340613) (owner: 10Stang)
[20:10:47] <hashar>	 koi: I click the rebase since wmf-config/InitialiseSettings.php got touched by the other change
[20:11:23] <thcipriani>	 (syncing now)
[20:12:17] <Jdlrobson>	 hey there im late sorry
[20:12:22] <Jdlrobson>	 was in a meeting
[20:13:03] <wikibugs>	 (03PS3) 10Jdlrobson: Update more logos with available SVGs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935824 (https://phabricator.wikimedia.org/T338162)
[20:13:52] <thcipriani>	 no worries, I'm almost ready for yours
[20:14:01] <Jdlrobson>	 thcipriani: great timing then :)
[20:15:38] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[20:16:40] <logmsgbot>	 !log thcipriani@deploy1002 Finished scap: Backport for [[gerrit:936055|Disable purging of old client hint data by default (T340959 T341076)]] (duration: 10m 08s)
[20:16:46] <stashbot>	 T340959: Update CheckUser prune job to remove client hint data - https://phabricator.wikimedia.org/T340959
[20:16:46] <stashbot>	 T341076: Creation of database tables cu_useragent_clienthints and cu_useragent_clienthints_map - https://phabricator.wikimedia.org/T341076
[20:16:55] <hashar>	 stuff is syncing still
[20:17:59] <thcipriani>	 Dreamy_Jazz: should be synced everywhere now
[20:18:07] <Dreamy_Jazz>	 Thanks.
[20:20:38] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[20:21:17] <hashar>	 koi: we are doing another private deploy in between
[20:21:47] <koi>	 got it, i'm ok of it
[20:26:17] <thcipriani>	 (sorry for delay, small privatesettings update)
[20:28:25] <wikibugs>	 (03PS1) 10Hashar: Restore fonts submodule whose removal has not been deployed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936088
[20:28:54] <wikibugs>	 (03CR) 10Hashar: "/srv/mediawiki-staging/fonts is still on the deployment server and thus its removal has NOT been deployed." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/723652 (owner: 10Legoktm)
[20:31:38] <Jdlrobson>	 thcipriani: are we still good for my logos deploy?
[20:31:45] <hashar>	 yes
[20:31:49] <hashar>	 in the queue we are doing another change
[20:31:58] <hashar>	 then I guess do koi change cause it must be late for them
[20:32:04] <Jdlrobson>	 👍
[20:34:24] <thcipriani>	 Jdlrobson: going ahead with yours now
[20:34:33] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by thcipriani@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935824 (https://phabricator.wikimedia.org/T338162) (owner: 10Jdlrobson)
[20:34:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[20:35:14] <Jdlrobson>	 great
[20:35:28] <wikibugs>	 (03Merged) 10jenkins-bot: Update more logos with available SVGs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935824 (https://phabricator.wikimedia.org/T338162) (owner: 10Jdlrobson)
[20:35:42] <logmsgbot>	 !log thcipriani@deploy1002 Started scap: Backport for [[gerrit:935824|Update more logos with available SVGs (T338162)]]
[20:35:45] <stashbot>	 T338162: Track which Vector 2022 logos are in production vs Google Drive - https://phabricator.wikimedia.org/T338162
[20:37:11] <logmsgbot>	 !log thcipriani@deploy1002 jdlrobson and thcipriani: Backport for [[gerrit:935824|Update more logos with available SVGs (T338162)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet
[20:37:24] <thcipriani>	 ^ Jdlrobson check please
[20:39:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[20:40:28] <wikibugs>	 (03PS1) 10Bking: scap: add new WDQS hosts as valid targets [puppet] - 10https://gerrit.wikimedia.org/r/936089 (https://phabricator.wikimedia.org/T341290)
[20:42:03] <Jdlrobson>	 LGTM please sync
[20:42:28] <thcipriani>	 cool, thanks for checking, going live
[20:43:37] <wikibugs>	 (03CR) 10Reedy: "Why not just delete the folder and the next full scap should deploy it?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936088 (owner: 10Hashar)
[20:46:53] <wikibugs>	 (03PS2) 10Bking: scap: add new WDQS hosts as valid targets [puppet] - 10https://gerrit.wikimedia.org/r/936089 (https://phabricator.wikimedia.org/T341290)
[20:47:33] <wikibugs>	 (03CR) 10Ebernhardson: [C: 03+1] "should do what we need" [puppet] - 10https://gerrit.wikimedia.org/r/936089 (https://phabricator.wikimedia.org/T341290) (owner: 10Bking)
[20:47:45] <wikibugs>	 (03CR) 10Bking: [C: 03+2] scap: add new WDQS hosts as valid targets [puppet] - 10https://gerrit.wikimedia.org/r/936089 (https://phabricator.wikimedia.org/T341290) (owner: 10Bking)
[20:48:23] <logmsgbot>	 !log thcipriani@deploy1002 Finished scap: Backport for [[gerrit:935824|Update more logos with available SVGs (T338162)]] (duration: 12m 41s)
[20:48:26] <stashbot>	 T338162: Track which Vector 2022 logos are in production vs Google Drive - https://phabricator.wikimedia.org/T338162
[20:48:31] <thcipriani>	 ^ Jdlrobson should be live now
[20:48:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[20:48:51] <thcipriani>	 alright koi you're up next! Sorry for the delay :)
[20:49:55] <koi>	 :)
[20:50:06] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by thcipriani@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936084 (https://phabricator.wikimedia.org/T340613) (owner: 10Stang)
[20:50:41] <wikibugs>	 (03Abandoned) 10Hashar: Restore fonts submodule whose removal has not been deployed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936088 (owner: 10Hashar)
[20:51:34] <wikibugs>	 (03Merged) 10jenkins-bot: pawikibooks: Install Quiz extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936084 (https://phabricator.wikimedia.org/T340613) (owner: 10Stang)
[20:51:50] <logmsgbot>	 !log thcipriani@deploy1002 Started scap: Backport for [[gerrit:936084|pawikibooks: Install Quiz extension (T340613)]]
[20:51:52] <stashbot>	 T340613: Install Quiz extension to Punjabi Wikibooks - https://phabricator.wikimedia.org/T340613
[20:53:03] <hashar>	 Reedy: thanks, we will remove /srv/mediawiki-config/fonts and sync the removal
[20:53:14] <logmsgbot>	 !log thcipriani@deploy1002 stang and thcipriani: Backport for [[gerrit:936084|pawikibooks: Install Quiz extension (T340613)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet
[20:53:26] <thcipriani>	 ^ koi should be live on mwdebug, check please
[20:53:32] <koi>	 looking
[20:53:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[20:54:06] <logmsgbot>	 !log bking@deploy1002 Started deploy [wdqs/wdqs@dff41b7]: 0.3.124
[20:54:10] <wikibugs>	 (03CR) 10Hashar: "I haven't realized the workload got moved to Shellbox and that is no more used on the app servers. We will remove /srv/mediawiki-config/fo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/723652 (owner: 10Legoktm)
[20:54:12] <logmsgbot>	 !log bking@deploy1002 Finished deploy [wdqs/wdqs@dff41b7]: 0.3.124 (duration: 00m 05s)
[20:55:43] <logmsgbot>	 !log bking@deploy1002 Started deploy [wdqs/wdqs@dff41b7]: 0.3.124
[20:57:37] <Jdlrobson>	 thanks thcipriani 
[20:57:39] <koi>	 thcipriani, I tested at https://pa.wikibooks.org/wiki/Wikibooks:Sandbox, and it works fine
[20:57:52] <thcipriani>	 no problem  Jdlrobson thanks for the updates :)
[20:58:04] <thcipriani>	 koi: cool, syncing everywhere now
[21:01:26] <icinga-wm>	 RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs2019 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[21:01:52] <icinga-wm>	 RECOVERY - Blazegraph Port for wdqs-categories on wdqs2019 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9990 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[21:02:08] <icinga-wm>	 RECOVERY - Query Service HTTP Port on wdqs2019 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.651 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service
[21:02:24] <icinga-wm>	 RECOVERY - WDQS SPARQL on wdqs2019 is OK: HTTP OK: HTTP/1.1 200 OK - 690 bytes in 0.227 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[21:02:34] <icinga-wm>	 RECOVERY - Blazegraph process -wdqs-categories- on wdqs2019 is OK: PROCS OK: 1 process with UID = 498 (blazegraph), regex args ^java .* --port 9990 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[21:04:09] <logmsgbot>	 !log thcipriani@deploy1002 Finished scap: Backport for [[gerrit:936084|pawikibooks: Install Quiz extension (T340613)]] (duration: 12m 19s)
[21:04:13] <stashbot>	 T340613: Install Quiz extension to Punjabi Wikibooks - https://phabricator.wikimedia.org/T340613
[21:04:20] <thcipriani>	 ^ koi should be live now
[21:04:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[21:05:52] <hashar>	 koi: it is live https://pa.wikibooks.org/wiki/Wikibooks:Sandbox !:)
[21:06:05] <koi>	 thx!
[21:06:12] <logmsgbot>	 !log thcipriani@deploy1002 Started scap: Clean up font directory [[gerrit:723652]]
[21:06:21] <hashar>	 happy Quiz!
[21:06:31] <hashar>	 Reedy: ^ :]
[21:09:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[21:10:39] <logmsgbot>	 !log bking@deploy1002 Finished deploy [wdqs/wdqs@dff41b7]: 0.3.124 (duration: 14m 56s)
[21:12:46] <logmsgbot>	 !log thcipriani@deploy1002 Finished scap: Clean up font directory [[gerrit:723652]] (duration: 06m 33s)
[21:16:03] <thcipriani>	 done! And easytimeline still works :)
[21:17:16] <jinxer-wm>	 (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[21:22:16] <jinxer-wm>	 (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[21:38:20] <jinxer-wm>	 (SystemdUnitFailed) firing: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service Failed on wdqs2012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:50:42] <icinga-wm>	 PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:11:42] <wikibugs>	 (03PS1) 10Bking: wdqs: Don't start services until host is ready [puppet] - 10https://gerrit.wikimedia.org/r/936095 (https://phabricator.wikimedia.org/T341290)
[22:12:36] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wdqs: Don't start services until host is ready [puppet] - 10https://gerrit.wikimedia.org/r/936095 (https://phabricator.wikimedia.org/T341290) (owner: 10Bking)
[22:14:02] <icinga-wm>	 RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:18:20] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[22:18:32] <wikibugs>	 (03PS1) 10Gmodena: data-engineering: add alerts for mw-page-content-change-enrich. [alerts] - 10https://gerrit.wikimedia.org/r/936096 (https://phabricator.wikimedia.org/T340666)
[22:28:46] <wikibugs>	 (03PS1) 10Jdlrobson: Logos: Fixes grantswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936097
[22:47:33] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/output/932317/42327/mx1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/932317 (https://phabricator.wikimedia.org/T280392) (owner: 10Dzahn)
[23:01:25] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/output/932316/42328/mx1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/932316 (https://phabricator.wikimedia.org/T280392) (owner: 10Dzahn)
[23:08:41] <mutante>	 !log mx2001 - rm /usr/local/bin/otrs_aliases ; rm /lib/systemd/system/generate_otrs_aliases.*  after deploying gerrit:932316 which renamed script and timer without absenting them
[23:08:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:14:01] <mutante>	 !log mx1001 - rm /usr/local/bin/otrs_aliases ; rm /lib/systemd/system/generate_otrs_aliases.*  after deploying gerrit:932316 which renamed script and timer without absenting them
[23:14:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:21:07] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "after merging this I did:" [puppet] - 10https://gerrit.wikimedia.org/r/932316 (https://phabricator.wikimedia.org/T280392) (owner: 10Dzahn)
[23:22:36] <wikibugs>	 (03PS4) 10Dzahn: vrts: rename exim config snippet [puppet] - 10https://gerrit.wikimedia.org/r/932317 (https://phabricator.wikimedia.org/T280392)
[23:26:10] <wikibugs>	 (03Abandoned) 10Dzahn: vrts: rename exim config snippet [puppet] - 10https://gerrit.wikimedia.org/r/932317 (https://phabricator.wikimedia.org/T280392) (owner: 10Dzahn)
[23:27:04] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] "as we learned from a similar change the other day, <RequireAll> can't be repeated" [puppet] - 10https://gerrit.wikimedia.org/r/932440 (https://phabricator.wikimedia.org/T338071) (owner: 10Dzahn)
[23:31:15] <wikibugs>	 (03PS3) 10Dzahn: contint: replace Apache 2.2 access control syntax for Jenkins proxy [puppet] - 10https://gerrit.wikimedia.org/r/932440 (https://phabricator.wikimedia.org/T338071)
[23:31:53] <wikibugs>	 (03CR) 10Dzahn: "now compare to https://gerrit.wikimedia.org/r/c/operations/puppet/+/935417" [puppet] - 10https://gerrit.wikimedia.org/r/932440 (https://phabricator.wikimedia.org/T338071) (owner: 10Dzahn)
[23:33:50] <wikibugs>	 (03Abandoned) 10Dzahn: mediawiki: replace Apache 2.2 syntax for access control [puppet] - 10https://gerrit.wikimedia.org/r/932447 (owner: 10Dzahn)
[23:35:27] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet-Core, 10User-jbond: global http_proxy setting - https://phabricator.wikimedia.org/T278315 (10nshahquinn-wmf)
[23:37:36] <wikibugs>	 (03CR) 10Dzahn: "commented here for attention: https://phabricator.wikimedia.org/T124657#8996134" [puppet] - 10https://gerrit.wikimedia.org/r/527912 (https://phabricator.wikimedia.org/T124657) (owner: 10Fomafix)
[23:39:35] <wikibugs>	 (03CR) 10Dzahn: "I would know how to replace the "check_http" with alertmanager blackbox checks, but how would you replace cert_expiry?" [puppet] - 10https://gerrit.wikimedia.org/r/836775 (owner: 10Muehlenhoff)
[23:42:39] <wikibugs>	 10SRE, 10MediaWiki-Documentation, 10serviceops-radar, 10Documentation, and 2 others: Repair "svn.wikimedia.org/doc/" redirect for doc.wikimedia.org - https://phabricator.wikimedia.org/T109950 (10Dzahn) @Aklapper asked the same on Gerrit, it seems to me this is #serviceops rather than traffic because it's a...
[23:43:09] <wikibugs>	 (03CR) 10Dzahn: "pinged / tagged via linked phab ticket for attention" [puppet] - 10https://gerrit.wikimedia.org/r/631888 (https://phabricator.wikimedia.org/T109950) (owner: 10Dereckson)