[00:02:59] !log rzl@deploy1002 helmfile [staging] START helmfile.d/services/opentelemetry-collector: apply [00:03:05] !log rzl@deploy1002 helmfile [staging] DONE helmfile.d/services/opentelemetry-collector: apply [00:27:20] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 225, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:28:08] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 140, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:32:02] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 226, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:32:54] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 141, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:38:21] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/935144 [00:38:27] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/935144 (owner: 10TrainBranchBot) [00:55:18] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/935144 (owner: 10TrainBranchBot) [01:02:18] (03PS1) 10Jdlrobson: WIP: Update more logos with available SVGs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935824 (https://phabricator.wikimedia.org/T338162) [01:02:29] (03CR) 10CI reject: [V: 04-1] WIP: Update more logos with available SVGs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935824 (https://phabricator.wikimedia.org/T338162) (owner: 10Jdlrobson) [02:00:02] RECOVERY - Check systemd state on mwlog2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:01:50] (03PS1) 10RLazarus: opentelemetry-collector: Use a NodePort service instead of a hostPort. [deployment-charts] - 10https://gerrit.wikimedia.org/r/935826 (https://phabricator.wikimedia.org/T320564) [02:03:48] PROBLEM - Check systemd state on gitlab2002 is CRITICAL: CRITICAL - degraded: The following units failed: backup-restore.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:04:00] PROBLEM - Check systemd state on gitlab1003 is CRITICAL: CRITICAL - degraded: The following units failed: backup-restore.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:04:23] (03CR) 10RLazarus: "helmfile diff: https://phabricator.wikimedia.org/P49519" [deployment-charts] - 10https://gerrit.wikimedia.org/r/935826 (https://phabricator.wikimedia.org/T320564) (owner: 10RLazarus) [02:05:21] !log rzl@deploy1002 helmfile [codfw] START helmfile.d/services/opentelemetry-collector: apply [02:05:33] !log rzl@deploy1002 helmfile [codfw] DONE helmfile.d/services/opentelemetry-collector: apply [02:05:45] !log rzl@deploy1002 helmfile [codfw] START helmfile.d/services/opentelemetry-collector: apply [02:06:09] !log rzl@deploy1002 helmfile [codfw] DONE helmfile.d/services/opentelemetry-collector: apply [02:13:02] RECOVERY - Hadoop NodeManager on analytics1069 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [02:16:42] !log rzl@deploy1002 helmfile [eqiad] START helmfile.d/services/opentelemetry-collector: apply [02:17:00] !log rzl@deploy1002 helmfile [eqiad] DONE helmfile.d/services/opentelemetry-collector: apply [02:22:36] (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:41:05] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [02:45:34] (03PS1) 10Cwhite: hiera: map logstash.wm.o to kibana7.eqiad" [puppet] - 10https://gerrit.wikimedia.org/r/935502 [02:45:55] (03PS2) 10Cwhite: hiera: map logstash.wm.o to kibana7.eqiad" [puppet] - 10https://gerrit.wikimedia.org/r/935502 [02:46:11] (03PS3) 10Cwhite: hiera: map logstash.wm.o to kibana7.eqiad [puppet] - 10https://gerrit.wikimedia.org/r/935502 [02:46:34] (03CR) 10CI reject: [V: 04-1] hiera: map logstash.wm.o to kibana7.eqiad [puppet] - 10https://gerrit.wikimedia.org/r/935502 (owner: 10Cwhite) [02:47:08] (03PS4) 10Cwhite: hiera: map logstash.wm.o to kibana7.eqiad [puppet] - 10https://gerrit.wikimedia.org/r/935502 [02:47:30] (03CR) 10CI reject: [V: 04-1] hiera: map logstash.wm.o to kibana7.eqiad [puppet] - 10https://gerrit.wikimedia.org/r/935502 (owner: 10Cwhite) [02:49:25] (03PS5) 10Cwhite: hiera: map logstash.wm.o to kibana7.eqiad [puppet] - 10https://gerrit.wikimedia.org/r/935502 (https://phabricator.wikimedia.org/T333732) [03:11:05] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [04:18:26] RECOVERY - Router interfaces on cr2-drmrs is OK: OK: host 185.15.58.129, interfaces up: 62, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:19:38] RECOVERY - Router interfaces on cr3-esams is OK: OK: host 91.198.174.245, interfaces up: 84, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:21:05] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [04:56:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [05:01:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [05:11:05] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [05:23:20] (03CR) 10Legoktm: mw-cli-wrapper: fix own dc reference in Beta Cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/935448 (owner: 10Krinkle) [05:46:12] (03PS1) 10KartikMistry: Update MinT to 2023-07-06-051402-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/935835 [05:56:26] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T341168 (10phaultfinder) [05:56:33] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T341169 (10phaultfinder) [05:56:38] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T341170 (10phaultfinder) [05:56:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [06:00:05] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230706T0600) [06:00:05] kormat, marostegui, and Amir1: How many deployers does it take to do Primary database switchover deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230706T0600). [06:00:05] (03CR) 10Elukey: changeprop: increase the linger.ms value (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/935772 (https://phabricator.wikimedia.org/T338357) (owner: 10Elukey) [06:01:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [06:02:03] (03CR) 10Elukey: changeprop: increase the linger.ms value (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/935772 (https://phabricator.wikimedia.org/T338357) (owner: 10Elukey) [06:22:36] (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:50:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [06:54:37] !log jelto@cumin1001 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1004.wikimedia.org with reason: GitLab minor version upgrade [06:55:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:00:07] Amir1, apergos, and jnuche: My dear minions, it's time we take the moon! Just kidding. Time for UTC morning backport and config training deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230706T0700). [07:01:52] let's see what's happening today [07:02:14] no patches scheduled for the window. aaaaand [07:02:36] no trainees signed up to help with those 0 patches, whew! [07:02:50] have a nice day and see you next time! [07:04:16] !log stevemunene@cumin1001 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on 9 hosts with reason: Stopping puppet and hadoop-hdfs-datanode services then decommissioning the hosts [07:04:35] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on 9 hosts with reason: Stopping puppet and hadoop-hdfs-datanode services then decommissioning the hosts [07:05:23] I'll deploy MinT then :) [07:07:15] (03CR) 10KartikMistry: [C: 03+2] Update MinT to 2023-07-06-051402-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/935835 (owner: 10KartikMistry) [07:08:08] (03Merged) 10jenkins-bot: Update MinT to 2023-07-06-051402-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/935835 (owner: 10KartikMistry) [07:09:59] !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/machinetranslation: apply [07:12:31] !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply [07:17:47] !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/machinetranslation: apply [07:21:33] (03CR) 10Stevemunene: [C: 03+2] analytics: Remove analytics1064_1069 from hdfs net_topology [puppet] - 10https://gerrit.wikimedia.org/r/933387 (https://phabricator.wikimedia.org/T317861) (owner: 10Stevemunene) [07:23:09] !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/machinetranslation: apply [07:25:24] !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/machinetranslation: apply [07:26:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:27:22] (03PS1) 10Jelto: ci/zuul: set contint2002 as the active ci::manager_host [puppet] - 10https://gerrit.wikimedia.org/r/935919 (https://phabricator.wikimedia.org/T324659) [07:29:21] (03CR) 10Jelto: ci/zuul: switch gearman server from contint2001 to contint2002 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/867705 (https://phabricator.wikimedia.org/T324659) (owner: 10Dzahn) [07:29:32] !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/machinetranslation: apply [07:29:43] !log mvernon@cumin1001 START - Cookbook sre.swift.roll-restart-reboot-swift-thanos-proxies rolling reboot on A:thanos-fe [07:30:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [07:31:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:31:40] !log Updated MinT to 2023-07-06-051402-production [07:31:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:33:20] PROBLEM - PyBal IPVS diff check on lvs2013 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.77:443]) https://wikitech.wikimedia.org/wiki/PyBal [07:34:28] PROBLEM - PyBal IPVS diff check on lvs2014 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.77:443]) https://wikitech.wikimedia.org/wiki/PyBal [07:35:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [07:35:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:38:52] RECOVERY - PyBal IPVS diff check on lvs2013 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [07:40:02] RECOVERY - PyBal IPVS diff check on lvs2014 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [07:40:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:41:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:46:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:49:18] those lvs alerts are related to the thanos-fe restarts [07:50:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:54:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:55:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:59:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:00:05] hashar and brennen: #bothumor My software never has bugs. It just develops random features. Rise for MediaWiki train - Utc-0+Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230706T0800). [08:03:05] !log jelto@cumin1001 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1004.wikimedia.org with reason: GitLab minor version upgrade [08:04:34] PROBLEM - PyBal IPVS diff check on lvs1020 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.77:443]) https://wikitech.wikimedia.org/wiki/PyBal [08:05:42] PROBLEM - PyBal IPVS diff check on lvs1019 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.77:443]) https://wikitech.wikimedia.org/wiki/PyBal [08:16:49] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) firing: Processing latency of WDQS_Streaming_Updater in eqiad (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [08:17:23] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T341168 (10fgiunchedi) [08:17:25] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T341170 (10fgiunchedi) [08:17:27] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T341169 (10fgiunchedi) [08:17:42] !log disabling puppet temporary on cp1075.eqiad.wmnet, cp2027.codfw.wmnet, cp3050.esams.wmnet to apply 935760 (T340983) [08:17:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:46] T340983: provide haproxy silent-drop support for port 80 as well - https://phabricator.wikimedia.org/T340983 [08:19:03] (03CR) 10Fabfur: [V: 03+1 C: 03+2] haproxy: support different actions for tls and http frontend [puppet] - 10https://gerrit.wikimedia.org/r/935760 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur) [08:20:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:21:10] RECOVERY - PyBal IPVS diff check on lvs1020 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [08:21:49] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) resolved: Processing latency of WDQS_Streaming_Updater in eqiad (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [08:22:20] RECOVERY - PyBal IPVS diff check on lvs1019 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [08:25:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:33:58] (03PS1) 10Btullis: Bump the image of datahub to the new 0.10.4 containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/935982 (https://phabricator.wikimedia.org/T329514) [08:35:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:36:19] (03CR) 10Jelto: [C: 04-1] "This change allows public dockerhub images (mariadb) on Trusted Runners (production infrastructure). This is discouraged and we only allow" [puppet] - 10https://gerrit.wikimedia.org/r/935703 (https://phabricator.wikimedia.org/T339352) (owner: 10Kosta Harlan) [08:38:01] (03CR) 10Kosta Harlan: common/gitlab_runner: Allow mariadb:* images for allowed_docker_services (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/935703 (https://phabricator.wikimedia.org/T339352) (owner: 10Kosta Harlan) [08:39:49] !log mvernon@cumin1001 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-thanos-proxies (exit_code=0) rolling reboot on A:thanos-fe [08:40:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:45:19] !log reenabled puppet on cp1075.eqiad.wmnet, cp2027.codfw.wmnet, cp3050.esams.wmnet [08:45:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:10] I forgot to run the train sorry [08:49:14] going to run it now [08:49:31] <_joe_> kart_: around? [08:49:46] !log oblivian@deploy1002 helmfile [staging] START helmfile.d/services/cxserver: apply [08:49:54] !log oblivian@deploy1002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [08:50:33] 10SRE, 10Observability-Alerting, 10Traffic, 10serviceops: Timeouts when talking to phabricator API - https://phabricator.wikimedia.org/T341039 (10fgiunchedi) I have extracted the `maniphest.edit` event duration from phab1004 access log, and on the 29th the operation started to take a whole lot longer: ` 2... [08:50:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:50:42] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/cxserver: apply [08:51:07] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [08:51:37] 10SRE, 10Observability-Alerting, 10Traffic, 10serviceops: Timeouts when talking to phabricator API - https://phabricator.wikimedia.org/T341039 (10fgiunchedi) @brennen I saw your updates to phab in SAL, does the above (`maniphest.edit` taking a lot longer to create tasks) ring a bell? [08:54:21] 10SRE, 10Infrastructure-Foundations, 10Wikimedia-IRC-RC-Server: Spam in PMs on IRC recent changes server - https://phabricator.wikimedia.org/T341097 (10Volans) For context there have been already a larger effort in the past towards moving the irc server to a newer and re-written server that serve only the re... [08:55:02] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [08:55:23] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [08:55:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:58:49] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/935725 (https://phabricator.wikimedia.org/T341063) (owner: 10Arturo Borrero Gonzalez) [08:59:32] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1 C: 03+2] wmcs: cloud_private_subnet: introduce per-rack vlan_id support [puppet] - 10https://gerrit.wikimedia.org/r/935725 (https://phabricator.wikimedia.org/T341063) (owner: 10Arturo Borrero Gonzalez) [09:02:01] (03PS1) 10TrainBranchBot: all wikis to 1.41.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935985 (https://phabricator.wikimedia.org/T340244) [09:02:03] (03CR) 10TrainBranchBot: [C: 03+2] all wikis to 1.41.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935985 (https://phabricator.wikimedia.org/T340244) (owner: 10TrainBranchBot) [09:02:53] (03Merged) 10jenkins-bot: all wikis to 1.41.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935985 (https://phabricator.wikimedia.org/T340244) (owner: 10TrainBranchBot) [09:04:26] _joe_: now. Tell me. [09:05:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:06:30] kart_: o/ Joe deployed cx to remove the extra key, I think he wanted to ping you about it [09:08:08] cool. Thanks a lot, _joe_ [09:08:32] I was looking at graphs if something has exploded in cxserver/MinT :D [09:10:11] !log hashar@deploy1002 rebuilt and synchronized wikiversions files: all wikis to 1.41.0-wmf.16 refs T340244 [09:10:14] T340244: 1.41.0-wmf.16 deployment blockers - https://phabricator.wikimedia.org/T340244 [09:10:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:10:57] 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops-collab, 10Patch-For-Review: contint2002 service implementation tracking - https://phabricator.wikimedia.org/T324659 (10Jelto) Thanks @hashar for the detailed summary! Regarding rsync the following commands //should// be needed (executed on `cont... [09:11:52] !log restart kube-apiserver on ml-serve-ctrl2* as attempt to fix LIST-related latency issues [09:11:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:13:55] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [09:15:14] (03CR) 10Arturo Borrero Gonzalez: Add a new nftables::service define (0310 comments) [puppet] - 10https://gerrit.wikimedia.org/r/935751 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [09:15:32] PROBLEM - Check systemd state on ml-serve-ctrl2002 is CRITICAL: CRITICAL - degraded: The following units failed: kube-controller-manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:17:04] RECOVERY - Check systemd state on ml-serve-ctrl2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:17:38] (KubernetesAPILatency) firing: (4) High Kubernetes API latency (LIST endpointslices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:20:29] (03CR) 10Clément Goubert: [C: 03+1] opentelemetry-collector: Use a NodePort service instead of a hostPort. [deployment-charts] - 10https://gerrit.wikimedia.org/r/935826 (https://phabricator.wikimedia.org/T320564) (owner: 10RLazarus) [09:22:38] (KubernetesAPILatency) resolved: (4) High Kubernetes API latency (LIST endpointslices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:28:38] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [09:30:23] (03CR) 10Kamila Součková: [C: 03+1] changeprop: increase the linger.ms value [deployment-charts] - 10https://gerrit.wikimedia.org/r/935772 (https://phabricator.wikimedia.org/T338357) (owner: 10Elukey) [09:33:51] (03CR) 10Jbond: "some comments inline" [puppet] - 10https://gerrit.wikimedia.org/r/935751 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [09:33:59] (03PS1) 10Fabfur: haproxy: fix variable type and better naming [puppet] - 10https://gerrit.wikimedia.org/r/935988 (https://phabricator.wikimedia.org/T340983) [09:35:50] !log mvernon@cumin1001 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling reboot on A:swift-fe [09:39:09] (03CR) 10Jelto: [C: 03+1] miscweb: add statictendril release to miscweb staging (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/930886 (https://phabricator.wikimedia.org/T300171) (owner: 10Dzahn) [09:39:53] (03PS1) 10Filippo Giunchedi: alertmanager: add page routes for traffic and netops [puppet] - 10https://gerrit.wikimedia.org/r/935990 [09:42:51] (03PS3) 10Jbond: pybal: update check to conform to the nagios plugin api [puppet] - 10https://gerrit.wikimedia.org/r/933398 (https://phabricator.wikimedia.org/T322377) [09:43:59] (03CR) 10Jbond: pybal: update check to conform to the nagios plugin api (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/933398 (https://phabricator.wikimedia.org/T322377) (owner: 10Jbond) [09:44:01] (03PS4) 10Jbond: pybal: update check to conform to the nagios plugin api [puppet] - 10https://gerrit.wikimedia.org/r/933398 (https://phabricator.wikimedia.org/T322377) [09:46:07] (03CR) 10Kamila Součková: [C: 03+1] requirements: bump pyssim [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/935715 (owner: 10Hnowlan) [09:50:17] (03PS1) 10Hashar: Recognize ~/.config/docker-pkg.yaml [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/935991 [09:52:34] (03CR) 10CI reject: [V: 04-1] Recognize ~/.config/docker-pkg.yaml [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/935991 (owner: 10Hashar) [09:54:57] (03CR) 10Hnowlan: [C: 03+2] requirements: bump pyssim [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/935715 (owner: 10Hnowlan) [09:55:05] (03CR) 10Urbanecm: Enable global abuse filters on almost all projects (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935815 (https://phabricator.wikimedia.org/T341159) (owner: 10Urbanecm) [09:55:27] (03CR) 10CI reject: [V: 04-1] requirements: bump pyssim [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/935715 (owner: 10Hnowlan) [09:56:59] (03CR) 10Filippo Giunchedi: [C: 03+2] alertmanager: add page routes for traffic and netops [puppet] - 10https://gerrit.wikimedia.org/r/935990 (owner: 10Filippo Giunchedi) [09:58:03] (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:58:38] !log stevemunene@cumin1001 START - Cookbook sre.hosts.decommission for hosts analytics1061.eqiad.wmnet [10:00:05] mvolz: #bothumor My software never has bugs. It just develops random features. Rise for Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230706T1000). [10:00:05] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230706T1000) [10:03:18] (ProbeDown) resolved: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:05:50] !log stevemunene@cumin1001 START - Cookbook sre.dns.netbox [10:07:09] (03CR) 10Kamila Součková: [C: 03+1] "recheck" [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/935715 (owner: 10Hnowlan) [10:08:42] !log stevemunene@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: analytics1061.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - stevemunene@cumin1001" [10:10:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:10:42] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: analytics1061.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - stevemunene@cumin1001" [10:10:43] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:10:43] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts analytics1061.eqiad.wmnet [10:11:12] (03PS2) 10Fabfur: haproxy: fix variable type and better naming [puppet] - 10https://gerrit.wikimedia.org/r/935988 (https://phabricator.wikimedia.org/T340983) [10:13:44] 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops-collab, 10Patch-For-Review: contint2002 service implementation tracking - https://phabricator.wikimedia.org/T324659 (10hashar) Thanks for the rsync commands! Some adjustements: * delete files on the destination with: `--delete-delay` * swap the... [10:15:28] (03CR) 10Filippo Giunchedi: [C: 03+1] "This change will need adjusting of CirrusSearchJobQueueLagTooHigh alert, 'pint' reported this error (AlertLintProblem alert)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/935089 (owner: 10Clément Goubert) [10:15:30] (03CR) 10Hnowlan: [C: 03+2] api-gateway: add native AQS1-style routes for AQS services [deployment-charts] - 10https://gerrit.wikimedia.org/r/935457 (https://phabricator.wikimedia.org/T338916) (owner: 10Hnowlan) [10:15:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:16:25] (03Merged) 10jenkins-bot: api-gateway: add native AQS1-style routes for AQS services [deployment-charts] - 10https://gerrit.wikimedia.org/r/935457 (https://phabricator.wikimedia.org/T338916) (owner: 10Hnowlan) [10:18:27] I am off for lunch [10:18:47] (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (NOOP 7): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42296/console" [puppet] - 10https://gerrit.wikimedia.org/r/935988 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur) [10:22:50] (03PS3) 10Fabfur: haproxy: fix variable type and better naming [puppet] - 10https://gerrit.wikimedia.org/r/935988 (https://phabricator.wikimedia.org/T340983) [10:22:51] (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:24:42] (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42299/console" [puppet] - 10https://gerrit.wikimedia.org/r/935988 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur) [10:27:26] (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42300/console" [puppet] - 10https://gerrit.wikimedia.org/r/935988 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur) [10:29:51] (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (NOOP 7): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42301/console" [puppet] - 10https://gerrit.wikimedia.org/r/935988 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur) [10:30:23] (03PS7) 10Arturo Borrero Gonzalez: Add a new nftables::service define [puppet] - 10https://gerrit.wikimedia.org/r/935751 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [10:30:50] (03CR) 10Vgutierrez: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/935988 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur) [10:33:02] (03CR) 10Fabfur: [V: 03+1 C: 03+2] haproxy: fix variable type and better naming [puppet] - 10https://gerrit.wikimedia.org/r/935988 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur) [10:35:31] (03CR) 10Arturo Borrero Gonzalez: Add a new nftables::service define (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/935751 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [10:37:49] (03PS2) 10Btullis: Bump the image of datahub to the new 0.10.4 containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/935982 (https://phabricator.wikimedia.org/T329514) [10:40:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:41:05] (03PS3) 10Btullis: Bump the image of datahub to the new 0.10.4 containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/935982 (https://phabricator.wikimedia.org/T329514) [10:41:19] !log stevemunene@cumin1001 START - Cookbook sre.hosts.decommission for hosts analytics1062.eqiad.wmnet [10:42:02] jouncebot: nowandnext [10:42:03] For the next 0 hour(s) and 17 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230706T1000) [10:42:03] For the next 0 hour(s) and 17 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230706T1000) [10:42:03] In 2 hour(s) and 17 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230706T1300) [10:42:03] In 2 hour(s) and 17 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230706T1300) [10:44:00] (03CR) 10Stevemunene: [C: 03+1] Bump the image of datahub to the new 0.10.4 containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/935982 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [10:44:15] (03PS1) 10Majavah: extdist: REL1_40 is stable, REL1_38 is EOL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935997 [10:45:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:46:28] urbanecm: (or someone else) if you could quickly double-check https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/935997/ is correct I'd appreciate it [10:47:15] !log stevemunene@cumin1001 START - Cookbook sre.dns.netbox [10:47:47] (03CR) 10Urbanecm: [C: 03+1] extdist: REL1_40 is stable, REL1_38 is EOL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935997 (owner: 10Majavah) [10:47:50] sounds correct to me taavi [10:48:02] thanks [10:48:16] (03CR) 10Btullis: [C: 03+2] Bump the image of datahub to the new 0.10.4 containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/935982 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [10:48:16] looks like the mw infra window is unused, so I'll push that out now [10:48:24] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] extdist: REL1_40 is stable, REL1_38 is EOL (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935997 (owner: 10Majavah) [10:48:41] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935997 (owner: 10Majavah) [10:49:16] (03Merged) 10jenkins-bot: Bump the image of datahub to the new 0.10.4 containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/935982 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [10:49:25] (03Merged) 10jenkins-bot: extdist: REL1_40 is stable, REL1_38 is EOL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935997 (owner: 10Majavah) [10:49:46] !log taavi@deploy1002 Started scap: Backport for [[gerrit:935997|extdist: REL1_40 is stable, REL1_38 is EOL]] [10:51:10] !log taavi@deploy1002 taavi: Backport for [[gerrit:935997|extdist: REL1_40 is stable, REL1_38 is EOL]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet [10:51:16] I have two other config changes I could deploy afterwards if no one else is doing anything [10:51:29] (or even three) [10:51:43] (I won’t be around for the backport window later unfortunately) [10:52:28] syncing [10:53:46] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [10:54:12] !log stevemunene@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: analytics1062.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - stevemunene@cumin1001" [10:55:04] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] Beta-Wikidata: Always show mul on desktop Termbox (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935770 (https://phabricator.wikimedia.org/T339104) (owner: 10Michael Große) [10:55:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:58:07] !log taavi@deploy1002 Finished scap: Backport for [[gerrit:935997|extdist: REL1_40 is stable, REL1_38 is EOL]] (duration: 08m 21s) [10:58:14] * taavi done [11:00:38] (KubernetesAPILatency) resolved: (3) High Kubernetes API latency (PUT replicasets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:01:26] (03PS8) 10Jbond: puppetdb: Add ability to configure secondary proxies [puppet] - 10https://gerrit.wikimedia.org/r/933608 (https://phabricator.wikimedia.org/T338811) [11:01:28] (03PS5) 10Jbond: puppetdb: add secondary web site to proxy requests form the puppet5 masters [puppet] - 10https://gerrit.wikimedia.org/r/935755 (https://phabricator.wikimedia.org/T338811) [11:02:25] (03CR) 10Jbond: "thanks" [puppet] - 10https://gerrit.wikimedia.org/r/933608 (https://phabricator.wikimedia.org/T338811) (owner: 10Jbond) [11:02:39] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42302/console" [puppet] - 10https://gerrit.wikimedia.org/r/935755 (https://phabricator.wikimedia.org/T338811) (owner: 10Jbond) [11:03:50] alright, I’ll deploy some config changes then [11:03:56] !log mvernon@cumin1001 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling reboot on A:swift-fe [11:03:58] (none of them are urgent, feel free to ping me if you want to do something in between) [11:04:20] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [11:05:13] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: analytics1062.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - stevemunene@cumin1001" [11:05:14] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:05:14] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts analytics1062.eqiad.wmnet [11:05:19] (03PS2) 10Lucas Werkmeister (WMDE): outreachwiki: Set wmgWikibaseSiteGroup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935455 [11:05:34] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935455 (owner: 10Lucas Werkmeister (WMDE)) [11:06:19] (03Merged) 10jenkins-bot: outreachwiki: Set wmgWikibaseSiteGroup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935455 (owner: 10Lucas Werkmeister (WMDE)) [11:06:37] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:935455|outreachwiki: Set wmgWikibaseSiteGroup]] [11:07:58] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde: Backport for [[gerrit:935455|outreachwiki: Set wmgWikibaseSiteGroup]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet [11:08:31] I checked in `mwscript shell outreachwiki` that `wbc::getSiteGroup()` returns the same result before and after the change, as expected. syncing [11:10:28] !log aborrero@cumin1001 START - Cookbook sre.hosts.decommission for hosts cloudswift1001.eqiad.wmnet [11:12:29] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/api-gateway: apply [11:12:49] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply [11:14:12] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:935455|outreachwiki: Set wmgWikibaseSiteGroup]] (duration: 07m 35s) [11:14:14] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/api-gateway: apply [11:14:16] (03PS9) 10Jbond: puppetdb: Add ability to configure secondary proxies [puppet] - 10https://gerrit.wikimedia.org/r/933608 (https://phabricator.wikimedia.org/T338811) [11:14:18] (03PS6) 10Jbond: puppetdb: add secondary web site to proxy requests form the puppet5 masters [puppet] - 10https://gerrit.wikimedia.org/r/935755 (https://phabricator.wikimedia.org/T338811) [11:14:33] (03PS9) 10Lucas Werkmeister (WMDE): foundationwiki: Enable WikibaseClient [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850547 (https://phabricator.wikimedia.org/T321967) (owner: 10Varnent) [11:14:48] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/api-gateway: apply [11:15:00] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850547 (https://phabricator.wikimedia.org/T321967) (owner: 10Varnent) [11:15:22] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42303/console" [puppet] - 10https://gerrit.wikimedia.org/r/935755 (https://phabricator.wikimedia.org/T338811) (owner: 10Jbond) [11:15:47] (03Merged) 10jenkins-bot: foundationwiki: Enable WikibaseClient [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850547 (https://phabricator.wikimedia.org/T321967) (owner: 10Varnent) [11:16:03] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:850547|foundationwiki: Enable WikibaseClient (T321967)]] [11:16:06] T321967: Enable Wikibase client on Wikimedia Foundation Governance Wiki - https://phabricator.wikimedia.org/T321967 [11:17:23] !log lucaswerkmeister-wmde@deploy1002 varnent and lucaswerkmeister-wmde: Backport for [[gerrit:850547|foundationwiki: Enable WikibaseClient (T321967)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet [11:17:59] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/api-gateway: apply [11:18:27] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: apply [11:19:02] !log stevemunene@cumin1001 START - Cookbook sre.hosts.decommission for hosts analytics1063.eqiad.wmnet [11:19:12] I linked foundationwiki’s Wikimedia:Sandbox to https://www.wikidata.org/wiki/Q3938?debug=2 [11:19:16] and sitelinks appeared on https://foundation.wikimedia.org/wiki/Wikimedia:Sandbox [11:19:22] I think that’s a success. syncing [11:19:51] (03PS1) 10ArielGlenn: Give Dan Andreescu and Jennifer Ebe root on dumps hosts [puppet] - 10https://gerrit.wikimedia.org/r/936003 (https://phabricator.wikimedia.org/T341045) [11:22:34] (03CR) 10Jbond: [C: 03+2] puppetdb: Add ability to configure secondary proxies [puppet] - 10https://gerrit.wikimedia.org/r/933608 (https://phabricator.wikimedia.org/T338811) (owner: 10Jbond) [11:22:39] (03CR) 10Jbond: [V: 03+1 C: 03+2] puppetdb: add secondary web site to proxy requests form the puppet5 masters [puppet] - 10https://gerrit.wikimedia.org/r/935755 (https://phabricator.wikimedia.org/T338811) (owner: 10Jbond) [11:22:43] (03CR) 10Jbond: [C: 03+2] puppetdb: Add ability to configure secondary proxies [puppet] - 10https://gerrit.wikimedia.org/r/933608 (https://phabricator.wikimedia.org/T338811) (owner: 10Jbond) [11:22:55] !log aborrero@cumin1001 START - Cookbook sre.dns.netbox [11:23:05] !log aborrero@cumin1001 START - Cookbook sre.hosts.decommission for hosts cloudswift1002.eqiad.wmnet [11:24:01] (03PS2) 10Lucas Werkmeister (WMDE): Beta-Wikidata: Always show mul on desktop Termbox [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935770 (https://phabricator.wikimedia.org/T339104) (owner: 10Michael Große) [11:24:08] !log aborrero@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [11:24:09] !log aborrero@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts cloudswift1001.eqiad.wmnet [11:24:38] !log stevemunene@cumin1001 START - Cookbook sre.dns.netbox [11:25:02] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:850547|foundationwiki: Enable WikibaseClient (T321967)]] (duration: 08m 58s) [11:25:06] T321967: Enable Wikibase client on Wikimedia Foundation Governance Wiki - https://phabricator.wikimedia.org/T321967 [11:25:30] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935770 (https://phabricator.wikimedia.org/T339104) (owner: 10Michael Große) [11:25:34] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST replicasets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:26:17] (03Merged) 10jenkins-bot: Beta-Wikidata: Always show mul on desktop Termbox [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935770 (https://phabricator.wikimedia.org/T339104) (owner: 10Michael Große) [11:26:33] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:935770|Beta-Wikidata: Always show mul on desktop Termbox (T339104)]] [11:26:36] T339104: Create feature flag to always show `mul` in “in more languages” section of desktop termbox - https://phabricator.wikimedia.org/T339104 [11:26:53] (03PS1) 10Jbond: puppetdb::site: secret needs to be content not source [puppet] - 10https://gerrit.wikimedia.org/r/936007 (https://phabricator.wikimedia.org/T338811) [11:26:59] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/935868 [11:27:21] !log aborrero@cumin1001 START - Cookbook sre.dns.netbox [11:27:32] !log btullis@cumin1001 START - Cookbook sre.kafka.roll-restart-brokers for Kafka A:kafka-test-eqiad cluster: Roll restart of jvm daemons. [11:27:52] (03CR) 10Vgutierrez: trafficserver: add gateway routing script, route device-analytics (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/935464 (https://phabricator.wikimedia.org/T320967) (owner: 10Hnowlan) [11:27:54] !log lucaswerkmeister-wmde@deploy1002 migr and lucaswerkmeister-wmde: Backport for [[gerrit:935770|Beta-Wikidata: Always show mul on desktop Termbox (T339104)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet [11:28:17] not much to test here, it’s a Beta-only change [11:28:24] it just touches Wikibase.php, but should have no effect [11:28:32] syncing after confirming that the site didn’t blow up on mwdebug [11:29:31] !log aborrero@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudswift1002.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - aborrero@cumin1001" [11:30:01] (03CR) 10Jbond: [C: 03+2] puppetdb::site: secret needs to be content not source [puppet] - 10https://gerrit.wikimedia.org/r/936007 (https://phabricator.wikimedia.org/T338811) (owner: 10Jbond) [11:30:27] !log aborrero@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudswift1002.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - aborrero@cumin1001" [11:30:28] !log aborrero@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:30:28] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudswift1002.eqiad.wmnet [11:30:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST replicasets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:34:11] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:935770|Beta-Wikidata: Always show mul on desktop Termbox (T339104)]] (duration: 07m 37s) [11:34:14] T339104: Create feature flag to always show `mul` in “in more languages” section of desktop termbox - https://phabricator.wikimedia.org/T339104 [11:34:30] PROBLEM - uWSGI puppetboard -http via nrpe- on puppetboard1003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 INTERNAL SERVER ERROR - 5551 bytes in 0.014 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/puppetboard [11:34:42] PROBLEM - Check systemd state on puppetdb2003 is CRITICAL: CRITICAL - degraded: The following units failed: nginx.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:34:48] * Lucas_WMDE done [11:35:01] !log aborrero@cumin1001 START - Cookbook sre.dns.netbox [11:35:45] (03PS1) 10Majavah: P:toolforge::prometheus: add pod_name label [puppet] - 10https://gerrit.wikimedia.org/r/936014 [11:36:25] (03CR) 10Jelto: [C: 04-1] common/gitlab_runner: Allow mariadb:* images for allowed_docker_services (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/935703 (https://phabricator.wikimedia.org/T339352) (owner: 10Kosta Harlan) [11:36:46] (03PS1) 10Bartosz Dziewoński: Revert "Add tag when reference added to the page" [extensions/VisualEditor] (wmf/1.41.0-wmf.16) - 10https://gerrit.wikimedia.org/r/935854 (https://phabricator.wikimedia.org/T341202) [11:38:49] !log aborrero@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudlb - aborrero@cumin1001" [11:39:10] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42304/console" [puppet] - 10https://gerrit.wikimedia.org/r/936014 (owner: 10Majavah) [11:39:33] !log aborrero@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudlb - aborrero@cumin1001" [11:39:33] !log aborrero@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:41:14] !log stevemunene@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [11:41:15] !log stevemunene@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=99) for hosts analytics1063.eqiad.wmnet [11:41:20] hi, anyone would like to deploy a revert for me? https://gerrit.wikimedia.org/r/c/mediawiki/extensions/VisualEditor/+/935854 [11:41:26] seems like a bad train regression [11:41:44] !log stevemunene@cumin1001 START - Cookbook sre.hosts.decommission for hosts analytics1063.eqiad.wmnet [11:41:55] 10SRE, 10API Platform, 10Anti-Harassment, 10Cloud-Services, and 18 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10BTullis) [11:42:53] !log aborrero@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host cloudlb1001 [11:43:11] !log aborrero@cumin1001 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host cloudlb1001 [11:43:14] MatmaRex: can do [11:43:19] jouncebot: nowandnext [11:43:19] No deployments scheduled for the next 1 hour(s) and 16 minute(s) [11:43:19] In 1 hour(s) and 16 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230706T1300) [11:43:19] In 1 hour(s) and 16 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230706T1300) [11:44:14] (03PS1) 10Btullis: Bump the version of the datahub image [deployment-charts] - 10https://gerrit.wikimedia.org/r/936015 (https://phabricator.wikimedia.org/T329514) [11:45:45] * TheresNoTime waiting for 935854's CI to finish [11:46:27] (03PS1) 10Jbond: puppetdb::site: fix nginx syntax error [puppet] - 10https://gerrit.wikimedia.org/r/936016 [11:46:30] (03PS1) 10Jbond: nginx: manage nginx directory [puppet] - 10https://gerrit.wikimedia.org/r/936017 [11:46:32] !log stevemunene@cumin1001 START - Cookbook sre.dns.netbox [11:47:45] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:47:45] !log stevemunene@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=99) for hosts analytics1063.eqiad.wmnet [11:48:15] !log aborrero@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host cloudlb1001 [11:48:19] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/api-gateway: apply [11:48:23] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply [11:48:31] !log aborrero@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudlb1001 [11:48:36] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/api-gateway: apply [11:49:00] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: apply [11:49:17] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/api-gateway: apply [11:49:50] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/api-gateway: apply [11:50:11] !log aborrero@cumin2002 START - Cookbook sre.dns.wipe-cache cloudlb1001.eqiad.wmnet on all recursors [11:50:13] !log aborrero@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudlb1001.eqiad.wmnet on all recursors [11:50:25] !log stevemunene@cumin1001 START - Cookbook sre.hosts.decommission for hosts analytics1063.eqiad.wmnet [11:50:36] (03CR) 10Btullis: [C: 03+2] Bump the version of the datahub image [deployment-charts] - 10https://gerrit.wikimedia.org/r/936015 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [11:50:45] !log aborrero@cumin1001 START - Cookbook sre.dns.netbox [11:51:21] (03Merged) 10jenkins-bot: Bump the version of the datahub image [deployment-charts] - 10https://gerrit.wikimedia.org/r/936015 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [11:52:04] PROBLEM - uWSGI puppetboard -http via nrpe- on puppetboard2003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 INTERNAL SERVER ERROR - 5551 bytes in 0.145 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/puppetboard [11:52:08] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [extensions/VisualEditor] (wmf/1.41.0-wmf.16) - 10https://gerrit.wikimedia.org/r/935854 (https://phabricator.wikimedia.org/T341202) (owner: 10Bartosz Dziewoński) [11:52:21] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [11:52:43] !log aborrero@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudlb - aborrero@cumin1001" [11:53:27] !log aborrero@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudlb - aborrero@cumin1001" [11:53:27] !log aborrero@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:53:44] !log aborrero@cumin1001 START - Cookbook sre.dns.netbox [11:54:24] !log aborrero@cumin2002 START - Cookbook sre.dns.wipe-cache cloudlb1001.eqiad.wmnet on all recursors [11:54:27] !log aborrero@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudlb1001.eqiad.wmnet on all recursors [11:55:21] !log stevemunene@cumin1001 START - Cookbook sre.dns.netbox [11:55:41] !log aborrero@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudlb - aborrero@cumin1001" [11:55:59] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 58): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42305/console" [puppet] - 10https://gerrit.wikimedia.org/r/936017 (owner: 10Jbond) [11:56:18] !log aborrero@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudlb - aborrero@cumin1001" [11:56:18] !log aborrero@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:56:26] !log aborrero@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host cloudlb1002 [11:56:41] !log aborrero@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudlb1002 [11:56:43] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:56:44] !log stevemunene@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts analytics1063.eqiad.wmnet [11:56:46] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [11:56:57] !log aborrero@cumin1001 START - Cookbook sre.hosts.reimage for host cloudlb1001.eqiad.wmnet with OS bullseye [11:58:40] (03PS1) 10Arturo Borrero Gonzalez: cloudlb1001/1002: add role [puppet] - 10https://gerrit.wikimedia.org/r/936019 (https://phabricator.wikimedia.org/T341200) [11:59:58] (03PS2) 10Arturo Borrero Gonzalez: cloudlb1001/1002: add role [puppet] - 10https://gerrit.wikimedia.org/r/936019 (https://phabricator.wikimedia.org/T341200) [12:00:46] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudlb1001/1002: add role [puppet] - 10https://gerrit.wikimedia.org/r/936019 (https://phabricator.wikimedia.org/T341200) (owner: 10Arturo Borrero Gonzalez) [12:01:59] (03PS2) 10Jbond: nginx: manage nginx directory [puppet] - 10https://gerrit.wikimedia.org/r/936017 [12:02:17] (03PS1) 10Andrew Bogott: Revert "cinder-backups: consolidate backup jobs on one host" [puppet] - 10https://gerrit.wikimedia.org/r/936020 [12:06:21] (03CR) 10Hashar: [C: 03+1] Revert "Add tag when reference added to the page" [extensions/VisualEditor] (wmf/1.41.0-wmf.16) - 10https://gerrit.wikimedia.org/r/935854 (https://phabricator.wikimedia.org/T341202) (owner: 10Bartosz Dziewoński) [12:08:18] (03Merged) 10jenkins-bot: Revert "Add tag when reference added to the page" [extensions/VisualEditor] (wmf/1.41.0-wmf.16) - 10https://gerrit.wikimedia.org/r/935854 (https://phabricator.wikimedia.org/T341202) (owner: 10Bartosz Dziewoński) [12:08:28] (03PS1) 10Arturo Borrero Gonzalez: cloudlb: eqiad: bootstrap hiera data [puppet] - 10https://gerrit.wikimedia.org/r/936022 (https://phabricator.wikimedia.org/T341200) [12:08:34] !log samtar@deploy1002 Started scap: Backport for [[gerrit:935854|Revert "Add tag when reference added to the page" (T341202)]] [12:08:37] T341202: Unable to edit an article on mobile (JavaScript error) - https://phabricator.wikimedia.org/T341202 [12:11:57] (03CR) 10Jbond: [C: 03+2] puppetdb::site: fix nginx syntax error [puppet] - 10https://gerrit.wikimedia.org/r/936016 (owner: 10Jbond) [12:12:01] (03CR) 10Jbond: [C: 03+2] nginx: manage nginx directory [puppet] - 10https://gerrit.wikimedia.org/r/936017 (owner: 10Jbond) [12:15:22] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host zookeeper-test1002.eqiad.wmnet with OS bookworm [12:15:31] !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host zookeeper-test1002.eqiad.wmnet with OS bookworm [12:16:22] RECOVERY - uWSGI puppetboard -http via nrpe- on puppetboard2003 is OK: HTTP OK: HTTP/1.1 200 OK - 10835 bytes in 0.477 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/puppetboard [12:17:22] RECOVERY - uWSGI puppetboard -http via nrpe- on puppetboard1003 is OK: HTTP OK: HTTP/1.1 200 OK - 10868 bytes in 0.214 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/puppetboard [12:17:50] RECOVERY - Check systemd state on puppetdb2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:21:03] !log samtar@deploy1002 matmarex and samtar: Backport for [[gerrit:935854|Revert "Add tag when reference added to the page" (T341202)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet [12:21:06] T341202: Unable to edit an article on mobile (JavaScript error) - https://phabricator.wikimedia.org/T341202 [12:21:24] will test [12:21:27] MatmaRex: ack [12:22:19] TheresNoTime: looks good, no console errors [12:22:25] syncing [12:23:29] (03CR) 10David Caro: [C: 03+2] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/936014 (owner: 10Majavah) [12:32:38] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:935854|Revert "Add tag when reference added to the page" (T341202)]] (duration: 24m 04s) [12:32:41] T341202: Unable to edit an article on mobile (JavaScript error) - https://phabricator.wikimedia.org/T341202 [12:32:58] (03CR) 10Kosta Harlan: common/gitlab_runner: Allow mariadb:* images for allowed_docker_services (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/935703 (https://phabricator.wikimedia.org/T339352) (owner: 10Kosta Harlan) [12:33:02] (03Abandoned) 10Kosta Harlan: common/gitlab_runner: Allow mariadb:* images for allowed_docker_services [puppet] - 10https://gerrit.wikimedia.org/r/935703 (https://phabricator.wikimedia.org/T339352) (owner: 10Kosta Harlan) [12:34:00] 10SRE, 10CAS-SSO, 10Infrastructure-Foundations, 10serviceops-collab, and 4 others: migrate gitlab away from the CAS protocol - https://phabricator.wikimedia.org/T320390 (10Jelto) Thanks for reporting the issue @Arnoldokoth ! I grepped a bit in `/var/log/cas/cas-2023-07-05.log ` on `idp-test1002` and found... [12:34:58] MatmaRex: live :) [12:35:04] (03PS2) 10Arturo Borrero Gonzalez: cloudlb: eqiad: bootstrap hiera data [puppet] - 10https://gerrit.wikimedia.org/r/936022 (https://phabricator.wikimedia.org/T341200) [12:35:20] thanks TheresNoTime [12:35:32] !log stevemunene@cumin1001 START - Cookbook sre.hosts.decommission for hosts analytics1064.eqiad.wmnet [12:36:53] (03PS1) 10Jbond: puppetmaster::puppetdb::cilent: updatre submit only port to 8443 [puppet] - 10https://gerrit.wikimedia.org/r/936026 (https://phabricator.wikimedia.org/T338811) [12:40:52] !log stevemunene@cumin1001 START - Cookbook sre.dns.netbox [12:41:14] RECOVERY - Ganeti memory on ganeti1013 is OK: OK Memory 86% used https://wikitech.wikimedia.org/wiki/Ganeti%23Memory_pressure [12:41:29] (03CR) 10Vgutierrez: [C: 03+1] trafficserver: add gateway routing script, route device-analytics (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/935464 (https://phabricator.wikimedia.org/T320967) (owner: 10Hnowlan) [12:42:58] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host zookeeper-test1002.eqiad.wmnet with OS bookworm [12:43:00] !log stevemunene@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: analytics1064.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - stevemunene@cumin1001" [12:47:08] MatmaRex: I just +2'd the master (935853) patch for that backport too... realised I probably should have asked before doing so [12:47:45] TheresNoTime: oh, thanks, i think that's just a formality [12:48:54] i think my team is mostly asleep now, and i don't want to ping them when the issue is mitigated already [12:49:47] 10SRE, 10CAS-SSO, 10Infrastructure-Foundations, 10serviceops-collab, and 4 others: migrate gitlab away from the CAS protocol - https://phabricator.wikimedia.org/T320390 (10jbond) >>! In T320390#8993613, @Jelto wrote: > Thanks for reporting the issue @Arnoldokoth ! > > I grepped a bit in `/var/log/cas/cas-... [12:51:44] (03PS2) 10Jbond: puppetmaster::puppetdb::cilent: updatre submit only port to 8443 [puppet] - 10https://gerrit.wikimedia.org/r/936026 (https://phabricator.wikimedia.org/T338811) [12:51:53] 10SRE, 10Infrastructure-Foundations, 10netops: Configure eqiad cloudsw devices to support cloud-private - https://phabricator.wikimedia.org/T341223 (10cmooney) p:05Triage→03Medium [12:52:29] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42307/console" [puppet] - 10https://gerrit.wikimedia.org/r/936026 (https://phabricator.wikimedia.org/T338811) (owner: 10Jbond) [12:53:09] (03PS3) 10Jbond: puppetmaster::puppetdb::cilent: updatre submit only port to 8443 [puppet] - 10https://gerrit.wikimedia.org/r/936026 (https://phabricator.wikimedia.org/T338811) [12:54:27] 10SRE, 10Infrastructure-Foundations, 10netops: Configure eqiad cloudsw devices to support cloud-private - https://phabricator.wikimedia.org/T341223 (10cmooney) [12:55:11] 10SRE, 10Infrastructure-Foundations, 10netops: Configure eqiad cloudsw devices to support cloud-private - https://phabricator.wikimedia.org/T341223 (10cmooney) [12:56:22] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on zookeeper-test1002.eqiad.wmnet with reason: host reimage [12:56:29] (03PS4) 10Jbond: puppetmaster::puppetdb::cilent: updatre submit only port to 8443 [puppet] - 10https://gerrit.wikimedia.org/r/936026 (https://phabricator.wikimedia.org/T338811) [12:58:15] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: analytics1064.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - stevemunene@cumin1001" [12:58:15] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:58:16] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts analytics1064.eqiad.wmnet [12:58:53] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on zookeeper-test1002.eqiad.wmnet with reason: host reimage [12:58:54] (03CR) 10Majavah: Enable global abuse filters on almost all projects (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935815 (https://phabricator.wikimedia.org/T341159) (owner: 10Urbanecm) [13:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230706T1300) [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: My dear minions, it's time we take the moon! Just kidding. Time for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230706T1300). [13:00:05] No Gerrit patches in the queue for this window AFAICS. [13:00:36] nothing to do indeed [13:00:38] (03PS5) 10Jbond: puppetmaster::puppetdb::cilent: updatre submit only port to 8443 [puppet] - 10https://gerrit.wikimedia.org/r/936026 (https://phabricator.wikimedia.org/T338811) [13:00:43] !log aborrero@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudlb1001.eqiad.wmnet with OS bullseye [13:01:39] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42310/console" [puppet] - 10https://gerrit.wikimedia.org/r/936026 (https://phabricator.wikimedia.org/T338811) (owner: 10Jbond) [13:02:33] (03CR) 10Urbanecm: Enable global abuse filters on almost all projects (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935815 (https://phabricator.wikimedia.org/T341159) (owner: 10Urbanecm) [13:02:38] !log stevemunene@cumin1001 START - Cookbook sre.hosts.decommission for hosts analytics1065.eqiad.wmnet [13:02:45] taavi: if you have a while, maybe we can finish the discussion synchronously here and deploy? [13:02:52] sure [13:03:17] TLDR CI requires me to remove it (at least) from `MWMultiVersion::DB_LIST`. We can workaround that if we want to though. [13:03:18] (03CR) 10Jbond: [V: 03+1 C: 03+2] puppetmaster::puppetdb::cilent: updatre submit only port to 8443 [puppet] - 10https://gerrit.wikimedia.org/r/936026 (https://phabricator.wikimedia.org/T338811) (owner: 10Jbond) [13:03:47] (03PS1) 10Elukey: java::version: add support for openjdk-17 [puppet] - 10https://gerrit.wikimedia.org/r/936032 [13:04:07] why? because it's not used anywhere? [13:04:11] yes [13:04:21] ah, I see [13:04:21] anywhere in operations/mediawiki-config at least [13:04:42] it might be used in regular maintenance jobs or one-off `foreachwikiindblist` tasks [13:04:55] oh right, I was just about to ask if .dblists are used anywhere else [13:05:22] those are the two places that come to mind. it might be used for a lot of things, and it is nearly impossible to identify where it is (not) used [13:05:48] my original concern was that leaving it there but in a way it's not visible to the config might be confusing, but then I didn't realize other places also use the dblist files [13:05:49] so my suggestion is to follow what CI wants and write a task to decide whether it should be removed for good or left as is [13:06:15] we have other dblists that are only available from outside of the config repo (growthexperiments.dblist is one, and there are probably others) [13:06:18] that sounds ok to me [13:06:31] (03CR) 10Majavah: [C: 03+1] Enable global abuse filters on almost all projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935815 (https://phabricator.wikimedia.org/T341159) (owner: 10Urbanecm) [13:06:40] thanks! using the window to sync it out then. [13:06:57] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935815 (https://phabricator.wikimedia.org/T341159) (owner: 10Urbanecm) [13:07:06] (03PS4) 10Urbanecm: Enable global abuse filters on almost all projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935815 (https://phabricator.wikimedia.org/T341159) [13:07:08] (03CR) 10Urbanecm: [C: 03+2] Enable global abuse filters on almost all projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935815 (https://phabricator.wikimedia.org/T341159) (owner: 10Urbanecm) [13:08:16] (03Merged) 10jenkins-bot: Enable global abuse filters on almost all projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935815 (https://phabricator.wikimedia.org/T341159) (owner: 10Urbanecm) [13:08:34] !log stevemunene@cumin1001 START - Cookbook sre.dns.netbox [13:08:36] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:935815|Enable global abuse filters on almost all projects (T341159)]] [13:08:39] T341159: Enable global abuse filters for all Wikimedia projects - https://phabricator.wikimedia.org/T341159 [13:10:02] !log urbanecm@deploy1002 urbanecm: Backport for [[gerrit:935815|Enable global abuse filters on almost all projects (T341159)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet [13:10:39] !log stevemunene@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: analytics1065.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - stevemunene@cumin1001" [13:10:51] !log btullis@cumin1001 END (PASS) - Cookbook sre.kafka.roll-restart-brokers (exit_code=0) for Kafka A:kafka-test-eqiad cluster: Roll restart of jvm daemons. [13:11:00] PROBLEM - Check systemd state on kafka-test1010 is CRITICAL: CRITICAL - degraded: The following units failed: kafka.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:11:06] PROBLEM - Kafka Broker Server on kafka-test1010 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args Kafka /etc/kafka/server.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration [13:11:32] PROBLEM - Kafka broker TLS certificate validity on kafka-test1010 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [13:12:07] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: analytics1065.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - stevemunene@cumin1001" [13:12:07] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:12:08] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts analytics1065.eqiad.wmnet [13:12:52] (03CR) 10Btullis: [C: 03+1] java::version: add support for openjdk-17 [puppet] - 10https://gerrit.wikimedia.org/r/936032 (owner: 10Elukey) [13:13:07] (03CR) 10Elukey: [C: 03+2] java::version: add support for openjdk-17 [puppet] - 10https://gerrit.wikimedia.org/r/936032 (owner: 10Elukey) [13:14:18] !log stevemunene@cumin1001 START - Cookbook sre.hosts.decommission for hosts analytics1066.eqiad.wmnet [13:14:30] kafka test is my fault :) [13:16:29] (03PS1) 10Btullis: Deploy a new image for the datahub service [deployment-charts] - 10https://gerrit.wikimedia.org/r/936035 (https://phabricator.wikimedia.org/T329514) [13:17:26] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on an-worker1095.eqiad.wmnet with reason: Replacing RAID controller battery [13:17:40] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker1095.eqiad.wmnet with reason: Replacing RAID controller battery [13:17:46] 10SRE, 10ops-eqiad, 10Data-Platform-SRE: Replace RAID controller battery on an-worker1095 - https://phabricator.wikimedia.org/T340946 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=6f84de2d-a493-4b54-92d4-cefed7da6f97) set by btullis@cumin1001 for 7 days, 0:00:00 on 1 host(s) and their s... [13:18:43] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:935815|Enable global abuse filters on almost all projects (T341159)]] (duration: 10m 07s) [13:18:47] T341159: Enable global abuse filters for all Wikimedia projects - https://phabricator.wikimedia.org/T341159 [13:18:58] 10SRE, 10ops-eqiad, 10Data-Platform-SRE: Replace RAID controller battery on an-worker1095 - https://phabricator.wikimedia.org/T340946 (10BTullis) @Jclark-ctr - I've shut down the machine and downtimed it. Feel free to boot it again normally after changing the battery. Many thanks. [13:18:59] deployed. [13:19:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:20:08] (03PS1) 10Cathal Mooney: Enable DHCP relay function for vlan 1023 (analytics1-d-eqiad) [homer/public] - 10https://gerrit.wikimedia.org/r/936036 [13:20:36] (03CR) 10Btullis: [C: 03+2] Deploy a new image for the datahub service [deployment-charts] - 10https://gerrit.wikimedia.org/r/936035 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [13:21:27] (03Merged) 10jenkins-bot: Deploy a new image for the datahub service [deployment-charts] - 10https://gerrit.wikimedia.org/r/936035 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [13:22:24] (03CR) 10Papaul: [C: 03+2] Enable DHCP relay function for vlan 1023 (analytics1-d-eqiad) [homer/public] - 10https://gerrit.wikimedia.org/r/936036 (owner: 10Cathal Mooney) [13:22:54] !log stevemunene@cumin1001 START - Cookbook sre.dns.netbox [13:23:56] (03CR) 10Papaul: [C: 03+2] Enable DHCP relay function for vlan 1023 (analytics1-d-eqiad) [homer/public] - 10https://gerrit.wikimedia.org/r/936036 (owner: 10Cathal Mooney) [13:24:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:24:57] !log stevemunene@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: analytics1066.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - stevemunene@cumin1001" [13:25:09] (03PS1) 10Cathal Mooney: Enable DHCP relay function for vlan 1030 (analytics1-a-eqiad) [homer/public] - 10https://gerrit.wikimedia.org/r/936037 [13:26:45] (03CR) 10Papaul: [V: 03+1] Enable DHCP relay function for vlan 1030 (analytics1-a-eqiad) [homer/public] - 10https://gerrit.wikimedia.org/r/936037 (owner: 10Cathal Mooney) [13:26:57] (03CR) 10Cathal Mooney: [C: 03+2] Enable DHCP relay function for vlan 1030 (analytics1-a-eqiad) [homer/public] - 10https://gerrit.wikimedia.org/r/936037 (owner: 10Cathal Mooney) [13:27:30] (03Merged) 10jenkins-bot: Enable DHCP relay function for vlan 1030 (analytics1-a-eqiad) [homer/public] - 10https://gerrit.wikimedia.org/r/936037 (owner: 10Cathal Mooney) [13:29:39] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.dhcp (exit_code=0) for host an-test-worker1003.eqiad.wmnet [13:29:45] !log pt1979@cumin2002 START - Cookbook sre.hosts.dhcp for host an-test-worker1003.eqiad.wmnet [13:29:47] PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:29:52] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [13:30:23] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: analytics1066.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - stevemunene@cumin1001" [13:30:23] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:30:23] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts analytics1066.eqiad.wmnet [13:32:38] 10SRE, 10ops-eqiad, 10DC-Ops, 10Shared-Data-Infrastructure: Multiple RAID battery failures on hadoop worker hosts - https://phabricator.wikimedia.org/T318659 (10Jclark-ctr) [13:32:44] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.dhcp (exit_code=0) for host an-test-worker1003.eqiad.wmnet [13:32:56] 10SRE, 10ops-eqiad, 10Data-Platform-SRE: Replace RAID controller battery on an-worker1095 - https://phabricator.wikimedia.org/T340946 (10Jclark-ctr) 05Open→03Resolved @BTullis replaced failed battery. server is booting up now [13:33:11] (03PS1) 10Ladsgroup: ExternalLinks: Make order by and continue only rely on el_id in READ NEW [core] (wmf/1.41.0-wmf.16) - 10https://gerrit.wikimedia.org/r/935856 (https://phabricator.wikimedia.org/T341000) [13:33:47] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host zookeeper-test1002.eqiad.wmnet with OS bookworm [13:34:12] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [13:34:57] (03PS1) 10Ladsgroup: ExternalLinks: Make order by and continue only rely on el_id in READ NEW [core] (wmf/1.41.0-wmf.15) - 10https://gerrit.wikimedia.org/r/935857 (https://phabricator.wikimedia.org/T341000) [13:35:51] RECOVERY - Check systemd state on kafka-test1010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:35:55] RECOVERY - Kafka Broker Server on kafka-test1010 is OK: PROCS OK: 1 process with command name java, args Kafka /etc/kafka/server.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration [13:37:06] (03PS1) 10JMeybohm: calico::kubernetes: Drop variable assigments used during migration [puppet] - 10https://gerrit.wikimedia.org/r/936040 (https://phabricator.wikimedia.org/T328291) [13:37:08] (03PS1) 10JMeybohm: kubernetes::master: Drop variable assigments used during migration [puppet] - 10https://gerrit.wikimedia.org/r/936041 (https://phabricator.wikimedia.org/T328291) [13:37:49] PROBLEM - Kafka broker TLS certificate validity on kafka-test1006 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [13:38:01] (03PS1) 10Btullis: Enable the datahub systemupdate job [deployment-charts] - 10https://gerrit.wikimedia.org/r/936042 (https://phabricator.wikimedia.org/T329514) [13:38:17] !log pt1979@cumin2002 START - Cookbook sre.hosts.dhcp for host cloudlb1001.eqiad.wmnet [13:38:39] RECOVERY - Kafka broker TLS certificate validity on kafka-test1006 is OK: SSL OK - Certificate kafka-test1006.eqiad.wmnet valid until 2024-04-04 08:08:00 +0000 (expires in 272 days) https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [13:39:41] PROBLEM - Kafka broker TLS certificate validity on kafka-test1007 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [13:40:51] RECOVERY - Kafka broker TLS certificate validity on kafka-test1007 is OK: SSL OK - Certificate kafka-test1007.eqiad.wmnet valid until 2024-04-04 09:53:00 +0000 (expires in 272 days) https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [13:41:02] (03CR) 10Btullis: [C: 03+2] Enable the datahub systemupdate job [deployment-charts] - 10https://gerrit.wikimedia.org/r/936042 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [13:41:31] PROBLEM - Kafka broker TLS certificate validity on kafka-test1008 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [13:41:46] (03Merged) 10jenkins-bot: Enable the datahub systemupdate job [deployment-charts] - 10https://gerrit.wikimedia.org/r/936042 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [13:42:03] RECOVERY - Kafka broker TLS certificate validity on kafka-test1008 is OK: SSL OK - Certificate kafka-test1008.eqiad.wmnet valid until 2024-04-04 14:13:00 +0000 (expires in 273 days) https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [13:42:14] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [13:43:21] PROBLEM - Kafka broker TLS certificate validity on kafka-test1009 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [13:44:47] RECOVERY - Kafka broker TLS certificate validity on kafka-test1009 is OK: SSL OK - Certificate kafka-test1009.eqiad.wmnet valid until 2024-04-04 14:53:00 +0000 (expires in 273 days) https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [13:45:13] (03PS1) 10JMeybohm: kubernetes::node: Drop variable assigments used during migration [puppet] - 10https://gerrit.wikimedia.org/r/936043 (https://phabricator.wikimedia.org/T328291) [13:46:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:47:11] RECOVERY - Kafka broker TLS certificate validity on kafka-test1010 is OK: SSL OK - Certificate kafka-test1010.eqiad.wmnet valid until 2024-04-04 15:44:00 +0000 (expires in 273 days) https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [13:47:45] (03PS1) 10JMeybohm: rsyslog::kubernetes: Drop variable assigments used during migration [puppet] - 10https://gerrit.wikimedia.org/r/936044 (https://phabricator.wikimedia.org/T328291) [13:50:08] (03CR) 10CI reject: [V: 04-1] rsyslog::kubernetes: Drop variable assigments used during migration [puppet] - 10https://gerrit.wikimedia.org/r/936044 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm) [13:50:10] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations: Ripe atlas eqiad reported down in Icinga since 2023-06-27 - https://phabricator.wikimedia.org/T341108 (10Jclark-ctr) Replaced SFP-T looks like link returned will close ticket if alert has cleared [13:51:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:51:40] (03PS2) 10Ssingh: dns1004: provision new DNS host in eqiad (hardware refresh) [puppet] - 10https://gerrit.wikimedia.org/r/933918 (https://phabricator.wikimedia.org/T326685) [13:51:42] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.dhcp (exit_code=0) for host cloudlb1001.eqiad.wmnet [13:53:01] (03CR) 10Ssingh: [C: 03+2] dns1004: provision new DNS host in eqiad (hardware refresh) [puppet] - 10https://gerrit.wikimedia.org/r/933918 (https://phabricator.wikimedia.org/T326685) (owner: 10Ssingh) [13:53:40] (03PS2) 10JMeybohm: rsyslog::kubernetes: Drop variable assigments used during migration [puppet] - 10https://gerrit.wikimedia.org/r/936044 (https://phabricator.wikimedia.org/T328291) [13:54:00] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [13:54:34] (HelmReleaseBadStatus) firing: Helm release datahub/main on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=datahub - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [13:54:37] (03PS1) 10Giuseppe Lavagetto: deployment_server: add REPL for mw-debug [puppet] - 10https://gerrit.wikimedia.org/r/936046 (https://phabricator.wikimedia.org/T341197) [13:55:37] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [13:56:02] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host dns1004.wikimedia.org with OS bullseye [13:56:13] 10SRE, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install dns100[456] - https://phabricator.wikimedia.org/T326685 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host dns1004.wikimedia.org with OS bullseye [13:56:57] (03CR) 10CI reject: [V: 04-1] deployment_server: add REPL for mw-debug [puppet] - 10https://gerrit.wikimedia.org/r/936046 (https://phabricator.wikimedia.org/T341197) (owner: 10Giuseppe Lavagetto) [13:57:53] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:59:27] ^ 198.35.26.207 Down xe-0/1/2.0 [13:59:34] (HelmReleaseBadStatus) resolved: Helm release datahub/main on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=datahub - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [14:00:45] RECOVERY - Check systemd state on gitlab1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:01:23] RECOVERY - Check systemd state on gitlab2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:01:28] !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host an-test-worker1003.eqiad.wmnet with OS bullseye [14:02:39] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dns1004.wikimedia.org with OS bullseye [14:02:49] 10SRE, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install dns100[456] - https://phabricator.wikimedia.org/T326685 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host dns1004.wikimedia.org with OS bullseye executed with errors: - dns1004 (**FAIL**) - Removed fro... [14:02:49] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host dns1004.wikimedia.org with OS bullseye [14:02:59] 10SRE, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install dns100[456] - https://phabricator.wikimedia.org/T326685 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host dns1004.wikimedia.org with OS bullseye [14:05:00] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: service=ats-be,name=cp2037.codfw.wmnet [14:05:56] !log disabling puppet on A:cp-text to test 935464 [14:05:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:36] !log stevemunene@cumin1001 START - Cookbook sre.hosts.decommission for hosts analytics1067.eqiad.wmnet [14:08:20] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:09:26] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [14:09:28] (03CR) 10Hnowlan: [C: 03+2] trafficserver: add gateway routing script, route device-analytics [puppet] - 10https://gerrit.wikimedia.org/r/935464 (https://phabricator.wikimedia.org/T320967) (owner: 10Hnowlan) [14:11:14] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (NOOP 12): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42311/console" [puppet] - 10https://gerrit.wikimedia.org/r/936044 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm) [14:12:23] !log stevemunene@cumin1001 START - Cookbook sre.dns.netbox [14:13:29] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [14:13:45] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dns1004.wikimedia.org with OS bullseye [14:13:56] 10SRE, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install dns100[456] - https://phabricator.wikimedia.org/T326685 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host dns1004.wikimedia.org with OS bullseye executed with errors: - dns1004 (**FAIL**) - Removed fro... [14:14:02] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host dns1004.wikimedia.org with OS bullseye [14:14:13] 10SRE, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install dns100[456] - https://phabricator.wikimedia.org/T326685 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host dns1004.wikimedia.org with OS bullseye [14:14:31] !log stevemunene@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: analytics1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - stevemunene@cumin1001" [14:15:33] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (NOOP 12): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42312/console" [puppet] - 10https://gerrit.wikimedia.org/r/936043 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm) [14:15:44] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/api-gateway: apply [14:16:18] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/api-gateway: apply [14:16:40] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (NOOP 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42313/console" [puppet] - 10https://gerrit.wikimedia.org/r/936041 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm) [14:18:20] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:18:44] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: analytics1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - stevemunene@cumin1001" [14:18:44] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:18:45] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts analytics1067.eqiad.wmnet [14:19:23] !log aborrero@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host cloudlb1001 [14:19:44] !log aborrero@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudlb1001 [14:20:27] !log stevemunene@cumin1001 START - Cookbook sre.hosts.decommission for hosts analytics1068.eqiad.wmnet [14:22:01] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/api-gateway: apply [14:22:25] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: apply [14:25:24] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [14:25:34] (HelmReleaseBadStatus) firing: Helm release datahub/main on k8s-staging@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=datahub - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [14:25:53] !log stevemunene@cumin1001 START - Cookbook sre.dns.netbox [14:26:51] (03PS1) 10Ssingh: Revert "dns1004: provision new DNS host in eqiad (hardware refresh)" [puppet] - 10https://gerrit.wikimedia.org/r/935858 [14:27:26] !log btullis@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-test-worker1003.eqiad.wmnet with OS bullseye [14:27:58] !log stevemunene@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: analytics1068.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - stevemunene@cumin1001" [14:28:19] (03PS1) 10Jbond: nftable::service: address comments [puppet] - 10https://gerrit.wikimedia.org/r/936049 [14:28:28] !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host an-test-worker1003.eqiad.wmnet with OS bullseye [14:29:06] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: analytics1068.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - stevemunene@cumin1001" [14:29:07] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:29:07] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts analytics1068.eqiad.wmnet [14:29:24] 10SRE, 10Infrastructure-Foundations, 10netops: Configure eqiad cloudsw devices to support cloud-private - https://phabricator.wikimedia.org/T341223 (10aborrero) [14:30:23] 10SRE, 10Infrastructure-Foundations, 10netops, 10User-aborrero: Configure eqiad cloudsw devices to support cloud-private - https://phabricator.wikimedia.org/T341223 (10aborrero) [14:30:34] (HelmReleaseBadStatus) resolved: Helm release datahub/main on k8s-staging@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=datahub - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [14:30:59] 10SRE, 10Infrastructure-Foundations, 10netops, 10User-aborrero: Configure eqiad cloudsw devices to support cloud-private - https://phabricator.wikimedia.org/T341223 (10cmooney) [14:31:24] !log stevemunene@cumin1001 START - Cookbook sre.hosts.decommission for hosts analytics1069.eqiad.wmnet [14:31:31] (03PS1) 10Hnowlan: Revert "trafficserver: add gateway routing script, route device-analytics" [puppet] - 10https://gerrit.wikimedia.org/r/935859 [14:31:52] (03CR) 10Vgutierrez: [C: 03+1] Revert "trafficserver: add gateway routing script, route device-analytics" [puppet] - 10https://gerrit.wikimedia.org/r/935859 (owner: 10Hnowlan) [14:32:59] (03CR) 10Jbond: [C: 04-1] "-1: see inline, i also created a Cr with all theses comments applied[1] if we agree on this approach we can squash that into this" [puppet] - 10https://gerrit.wikimedia.org/r/935751 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [14:34:06] (03CR) 10Hnowlan: [C: 03+2] Revert "trafficserver: add gateway routing script, route device-analytics" [puppet] - 10https://gerrit.wikimedia.org/r/935859 (owner: 10Hnowlan) [14:35:27] RECOVERY - Host ripe-atlas-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.31 ms [14:35:39] (03CR) 10Jbond: [C: 04-1] Add a new nftables::service define (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/935751 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [14:35:44] !log reenabling puppet on A:cp [14:35:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:17] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: service=ats-be,name=cp2037.codfw.wmnet [14:37:04] !log stevemunene@cumin1001 START - Cookbook sre.dns.netbox [14:37:29] PROBLEM - IPv4 ping to eqiad on ripe-atlas-eqiad is CRITICAL: CRITICAL - failed 670 probes of 761 (alerts on 35) - https://atlas.ripe.net/measurements/1790945/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:37:39] RECOVERY - Host ripe-atlas-eqiad IPv6 is UP: PING OK - Packet loss = 0%, RTA = 1.32 ms [14:37:45] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 537 probes of 694 (alerts on 90) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:42:01] !log stevemunene@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: analytics1069.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - stevemunene@cumin1001" [14:42:59] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 61 probes of 695 (alerts on 90) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:44:24] (03PS1) 10Ssingh: P:ntp: do not use global variables [puppet] - 10https://gerrit.wikimedia.org/r/936050 [14:45:22] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42314/console" [puppet] - 10https://gerrit.wikimedia.org/r/936050 (owner: 10Ssingh) [14:45:37] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dns1004.wikimedia.org with OS bullseye [14:45:47] 10SRE, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install dns100[456] - https://phabricator.wikimedia.org/T326685 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host dns1004.wikimedia.org with OS bullseye executed with errors: - dns1004 (**FAIL**) - Removed fro... [14:45:52] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host dns1004.wikimedia.org with OS bullseye [14:46:01] 10SRE, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install dns100[456] - https://phabricator.wikimedia.org/T326685 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host dns1004.wikimedia.org with OS bullseye [14:46:05] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-test-worker1003.eqiad.wmnet with reason: host reimage [14:46:17] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: analytics1069.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - stevemunene@cumin1001" [14:46:17] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:46:18] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts analytics1069.eqiad.wmnet [14:47:58] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [14:47:59] RECOVERY - IPv4 ping to eqiad on ripe-atlas-eqiad is OK: OK - failed 11 probes of 762 (alerts on 35) - https://atlas.ripe.net/measurements/1790945/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:48:11] (03PS1) 10Stevemunene: analytics: remove puppet references for analytics[1058-1069] [puppet] - 10https://gerrit.wikimedia.org/r/936051 (https://phabricator.wikimedia.org/T317861) [14:49:26] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-test-worker1003.eqiad.wmnet with reason: host reimage [14:51:30] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [14:53:24] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (NOOP 12): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42315/console" [puppet] - 10https://gerrit.wikimedia.org/r/936040 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm) [14:54:06] !log aborrero@cumin1001 START - Cookbook sre.dns.netbox [14:55:17] !log aborrero@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:56:30] (03CR) 10Arturo Borrero Gonzalez: nftable::service: address comments (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/936049 (owner: 10Jbond) [14:57:09] (03CR) 10Arturo Borrero Gonzalez: Add a new nftables::service define (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/935751 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [14:57:29] (03PS2) 10Stevemunene: analytics: remove puppet references for analytics[1058-1069] [puppet] - 10https://gerrit.wikimedia.org/r/936051 (https://phabricator.wikimedia.org/T317861) [14:57:54] (03PS2) 10Ssingh: sites.yaml: add new dns host dns1004 (eqiad hardware refresh) [homer/public] - 10https://gerrit.wikimedia.org/r/933917 (https://phabricator.wikimedia.org/T326685) [14:58:09] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on dns1004.wikimedia.org with reason: host reimage [14:58:28] (03PS1) 10Cathal Mooney: Add Eqiad cloud VIP range to prefix list filtering inbound from hosts [homer/public] - 10https://gerrit.wikimedia.org/r/936053 (https://phabricator.wikimedia.org/T341223) [15:00:37] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM, thanks!" [homer/public] - 10https://gerrit.wikimedia.org/r/936053 (https://phabricator.wikimedia.org/T341223) (owner: 10Cathal Mooney) [15:00:39] (03CR) 10Stevemunene: [V: 03+1 C: 03+2] Create spark3 local directory [puppet] - 10https://gerrit.wikimedia.org/r/935444 (https://phabricator.wikimedia.org/T332765) (owner: 10Stevemunene) [15:00:49] (03CR) 10Btullis: analytics: remove puppet references for analytics[1058-1069] (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/936051 (https://phabricator.wikimedia.org/T317861) (owner: 10Stevemunene) [15:01:17] (03CR) 10Cathal Mooney: [C: 03+2] Add Eqiad cloud VIP range to prefix list filtering inbound from hosts [homer/public] - 10https://gerrit.wikimedia.org/r/936053 (https://phabricator.wikimedia.org/T341223) (owner: 10Cathal Mooney) [15:02:05] (03Merged) 10jenkins-bot: Add Eqiad cloud VIP range to prefix list filtering inbound from hosts [homer/public] - 10https://gerrit.wikimedia.org/r/936053 (https://phabricator.wikimedia.org/T341223) (owner: 10Cathal Mooney) [15:02:07] PROBLEM - BGP status on cloudsw1-c8-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:02:17] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dns1004.wikimedia.org with reason: host reimage [15:04:03] (03PS2) 10Giuseppe Lavagetto: deployment_server: add REPL for mw-debug [puppet] - 10https://gerrit.wikimedia.org/r/936046 (https://phabricator.wikimedia.org/T341197) [15:05:44] (03CR) 10JMeybohm: [C: 04-1] deployment_server: add REPL for mw-debug (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/936046 (https://phabricator.wikimedia.org/T341197) (owner: 10Giuseppe Lavagetto) [15:06:44] (03PS3) 10Stevemunene: analytics: remove puppet references for analytics[1058-1069] [puppet] - 10https://gerrit.wikimedia.org/r/936051 (https://phabricator.wikimedia.org/T317861) [15:07:21] PROBLEM - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns2005 is CRITICAL: CRITICAL: Service ntp.service has not been restarted after /etc/ntp.conf was changed (stale by 2924325.79s). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [15:07:35] hm ok [15:07:37] expected [15:08:12] (03CR) 10Stevemunene: analytics: remove puppet references for analytics[1058-1069] (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/936051 (https://phabricator.wikimedia.org/T317861) (owner: 10Stevemunene) [15:10:44] (03CR) 10JMeybohm: [C: 04-1] deployment_server: add REPL for mw-debug (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/936046 (https://phabricator.wikimedia.org/T341197) (owner: 10Giuseppe Lavagetto) [15:12:32] (03PS1) 10Ssingh: P:ntp: increase interval for checking stale ntp.conf file [puppet] - 10https://gerrit.wikimedia.org/r/936054 [15:13:34] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42316/console" [puppet] - 10https://gerrit.wikimedia.org/r/936054 (owner: 10Ssingh) [15:13:54] (03PS1) 10Dreamy Jazz: Disable purging of old client hint data by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936055 (https://phabricator.wikimedia.org/T340959) [15:15:55] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-test-worker1003.eqiad.wmnet with OS bullseye [15:16:25] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [15:16:41] (03CR) 10Alexandros Kosiaris: [C: 03+1] calico::kubernetes: Drop variable assigments used during migration [puppet] - 10https://gerrit.wikimedia.org/r/936040 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm) [15:17:25] (03PS2) 10Elukey: changeprop: increase the linger.ms value [deployment-charts] - 10https://gerrit.wikimedia.org/r/935772 (https://phabricator.wikimedia.org/T338357) [15:17:44] (03CR) 10Alexandros Kosiaris: [C: 03+1] kubernetes::master: Drop variable assigments used during migration [puppet] - 10https://gerrit.wikimedia.org/r/936041 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm) [15:18:17] (03CR) 10Alexandros Kosiaris: [C: 03+1] kubernetes::node: Drop variable assigments used during migration [puppet] - 10https://gerrit.wikimedia.org/r/936043 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm) [15:18:37] (03CR) 10Alexandros Kosiaris: [C: 03+1] rsyslog::kubernetes: Drop variable assigments used during migration [puppet] - 10https://gerrit.wikimedia.org/r/936044 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm) [15:18:37] PROBLEM - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns6001 is CRITICAL: CRITICAL: Service ntp.service has not been restarted after /etc/ntp.conf was changed (stale by 2922987.70s). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [15:18:56] ^ this is expected, first time reimaging with the automation, so will tune the check intervals [15:19:01] the patch is above, merging later [15:19:27] (03PS3) 10Giuseppe Lavagetto: deployment_server: add REPL for mw-debug [puppet] - 10https://gerrit.wikimedia.org/r/936046 (https://phabricator.wikimedia.org/T341197) [15:19:33] PROBLEM - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns5004 is CRITICAL: CRITICAL: Service ntp.service has not been restarted after /etc/ntp.conf was changed (stale by 2920992.35s). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [15:19:34] (03CR) 10Giuseppe Lavagetto: deployment_server: add REPL for mw-debug (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/936046 (https://phabricator.wikimedia.org/T341197) (owner: 10Giuseppe Lavagetto) [15:20:53] !log sukhe@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - sukhe@cumin2002" [15:21:50] !log sukhe@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - sukhe@cumin2002" [15:21:51] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dns1004.wikimedia.org with OS bullseye [15:22:02] 10SRE, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install dns100[456] - https://phabricator.wikimedia.org/T326685 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host dns1004.wikimedia.org with OS bullseye completed: - dns1004 (**PASS**) - Removed from Puppet an... [15:22:10] (03CR) 10Elukey: [C: 03+2] changeprop: increase the linger.ms value [deployment-charts] - 10https://gerrit.wikimedia.org/r/935772 (https://phabricator.wikimedia.org/T338357) (owner: 10Elukey) [15:23:19] RECOVERY - BGP status on cloudsw1-c8-eqiad.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:24:19] 10SRE, 10SRE-Access-Requests, 10serviceops-radar, 10Patch-For-Review: Drop the `deploy-service` right, move three included users to `deployment` (or drop access)? - https://phabricator.wikimedia.org/T340165 (10akosiaris) [15:25:52] !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) [15:26:07] PROBLEM - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns2004 is CRITICAL: CRITICAL: Service ntp.service has not been restarted after /etc/ntp.conf was changed (stale by 2925119.33s). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [15:27:04] (03CR) 10Giuseppe Lavagetto: deployment_server: add REPL for mw-debug (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/936046 (https://phabricator.wikimedia.org/T341197) (owner: 10Giuseppe Lavagetto) [15:27:30] (03PS2) 10Jdlrobson: Update more logos with available SVGs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935824 (https://phabricator.wikimedia.org/T338162) [15:27:41] (03CR) 10CI reject: [V: 04-1] Update more logos with available SVGs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935824 (https://phabricator.wikimedia.org/T338162) (owner: 10Jdlrobson) [15:27:53] PROBLEM - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns6002 is CRITICAL: CRITICAL: Service ntp.service has not been restarted after /etc/ntp.conf was changed (stale by 2921399.04s). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [15:28:22] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [15:28:34] (HelmReleaseBadStatus) firing: Helm release datahub/main on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=datahub - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [15:29:32] !log restart ntp.service on A:dns-rec [15:29:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:38] (03PS1) 10Elukey: changeprop: bump chart's version [deployment-charts] - 10https://gerrit.wikimedia.org/r/936057 [15:30:57] (03CR) 10Elukey: [C: 03+2] "Forgot to bump the chart's version https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/936057" [deployment-charts] - 10https://gerrit.wikimedia.org/r/935772 (https://phabricator.wikimedia.org/T338357) (owner: 10Elukey) [15:31:10] 10SRE, 10observability, 10serviceops: stop using $::site in description field of service.yaml - https://phabricator.wikimedia.org/T258697 (10akosiaris) Any objections to switching "svc.%{::site}.wmnet" to "discovery.wmnet" ? [15:31:26] (03CR) 10Elukey: [C: 03+2] changeprop: bump chart's version [deployment-charts] - 10https://gerrit.wikimedia.org/r/936057 (owner: 10Elukey) [15:33:34] (HelmReleaseBadStatus) resolved: Helm release datahub/main on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=datahub - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [15:33:36] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/936003 (https://phabricator.wikimedia.org/T341045) (owner: 10ArielGlenn) [15:34:06] (03PS2) 10Ssingh: P:ntp: increase interval for checking stale ntp.conf file [puppet] - 10https://gerrit.wikimedia.org/r/936054 [15:35:04] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42317/console" [puppet] - 10https://gerrit.wikimedia.org/r/936054 (owner: 10Ssingh) [15:35:28] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [15:36:23] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "<3" [puppet] - 10https://gerrit.wikimedia.org/r/936040 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm) [15:36:27] PROBLEM - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns5003 is CRITICAL: CRITICAL: Service ntp.service has not been restarted after /etc/ntp.conf was changed (stale by 2923205.83s). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [15:36:39] PROBLEM - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns4004 is CRITICAL: CRITICAL: Service ntp.service has not been restarted after /etc/ntp.conf was changed (stale by 2920896.90s). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [15:36:43] !log elukey@deploy1002 helmfile [staging] START helmfile.d/services/changeprop: sync [15:36:53] (03CR) 10Ssingh: [V: 03+1 C: 03+2] P:ntp: increase interval for checking stale ntp.conf file [puppet] - 10https://gerrit.wikimedia.org/r/936054 (owner: 10Ssingh) [15:36:55] !log elukey@deploy1002 helmfile [staging] DONE helmfile.d/services/changeprop: sync [15:36:56] (03CR) 10Jbond: [C: 03+1] "LGTM thanks" [puppet] - 10https://gerrit.wikimedia.org/r/936050 (owner: 10Ssingh) [15:37:29] 10SRE, 10Content-Transform-Team-WIP, 10Mobile-Content-Service, 10RESTbase Sunsetting, and 2 others: Setup allowed list for MCS decom - https://phabricator.wikimedia.org/T340036 (10akosiaris) @MSantos, change deployed today. e.g. https://en.wikipedia.org/api/rest_v1/page/mobile-sections now returns a 403 wi... [15:39:56] (03PS2) 10Jbond: nftable::service: address comments [puppet] - 10https://gerrit.wikimedia.org/r/936049 [15:40:21] 10SRE, 10Observability-Alerting, 10Traffic, 10collaboration-services, 10serviceops-radar: Timeouts when talking to phabricator API - https://phabricator.wikimedia.org/T341039 (10akosiaris) [15:41:02] (03PS3) 10Jbond: nftable::service: address comments [puppet] - 10https://gerrit.wikimedia.org/r/936049 [15:41:14] (03CR) 10Jbond: nftable::service: address comments (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/936049 (owner: 10Jbond) [15:41:59] PROBLEM - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns4003 is CRITICAL: CRITICAL: Service ntp.service has not been restarted after /etc/ntp.conf was changed (stale by 2923120.30s). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [15:42:37] (03PS1) 10Effie Mouzeli: ipoid: add APP_CONFIG_PATH for testing [deployment-charts] - 10https://gerrit.wikimedia.org/r/936059 [15:43:12] 10SRE, 10observability, 10serviceops: stop using $::site in description field of service.yaml - https://phabricator.wikimedia.org/T258697 (10Joe) Is this still relevant? I think we moved all LVS alerts off of icinga by now. But yeah no objection apart from what I stated above. [15:44:32] (03CR) 10Alexandros Kosiaris: [C: 03+1] mesh.configuration: Limit the total number of active connections [deployment-charts] - 10https://gerrit.wikimedia.org/r/935702 (https://phabricator.wikimedia.org/T340955) (owner: 10JMeybohm) [15:45:16] (03CR) 10Effie Mouzeli: [C: 03+2] ipoid: add APP_CONFIG_PATH for testing [deployment-charts] - 10https://gerrit.wikimedia.org/r/936059 (owner: 10Effie Mouzeli) [15:45:30] !log elukey@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop: sync [15:45:45] !log elukey@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop: sync [15:45:50] 10SRE, 10observability, 10serviceops: stop using $::site in description field of service.yaml - https://phabricator.wikimedia.org/T258697 (10akosiaris) No, it's not relevant to icinga so much any more (and it's going to be less and less). It's still an interesting informational thing though and the replaceme... [15:46:00] (03Merged) 10jenkins-bot: ipoid: add APP_CONFIG_PATH for testing [deployment-charts] - 10https://gerrit.wikimedia.org/r/936059 (owner: 10Effie Mouzeli) [15:47:04] 10SRE, 10Infrastructure-Foundations, 10netops, 10User-aborrero: Configure eqiad cloudsw devices to support cloud-private - https://phabricator.wikimedia.org/T341223 (10cmooney) [15:47:13] !log jiji@deploy1002 helmfile [staging] START helmfile.d/services/ipoid: apply [15:47:30] !log jiji@deploy1002 helmfile [staging] DONE helmfile.d/services/ipoid: apply [15:49:34] (HelmReleaseBadStatus) firing: Helm release datahub/main on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=datahub - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [15:49:36] (03CR) 10Btullis: analytics: remove puppet references for analytics[1058-1069] (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/936051 (https://phabricator.wikimedia.org/T317861) (owner: 10Stevemunene) [15:50:13] (03PS2) 10Milimetric: replicas: redact revdeleted, oversighted information [puppet] - 10https://gerrit.wikimedia.org/r/935752 (https://phabricator.wikimedia.org/T339037) (owner: 10Samuel (WMF)) [15:51:06] (03PS1) 10Alexandros Kosiaris: service: Replace svc.%{::site} with discovery [puppet] - 10https://gerrit.wikimedia.org/r/936062 (https://phabricator.wikimedia.org/T258697) [15:51:29] (03CR) 10CI reject: [V: 04-1] service: Replace svc.%{::site} with discovery [puppet] - 10https://gerrit.wikimedia.org/r/936062 (https://phabricator.wikimedia.org/T258697) (owner: 10Alexandros Kosiaris) [15:53:13] !log elukey@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop: sync [15:53:27] !log elukey@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop: sync [15:53:27] (03PS2) 10Alexandros Kosiaris: service: Replace svc.%{::site} with discovery [puppet] - 10https://gerrit.wikimedia.org/r/936062 (https://phabricator.wikimedia.org/T258697) [15:54:24] !log changeprop's kafka linger.ms set to 20s - T338357 (was 5ms, now changeprop waits a bit more to batch messages to send to kafka in one go) [15:54:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:27] T338357: Pushing jobs to jobqueue is slow again - https://phabricator.wikimedia.org/T338357 [15:54:34] (HelmReleaseBadStatus) resolved: Helm release datahub/main on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=datahub - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [15:56:23] (03CR) 10RLazarus: [C: 03+2] opentelemetry-collector: Use a NodePort service instead of a hostPort. [deployment-charts] - 10https://gerrit.wikimedia.org/r/935826 (https://phabricator.wikimedia.org/T320564) (owner: 10RLazarus) [15:57:07] (03Merged) 10jenkins-bot: opentelemetry-collector: Use a NodePort service instead of a hostPort. [deployment-charts] - 10https://gerrit.wikimedia.org/r/935826 (https://phabricator.wikimedia.org/T320564) (owner: 10RLazarus) [15:57:33] PROBLEM - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns2006 is CRITICAL: CRITICAL: Service ntp.service has not been restarted after /etc/ntp.conf was changed (stale by 2924662.70s). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [15:57:33] RECOVERY - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns4003 is OK: OK: ntp.service was restarted after /etc/ntp.conf was changed (within 3600 seconds). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [15:57:33] PROBLEM - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns3002 is CRITICAL: CRITICAL: Service ntp.service has not been restarted after /etc/ntp.conf was changed (stale by 2920932.00s). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [15:57:35] RECOVERY - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns5003 is OK: OK: ntp.service was restarted after /etc/ntp.conf was changed (within 3600 seconds). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [15:57:49] ^ expected, spacing out restarts should resolve soon [15:57:58] increased the check interval here so we don't start spamming early [15:59:01] 👍 [15:59:59] RECOVERY - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns2006 is OK: OK: ntp.service was restarted after /etc/ntp.conf was changed (within 3600 seconds). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [15:59:59] RECOVERY - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns4004 is OK: OK: ntp.service was restarted after /etc/ntp.conf was changed (within 3600 seconds). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [15:59:59] RECOVERY - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns3002 is OK: OK: ntp.service was restarted after /etc/ntp.conf was changed (within 3600 seconds). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [15:59:59] RECOVERY - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns6001 is OK: OK: ntp.service was restarted after /etc/ntp.conf was changed (within 3600 seconds). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [16:00:04] jbond and rzl: Time to snap out of that daydream and deploy Puppet request window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230706T1600). [16:00:04] No Gerrit patches in the queue for this window AFAICS. [16:02:29] RECOVERY - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns2005 is OK: OK: ntp.service was restarted after /etc/ntp.conf was changed (within 3600 seconds). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [16:02:29] RECOVERY - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns6002 is OK: OK: ntp.service was restarted after /etc/ntp.conf was changed (within 3600 seconds). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [16:02:29] RECOVERY - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns5004 is OK: OK: ntp.service was restarted after /etc/ntp.conf was changed (within 3600 seconds). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [16:03:19] (03CR) 10Arturo Borrero Gonzalez: "I would say merge this to the main CR?" [puppet] - 10https://gerrit.wikimedia.org/r/936049 (owner: 10Jbond) [16:06:33] (03PS4) 10Jbond: Add a new nftables::service define [puppet] - 10https://gerrit.wikimedia.org/r/936049 (https://phabricator.wikimedia.org/T336497) [16:07:03] (03CR) 10CI reject: [V: 04-1] Add a new nftables::service define [puppet] - 10https://gerrit.wikimedia.org/r/936049 (https://phabricator.wikimedia.org/T336497) (owner: 10Jbond) [16:09:29] RECOVERY - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns2004 is OK: OK: ntp.service was restarted after /etc/ntp.conf was changed (within 3600 seconds). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [16:10:10] 10SRE, 10Content-Transform-Team-WIP, 10Mobile-Content-Service, 10RESTbase Sunsetting, and 2 others: Setup allowed list for MCS decom - https://phabricator.wikimedia.org/T340036 (10MSantos) 05Open→03Resolved a:03akosiaris >>! In T340036#8994407, @akosiaris wrote: > @MSantos, change deployed today. e.g... [16:10:39] (03PS8) 10Jbond: Add a new nftables::service define [puppet] - 10https://gerrit.wikimedia.org/r/935751 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [16:11:33] (03Abandoned) 10Jbond: Add a new nftables::service define [puppet] - 10https://gerrit.wikimedia.org/r/936049 (https://phabricator.wikimedia.org/T336497) (owner: 10Jbond) [16:11:50] !log rzl@deploy1002 helmfile [staging] START helmfile.d/services/opentelemetry-collector: apply [16:12:00] !log rzl@deploy1002 helmfile [staging] DONE helmfile.d/services/opentelemetry-collector: apply [16:13:36] (03PS9) 10Jbond: Add a new nftables::service define [puppet] - 10https://gerrit.wikimedia.org/r/935751 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [16:13:41] (03Abandoned) 10Ssingh: Revert "dns1004: provision new DNS host in eqiad (hardware refresh)" [puppet] - 10https://gerrit.wikimedia.org/r/935858 (owner: 10Ssingh) [16:13:51] (03CR) 10Ssingh: [C: 03+2] sites.yaml: add new dns host dns1004 (eqiad hardware refresh) [homer/public] - 10https://gerrit.wikimedia.org/r/933917 (https://phabricator.wikimedia.org/T326685) (owner: 10Ssingh) [16:15:23] 10SRE, 10RESTBase, 10RESTBase-API, 10Traffic: REST API is not invalidating caches after template and/or module changes - https://phabricator.wikimedia.org/T335770 (10Brycehughes) 05Open→03Resolved [16:15:53] (03PS1) 10Urbanecm: PageView: Route requests through restbase service proxy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936065 (https://phabricator.wikimedia.org/T341191) [16:16:19] (03CR) 10Jbond: "I have squashed my changes into this one, closed my comments. Overall i im not sure if i have a strong preference for this or epp. I felt" [puppet] - 10https://gerrit.wikimedia.org/r/935751 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [16:16:19] !log homer "cr*-eqiad*" commit "Gerrit: 933917 add new DNS host dns1004" [16:16:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:05] (03PS10) 10Jbond: Add a new nftables::service define [puppet] - 10https://gerrit.wikimedia.org/r/935751 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [16:17:50] (03PS2) 10Urbanecm: PageView: Route requests through restbase service proxy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936065 (https://phabricator.wikimedia.org/T341191) [16:21:59] (03PS1) 10Effie Mouzeli: ipoid: updated app.port [deployment-charts] - 10https://gerrit.wikimedia.org/r/936067 [16:22:44] (03CR) 10Jbond: "a few more comments" [puppet] - 10https://gerrit.wikimedia.org/r/935751 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [16:23:30] (03PS2) 10Effie Mouzeli: ipoid: update app.port [deployment-charts] - 10https://gerrit.wikimedia.org/r/936067 [16:25:06] (03CR) 10Effie Mouzeli: [C: 03+2] ipoid: update app.port [deployment-charts] - 10https://gerrit.wikimedia.org/r/936067 (owner: 10Effie Mouzeli) [16:25:23] 10SRE, 10Observability-Alerting, 10Traffic, 10collaboration-services, 10serviceops-radar: Timeouts when talking to phabricator API - https://phabricator.wikimedia.org/T341039 (10brennen) > @brennen I saw your updates to phab in SAL, does the above (maniphest.edit taking a lot longer to create tasks) ring... [16:25:52] (03Merged) 10jenkins-bot: ipoid: update app.port [deployment-charts] - 10https://gerrit.wikimedia.org/r/936067 (owner: 10Effie Mouzeli) [16:29:08] (03PS1) 10Alexandros Kosiaris: Fix CirrusSearchJobQueueLagTooHigh to use histograms [alerts] - 10https://gerrit.wikimedia.org/r/936070 [16:30:06] (03CR) 10Alexandros Kosiaris: [C: 03+1] changeprop: Change normal_rule_processing_delay to histogram (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/935089 (owner: 10Clément Goubert) [16:30:15] !log jiji@deploy1002 helmfile [staging] START helmfile.d/services/ipoid: apply [16:30:38] !log jiji@deploy1002 helmfile [staging] DONE helmfile.d/services/ipoid: apply [16:31:33] !log ns0: set routing-options static route 208.80.154.238/32 next-hop [ 208.80.154.6 208.80.155.108 208.80.154.134 ] [16:31:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:39] (03CR) 10CI reject: [V: 04-1] Fix CirrusSearchJobQueueLagTooHigh to use histograms [alerts] - 10https://gerrit.wikimedia.org/r/936070 (owner: 10Alexandros Kosiaris) [16:33:08] (03CR) 10Milimetric: [C: 04-1] "I addressed my own comments but I shouldn't (and don't have rights to) self-merge. We're currently testing this on the DE cloud replica, " [puppet] - 10https://gerrit.wikimedia.org/r/935752 (https://phabricator.wikimedia.org/T339037) (owner: 10Samuel (WMF)) [16:36:03] (03PS1) 10Ssingh: sites.yaml: remove dns1001 from anycast_neighbors (host decom) [homer/public] - 10https://gerrit.wikimedia.org/r/936071 (https://phabricator.wikimedia.org/T326685) [16:40:46] (03CR) 10Ssingh: [C: 03+2] sites.yaml: remove dns1001 from anycast_neighbors (host decom) [homer/public] - 10https://gerrit.wikimedia.org/r/936071 (https://phabricator.wikimedia.org/T326685) (owner: 10Ssingh) [16:42:21] (03CR) 10Alexandros Kosiaris: modules: Add a new networkpolicy for base modules (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/935746 (https://phabricator.wikimedia.org/T340843) (owner: 10Alexandros Kosiaris) [16:44:30] (03CR) 10Jbond: Add a new nftables::service define (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/936049 (https://phabricator.wikimedia.org/T336497) (owner: 10Jbond) [16:44:47] !log homer "cr*-eqiad*" commit "decommission DNS host dns1001 (replaced by dns1004)" [16:44:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:42] (03PS1) 10Ssingh: hiera: decommission dns host dns1001 (eqiad hardware refresh) [puppet] - 10https://gerrit.wikimedia.org/r/936072 (https://phabricator.wikimedia.org/T326685) [16:47:05] (03PS2) 10Jbond: puppedb::bookworm: Force client auth [puppet] - 10https://gerrit.wikimedia.org/r/935733 (https://phabricator.wikimedia.org/T338811) [16:47:25] (03CR) 10Ssingh: [C: 03+2] hiera: decommission dns host dns1001 (eqiad hardware refresh) [puppet] - 10https://gerrit.wikimedia.org/r/936072 (https://phabricator.wikimedia.org/T326685) (owner: 10Ssingh) [16:47:51] (03CR) 10Jbond: [C: 03+2] puppedb::bookworm: Force client auth [puppet] - 10https://gerrit.wikimedia.org/r/935733 (https://phabricator.wikimedia.org/T338811) (owner: 10Jbond) [16:49:23] !log sudo cumin A:netbox 'run-puppet-agent': removing dns1001 before decomm cookbook [16:49:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:52:56] (03PS1) 10Ssingh: common.yaml: add dns1004, remove dns1001 [homer/public] - 10https://gerrit.wikimedia.org/r/936074 [16:54:59] !log sukhe@cumin2002 START - Cookbook sre.hosts.decommission for hosts dns1001.wikimedia.org [16:56:50] (03PS1) 10Kosta Harlan: ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/936075 (https://phabricator.wikimedia.org/T341129) [16:57:01] (03PS2) 10Kosta Harlan: ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/936075 (https://phabricator.wikimedia.org/T341129) [16:57:18] (03CR) 10Kosta Harlan: [C: 03+2] ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/936075 (https://phabricator.wikimedia.org/T341129) (owner: 10Kosta Harlan) [16:58:00] (03Merged) 10jenkins-bot: ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/936075 (https://phabricator.wikimedia.org/T341129) (owner: 10Kosta Harlan) [16:58:41] !log kharlan@deploy1002 helmfile [staging] START helmfile.d/services/ipoid: apply [16:58:59] !log kharlan@deploy1002 helmfile [staging] DONE helmfile.d/services/ipoid: apply [17:00:06] bd808: #bothumor My software never has bugs. It just develops random features. Rise for Technical Engagement weekly deploy (Toolhub, Developer portal, Striker). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230706T1700). [17:00:06] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230706T1700) [17:00:25] !log sukhe@cumin2002 START - Cookbook sre.dns.netbox [17:00:27] PROBLEM - uWSGI puppetboard -http via nrpe- on puppetboard1003 is CRITICAL: HTTP CRITICAL - No data received from host https://wikitech.wikimedia.org/wiki/Services/Monitoring/puppetboard [17:00:59] PROBLEM - uWSGI puppetboard -http via nrpe- on puppetboard2003 is CRITICAL: HTTP CRITICAL - No data received from host https://wikitech.wikimedia.org/wiki/Services/Monitoring/puppetboard [17:01:03] hmmm [17:01:33] sukhe: ill check thats i think its me that broke it [17:01:36] jbond: <3 [17:01:37] also they are not live [17:01:43] bookworm hosts? [17:01:48] yes [17:01:51] ok thanks! [17:01:55] new puppet7 stuff [17:02:08] +profile::puppetdb::ssl_verify_client: 'on' [17:02:11] probably this then [17:02:28] yes exactly im gussing i need to configure puppet board to sends its client certs [17:02:30] !log sukhe@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: dns1001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin2002" [17:04:19] !log sukhe@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: dns1001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin2002" [17:04:19] !log sukhe@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:04:20] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts dns1001.wikimedia.org [17:04:29] 10SRE, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install dns100[456] - https://phabricator.wikimedia.org/T326685 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by sukhe@cumin2002 for hosts: `dns1001.wikimedia.org` - dns1001.wikimedia.org (**WARN**) - Downtimed host on Icinga/Alertmanag... [17:07:16] (03PS1) 10Jbond: puppetboard: Add additional site to proxy puppet7 config [puppet] - 10https://gerrit.wikimedia.org/r/936076 (https://phabricator.wikimedia.org/T338811) [17:07:46] akosiaris: [17:07:52] !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) [17:08:00] apparently the MCS decom is affecting PCS https://phabricator.wikimedia.org/T341248 [17:09:12] mbsantos: not sure what the issue is though [17:09:31] (03PS1) 10Jbond: Revert "puppedb::bookworm: Force client auth" [puppet] - 10https://gerrit.wikimedia.org/r/935862 [17:09:33] mobile-html endpoints are receiving 403 [17:09:41] weren't they meant to ? [17:09:44] (03CR) 10Jbond: [V: 03+2 C: 03+2] Revert "puppedb::bookworm: Force client auth" [puppet] - 10https://gerrit.wikimedia.org/r/935862 (owner: 10Jbond) [17:10:01] (03PS1) 10Jbond: puppedb::bookworm: Force client auth [puppet] - 10https://gerrit.wikimedia.org/r/935863 (https://phabricator.wikimedia.org/T338811) [17:10:18] mbsantos: I had pasted the regex in https://phabricator.wikimedia.org/T340036#8956205 [17:10:25] if (req.url ~ "^/api/rest_v1/page/mobile-" [17:10:31] (03CR) 10Jbond: [C: 04-1] "need to configure puppetboard with client auth first" [puppet] - 10https://gerrit.wikimedia.org/r/935863 (https://phabricator.wikimedia.org/T338811) (owner: 10Jbond) [17:10:36] if that's wrong, I can change it, but let me know to what [17:10:48] yeah that's my bad it should be mobile-sections only [17:10:58] ok, easy to fix, gimme a sec [17:11:08] thanks [17:13:16] done [17:13:22] ok now the match is if (req.url ~ "^/api/rest_v1/page/mobile-sections" [17:13:40] (03CR) 10Ssingh: [C: 03+2] common.yaml: add dns1004, remove dns1001 [homer/public] - 10https://gerrit.wikimedia.org/r/936074 (owner: 10Ssingh) [17:13:42] which should also match /page/mobile-sections-remaining and /page/mobile/sections-lead [17:13:51] from the https://en.wikipedia.org/api/rest_v1/#/Mobile stuff at least [17:13:57] 10SRE, 10Infrastructure-Foundations, 10Puppet (Puppet 7.0): Puppetboard: configure client auth - https://phabricator.wikimedia.org/T341268 (10jbond) [17:15:06] mbsantos: I 've responded on the task too [17:15:40] !log homer "mr*" commit "update ntp_servers (add dns1004, remove dns1001)" [17:15:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:16:01] RECOVERY - uWSGI puppetboard -http via nrpe- on puppetboard1003 is OK: HTTP OK: HTTP/1.1 200 OK - 11910 bytes in 0.354 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/puppetboard [17:16:11] akosiaris: thank you very much! [17:16:35] RECOVERY - uWSGI puppetboard -http via nrpe- on puppetboard2003 is OK: HTTP OK: HTTP/1.1 200 OK - 12836 bytes in 0.301 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/puppetboard [17:17:08] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [17:20:42] 10SRE, 10RESTBase, 10RESTBase-API, 10Traffic: REST API is not invalidating caches after template and/or module changes - https://phabricator.wikimedia.org/T335770 (10akosiaris) >>! In T335770#8988938, @Brycehughes wrote: > @akosiaris Yep all clear now from Georgia (the country). However, this lasted much m... [17:24:30] 10SRE, 10RESTBase, 10RESTBase-API, 10Traffic: REST API is not invalidating caches after template and/or module changes - https://phabricator.wikimedia.org/T335770 (10Brycehughes) @akosiaris Fair enough. Ah, the joys of caching. Thanks. [17:24:40] !log sudo cumin -b1 -s300 'A:dns-rec' 'systemctl restart ntp.service' [17:24:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:30] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:36:12] 10SRE, 10Infrastructure-Foundations, 10Puppet-Core, 10Patch-For-Review: Resource attributes are quoted inconsistently - https://phabricator.wikimedia.org/T91908 (10Dzahn) It seemed a bit much to link every single change to this ticket, but then also,, I wanted to somehow link them. So here it goes as a si... [17:36:19] (03CR) 10Dzahn: "https://phabricator.wikimedia.org/T91908" [puppet] - 10https://gerrit.wikimedia.org/r/934640 (owner: 10Dzahn) [17:36:23] (03CR) 10Dzahn: "https://phabricator.wikimedia.org/T91908" [puppet] - 10https://gerrit.wikimedia.org/r/934637 (owner: 10Dzahn) [17:36:27] (03CR) 10Dzahn: "https://phabricator.wikimedia.org/T91908" [puppet] - 10https://gerrit.wikimedia.org/r/934641 (owner: 10Dzahn) [17:36:32] (03CR) 10Dzahn: "https://phabricator.wikimedia.org/T91908" [puppet] - 10https://gerrit.wikimedia.org/r/934642 (owner: 10Dzahn) [17:36:36] (03CR) 10Dzahn: "https://phabricator.wikimedia.org/T91908" [puppet] - 10https://gerrit.wikimedia.org/r/934638 (owner: 10Dzahn) [17:36:40] (03CR) 10Dzahn: "https://phabricator.wikimedia.org/T91908" [puppet] - 10https://gerrit.wikimedia.org/r/934639 (owner: 10Dzahn) [17:37:34] (03CR) 10Dzahn: [C: 03+2] "no difference in compiler, just style fixes: https://puppet-compiler.wmflabs.org/output/934639/42318/" [puppet] - 10https://gerrit.wikimedia.org/r/934639 (owner: 10Dzahn) [17:38:28] (03CR) 10Dzahn: [C: 03+2] wikistats: fix quoting for ensure parameter [puppet] - 10https://gerrit.wikimedia.org/r/934640 (owner: 10Dzahn) [17:40:43] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/output/934637/42319/" [puppet] - 10https://gerrit.wikimedia.org/r/934637 (owner: 10Dzahn) [17:46:06] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:51:23] (03CR) 10Tchanders: [C: 03+1] Disable purging of old client hint data by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936055 (https://phabricator.wikimedia.org/T340959) (owner: 10Dreamy Jazz) [18:00:06] hashar and brennen: My dear minions, it's time we take the moon! Just kidding. Time for MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230706T1800). [18:01:53] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/output/934638/42320/" [puppet] - 10https://gerrit.wikimedia.org/r/934638 (owner: 10Dzahn) [18:01:55] 10SRE, 10Observability-Alerting, 10Traffic, 10collaboration-services, 10serviceops-radar: Timeouts when talking to phabricator API - https://phabricator.wikimedia.org/T341039 (10Aklapper) Hmm. The problem //could// be related to deploying the bug fix (see non-public T338611#8965304 for details) in 6b59a3... [18:10:16] (03PS2) 10Dzahn: vrts: fix quoting of ensure parameter [puppet] - 10https://gerrit.wikimedia.org/r/934641 [18:10:37] (03CR) 10Dzahn: [C: 03+2] ""If a string is a value from an enumerable set of options, such as" [puppet] - 10https://gerrit.wikimedia.org/r/934639 (owner: 10Dzahn) [18:10:52] (03PS2) 10Dzahn: releases: fix quoting of ensure parameter [puppet] - 10https://gerrit.wikimedia.org/r/934642 [18:11:06] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/output/934639/42323/" [puppet] - 10https://gerrit.wikimedia.org/r/934641 (owner: 10Dzahn) [18:12:53] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "this was about https://phabricator.wikimedia.org/T91908" [puppet] - 10https://gerrit.wikimedia.org/r/934641 (owner: 10Dzahn) [18:13:15] (03CR) 10Dzahn: "this is about https://phabricator.wikimedia.org/T91908" [puppet] - 10https://gerrit.wikimedia.org/r/934642 (owner: 10Dzahn) [18:18:20] (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:19:07] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/output/934639/42325/" [puppet] - 10https://gerrit.wikimedia.org/r/934642 (owner: 10Dzahn) [18:25:53] (03CR) 10Dzahn: miscweb: add statictendril release to miscweb staging (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/930886 (https://phabricator.wikimedia.org/T300171) (owner: 10Dzahn) [18:28:59] (03CR) 10Dzahn: miscweb: add statictendril release to miscweb staging (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/930886 (https://phabricator.wikimedia.org/T300171) (owner: 10Dzahn) [18:32:30] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:33:50] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.278 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:40:24] 10SRE, 10Infrastructure-Foundations, 10Wikimedia-IRC-RC-Server: Spam in PMs on IRC recent changes server - https://phabricator.wikimedia.org/T341097 (10jhsoby) The spammers have now moved on from promoting that one IRC network to posting links and ASCII art depicting lemon party and goatse (if you're lucky e... [18:49:41] jouncebot: nowandnext [18:49:41] For the next 1 hour(s) and 10 minute(s): MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230706T1800) [18:49:41] In 1 hour(s) and 10 minute(s): UTC late backport and config training (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230706T2000) [18:51:45] seems like we're on .16 already, and the window's unused? [18:51:56] (03PS3) 10Urbanecm: PageView: Route requests through restbase service proxy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936065 (https://phabricator.wikimedia.org/T341191) [18:54:49] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936065 (https://phabricator.wikimedia.org/T341191) (owner: 10Urbanecm) [18:55:48] (03Merged) 10jenkins-bot: PageView: Route requests through restbase service proxy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936065 (https://phabricator.wikimedia.org/T341191) (owner: 10Urbanecm) [18:56:03] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:936065|PageView: Route requests through restbase service proxy (T341191)]] [18:56:06] T341191: Failed fetching https://wikimedia.org/api/rest_v1/metrics/unique-devices/{parameters}: Connection timed out - https://phabricator.wikimedia.org/T341191 [18:57:32] !log urbanecm@deploy1002 urbanecm: Backport for [[gerrit:936065|PageView: Route requests through restbase service proxy (T341191)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet [19:01:34] (03Abandoned) 10Stang: Update logo/wordmark/tagline for Serbian project [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892955 (https://phabricator.wikimedia.org/T324545) (owner: 10Stang) [19:03:30] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:936065|PageView: Route requests through restbase service proxy (T341191)]] (duration: 07m 27s) [19:03:33] T341191: Failed fetching https://wikimedia.org/api/rest_v1/metrics/unique-devices/{parameters}: Connection timed out - https://phabricator.wikimedia.org/T341191 [19:04:34] (KubernetesAPILatency) firing: High Kubernetes API latency (GET replicasets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [19:06:59] !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [19:07:04] PROBLEM - Check systemd state on wdqs2012 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:07:10] PROBLEM - Blazegraph Port for wdqs-categories on wdqs2019 is CRITICAL: connect to address 127.0.0.1 and port 9990: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [19:07:22] PROBLEM - Query Service HTTP Port on wdqs2019 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 364 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [19:07:42] PROBLEM - WDQS SPARQL on wdqs2019 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string http://www.w3.org/2001/XML... not found on https://query.wikidata.org:443/bigdata/namespace/wdq/sparql?query=SELECT%20*%20WHERE%20%7Bwikibase%3ADump%20schema%3AdateModified%20%3Fy%7D%20LIMIT%201 - 398 bytes in 0.179 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [19:07:48] PROBLEM - Blazegraph process -wdqs-categories- on wdqs2019 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9990 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [19:08:20] PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs2019 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [19:08:24] PROBLEM - Blazegraph process -wdqs-blazegraph- on wdqs2019 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [19:08:30] PROBLEM - Check systemd state on wdqs2019 is CRITICAL: CRITICAL - degraded: The following units failed: load-dcatap-weekly.service,prometheus-blazegraph-exporter-wdqs-blazegraph.service,prometheus-blazegraph-exporter-wdqs-categories.service,wdqs-blazegraph.service,wdqs-categories.service,wdqs-updater.service,wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service,wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-categor [19:08:30] ice https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:09:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (GET replicasets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [19:15:53] (03PS1) 10Urbanecm: PageView: Fix base URL when using service proxy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936083 (https://phabricator.wikimedia.org/T341191) [19:16:24] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936083 (https://phabricator.wikimedia.org/T341191) (owner: 10Urbanecm) [19:17:11] (03Merged) 10jenkins-bot: PageView: Fix base URL when using service proxy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936083 (https://phabricator.wikimedia.org/T341191) (owner: 10Urbanecm) [19:17:28] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:936083|PageView: Fix base URL when using service proxy (T341191)]] [19:17:31] T341191: Failed fetching https://wikimedia.org/api/rest_v1/metrics/unique-devices/{parameters}: Connection timed out - https://phabricator.wikimedia.org/T341191 [19:22:55] (03CR) 10Ssingh: [V: 03+1 C: 03+2] P:ntp: do not use global variables [puppet] - 10https://gerrit.wikimedia.org/r/936050 (owner: 10Ssingh) [19:24:44] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:936083|PageView: Fix base URL when using service proxy (T341191)]] (duration: 07m 16s) [19:24:48] T341191: Failed fetching https://wikimedia.org/api/rest_v1/metrics/unique-devices/{parameters}: Connection timed out - https://phabricator.wikimedia.org/T341191 [19:32:09] (03PS1) 10Stang: pawikibooks: Install Quiz extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936084 (https://phabricator.wikimedia.org/T340613) [19:37:34] RECOVERY - Blazegraph process -wdqs-blazegraph- on wdqs2019 is OK: PROCS OK: 1 process with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [19:38:23] (03CR) 10Dzahn: [C: 03+1] sre: add gitlab ci alerts [alerts] - 10https://gerrit.wikimedia.org/r/931286 (https://phabricator.wikimedia.org/T339370) (owner: 10Jelto) [19:43:25] (03CR) 10Dzahn: [C: 03+1] "Yes, this does indeed control whether jenkins and zuul services are enabled. So I expect this should be merged after all rsyncing is done " [puppet] - 10https://gerrit.wikimedia.org/r/935919 (https://phabricator.wikimedia.org/T324659) (owner: 10Jelto) [19:44:42] (03CR) 10Dzahn: [C: 03+1] "this should also be merged at some point between rsyncing and before re-enabling puppet I think. Not sure if before or after zuul and jenk" [puppet] - 10https://gerrit.wikimedia.org/r/867705 (https://phabricator.wikimedia.org/T324659) (owner: 10Dzahn) [19:46:31] 10SRE, 10Continuous-Integration-Infrastructure, 10collaboration-services, 10Patch-For-Review: contint2002 service implementation tracking - https://phabricator.wikimedia.org/T324659 (10hashar) I have done a first initial transfer of `/srv/jenkins` since I wanted to have a rough estimate of how long it took... [19:56:20] mutante: the Jenkins build rsync takes a little more than a minute once warmed up :] [19:56:27] thanks again for the magic `rsync` commands [19:57:06] I will dig tomorrow in the sequence of the actions to do the migration and write the puppet patches to stop the services and enable them then sync up with jelto [20:00:06] brennen and TheresNoTime: OwO what's this, a deployment window?? UTC late backport and config training. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230706T2000). nyaa~ [20:00:06] Dreamy_Jazz, Jdlrobson, and koi: A patch you scheduled for UTC late backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:14] \o [20:00:27] o/ [20:01:05] My change should need no testing as it's adding a config that isn't used by any code yet. [20:02:22] I'll be around in 15ish if no one else appears [20:02:27] I can deploy [20:03:02] I can be around for the entire backport window if needed. [20:03:38] thcipriani: please do ^^ [20:04:05] * thcipriani does :) [20:04:14] alright Dreamy_Jazz you're up first [20:04:57] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by thcipriani@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936055 (https://phabricator.wikimedia.org/T340959) (owner: 10Dreamy Jazz) [20:05:28] Jdlrobson: around for backport window? [20:05:41] TIL about the Quiz extension ( https://www.mediawiki.org/wiki/Extension:Quiz ) [20:06:14] (03Merged) 10jenkins-bot: Disable purging of old client hint data by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936055 (https://phabricator.wikimedia.org/T340959) (owner: 10Dreamy Jazz) [20:06:31] !log thcipriani@deploy1002 Started scap: Backport for [[gerrit:936055|Disable purging of old client hint data by default (T340959 T341076)]] [20:06:35] T340959: Update CheckUser prune job to remove client hint data - https://phabricator.wikimedia.org/T340959 [20:06:36] T341076: Creation of database tables cu_useragent_clienthints and cu_useragent_clienthints_map - https://phabricator.wikimedia.org/T341076 [20:07:08] (03CR) 10Hashar: [C: 03+1] "The extension does need anything on the database side so that looks fine. I never heard before of that https://www.mediawiki.org/wiki/Ext" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936084 (https://phabricator.wikimedia.org/T340613) (owner: 10Stang) [20:07:54] !log thcipriani@deploy1002 thcipriani and dreamyjazz: Backport for [[gerrit:936055|Disable purging of old client hint data by default (T340959 T341076)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet [20:08:49] ^ Dreamy_Jazz you mentioned this needs no testing? Unused? [20:09:12] No testing as the config will be used by a patch that depends on this (needs to be a different value to the default on WMF wikis) [20:09:24] So there is no code that uses this config yet on WMF wikis [20:09:31] got it, thank you [20:09:37] Thanks! [20:10:14] (03PS2) 10Hashar: pawikibooks: Install Quiz extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936084 (https://phabricator.wikimedia.org/T340613) (owner: 10Stang) [20:10:47] koi: I click the rebase since wmf-config/InitialiseSettings.php got touched by the other change [20:11:23] (syncing now) [20:12:17] hey there im late sorry [20:12:22] was in a meeting [20:13:03] (03PS3) 10Jdlrobson: Update more logos with available SVGs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935824 (https://phabricator.wikimedia.org/T338162) [20:13:52] no worries, I'm almost ready for yours [20:14:01] thcipriani: great timing then :) [20:15:38] (KubernetesAPILatency) firing: High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:16:40] !log thcipriani@deploy1002 Finished scap: Backport for [[gerrit:936055|Disable purging of old client hint data by default (T340959 T341076)]] (duration: 10m 08s) [20:16:46] T340959: Update CheckUser prune job to remove client hint data - https://phabricator.wikimedia.org/T340959 [20:16:46] T341076: Creation of database tables cu_useragent_clienthints and cu_useragent_clienthints_map - https://phabricator.wikimedia.org/T341076 [20:16:55] stuff is syncing still [20:17:59] Dreamy_Jazz: should be synced everywhere now [20:18:07] Thanks. [20:20:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:21:17] koi: we are doing another private deploy in between [20:21:47] got it, i'm ok of it [20:26:17] (sorry for delay, small privatesettings update) [20:28:25] (03PS1) 10Hashar: Restore fonts submodule whose removal has not been deployed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936088 [20:28:54] (03CR) 10Hashar: "/srv/mediawiki-staging/fonts is still on the deployment server and thus its removal has NOT been deployed." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/723652 (owner: 10Legoktm) [20:31:38] thcipriani: are we still good for my logos deploy? [20:31:45] yes [20:31:49] in the queue we are doing another change [20:31:58] then I guess do koi change cause it must be late for them [20:32:04] 👍 [20:34:24] Jdlrobson: going ahead with yours now [20:34:33] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by thcipriani@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935824 (https://phabricator.wikimedia.org/T338162) (owner: 10Jdlrobson) [20:34:34] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:35:14] great [20:35:28] (03Merged) 10jenkins-bot: Update more logos with available SVGs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935824 (https://phabricator.wikimedia.org/T338162) (owner: 10Jdlrobson) [20:35:42] !log thcipriani@deploy1002 Started scap: Backport for [[gerrit:935824|Update more logos with available SVGs (T338162)]] [20:35:45] T338162: Track which Vector 2022 logos are in production vs Google Drive - https://phabricator.wikimedia.org/T338162 [20:37:11] !log thcipriani@deploy1002 jdlrobson and thcipriani: Backport for [[gerrit:935824|Update more logos with available SVGs (T338162)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet [20:37:24] ^ Jdlrobson check please [20:39:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:40:28] (03PS1) 10Bking: scap: add new WDQS hosts as valid targets [puppet] - 10https://gerrit.wikimedia.org/r/936089 (https://phabricator.wikimedia.org/T341290) [20:42:03] LGTM please sync [20:42:28] cool, thanks for checking, going live [20:43:37] (03CR) 10Reedy: "Why not just delete the folder and the next full scap should deploy it?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936088 (owner: 10Hashar) [20:46:53] (03PS2) 10Bking: scap: add new WDQS hosts as valid targets [puppet] - 10https://gerrit.wikimedia.org/r/936089 (https://phabricator.wikimedia.org/T341290) [20:47:33] (03CR) 10Ebernhardson: [C: 03+1] "should do what we need" [puppet] - 10https://gerrit.wikimedia.org/r/936089 (https://phabricator.wikimedia.org/T341290) (owner: 10Bking) [20:47:45] (03CR) 10Bking: [C: 03+2] scap: add new WDQS hosts as valid targets [puppet] - 10https://gerrit.wikimedia.org/r/936089 (https://phabricator.wikimedia.org/T341290) (owner: 10Bking) [20:48:23] !log thcipriani@deploy1002 Finished scap: Backport for [[gerrit:935824|Update more logos with available SVGs (T338162)]] (duration: 12m 41s) [20:48:26] T338162: Track which Vector 2022 logos are in production vs Google Drive - https://phabricator.wikimedia.org/T338162 [20:48:31] ^ Jdlrobson should be live now [20:48:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:48:51] alright koi you're up next! Sorry for the delay :) [20:49:55] :) [20:50:06] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by thcipriani@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936084 (https://phabricator.wikimedia.org/T340613) (owner: 10Stang) [20:50:41] (03Abandoned) 10Hashar: Restore fonts submodule whose removal has not been deployed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936088 (owner: 10Hashar) [20:51:34] (03Merged) 10jenkins-bot: pawikibooks: Install Quiz extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936084 (https://phabricator.wikimedia.org/T340613) (owner: 10Stang) [20:51:50] !log thcipriani@deploy1002 Started scap: Backport for [[gerrit:936084|pawikibooks: Install Quiz extension (T340613)]] [20:51:52] T340613: Install Quiz extension to Punjabi Wikibooks - https://phabricator.wikimedia.org/T340613 [20:53:03] Reedy: thanks, we will remove /srv/mediawiki-config/fonts and sync the removal [20:53:14] !log thcipriani@deploy1002 stang and thcipriani: Backport for [[gerrit:936084|pawikibooks: Install Quiz extension (T340613)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet [20:53:26] ^ koi should be live on mwdebug, check please [20:53:32] looking [20:53:34] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:54:06] !log bking@deploy1002 Started deploy [wdqs/wdqs@dff41b7]: 0.3.124 [20:54:10] (03CR) 10Hashar: "I haven't realized the workload got moved to Shellbox and that is no more used on the app servers. We will remove /srv/mediawiki-config/fo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/723652 (owner: 10Legoktm) [20:54:12] !log bking@deploy1002 Finished deploy [wdqs/wdqs@dff41b7]: 0.3.124 (duration: 00m 05s) [20:55:43] !log bking@deploy1002 Started deploy [wdqs/wdqs@dff41b7]: 0.3.124 [20:57:37] thanks thcipriani [20:57:39] thcipriani, I tested at https://pa.wikibooks.org/wiki/Wikibooks:Sandbox, and it works fine [20:57:52] no problem Jdlrobson thanks for the updates :) [20:58:04] koi: cool, syncing everywhere now [21:01:26] RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs2019 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [21:01:52] RECOVERY - Blazegraph Port for wdqs-categories on wdqs2019 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9990 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [21:02:08] RECOVERY - Query Service HTTP Port on wdqs2019 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.651 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [21:02:24] RECOVERY - WDQS SPARQL on wdqs2019 is OK: HTTP OK: HTTP/1.1 200 OK - 690 bytes in 0.227 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [21:02:34] RECOVERY - Blazegraph process -wdqs-categories- on wdqs2019 is OK: PROCS OK: 1 process with UID = 498 (blazegraph), regex args ^java .* --port 9990 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [21:04:09] !log thcipriani@deploy1002 Finished scap: Backport for [[gerrit:936084|pawikibooks: Install Quiz extension (T340613)]] (duration: 12m 19s) [21:04:13] T340613: Install Quiz extension to Punjabi Wikibooks - https://phabricator.wikimedia.org/T340613 [21:04:20] ^ koi should be live now [21:04:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:05:52] koi: it is live https://pa.wikibooks.org/wiki/Wikibooks:Sandbox !:) [21:06:05] thx! [21:06:12] !log thcipriani@deploy1002 Started scap: Clean up font directory [[gerrit:723652]] [21:06:21] happy Quiz! [21:06:31] Reedy: ^ :] [21:09:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:10:39] !log bking@deploy1002 Finished deploy [wdqs/wdqs@dff41b7]: 0.3.124 (duration: 14m 56s) [21:12:46] !log thcipriani@deploy1002 Finished scap: Clean up font directory [[gerrit:723652]] (duration: 06m 33s) [21:16:03] done! And easytimeline still works :) [21:17:16] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [21:22:16] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [21:38:20] (SystemdUnitFailed) firing: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service Failed on wdqs2012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:50:42] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:11:42] (03PS1) 10Bking: wdqs: Don't start services until host is ready [puppet] - 10https://gerrit.wikimedia.org/r/936095 (https://phabricator.wikimedia.org/T341290) [22:12:36] (03CR) 10CI reject: [V: 04-1] wdqs: Don't start services until host is ready [puppet] - 10https://gerrit.wikimedia.org/r/936095 (https://phabricator.wikimedia.org/T341290) (owner: 10Bking) [22:14:02] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:18:20] (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:18:32] (03PS1) 10Gmodena: data-engineering: add alerts for mw-page-content-change-enrich. [alerts] - 10https://gerrit.wikimedia.org/r/936096 (https://phabricator.wikimedia.org/T340666) [22:28:46] (03PS1) 10Jdlrobson: Logos: Fixes grantswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936097 [22:47:33] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/output/932317/42327/mx1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/932317 (https://phabricator.wikimedia.org/T280392) (owner: 10Dzahn) [23:01:25] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/output/932316/42328/mx1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/932316 (https://phabricator.wikimedia.org/T280392) (owner: 10Dzahn) [23:08:41] !log mx2001 - rm /usr/local/bin/otrs_aliases ; rm /lib/systemd/system/generate_otrs_aliases.* after deploying gerrit:932316 which renamed script and timer without absenting them [23:08:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:14:01] !log mx1001 - rm /usr/local/bin/otrs_aliases ; rm /lib/systemd/system/generate_otrs_aliases.* after deploying gerrit:932316 which renamed script and timer without absenting them [23:14:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:21:07] (03CR) 10Dzahn: [C: 03+2] "after merging this I did:" [puppet] - 10https://gerrit.wikimedia.org/r/932316 (https://phabricator.wikimedia.org/T280392) (owner: 10Dzahn) [23:22:36] (03PS4) 10Dzahn: vrts: rename exim config snippet [puppet] - 10https://gerrit.wikimedia.org/r/932317 (https://phabricator.wikimedia.org/T280392) [23:26:10] (03Abandoned) 10Dzahn: vrts: rename exim config snippet [puppet] - 10https://gerrit.wikimedia.org/r/932317 (https://phabricator.wikimedia.org/T280392) (owner: 10Dzahn) [23:27:04] (03CR) 10Dzahn: [C: 04-1] "as we learned from a similar change the other day, can't be repeated" [puppet] - 10https://gerrit.wikimedia.org/r/932440 (https://phabricator.wikimedia.org/T338071) (owner: 10Dzahn) [23:31:15] (03PS3) 10Dzahn: contint: replace Apache 2.2 access control syntax for Jenkins proxy [puppet] - 10https://gerrit.wikimedia.org/r/932440 (https://phabricator.wikimedia.org/T338071) [23:31:53] (03CR) 10Dzahn: "now compare to https://gerrit.wikimedia.org/r/c/operations/puppet/+/935417" [puppet] - 10https://gerrit.wikimedia.org/r/932440 (https://phabricator.wikimedia.org/T338071) (owner: 10Dzahn) [23:33:50] (03Abandoned) 10Dzahn: mediawiki: replace Apache 2.2 syntax for access control [puppet] - 10https://gerrit.wikimedia.org/r/932447 (owner: 10Dzahn) [23:35:27] 10SRE, 10Infrastructure-Foundations, 10Puppet-Core, 10User-jbond: global http_proxy setting - https://phabricator.wikimedia.org/T278315 (10nshahquinn-wmf) [23:37:36] (03CR) 10Dzahn: "commented here for attention: https://phabricator.wikimedia.org/T124657#8996134" [puppet] - 10https://gerrit.wikimedia.org/r/527912 (https://phabricator.wikimedia.org/T124657) (owner: 10Fomafix) [23:39:35] (03CR) 10Dzahn: "I would know how to replace the "check_http" with alertmanager blackbox checks, but how would you replace cert_expiry?" [puppet] - 10https://gerrit.wikimedia.org/r/836775 (owner: 10Muehlenhoff) [23:42:39] 10SRE, 10MediaWiki-Documentation, 10serviceops-radar, 10Documentation, and 2 others: Repair "svn.wikimedia.org/doc/" redirect for doc.wikimedia.org - https://phabricator.wikimedia.org/T109950 (10Dzahn) @Aklapper asked the same on Gerrit, it seems to me this is #serviceops rather than traffic because it's a... [23:43:09] (03CR) 10Dzahn: "pinged / tagged via linked phab ticket for attention" [puppet] - 10https://gerrit.wikimedia.org/r/631888 (https://phabricator.wikimedia.org/T109950) (owner: 10Dereckson)