[00:07:35] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:15:54] (HelmReleaseBadStatus) firing: Helm release eventstreams-internal/main on k8s-staging@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=eventstreams-internal - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [00:20:03] PROBLEM - Check systemd state on an-web1001 is CRITICAL: CRITICAL - degraded: The following units failed: hardsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:23:45] (JobUnavailable) firing: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:28:45] (JobUnavailable) resolved: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:31:23] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:31:57] RECOVERY - Check systemd state on an-web1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:15:57] RECOVERY - SSH on mw1326.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:21:40] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [01:36:45] (JobUnavailable) firing: (2) Reduced availability for job redis_gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:41:45] (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:46:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:51:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:58:07] RECOVERY - SSH on restbase2012.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:06:45] (JobUnavailable) firing: (5) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:07:51] PROBLEM - Check systemd state on thanos-be2004 is CRITICAL: CRITICAL - degraded: The following units failed: swift-drive-audit.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:11:45] (JobUnavailable) resolved: (5) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:18:41] PROBLEM - SSH on mw1315.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:01:26] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: (2) Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [03:02:37] RECOVERY - Check systemd state on thanos-be2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:11:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [03:16:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [03:19:57] RECOVERY - SSH on mw1315.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:21:26] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) resolved: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [03:22:45] (JobUnavailable) firing: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:27:45] (JobUnavailable) resolved: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:54:45] (JobUnavailable) firing: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:59:45] (JobUnavailable) resolved: (2) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:14:45] (JobUnavailable) firing: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:15:54] (HelmReleaseBadStatus) firing: Helm release eventstreams-internal/main on k8s-staging@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=eventstreams-internal - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [04:24:45] (JobUnavailable) resolved: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:29:55] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 234, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:30:27] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 142, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:30:27] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 89, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:02:13] PROBLEM - SSH on restbase2012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:59:01] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [05:59:11] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [06:03:29] RECOVERY - SSH on restbase2012.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:09:45] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 4 days, 0:00:00 on db1189.eqiad.wmnet with reason: on site maintenance [06:09:59] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on db1189.eqiad.wmnet with reason: on site maintenance [06:10:17] !log Shutdown db1189 T317662 [06:10:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:10:20] T317662: db1189 broken memory - https://phabricator.wikimedia.org/T317662 [06:12:40] 10SRE, 10ops-eqiad, 10DBA: db1189 broken memory - https://phabricator.wikimedia.org/T317662 (10Marostegui) @Jclark-ctr the host is now off. Proceed as needed, thank you. [06:13:56] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: Q1:rack/setup/install db119[67] - https://phabricator.wikimedia.org/T313978 (10Marostegui) >>! In T313978#8253254, @Jclark-ctr wrote: > @jcrespo @Marostegui Those host names have been used I have entered into netbox db1204 , db1205. Please confirm... [06:22:55] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 143, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220923T0700) [07:08:45] (JobUnavailable) firing: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:13:45] (JobUnavailable) resolved: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:23:45] (JobUnavailable) firing: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:32:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [07:33:45] (JobUnavailable) resolved: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:36:02] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db[2134,2160].codfw.wmnet,db[1117,1159].eqiad.wmnet with reason: Grants fixing [07:36:18] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db[2134,2160].codfw.wmnet,db[1117,1159].eqiad.wmnet with reason: Grants fixing [07:37:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [07:39:45] (JobUnavailable) firing: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:47:40] (03CR) 10Ladsgroup: "You can start by enabling it in beta cluster (probably in all of it), it's in wmf-config/initialiseSettings-labs.php" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833454 (https://phabricator.wikimedia.org/T175177) (owner: 10Sbailey) [07:49:45] (JobUnavailable) resolved: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:01:45] (JobUnavailable) firing: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:06:45] (JobUnavailable) resolved: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:11:53] 10SRE, 10Traffic: Text cluster is being hit with an average of 1.8k PURGE requests per second per host - https://phabricator.wikimedia.org/T318349 (10Vgutierrez) [08:13:59] 10SRE, 10Performance-Team, 10RESTBase-API, 10Traffic: Text cluster is being hit with an average of 1.8k PURGE requests per second per host - https://phabricator.wikimedia.org/T318349 (10Ladsgroup) This seems to be mostly rest base (FYI perf) [08:15:54] (HelmReleaseBadStatus) firing: Helm release eventstreams-internal/main on k8s-staging@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=eventstreams-internal - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [08:20:19] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic, 10Sustainability (Incident Followup): Audit eqiad & codfw LVS network links - https://phabricator.wikimedia.org/T286881 (10Vgutierrez) After double checking on netbox it looks like we have some standing issues: lvs1017 and lvs1020 get connectivity to rows B and C... [08:45:20] (03PS1) 10Aqu: Deploy Spark 3 on the whole production cluster [puppet] - 10https://gerrit.wikimedia.org/r/834500 (https://phabricator.wikimedia.org/T312882) [08:46:15] (03Abandoned) 10Aqu: Deploy Spark 3 to production [puppet] - 10https://gerrit.wikimedia.org/r/833412 (https://phabricator.wikimedia.org/T312882) (owner: 10Aqu) [08:47:07] (03PS2) 10Aqu: Deploy Spark 3 on the whole production cluster [puppet] - 10https://gerrit.wikimedia.org/r/834500 (https://phabricator.wikimedia.org/T312882) [08:49:35] PROBLEM - SSH on db1116.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:50:27] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 236, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:51:15] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 90, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:51:17] !log rebalance ms-eqiad swift rings T294550 [08:51:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:21] T294550: Decom ms-be10[28-39] - https://phabricator.wikimedia.org/T294550 [08:51:59] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH nodes) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:52:30] (03CR) 10Aqu: [V: 03+1] "PCC SUCCESS (DIFF 7): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37337/console" [puppet] - 10https://gerrit.wikimedia.org/r/834500 (https://phabricator.wikimedia.org/T312882) (owner: 10Aqu) [08:56:59] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH nodes) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:58:53] (03PS1) 10MVernon: swift: remove ms-be10[28-39] from the rings [puppet] - 10https://gerrit.wikimedia.org/r/834503 (https://phabricator.wikimedia.org/T294550) [08:59:03] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37338/console" [puppet] - 10https://gerrit.wikimedia.org/r/833416 (https://phabricator.wikimedia.org/T318019) (owner: 10Dduvall) [09:12:58] (KubernetesAPILatency) firing: High Kubernetes API latency (GET deployments) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:13:44] (03CR) 10Jelto: [C: 03+1] "lgtm. I run build-production-images on build host and buildkitd image was updated. Following https://wikitech.wikimedia.org/wiki/Kubernete" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/830909 (owner: 10Dduvall) [09:17:10] (03PS1) 10Marostegui: production-m3.sql.erb: Add phab1004 missing grants [puppet] - 10https://gerrit.wikimedia.org/r/834508 (https://phabricator.wikimedia.org/T315713) [09:17:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (GET deployments) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:19:28] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (GET deployments) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:19:44] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: Q1:rack/setup/install db119[67] - https://phabricator.wikimedia.org/T313978 (10jcrespo) > Those names are ok. +1 [09:20:21] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: Q1:rack/setup/install db1204, db1205 - https://phabricator.wikimedia.org/T313978 (10jcrespo) [09:20:55] (03CR) 10Marostegui: [C: 03+2] production-m3.sql.erb: Add phab1004 missing grants [puppet] - 10https://gerrit.wikimedia.org/r/834508 (https://phabricator.wikimedia.org/T315713) (owner: 10Marostegui) [09:21:09] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: Q1:rack/setup/install db1204, db1205 - https://phabricator.wikimedia.org/T313978 (10jcrespo) [09:21:29] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: Q1:rack/setup/install db1204, db1205 - https://phabricator.wikimedia.org/T313978 (10jcrespo) [09:21:48] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: Q1:rack/setup/install db1204, db1205 - https://phabricator.wikimedia.org/T313978 (10jcrespo) [09:24:05] (03PS1) 10AOkoth: vrts: enable vrts-daemon on WMCS instance [puppet] - 10https://gerrit.wikimedia.org/r/834510 (https://phabricator.wikimedia.org/T317059) [09:24:28] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (GET clusterinformations) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:25:36] (03CR) 10Jelto: [V: 03+1] "I left one comment in-line, otherwise looks good" [puppet] - 10https://gerrit.wikimedia.org/r/833416 (https://phabricator.wikimedia.org/T318019) (owner: 10Dduvall) [09:26:52] !log stopping db1117:s3 for maintenance T315713 [09:26:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:57] T315713: sort out mysql privileges for phab1004/phab2002 - https://phabricator.wikimedia.org/T315713 [09:29:13] 10SRE, 10Performance-Team, 10RESTBase-API, 10Traffic: Text cluster is being hit with an average of 1.8k PURGE requests per second per host - https://phabricator.wikimedia.org/T318349 (10Vgutierrez) The schema https://schema.wikimedia.org/repositories//primary/jsonschema/resource_change/1.0.0.json allows to... [09:29:28] (KubernetesAPILatency) resolved: (3) High Kubernetes API latency (GET clusterinformations) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:31:20] (03PS1) 10JMeybohm: Add tanuja to ldap_only_users for wmde access [puppet] - 10https://gerrit.wikimedia.org/r/834512 (https://phabricator.wikimedia.org/T317613) [09:31:28] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (GET clusterinformations) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:31:35] (03CR) 10Jelto: [C: 03+2] "lgtm" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/833067 (https://phabricator.wikimedia.org/T318019) (owner: 10Dduvall) [09:32:25] (03CR) 10CI reject: [V: 04-1] Add tanuja to ldap_only_users for wmde access [puppet] - 10https://gerrit.wikimedia.org/r/834512 (https://phabricator.wikimedia.org/T317613) (owner: 10JMeybohm) [09:33:57] PROBLEM - haproxy failover on dbproxy1016 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [09:34:14] (03CR) 10Jelto: [V: 03+2 C: 03+2] buildkitd: Install wmf-certificates for registry CA [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/833067 (https://phabricator.wikimedia.org/T318019) (owner: 10Dduvall) [09:34:31] PROBLEM - haproxy failover on dbproxy1020 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [09:36:28] (KubernetesAPILatency) firing: (4) High Kubernetes API latency (GET clusterinformations) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:36:46] haproxy is me, it is expected [09:37:03] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users and Kerberos identity for CMyrick-WMF - https://phabricator.wikimedia.org/T317996 (10JMeybohm) 05Open→03Resolved AIUI this is done. Please reopen if that's not the case. [09:37:06] (no impact on active services) [09:38:29] (03CR) 10Jelto: [V: 03+2 C: 03+2] "I run build-production-images on build host and buildkitd image was updated to buildkitd:0.10.4-2. Following https://wikitech.wikimedia.or" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/833067 (https://phabricator.wikimedia.org/T318019) (owner: 10Dduvall) [09:40:15] (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab_runner: Bump buildkitd version to 0.10.4-2 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/833416 (https://phabricator.wikimedia.org/T318019) (owner: 10Dduvall) [09:40:27] (03PS1) 10Jbond: admin: ryankemper update shell to zsh [puppet] - 10https://gerrit.wikimedia.org/r/834515 [09:41:20] (03CR) 10Jbond: "i noticed the following and wondered if you perhaps didn't realise you could change your shell?" [puppet] - 10https://gerrit.wikimedia.org/r/834515 (owner: 10Jbond) [09:41:28] (KubernetesAPILatency) firing: (5) High Kubernetes API latency (GET clusterinformations) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:41:52] (03CR) 10Jbond: ryankemper: add tmux, vim, zsh conf (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/834369 (owner: 10Ryan Kemper) [09:43:18] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/834398 (https://phabricator.wikimedia.org/T318382) (owner: 10Dduvall) [09:44:45] (03CR) 10Jbond: "feel free to abandon or merge this to your preference" [puppet] - 10https://gerrit.wikimedia.org/r/834515 (owner: 10Jbond) [09:46:28] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (LIST cronjobs) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:48:59] 10SRE, 10Traffic: Varnish SLI is impacted by external components performance|behavior - https://phabricator.wikimedia.org/T317051 (10Vgutierrez) 05Stalled→03In progress actually there is some stuff that we can implement to avoid the issue described on the task description. the SLI should focus only on clie... [09:50:22] (03Abandoned) 10Ladsgroup: rdbms: Allow SubQuery objects in SelectQueryBuilder as table [core] (wmf/1.40.0-wmf.1) - 10https://gerrit.wikimedia.org/r/832322 (https://phabricator.wikimedia.org/T314189) (owner: 10Ladsgroup) [09:50:49] RECOVERY - SSH on db1116.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:55:28] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST cronjobs) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:57:08] 10SRE, 10SRE-Access-Requests: Requesting access to Analytics for devnull - https://phabricator.wikimedia.org/T318104 (10JMeybohm) @Devnull I think you need a sponsor for this as well as approval from @Ottomata or @odimitrijevic [10:00:28] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST cronjobs) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:10:17] (03CR) 10Jelto: [C: 03+1] "lgtm now, thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/826268 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [10:10:28] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST cronjobs) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:12:08] (03PS1) 10Majavah: labstore: Remove UTRS NFS volumes [puppet] - 10https://gerrit.wikimedia.org/r/834522 [10:12:40] (03PS2) 10Majavah: labstore: Remove UTRS NFS volumes [puppet] - 10https://gerrit.wikimedia.org/r/834522 (https://phabricator.wikimedia.org/T301295) [10:15:28] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST cronjobs) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:18:43] (03PS1) 10Vgutierrez: mtail:varnishsli: Track client sided requests only [puppet] - 10https://gerrit.wikimedia.org/r/834525 [10:20:28] (KubernetesAPILatency) resolved: (3) High Kubernetes API latency (LIST cronjobs) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:26:28] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST cronjobs) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:28:00] (03CR) 10JMeybohm: [C: 03+2] admin_ng: Allow to pin calico chart versions per environment [deployment-charts] - 10https://gerrit.wikimedia.org/r/826268 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [10:28:03] (03CR) 10JMeybohm: [C: 03+2] calico-crd: Split crds.yaml into multiple files [deployment-charts] - 10https://gerrit.wikimedia.org/r/826269 (owner: 10JMeybohm) [10:31:28] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST jobs) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:31:57] (03Merged) 10jenkins-bot: admin_ng: Allow to pin calico chart versions per environment [deployment-charts] - 10https://gerrit.wikimedia.org/r/826268 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [10:32:07] (03Merged) 10jenkins-bot: calico-crd: Split crds.yaml into multiple files [deployment-charts] - 10https://gerrit.wikimedia.org/r/826269 (owner: 10JMeybohm) [10:32:54] (03CR) 10MVernon: [C: 03+1] "Thanks! I didn't know you could do that and not have it conflict with the use of rsync::server in rsync::server::module" [puppet] - 10https://gerrit.wikimedia.org/r/832628 (https://phabricator.wikimedia.org/T311066) (owner: 10Jbond) [10:34:03] (03CR) 10MVernon: [C: 03+1] "Thanks :-)" [puppet] - 10https://gerrit.wikimedia.org/r/832630 (https://phabricator.wikimedia.org/T311066) (owner: 10Jbond) [10:36:28] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (GET configmaps) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:37:55] (03CR) 10MVernon: "So this looks sensible to me, but I don't think I know enough puppetry to really provide an expert review..." [puppet] - 10https://gerrit.wikimedia.org/r/703452 (https://phabricator.wikimedia.org/T205618) (owner: 10Jbond) [10:46:28] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (GET configmaps) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:49:16] 10SRE, 10Discovery-Search, 10serviceops, 10serviceops-collab, and 2 others: Sunset search.wikimedia.org service - https://phabricator.wikimedia.org/T316296 (10Jelto) >>! In T316296#8206081, @Jelto wrote: > [...] > Out of curiosity I looked at the httpd logs for the Pods on Kubernetes and found only one "va... [10:55:58] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (GET configmaps) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:00:58] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (GET clusterinformations) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:05:58] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (GET clusterinformations) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:06:28] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (GET clusterinformations) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:11:13] (KubernetesAPILatency) firing: (4) High Kubernetes API latency (GET clusterinformations) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:12:03] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:12:20] (03PS1) 10Jbond: P:lvs::configueration: move classification to hiera and add error checks [puppet] - 10https://gerrit.wikimedia.org/r/834549 (https://phabricator.wikimedia.org/T264132) [11:14:26] (03CR) 10CI reject: [V: 04-1] P:lvs::configueration: move classification to hiera and add error checks [puppet] - 10https://gerrit.wikimedia.org/r/834549 (https://phabricator.wikimedia.org/T264132) (owner: 10Jbond) [11:20:17] PROBLEM - SSH on db1109.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:21:13] (KubernetesAPILatency) firing: (4) High Kubernetes API latency (GET clusterinformations) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:26:13] (KubernetesAPILatency) resolved: (3) High Kubernetes API latency (GET configmaps) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:26:28] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (GET endpoints) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:31:13] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (GET namespaces) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:39:58] (KubernetesAPILatency) firing: High Kubernetes API latency (PUT cronjobs) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:42:43] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (PUT cronjobs) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:43:35] (03PS2) 10Jbond: P:lvs::configueration: move classification to hiera and add error checks [puppet] - 10https://gerrit.wikimedia.org/r/834549 (https://phabricator.wikimedia.org/T264132) [11:44:33] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37341/console" [puppet] - 10https://gerrit.wikimedia.org/r/834549 (https://phabricator.wikimedia.org/T264132) (owner: 10Jbond) [11:44:58] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (PUT cronjobs) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:45:42] (03CR) 10CI reject: [V: 04-1] P:lvs::configueration: move classification to hiera and add error checks [puppet] - 10https://gerrit.wikimedia.org/r/834549 (https://phabricator.wikimedia.org/T264132) (owner: 10Jbond) [11:47:26] (03CR) 10Jbond: [C: 03+1] "a few optional nits or follow up but looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/834360 (https://phabricator.wikimedia.org/T264132) (owner: 10BCornwall) [11:47:43] (KubernetesAPILatency) resolved: (3) High Kubernetes API latency (PUT cronjobs) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:48:12] (03CR) 10Jbond: "removing vote see additional comment" [puppet] - 10https://gerrit.wikimedia.org/r/834360 (https://phabricator.wikimedia.org/T264132) (owner: 10BCornwall) [11:49:58] (KubernetesAPILatency) firing: (5) High Kubernetes API latency (PUT cronjobs) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:50:21] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 20 NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37342/console" [puppet] - 10https://gerrit.wikimedia.org/r/834549 (https://phabricator.wikimedia.org/T264132) (owner: 10Jbond) [11:51:58] (03PS3) 10Jbond: P:lvs::configueration: move classification to hiera and add error checks [puppet] - 10https://gerrit.wikimedia.org/r/834549 (https://phabricator.wikimedia.org/T264132) [11:52:43] (KubernetesAPILatency) resolved: (4) High Kubernetes API latency (PUT cronjobs) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:53:51] (03CR) 10CI reject: [V: 04-1] P:lvs::configueration: move classification to hiera and add error checks [puppet] - 10https://gerrit.wikimedia.org/r/834549 (https://phabricator.wikimedia.org/T264132) (owner: 10Jbond) [11:59:58] (KubernetesAPILatency) firing: High Kubernetes API latency (GET deployments) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:04:43] (KubernetesAPILatency) resolved: High Kubernetes API latency (GET deployments) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:05:58] (KubernetesAPILatency) firing: High Kubernetes API latency (GET services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:09:53] (03PS1) 10Elukey: knative: backport patch to tune pod DNS settings from version 1.5 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/834553 (https://phabricator.wikimedia.org/T313915) [12:10:58] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (GET deployments) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:11:29] (03PS2) 10Elukey: knative: backport patch to tune pod DNS settings from version 1.5 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/834553 (https://phabricator.wikimedia.org/T313915) [12:11:58] (03PS4) 10Jbond: P:lvs::configueration: move classification to hiera and add error checks [puppet] - 10https://gerrit.wikimedia.org/r/834549 (https://phabricator.wikimedia.org/T264132) [12:12:28] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (GET deployments) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:13:57] (03CR) 10Jbond: [C: 03+2] P:swift::proxy: initiate the rsync::server explicitly (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/832628 (https://phabricator.wikimedia.org/T311066) (owner: 10Jbond) [12:15:54] (HelmReleaseBadStatus) firing: Helm release eventstreams-internal/main on k8s-staging@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=eventstreams-internal - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [12:16:08] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:thanos::swift::frontend: initiate the rsync::server explicitly [puppet] - 10https://gerrit.wikimedia.org/r/832630 (https://phabricator.wikimedia.org/T311066) (owner: 10Jbond) [12:16:13] (KubernetesAPILatency) resolved: (3) High Kubernetes API latency (GET deployments) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:17:28] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (GET endpoints) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:20:30] (03PS5) 10Jbond: P:lvs::configueration: move classification to hiera and add error checks [puppet] - 10https://gerrit.wikimedia.org/r/834549 (https://phabricator.wikimedia.org/T264132) [12:20:59] (03PS6) 10Jbond: P:lvs::configueration: move classification to hiera and add error checks [puppet] - 10https://gerrit.wikimedia.org/r/834549 (https://phabricator.wikimedia.org/T264132) [12:21:13] (KubernetesAPILatency) firing: (4) High Kubernetes API latency (GET deployments) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:21:33] RECOVERY - SSH on db1109.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:22:33] (KubernetesAPILatency) resolved: (3) High Kubernetes API latency (GET deployments) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:31:17] PROBLEM - SSH on mw1326.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:35:28] (KubernetesAPILatency) firing: (4) High Kubernetes API latency (GET configmaps) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:42:43] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (GET configmaps) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:52:43] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (GET configmaps) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:55:28] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (GET deployments) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:57:43] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (GET configmaps) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:02:43] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (GET configmaps) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:05:28] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (GET configmaps) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:07:43] (KubernetesAPILatency) firing: (5) High Kubernetes API latency (GET configmaps) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:10:28] (KubernetesAPILatency) firing: (5) High Kubernetes API latency (GET configmaps) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:17:44] (KubernetesAPILatency) firing: (5) High Kubernetes API latency (GET configmaps) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:17:59] joal: hello! Would you have time today to look at the unique_devices changes? I have the hope that I can merge and deploy that today, together with a couple other Airflow patches :] If not, no worries! It's friday anyway... [13:22:43] (KubernetesAPILatency) firing: (4) High Kubernetes API latency (GET configmaps) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:23:41] !log taavi@deploy1002 Started deploy [horizon/deploy@9d02cd6] (dev): wmf-proxy-dashboard improved error handling [13:24:53] !log taavi@deploy1002 Finished deploy [horizon/deploy@9d02cd6] (dev): wmf-proxy-dashboard improved error handling (duration: 01m 11s) [13:25:28] (KubernetesAPILatency) resolved: (5) High Kubernetes API latency (PATCH nodes) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:26:10] 10SRE, 10TimedMediaHandler, 10serviceops: Upgrade Wikimedia production's ffmpeg to 4.4 or later so we can use the fpsmax flag - https://phabricator.wikimedia.org/T318419 (10Jdforrester-WMF) p:05Triage→03Low [13:26:45] !log taavi@deploy1002 Started deploy [horizon/deploy@9d02cd6]: wmf-proxy-dashboard improved error handling [13:27:52] hi, we have a huge spike of PHP errors coming from some timed dump process running on host snapshot1008 [13:28:01] the errors started an hour ago and I don't see any recent changes around the relevant code. Could this be an ops issue somewhere? https://logstash.wikimedia.org/goto/8b63707e9db235abb2f4ce70800dc186 [13:29:08] (03PS1) 10Hashar: Stop using Elastica::Type and set the target indices [extensions/ApiFeatureUsage] (wmf/1.40.0-wmf.2) - 10https://gerrit.wikimedia.org/r/834531 (https://phabricator.wikimedia.org/T318356) [13:29:51] !log taavi@deploy1002 Finished deploy [horizon/deploy@9d02cd6]: wmf-proxy-dashboard improved error handling (duration: 03m 06s) [13:29:59] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by hashar@deploy1002 using scap backport" [extensions/ApiFeatureUsage] (wmf/1.40.0-wmf.2) - 10https://gerrit.wikimedia.org/r/834531 (https://phabricator.wikimedia.org/T318356) (owner: 10Hashar) [13:30:28] (KubernetesAPILatency) firing: (6) High Kubernetes API latency (PATCH nodes) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:31:43] (03Merged) 10jenkins-bot: Stop using Elastica::Type and set the target indices [extensions/ApiFeatureUsage] (wmf/1.40.0-wmf.2) - 10https://gerrit.wikimedia.org/r/834531 (https://phabricator.wikimedia.org/T318356) (owner: 10Hashar) [13:31:58] !log hashar@deploy1002 Started scap: Backport for [[gerrit:834531|Stop using Elastica::Type and set the target indices (T318356)]] [13:32:01] T318356: ApiFeatureUsage Error: Call to undefined method Elastica\Search::addType() - https://phabricator.wikimedia.org/T318356 [13:32:20] !log hashar@deploy1002 hashar and hashar: Backport for [[gerrit:834531|Stop using Elastica::Type and set the target indices (T318356)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet [13:32:35] RECOVERY - SSH on mw1326.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:32:43] (KubernetesAPILatency) firing: (5) High Kubernetes API latency (PATCH nodes) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:35:28] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (PATCH nodes) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:35:59] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:36:57] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:36:58] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:37:57] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:38:17] that hotfix was to unbreak `Special:ApiFeatureUsage` [13:38:41] jnuche: from /srv/mediawiki/php-1.40.0-wmf.2/extensions/ContentTranslation/includes/TmxDumpFormatter.php(66) [13:38:52] so I guess it is more or less related to that extension [13:39:08] !log hashar@deploy1002 Finished scap: Backport for [[gerrit:834531|Stop using Elastica::Type and set the target indices (T318356)]] (duration: 07m 10s) [13:39:12] T318356: ApiFeatureUsage Error: Call to undefined method Elastica\Search::addType() - https://phabricator.wikimedia.org/T318356 [13:39:27] apparently triggered by a maintenance script `/srv/mediawiki/multiversion/MWScript.php extensions/ContentTranslation/scripts/dump-corpora.php --wiki enwiki -q --split-at 500 --outputdir /mnt/dumpsdata/otherdumps/contenttranslation/20220923 --compression gzip --format tmx --plaintext` [13:40:15] I guess that can be filed to phabricator and poke apergos since that looks related to dumps and lagnuage team which are mananging contenttranslation [13:40:28] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (GET endpoints) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:40:34] 10SRE, 10Performance-Team, 10RESTBase-API, 10Traffic: Text cluster is being hit with an average of 1.8k PURGE requests per second per host - https://phabricator.wikimedia.org/T318349 (10Krinkle) > non-PURGE requests VS PURGE requests hitting ats@cp3050 during the last 30 days: > {F35528462} Which dashboar... [13:40:49] you can cc me on the task, it should go to the language team folks [13:42:35] hashar, apergos: thanks, I'll create a ticket [13:42:49] note that all dumps jobs were moved to php7.4 at the beginning of the week, I expect that could be related [13:43:11] PROBLEM - Host db1189.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [13:43:48] the message is ` PHP Notice: Trying to access array offset on value of type int` and there some changes made to ContentTranslation this week [13:43:56] 10SRE-swift-storage, 10ops-codfw: /dev/sdg failed in thanos-be2004 - https://phabricator.wikimedia.org/T318422 (10MatthewVernon) [13:44:23] hashar: yes, but they were only JS files [13:44:27] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: /dev/sdg failed in thanos-be2004 - https://phabricator.wikimedia.org/T318422 (10MatthewVernon) [13:45:36] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: /dev/sdg failed in thanos-be2004 - https://phabricator.wikimedia.org/T318422 (10MatthewVernon) p:05Triage→03High [set to high because this blocks further changes to the replication number, which blocks draining 2 nodes from thanos, which blocks MOSS Ce... [13:48:11] 10SRE, 10Performance-Team, 10RESTBase-API, 10Traffic: Text cluster is being hit with an average of 1.8k PURGE requests per second per host - https://phabricator.wikimedia.org/T318349 (10Vgutierrez) The dashboard is https://grafana.wikimedia.org/d/kHk7W6OZz/ats-cluster-view?orgId=1&var-site=esams&var-layer=... [13:49:37] RECOVERY - Host db1189.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.90 ms [14:01:41] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: Q1:rack/setup/install db1204, db1205 - https://phabricator.wikimedia.org/T313978 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson Thank you [14:02:59] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST cronjobs) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:07:58] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST cronjobs) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:12:58] (KubernetesAPILatency) resolved: (3) High Kubernetes API latency (GET configmaps) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:18:14] (KubernetesAPILatency) firing: (4) High Kubernetes API latency (LIST cronjobs) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:18:28] (KubernetesAPILatency) resolved: (4) High Kubernetes API latency (LIST cronjobs) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:23:14] (KubernetesAPILatency) firing: (7) High Kubernetes API latency (LIST cronjobs) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:28:14] (KubernetesAPILatency) firing: (4) High Kubernetes API latency (GET configmaps) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:33:14] (KubernetesAPILatency) firing: (4) High Kubernetes API latency (GET deployments) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:38:14] (KubernetesAPILatency) resolved: (4) High Kubernetes API latency (GET deployments) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:48:14] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (GET deployments) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:50:49] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [14:53:14] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (GET deployments) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:58:14] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (GET deployments) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:00:55] 10SRE, 10ops-eqiad, 10DBA: db1189 broken memory - https://phabricator.wikimedia.org/T317662 (10Jclark-ctr) @Marostegui swapped Dimm A10 and A5 preformed hardware diagnostic on memory And pulled TSR report no errors at this time We will need to put server back into service to see if any errors return.... [15:03:14] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (GET deployments) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:10:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST jobs) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:12:03] PROBLEM - SSH on db1101.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:14:00] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install logstash103[67] - https://phabricator.wikimedia.org/T313849 (10Jclark-ctr) logstash1036 E1 U26 Port 26 Cableid 20220234 logstash1037 F1 U26 Port 26 Cableid 20220233 [15:14:17] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install logstash103[67] - https://phabricator.wikimedia.org/T313849 (10Jclark-ctr) [15:14:40] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install logstash103[67] - https://phabricator.wikimedia.org/T313849 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson [15:15:58] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (LIST jobs) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:19:28] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (GET namespaces) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:24:28] (KubernetesAPILatency) resolved: (3) High Kubernetes API latency (GET namespaces) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:29:28] (KubernetesAPILatency) firing: (5) High Kubernetes API latency (GET namespaces) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:34:28] (KubernetesAPILatency) resolved: (5) High Kubernetes API latency (GET namespaces) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:38:31] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 109 probes of 689 (alerts on 90) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:41:07] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [15:44:28] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST cronjobs) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:44:55] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 49 probes of 689 (alerts on 90) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:49:28] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST cronjobs) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:51:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [15:54:28] (KubernetesAPILatency) resolved: (3) High Kubernetes API latency (LIST cronjobs) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:55:45] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q1:rack/setup/install ganeti103[34] - https://phabricator.wikimedia.org/T314303 (10Jclark-ctr) ganeti1033 D2 U34 Port 34 Cableid 20220010 ganeti1034 D4 U30 Port 38 Cableid 20220038 [15:55:52] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q1:rack/setup/install ganeti103[34] - https://phabricator.wikimedia.org/T314303 (10Jclark-ctr) [15:56:41] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q1:rack/setup/install ganeti103[34] - https://phabricator.wikimedia.org/T314303 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson [15:56:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [16:15:55] (HelmReleaseBadStatus) firing: Helm release eventstreams-internal/main on k8s-staging@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=eventstreams-internal - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [16:17:28]  [16:19:29] PROBLEM - SSH on mw1325.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:26:17] PROBLEM - SSH on ms-be1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:26:31] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [16:30:59] 10SRE, 10ops-eqiad, 10DC-Ops: ps1-a7-eqiad power over threshold alerts - https://phabricator.wikimedia.org/T276743 (10Jclark-ctr) it has been2 weeks with out any alerts closing ticket nothing else will be added to this rack untill we can decom some host from it. [16:31:06] 10SRE, 10ops-eqiad, 10DC-Ops: ps1-a7-eqiad power over threshold alerts - https://phabricator.wikimedia.org/T276743 (10Jclark-ctr) 05Open→03Resolved [16:31:09] 10SRE, 10ops-eqiad, 10Analytics-Radar: Try to move some new analytics worker nodes to different racks - https://phabricator.wikimedia.org/T276239 (10Jclark-ctr) [16:36:57] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:41:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [16:46:33] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:46:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [16:50:25] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [16:52:19] (03PS3) 10BCornwall: lvs: Convert ::lvs::configuration to a profile [puppet] - 10https://gerrit.wikimedia.org/r/834360 (https://phabricator.wikimedia.org/T264132) [16:52:26] (03CR) 10BCornwall: lvs: Convert ::lvs::configuration to a profile (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/834360 (https://phabricator.wikimedia.org/T264132) (owner: 10BCornwall) [17:14:41] RECOVERY - SSH on db1101.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:27:37] RECOVERY - SSH on ms-be1040.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:27:41] (03Abandoned) 10Dduvall: scap: Allow deploy user to keep scap3 environment variables with sudo [puppet] - 10https://gerrit.wikimedia.org/r/830682 (https://phabricator.wikimedia.org/T313259) (owner: 10Dduvall) [17:31:18] (03CR) 10Dduvall: "Just resolving commits so this stops showing in Gerrit under Your Turn" [puppet] - 10https://gerrit.wikimedia.org/r/791655 (https://phabricator.wikimedia.org/T308271) (owner: 10Dduvall) [17:33:11] (03CR) 10Dduvall: buildkitd: Support configuration of OCI executor nameservers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/832584 (https://phabricator.wikimedia.org/T317904) (owner: 10Dduvall) [17:45:16] (03Abandoned) 10Dduvall: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/833020 (owner: 10PipelineBot) [17:48:54] !log nokafor@deploy1002 Started deploy [airflow-dags/analytics@7620b25]: (no justification provided) [17:49:04] !log nokafor@deploy1002 Finished deploy [airflow-dags/analytics@7620b25]: (no justification provided) (duration: 00m 10s) [17:50:36] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Traffic-Icebox, and 2 others: Fix rule violation in the lvs balancer role - https://phabricator.wikimedia.org/T264132 (10BCornwall) 05Open→03In progress [17:50:41] 10Puppet, 10SRE, 10Cloud-Services, 10Infrastructure-Foundations, and 2 others: Convert all of our site.pp/roles to the role/profile paradigm - https://phabricator.wikimedia.org/T159412 (10BCornwall) [17:52:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [17:54:14] (03PS1) 10BryanDavis: bd808: Add new production ssh key [puppet] - 10https://gerrit.wikimedia.org/r/834604 [17:56:47] (03CR) 10BryanDavis: "I'm more than happy to do a video call with someone to verify that I'm me." [puppet] - 10https://gerrit.wikimedia.org/r/834604 (owner: 10BryanDavis) [17:57:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [18:01:15] (03PS1) 10BryanDavis: bd808: Add new root ssh key [labs/private] - 10https://gerrit.wikimedia.org/r/834605 [18:01:43] (03CR) 10BryanDavis: "I'm more than happy to do a video call with someone to verify that I'm me." [labs/private] - 10https://gerrit.wikimedia.org/r/834605 (owner: 10BryanDavis) [18:14:05] (03Abandoned) 10Jdlrobson: EXPECTED VISUAL CHANGES FOR 1.40.0-wmf.1 [skins/Vector] (wmf/1.40.0-wmf.1) - 10https://gerrit.wikimedia.org/r/832551 (https://phabricator.wikimedia.org/T316056) (owner: 10Jdlrobson) [18:27:43] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:28:41] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:30:57] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48681 bytes in 0.103 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:32:19] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.247 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:52:19] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] bd808: Add new root ssh key [labs/private] - 10https://gerrit.wikimedia.org/r/834605 (owner: 10BryanDavis) [18:52:37] (03CR) 10Andrew Bogott: [C: 03+2] bd808: Add new production ssh key [puppet] - 10https://gerrit.wikimedia.org/r/834604 (owner: 10BryanDavis) [19:10:28] !log mforns@deploy1002 Started deploy [airflow-dags/analytics@4c973d6]: (no justification provided) [19:10:41] !log mforns@deploy1002 Finished deploy [airflow-dags/analytics@4c973d6]: (no justification provided) (duration: 00m 12s) [19:23:35] RECOVERY - SSH on mw1325.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:48:17] (03PS1) 10Jbond: wmflib::service::lvs_ipblock: remove unused function [puppet] - 10https://gerrit.wikimedia.org/r/834609 [19:48:50] (03CR) 10CI reject: [V: 04-1] wmflib::service::lvs_ipblock: remove unused function [puppet] - 10https://gerrit.wikimedia.org/r/834609 (owner: 10Jbond) [19:49:51] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/834360 (https://phabricator.wikimedia.org/T264132) (owner: 10BCornwall) [20:15:55] (HelmReleaseBadStatus) firing: Helm release eventstreams-internal/main on k8s-staging@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=eventstreams-internal - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [20:17:05] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:18:59] PROBLEM - SSH on db1101.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:29:03] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:50:21] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:59:55] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:21:39] RECOVERY - SSH on db1101.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:33:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:38:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency