[00:19:28] PROBLEM - Check systemd state on an-web1001 is CRITICAL: CRITICAL - degraded: The following units failed: hardsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:21:02] PROBLEM - Check systemd state on wdqs1011 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:21:42] PROBLEM - Swift https backend on ms-fe2013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [00:30:22] RECOVERY - Check systemd state on an-web1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:41:48] PROBLEM - SSH on stat1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [00:45:20] RECOVERY - SSH on stat1006 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [00:58:14] 10SRE, 10Data-Engineering, 10Event-Platform Value Stream, 10Service-deployment-requests: New Service Request mediawiki-page-content-change-enrichment - https://phabricator.wikimedia.org/T330507 (10Reedy) [02:06:45] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:21:45] (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:54:58] (KubernetesCalicoDown) firing: dse-k8s-worker1007.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-dse&var-instance=dse-k8s-worker1007.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [04:26:04] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:26:26] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:27:46] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49565 bytes in 0.100 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:28:06] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.287 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:57:29] 10SRE, 10MediaWiki-File-management, 10Traffic, 10Patch-For-Review, 10Technical-Debt: Remove IEContentAnalyzer - https://phabricator.wikimedia.org/T309787 (10Legoktm) @taavi suggested adding the header in core's pre-existing `images/.htaccess`; I'll submit a patch for that. [06:54:58] (KubernetesCalicoDown) firing: dse-k8s-worker1007.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-dse&var-instance=dse-k8s-worker1007.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [09:18:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:23:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:38:49] !log delete knative pods on ml-serve-codfw to clear latency alerts [09:38:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:10] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:54:58] (KubernetesCalicoDown) firing: dse-k8s-worker1007.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-dse&var-instance=dse-k8s-worker1007.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [11:02:45] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 4 days, 0:00:00 on dse-k8s-worker[1001-1004,1007].eqiad.wmnet with reason: Downtime DSE workers for cluster upgrade [11:03:02] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on dse-k8s-worker[1001-1004,1007].eqiad.wmnet with reason: Downtime DSE workers for cluster upgrade [14:54:58] (KubernetesCalicoDown) firing: dse-k8s-worker1007.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-dse&var-instance=dse-k8s-worker1007.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [15:28:06] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: Q3: rack/setup/install dumpsdata100[67] - https://phabricator.wikimedia.org/T299443 (10ArielGlenn) Hey @RobH I am preemptively assuming that dumpsdata1007 is good to go for us to use, since it's got the role ("insetup::core_platform") and everything in the... [15:30:48] !log resized lvm and filesystem for /data on dumpsdata1004,5,7; was <100G, now is 38T usable (left some room for growth later) [15:30:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:54:58] (KubernetesCalicoDown) firing: dse-k8s-worker1007.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-dse&var-instance=dse-k8s-worker1007.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [20:26:57] (03PS1) 10Reedy: filebackend: Replace stringified class names with ::class [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891962 [20:37:50] (03PS1) 10Reedy: filebackend: Opinionated reformatting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891964 [21:58:47] (03CR) 10Jforrester: filebackend: Opinionated reformatting (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891964 (owner: 10Reedy) [21:59:44] (03PS2) 10Reedy: filebackend: Opinionated reformatting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891964 [22:54:58] (KubernetesCalicoDown) firing: dse-k8s-worker1007.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-dse&var-instance=dse-k8s-worker1007.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown