[00:00:12] RECOVERY - Check systemd state on puppetmaster2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:38:33] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/945830 [00:38:39] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/945830 (owner: 10TrainBranchBot) [00:54:33] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/945830 (owner: 10TrainBranchBot) [01:18:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [01:23:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [02:06:32] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:11:33] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:18:34] PROBLEM - Check systemd state on gitlab1003 is CRITICAL: CRITICAL - degraded: The following units failed: sync-gitlab-group-with-ldap.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:19:28] PROBLEM - Check systemd state on gitlab2002 is CRITICAL: CRITICAL - degraded: The following units failed: sync-gitlab-group-with-ldap.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:30:38] RECOVERY - Check systemd state on gitlab1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:31:32] (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:45:08] RECOVERY - Check systemd state on gitlab2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:54:32] PROBLEM - Check systemd state on build2001 is CRITICAL: CRITICAL - degraded: The following units failed: docker-reporter-base-images.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:19:30] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1003), Fresh: 129 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [07:00:06] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230806T0700) [07:21:05] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [07:51:05] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [08:04:46] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:15:16] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:21:24] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 130 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [11:37:02] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [11:48:54] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [14:06:33] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:11:33] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:16:33] (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:05:16] (03PS1) 10Anzx: Update knwiktionary logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/945939 (https://phabricator.wikimedia.org/T343662) [15:09:23] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (DELETE certificaterequests) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:14:23] (KubernetesAPILatency) resolved: (3) High Kubernetes API latency (DELETE certificaterequests) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:50:26] PROBLEM - WDQS SPARQL on wdqs1004 is CRITICAL: HTTP CRITICAL: HTTP/1.1 429 Too Many Requests - string http://www.w3.org/2001/XML... not found on https://query.wikidata.org:443/bigdata/namespace/wdq/sparql?query=SELECT%20*%20WHERE%20%7Bwikibase%3ADump%20schema%3AdateModified%20%3Fy%7D%20LIMIT%201 - 754 bytes in 0.050 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [16:39:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:44:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:53:11] (03CR) 10Esanders: IS-labs: Enable edit recovery on en.wikipedia.beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/942419 (https://phabricator.wikimedia.org/T342858) (owner: 10Samtar) [17:48:10] PROBLEM - Disk space on alert1001 is CRITICAL: DISK CRITICAL - free space: / 2789 MB (3% inode=55%): /var/lib/docker/overlay2/57e08c3653f262bbb951aac3d11b741696919e0a9d276d3047d65ad41ce8c432/merged 2789 MB (3% inode=55%): /tmp 2789 MB (3% inode=55%): /var/tmp 2789 MB (3% inode=55%): /var/lib/docker/overlay2/89ba38deccc406acb93addb77bb4a80894d4b2b5b7fe7911b6586106ff3217d5/merged 2789 MB (3% inode=55%): https://wikitech.wikimedia.org/wiki/M [17:48:10] g/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=alert1001&var-datasource=eqiad+prometheus/ops [20:24:38] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:47:29] 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops, and 2 others: Future of Thumbor's memcached backend - https://phabricator.wikimedia.org/T318695 (10Aklapper) >>! In T318695#9019793, @akosiaris wrote: > @hnowlan @jijiki. nutcracker removal merged and deployed. I am gonna let you have the pleasure of res... [22:12:34] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Direct 5% of all traffic to mw-on-k8s - https://phabricator.wikimedia.org/T341780 (10Krinkle) [22:12:42] 10SRE, 10serviceops: k8s/mw: traffic to eventgate dropped by iptables - https://phabricator.wikimedia.org/T249700 (10Krinkle) [22:15:48] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120 (10Krinkle) [22:17:22] 10SRE, 10Infrastructure-Foundations, 10Traffic-Icebox, 10netops, 10Patch-For-Review: Anycast AuthDNS - https://phabricator.wikimedia.org/T98006 (10Krinkle) [22:18:39] 10SRE, 10RESTBase-API, 10Traffic-Icebox: Thumb API: Varnish / CDN questions - https://phabricator.wikimedia.org/T150673 (10Krinkle) [22:20:05] 10SRE, 10Traffic-Icebox, 10MediaWiki-Platform-Team (Radar), 10Patch-For-Review: Better handling for one-hit-wonder objects - https://phabricator.wikimedia.org/T144187 (10Krinkle) [22:23:59] 10SRE, 10Traffic-Icebox, 10User-CDanis: Harmonise the identification of requests across our stack - https://phabricator.wikimedia.org/T201409 (10Krinkle) [22:25:22] 10SRE, 10SRE-swift-storage, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Reduce amount of headers sent from web responses - https://phabricator.wikimedia.org/T194814 (10Krinkle) [22:26:02] 10SRE, 10Infrastructure-Foundations, 10Traffic: Mapping Client IPs to Resolver IPs - https://phabricator.wikimedia.org/T336947 (10Krinkle) [22:27:00] 10SRE, 10Infrastructure-Foundations: Consider OS level tracking/configuration of performance/powersaving settings - https://phabricator.wikimedia.org/T338944 (10Krinkle) [22:30:32] 10SRE, 10MediaWiki-Platform-Team, 10Traffic-Icebox, 10WMF-General-or-Unknown: Disable caching on the main page for anonymous users - https://phabricator.wikimedia.org/T119366 (10Krinkle) 05Open→03Declined [22:35:35] 10SRE, 10MediaWiki-Core-HTTP-Cache, 10MediaWiki-REST-API, 10RESTbase Sunsetting, and 3 others: Determine http cache control and active purging for REST endpoints serving parsoid output - https://phabricator.wikimedia.org/T308424 (10Krinkle) [22:35:55] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, 10Release-Engineering-Team (Seen): Deploy mediawiki kubernetes services - https://phabricator.wikimedia.org/T321786 (10Krinkle) [22:38:06] 10SRE, 10Traffic, 10serviceops: Reconcile MediaWiki POST timeout and Varnish/ATS timeouts - https://phabricator.wikimedia.org/T294800 (10Krinkle) [22:41:43] 10SRE, 10SRE-swift-storage, 10serviceops: Swift object servers become briefly unresponsive on a regular basis - https://phabricator.wikimedia.org/T226373 (10Krinkle)