[00:01:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2175', diff saved to https://phabricator.wikimedia.org/P50142 and previous config saved to /var/cache/conftool/dbconfig/20230805-000143-ladsgroup.json [00:04:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [00:16:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2175', diff saved to https://phabricator.wikimedia.org/P50143 and previous config saved to /var/cache/conftool/dbconfig/20230805-001649-ladsgroup.json [00:31:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [00:31:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2175 (T342617)', diff saved to https://phabricator.wikimedia.org/P50144 and previous config saved to /var/cache/conftool/dbconfig/20230805-003155-ladsgroup.json [00:31:59] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [00:38:50] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/945828 [00:38:52] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/945828 (owner: 10TrainBranchBot) [00:48:34] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T342502 (10phaultfinder) [00:53:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1222 (T342617)', diff saved to https://phabricator.wikimedia.org/P50145 and previous config saved to /var/cache/conftool/dbconfig/20230805-005312-ladsgroup.json [00:53:18] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [00:53:55] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/945828 (owner: 10TrainBranchBot) [01:08:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P50146 and previous config saved to /var/cache/conftool/dbconfig/20230805-010819-ladsgroup.json [01:23:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P50147 and previous config saved to /var/cache/conftool/dbconfig/20230805-012325-ladsgroup.json [01:38:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1222 (T342617)', diff saved to https://phabricator.wikimedia.org/P50148 and previous config saved to /var/cache/conftool/dbconfig/20230805-013831-ladsgroup.json [01:38:33] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1225.eqiad.wmnet with reason: Maintenance [01:38:35] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [01:38:47] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1225.eqiad.wmnet with reason: Maintenance [02:02:00] (03CR) 10RLazarus: [C: 03+1] profile::cache::base: add netmapper file for proxies [puppet] - 10https://gerrit.wikimedia.org/r/945818 (https://phabricator.wikimedia.org/T343294) (owner: 10Giuseppe Lavagetto) [02:03:16] (03CR) 10RLazarus: [C: 03+1] cache: load ip reputation data and add request header (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/945819 (https://phabricator.wikimedia.org/T343294) (owner: 10Giuseppe Lavagetto) [02:06:33] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:18:38] PROBLEM - Check systemd state on gitlab2002 is CRITICAL: CRITICAL - degraded: The following units failed: sync-gitlab-group-with-ldap.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:18:40] PROBLEM - Check systemd state on gitlab1003 is CRITICAL: CRITICAL - degraded: The following units failed: sync-gitlab-group-with-ldap.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:30:42] RECOVERY - Check systemd state on gitlab1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:30:42] RECOVERY - Check systemd state on gitlab2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:31:33] (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:39:47] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [02:40:00] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [02:41:05] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [03:03:24] PROBLEM - Router interfaces on cr2-eqdfw is CRITICAL: CRITICAL: host 208.80.153.198, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:03:38] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:11:05] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [04:26:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [04:33:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [05:23:11] (03PS7) 10D3r1ck01: wmf-config: Remove wgContentTranslationDefaultParsoidClient cleanup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930798 [05:53:33] <_joe_> !log creating logical volume "dataimport" on the puppetmaster frontends [05:53:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:57:54] <_joe_> !log mounting the volume under /srv/dataimport on both puppetmaster frontends [05:57:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:23:47] (03PS1) 10Giuseppe Lavagetto: ip_reputation_vendors: use dedicated data directory [puppet] - 10https://gerrit.wikimedia.org/r/945896 [06:27:00] (03CR) 10Brennen Bearnes: [C: 03+1] "No objection to tweaking this if it's helpful, and especially if these should be rare or nonexistent now it makes sense to stop filtering " [puppet] - 10https://gerrit.wikimedia.org/r/945792 (https://phabricator.wikimedia.org/T323254) (owner: 10Bartosz DziewoƄski) [06:28:22] (03CR) 10Giuseppe Lavagetto: [C: 03+2] ip_reputation_vendors: use dedicated data directory [puppet] - 10https://gerrit.wikimedia.org/r/945896 (owner: 10Giuseppe Lavagetto) [06:40:00] (03PS1) 10Giuseppe Lavagetto: ip_reputation_vendors: re-enable timer [puppet] - 10https://gerrit.wikimedia.org/r/945897 [06:41:44] (03CR) 10Giuseppe Lavagetto: [C: 03+2] ip_reputation_vendors: re-enable timer [puppet] - 10https://gerrit.wikimedia.org/r/945897 (owner: 10Giuseppe Lavagetto) [06:50:01] (03PS1) 10Giuseppe Lavagetto: ip_reputation_vendors: add datadir [puppet] - 10https://gerrit.wikimedia.org/r/945898 [06:51:40] (03CR) 10Giuseppe Lavagetto: [C: 03+2] ip_reputation_vendors: add datadir [puppet] - 10https://gerrit.wikimedia.org/r/945898 (owner: 10Giuseppe Lavagetto) [06:56:37] 10SRE, 10ops-codfw, 10DBA: codfw: es2025 lost System Board Fan6 - https://phabricator.wikimedia.org/T343254 (10Marostegui) I have started mariadb for now (but the host is not in production), so it doesn't get behind so many days. [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230805T0700) [07:07:14] (03PS1) 10ArielGlenn: retry getting the max rev id for wikibase dumps on failure [puppet] - 10https://gerrit.wikimedia.org/r/945899 (https://phabricator.wikimedia.org/T343621) [07:21:05] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [08:01:05] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [08:17:08] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:33:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [08:59:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:04:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:17:07] 10SRE, 10ops-eqiad, 10sre-alert-triage, 10DC-Ops, 10cloud-services-team: dbproxy1018 network interface down - https://phabricator.wikimedia.org/T343560 (10Marostegui) This host belongs to #cloud-services-team (or #data-engineering ?) The proxy needed a reload, because clouddb1018 was showing down, but it... [09:17:14] RECOVERY - haproxy failover on dbproxy1018 is OK: OK check_failover servers up 16 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [09:29:47] (03PS9) 10Kaleem Bhatti: sdwiki: set 'wgTranslateNumerals' to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937922 (https://phabricator.wikimedia.org/T268203) [09:43:47] (03CR) 10Kaleem Bhatti: [C: 03+1] "now how to add jenkins-bot to code review" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937922 (https://phabricator.wikimedia.org/T268203) (owner: 10Kaleem Bhatti) [09:56:14] 10SRE, 10ops-eqiad, 10sre-alert-triage, 10DC-Ops, 10cloud-services-team: dbproxy1018 network interface down - https://phabricator.wikimedia.org/T343560 (10Peachey88) [09:56:53] (03CR) 10Aklapper: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937922 (https://phabricator.wikimedia.org/T268203) (owner: 10Kaleem Bhatti) [10:46:30] PROBLEM - Debian mirror in sync with upstream on mirror1001 is CRITICAL: /srv/mirrors/debian is over 14 hours old. https://wikitech.wikimedia.org/wiki/Mirrors [12:33:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [13:15:34] PROBLEM - Check systemd state on doc2002 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-doc-host-data-sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:02:35] (03PS2) 10Esanders: Explicitly set DiscussionToolsAutoTopicSubEditor to discussiontoolsapi [mediawiki-config] - 10https://gerrit.wikimedia.org/r/943558 [14:03:13] (03CR) 10CI reject: [V: 04-1] Explicitly set DiscussionToolsAutoTopicSubEditor to discussiontoolsapi [mediawiki-config] - 10https://gerrit.wikimedia.org/r/943558 (owner: 10Esanders) [14:06:33] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:11:33] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:12:34] RECOVERY - Check systemd state on doc2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:16:33] (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:18:34] PROBLEM - Check systemd state on gitlab2002 is CRITICAL: CRITICAL - degraded: The following units failed: sync-gitlab-group-with-ldap.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:30:30] RECOVERY - Check systemd state on gitlab2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:33:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [14:47:36] RECOVERY - Router interfaces on cr2-eqdfw is OK: OK: host 208.80.153.198, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:47:40] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:10:06] (03CR) 10Vgutierrez: "not a big deal and definitely can be fixed afterwards but this CR breaks varnish tests" [puppet] - 10https://gerrit.wikimedia.org/r/945819 (https://phabricator.wikimedia.org/T343294) (owner: 10Giuseppe Lavagetto) [17:29:59] (03PS14) 10Winston Sung: SiteMatrix config: Add actual (non-deprecated) language code for deprecated language codes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884494 (https://phabricator.wikimedia.org/T172035) [17:55:21] (03CR) 10Ladsgroup: [C: 04-2] "Needs community consensus" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937922 (https://phabricator.wikimedia.org/T268203) (owner: 10Kaleem Bhatti) [19:31:12] PROBLEM - Wikitech-static main page has content on cloudweb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [19:31:50] PROBLEM - Wikitech-static main page has content on cloudweb1003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - string Wikitech not found on https://wikitech-static.wikimedia.org:443/wiki/Main_Page?debug=true - 2232 bytes in 0.156 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [19:36:30] (Traffic bill over quota) firing: Alert for device cr2-drmrs.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [19:52:52] RECOVERY - Wikitech-static main page has content on cloudweb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 26069 bytes in 9.785 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [19:53:26] RECOVERY - Wikitech-static main page has content on cloudweb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 26068 bytes in 0.545 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [19:56:30] (Traffic bill over quota) resolved: Alert for device cr2-drmrs.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [20:04:34] (KubernetesAPILatency) firing: High Kubernetes API latency (PUT cronjobs) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:09:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (PUT cronjobs) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:51:42] (03CR) 10Klausman: [C: 03+2] "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/945916 (owner: 10Klausman) [21:36:46] (ProbeDown) firing: Service etherpad1003:7443 has failed probes (http_etherpad_envoy_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#etherpad1003:7443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:41:46] (ProbeDown) resolved: Service etherpad1003:7443 has failed probes (http_etherpad_envoy_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#etherpad1003:7443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:43:20] 10SRE, 10Wikimedia-Mailing-lists: Mailman3 templates with colons in filename made operations/puppet not cloneable on Windows - https://phabricator.wikimedia.org/T282308 (10Ladsgroup) We will be moving to dedicated hardware and bookworm and new version of mailman3 soonTM. That would automatically solve this. [22:01:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:06:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency