[00:02:01] FIRING: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [00:03:58] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: InboundInterfaceErrors - https://phabricator.wikimedia.org/T389884#10688103 (10phaultfinder) [00:07:01] FIRING: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [00:27:01] RESOLVED: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [00:38:55] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1132041 [00:38:55] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1132041 (owner: 10TrainBranchBot) [00:40:24] (03PS10) 10Krinkle: search-grafana-dashboards: format results as markdown, and add --json [software] - 10https://gerrit.wikimedia.org/r/1129242 (owner: 10Filippo Giunchedi) [00:40:30] (03CR) 10Krinkle: [C:03+1] search-grafana-dashboards: format results as markdown, and add --json [software] - 10https://gerrit.wikimedia.org/r/1129242 (owner: 10Filippo Giunchedi) [00:51:09] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1132041 (owner: 10TrainBranchBot) [00:55:01] 10ops-codfw, 06SRE, 06DC-Ops: InboundInterfaceErrors - https://phabricator.wikimedia.org/T390008#10688151 (10phaultfinder) [01:09:06] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1132042 [01:09:06] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1132042 (owner: 10TrainBranchBot) [01:25:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid releases routed via main at eqiad: 6.653% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [01:27:26] FIRING: ProbeDown: Service mw-parsoid:4452 has failed probes (http_mw-parsoid_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mw-parsoid:4452 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:28:15] FIRING: [10x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [01:28:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 4.353s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [01:28:38] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1132042 (owner: 10TrainBranchBot) [01:28:57] RESOLVED: ProbeDown: Service mw-parsoid:4452 has failed probes (http_mw-parsoid_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mw-parsoid:4452 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:30:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid releases routed via main at eqiad: 22.5% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [01:33:15] RESOLVED: [10x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [01:33:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid releases routed via main at eqiad: 4.643% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [01:33:21] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 2.724s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [01:33:57] FIRING: ProbeDown: Service mw-parsoid:4452 has failed probes (http_mw-parsoid_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mw-parsoid:4452 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:34:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10688166 (10phaultfinder) [01:35:15] FIRING: [4x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [01:35:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.549s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [01:37:26] RESOLVED: ProbeDown: Service mw-parsoid:4452 has failed probes (http_mw-parsoid_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mw-parsoid:4452 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:38:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid releases routed via main at eqiad: 15.81% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [01:38:25] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-parsoid_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:40:15] RESOLVED: [4x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [01:40:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.946s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [01:49:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10688180 (10phaultfinder) [02:33:25] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-parsoid_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:44:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10688226 (10phaultfinder) [03:27:26] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-eqsin:xe-0/1/3 (Peering: SGIX (103.16.102.187) {#1152}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqsin:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [04:12:02] FIRING: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [04:32:01] RESOLVED: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [05:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:24:41] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10688260 (10phaultfinder) [05:45:54] (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1132075 [05:52:58] 10ops-codfw, 06SRE, 06DC-Ops: InboundInterfaceErrors - https://phabricator.wikimedia.org/T390008#10688262 (10phaultfinder) [05:56:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:57:58] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: InboundInterfaceErrors - https://phabricator.wikimedia.org/T389884#10688263 (10phaultfinder) [06:59:41] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10688279 (10phaultfinder) [07:27:26] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-eqsin:xe-0/1/3 (Peering: SGIX (103.16.102.187) {#1152}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqsin:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:45:40] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10688310 (10phaultfinder) [08:17:02] FIRING: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [08:37:02] RESOLVED: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [08:44:40] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10688319 (10phaultfinder) [09:24:40] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10688385 (10phaultfinder) [10:44:41] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10688434 (10phaultfinder) [11:15:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10688505 (10phaultfinder) [11:16:33] FIRING: KubernetesCalicoDown: wikikube-worker1039.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s&var-instance=wikikube-worker1039.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [11:27:26] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-eqsin:xe-0/1/3 (Peering: SGIX (103.16.102.187) {#1152}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqsin:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [11:30:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10688537 (10phaultfinder) [12:22:02] FIRING: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [12:42:02] RESOLVED: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [13:19:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10688921 (10phaultfinder) [14:39:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10689227 (10phaultfinder) [14:53:57] 10ops-drmrs: InboundInterfaceErrors - https://phabricator.wikimedia.org/T389848#10689271 (10phaultfinder) [14:55:40] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10689273 (10phaultfinder) [15:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:16:33] FIRING: KubernetesCalicoDown: wikikube-worker1039.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s&var-instance=wikikube-worker1039.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [15:27:26] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-eqsin:xe-0/1/3 (Peering: SGIX (103.16.102.187) {#1152}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqsin:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [15:30:43] FIRING: VarnishUnavailable: varnish-upload has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [15:30:44] FIRING: HaproxyUnavailable: HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [15:32:26] FIRING: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:32:32] I'm about to switch trains unfortunately [15:33:13] <_joe_> swift is down in eqiad I think [15:33:15] <_joe_> https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?orgId=1&viewPanel=37&from=now-1h&to=now-1m [15:33:57] FIRING: [2x] ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:34:02] <_joe_> if anyone else is around, can you please update the status page? [15:34:26] On it [15:36:27] "Partial media unavailability" ? [15:36:32] <_joe_> yes [15:36:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:36:43] <_joe_> I'll depool swift in eqiad from the cdn [15:36:58] <_joe_> claime: anything against it? ^ [15:37:28] +1 [15:37:52] https://wikimedia.statuspage.io/incidents/tzb72slhy9cz [15:37:52] I acked the alerts [15:37:54] !log oblivian@cumin1002 START - Cookbook sre.discovery.service-route depool swift in eqiad: maintenance [15:37:55] _joe_: go [15:38:03] <_joe_> done [15:38:59] <_joe_> ok next thing, I'd roll-restart the swift proxies in eqiad [15:39:14] <_joe_> I restarted it on ms-fe1009 and it seemed to recover [15:39:16] yep I was about to propose the same [15:39:27] let's do it [15:39:41] roll restart ms-f... ok too slow [15:39:57] * Emperor just got in from a walk. Where are we? [15:40:37] <_joe_> Emperor: swift in eqiad is borked, I depooled eqiad first thing [15:41:00] <_joe_> I'm looking at it now but I can't see a problem on the proxy [15:41:01] Bunch of 503s [15:41:09] <_joe_> if you can check the backends, that would help :) [15:41:15] will do [15:42:07] backends look OK on my usual checks (swift-recon -r ; swift-dispersion-report) [15:42:18] network IO abnormaly high [15:42:24] I just tried to check superset and turnilo, there are some issues with both that I believe are related to the recent switch of the webrequest sampled data [15:42:26] RESOLVED: [2x] ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:42:51] FIRING: [4x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [15:42:53] <_joe_> claime: where? [15:42:54] backend transmit is ~ 3.5+GB/s [15:42:56] !log oblivian@cumin1002 END (FAIL) - Cookbook sre.discovery.service-route (exit_code=99) depool swift in eqiad: maintenance [15:43:02] has someone done a roll-restart of the frontends yet? [15:43:07] <_joe_> Emperor: nope [15:43:09] swift eqiad [15:43:23] rps aren't particularly high [15:43:27] I'll do that, then it won't hurt and sometimes does clear up oddities [15:43:35] <_joe_> yeah and without turnilo we will fly blind [15:43:37] https://grafana.wikimedia.org/goto/JU1pkfTNg?orgId=1 [15:43:42] <_joe_> Emperor: +1 [15:43:57] FIRING: [3x] ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:44:02] so turnilo works, but I realized that webrequest_source is now "frontend" [15:44:05] !incidents [15:44:06] 5899 (ACKED) VarnishUnavailable global sre (varnish-upload thanos-rule) [15:44:06] 5900 (ACKED) HaproxyUnavailable cache_upload global sre (thanos-rule) [15:44:06] 5901 (UNACKED) [4x] ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet) [15:44:10] !ack 5901 [15:44:11] 5901 (ACKED) [4x] ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet) [15:44:12] and not text/upload [15:44:24] !log mvernon@cumin1002 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-eqiad [15:44:27] superset seems to have a different issue, the dashboard fails to me to load [15:44:32] various errors [15:44:49] <_joe_> yeah I think we need to to repool eqiad sadly [15:44:54] <_joe_> errors are skyrocketing [15:45:08] <_joe_> elukey: well turnilo is ok, we can select upload.w.o [15:45:25] <_joe_> Emperor: lmk when you've finished your restarts [15:45:29] roll-restart will take O(minute) [15:45:32] _joe_: ack [15:45:36] <_joe_> claime: is thumbor ok? [15:45:40] no [15:45:45] <_joe_> heh [15:45:48] <_joe_> as I feared [15:45:57] FIRING: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:46:01] _joe_ okok perfect [15:46:10] !log oblivian@cumin1002 START - Cookbook sre.discovery.service-route pool swift in eqiad: maintenance [15:46:16] thumbor unhappyness will show up as swift errors [15:46:22] !incidents [15:46:22] 5899 (ACKED) VarnishUnavailable global sre (varnish-upload thanos-rule) [15:46:23] 5900 (ACKED) HaproxyUnavailable cache_upload global sre (thanos-rule) [15:46:23] 5901 (ACKED) [4x] ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet) [15:46:23] 5902 (UNACKED) ProbeDown sre (10.2.1.27 ip4 swift-https:443 probes/service http_swift-https_ip4 codfw) [15:46:26] !ack 5902 [15:46:27] 5902 (ACKED) ProbeDown sre (10.2.1.27 ip4 swift-https:443 probes/service http_swift-https_ip4 codfw) [15:46:27] <_joe_> yeah but these are too many tbh [15:46:37] https://grafana.wikimedia.org/goto/m7R-zfoNg?orgId=1 [15:46:39] thumbor sad [15:47:10] <_joe_> ah yeah we sent all traffic to one thumbor [15:47:14] <_joe_> and that was too much [15:47:19] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-eqiad [15:47:26] FIRING: [4x] ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:47:30] _joe_: eqiad frontend roll-restart done [15:47:39] <_joe_> ok [15:49:38] so I think if you want to try repooling eqiad to spread the load better...? [15:50:57] RESOLVED: [2x] ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:51:12] !log oblivian@cumin1002 END (FAIL) - Cookbook sre.discovery.service-route (exit_code=99) pool swift in eqiad: maintenance [15:52:26] RESOLVED: [4x] ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:52:51] RESOLVED: [4x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [15:53:06] <_joe_> so the roll restart was probably to do first, heh [15:55:43] RESOLVED: VarnishUnavailable: varnish-upload has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [15:55:44] RESOLVED: HaproxyUnavailable: HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [15:57:51] " I'm about to switch trains unfortunately" - I was like, why is jelto doing mw deployments ;p [15:58:29] so thumbor was unhappy? [16:02:36] FIRING: GatewayBackendErrorsElevated: api-gateway: elevated 5xx errors from lw_inference_reference_need_cluster in eqiad - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=api-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsElevated [16:02:46] this one is unrelated --^ [16:07:17] question_mark: o/ we are discussing it in #security, mostly swift not happy [16:20:37] !incidents [16:20:38] 5900 (RESOLVED) HaproxyUnavailable cache_upload global sre (thanos-rule) [16:20:38] 5899 (RESOLVED) VarnishUnavailable global sre (varnish-upload thanos-rule) [16:20:38] 5901 (RESOLVED) [4x] ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet) [16:20:38] 5902 (RESOLVED) ProbeDown sre (10.2.1.27 ip4 swift-https:443 probes/service http_swift-https_ip4 codfw) [16:27:02] FIRING: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [16:32:36] RESOLVED: GatewayBackendErrorsElevated: api-gateway: elevated 5xx errors from lw_inference_reference_need_cluster in eqiad - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=api-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsElevated [16:47:02] RESOLVED: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [16:52:57] (03PS1) 10Ilias Sarantopoulos: ml-services: increase ref-need memory limits/requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1132111 (https://phabricator.wikimedia.org/T387019) [16:56:00] o/ re: Gateway 5xx errors lw_inference_reference_need_cluster : I noticed some memory spikes in the pods so for now I suggest we increase the memory limits/requests until we further investigate this [16:56:08] https://grafana.wikimedia.org/goto/DczjGfTHR?orgId=1 [16:56:08] https://grafana.wikimedia.org/goto/Z5kZnfoNR?orgId=1 [16:57:44] (03CR) 10Ilias Sarantopoulos: [C:03+2] ml-services: increase ref-need memory limits/requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1132111 (https://phabricator.wikimedia.org/T387019) (owner: 10Ilias Sarantopoulos) [16:59:11] (03Merged) 10jenkins-bot: ml-services: increase ref-need memory limits/requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1132111 (https://phabricator.wikimedia.org/T387019) (owner: 10Ilias Sarantopoulos) [17:00:24] !log isaranto@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revision-models' for release 'main' . [18:23:39] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic1070-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [18:24:40] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10689672 (10phaultfinder) [19:11:57] 10ops-drmrs: InboundInterfaceErrors - https://phabricator.wikimedia.org/T389848#10689707 (10phaultfinder) [19:16:33] FIRING: KubernetesCalicoDown: wikikube-worker1039.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s&var-instance=wikikube-worker1039.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [19:20:41] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10689715 (10phaultfinder) [19:27:26] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-eqsin:xe-0/1/3 (Peering: SGIX (103.16.102.187) {#1152}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqsin:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [20:26:20] FIRING: CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to eqiad) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [20:26:43] FIRING: VarnishUnavailable: varnish-upload has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [20:26:44] FIRING: HaproxyUnavailable: HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [20:28:57] FIRING: [2x] ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:29:46] another big spike in swift frontend network usage [20:31:02] !incidents [20:31:03] 5903 (UNACKED) VarnishUnavailable global sre (varnish-upload thanos-rule) [20:31:03] 5904 (UNACKED) HaproxyUnavailable cache_upload global sre (thanos-rule) [20:31:04] 5900 (RESOLVED) HaproxyUnavailable cache_upload global sre (thanos-rule) [20:31:04] 5899 (RESOLVED) VarnishUnavailable global sre (varnish-upload thanos-rule) [20:31:04] 5901 (RESOLVED) [4x] ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet) [20:31:04] 5902 (RESOLVED) ProbeDown sre (10.2.1.27 ip4 swift-https:443 probes/service http_swift-https_ip4 codfw) [20:31:19] !ack 5904 [20:31:19] 5904 (ACKED) HaproxyUnavailable cache_upload global sre (thanos-rule) [20:31:20] FIRING: [2x] CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to eqiad) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [20:31:25] !ack 5903 [20:31:26] 5903 (ACKED) VarnishUnavailable global sre (varnish-upload thanos-rule) [20:32:02] FIRING: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [20:32:26] FIRING: [2x] ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:33:57] RESOLVED: [2x] ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:36:20] RESOLVED: [2x] CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to eqiad) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [20:36:43] RESOLVED: VarnishUnavailable: varnish-upload has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [20:36:44] RESOLVED: HaproxyUnavailable: HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [20:37:13] !incidents [20:37:14] 5904 (RESOLVED) HaproxyUnavailable cache_upload global sre (thanos-rule) [20:37:14] 5903 (RESOLVED) VarnishUnavailable global sre (varnish-upload thanos-rule) [20:37:15] 5900 (RESOLVED) HaproxyUnavailable cache_upload global sre (thanos-rule) [20:37:15] 5899 (RESOLVED) VarnishUnavailable global sre (varnish-upload thanos-rule) [20:37:15] 5901 (RESOLVED) [4x] ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet) [20:37:15] 5902 (RESOLVED) ProbeDown sre (10.2.1.27 ip4 swift-https:443 probes/service http_swift-https_ip4 codfw) [20:52:02] RESOLVED: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [21:12:20] FIRING: CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to eqiad) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [21:13:20] FIRING: [2x] CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@codfw to eqiad) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [21:13:39] RESOLVED: CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic1070-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [21:15:40] 10SRE-swift-storage, 10API Platform, 06Commons, 10MediaWiki-File-management, and 3 others: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872#10689851 (10Mike_Peel) "The MediaWiki error backend-fail-internal occured: A... [21:17:20] FIRING: [2x] CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to eqiad) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [21:18:20] RESOLVED: [2x] CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@codfw to eqiad) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [21:22:20] RESOLVED: [2x] CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to eqiad) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [22:04:02] 10ops-esams, 06SRE, 06DC-Ops: InboundInterfaceErrors - https://phabricator.wikimedia.org/T389874#10689970 (10phaultfinder) [22:29:59] 10ops-drmrs: InboundInterfaceErrors - https://phabricator.wikimedia.org/T389848#10689979 (10phaultfinder) [22:34:50] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10689981 (10phaultfinder) [23:04:39] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic1070-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [23:16:33] FIRING: KubernetesCalicoDown: wikikube-worker1039.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s&var-instance=wikikube-worker1039.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [23:27:26] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-eqsin:xe-0/1/3 (Peering: SGIX (103.16.102.187) {#1152}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqsin:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown