[09:24:20] something seems to be degraded with kafka-main since yesterday 20:23/21:52 [09:25:01] https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=codfw%20prometheus%2Fops&var-lag_datasource=000000026&var-mirror_name=main-eqiad_to_main-codfw&orgId=1&from=now-24h&to=now&timezone=utc&refresh=5m [09:25:39] https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=codfw%20prometheus%2Fops&var-lag_datasource=000000026&var-mirror_name=main-eqiad_to_main-codfw&orgId=1&from=now-7d&to=now&timezone=utc&refresh=5m&viewPanel=panel-5 [09:28:35] Look to me like cache invalidations, but I don't know too much about the topic [09:44:37] things got better, let's see if latency stays low [09:45:09] s/latency/lag/ [09:46:39] yeah there was a significant bump overnight (unusual for that job), partitioner went a bit wild. Looks like it's on the way down now [09:57:20] GitLab needs a short maintenance break in one hour [10:14:29] <_joe_> latency in mirrormaker is not operationally critical [10:14:56] <_joe_> if that results in lag in purged, then that's an issue [10:15:46] I think it was a spike, not systemic, so we are good [10:17:46] https://grafana.wikimedia.org/d/be841652-cc1d-47d3-827f-065768a111dc/purged?orgId=1&from=now-7d&to=now&timezone=utc&var-site=$__all&var-cluster=$__all&viewPanel=panel-6-clone-0%2Fgrid-item-1%2Fpanel-1-clone-0 [11:07:04] GitLab maintenance finished [11:07:29] thanks [17:43:06] There is some weirdness with eventgate-analytics on wikikube. It can't reach the eventstreams API through envoy, so it's crashlooping. Have there been any related changes to k8s and/or networking today? [17:47:37] Oh it's in staging. eqiad and codfw are still ok. Phew. [17:47:43] https://www.irccloud.com/pastebin/1wjnDTUF/ [17:47:51] was just about to ask :) [17:48:12] https://sal.toolforge.org/log/hxsM-ZYBffdvpiTrUfmU ? [17:48:37] btullis eqiad and codfw deployments have not been restarted though [17:49:22] gmodena: Was there a diff shown when you deployed? [17:49:26] the staging one is failing to access MW Apis (EventStreamConfig) at startup [17:49:31] btullis no [17:49:59] it's was restart I had to do to fetch new stream configs that have been deployed earlier today [17:50:37] the services fails with a "no healthy upstream error" [17:50:40] Failed fetching stream configs from http://127.0.0.1:6500/w/api.php?format=json&action=streamconfigs&constraints=destination_event_service=eventgate-analytics on try number 2 out of 3 [17:50:41] Ack. [17:52:54] it queries the endpoint only at startup btw [17:53:18] so eqiad and codfw being up does not exclude issues [17:53:39] Let's not restart them right now, though. [17:53:53] Unless you want to do a canary? [17:58:13] btullis: gmodena: I'm seeing calico-related alerts in staging, so networking might not be working as expected there [17:58:22] swfrench-wmf ack [17:58:47] we could do a canary on the non active DC? [17:59:53] Are there any calico related alerts affecting dse-k8s as well. I have another incident that appears network related, but this is on dse-k8s-eqiad. T395057 [17:59:54] T395057: datahub.wikimedia.org is down - https://phabricator.wikimedia.org/T395057 [18:03:41] btullis: I'm not seeing anything there, no - just (wikikube) staging-eqiad [18:03:57] OK, thanks. [18:05:05] eventgate-analytics-external fetches configs (same endpoint) every 60s, and it looks like it's working fine [18:07:14] swfrench-wmf just to be sure: you are not seeing any calico issue on wikikube eqiad and codfw? If ok I'll try to roll restart the codfw deployment [18:08:25] gmodena: correct, yeah - the production 'eqiad' and 'codfw' clusters are fine (this would be pretty bad if it were happening there). [18:09:32] alright [18:09:56] eventgate-analytics on codfw has restarted [18:12:07] eventgate-analytics on eqiad has restarted [18:12:28] staging being down is no issue, so no worries. I'll restart once networking issues are resolved [18:12:57] swfrench-wmf for my own education: is there a dashboard I could track for calico ? [18:14:38] the component that's not functioning appears to be the controller. there's some details in this section of the runbook: https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers [18:17:47] swfrench-wmf thanks for the pointer [18:49:50] following up: to summarize, there was a change applied around 15:30 UTC in staging-eqiad, which seems to have taken down the calico controller. as a result, you may see networking issues for newly created containers there, for example. [18:49:50] unfortunately, it's unclear what the change was, and why it failed to revert cleanly when it failed, so we may need to wait for folks to be online again on Friday in Europe.