[09:49:35] Services and traffic is being switched to eqiad today right? [11:15:14] marostegui: yes [11:15:42] good thanks [11:17:17] Sorry I didn't see your message earlier, happens at 1400 UTC [11:18:47] no worries :) [13:52:19] o/ everyone. Services+Traffic DC Switchover in 8m [13:53:17] good luck [13:54:13] good luck everyone! [13:59:31] thank you all, the relevant tmux is march-dcso and you can tail the cookbook exec at /var/log/spicerack/sre/discovery/datacenter-extended.log [14:02:01] good luck [14:02:33] effie: which cumin? [14:03:02] cumin1002 [14:03:36] ty [14:06:24] (run sudo -i authdns-update on dns1004) [14:26:05] <_joe_> uhm [14:26:19] <_joe_> did we depool something from eqiad somehow? [14:26:31] <_joe_> because the requests to mw-api-int dropped in eqiad [14:26:45] was just wondering the same [14:27:02] <_joe_> how can it be? [14:27:11] <_joe_> restbase-async or something? [14:27:12] wasn't there something that runs in the secondary? [14:27:13] but it coincides with a jump of apcu entries somehow? [14:27:50] <_joe_> the drop happened around 14:21 -14:21:30 [14:27:58] <_joe_> look at what was depooled around that time [14:28:02] <_joe_> maybe it's a hint [14:28:04] <_joe_> maybe not [14:29:28] eventstreams [14:29:34] <_joe_> so not a hint [14:29:57] eventgate just before [14:30:00] <_joe_> probably unrelated [14:30:38] <_joe_> now if grafana would load [14:30:41] And then it's the ingress [14:30:43] So idk [14:31:11] <_joe_> no [14:31:13] <_joe_> as I said [14:31:18] <_joe_> probably nbot a hint [14:31:27] <_joe_> now looking at grafana's envoy telemetry [14:31:31] <_joe_> it's pretty clear it's flink [14:31:42] <_joe_> so it should be back [14:31:48] <_joe_> https://grafana.wikimedia.org/d/b1jttnFMz/envoy-telemetry-k8s?orgId=1&var-datasource=thanos&var-site=eqiad&var-prometheus=k8s&var-app=All&var-kubernetes_namespace=All&var-destination=mw-api-int-async-ro [14:32:00] it is back [14:56:08] kartotherian is not very happy, I am pooling back codfw until we figure out what is up [14:56:14] !incidents [14:56:14] 4526 (UNACKED) [3x] ATSBackendErrorsHigh cache_upload sre (kartotherian.discovery.wmnet) [14:56:19] !ack 4526 [14:56:19] 4526 (ACKED) [3x] ATSBackendErrorsHigh cache_upload sre (kartotherian.discovery.wmnet) [15:00:09] It is still very unhappy [15:00:46] yeah, I think it's in this awful state where it is shooting itself in the foot [15:00:55] cause geoshapes talks back to maps [15:01:03] it's an arch mess, don't ask more [15:01:25] * claime backs of from the lovecraftian servers [15:01:29] off* [15:06:14] _joe_ just catching up, what happened with flink? [15:06:34] <_joe_> nothing, it just stopped updating for a bit [15:06:44] <_joe_> whcih made us wonder what caused a traffic drop [15:07:46] _joe_ ACK, we didn't get any alerts, so I guess all is well? Will keep checking dashboards [15:08:03] <_joe_> yeah all well :) [15:56:32] oncall handoff from EMEA is that the traffic and services switchover went fine, we had a minor hiccup with kartotherian and we repooled codfw to add capacity to it. We don't have an incident doc (it was pretty short as an incident) but a description and graphs are available at: https://phabricator.wikimedia.org/T357547#9642861 [15:58:36] ty akosiaris [15:58:59] moritzm: around? I don't want to press you, but I want to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/1012684 as it is blocking the dc ops [16:00:43] already +1d a minute ago :-) [16:00:52] thank you! [16:01:18] I am assuming it fixed the bug you detected? [16:01:30] it does, thanks [16:01:48] sorry about that, sometimes I forget to move hosts there [16:02:10] as no recipe and manual install almost do the same for what I want