[10:32:10] <_joe_> fabfur: the change is deploying, we should get an alert soon-ish [10:32:22] ideal time for a coffee [10:47:25] <_joe_> fabfur: still no alert, uhm I might have done something wrong [10:50:54] the alert is there, though not firing, I checked here [10:50:55] https://prometheus-eqiad.wikimedia.org/k8s/alerts?search=gatewaybackend [10:51:17] wikifeeds is 4/s currently [10:51:32] _joe_: ^ [10:52:03] to check I clicked the expression and then removed the > 5 to see all results [10:54:56] TIL about the alerts UI. Looks really useful. How long has this been available? [10:55:11] <_joe_> godog: yeah I was in another convo so I couldn't check [10:55:18] <_joe_> it seemed higher when I merged the change [10:55:42] btullis: I can't remember offhand, maybe a couple of quarters [10:57:01] godog: Thanks. I must have missed a meeting or two :-) Will bookmark it and will probably have some questions in due course. [10:57:22] to be clear, that's the prometheus UI, from which you can see alerts too, and it is linked from alerts.w.o from the date dropdown in "alert source links" [10:57:24] <_joe_> godog: can I silence it from the UI? [10:57:31] btullis: sure! [10:57:37] <_joe_> even if it's not firing? [10:57:49] <_joe_> I'm not finding the obvious way to make it appear in karma [10:59:10] _joe_: yes you can silence at any time from karma, top right bell icon and you can input the labels you need, e.g. alertname GatewayBackendErrorsHigh [10:59:23] it is probably going to fire soon btw, I see it is now pending [10:59:43] <_joe_> godog: yeah I did that and nothing was found [11:00:31] indeed even if no alerts are matched you can still create the silence ahead of time [11:00:47] this isn't icinga :P [11:03:46] <_joe_> godog: yeah but clearly I'm failing at it, it's 100% me [11:07:37] godog: I've tried but seems no alerts matches [11:07:53] (alertname -> GatewayBackendERrorsHigh) [11:13:36] fabfur: yes, even if no alerts are matched you can still create the silence ahead of time [11:44:02] <_joe_> !incidents [11:44:02] 4538 (UNACKED) GatewayBackendErrorsHigh sre (wikifeeds_cluster rest-gateway eqiad) [11:44:07] <_joe_> !ack 4538 [11:44:07] 4538 (ACKED) GatewayBackendErrorsHigh sre (wikifeeds_cluster rest-gateway eqiad) [12:01:10] <_joe_> !incidents [12:01:10] 4538 (ACKED) GatewayBackendErrorsHigh sre (wikifeeds_cluster rest-gateway eqiad) [13:47:40] <_joe_> oncall people: I'm trying to silence the alert [13:47:44] <_joe_> on karma [13:50:53] the wikifeeds one? [13:56:33] <_joe_> yeah [13:56:39] <_joe_> but I can't use that UI [13:58:41] you can use the command line too via amtool on alert hosts, there's an example here https://wikitech.wikimedia.org/wiki/Alertmanager#Add_a_silence_via_CLI [14:23:34] Got a page. [14:23:36] !incidents [14:23:36] 4540 (ACKED) GatewayBackendErrorsHigh sre (wikifeeds_cluster rest-gateway eqiad) [14:23:36] 4539 (RESOLVED) GatewayBackendErrorsHigh sre (wikifeeds_cluster rest-gateway eqiad) [14:23:37] 4538 (RESOLVED) GatewayBackendErrorsHigh sre (wikifeeds_cluster rest-gateway eqiad) [14:23:47] What's the context for wikifeeds? [14:24:14] for the alerts, I mean [14:26:35] <_joe_> jynus: T360597 [14:26:35] T360597: Increased latency, timeouts from wikifeeds since march 10th - https://phabricator.wikimedia.org/T360597 [14:26:58] !incidents [14:26:58] 4540 (ACKED) GatewayBackendErrorsHigh sre (wikifeeds_cluster rest-gateway eqiad) [14:26:58] 4539 (RESOLVED) GatewayBackendErrorsHigh sre (wikifeeds_cluster rest-gateway eqiad) [14:26:58] 4538 (RESOLVED) GatewayBackendErrorsHigh sre (wikifeeds_cluster rest-gateway eqiad) [14:28:01] _joe_: thanks [15:33:34] nemo-yiannis: Since we can correlate the when of the wikifeeds timeouts to the new restbase hosts, since the hosts in question are running different deployments (even if they should be normatively identical), and perhaps most importantly, the hosts in question are exhibiting different behavior... should we just depool those and see what happens? [15:33:47] ok [15:34:03] would it be a problem in the load on restbase? [15:34:11] <_joe_> I assumed that already happened :) [15:34:20] its a good amount of our currently pooled nodes [15:34:33] <_joe_> urandom: how many nodes aare we talking about? [15:34:41] 9 [15:34:56] restbase is still running on the decomm'd cassandra nodes though [15:35:08] i.e. we are net +9 since adding these [15:35:27] also... we don't need all these restbase nodes, we just do that to keep everything symmetrical [15:35:44] granted, I don't know how many we *do* need, but it probably not very many [15:36:31] oh, actually.... the old cassandra hosts are depool, but we can repool them [15:37:15] if we depooled restbase10[34-42] now, we'd still have 7 [15:37:51] if we depooled restbase10[34-42] now, we'd still have 7 (which I suspect is plenty), but there are 9 others we can pool if anyone is concerned [15:40:22] ftr, I'd rather we not pool any of the older nodes [15:41:17] (in the interest of controlling the number of variables at play) [15:42:18] can we enable puppet on the hosts just so we know we have a more consistent state? [15:42:39] what hosts? puppet shouldn't be disabled on any of the restbase machines [15:42:41] I doubt it's involved but it would be nice to have it eliminated [15:42:52] oh damn [15:42:57] * urandom just remembered [15:42:59] it's disabled on all of the eqiad restbase hosts [15:45:05] hnowlan: I did that in order to do a rolling restart (otherwise puppet can race and cancel a restart in progress), and forgot to reenable [15:45:12] it's reenabled now [15:45:28] we need to make that a part of the cookbook... [15:52:08] Ok, I'm going to depool [15:52:34] can we do a scap deploy to get everything onto the same version? [15:53:38] hnowlan: would you rather try that first? [15:55:50] Anyone have bright ideas on why swift's proxy-server might have got bored of accepting new connections? T360913 is what I've found, but anything else it'd be worth doing before restarting the proxy server? I presume this is the cause of the occasional rise in 5xxs we've seen with swift, but this time only 1 node failed, which meant I could keep it depooled in the broken state for a bit [15:55:51] T360913: Swift server misbehaviour (no longer calling `accept`?) - https://phabricator.wikimedia.org/T360913 [15:56:12] I dunno, what are we hoping a depool will achieve? rb hosts other than the ones you're proposing have incorrect response behaviour [15:57:28] hnowlan: I had understood it was the opposite [15:57:46] yeah i think that the newly pooled nodes are serving the wrong output [15:58:50] so by depooling we would learn if thats the issue or not [16:05:16] based on https://phabricator.wikimedia.org/P58908 it's more widespread than just those nodes [16:06:22] hnowlan: for eqiad, it's exactly that set, no? [16:07:31] or 31-33 doesn't fit [16:07:33] 1031,1032,1033 also serve the public url [16:07:58] those are recent additions too though, even if not pooled as part of that batch [16:09:18] I'm happy to see if a scap deploy will help, that does seem like *a* problem, even if that difference in commits should be a no-op. If the timeouts correlate with the pooling of 34-42 though, it's an easy enough test to depool them [16:09:50] seems like the first thing we'd have done if we had noticed this on the 10th right after it started [16:09:57] the diff between the scap changes is empty (the commit that was not deployed is noop) [16:10:40] oh heh :( [16:10:50] seems like the depool is the next sensible route [16:11:39] the other thing i would try is to remove the cassandra rows from 1034 (that we already deployed with scap to latest commit) and see what is a fresh response to page summary [16:12:54] but +1 on depooling [16:14:49] *remove the cassandra rows from the example request that is failing [16:17:50] 1034? restbase1034? [16:18:25] yeah [16:19:18] data is placed in cassandra using consistent hashing, a network-aware replication strategy, there is no affinity between a restbase returning a result, and where the data is in the database [16:19:37] s/a network-aware/and a network-aware/g [16:20:33] yeah but if after deleting the row (i understand the clustering part) a new request to 1034 was succesful then it could be a hint [16:20:47] hint as in that scap deploy did something [16:21:11] either way depool is the most sensible thing to do atm [16:28:35] nemo-yiannis: so to h.nowlan's point, 31-33 have been around, and as for deployments, they are in the same state as the ones I just depooled [16:29:34] i.e. them being out of sync isn't related to any recent deployments, and that correlation (that the offending nodes are at a different sha), still holds [16:30:44] to clarify there are (at least) two different things going on - the different responses, and the latency [16:31:01] right [16:31:02] the errors might have been going for longer [16:31:07] seems likely [16:31:27] but it also seems like there has been an increase in them since the 10th [16:31:38] based on some unscientific logstashery [16:33:36] Ok, so the errors dipped some and then seem to have leveled off. The timing could be coincidental I guess, but I wonder what happens if we depooled 31-33 (the ratio kind of works out, doesn't it?) [16:34:11] I could pool a few of the ones depooled early to avoid creating a capacity crisis [16:34:17] eariler [16:34:41] Concretely: pool 19-21, depool 31-33 [16:34:49] sgtm [16:36:42] (I'll space those out by a few minutes) [16:37:20] 19-21 are confirmed to have the older sha, the one that correlates with "correct" responses [16:45:59] oh that's interesting, the rate dropped again: coincidence, or did adding 19-21 increase the ratio of requests being answered by... err, "good" nodes? [16:47:41] which metric are you following urandom ? [16:48:03] https://grafana-rw.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&from=1711381672179&to=1711385272179&var-datasource=eqiad+prometheus%2Fk8s&var-instance=rest-gateway [16:53:13] nemo-yiannis: here is the list of deployed shas [16:53:17] https://www.irccloud.com/pastebin/rxZt7fNI/ [16:53:40] 3 different in total, I guess 34 is the canary? [16:54:53] no i manually run scap just to see if it makes a difference [16:55:13] the problem is that if you do a git diff on those, the diffs are almost empty [16:55:51] ok, so maybe the shas are different, but it also speaks to the last deployment, which also speaks to the last restart [16:55:56] true [16:57:38] what would that mean, that if we restarted the cluster we could make everything worse? :) [16:58:13] btw, the error rate is lower than it has been the past week, and is still dropping, so that's has to be us [16:58:58] yeah i am following that, errors go down indeed [17:00:00] .063/sec [17:00:15] could we try to restart one just to see if behaviour changes? that'd at least be a clear indication [17:01:50] I was thinking we could restart restbase1024, and the rerun the command used to generate that paste [17:01:58] since it is not pooled, [17:03:57] Latency also went down on the service level metrics [17:04:02] yes, it did [17:04:19] yay? :) [17:04:55] https://grafana.wikimedia.org/goto/1wzpzqJSk?orgId=1 [17:05:31] Ok, let me bounce restbase1024 [17:07:31] Ok, it's response stayed the same (aka Good) [17:08:06] let's trying restbase1031 [17:08:10] can somebody send a recap of which restbase nodes are currently pooled ? [17:08:17] s/trying/try/ [17:08:23] i am a bit lost [17:08:23] https://config-master.wikimedia.org/pybal/eqiad/restbase-https [17:08:26] pl [17:08:27] ok [17:09:45] and restbase1031's response is still "Bad" [17:10:02] so... it would seem to be the differing shas [17:10:46] nemo-yiannis: can you force a deploy to *only* restbase1024? [17:10:51] yes [17:15:20] done [17:18:41] weird data point - `curl -s -H "Cache-control: no-cache" restbase1032.eqiad.wmnet:7233/fr.wikipedia.org/v1/page/summary/Fichier%3ACleopatra_poster.jpg` consistently gives a 503 [17:18:55] 503 is better than 400 [17:18:57] restbase103[123] do the same [17:19:00] let me see if ican find the error [17:19:01] but they're the only ones [17:19:12] so one way or another they're outliers [17:20:04] and restbase1024 (still) seems OK [17:20:19] nemo-yiannis: can you do restbase1031 now? [17:21:28] based on hnowlan's happy news, I have a feeling it won't change anything, but... [17:22:04] https://logstash.wikimedia.org/goto/df528d25babf5bcd849955e6a2c1ea6a [17:23:23] urandom: done [17:23:51] no it still returns the absolute url to commons.wikimedia.org [17:24:03] 1031 [17:24:09] ugggh [17:24:10] greeeat [17:24:52] ok, do we now which cassandra nodes each restbase instance connects to ? [17:25:02] yes, they all connect to all of them [17:25:47] since all restbase instances have the same config (after the same scap deploy/restarts) it should be the persistency that gives misleading results [17:26:02] so we must have some split brain situation here [17:26:15] 1031 now responds with a 404 instead of a 503 fwiw [17:26:28] wait hang on [17:27:11] the failure path in those errors is for another service [17:27:14] mobileapps is 6012 [17:27:52] oh I see, nm [17:28:18] let me see what mobileapps returns internally [17:28:43] what *should* we be seeing for a summary of an image anyway? [17:28:54] a json file [17:31:09] So mobileapps internally fails with 404 for the path: fr.wikipedia.org/v1/page/summary/Fichier%3ACleopatra_poster.jpg [17:31:38] but works (as expected) for the redirected version: commons.wikimedia.org/v1/page/summary/File%3ACleopatra_poster.jpg [17:32:12] the 404 is also expected because restbase handles all the title resolution [17:33:03] can we enable debug level logs for one of the depooled ones ? [17:33:30] debug logging for restbase? [17:33:34] yes [17:33:49] for a Good node, or a Bad one? [17:33:54] a bad depooled node [17:34:23] so we can see what it does (if it gives any useful logs) [17:35:00] ok, done —I think— for restbase1031 [17:35:18] ok [17:35:35] locally hacked, so don't expect that to stick [17:39:01] can you change it to `level: trace` [17:39:02] ? [17:39:09] logstash doesn't show up anything interesting [17:39:35] nemo-yiannis: done [17:40:49] message not supplied, thank you restbase :/ [17:41:34] not useful [17:46:44] nemo-yiannis: ok to turn the log level back down? [17:46:50] yeah [17:49:35] ok since we have stabilized production, i will try to see how we can reach that if-case that decides whether it gives you an absolute (external url) or a relative (internal) [17:49:52] umm... yeah, about that [17:50:55] wait...we've pooled some hosts [17:51:01] did the deploy do that? [17:51:23] we've deployed 1031 & 1024 [17:52:20] oh, i think scap repools after deployment [17:52:30] forgot about that [17:52:36] yes [17:53:15] ok, I've re-depooled, I expect the error rate will drop back down [18:13:25] so apparently the 404 is expected behaviour for "cache-control: no-cache" when we have a file like the one on the example request [18:13:47] in order to get the internal path (aka relative) you need to be an internal request [18:14:13] i am curious if the bad nodes have a different config that the routing doesn't look like "internal" [18:16:17] Here is the logic: https://gerrit.wikimedia.org/g/mediawiki/services/change-propagation/deploy/+/464d3bcc59e68764be3e3765e5511e2e3eff4559/node_modules/hyperswitch/lib/server.js#287 [18:18:39] For example a trace request from wikifeeds is tagged as external here: https://logstash.wikimedia.org/app/discover#/doc/0fade920-6712-11eb-8327-370b46f9e7a5/ecs-default-1-1.11.0-6-2024.13?id=tbezdo4BWjhRzdxngVqZ [18:19:24] The client ip is: `2620:0:861:101:10:64:0:165` [18:21:42] which doesn't match the regex restbase uses to detect the local ip thus all the failures [18:21:56] Are the new (bad) nodes having something different in their network setup ? [18:28:52] This is an example of a request that worked properly (aka internal): https://logstash.wikimedia.org/app/discover#/doc/0fade920-6712-11eb-8327-370b46f9e7a5/ecs-default-1-1.11.0-6-2024.13?id=uVvVdo4B1Aouzw__lGAA [18:29:48] the client ip matches the regex to detect for local ip [18:40:46] hnowlan urandom ^ [18:41:26] maybe the new nodes bind nodejs server on a different interface ? but overall the root cause should be around that [18:42:23] nemo-yiannis: so, is restbase matching only ipv4? [18:43:10] topranks: are you around? [18:43:19] No, it matches this: `/^::1|(?:::ffff:)?(?:10|127)\./` [18:43:46] check the 2 logstash entries i shared and specifically `x-client-ip` [18:44:35] I did, yeah, the example of "working" is ipv4, the one that doesn't is ipv6, and...I'm trying to grok that pattern matching [18:45:47] I'm not sure what `^::1` was expecting to match, but the rest of that seems to assume ipv4 [18:47:13] nemo-yiannis: I'd say you found the problem, and that this relates to a recent network change [18:48:15] nemo-yiannis: we can seek clarity from networking, but I suspect the answer will be that restbase needs to match against internal ipv6 addresses as well [18:49:12] I think thats an option, although the less we change in restbase codebase the better [18:49:21] but we can definitely change the rules [18:51:34] If you want to wait for an answer from topran.ks or XioNo.X , I think we have time for that [18:56:59] i will be around [18:57:46] nemo-yiannis: fwiw, the nodes don't have a different setup, but racks those nodes are in might [18:58:42] ok [19:33:17] urandom: just catching up now [19:34:05] if there is a tl;dr let me know, checking the scrollback [19:38:27] reading above - 2620:0:861:101:10:64:0:165 is on the regular private1-a-eqiad vlan (2620:0:861:101::/64) [19:38:46] which has been around for many years [19:41:08] It seems the regex above is matching "IPv4 mapped addresses" in an IPv6 address ? [19:42:21] I'm not overly familiar with that, I would have expected our apps to use AAAA records and the IPv6 gloabl unicast addressing if that was present, but I think some software might bind to the mapped address format [19:44:20] I can see the restbase node process on an example node is bount to those IPs alright [19:44:33] i.e. [19:44:34] tcp ESTAB 0 0 [::ffff:10.64.0.165]:7233 [::ffff:10.64.1.17]:33192 users:(("node",pid=1220923,fd=128)) [19:44:53] which node was that? [19:45:50] restbase1031 - randomly checking on that one as it was mentioned in the logs [19:47:06] topranks: ok, yeah, that's a node in the set doing the Wrong Thing, and it's the same as from the set doing the Right Thing™ [19:47:18] They key difference I see here comparing the two log link above [19:47:28] basically, we're trying to a) establish what changed, and b) what to do about it [19:47:29] restbase1031 has a DNS entry for IPv6 [19:47:39] oh [19:47:50] which points to 2620:0:861:101:10:64:0:165/64 [19:47:51] aha [19:47:54] there we go [19:48:08] whereas on the other one, restbase1028, it has an IPv6 address assigned, but no DNS entry [19:48:17] yes, that's it [19:48:25] so normal linux behaviour here is the lack of v6 DNS record makes it use IPv4 [19:48:31] right [19:48:47] this is a reminder to me to improve our classification of nodes so that this is set based on their type at provision stage [19:49:04] Ok, so what changed is... I guess, provisioning [19:49:16] right now to avoid getting a v6 DNS record dc-ops need to remember (or it be requested in ticket) at provision stage, and tick the box [19:49:28] but at least 3 of the affected nodes have been like this for a while, so we were probably getting some errors there, but it didn't rise to the level we noticed [19:49:44] gimme a sec, I should be able to get a list of any restbase* nodes with v6 records and we can remvoe them if you think that makes sense [19:50:12] Ok, either that, or we fix restbase to match internal ipv6, I don't know which makes more sense [19:50:36] the latter future-proofs us, the former may be easier [19:51:11] https://phabricator.wikimedia.org/P58915 [19:51:17] allegedly restbase is going away, but we've been alleging that for a very long time [19:51:22] hmm ok [19:51:40] our internal ranges shouldn't be too hard to match if you're only criteria is if it is within WMF space [19:51:50] topranks: yep, I could have calculated that set another way! :) [19:52:05] nemo-yiannis: ^^ [19:53:22] topranks: what's involved in ditching those dns records? [19:53:40] just need to delete the "dns name" field for those IPs in netbox and run the dns cookbook [19:53:56] I can get that done fairly quick if you want [19:54:10] topranks: sure, if you don't mind [19:54:14] ok [19:54:22] happy to sort it out myself too [19:54:37] fwiw the regex for our US ranges would be something like '2620:0000:086[0-3]+.*' [19:55:17] although that may be wrong as my regex is terrible forgive me [20:09:01] Ok all those records are gone and I've cleared the cache on all our dns recursors for them too [20:09:46] sweet, I'll repool some of those hosts and see what happens [20:09:52] topranks: thank you btw! [20:09:57] ok cool [20:09:58] np! [20:10:14] ping me if it's still acting weird I can have another look, probably that's it though [20:10:36] I think that will do it, yeah; It makes sense [21:02:33] Hello everyone, in order to reduce email fatigue I created T360902 to coordinate, track, and document progress on ensuring our autogenerated emails are compliant with RFC 3834 to reduce automated responses to atomically generated emails, notably the Gmail vacation responder's unnecessary replies to automated emails. [21:02:33] T360902: Consolidation and tracking of automated email alerts improvements across services - https://phabricator.wikimedia.org/T360902 [21:03:24] I'd greatly appreciate contributions both to the inventory of autogenerated emails that do not include the proper headers as well as code contributions. Please take a look at it of you can. Thanks. :)