[08:32:49] good morning :) [08:33:02] Going to restart the kartotherian work with https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1121309 (more memory for kartotherian pods) [08:41:57] going to pool only one server for each DC and observe metrics [09:01:12] so far no 50x and the metrics look decent [09:01:19] https://grafana.wikimedia.org/d/d821ac19-02c5-49ac-bf18-58d2e27fdf19/kartotherian?orgId=1&var-dc=thanos&var-site=eqiad&var-service=kartotherian&var-prometheus=k8s&var-container_name=All [09:01:53] filed https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1121315 to tune a bit cpu requests/limits [09:41:44] Any idea why https://logstash.wikimedia.org/app/dashboards#/view/138271f0-40ce-11ed-bb3e-0bc9ce387d88?_g=h@1be52a8&_a=h@0eb7d9a (Istio Ingress GW logs) shows me requests/s and UA breakdown, but no actual events? [09:43:18] virt.cloudgw.eqiad1.wikimediacloud.org has stopped resolving and it is making puppet on alert hosts fail, what is replacing it? cc arturo [09:43:44] godog: mmm that's unexpected [09:44:46] lol interesting [09:45:32] godog: I just created a ticket T386907 [09:45:32] T386907: FQDN virt.cloudgw.eqiad1.wikimediacloud.org is missing - https://phabricator.wikimedia.org/T386907 [09:46:52] arturo: ok thank you, do you mind if I comment out the check temporarily in puppet? or is it sth you are looking at now ? [09:47:14] arturo: apparently removed in https://netbox.wikimedia.org/extras/changelog/210721/, should be easily fixable by adding it back to netbox [09:47:16] I'm looking now, but also you should feel free to do the puppet thing [09:47:47] taavi: yeah, that was my theory, that somehow the reimage script resulted in the VIP being removed [09:48:08] arturo: ok will do [09:48:10] klausman: somebody messed up with the visualizations sigh [09:48:27] Anything I cna do to help fixing it? [09:50:15] klausman: I don't know exactly how to fix them, looking into it, but it seems to me that the broken vis don't get the same traffic as the working ones [09:50:21] so they don't filter anything [09:50:43] I see "ChartRendererErrors" that is a bit weird [09:51:47] in the working ones I see "logstash-*" instead, that makes more sense [10:24:03] on-callers: I pooled two kartotherian wikikube workers for maps eqiad, all good so far (their weight is 5, compared to the 10 for bare metals) [10:24:25] godog: the FQDN is back online, please revert the puppet patch with a reference to T386907 [10:24:25] T386907: FQDN virt.cloudgw.eqiad1.wikimediacloud.org is missing - https://phabricator.wikimedia.org/T386907 [10:24:38] I need to go afk (pick up my kid from kidergarten), if anything happens it should be sufficient to depool them from kartotherian.discovery.wmnet [10:24:41] and/or call me :) [10:24:54] arturo: sweet! will do [10:25:51] thanks! [11:27:49] hmm it looks like socket(7) is deprecated in bookworm.. it fails to mention that SO_MARK can be used with CAP_NET_RAW capability since kernel 5.17 as https://man7.org/linux/man-pages/man7/socket.7.html does [12:12:41] volans: I sent T386915 your way [12:12:41] T386915: cookbook: decomission workflow may remove VIPs from netbox - https://phabricator.wikimedia.org/T386915 [12:14:22] arturo: ack, could you please add which run of the decom cookbook was it? hostname and from where you run it [12:14:44] volans: sure [15:24:51] godog OK if I puppet-merge your "Add fundraising-analytic" change? [15:31:20] effie: did you set up the wikitech->wikitech-static syncing after moving wikitech to the main wiki cluster? Or did someone else do that? [15:31:30] inflatador: gah, yes! sorry I forgot [15:32:07] effie: I ask because the syncing is broken, wondering if I should make you a task to investigate. [15:32:29] godog np, merging now [15:36:56] <_joe_> andrewbogott: I assumed wikitech-static wasn't included in the package with wikitech itself in terms of long-term support from serviceops [15:37:07] <_joe_> as it notably doesn't run on k8s [15:37:32] <_joe_> So while I assume effie or someone else woudl be able to help investigate, ownership of the sync hasn't changed. [15:37:50] The sync includes a dump on the wikitech-static side which must've changed in the migration [15:37:54] but I don't know how [15:38:12] (syncing worked properly for quite a while after the migration, it's only just now broken) [15:38:45] <_joe_> so in general opening a task is a good first step, and then I think we'll gladly help you figure out what's broken [15:38:54] ok [15:39:49] andrewbogott: we sorted it at the time making the absolute minimum changes from the wikitech-static side as we do not own the server, as joe noted [15:40:10] <_joe_> it's also possible that what has changed is the dump you're consuming, which I guess DSE might help with [15:42:21] andrewbogott: I am off till monday, I may take a look, however, no promises [15:42:37] looks like the latest dump is still present so the issue may be on the host itself [15:42:54] I'll dig a bit and then make you a bug if it's not an obvious fix [15:51:16] andrewbogott: as serviceops owns neither -static nor dumps, best we can offer is to have such task on our radar [16:15:24] on-callers: o/ I've summarized the current status of maps.wikimedia.org in https://phabricator.wikimedia.org/T386926 [16:16:00] atm we have two k8s workers pooled it, at half weight, and the metrics are good (was three but we were getting a bit closer to the limits) [16:16:20] I'll not move anything forward until we get more pods/capacity [16:18:41] I added a rollback step to the task in case of fire [16:21:10] o/ - anyone around who could rotate some accidentally leaked phabricator bot credentials for me? T386949 [16:24:40] tarrow: wrong channel (-releng would be the right one), but done [16:25:08] taavi: Thanks! I was kinda wondering where to go [20:16:33] This might be a dumb question, but is anyone using envoyproxy config to restrict HTTP methods in non-k8s prod? We have a couple of services still using nginx for minor stuff like that and I was wondering if I could get rid of it [22:33:20] inflatador: I'm not aware of anyone currently doing it, but you certainly can -- the only trick is when you're setting up your HTTP routing, Envoy treats the method as if it were an HTTP header with the magic name `:method` [22:36:21] so in your VirtualHost you'd have e.g. this (all the other fields missing, obvs) https://www.irccloud.com/pastebin/agQpaYWf/ [22:37:11] rzl ah, thanks for the context! I haven't used envoy much, but its config seems similar to traefik which I've used in the past [22:38:35] I haven't tried traefik but I believe you :) [22:40:17] (updated that pastebin to fix a typo, s/exact_match/exact/) [22:54:33] thanks again, I popped T386983 for future discussion [22:54:34] T386983: Explore migrating wdqs' nginx config into envoy - https://phabricator.wikimedia.org/T386983 [23:12:48] 👍