[10:48:41] o/, FYI, my Phab username is now @Raine so there's... less? more? a different amount? of confusion when tagging me :D (thanks t.aavi for changing it <3) [12:45:11] who is taking care of wdqs-internal-main and wdqs-internal-scholarly? [12:45:16] we got some pybal related alerts [12:45:35] it looks like it went through some migration between port 80 and port 443? [12:47:34] ryankemper ? [12:52:26] ryankemper, inflatador do you need help with this? [13:18:24] vgutierrez: do you have a link to an alert? [13:18:52] vgutierrez I'm here [13:19:11] https://alerts.wikimedia.org/?q=%40state%3Dactive&q=%40cluster%3Dwikimedia.org&q=alertname%3DPyBal%20IPVS%20diff%20check [13:19:35] from what I'm seeing service got migrated to port 443 but port 80 config is still on ipvs [13:19:40] that needs to be purged manually [13:23:20] vgutierrez ACK, we got rid of the plaintext config a few weeks back (still looking for ticket). Apologies for the fallout there. Can I help with the IPVS stuff or is that y'all's team? [13:25:03] Ticket here: T193473 [13:25:04] T193473: Add HTTPS support to wdqs-internal service - https://phabricator.wikimedia.org/T193473 [14:07:17] so basically we need to follow point 6 of https://wikitech.wikimedia.org/wiki/LVS#Remove_the_service_from_the_load-balancers_and_the_backend_servers [14:07:23] I can take care of it now that I got the context :) [14:14:02] inflatador, ryankemper are you sure it's ok to drop port 80 in codfw? [14:14:10] https://www.irccloud.com/pastebin/v5zIEciE/ [14:14:20] port 80 has some traffic [14:14:49] that's wdqs-internal-main [14:15:39] The client for that **should** be Kartotherian (mediawiki extension), which might not be configured to use HTTPS yet [14:16:36] gehel: so it's ok to make Kartotherian fail? [14:17:33] if not, you need to re-add port 80 support for wdqs-internal services [14:17:40] Not really my call, but probably not :( [14:18:09] inflatador: ^ can you take that on? [14:19:38] I've just re-opened T193473 [14:19:39] T193473: Add HTTPS support to wdqs-internal service - https://phabricator.wikimedia.org/T193473 [14:21:08] I'd suggest doing it ASAP, we are one pybal restart or lvs reboot from having issues [14:26:04] * gehel is checking with DPE-SRE to see who can help. [14:33:51] I will have a look as well, in case inflatador doesn't get there before me. [14:34:34] vgutierrez I'm looking at it now. CC elukey as he knows a bit more about Kartotherian [14:35:18] I'm going to rollback unless anyone objects [14:36:39] inflatador: I'd suggest readding port 80 but not dropping port 443, given that port 443 has traffic as well [14:36:59] vgutierrez ACK, will move fwd with that plan [14:37:52] OK, I'll let inflatador handle this, but I'm here if I can help with anything. [14:43:18] vgutierrez Is it possible for us to add back port 80 without removing port 443 support? I could be missing something but I thought we'd need 2 LVS pools for that [14:43:36] yes, you need 2 LVS pools [14:44:09] Hmmm, that's not gonna work [14:44:15] one for 10.2.[12].93:80 and another for 10.2.[12].93:443 [14:44:17] why? [14:50:25] Doesn't that mean we need to start from https://wikitech.wikimedia.org/wiki/LVS#Add_a_new_load_balanced_service ? [14:57:07] vgutierrez I'm up in https://meet.google.com/iqa-cyir-njo if you wanna talk about it further. I'm still not clear on the plan and I don't want to stand up new LVS endpoints if it's not absolutely necessary. Kartotherian uses envoy proxy ( https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/charts/kartotherian/values.yaml#42 ) [14:57:26] sorry I just finished a meeting [14:57:42] inflatador: you're right [14:57:49] you would be basically creating a new LVS service for port 80 [14:58:06] Looks like the envoy config for kartotherian is correct , maybe just restart of the pods would do the trick? [14:58:20] correct as in already pointing to port 443? [14:58:21] Do we have a way to track where this port 80 traffic comes from? And either fix the consumer configuration or realize that we don't care about it? [15:00:51] vgutierrez I mean envoy abstracts the actual listening port away from the pods. So I think kartotherian would need a redeploy to get the new envoy config. [15:02:12] that's definitely a question for elukey but today he is on sick leave [15:02:22] I don't want to break anyone else's service, but I don't know that it's wise to stand up a new LVS pool for a service with a very lax SLA (wdqs-internal-main/scholarly) when a new deploy would probably do the trick [15:04:30] gehel I believe Valentín's already ID'd the traffic as coming from Kartotherian. The other wdqs-internal consumers are mediawiki extensions, so they get redeployed all the time. [15:04:42] I havent [15:04:53] but a quick check with tcpdump on wdqs2018 shows `10-194-152-150.kartotherian-main-tls-service.kartotherian.svc.cluster.local.` [15:06:21] hnowlan: might now more about kartotherian. Not sure who owns it these days [15:06:35] cdanis: could you or somebody else from your team help us redeploying kartotherian given that elu.key is OoO? [15:07:07] Moritz OOO also 😔 [15:07:33] is it specifically only codfw kartotherian? [15:07:49] ah and the one on k8s [15:08:15] So the wdqs-internal-main service is registered for port 6041, ref https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/hieradata/common/profile/services_proxy/envoy.yaml#226 [15:08:51] Code search for the envoy URL turns up the following: https://codesearch.wmcloud.org/search/?q=http%3A%2F%2Flocalhost%3A6041&files=&excludeFiles=&repos= [15:09:41] so I think our internal consumers are MW and kartotherian [15:11:17] we know that MW already had a problem when this was rolled out originally, ref https://etherpad.wikimedia.org/p/wdqs-internal-plaintext [15:11:30] * hnowlan reading [15:13:15] If I do a `helmfile diff` against kartotherian in staging, I can see that it wants to update the envoy config: https://phabricator.wikimedia.org/P84130 [15:14:36] there are a few unapplied other changes for it [15:15:08] might just be envoy version bumps though [15:15:48] yeah looks like it [15:16:16] I can apply in staging and see if we can render tiles I spose [15:16:50] brett: fyi I'm back from OOO and P&T offsite, will take today to make some progress on VCL cleanup and redirects, probably won't finish anything new today. I'll try to have the redirect patch ready by tomorrow. https://gerrit.wikimedia.org/r/c/operations/puppet/+/1194558 is ready for review though [15:16:53] objections? [15:17:07] hnowlan none from me, thanks for jumping on this [15:17:09] hnowlan: +1 from me [15:22:48] tiles look okay to me. but eh... my memory is foggy, what does kartotherian use wdqs for again? :D [15:23:17] Show SPARQL results on a map [15:25:59] so regular tile requests aren't going to hit that code path? [15:26:51] hnowlan: no, probably not. I don't quite remember all the details about Kartographer, it's been a while! [15:27:22] Example: https://www.mediawiki.org/wiki/Help:Extension:Kartographer#Mixed_types_with_SPARQL_query [15:28:23] oh nice I can probably work with this [15:28:43] inflatador, hnowlan: those envoy bumps are fine to roll out, I've deployed them for everything else but didn't want to deploy the wdqs routing change without another pair of eyes [15:29:34] ack, thanks rzl [15:30:45] you'll see some related diffs for internal_address_config which are likewise safe and deployed to ~all other services [15:33:31] yeah wdqs queries are timing out with the new config. looking at why [15:33:43] hnowlan I'm around if I can help [15:40:25] lmao what [15:40:30] second request worked [15:40:47] weird, clearly timed out the first time [15:43:44] okay yeah, I think we're good [15:44:13] I can roll ahead with prod and keep an eye on error rates etc [15:47:00] thx hnowlan <3 [15:47:22] (。♥‿♥。) [15:54:19] my pleasure [15:54:32] touching kartotherian again is like seeing a distant friend who owes you money [15:54:41] (although it's in such better condition now) [15:54:52] I think we're good [15:55:49] Cool, I'm going to a parent/teacher conference in ~10m unless anyone needs me [15:56:25] hnowlan: permission to bash the perfect kartotherian description? :D [15:56:29] thanks very much hnowlan [15:56:39] Raine: go for it :D [15:56:58] !bash touching kartotherian again is like seeing a distant friend who owes you money [15:56:59] Raine: Stored quip at https://bash.toolforge.org/quip/QVFWApoBffdvpiTrjnw4 [15:57:04] np np. looks like there is some weirdness with setup of first connection from kartotherian->wdqs but it's only the first https://grafana.wikimedia.org/goto/bMNBjZRvR?orgId=1 [16:07:07] it looks like port 80 has been drained [16:07:19] TCP 10.2.1.93:80 mh (mh-port) [16:07:19] -> 10.192.0.85:80 Tunnel 10 0 0 [16:07:19] -> 10.192.32.155:80 Tunnel 10 0 0 [16:07:19] -> 10.192.32.156:80 Tunnel 10 0 0 [16:41:18] Krinkle: Thanks for the patch! Should I unmark the WIP status? [16:48:01] brett: sure, feel free. I would do so after tests pass and beta works correctly. I'll do that after my next meeting. [16:48:15] it might pass already, haven't tested yet [17:21:14] we don't have any kind of ready-made tool that would create a pcap on x number of machines and upload the data somewhere, do we? ebernhardson is looking at some weirdness in the cirrussearch infra and we'd want to do a pcap of localhost on ~50x machines [18:22:55] inflatador: cumin plus you can reuse its keyholder for scp