[06:52:53] hello folks :) [07:35:08] kevinbazira: https://phabricator.wikimedia.org/T317531#8285643 >> I added my thought in the task :) [07:40:55] I like --^ [07:41:02] good morning aiko :) [07:45:35] morning Luca! [08:19:17] ok so I am very confused [08:19:33] in the current settings, to have meaningful http metrics from our istio sidecar pods, we do [08:20:01] 1) set http://api-ro.discovery.wmnet (note http and not https) in our isvc settings [08:20:35] 2) instruct the istio sidecar container to use TLS when api-ro.discovery.wmnet:80 is requested [08:20:54] so we have http metrics in grafana about how many http calls we do from pods etc.. and we use TLS [08:21:17] I verified again that we use TLS via tcpdump and nsenter, the traffic to api-ro is definitely encrypted [08:21:20] BUT [08:21:24] in the kserve-container logs I see [08:21:25] https://github.com/wikimedia/mediawiki/blob/master/includes/api/ApiMain.php#L1865 [08:21:39] [W 221005 08:13:09 async_session:98] - main -- {'*': 'HTTP used when HTTPS was expected.\nSubscribe to the mediawiki-api-announce mailing list at .. [08:22:12] I presume we do see the http connects to istio in the metrics? [08:22:25] (also, Morning! :)) [08:22:40] morning :) [08:23:05] what do you mean with http connects? [08:23:45] in theory the sidecar does everything via iptables, I verified https in traffic going outside the pod (didn't check localhost) [08:24:02] So since the original point was that we get (better) metrics about http connections out of isitio, I presume that the metrics are actually showing that connections happen now [08:25:08] https://grafana.wikimedia.org/d/zsdYRV7Vk/istio-sidecar [08:25:34] it shows up in metrics [08:25:44] (I built a separate dashboard some days ago) [08:25:47] (still wip) [08:25:52] What I wonder is if there is some place that uses the old (https) URL and thus istion can't do anything about it [08:27:46] from tcpdump I see no traffic towards port 80 [08:27:57] only tls traffic [08:28:13] Is this what comes out of the sidecar, or out of the model's container? [08:28:25] the sidecar [08:28:34] Hurm [08:29:52] Is it possible that the connection is downgraded somehow, outside of our reach? [08:30:22] I mean, there's a whole bunch of components involved that we do not control [08:31:09] If we could replicate the exact request(s) that trigger the error message and see if we can reproduce them with e.g. curl, this might be easier to diagnose [08:33:46] I can try, we can use nsenter with curl and any mw api call in theory [08:34:55] If we don't see any port-80 traffic, but still get the error, I see three possibilities: [08:35:04] - The error is completely bogus [08:35:20] - A https -> https downgrade happens outside of our control [08:35:32] - We're doing http on port 443 and it "somehow" works [08:35:48] my suspicion is that istio-proxy acts as forward proxy, tunneling http traffic [08:36:12] the tunnel is on TLS, but envoy on the other side sees only a simple http reuqest [08:36:15] *request [08:36:25] it would explain what we see [08:37:09] Then the question is how we either avoid that error (if this scheme is ok), or get istio to do a proper end-to-end TLS connection [08:38:30] I think the latter is the only robust way to proceed, but not sure how istio does it [08:42:06] Does the api-ro endpoint speak http2? [08:42:50] If so, we could consider trying https://istio.io/latest/docs/reference/config/istio.mesh.v1alpha1/#MeshConfig-H2UpgradePolicy [08:43:19] Though I am not sure that would DTRT [08:44:14] I don't think so, plus we may want to stick with HTTP1.1 [08:44:27] going to try a few settings on the virtual service side [08:53:41] elukey: I have spotted something that's at least a bit inconsistent between the docs and our setup [08:54:03] https://istio.io/latest/docs/reference/config/networking/destination-rule/#DestinationRule says for `host`: `The name of a service from the service registry. Service names are looked up from the platform’s service registry (e.g., Kubernetes services, Consul services, etc.) and from the hosts declared by ServiceEntries. Rules defined for services that do not exist in the service registry [08:54:05] will be ignored.` [08:54:26] Argh, now I spotted what I misread. nvm :) [08:54:58] (I thought the `host:` in a destinatio_rule should refer to the name field in the service entries, not the hostname there.) [08:59:01] ooh [08:59:08] elukey: I spotted another thing [08:59:20] https://istio.io/latest/docs/tasks/traffic-management/egress/egress-tls-origination/ seems to be the doc that explains how to do what we want [08:59:42] https://istio.io/latest/docs/tasks/traffic-management/egress/egress-tls-origination/ The config example here, however has a `targetPort` line we don't have [09:00:21] And the DestRule section uses port 80 [09:00:44] yeah I tried multiple configs, I recall using port 80 [09:01:28] lemme try [09:03:09] so we have a slightly different config, but I adapted it as following [09:03:50] 1) the virtual service is set to intercept traffic for '*.wikipedia.org' and route it to api-ro.discovery.wmnet (its destination) [09:04:21] 2) the destination rule is set like [09:04:29] portLevelSettings: [09:04:29] - port: [09:04:29] number: 80 [09:04:29] tls: [09:04:29] mode: SIMPLE [09:04:48] 3) the service entry is using the target port 443 for port 80 [09:04:56] all works, but I still see the deprecation warning in the logs [09:05:13] I was reading that the wildcard may play some role [09:05:43] maybe envoy is confused and uses a simple tls tunnel rather than upgrading http to https [09:05:53] tried to set an sni in the destination rule but nothing [09:06:19] Could we use a non-wildcard rule just for testing? To rule out that using a wildcard changes istio behaviour [09:07:54] tried, same thing [09:08:20] weird [09:08:54] can you pastebin the current config snippets. Just so we're 100% on the same page [09:09:18] you can check them on staging [09:12:26] (all in the knative-serving namespace) [09:15:07] on ml-stagin2002 I was able to run curl with a mw api call [09:15:09] udo nsenter -t 1236300 -n curl -vvv --resolve en.wikipedia.org:80:208.80.153.224 "http://en.wikipedia.org/w/api.php?action=query&format=json&list=recentchanges&meta=ores&rcprop=title|timestamp|ids|oresscores&rclimit=250&rctitle=v" [09:15:21] and the issue is reproducible [09:15:25] (same warning) [09:19:26] I wonder if the w.org -> wmnet rewrite messes with us [09:20:27] IIUC the order of eval is vs -> dr + se [09:20:53] so the vs is evaluated, the destination/route is picked up and then the destination rule is evaluated (also the service entry is validated as well) [09:27:08] another thing that I am trying is to dump envoy config from the container (using nsenter) and check in there [09:27:12] Just to confirm: I tcpdump'd some of the p443 traffic leaving the istio sidecar, and it is indeed TLS1.2 [09:27:32] (so not just p443, but actually encrypted) [09:29:13] And the traffic is with 10.2.1.22, so not some random other endpoint [09:41:55] tried to set debug logs for envoy, but didn't really notice anything tls specific [09:41:59] very weird [09:42:07] going to commute, bbiab [09:42:28] I'm going for lunch+groceries, ttyl [10:36:57] (03PS1) 10Zabe: phpunit: Use assertEqualsWithDelta [extensions/ORES] (REL1_39) - 10https://gerrit.wikimedia.org/r/838219 (https://phabricator.wikimedia.org/T318134) [10:37:09] (03PS1) 10Zabe: phpunit: Use assertEqualsWithDelta [extensions/ORES] (REL1_38) - 10https://gerrit.wikimedia.org/r/838220 (https://phabricator.wikimedia.org/T318134) [10:37:37] (03CR) 10Zabe: [C: 03+2] phpunit: Use assertEqualsWithDelta [extensions/ORES] (REL1_39) - 10https://gerrit.wikimedia.org/r/838219 (https://phabricator.wikimedia.org/T318134) (owner: 10Zabe) [10:37:42] (03CR) 10Zabe: [C: 03+2] phpunit: Use assertEqualsWithDelta [extensions/ORES] (REL1_38) - 10https://gerrit.wikimedia.org/r/838220 (https://phabricator.wikimedia.org/T318134) (owner: 10Zabe) [10:38:15] * elukey lunch [10:39:35] (03Merged) 10jenkins-bot: phpunit: Use assertEqualsWithDelta [extensions/ORES] (REL1_39) - 10https://gerrit.wikimedia.org/r/838219 (https://phabricator.wikimedia.org/T318134) (owner: 10Zabe) [10:39:48] (03Merged) 10jenkins-bot: phpunit: Use assertEqualsWithDelta [extensions/ORES] (REL1_38) - 10https://gerrit.wikimedia.org/r/838220 (https://phabricator.wikimedia.org/T318134) (owner: 10Zabe) [10:41:10] (03CR) 10Zabe: [C: 03+2] phpunit: Use assertEqualsWithDelta [extensions/ORES] (REL1_37) - 10https://gerrit.wikimedia.org/r/838221 (https://phabricator.wikimedia.org/T318134) (owner: 10Zabe) [10:41:14] (03CR) 10Zabe: [C: 03+2] phpunit: Use assertEqualsWithDelta [extensions/ORES] (REL1_35) - 10https://gerrit.wikimedia.org/r/838222 (https://phabricator.wikimedia.org/T318134) (owner: 10Zabe) [10:44:16] (03Merged) 10jenkins-bot: phpunit: Use assertEqualsWithDelta [extensions/ORES] (REL1_37) - 10https://gerrit.wikimedia.org/r/838221 (https://phabricator.wikimedia.org/T318134) (owner: 10Zabe) [10:45:11] (03Merged) 10jenkins-bot: phpunit: Use assertEqualsWithDelta [extensions/ORES] (REL1_35) - 10https://gerrit.wikimedia.org/r/838222 (https://phabricator.wikimedia.org/T318134) (owner: 10Zabe) [11:41:12] (03CR) 10AikoChou: "I added some thoughts on your comment. :)" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/837642 (owner: 10AikoChou) [13:23:16] Morning all! [13:23:46] eyoo o/ [16:02:36] istio won for today, will restart tomorrow :) [16:03:04] staging is fine except for the eventgate-related settings, currently not working (so events are not sent to eventgate basically) [16:06:54] 10Machine-Learning-Team, 10ORES, 10MediaWiki-Core-Preferences, 10Moderator-Tools-Team (Kanban), 10Patch-For-Review: When ORES quality filters are selected in mobile web, entries should be highlighted - https://phabricator.wikimedia.org/T314026 (10eigyan) Thanks @Jdlrobson I will have a look. [19:10:52] (03PS1) 10Thiemo Kreuz (WMDE): Use PHPUnit's convenience shortcuts where possible [extensions/ORES] - 10https://gerrit.wikimedia.org/r/838876 [19:13:11] (03PS1) 10Thiemo Kreuz (WMDE): Make use of the ?? syntax where it makes sense [extensions/ORES] - 10https://gerrit.wikimedia.org/r/838877