[07:11:21] hello folks [07:21:55] 10Machine-Learning-Team: Future support for ores scores in RC API - https://phabricator.wikimedia.org/T343813 (10elukey) Hi @Strainu, thanks for following up! So I don't know the exact details, but since we are talking about the MW php API I am almost sure that the ORES data that you are talking about comes from... [08:24:20] (03CR) 10Elukey: Fetch embedding from Swift (032 comments) [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/945834 (https://phabricator.wikimedia.org/T343576) (owner: 10Kevin Bazira) [09:10:22] going afk for a bit folks, ttl! [09:30:06] Mornin'! [10:30:21] (03PS8) 10Kevin Bazira: Fetch embedding from Swift [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/945834 (https://phabricator.wikimedia.org/T343576) [10:32:30] (03CR) 10Kevin Bazira: Fetch embedding from Swift (032 comments) [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/945834 (https://phabricator.wikimedia.org/T343576) (owner: 10Kevin Bazira) [11:03:10] 10Machine-Learning-Team: Review HTTP 500 reported by articletopic-outlink's transformer for wikisource.org - https://phabricator.wikimedia.org/T343740 (10achou) @elukey Should we add sourceswiki to the [[ https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/helmfile.d/s... [11:06:50] (03CR) 10Elukey: Fetch embedding from Swift (032 comments) [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/945834 (https://phabricator.wikimedia.org/T343576) (owner: 10Kevin Bazira) [11:08:29] 10Machine-Learning-Team: Review HTTP 500 reported by articletopic-outlink's transformer for wikisource.org - https://phabricator.wikimedia.org/T343740 (10elukey) >>! In T343740#9079744, @achou wrote: > @elukey Should we add sourceswiki to the [[ https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deploymen... [11:12:03] 10Machine-Learning-Team: Review HTTP 500 reported by articletopic-outlink's transformer for wikisource.org - https://phabricator.wikimedia.org/T343740 (10elukey) I just noticed other 500s for: ` aiohttp.client_exceptions.ClientConnectorError: Cannot connect to host wikisource.org:443 ssl:default [Connection res... [11:14:13] applied new istio settings :) [11:14:32] hopefully they will be good when ores-legacy ramps up traffic [11:15:12] klausman: o/ qq - we have been working a lot on having two sets of istio gateways (regular and services), splitting traffic between them etc.. [11:15:16] all clear on this front? [11:15:34] it seems working fine but it may be hard to debug if there are doubs [11:15:38] *doubts [11:15:44] in case lemme know and we can sync :) [11:17:27] 10Machine-Learning-Team: Review HTTP 500 reported by articletopic-outlink's transformer for wikisource.org - https://phabricator.wikimedia.org/T343740 (10achou) @elukey Yes I agree, we should return http 400 for those domains! I will also send a patch for that. [11:23:31] elukey: No, I think I get the setup. [11:24:02] Basically two istio configs: one that is for istio bits that ferry reqs to and from inference services, and one for everything else (e.g. ores-legacy). [11:24:23] exactly yes [11:24:30] and we split traffic using the various selectors etc.. [11:24:47] for example, in the istio config.yaml [11:24:47] selector: [11:24:48] service.istio.io/canonical-name: istio-ingressgateway-services [11:24:50] when reqs go to o-l, they of course pass through both: services for talking to o-l, and then through the inference ones when talking to the inference backends that o-l sends reqs to [11:24:50] istio: ingressgateway [11:24:53] vs [11:24:54] selector: [11:24:54] service.istio.io/canonical-name: istio-ingressgateway [11:24:56] istio: ingressgateway [11:25:05] the above are added to the k8s services etc.. [11:25:32] The only (very mild) concern is that if we matched by name istio-ingressgateway.* we'd get both, but it's no biggie [11:25:58] this is why we have the selectors :) [11:26:33] when requests go to ores-legacy they are handled only by one set of ingress pods, the "services" one [11:26:41] because of the svc selectors, the target pods are limited [11:26:55] and also the clusters defined in the envoy configs are completely separated [11:27:01] so it is like having two meshes [11:27:06] (sort of) [11:28:43] so the forwarded requests that o-l makes to inference services never hit the non-service istio? [11:31:48] ohno, I broke Luca's irc connection! [11:33:51] (03PS9) 10Kevin Bazira: Fetch embedding from Swift [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/945834 (https://phabricator.wikimedia.org/T343576) [11:34:22] sorry got kicked out by IRC [11:34:54] so to repeat my likely-missed question: [11:34:57] so the forwarded requests that o-l makes to inference services never hit the non-service istio? [11:35:22] what do you mean with forwarded requests? [11:35:53] ah you mean when o-l pods call inference.discovery.wmnet? [11:35:54] so o-l is and adapter, basiclly, right? It receives requests and transforms them/fans out [11:36:04] yes yes [11:36:21] so what mesh do those requests over? istio-services or istio-inference? [11:36:24] (03CR) 10Kevin Bazira: Fetch embedding from Swift (031 comment) [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/945834 (https://phabricator.wikimedia.org/T343576) (owner: 10Kevin Bazira) [11:36:56] so o-l calls inference.discovery.wmnet:30443, the VIP is served only by the istio gw pods that we use for inference [11:37:27] the ores-legacy.discovery.wmnet:31443 (different port) is backed by the "services" gw pods [11:37:59] the only real "mesh" is the one that we use for inference services [11:38:04] The question is: do the o-l generated requests go over the same mesh as direct LW requests? or istio-services? [11:38:35] direct LW requests are reqs to inference.discovery.wmnet? [11:38:59] yeah, e.g. my curl tests (without APIGW) [11:39:21] then yes, see above, o-l pods call inference.discovery.wmnet [11:39:35] ok, so to put another way: [11:39:37] they use the mesh module that serviceops offers [11:40:28] while requests to o-l's front go over the i-services istio pods, the requests it generates to its "backend" are no different (routing-wise) than direct LW requests. [11:40:43] correct [11:41:01] Ok, then I got it right :) [11:41:02] it is like we were calling another k8s cluster [11:41:55] ack. We could still discern o-l forwarded reqs by looking at Forwarded-For [11:43:08] or simply user agent [11:43:14] it should be set in theory [11:43:33] going afk for lunch! ttl! [11:45:32] same! [13:45:18] back! [13:45:43] folks I don't see the team meeting for today, are we going to skip? [13:49:11] I don't see it either. Why? [13:49:59] maybe Chris cancelled it? [13:54:40] mm I think so [13:55:20] aiko: changeprop deployed :) [13:56:12] thanks!! :) [13:57:08] elukey: I'm working on the patch for the outlink code [13:57:46] so would we like to have the team meeting today? [13:57:54] (03CR) 10Elukey: [C: 03+1] Fetch embedding from Swift (031 comment) [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/945834 (https://phabricator.wikimedia.org/T343576) (owner: 10Kevin Bazira) [13:58:24] aiko: I think we can skip, unless there is something to discuss [13:58:34] elukey: Chris mentioned on Slack that today is a US holiday, IIRC [13:59:53] I am ok to skip it or not, as you prefer folks [14:01:29] I am ok to skip too [14:04:12] +1 [14:05:48] +! [14:48:29] klausman, kevinbazira - one note for the next days if you want to proceed with the rec-api [14:48:56] after the docker image is published to the registry, we'll need to create the helm chart and the helmfile config [14:50:03] I don't think that there is an helm chart for a flask app, Ilias created the "fastapi" one but it will probably not fit [14:50:29] to create all the scaffold for a new chart (in the deployment-charts repo), SRE offers https://gitlab.wikimedia.org/repos/sre/sextant [14:51:01] sextant is a tool that takes as input a scaffold template and generates the chart [14:51:23] the scaffold template for a serviceops-like service is stored in deployment-charts, _scaffold dir [14:52:22] 10Machine-Learning-Team, 10sre-alert-triage: Alert triage: overdue alert [warning] Kubelet exec_sync operations on ml-serve1001.eqiad.wmnet take 1.133s in p99 - https://phabricator.wikimedia.org/T343900 (10JMeybohm) [14:52:28] (we can probably create a recommendation-api-ng chart, instead of a generic one) [14:52:35] you decide :) [14:52:51] but the whole process should be easy to do [14:54:26] Roger [14:57:42] kevinbazira: lemme know if you have questions for --^ [14:58:12] Thanks elukey, will dig into the helm chart docs and let you know in case questions come up. [14:59:05] (03CR) 10Kevin Bazira: [C: 03+2] "Thanks for the reviews, Luca!" [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/945834 (https://phabricator.wikimedia.org/T343576) (owner: 10Kevin Bazira) [14:59:39] (03Merged) 10jenkins-bot: Fetch embedding from Swift [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/945834 (https://phabricator.wikimedia.org/T343576) (owner: 10Kevin Bazira) [15:03:43] (03PS1) 10AikoChou: outlink: return http 400 for non-wikipedia domains [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/947368 (https://phabricator.wikimedia.org/T343740) [15:09:17] (03PS2) 10AikoChou: outlink: return http 400 for non-wikipedia domains [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/947368 (https://phabricator.wikimedia.org/T343740) [15:12:21] (03CR) 10Elukey: outlink: return http 400 for non-wikipedia domains (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/947368 (https://phabricator.wikimedia.org/T343740) (owner: 10AikoChou) [15:15:59] (03PS3) 10AikoChou: outlink: return http 400 for non-wikipedia domains [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/947368 (https://phabricator.wikimedia.org/T343740) [15:17:14] (03CR) 10AikoChou: outlink: return http 400 for non-wikipedia domains (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/947368 (https://phabricator.wikimedia.org/T343740) (owner: 10AikoChou) [15:26:53] 10Machine-Learning-Team, 10sre-alert-triage: Alert triage: overdue alert [warning] Kubelet exec_sync operations on ml-serve1001.eqiad.wmnet take 1.133s in p99 - https://phabricator.wikimedia.org/T343900 (10elukey) Very interesting: ` Aug 09 00:09:33 ml-serve1001 kubelet[3980749]: E0809 00:09:33.603646 3980749... [15:36:03] (03CR) 10Elukey: [C: 03+1] "Left a nit but LGTM!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/947368 (https://phabricator.wikimedia.org/T343740) (owner: 10AikoChou) [15:45:40] (03PS4) 10AikoChou: outlink: return http 400 for non-wikipedia domains [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/947368 (https://phabricator.wikimedia.org/T343740) [15:47:50] (03CR) 10AikoChou: [C: 03+2] "Thanks for the review :)" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/947368 (https://phabricator.wikimedia.org/T343740) (owner: 10AikoChou) [15:48:34] 10Machine-Learning-Team, 10Release Pipeline, 10ci-test-error: Post-merge build failed due to Internal Server Error - https://phabricator.wikimedia.org/T342084 (10kevinbazira) To reduce image layer sizes, in T343576 we store and fetch the ~2.8GB recommendation-api embedding from Swift as recommended in T28819... [15:54:38] (03Merged) 10jenkins-bot: outlink: return http 400 for non-wikipedia domains [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/947368 (https://phabricator.wikimedia.org/T343740) (owner: 10AikoChou) [16:02:38] 10Machine-Learning-Team, 10Patch-For-Review, 10sre-alert-triage: Alert triage: overdue alert [warning] Kubelet exec_sync operations on ml-serve1001.eqiad.wmnet take 1.133s in p99 - https://phabricator.wikimedia.org/T343900 (10elukey) Throttling is gone, but I still see the exec_sync elevated latency, errors... [16:03:58] klausman: totally forgot of https://phabricator.wikimedia.org/T339231 [16:04:12] we should probably start expanding those partitions [16:04:30] not urgent now, but I forgot about it till now :D [16:06:11] Ah yes! [16:06:40] ok, so all but 1001 need the same lvextend and resize2fs? [16:07:09] I can do that tomorrow [16:08:55] nono we can do it slowly during the next weeks [16:09:22] not sure if we need to drain the nodes or if we could stop kubelets, resize, restart then [16:09:25] *them [16:09:41] anyway, we can split the work [16:10:16] Sure. I can do eqiad in the next two weeks and leave codfw to you? Or vice versa [16:10:30] In my experience, resizing FSes under running processes never breaks [16:12:30] yeah but let's not do it :D [16:12:58] the most safe procedure should be to drain the node from containers, stop kubelet, resize, and do the rest in reverse [16:13:04] but draining is very painful [16:13:17] so maybe we could get away with simply stopping kubelet, resize, starting [16:13:22] maybe we can test in codfw [16:13:27] anyway, the SLO work comes first! [16:14:48] of course [16:15:14] 10Machine-Learning-Team: Store and fetch the recommendation-api embedding from Swift - https://phabricator.wikimedia.org/T343576 (10kevinbazira) A [[ https://github.com/wikimedia/research-recommendation-api/blob/master/recommendation/api/types/related_articles/candidate_finder.py#L119-L165 | Swift client ]] has... [16:28:45] elukey: nooo I made a mistake.. :((( [16:31:08] (03PS1) 10AikoChou: outlink: fix function param for is_domain_wikipedia [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/947413 [16:32:10] ah snap! [16:32:14] totally didn't see it [16:32:23] (03CR) 10Elukey: [C: 03+1] outlink: fix function param for is_domain_wikipedia [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/947413 (owner: 10AikoChou) [16:32:59] (03CR) 10AikoChou: [C: 03+2] outlink: fix function param for is_domain_wikipedia [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/947413 (owner: 10AikoChou) [16:33:45] (03Merged) 10jenkins-bot: outlink: fix function param for is_domain_wikipedia [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/947413 (owner: 10AikoChou) [16:48:12] ok now back to normal [16:51:09] thanks! [17:04:36] going afk folks! Have a nice rest of the day [17:09:45] bye Luca! have a nice evening :) [17:52:12] \o