[07:18:52] hello folks [07:44:12] morning :) [07:46:29] 10Machine-Learning-Team: Host the recommendation-api container on LiftWing - https://phabricator.wikimedia.org/T339890 (10elukey) It seems definitely an issue with the config of the envoy proxy: ` elukey@ml-staging2002:~$ curl https://thanos-swift.discovery.wmnet:443/info {"swift": {"version": "2.26.0", "strict... [08:02:01] \o Just FYI: today is a holiday in Zurich, so I'll be properly-back tomorrow :) [08:02:43] o/ [08:02:46] enjoy! [08:05:10] o/ [08:05:39] need to run a quick errand, bbiab [08:15:29] hey all! [08:39:09] elukey: oh, one thing before I forget. I came across this last week: https://github.com/juicedata/juicefs Might come in handy in the future. [08:46:37] * elukey back [08:46:41] klausman: interesting! [08:46:57] kevinbazira: o/ [08:47:28] elukey: o/ [08:47:38] so I think that the swiftclient, configured with localhost:6022, doesn't work with the current thanos swift http server config [08:47:43] trying to get to the bottom of it [08:50:49] 10Machine-Learning-Team: Host the recommendation-api container on LiftWing - https://phabricator.wikimedia.org/T339890 (10elukey) On the Thanos FE nodes: ` elukey@thanos-fe1001:~$ curl https://localhost:443/info -k -H "Host: thanos-swift.discovery.wmnet" {"swift": {"version": "2.26.0", "strict_cors_mode": true,... [08:53:22] elukey: true. I have come across swift eqiad and codfw specific ports. I wonder whether these would help instead of "6022": https://github.com/wikimedia/operations-puppet/blob/production/hieradata/common/profile/services_proxy/envoy.yaml#L205-L214 [08:59:19] kevinbazira: those are related to the commons swift clusters, similar interface but different data [09:18:59] elukey: got it. thanks for the clarification. [09:19:23] I noticed the thumbor codebase that you shared with me on Friday does not use "localhost". The more surprising bit is that they use "swift.discovery.wmnet" instead of "thanos-swift.discovery.wmnet": https://github.com/wikimedia/operations-deployment-charts/blob/2889d435ff4b2c7d3e93812f001358955162babd/helmfile.d/services/thumbor/values-codfw.yaml#L5 [09:21:35] kevinbazira: so swift.discovery.wmnet is the endpoint for commons (basically related to the same ports you found above) [09:22:02] makes sense! [09:22:43] thumbor fetches/uploads thumbnails to commons directly, thanos should be only for the observability team, they let us use the cluster for our use case (waiting for the MOSS cluster to be up and running, that should be another cluster for generic object storage that we'll have to migrate to in the future) [09:23:06] but they don't use the mesh (see https://github.com/wikimedia/operations-deployment-charts/blob/2889d435ff4b2c7d3e93812f001358955162babd/helmfile.d/services/thumbor/values.yaml#L9) [09:23:16] mesh is basically the envoy local container that we are proxying to [09:23:35] it is maintained by service ops (the config I mean), and it is useful to get TLS / metrics /etc.. [09:37:54] ok I get it now. It explains why the flink-session-cluster disabled mesh and used the swift auth url that works for us locally: https://github.com/wikimedia/operations-deployment-charts/blob/2889d435ff4b2c7d3e93812f001358955162babd/charts/flink-session-cluster/values.yaml#L39 [09:43:03] exactly yes [09:43:12] tegola-vector-tiles did the same thing, disabled mesh and used the swift auth url that works for us: https://github.com/wikimedia/operations-deployment-charts/blob/2889d435ff4b2c7d3e93812f001358955162babd/charts/tegola-vector-tiles/values.yaml#L39 [09:43:13] we are (of course!) the first ones doing it properly :D [09:49:30] created https://gerrit.wikimedia.org/r/c/operations/puppet/+/956373 [09:49:32] maybe it works [09:54:06] kevinbazira: worked! Now we have a different error [09:54:16] SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1123)')) [09:58:50] (03PS1) 10Elukey: blubber: add wmf-certificates to the production image [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/956377 (https://phabricator.wikimedia.org/T339890) [09:59:08] kevinbazira: --^ [10:00:10] I totally forgot about it, but it is a package that installs the Puppet/PKI Root CA certs on the docker image [10:00:12] (03CR) 10Kevin Bazira: [C: 03+1] "LGTM!" [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/956377 (https://phabricator.wikimedia.org/T339890) (owner: 10Elukey) [10:00:22] so when connecting to a TLS endpoint, we can trust the cert etc.. [10:00:48] (03CR) 10Elukey: [C: 03+2] blubber: add wmf-certificates to the production image [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/956377 (https://phabricator.wikimedia.org/T339890) (owner: 10Elukey) [10:01:04] building the image [10:02:31] (03Merged) 10jenkins-bot: blubber: add wmf-certificates to the production image [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/956377 (https://phabricator.wikimedia.org/T339890) (owner: 10Elukey) [10:10:42] kevinbazira: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/956380 [10:10:46] hopefully the last one! [10:12:40] looks like the new image "2023-09-11-100246-production" takes a while to show up here: https://docker-registry.wikimedia.org/wikimedia/research-recommendation-api/tags/ [10:15:58] We'll be enabling LW in enwiki and wikidata in approx 1h [10:16:34] kevinbazira: yeah the docker registry is getting updated every 30 mins IIRC [10:16:41] isaranto: \o/ [10:16:42] I'll be monitoring LW traffic to verify that our current configuration handles the load [10:19:28] kevinbazira: didn't work for some reason, same error [10:20:01] isaranto: fingers crossed! [10:20:22] aiko: always! [10:25:12] kevinbazira: I found the issue, we need another env variable, sending a code change in a bit [10:25:23] I added it manually but now I see that we get OOM killd [10:27:17] kevinbazira: do you know how much memory we need to run a single pod? [10:28:05] elukey: locally I was using 10G. `docker run -it --memory=10g --entrypoint=/bin/bash recommendation-api` [10:29:32] 10G?? :D [10:29:56] ok then something may need to be refined [10:30:16] I noticed that you used 1G for memory settings (for each pod) in the values.yaml [10:30:22] so I kinda hoped it didn't use too much [10:30:31] do we know why it uses 10G? [10:30:38] are those the embeddings loaded? [10:30:47] processing the embedding, yes! [10:31:25] mmm but is it only a temporary increase when loading, or will the embeddings use 10G etc.. at steady state? [10:31:55] I am asking since having a 10G pod requirement for recommendation-api is not ideal [10:32:29] didn't monitor the steady state, but will likely need 10G when loading the embedding. [10:34:00] kevinbazira: do you have some time to make local testing? To better figure out these requirements, and if we can trim them down somehow [10:35:04] yep, I am loading the image now to monitor the steady state. [10:37:21] For the load part, I am sure it won't work if we set it way below 10G. [10:39:27] isaranto: https://www.usenix.org/conference/srecon23emea/presentation/mcglohon [10:40:24] thanks! [10:41:57] there are many great talks! I also saw this one https://www.usenix.org/conference/srecon23emea/presentation/weichbrodt [10:45:15] also created https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/956385 for rec-api-ng [10:47:54] kevinbazira: one quick thing - I am reading https://github.com/wikimedia/research-recommendation-api/blob/master/recommendation/api/types/related_articles/candidate_finder.py#L167 [10:48:58] and it is probably something that we can improve [10:49:50] for example, we could do the prep work offline and force np to load from file [10:50:00] I never done it but I am pretty sure it should be doable [10:51:23] could you please check if this is doable? And also update the task with all the findings etc.. [10:52:34] yep, working on monitoring the steady-state then will check this too and share the findings. [10:55:54] (03CR) 10AikoChou: [V: 03+2 C: 03+2] test: add load test script and input for ores-legacy [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/955910 (owner: 10AikoChou) [10:58:41] * elukey lunch! [11:03:46] Good morning all [11:07:50] hi Chris o/ [11:11:32] * aiko lunch [11:13:46] Hi Aiko! [11:30:35] 10Machine-Learning-Team, 10Wikimedia Enterprise: Elevate LiftWing access to WME tier for development and production environment - https://phabricator.wikimedia.org/T346032 (10LDlulisa-WMF) [11:33:37] 10Machine-Learning-Team, 10Patch-For-Review: Host the recommendation-api container on LiftWing - https://phabricator.wikimedia.org/T339890 (10kevinbazira) @elukey, regarding the rec-api memory usage, please see the findings below got from `docker stats` after monitoring the rec-api container running locally:... [11:34:52] 10Machine-Learning-Team, 10Patch-For-Review: Host the recommendation-api container on LiftWing - https://phabricator.wikimedia.org/T339890 (10kevinbazira) @elukey, regarding the rec-api memory usage, please see the findings below got from `docker stats` after monitoring the rec-api container running locally:... [11:38:53] Hi Chris! [11:38:57] Heyo! [11:42:18] folks LW is enabled in ores extension for enwiki and wikidata (all wikis now!) [11:42:27] LETS GOOOOOO [11:43:20] my first MW deployment with the help of the one and only Amir1: 🎉 [11:43:47] <3 <3 I wish I could help more [11:44:02] let me know if there is anything else I need to do [11:46:31] thank you for all the help! <3 [11:46:40] watching those requests come in ! https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&var-cluster=eqiad%20prometheus%2Fk8s-mlserve&var-namespace=All&var-backend=enwiki-damaging-predictor-default-00009&var-response_code=All&var-quantile=0.5&var-quantile=0.95&var-quantile=0.99&from=now-1h&to=now&refresh=30s [11:47:12] So pretty [11:55:02] okay I finally fixed my name [11:56:54] nooo [11:56:56] :) [11:57:24] DMing a bot a password in plain text feels weird [11:57:41] stepping afk for lunch for approx 30' hope nothing breaks in the meantime. will keep an eye on irc at least [11:58:17] have a great lunch! [12:57:49] 10Machine-Learning-Team, 10Wikimedia Enterprise: Elevate LiftWing access to WME tier for development and production environment - https://phabricator.wikimedia.org/T346032 (10LDlulisa-WMF) [13:26:48] kevinbazira: o/ [13:26:55] I am reviewing https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/956017, I think that we should not proceed [13:27:12] a 10G pod is huge, it is a 10x compared to an isvc [13:27:30] I think that we should probably improve the embeddings handling [13:29:30] isaranto: niceeeee [13:29:42] I am checking logs in logstash, all good afaics [13:30:07] chrisalbon: o/ [13:37:47] elukey: no problem, we can pause on increasing the memory limit to 10G. proposals are welcome on how to improve the embedding handling different from what the original rec-api devs did. [13:40:06] 10Machine-Learning-Team, 10Patch-For-Review: Host the recommendation-api container on LiftWing - https://phabricator.wikimedia.org/T339890 (10elukey) >>! In T339890#9155991, @kevinbazira wrote: > @elukey, on IRC you mentioned: > >> one quick thing - I am reading https://github.com/wikimedia/research-recommend... [13:40:51] kevinbazira: I added some thoughts to the task, I think that it should be a matter of testing what load_raw_embeddings actually does and improve it [13:41:14] checking ... [13:41:23] for example, maybe we could load the numpy array directly from a file [13:41:28] rather than building it up in memory [13:46:56] Can you use a smaller dtype? np.uint16 or something? [13:48:08] chrisalbon: the function is https://github.com/wikimedia/research-recommendation-api/blob/master/recommendation/api/types/related_articles/candidate_finder.py#L167 [13:48:52] what I am proposing is something like saving the np array on a file, and just load it instead [13:49:10] another dtype could also help [13:49:19] but I am pretty sure this code is very experimental/old [13:49:31] 10G of memory for each pod seems crazy for an app like this [13:49:36] yeah, this is super old [13:50:12] I suspect this isn't up to any modern engineering standards, it is just hacked together on someone's laptop 8 years ago [13:50:19] yep [13:50:38] the thing that I am really worried about is who will maintain the codebase :D [13:50:49] because the trend may be the same across all modules [13:50:53] it is not a ton of code, but.. [13:55:54] Right, probably just saving the array as a file is easiest. There are lots of ways to make this smaller but they might change the behavior of the output, which is a complication we don't want because its just more work [13:56:41] chrisalbon: makes sense, but at the same time we'd reduce a lot our k8s capacity if we can't run a simple app like this with less memory.. [13:56:52] I'd be very against running it at the moment [13:58:42] I mean, we _can_ reduce the memory but the question is do we want to spent a week making this thing more efficient or outload the data to a file and load that in as needed [13:59:41] 10Machine-Learning-Team, 10Wikimedia Enterprise: Elevate LiftWing access to WME tier for development and production environment - https://phabricator.wikimedia.org/T346032 (10elukey) ` elukey@mwmaint1002:~$ mwscript extensions/OAuthRateLimiter/maintenance/setClientTierName.php --wiki metawiki --client ee474263... [14:00:32] But you might be able to just change this line https://github.com/wikimedia/research-recommendation-api/blob/master/recommendation/api/types/related_articles/candidate_finder.py#L181C13-L181C93 to dtype=np.float32 and cut down the size by half. [14:00:36] chrisalbon: ah yes! I'd say that loading from file may not be the complete solution since we'd still use a ton of memory after loading (at least I guess, but my knowledge is limited) [14:00:46] exactly yes [14:01:00] kevinbazira: --^ [14:01:20] so loading from file and size cut could be a good compromise [14:01:31] trying `dtype=np.float32` now. will let you know how it goes. [14:02:01] kevinbazira: try also the other solution if you have some spare cycles (loading the np array from file etc..) [14:03:36] If float32 truncates the embedding values I take no responsibility for my suggestion [14:12:09] 10Machine-Learning-Team, 10Patch-For-Review: Host the recommendation-api container on LiftWing - https://phabricator.wikimedia.org/T339890 (10isarantopoulos) The main issue I see (and the one elukey highlights) is that as a service it is not scalable. I suggest that we try to do any optimization in memory usag... [14:13:13] o/ I agree with downcasting to 32bits and I wrote my comments on the task above [14:13:59] I am curious how much size the np.array is compared to the initial data. if it much smaller we should definitely bypass that (I mean do it once) [14:18:21] isaranto: logstash looks really good, I don't see errors or simialr [14:18:55] yep 🤞 [14:19:18] however I don't see notable drop in ORES traffic, unless I'm looking at it the wrong way [14:21:03] isaranto: https://logstash.wikimedia.org/goto/e1fef09bb1ca20449c5734bc4fb2654d [14:22:23] ok! ofc I was looking at it wrong (had wrong filters) [14:22:26] that looks great! [14:23:22] it does yes! [14:23:36] I also want to stop traffic for the SampleChangePropInstance [14:23:39] was really looking forward to this! [14:23:48] somebody is running changeprop and warming up ores :D [14:23:53] hehe [14:24:10] what am I looking at? [14:24:50] chrisalbon: the link is about Mediawiki traffic hitting ORES [14:25:02] dropped to zero after moving to Lift Wing (enwiki and wikidata) [14:25:08] +1 [14:25:46] 10Machine-Learning-Team, 10Patch-For-Review: Host the recommendation-api container on LiftWing - https://phabricator.wikimedia.org/T339890 (10kevinbazira) @calbon, suggested that we use `dtype=np.float32` to reduce memory usage. I have tested it and below are the results: | **dtype** | **on-load** | **steady-s... [14:25:50] `dtype=np.float32` has reduced the memory usgae. here are the stats: https://phabricator.wikimedia.org/T339890#9156601 [14:34:32] 10Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10MW-1.41-notes (1.41.0-wmf.22; 2023-08-15), 10Patch-For-Review: Move backend of ORES MediaWiki extension to Lift Wing - https://phabricator.wikimedia.org/T319170 (10isarantopoulos) all backend have been moved to LW and ORES has zero traffic from Mediaw... [14:35:50] I'll be sending a follow up email to wikitech-l mailing list. let me know if you think I should add anything else other than a heads up for enwiki and wikidata (like the previous messages) [14:37:40] @elukey, following T339890#9156601, will a 5GB pod max out our k8s capacity? [14:42:07] kevinbazira: one thing that we discussed earlier on was also to preprocess/load the numpy array offline, did you already test it? [14:42:28] the 5GB mark is definitely better, but we are still far from optimal in my opinion [15:05:55] ok I prepped the change for requestctl (https://wikitech.wikimedia.org/wiki/Requestctl) to block SampleChangeProp instanc [15:05:58] *instance [15:06:07] (basically only that UA with /v3/precache) [15:06:10] will enabled it in a bit [15:06:27] if we don't remove it we'll get the same traffic to ores-legacy [15:13:24] Agree! [15:37:37] isaranto: applied the ban [15:38:37] isaranto: ah before we step afk, let's send an email to wikitech [15:39:42] Yes I have it ready! Was waiting if anyone wanted to add anything as I wrote above [15:39:57] Sending now! [15:41:20] ack thanks! [15:41:56] from logstash it seems that the traffic without changeprop etc.. is really around 300/400 requests/minut [15:41:59] *minute [15:42:46] oustanding work Ilias! [15:43:26] I can confirm that SampleChangeprop traffic is now zero [15:44:39] great work elukey:! [15:46:23] still waiting for some news in https://github.com/google/wikiloop-doublecheck/issues/444 [15:47:11] I'm seeing a MediaWiki/1.29.2 and curious to track it down to see what it is [15:55:35] ah wow [15:55:38] any particular IP? [15:58:12] haven't investigated further. Will go through all the agents tomorrow and report them in the task [16:00:24] I can see from eventstreams' metrics that only one active user of revision-score is active [16:00:45] I know only the IP, no UA etc.. [16:00:50] difficult to track it down sigh [16:13:09] elukey: I've been working on preprocessing the numpy array offline. writing to `wikidata_ids.txt` and `decoded_lines.txt` is taking a while to complete. will update the task soon as I have results. [16:15:59] kevinbazira: ack thanks! [16:28:12] going afk folks, cu tomorrow! [16:44:29] same, have a nice rest of the day! [20:07:08] 10Machine-Learning-Team, 10Wikimedia Enterprise: Elevate LiftWing access to WME tier for development and production environment - https://phabricator.wikimedia.org/T346032 (10prabhat) @elukey Hi Luca, Yes, we will have two clients (one for our `dev` environment and one for our `prod` environment). So, liftwing...