[05:47:14] (03PS2) 10Ilias Sarantopoulos: llm: load test langid model [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/967220 (https://phabricator.wikimedia.org/T340507) [06:05:40] (03CR) 10Ilias Sarantopoulos: [V: 03+2 C: 03+2] llm: load test langid model [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/967220 (https://phabricator.wikimedia.org/T340507) (owner: 10Ilias Sarantopoulos) [06:22:17] morning folks [06:22:52] kevinbazira: o/ before starting, did you notice a reduction in total time (if we have it in the logs) with 250 instead of 500? [06:23:34] elukey: o/ nope there was no noticable reduction in time. [06:24:41] ah snap then this is something weird [06:24:56] changing it to 100, but doesn't seem good [06:25:06] yep, the rec-api query still hangs and doesn't return results [06:26:24] MAX_CANDIDATES set to 100, pod is spinning up [06:27:29] ready to test :) [06:27:59] thanks, testing now .. [06:29:34] same results: [06:29:43] ``` [06:29:43] $ time curl 'https://recommendation-api-ng.k8s-ml-staging.discovery.wmnet:31443/api/?s=en&t=fr&n=3&article=Apple' [06:29:43] upstream connect error or disconnect/reset before headers. reset reason: connection termination [06:29:44] real 0m48.484s [06:29:44] user 0m0.008s [06:29:44] sys 0m0.007s [06:29:45] ``` [06:31:37] container logs: [06:31:45] ``` [06:31:45] Fri Oct 20 06:28:53 2023 - HARAKIRI [core 0] 127.0.0.1 - GET /api/?s=en&t=fr&n=3&article=Apple since 1697783317 [06:31:45] Fri Oct 20 06:28:53 2023 - HARAKIRI !!! end of worker 1 status !!! [06:31:45] 2023-10-20 06:28:53,941 recommendation.utils.event_logger log_api_request():39 INFO -- Logging event: {"schema": "TranslationRecommendationAPIRequests", "$schema": "/analytics/legacy/$translationrecommendationapirequests/1.0.0", "revision": 16261139, "event": {"timestamp": 1697783333, "sourceLanguage": "en", "targetLanguage": "fr", "seed": "Apple", "searchAlgorithm": "morelike"}, "webHost": [06:31:45] "recommendation-api-ng.k8s-ml-staging.discovery.wmnet:31443", "client_dt": "2023-10-20T06:28:53.940930", "meta": {"stream": "eventlogging_TranslationRecommendationAPIRequests", "domain": "recommendation-api-ng.k8s-ml-staging.discovery.wmnet:31443"}} [06:31:46] 2023-10-20 06:28:54,450 recommendation.api.types.translation.candidate_finders get_morelike_candidates():39 INFO -- morelike returned 101 results [06:31:46] DAMN ! worker 1 (pid: 134) died, killed by signal 9 :( trying respawn ... [06:31:47] Respawned uWSGI worker 1 (new pid: 159) [06:31:48] ``` [06:34:29] from the logs it is not clear though if it bottleneck is the candidates number though [06:34:55] I don't see a clear timing, like "start of request -> etc.." [06:37:21] morning! [06:41:25] kalimera : [06:41:27] :) [06:42:38] kevinbazira: I checked with nsenter on the pods, and `netstat -tunap` (that shows all the TCP conns etc..) has these weird entries [06:42:41] tcp6 0 1 2620:0:860:302:cd:33276 2620:0:860:ed1a::1:443 SYN_SENT 3248538/uwsgi [06:42:44] tcp6 0 1 2620:0:860:302:cd:33266 2620:0:860:ed1a::1:443 SYN_SENT 3248538/uwsgi [06:42:47] tcp6 0 1 2620:0:860:302:cd:33256 2620:0:860:ed1a::1:443 SYN_SENT 3248538/uwsgi [06:43:01] all the other sockets are either ESTABLISHED (connection ok) or TIME_WAIT (connection already completed) [06:43:20] SYN_SENT means that the client started the conn, but didn't get an answer yet [06:43:35] so I think that this may be the issue, we are hanging waiting for something [06:44:15] and if I do a reverse dns lookup, I get text-lb.codfw.wikimedia.org [06:44:43] that is our fronted load balancer, so it seems that we are trying to connect to a wikimedia.org endpoint [06:45:01] (hanging, since it is now allowed, we need to use the envoy proxy) [06:45:47] I thought we took care of the envoy proxy issues o_o [06:47:15] maybe there is still a sneaky one that lingers [06:50:45] does this mean the rec-api is not able to access the wikimedia.org endpoint through the envoy proxy? [06:53:12] yes I think that's it, I didn't see the "pageviews" setting, it is not right [06:53:25] localhost:6500 is only for the PHP API [06:53:37] not sure what endpoint "pageviews" need [06:54:49] ahh snap [06:54:50] query = ${endpoints:pageviews}/per-article/{source}.wikipedia/all-access/user/{title} [06:55:58] yes ok it tries to access https://wikitech.wikimedia.org/wiki/Analytics/AQS/Pageviews#Pageview_counts_by_article [06:56:25] that hits restbase IIRC [06:56:52] I don't recall if we have an internal endpoint, lemme think [06:57:12] it hits the AQS service, not sure if with the same URL [06:58:12] so the flow should be [06:58:39] LB (wikimedia.org) -> Restbase (proxy/cache) -> AQS LB -> AQS service [06:59:39] kevinbazira: we have the aqs listener in service proxy, port 6020 [07:00:48] sorry folks but sometimes when I switch to the irccloud tab I write sth and then past messages pop up . so some responses of mine may seem out of context [07:01:02] probably I need to restart or sth [07:01:37] elukey: great, so we should point the pageviews port to 6020 [07:03:09] kevinbazira: in theory yes, but I am not 100% sure if the URLs are the same [07:03:31] the current URL is the external/public one, exposed by restbase [07:03:41] that proxies to our internal endpoint, AQS [07:03:58] IIRC the format is not the same [07:04:04] I am trying to hit the endpoint [07:05:28] since you have access to the pod, is it possible to manually change the listener in the deployment conifg and port in recommendation/data/recommendation_liftwing.ini so we can test this without a gerrit patch? [07:05:56] kevinbazira: ah no that is too much magic, can't do it :( [07:06:11] ok, should I push a patch for this? [07:06:55] kevinbazira: I think we need to follow up with Data Engineering first (they manage AQS) to figure out what is the format of the URLs of the internal API [07:08:18] atm we have pageviews = http://localhost:6500/api/rest_v1/metrics/pageviews [07:08:51] elukey@stat1004:~$ curl http://aqs.discovery.wmnet:7232/api/rest_v1/metrics/pageviews [07:08:54] {"type":"https://mediawiki.org/wiki/HyperSwitch/errors/not_found#route","title":"Not found.","method":"get","uri":"/api/rest_v1/metrics/pageviews"} [07:08:57] kevinbazira: --^ [07:09:17] yes, that was configured based on: https://github.com/wikimedia/research-recommendation-api/blob/3c50d88504e66fd897416374d64235c3c1733234/recommendation/data/recommendation.ini#L3C1-L3C1 [07:09:22] if you have time you can connect to #wikimedia-analytics and ask to the data engineering folks [07:09:35] sure, let me follow up [07:09:43] I need to run an errand, bbiab [08:08:08] Afk commuting [08:50:57] back :) [08:56:34] kevinbazira: how is it going? [08:57:42] elukey: haven't got a response yet. I actually tagged you in the question in #wikimedia-analytics [09:04:28] ok IRC? [09:04:32] *on [09:04:47] I don't see it :( [09:12:32] Here is the screenshot of the message. not sure why you're not seeing it since I tagged you ... https://usercontent.irccloud-cdn.com/file/dhRZgkiu/Screenshot%20from%202023-10-20%2013-07-43.png [09:15:31] kevinbazira: I see that you joined but not msg delivered [09:16:50] I've sent it again. [09:17:01] are you able to see it? [09:17:58] for some reason it wasnt sent previously. now it is there! [09:17:59] now yes [09:18:33] okok [09:18:42] kevinbazira: when you explain on IRC try to add more info about what your goal is, so it will be easier for people to understand (like, what is the listener, etc..) [09:19:06] (not everybody worked with it) [09:19:43] in this case we know we want to use the AQS endpoint though, the main issue is to find the right URI [09:27:05] I may have something [09:35:17] (03PS1) 10Elukey: Fix pageviews base endpoint [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/967401 (https://phabricator.wikimedia.org/T348607) [09:35:23] kevinbazira: --^ [09:35:27] I think it should work [09:36:09] so we have to [09:36:14] 1) rebuild the docker image [09:36:28] 2) change the docker image and add the aqs listener in deployments-charts [09:36:58] (03CR) 10Kevin Bazira: [C: 03+1] "ok, let's give this a shot" [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/967401 (https://phabricator.wikimedia.org/T348607) (owner: 10Elukey) [09:37:15] elukey: ok, let's give this a shot [09:38:53] will push a patch for the deployments-charts once the new image has been built [09:40:22] (03CR) 10Elukey: [C: 03+2] Fix pageviews base endpoint [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/967401 (https://phabricator.wikimedia.org/T348607) (owner: 10Elukey) [09:41:14] isaranto: if you have time later on https://gerrit.wikimedia.org/r/c/research/recommendation-api/+/966543 [09:42:00] ack, will do! [09:53:21] <3 [09:53:35] kevinbazira: +1ed the change, let's test! (if you have time, otherwise later) [09:55:02] sure, merging now so we can test ... [09:56:52] deploying on staging ... [09:57:32] ack [09:57:39] filed https://gerrit.wikimedia.org/r/c/operations/puppet/+/967408 for the SLO metrics calculations [09:57:42] hope that it will work [10:01:12] elukey, the api is returning results now :) [10:01:20] ``` [10:01:20] $ time curl 'https://recommendation-api-ng.k8s-ml-staging.discovery.wmnet:31443/api/?s=en&t=fr&n=3&article=Apple' [10:01:20] [{"pageviews": 0, "title": "Beetroot", "wikidata_id": "Q99548274", "rank": 243.0}, {"pageviews": 0, "title": "Malus_\u00d7_zumi", "wikidata_id": "Q5990804", "rank": 239.0}, {"pageviews": 0, "title": "PRI_disease_resistant_apple_breeding_program", "wikidata_id": "Q19840594", "rank": 237.0}] [10:01:20] real 0m2.658s [10:01:20] user 0m0.013s [10:01:20] sys 0m0.000s [10:01:21] ``` [10:01:34] 🎉great work both! [10:01:51] agreed [10:04:13] yesssssss [10:05:17] kevinbazira: we could test with 500 quickly to check how fast it is [10:05:30] and possibly also with more workers, just to find a sweet spot for one pod [10:05:34] what do you think? [10:05:39] okok, are you going to do it manually or should I push a patch? [10:05:55] manually yes, is it ok now? Or is it lunch time for you? [10:06:15] no it's fine, let's not break this momentum :) [10:07:16] kevinbazira: ack, new pod is up with 500 [10:07:57] testing ... [10:08:17] results: [10:08:17] ``` [10:08:17] $ time curl 'https://recommendation-api-ng.k8s-ml-staging.discovery.wmnet:31443/api/?s=en&t=fr&n=3&article=Apple' [10:08:17] [{"pageviews": 0, "title": "Peanut_stunt_virus", "wikidata_id": "Q7157945", "rank": 496.0}, {"pageviews": 0, "title": "Kentville_Research_and_Development_Centre", "wikidata_id": "Q4816432", "rank": 495.0}, {"pageviews": 0, "title": "Specific_replant_disease", "wikidata_id": "Q889601", "rank": 494.0}] [10:08:18] real 0m2.950s [10:08:18] user 0m0.010s [10:08:19] sys 0m0.005s [10:08:19] ``` [10:09:21] I tested as well from stat1004, I get something in the range of 2.2~2.7 seconds [10:09:36] so in theory it seems that the max-candidates number is not that heavy [10:10:04] yep, I agree [10:10:36] we could try to bump the workers + cpus, maybe getting up to 4, what do you think? [10:10:46] +1 [10:11:05] after that we'll have to set up a small load test, to have better comparisons [10:11:11] but we can do a quick test now [10:12:25] * klausman lunch [10:16:22] kevinbazira: done! [10:16:30] from a quick test I don't see a big difference [10:17:21] testing ... [10:17:36] results: [10:17:36] ``` [10:17:36] $ time curl 'https://recommendation-api-ng.k8s-ml-staging.discovery.wmnet:31443/api/?s=en&t=fr&n=3&article=Apple' [10:17:36] [{"pageviews": 0, "title": "Peanut_stunt_virus", "wikidata_id": "Q7157945", "rank": 496.0}, {"pageviews": 0, "title": "Kentville_Research_and_Development_Centre", "wikidata_id": "Q4816432", "rank": 495.0}, {"pageviews": 0, "title": "Specific_replant_disease", "wikidata_id": "Q889601", "rank": 494.0}] [10:17:37] real 0m2.141s [10:17:37] user 0m0.010s [10:17:38] sys 0m0.005s [10:17:38] ``` [10:17:50] elukey: yep, not much is changing. [10:19:14] kevinbazira: we could think about keeping 2 cpus and 2 workers, with 500 max candidates, at least for the moment. We rollout the change to staging and prod via deployment-charts, so we are in a kind-of stable state [10:19:42] then I'd say that we could do a little load test in staging, with more URLs (different than the Apple one) [10:20:08] what do you think? [10:20:42] +1 on 2 cpus and 2 workers with 500 max candidates. let me push a patch for that. [10:22:01] results on testing a different article other than Apple: [10:22:01] ``` [10:22:01] $ time curl 'https://recommendation-api-ng.k8s-ml-staging.discovery.wmnet:31443/api/?s=en&t=fr&n=3&article=Tennis' [10:22:01] [{"pageviews": 0, "title": "2017_Australian_Open_\u2013_Men's_singles_final", "wikidata_id": "Q28600417", "rank": 499.0}, {"pageviews": 0, "title": "ATP_rankings", "wikidata_id": "Q1571500", "rank": 495.0}, {"pageviews": 0, "title": "PointTracker", "wikidata_id": "Q7207919", "rank": 491.0}] [10:22:01] real 0m2.776s [10:22:01] user 0m0.010s [10:22:02] sys 0m0.005s [10:22:02] ``` [10:30:48] (03CR) 10Ilias Sarantopoulos: [C: 03+1] "LGTM! I just left a comment with an alternative suggestion for skipping pep8 checks. Feel free to skip it if you don't like it." [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/966543 (owner: 10Elukey) [10:40:22] (03CR) 10Elukey: Fix pre-commit errors and bump version (031 comment) [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/966543 (owner: 10Elukey) [10:43:14] deployment on staging is done. [10:43:15] results: [10:43:15] ``` [10:43:15] $ time curl 'https://recommendation-api-ng.k8s-ml-staging.discovery.wmnet:31443/api/?s=en&t=fr&n=3&article=Basketball' [10:43:15] [{"pageviews": 0, "title": "2019_United_States_FIBA_Basketball_World_Cup_team", "wikidata_id": "Q56042822", "rank": 495.0}, {"pageviews": 0, "title": "Molly_Bolin", "wikidata_id": "Q27451749", "rank": 482.0}, {"pageviews": 0, "title": "Dom_Flora", "wikidata_id": "Q5289728", "rank": 478.0}] [10:43:15] real 0m2.902s [10:43:16] user 0m0.009s [10:43:16] sys 0m0.005s [10:43:17] ``` [10:43:17] going to deploy to prod now ... [10:44:02] super [10:44:47] kevinbazira: we'll need to perform a more precise load test with something like wrk, to have a good idea about how many conns we can sustain etc.. nothing major, but basically like what Aiko and Ilias did for inference-services (in the repo test dir I mean) [10:45:21] sure sure [10:51:16] going afk for lunch! [10:51:32] deployment to both eqiad and codfw is done. [10:51:33] results: [10:51:33] ``` [10:51:33] $ time curl "https://recommendation-api-ng.discovery.wmnet:31443/api/?s=en&t=fr&n=3&article=Basketball" [10:51:33] [{"pageviews": 0, "title": "2019_United_States_FIBA_Basketball_World_Cup_team", "wikidata_id": "Q56042822", "rank": 495.0}, {"pageviews": 0, "title": "Molly_Bolin", "wikidata_id": "Q27451749", "rank": 482.0}, {"pageviews": 0, "title": "Dom_Flora", "wikidata_id": "Q5289728", "rank": 478.0}] [10:51:33] real 0m1.597s [10:51:34] user 0m0.007s [10:51:34] sys 0m0.007s [10:51:35] ``` [10:51:35] prod is much faster than staging :) [10:51:52] weird, I wouldn't have expected that [10:52:02] we have more pods but it shouldn't change much [10:52:05] mmmm [11:13:25] (03CR) 10Ilias Sarantopoulos: [C: 03+1] Fix pre-commit errors and bump version (031 comment) [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/966543 (owner: 10Elukey) [11:20:19] 10Machine-Learning-Team: Configure envoy settings to enable rec-api-ng container to access endpoints external to k8s/LiftWing - https://phabricator.wikimedia.org/T348607 (10kevinbazira) We discovered that the [[ https://wikitech.wikimedia.org/wiki/Analytics/AQS/Pageviews#Pageview_counts_by_article | pageviews en... [11:29:37] 10Machine-Learning-Team: Refactor LLM class and model server to run locally - https://phabricator.wikimedia.org/T349371 (10isarantopoulos) [11:29:50] 10Machine-Learning-Team: Refactor LLM class and model server to run locally - https://phabricator.wikimedia.org/T349371 (10isarantopoulos) [11:35:40] * isaranto lunch! [11:43:42] 10Machine-Learning-Team: Investigate recommendation-api-ng internal endpoint failure - https://phabricator.wikimedia.org/T347475 (10kevinbazira) In T348607#9267956 we fixed the envoy listener for the pageviews endpoint. Now the rec-api-ng is able to access all external endpoints from k8s/LiftWing. We also run l... [11:46:40] 10Machine-Learning-Team: Investigate recommendation-api-ng internal endpoint failure - https://phabricator.wikimedia.org/T347475 (10kevinbazira) @isaac, the rec-api-ng endpoint now works as shown below. Please let us know whether there are edge cases we might have missed: ` $ time curl "https://recommendation-ap... [11:59:03] elukey: should I wait for Kamila's review of the Golang change? Also, Effie is out til Monday [12:19:21] klausman: nono you can go ahead, Janis signed off, it was to get somebody from service ops [12:19:30] righto [12:26:23] klausman: you need to kick off the build script from root on build2001 [12:26:33] should be build-production-images etc.. [12:26:42] already done :) [12:26:44] (and git pull the commit et.c.) [12:26:45] okok [12:26:50] == Step 2: publishing == [12:26:53] == Build done! == [12:27:20] did it say what image was published? [12:27:21] it's not visible yet on https://docker-registry.wikimedia.org/, so I suspect it's just some sync delay [12:27:31] yeah but can you pull it? [12:28:08] seems good yes, okok [12:28:18] good! [12:28:19] yes [12:28:43] And with that Yak shaven, I can now go back to the kserve update :D [12:32:20] which still can't fetch that very image I just built (and fetched :-/ [12:39:37] (03PS7) 10Elukey: Add precommit support [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/966542 [12:39:45] (03PS7) 10Elukey: Fix pre-commit errors and bump version [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/966543 [12:40:05] (03CR) 10Elukey: Fix pre-commit errors and bump version (031 comment) [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/966543 (owner: 10Elukey) [12:40:15] (03CR) 10Elukey: [V: 03+2 C: 03+2] Add precommit support [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/966542 (owner: 10Elukey) [12:42:42] (03CR) 10Elukey: [C: 03+2] Fix pre-commit errors and bump version [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/966543 (owner: 10Elukey) [12:43:12] all right rec-api-ng now with pre-commit support [12:43:59] (03Merged) 10jenkins-bot: Fix pre-commit errors and bump version [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/966543 (owner: 10Elukey) [12:44:38] kevinbazira: I am going to update deployment-charts with the new image, since it uses the latest debian bullseye (so we fix some security concerns) [12:47:20] sure sure [12:53:14] isaranto: created https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/967442/ to add the new env variable for revert risk (to merge before your patch for bullseye updates) [12:53:24] elukey: it looks like the image_tag thing breaks with the golang image name. I suspect because there is a dot in it [12:53:41] what is the image_tag thing? [12:53:44] If I use the naked image name (no {{ .. | }}), it works [12:54:01] the very thing you mentioned on the other patch as forgetting about it [12:54:34] sure but could you explain what is "works"? :) [12:54:39] local build, etc..? [12:54:39] `FROM {{ 'golang1.21' | image_tag }} as build` does not work [12:54:48] `FROM docker-registry.wikimedia.org/golang1.21 as build` works [12:54:58] and yes, local build. [12:55:17] what is the error? [12:55:30] do you have golang1.21 in your local docker? [12:55:42] I think that it needs to be there if you want to test it locally [12:56:45] Ok, so without golang present, I get: [12:56:57] 2023-10-20 14:56:53,145 [docker-pkg-build] ERROR - Unexpected error building image docker-registry.wikimedia.org/kserve-build:0.11.1: Image docker-registry.wikimedia.org/golang1.21 not found (image.py:208) [12:57:14] docker pull docker-registry.wikimedia.org/golang1.21 # works fine [12:57:33] after that pull, I still get the same error as before [12:57:40] (when using build [12:57:41] ) [12:58:20] if I use said image name in the FROM line (no {{}} at all), build works fine [12:59:20] I don't think it's a local setup problem, as testing the making of that earlier golang1.21, everything was working [12:59:49] again: I suspect whatever {{ .... | image_tag }} does breaks because there is a dot in the image name [13:01:23] Friday! [13:02:01] Correct :) [13:18:47] klausman: can you send me a diff or file a patch so I can test? [13:28:24] I'll put it on Gerrit, is that ok? [13:30:18] yep [13:31:34] https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/+/967451 [13:35:33] klausman: I commented, I think you are missing a -1 in the changelog [13:36:21] But that sure won't break building the image? [13:36:42] it is not a valid debian version afaics [13:37:35] I'm not saying it doesn't need to be fixed, but adding it also doesn't make the build work [13:38:03] sure sure [13:38:27] but the dot in the name shouldn't count, we already build with it golang1.18 [13:38:51] Well, that didn't work, either, and not becaus ekserve needs 1.20+ [13:39:03] maybe this broke since the last time we built kserve images [13:40:43] ah ok I found the issue [13:40:48] golang1.18 (1.18-1) wikimedia; urgency=high [13:40:49] vs [13:40:54] golang1.21 (1.21) wikimedia; urgency=high [13:41:03] so the issue is the 1.21 tag [13:41:45] oh, so the missing -1 on the golang image is the problem. Gotcha, sending a patch [13:42:06] I think so, but since you already published it we need a new entry in the changelog [13:43:42] klausman: you can try to build locally, hopefully it will work [13:46:04] 10Machine-Learning-Team: Refactor LLM class and model server to run locally - https://phabricator.wikimedia.org/T349371 (10isarantopoulos) [13:46:05] REPOSITORY TAG IMAGE ID CREATED SIZE [13:46:08] docker-registry.wikimedia.org/golang1.21 1.21-1 acbcd2305163 35 seconds ago 621MB [13:46:18] ^^^ I now have that, but building kserve still fails with the same error [13:46:20] (03PS1) 10Ilias Sarantopoulos: llm: refactor to run locally [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/967455 (https://phabricator.wikimedia.org/T349371) [13:46:36] okok but we need to fix it anyway [13:46:50] so let's start with that code change [13:47:13] o/ doint pay attention to the above patch it is WIP for now. I'll put it as active and add reviewers when it is tested [13:47:20] *don't [13:47:51] elukey: https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/+/967456 [13:49:09] ok perfect [13:49:55] it's only the tag in (), right, nothing else? [13:50:58] in theory yes [13:51:08] ok, merged and building now [13:52:24] I left a comment saying "wait for Janis approval first" [13:52:30] but he did [13:52:43] super didn't see [13:52:44] my bad [13:53:46] Ehi, è venerdì. [13:55:51] Still the same error when trying to build kserve. [13:55:54] the kserve build seems working now [13:56:03] at least for me locally [13:56:13] Still broken locally for me [13:56:22] try to remove all golang local images [13:56:27] 1.21 I mean [13:57:19] yeah, done that, still no joy [13:57:49] it is working for me, I can see it building [13:57:57] * Built image docker-registry.wikimedia.org/kserve-build:0.11.1-1 [13:58:45] klausman: what do you have in `docker image ls | grep golang` ? [13:58:59] Nope. [13:59:06] er, nothing [13:59:28] and if you run `docker-pkg build images` do you get the same error? [14:00:07] Running that now [14:01:04] Now it seems to work. Bloody voodo. [14:01:08] 10Lift-Wing, 10Machine-Learning-Team, 10I18n, 10NewFunctionality-Worktype, 10Patch-For-Review: Create a language detection service in LiftWing - https://phabricator.wikimedia.org/T340507 (10isarantopoulos) Load testing was done using input text ranging from 5 to 500 words. Although 1-2 sentences would be... [14:04:19] klausman: if you try to run build for a specific directory is different that on the images dire as a whole [14:04:29] non bloody voodo :) [14:04:47] It really shouldn't be different, or at least emit some warning [14:05:53] 10Machine-Learning-Team: [CI] Update pre-commit versions in inf-services repo - https://phabricator.wikimedia.org/T349382 (10isarantopoulos) [14:06:00] klausman: https://github.com/wikimedia/operations-docker-images-docker-pkg#build-the-images [14:06:04] 10Machine-Learning-Team: [CI] Update pre-commit versions in inf-services repo - https://phabricator.wikimedia.org/T349382 (10isarantopoulos) [14:06:09] it looks for the image tags in the various changelogs [14:07:07] But the golang1.21 build worked just fine that way [14:07:22] maybe ebcause it was entirely new, not an update? [14:07:59] with the base images is different, your only requirement in there was bookworm [14:09:34] (03PS1) 10Ilias Sarantopoulos: ci: update pre-commit hooks versions [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/967458 (https://phabricator.wikimedia.org/T349382) [14:15:28] (03CR) 10Ilias Sarantopoulos: "recheck" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/967458 (https://phabricator.wikimedia.org/T349382) (owner: 10Ilias Sarantopoulos) [14:17:05] made an attempt to update pre-commit versions in inf services. Things seem smooth locally [14:17:37] is there any hack in gerrit/blubber to trigger all test pipelines? [14:18:14] isaranto: mmm you can try to log in to integration.wikimedia.org and trigger one job manually [14:18:29] (03PS2) 10Ilias Sarantopoulos: ci: update pre-commit hooks versions [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/967458 (https://phabricator.wikimedia.org/T349382) [14:19:49] although: local run formats and checks all the repo so it would be fine [14:20:01] * klausman going for an eagle stomp, bbiab [14:44:30] (03CR) 10Elukey: [C: 03+1] ci: update pre-commit hooks versions [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/967458 (https://phabricator.wikimedia.org/T349382) (owner: 10Ilias Sarantopoulos) [15:13:24] Going afk for the weekend folks. Continuing on Monday :). Have a great weekend all! [15:20:11] o/ [15:36:18] going afk as well! o/ [15:38:00] \o [16:19:29] 10Machine-Learning-Team: Investigate recommendation-api-ng internal endpoint failure - https://phabricator.wikimedia.org/T347475 (10Isaac) thanks @kevinbazira ! That API call worked for me as well. A few notes: * When I add the `&pageviews` parameter, the `pageviews` values in the response all become `null` inst... [16:27:51] 10Machine-Learning-Team, 10Section-Level-Image-Suggestions, 10Patch-For-Review, 10Structured-Data-Backlog (Current Work): [XL] Productionize section alignment model training - https://phabricator.wikimedia.org/T325316 (10mfossati) =====Weekly update===== The SEAL pipeline is currently running in production... [16:59:31] 10Machine-Learning-Team, 10Data-Engineering, 10Wikimedia Enterprise, 10Data Engineering and Event Platform Team (Sprint 3), and 2 others: [Event Platform] Implement PoC Event-Driven Data Pipeline for Revert Risk Model Scores using Event Platform Capabilities - https://phabricator.wikimedia.org/T338792 (10Ah... [21:33:12] 10Machine-Learning-Team, 10Add-Link, 10Chinese-Sites, 10Growth-Team (Sprint 1 (Growth Team)), 10User-notice: Deploy "add a link" to 14th round of wikis - https://phabricator.wikimedia.org/T308139 (10Etonkovidova) 05Open→03Resolved