[06:09:49] good morning! [06:47:01] Good morning [07:07:32] good morning :) [07:40:29] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10MediaWiki-Recent-changes, 06Moderator-Tools-Team, and 4 others: [batch #3] Enable revertrisk filters in recent changes in multiple wikis - https://phabricator.wikimedia.org/T395824#10937274 (10isarantopoulos) [07:41:14] morning! [07:41:43] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10MediaWiki-Recent-changes, 06Moderator-Tools-Team, and 4 others: [batch #3] Enable revertrisk filters in recent changes in multiple wikis - https://phabricator.wikimedia.org/T395824#10937275 (10isarantopoulos) [08:38:28] isaranto: georgekyz: Would you have time today to take a small look at the last CI patch removing unnecessary pre-commit jobs? https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/1159368 [08:41:31] sure! [08:46:23] bartosz: yeap [09:52:20] (03CR) 10Gkyziridis: [C:03+1] "Thank you for working on this one." [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1159368 (https://phabricator.wikimedia.org/T393865) (owner: 10Bartosz Wójtowicz) [10:10:20] Folks, is there a limitation on the size of the images that we are pushing to docker-registry? I achieved to setup kokkuri and make it work, but it still failing on pushing to the registry. I already reported that to the releng's folks. Does anybody has any more info on that ? [10:10:46] There is more info here: https://phabricator.wikimedia.org/T396495#10934785 [10:24:55] (03PS4) 10Máté Szabó: Map pre-save RR scores to predefined values [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1160196 (https://phabricator.wikimedia.org/T364705) [10:24:59] (03CR) 10Máté Szabó: Map pre-save RR scores to predefined values (033 comments) [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1160196 (https://phabricator.wikimedia.org/T364705) (owner: 10Máté Szabó) [10:25:57] 06Machine-Learning-Team: ML Services causing log spam - https://phabricator.wikimedia.org/T393475#10937757 (10Ladsgroup) It seems it started log flooding again. Here is just the staging causing 250K logs in 15 minutes: https://logstash.wikimedia.org/goto/f2c444ed44f453d25636ad25739a0897 If I'm reading logs... [10:51:02] 06Machine-Learning-Team: ML Services causing log spam - https://phabricator.wikimedia.org/T393475#10937850 (10isarantopoulos) 05Resolved→03Open [11:01:40] isaranto: Did the backfill script report accurate information after our changes? [11:02:26] yes it seems to work now! I'll report in the task [11:08:13] what I didn't get is that it doesn't score the number of revisions you define in the `number` cmd arg but that probably has to do with the query [11:08:33] I mean that you may define 10k but it scores 5-6k [11:21:02] hmmm intersting... [11:21:08] 06Machine-Learning-Team: ML Services causing log spam - https://phabricator.wikimedia.org/T393475#10937900 (10isarantopoulos) Thanks for reporting! These messages seem to come from many servers (prod & staging) and from multiple model servers. So it was not related to load testing on ml-staging. Also not all of... [11:23:17] 06Machine-Learning-Team: Build model training pipeline using WMF ML Airflow instance - https://phabricator.wikimedia.org/T396495#10937907 (10gkyziridis) I think there is an issue on the size of the docker-image that we are trying to push to docker-registry. I created a very simple and small `hello-world` image a... [11:24:40] Folks I think the issue that I was facing when pushing the image to the registry is because of the size of the image because it contains the model as well... I built a simple `hello-world` example and use exactly the same kokkuri structure and I achieved to push it to registry. So, that means that the issue was the size of the image indeed. More info here: https://phabricator.wikimedia.org/T396495#10937907 [12:07:34] (03PS5) 10Máté Szabó: Map pre-save RR scores to predefined values [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1160196 (https://phabricator.wikimedia.org/T364705) [12:15:22] FIRING: SLOMetricAbsent: linkrecommendation-requests - https://slo.wikimedia.org/?search=linkrecommendation-requests - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [12:21:05] (03CR) 10Kosta Harlan: [C:03+2] Map pre-save RR scores to predefined values [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1160196 (https://phabricator.wikimedia.org/T364705) (owner: 10Máté Szabó) [12:29:05] (03Merged) 10jenkins-bot: Map pre-save RR scores to predefined values [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1160196 (https://phabricator.wikimedia.org/T364705) (owner: 10Máté Szabó) [12:35:22] RESOLVED: SLOMetricAbsent: linkrecommendation-requests - https://slo.wikimedia.org/?search=linkrecommendation-requests - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [12:56:45] kart_, klausman o/ [12:57:35] so Janis and Raine had to revert the last machine translation deployment in https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1162894 because they needed to deploy to codfw [12:57:58] the main issue is that the new code (https://gerrit.wikimedia.org/r/c/mediawiki/services/machinetranslation/+/1147812/7/entrypoint.sh) tries to fetch data from analytics.wikimedia.org [12:58:20] I think that BASE_URL doesn't get changed based on the input [12:58:37] also, one thing missing in https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1162894 are the network policies for Thanos Swift [13:05:58] FIWI it still does not come up after reverting [13:06:01] Downloading https://peopleweb.discovery.wmnet/~santhosh/nllb/nllb200-600M/nllb200-600M.tgz [13:06:02] failed: Connection timed out. [13:06:39] we will leave it _undeployed_ and _depooled_ in codfw, kart_, klausman, please fix [13:11:09] Kartik pinged me earlier about some S3 stuff, will take a look at the net policies [13:13:53] klausman: the entrypoint.sh script is also to be fixed, I think we'd need kart_ for it [13:13:58] :( [13:16:25] netpol patch: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1162900 [13:17:42] klausman: could you please produce the working version from before the changes please? [13:17:52] befor rollling forward? [13:18:26] we need a version that is deployable to prod (e.g. the version that is deployed to eqiad) [13:18:31] Probably [13:20:45] Actuall, I am not sure. I was not part of all the reviews, so digging it up will take a while [13:20:56] elukey@people1004:/home/santhosh/public_html/nllb/nllb200-600M$ ls [13:20:56] config.json model.bin sentencepiece.bpe.model shared_vocabulary.txt [13:21:10] I don't see nllb200-600M.tgz [13:21:30] there is model.bin, but no idea if it is the right one [13:22:28] (same for 2003, the actual CNAME for discovery) [13:24:37] Sorry, folks. IRC seems not seending notification :/ [13:24:42] jayme: I think that it is fine to keep codfw depooled for machine translation, is it blocking you right now? [13:24:45] hey kart_ o/ [13:25:24] elukey: unfortunately we can't depool a single ingress service vom LVS, just the whole ingress [13:25:38] version 0.0.20 of the MT chart is from July last year. Was this never pushed to prod? [13:25:50] I mean the intermediate versions (21-23) [13:25:54] jayme: ah snap it is ingress [13:25:57] klausman: yes. we didn't deploy anything. [13:26:43] I can make a revert-alike patch to that version (committ 65b765cc9e96115fb62f57d57253f5b8895e74bc), which should get us most of the way to having something that resembles what is now running in prod. [13:26:48] kart_: if I ssh to people2003.wikimedia.org and look for https://peopleweb.discovery.wmnet/~santhosh/nllb/nllb200-600M/nllb200-600M.tgz in the home dir I don't find anything [13:27:06] I mean /home/santhosh/public_html/nllb/nllb200-600M [13:27:17] is there any chance that the file got moved/renamed? [13:27:22] there is a model.bin [13:28:16] the other alternative is fixing https://gerrit.wikimedia.org/r/c/mediawiki/services/machinetranslation/+/1147812/7/entrypoint.sh to properly set BASE_URL [13:28:24] and use the s3 endpoint [13:30:18] Yes, that's what we did in: https://gerrit.wikimedia.org/r/c/mediawiki/services/machinetranslation/+/1147812/7/entrypoint.sh and https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1159696 [13:30:54] Then, s3cmd seems requiring config file [13:31:50] ERROR: Configuration file not available. [13:31:50] ERROR: Consider using --configure parameter to create one. [13:31:50] ERROR: /home/somebody/.s3cfg: None [13:32:00] elukey: ^ [13:32:34] I was trying to add parameters to s3cmd, so we can fix this. Had a discussion with klausman just while I was deploying in staging. [13:33:19] kart_: my understanding from Janis' errors is that the staging pod tried to connect to analytics.wikimedia.org using S3 [13:33:29] where do you see the above errors? [13:35:17] https://logstash.wikimedia.org/app/dashboards#/view/fb6c6d50-eff0-11ed-8c01-973304d7b1ca?_g=(filters%3A!()%2CrefreshInterval%3A(pause%3A!t%2Cvalue%3A0)%2Ctime%3A(from%3A'2025-06-23T12%3A30%3A00.000Z'%2Cto%3A'2025-06-23T12%3A40%3A00.000Z')) [13:35:18] you may also need to run `s3cmd --configure` before the get yes [13:35:46] does running --configure use the env vars to populate the file? [13:36:05] in theory it should [13:36:14] oh, in that case, SREs can probably help :) [13:36:26] kart_: but you can also see Downloading using s3cmd: https://analytics.wikimedia.org/published/wmf-ml-models/mint/20250514081434/nllb/nllb200-600M.tgz [13:36:31] that is clearly wrong [13:36:50] Yes seems fallbacking. Which is wrong. Let me fix that. [13:37:58] We've set BASE_URL properly in values.yaml wondering how it goes back to entrypoint.sh [13:39:24] kart_: I think that at line 6 you now have "BASE_URL="https://analytics.wikimedia.org/published/wmf-ml-models/mint/20250514081434"" [13:39:31] it doesn't pick up the env var afaics [13:40:25] (if there is somewhere else in entrypoint.sh where you fetch it lemme know, I didn't read the whole thing) [13:40:32] Yeah, that unconditionally overwrites the k8s-provided env var [13:41:30] The function define at line 10 is used to download the assroted files $1 is the URL, and the saved-to file is $2 [13:44:06] kart_: I think either deleting line 6 entirely would be the right thing., since even if we wanted to fallback to the analytics URL, we can't use s3cmd to fetch those files. [13:44:15] s/either// [13:44:58] maybe add a piece of code that emits a useful error if BASE_URL is unset at the beginning of entrypoint.sh [13:47:24] Noted. Let me work on it. Code need more readability as well. [13:47:52] ack. feel free to ping me for code review [13:48:09] We need two different URLs for sure one for local setup, one for production. [13:48:39] kart_: we'd need to find a quick fix for the wikikube cluster though, the service ops team is rolling out changes and they had to deploy machine translation again. They rolledback the last config, but peopleweb doesn't contain the right model anymore [13:49:03] so I'd suggest to possibly help them re-deploy prod in codfw now, and then work on the s3 fix later if possible [13:52:07] oh, that's because codfw was reimagined. Got it. [13:52:35] exactly yes, they are upgrading k8s IIUC [13:52:50] but they cannot depool only machine translation etc.. [14:00:13] Figuring out why files were moved out. Let me check if we can put that quickly. Preparing tarball. [14:03:54] super <3 [14:14:52] elukey: can you do following on people.wikimedia.org? [14:14:52] cp /home/kartik/public_html/nllb/nllb200-600M.tgz /home/santhosh/public_html/nllb/nllb200-600M/ [14:15:15] That's one missing file. [14:19:02] sure [14:19:06] could you please ping me/claime/Raine when you managed to re-deploy a working version to codfw so we ca repool stuff? [14:21:12] deploying it now [14:22:26] (03CR) 10Bartosz Wójtowicz: [C:03+2] "Finishing the CI cleanup <3" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1159368 (https://phabricator.wikimedia.org/T393865) (owner: 10Bartosz Wójtowicz) [14:24:02] mmm same issue [14:24:03] Downloading https://peopleweb.discovery.wmnet/~santhosh/nllb/nllb200-600M/nllb200-600M.tgz [14:24:06] failed: Connection timed out [14:24:36] Sounds like some netpol still missing [14:25:09] something with people.w.o? [14:25:20] afaics the network policies for people.w.o are there [14:25:56] In the file I made additions to, there is people1004 and 2003, but I don't know if that's all possible hosts [14:26:20] from deploy1003 I can fetch curl https://peopleweb.discovery.wmnet/~santhosh/nllb/nllb200-600M/nllb200-600M.tgz [14:26:32] 2003 is the active host so it should be copied there [14:27:26] From cumin2002 I can;t [14:27:39] (to note, that's main issue why we prepared s3 storage :/) [14:28:26] I nsentered in the pod and I see [14:28:28] tcp6 0 1 2620:0:860:cabe:3:33586 2620:0:860:104:10:1:443 SYN_SENT 3966148/wget [14:28:34] from deploy2002, fetch works, so the cumin failure is a red herring [14:29:25] there was a timeout issue discussed in https://phabricator.wikimedia.org/T383750. At some point apache/envoy/varnish/whatever closes the connection when the bandwidth is too small and the download takes more time [14:30:22] (03Merged) 10jenkins-bot: ci: Remove unnecessary CI stages running lint checks on subset of repository. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1159368 (https://phabricator.wikimedia.org/T393865) (owner: 10Bartosz Wójtowicz) [14:30:52] jelto: o/ afaics there is a SYN that gets stuck while fetching 2620:0:860:104:10:1 [14:30:57] fetching from [14:31:27] and it is the wget in entrypoint.sh [14:32:05] ah it is probably truncated [14:32:08] inet6 2620:0:860:104:10:192:48:214/64 scope global [14:32:23] I was wondering about that [14:32:28] okok so it tries to fetch from people2004 but via ipv6, and we don't have the net policies [14:32:45] huh? [14:33:08] Oh, 200*4* [14:33:29] Should I add that to my patch? or make a separate one? [14:34:10] There is no people2004 [14:34:26] people2003.codfw.wmnet [14:34:27] no sorry 2003, the IP seems to be present for ipv6 [14:34:31] and 2003's IPv6 address (2620:0:860:104:10:192:48:214/128) is in the netpol [14:34:35] then I am not sure why the syn is stuck [14:35:17] I suspect that the cluster has some issues talking to ipv6 to the other hosts [14:35:28] we could try tshark'ing on 2003, see if the SYN ever makes it there, but I dunno what the source IP would be [14:36:41] usually if it is stuck on the container's netstat it is blocked completely [14:37:02] I am not 100% sure if the wikikube clusters can talk without issues ipv6 [14:37:14] do the pods have ipv6 addresses? [14:37:58] 06Machine-Learning-Team, 10Add-Link, 06Growth-Team, 05Goal: Q4 24-25 Goal: Investigate Add-a-link model training and deployment - https://phabricator.wikimedia.org/T393474#10938583 (10OKarakaya-WMF) research-datasets: I've updated it [here](https://gitlab.wikimedia.org/repos/research/research-datasets/-/co... [14:37:59] we could simply use wget --inet4-only [14:38:14] but we need a new docker image :( [14:39:39] klausman: please take it from here, I have to work on the I/F hackathon [14:39:45] Ack [14:39:50] klausman: seems, [14:39:50] -BASE_URL="https://analytics.wikimedia.org/published/wmf-ml-models/mint/20250514081434" [14:39:50] +BASE_URL="${BASE_URL:-https://analytics.wikimedia.org/published/wmf-ml-models/mint/20250514081434}" [14:39:50] will fix the BASE URL issue. [14:39:58] My suggestion would be to force wget to use ipv4 [14:40:27] kart_: we should change entrypoint.sh to use the flag Luca has mentioned, and then we can doa proper fix at leisure [14:40:59] sure. But that requires revert of the latest entrypoint.sh as well :/ [14:41:54] Well, there's only so mach we can do in a short time. And it's not like the code is gone-gone [14:42:11] yeah :) [14:44:16] klausman: what flag we need for ipv4 force? [14:44:26] -4 or --inet4-only [14:45:00] cool. Submitting revert with this. [14:45:43] Are you autobuilding the Docker images using CI? [14:49:00] yes. it is build automatically. [14:49:23] one minute, patch is in progress. [14:49:28] ack [14:50:03] https://gerrit.wikimedia.org/r/c/mediawiki/services/machinetranslation/+/1162924 [14:51:25] So this is basically like a full revert, but with -4 thrown in and using {:} for BASE_URL? [14:51:44] No full revert + ipv4 [14:52:04] BASE_URL goes back to old people.w.o host [14:52:11] Righto. [14:52:18] +1'd [14:52:30] So, I missed : part in the new patch, will fix that later. [14:53:55] OK. So, once patch is merged, we need to deploy new docker image. Given mess with config how should we do it? [14:54:20] Revert to chart version as well? [14:55:26] SO the question is if using the new image would be enough to make it work (I suspect so). The env for S# being there shouldn't affect anything. And if the service runs ok with no diff from Helm being visible, we should be good. [14:55:41] jayme: I presume the above would be enough for your purposes? [14:56:06] yep yep, let's build and deploy [14:56:18] I'm happy with whatever creates a working mechinetranslation deployment [14:56:36] ack, then let's proceed. [14:57:02] (as far as I understand it, actually fetching the binaries is the only thing broken rn, so -4 should fix it) [14:57:12] and you commitment to not merge production changes without rolling things out to production. We need to have deployment-charts in a deployable state all the time [14:57:36] this is why we have different releases/release-values etc. for prod and staging [14:58:04] I wasn't even aware of the accumulated difference between the repo and prod [14:58:21] not you personally...the collective ml you [14:58:45] jayme: Noted. I was in preparation for moving to s3, so it got pending changes in long term. [14:58:51] It was me klausman :) [15:00:11] https://integration.wikimedia.org/ci/job/trigger-machinetranslation-pipeline-publish/182/console DI build running [15:00:11] understood. But that's what helmfile.d/services/machinetranslation/values-staging.yaml is for. Until you actually want to change prod. [15:01:55] Noted. I think we never really used it properly. [15:03:02] We changed values.yaml and deployed to staging so far :/ [15:03:26] We can work on having a better prod/staging split later, I can help with that [15:04:40] Thanks [15:08:03] 👍 [15:10:04] klausman: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1162929 need to deploy. [15:10:37] LGTM [15:16:26] klausman: should we deploy to staging? [15:16:36] That's mergeyes [15:16:44] oops, c&p garbage [15:17:08] Though oddly, I don't see the new image in the diff [15:17:29] ah, nevermind, hand't scrolled far enough [15:17:37] - image: "docker-registry.discovery.wmnet/wikimedia/mediawiki-services-machinetranslation:2025-02-05-115716-production" [15:17:37] + image: "docker-registry.discovery.wmnet/wikimedia/mediawiki-services-machinetranslation:2025-06-23-145751-production" [15:17:37] [15:17:58] want me to apply? [15:18:05] sure [15:18:10] go ahead [15:19:12] 06Machine-Learning-Team, 06Moderator-Tools-Team: AI/ML Infrastructure Request: Persist historical revert risk multilingual model scores for threshold analysis - https://phabricator.wikimedia.org/T397187#10938736 (10Kgraessle) Hi @SSalgaonkar-WMF apologies for the delay in responding. > Which KRs for FY2025-2... [15:19:50] much better now: [15:19:51] Downloading https://peopleweb.discovery.wmnet/~santhosh/nllb/nllb200-600M/nllb200-600M.tgz [15:20:09] Not partying until the thing starts :) [15:21:21] mmmh. the entrypoint script calls wget with --show-progress --progress=bar:force:noscroll, should that mean we should see some progress indication in kubectl logs? [15:21:27] failed: Connection timed out. [15:21:28] exactly [15:21:29] tcp 0 1 10.64.65.54:50460 10.192.48.214:443 SYN_SENT 970832/wge [15:23:14] So helmfile.d/services/machinetranslation/values.yaml has the relevant netpols, but helmfile.d/services/machinetranslation/values-staging.yaml doesn't mention any, does that mean they are taken from the former? [15:23:26] (must be, as this used to work) [15:24:20] yes [15:26:40] the ip port is right [15:27:04] at this point maybe on people2003 there are special rules? [15:27:10] nsenter foo telnet 10.192.48.214 443 also hangs [15:27:30] so the pods are connecting directly to people ? [15:27:30] On the people machine I can just see the following firewall rules: [15:27:36] ip saddr @CACHES_ipv4 tcp dport 443 accept [15:27:36] ip saddr @DEPLOYMENT_HOSTS_ipv4 tcp dport 443 accept [15:27:36] ip6 saddr @CACHES_ipv6 tcp dport 443 accept [15:27:36] ip6 saddr @DEPLOYMENT_HOSTS_ipv6 tcp dport 443 accept [15:27:43] no mention of kubernetes subnets [15:28:18] Which immediately begs the question how this ever worked [15:28:43] did you use the discovery before or the public name people.wikimedia.org ? [15:28:57] I don't know. kart_ do you know? [15:29:03] discovery [15:29:09] Ah. [15:29:14] jelto: we try to get https://peopleweb.discovery.wmnet/~santhosh/nllb/nllb200-600M/nllb200-600M.tgz [15:29:56] but yeah at this point people is missing the IP ranges for wikikube [15:30:05] I can try adding a rule which allows wikikube staging on port 443 manually, if that works we can puppetize this [15:30:22] I think it would be great, klausman ok for you? [15:30:36] give me a sec, I'll let you know when the rule is present [15:30:37] yeah, sure [15:33:42] just now switched to 'refused" instead of timeout [15:33:47] I added the rules manually, can you retry? [15:33:48] aaand there goes prgress [15:33:49] ip saddr @STAGING_KUBEPODS_NETWORKS_ipv4 tcp dport 443 accept [15:33:49] ip6 saddr @STAGING_KUBEPODS_NETWORKS_ipv6 tcp dport 443 accept [15:34:00] nllb200-600M.tgz 100%[===================>] 1.35G 112MB/s in 13s [15:34:08] 🚀 [15:34:09] klausman: yes. it used to work and then boom. [15:34:22] Downloading https://peopleweb.discovery.wmnet/~santhosh/nllb/nllb-wikipedia/config.json [15:34:24] failed: Connection refused. [15:34:33] and then it just works. Odd [15:35:11] that notice seems odd for sure. [15:35:33] klausman: it worked because of ipv4 flag? [15:35:48] There are multiple things going on [15:35:53] we have to fix for analytics.w.o as well? [15:36:22] For one thing, the -4 did not make it into this image, hence probably the ECONN, since it still tries v6. [15:36:56] So the download works now? [15:37:05] It's still downloading stuff, so I don;t want to interrupt it [15:37:13] Yes, v4 seems to work now [15:38:33] and the kubectl apply just timed out [15:38:48] Still, let's let it download everything, see if anything else fails [15:39:30] [2025-06-23 15:39:23 +0000] [135] [INFO] Booting worker with pid: 135 [15:39:45] seems OK? [15:40:21] I think it's still initializing [15:40:51] you said, apply timed out - seems different issue to fix as download will even take longer for eqiad/codfw [15:41:28] So, I think the existing deployment just succeeded sinc j_elto added the FW rule and the update failed because it just plain took too long. [15:41:35] I'll retry the apply [15:42:46] restarting and no mor ECONNs [15:42:58] you can try increasing the helmfile/helm timeout [15:43:05] curl 'https://machinetranslation.k8s-staging.discovery.wmnet:30443/api/translate' -X POST -H 'Content-Type: application/json' --data-raw '{"source_language": "en", "target_language": "wuu", "model": "madlad-400", "format": "text", "content":"Jazz is a music genre"}' says: no healthy upstream [15:43:24] yrsh, because it's restarting with the -4 change atm [15:43:33] ah ok [15:46:52] Ok, should be working now, can you test again? [15:47:16] Also, from helm's POV, the deployment succeeded [15:47:18] Yes. Worked. [15:47:25] cool. [15:47:53] Ok, we can now also apply this to eqiad [15:48:05] nice. I'll watch logs. [15:48:13] codfw, but I think https://gerrit.wikimedia.org/r/c/operations/puppet/+/1162952 needs to be deployed first [15:48:14] one sec [15:48:22] not sure if Jelto added wikikube ranges as ewll [15:48:41] yes give me a sec for pcc to finish, I just added staging ranges manually [15:48:50] ty! [15:52:02] klausman: I also added the wikikube ranges manually, you can proceed. I'll refine the puppet patch in a sec [15:52:14] I gtg - when you've fixed the deployment, please contact jelto/claime/Raine for the repool [15:52:32] will do [15:54:00] can confirm the first downloads are ahppening, so it works in eqiad as well [15:54:39] yes [15:59:55] yeah, the starting timeout or this is too short, k8s just killed the pod because the downloads took too long [16:00:16] klausman: are you also deploying to codfw? [16:00:48] NO, there is no deployment for this service there, AFAICT [16:01:03] Oooh, or is that due to the reimage? [16:01:13] yes [16:01:27] codfw was wiped and the service was running there before afaik [16:01:38] ok, will see to that as well [16:01:46] so it has to be redeployed to unblock pooling the cluster ingress again [16:02:41] exactly yes :) [16:03:03] otherwise Jelto will finish the codfw upgrade at midnight :D [16:03:12] kart_: we need to increase the readiness timeout, we can't deploy like this [16:03:21] klausman: q: If helm timeouts, will deployment continue in the background? [16:03:41] No, the pods are killed since they don't become ready soon enough [16:03:55] klausman: similarly like this? https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1125093 [16:05:15] I had a patch for staging, see: values-staging.yaml not confident enough about it as well. [16:06:06] elukey: which one is the startup timeout again? readiness_probe or liveness_probe? [16:06:27] kart_: in priciple yess, just not sure which probetype this was [16:07:41] klausman: readiness, but if you see failures in the helm timeout, you'll need to bump the max in helmfile.yaml [16:07:46] it is 10 mins by default [16:07:54] well, helm tiemed out since the pod started over [16:08:09] but wait, it worked in eqiad and not in codfw? [16:08:12] So it's not a helm-side problem, really [16:08:23] No, it worked in staging and so far nowhere else [16:08:33] The downloads just take forever [16:08:44] several minutes [16:09:59] in eqiad I see [16:09:59] 4m55s Warning Unhealthy pod/machinetranslation-production-748ddb8678-tx7l2 Readiness probe failed: Get "http://10.67.145.126:8989/healthz": dial tcp 10.67.145.126:8989: connect: connection refused [16:10:08] more than 10 minutes? then increase the helmfile timeout [16:10:37] The download would have completed in <10m, but not beforek8s thoguth the startup took too long (like 5m) [16:11:08] okok then the readiness probe would need a bump, because when it finishes the retries it gives up [16:11:11] then the container started again, and eventually helm gave up [16:11:27] I'll live-hack it it to 450s (7.5m), that should be enough [16:11:56] but any new deployment will suffer the same problem [16:12:22] in theory we should be able to tune .Values.app.readiness_probe [16:12:30] I am just trying to get this deployed now, and we'll see about fixing it properly afterwards [16:13:50] deployment in both eqiad and codfw is running [16:14:46] can you please tell me what settings you used? [16:15:12] + readiness_probe: [16:15:14] + initialDelaySeconds: 450 # 7.5m [16:15:40] That should cover both readiness and liveness, AIUI [16:16:32] note that the machinetranslation chart has the following [16:16:32] liveness_probe: [16:16:32] initialDelaySeconds: 300 [16:16:32] tcpSocket: [16:16:33] port: 8989 [16:16:39] but it is under main_app, not app [16:16:55] yeah, that's the 5m timeout that killed the pods earlier. [16:17:51] klausman: should we fix with a patch? I'll revert staging settings after that. [16:18:07] Let's wait and see if 7.5m is even enough [16:18:20] because we're at 5m40s and it's not ready [16:18:23] I think the download finished in eqiad but the pod is not yeat ready [16:18:25] er 6m40 [16:18:29] (from looking at the logs) [16:18:32] yep [16:18:40] I have to go, but please fix it properly before leaving [16:18:47] of course [16:19:03] seems ready now? [16:19:28] no, it's still only 2/3 ready, should be 3/3 [16:19:31] no 2/3 container ready [16:19:35] ah forgot. [16:20:00] eqiad ready (one pod) [16:20:04] readiness probe and liveness probe fail [16:20:26] ah yes it just became ready 🥳 [16:20:54] just in time. So 7.5m is still too close [16:21:11] Ok, codfw is ready [16:21:15] (both pods) [16:21:24] wikikube-worker1165 is downloading. [16:22:17] it was terminated in eqiad, still too slow [16:22:49] you have to bump the timeout in the helmfile.yaml [16:23:04] 5th line, its 600 seconds [16:23:17] yarp [16:24:01] oddly enough, the helmfile apply command did not exit [16:25:32] Thoguh you should be unblock re: codfw now. [16:26:43] klausman: and we should also revet, https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1128067 ? [16:27:09] probably [16:27:19] revert* [16:27:50] We're good to go for dinner, klausman? :) [16:28:03] so machinetranslation service has healthy endpoints in wikikube eqiad and codfw now? I see an ongoing deployment in eqiad? there is one unhealthy pod [16:28:45] deploying to eqiad qith monly codfw, still hammering on eqiad [16:28:51] gah, brain [16:28:56] codfw is good, eqiad not [16:29:24] Error: UPGRADE FAILED: another operation (install/upgrade/rollback) is in progress [16:29:27] Great. [16:29:38] What operation? :/ [16:29:58] I killed the previous push because it wouldn't temrinate even after the pod failed. [16:31:49] ok then I'll pool the wikikube ingress codfw again, machinetranslation in codfw is "fixed", cc jayme, claime, raine [16:31:57] awesome [16:32:20] thanks everyone for the help. I will try and fix codfw [16:32:25] er eqiad %-) [16:32:38] Thanks a lot, all [16:32:54] klausman: Thanks! [16:33:09] Timing was just the timing! [16:35:01] Ok, got rid of the pending releas,e now gouing again with 20m timeout for both helm and the pods. [16:36:27] klausman: if you've patch, you may tag, https://phabricator.wikimedia.org/T386889 [16:36:42] will do [16:36:52] Thanks! [16:42:07] I also puppetized the missing firewall rules for the kubernetes pod ranges to allow this pods on people hosts. I'll be out in a few minutes, ping me if there is anything urgent [16:42:28] ack, ty! [16:43:19] Thanks jelto [16:45:25] God, 20m maty still not be enough. The first pod has been running for >10m and is still not ready [16:45:33] s/maty/may/ [16:49:20] approaching 15m(!) [16:51:04] :/ [16:51:18] [2025-06-23 16:40:13 +0000] [135] [INFO] Booting worker with pid: 135 [16:51:24] That was the last log line so far [16:51:28] what is it even doing? [16:52:01] yes, no logs after that in logstash [16:52:22] and not yet ready? I thought when it boots up it should be ready to use. [16:52:36] ooooh. [16:52:47] initialdelay. It won't even _try_ to probe [16:52:55] ahem. So that needs a fi [16:52:57] +x [16:54:36] worker is ready but no probe. that's issue? [16:54:44] AIUI, yes [16:54:57] aaand it got killed. [16:55:01] I'll make a better patch [16:56:26] Interesting things are happening on the Monday! [16:58:02] Ok, here we go again. this time with something a little cleverer. [16:59:24] [2025-06-23 16:56:26 +0000] [1] [ERROR] Worker (pid:1045563) was sent SIGKILL! Perhaps out of memory? [16:59:28] this? [16:59:36] no, that was me [16:59:53] ah ok :) [17:00:41] Time for bed. Maybe you can leave slack message if I'm timing out here :) [17:00:48] will do [17:00:55] :) [17:01:05] sleep well, and ttyl [17:35:44] FIRING: LiftWingServiceErrorRate: ... [17:35:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=itwiki-damaging-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [18:05:44] RESOLVED: LiftWingServiceErrorRate: ... [18:05:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=itwiki-damaging-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [18:17:55] (03PS1) 10Kosta Harlan: Map pre-save RR scores to predefined values [extensions/ORES] (wmf/1.45.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1162998 (https://phabricator.wikimedia.org/T364705) [18:18:20] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 23 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [extensions/ORES] (wmf/1.45.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1162998 (https://phabricator.wikimedia.org/T364705) (owner: 10Kosta Harlan) [19:20:44] FIRING: LiftWingServiceErrorRate: ... [19:20:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=itwiki-damaging-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [20:06:18] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [extensions/ORES] (wmf/1.45.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1162998 (https://phabricator.wikimedia.org/T364705) (owner: 10Kosta Harlan) [20:13:29] (03Merged) 10jenkins-bot: Map pre-save RR scores to predefined values [extensions/ORES] (wmf/1.45.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1162998 (https://phabricator.wikimedia.org/T364705) (owner: 10Kosta Harlan) [20:55:44] RESOLVED: LiftWingServiceErrorRate: ... [20:55:50] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=itwiki-damaging-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [21:36:11] 06Machine-Learning-Team, 06Moderator-Tools-Team: AI/ML Infrastructure Request: Persist historical revert risk multilingual model scores for threshold analysis - https://phabricator.wikimedia.org/T397187#10940161 (10SSalgaonkar-WMF) Thanks so much for getting back to me, and no worries at all about timing @Kgra... [21:59:44] FIRING: LiftWingServiceErrorRate: ... [21:59:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=itwiki-damaging-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate