[05:43:19] [2025-07-10 05:39:27] s3cfg written to models/.s3cfg [05:43:19] ERROR: Configuration file not available. [05:43:19] ERROR: models/.s3cfg: None [05:43:19] Seems s3cmd still not able to find the config. [06:07:56] It happens same with local setup and with --debug it is parsing the file and still throwing up ERROR. [06:10:43] kart_: o/ do you have the logs by any chance? It would be good to see them [06:14:59] Sorry, didn't see the message while I was updating the update on the task [06:15:04] Same as above error. [06:15:26] I've put it to https://phabricator.wikimedia.org/T335491#10990368 as well to update the status. [06:31:27] elukey: it seems usual behavior with s3cmd elsewhere? Can you give similar usecase where we are using s3cmd? [06:44:21] back sorry [06:44:35] I am unable to repro though, at least on stat1011 [06:44:54] when you say "local" what do you mean? [06:45:07] we don't use s3cmd elsewhere that I know [06:45:15] most of the time it is a python-based lib [06:48:07] also I found https://github.com/s3tools/s3cmd/issues/903 that it is very interesting [06:48:36] the ERROR: .s3cfg: None may mean that there is something in the config file that the parser doesn't like [06:51:50] ah [06:52:12] local = tested locally to check if s3cmd is getting config file. [06:54:20] seems key/secret not parsing properly? It should be. [06:56:07] o/ good morning [06:56:58] elukey: stat1001 with curren s3cfg works fine? In that case, we cam check if there is any issue with authentication [06:57:10] *current [06:57:41] Also, we can supress false output. It is just noise. [06:59:13] I currently get 403 access denied from stat1011 [06:59:19] but IIRC Tobias was able to make it work [07:00:55] ok nono with s3://wmf-ml-models/ it works [07:02:20] even with the env variables etc.. [07:03:29] so, file is there and it is correct. [07:03:37] the ERROR is misleading. [07:05:20] kart_: ok I was able to repro on stat1011.. if I don't export any AWS_ variable, and try to execute the command, I get the same failure [07:05:48] so for some reason, the AWS variables are not available when entrypoint.sh runs [07:06:46] Interesting [07:09:36] can you try to redeploy in staging so i can inspect the pod? [07:10:04] sure [07:10:38] started elukey [07:11:26] so kubectl describe pod machinetranslation-staging-5bb5449d48-dgvsx -n machinetranslation | grep AWS [07:11:31] returns empty :D [07:12:14] :/ [07:12:52] the secret is there [07:14:43] ahh wait I may know the issue [07:16:56] yeah ok [07:17:13] so you have a custom version of deployment.yaml in the machinetranslation chart [07:17:53] meanwhile "app.generic.container" (under the vendor templates) is the snippet that defines the inclusion of the env variables [07:18:01] but you don't use it, so they never get rendered from the chart [07:22:13] Do we need {{- include "app.generic.container" . | indent 8 }} that's all? or some more magic? [07:22:38] kart_: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1167742 [07:22:57] the include is a bit big and it applies to the definition of the whole container [07:23:17] I'd suggest to refactor the config later on, but for the moment the one that I posted should be sufficient [07:23:23] ah. Thanks. [07:23:47] of course CI failed :D [07:24:12] :/ [07:29:02] Be back after the lunch. [07:30:58] kart_: fixed! [07:31:11] if you have a moment to review I can then merge/deploy [07:36:55] it is super simple and I self-merged [07:36:58] will test it in a bit [07:45:09] Morning! [07:46:36] [2025-07-10 07:45:45] Extracting models/nllb200-600M.tgz into models/nllb200-600M [07:47:49] kart_: worked! staging is up :) [07:48:00] morning :) [07:53:44] very nice! how long did the startup take? [07:57:28] good morning [07:57:39] Morning, Aiko! [07:58:14] o/ [07:59:02] elukey: awesome. Thanks! [08:02:40] We can plan next deployment after watching log for sometime. [08:03:52] klausman: should we proceed with depooling codfw in prod? [08:04:16] Yes, I will do that in a moment [08:05:22] ack! [08:08:03] $ confctl --object-type discovery select 'dnsdisc=inference,name=codfw' get [08:08:05] {"codfw": {"pooled": false, "references": [], "ttl": 300}, "tags": "dnsdisc=inference"} [08:08:11] codfw should be depooled [08:10:21] klausman: re startup, we might need to remove liveness_probe: part from values-staging.yaml and test? [08:11:11] you mean to see if the default values are good enough? [08:21:07] klausman: q - after you apply the knative change in admin_ng, and all the pods reschedule, should/can we run httpbb tests to vefrify everything is working fine? will the models be accessible when codfw is depooled? [08:23:24] We should try it either way. If it works: great, we have successfully verified it; All fail: well, that's depooled; _some_ fail: we need to investigate [08:25:05] aiko: you can definitely run httpbb, just use as endpoint inference.svc.codfw.wmnet instead of inference.discovery.wmnet [08:25:34] ah, yes, the site-specific disco endpoint should still work [08:26:15] https://grafana.wikimedia.org/goto/wEQvo9sHR?orgId=1 looks like all traffic has drained (the remainder is likely just monitoring) [08:28:15] elukey, aiko: https://phabricator.wikimedia.org/P78862 diff for ml-serve-codfw (also some external-services stuff that I haven't pastebinned) [08:28:19] I see! thank both for the answers :) [08:29:13] I will apply the adminng stuff in a few moments unless someone stops me :) [08:30:23] here we go [08:30:56] klausman: yes. [08:31:14] kart_: yeah, that sounds like a good idea. [08:31:42] only for staging, we've. [08:31:42] initialDelaySeconds: 15 [08:31:42] periodSeconds: 10 [08:31:42] failureThreshold: 6 [08:35:53] aiko: all pods have been restarted (well, except the system pods and ores-legacy since they don't use this image) [08:37:04] ack! let me check if they are all in running status [08:41:34] recommendation-api-ng doesn't use that image either, so it's not been restarted [08:41:43] good point [08:43:59] bartosz: really great work on the script! [08:44:47] nice looks like all pods are in good state [08:45:11] gonna run httpbb tests [08:45:25] ack [08:48:42] https://www.irccloud.com/pastebin/VypzcECm/ [08:49:05] \o/ [08:49:58] o/ elukey: thank you! and additional thanks for the review <3 [08:51:34] \o/ we can start deploying the credential changes [08:52:05] let me run a scripted diff across everything in the ml-services subdir [08:53:21] nice! [08:53:39] Ok, looks good, besides a an oodity: https://phabricator.wikimedia.org/P78864 [08:54:01] I suspect that while recapi does not use the changed secrets, they're still part of its env [08:56:10] I'll start applying the change one NS at a time, starting with article-descriptions [08:56:23] they might also use s3 to get the embedding in swift? https://analytics.wikimedia.org/published/wmf-ml-models/recommendation-api/ [08:56:32] ah, right [08:57:11] ok, changed the secrets on art-desc, bouncing the service [08:58:25] INFO:root:Successfully copied s3://wmf-ml-models/article-descriptions/ to /mnt/models [08:58:27] \o/ [08:59:46] hmmm. the pod is still in "Initializing" state [09:00:00] and just as I type that it goes to Running :D [09:00:13] yes now it's running 2/3 [09:00:21] ... and 3/3 [09:00:30] \o/ [09:00:43] can you run the httpbb test against that service? [09:01:31] doing it [09:01:34] passed :) [09:01:56] excellent [09:02:05] continuing with the art-models NS [09:02:29] secrets applyied, bouncing art-country [09:03:30] art-country is startsed and running [09:05:31] art-quality also restarted and running. can you test them? [09:05:48] (after this, I'll proceed with the rest of the NSes and we cna then test them all at the end) [09:09:56] mm just found out we don't have test_article-models.yaml on deployment node? https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/production/modules/profile/files/httpbb/liftwing/production/test_article-models.yaml [09:10:14] Interesting [09:10:20] not under /srv/deployment/httpbb-tests/liftwing/production/ [09:11:22] Since that dir is not a git repo, I suspect the file we have in puppet is not wired up to be distributed [09:12:10] Yep, missing from modules/profile/manifests/httpbb.pp [09:13:15] making a patrch [09:14:51] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1167827 [09:16:39] thanks for the patch! +1 [09:16:49] also missing for staging, adding that now [09:18:21] Hmmm. Logo Detection, Rec-API are also not wired up. [09:19:45] ahh [09:20:27] re: next deployment, I think we can deploy revscoring-editquality-goodfaith NS and revscoring-editquality-damaging NS individually because they have quite a lot of models, and then we can do the rest at once. wdyt? [09:21:03] sure, that works [09:22:07] Updated the patch to add the missing tests [09:25:35] thanksss! [09:28:36] this probably needs to be noted in our liftwing doc.. when we add new httpbb tests, also need to update that modules/profile/manifests/httpbb.pp [09:29:36] yes, agreed [09:33:27] klausman: could you also merge this patch since I don't have +2 rights? (httpbb tests for edit check) https://gerrit.wikimedia.org/r/c/operations/puppet/+/1149634 [09:33:59] That patch is missing the very bit I added above :) [09:35:31] Added a comment that explains it [09:37:42] aiko: the article-modules (and rec-api+logo-detect) tests are now available on the deployment machine, can you retry the test? [09:38:29] yes! just tested, and passed :) [09:38:37] excellent, thank you. [09:38:47] Will now push the change to revscoring-editquality-goodfaith [09:39:03] ack [09:39:28] oh got it, thanks! will add that [09:40:12] It's very clearly easy to miss, as we all did :) [09:40:34] I'll now bounce the pods in revscoring-editquality-goodfaith [09:45:30] ok, all pods restarted, you can commence tests [09:46:24] cool! doing it [09:47:07] PASS: 35 requests sent to inference.svc.codfw.wmnet. All assertions passed. \o/ [09:47:22] excellent. proceeding with rs-eq-damaging [09:51:24] config pushed and all pods bounced [09:51:37] niceee [09:52:26] PASS: 35 requests sent to inference.svc.codfw.wmnet. All assertions passed! [09:52:54] great. I'm gonna run do a few errands and get lunch, and when I'm back we can proceed with the other NSes [09:53:31] no problem! sounds good [09:54:08] I'll ping when I'm back [09:56:54] alrighty [10:02:56] * aiko get lunch too :) [10:07:53] I added the missing bit in the edit check httpbb patch https://gerrit.wikimedia.org/r/c/operations/puppet/+/1149634 [11:57:37] * aiko back [12:23:38] * klausman back as well [12:24:30] aiko: any preference regarding the next NS to do? [12:27:52] maybe the rest of revscoring-NS? [12:28:00] yeah, sgtm. [12:28:28] If we do them alphabetically, rs-articlequality is next [12:28:43] ack! [12:29:53] config pushed, bouncing pods [12:30:57] and they're all back [12:32:24] niceee [12:34:27] elukey: Should we file a task about updating deployment.yaml for machinetranslation/MinT? [12:36:07] then we have revscoring-articletopic, revscoring-draftquality, revscoring-drafttopic, and revscoring-editquality-reverted [12:36:10] kart_: if you want yes, it is not strictly needed at the moment but it may be good [12:36:20] aiko: I can do them all in one go [12:36:49] that'd be great, and I'll test them all when they're up running [12:36:56] wfm [12:38:51] elukey: Can you do that, I'll miss the background/context. Maybe low priority as of now. [12:44:20] aiko: all RS pods done [12:45:43] ok! running httpbb tests [12:48:05] greattt, all passed :) [12:48:24] Good. I'll do all of the following NSes now: articletopic-outlink experimental llm logo-detection readability recommendation-api-ng revertrisk revision-models [12:48:53] edit-check has an unrelated diff, so I'll coordinate with Ilias on that one. and the other NSes should have no diff anyway [12:49:23] edit-check one is mine [12:49:37] I merged https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1167565 [12:49:37] ah, right [12:50:26] there are no services in experimental NS on codfw [12:50:27] Should I do that, then? [12:50:53] I'll still push the cred update on experimental, ther just won't be pods to bounce [12:51:05] ("that" above referring to editcheck) [12:51:27] yeah we can do that, I just tested in staging [12:51:55] roger [12:52:01] re experimental: got it! [12:58:12] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1167858 and the httpbb test for edit-check needs to be updated.. whenever you have a moment [12:58:59] +1'd for now. will +2 and merge once I'm done with the current set of services [12:59:03] aiko you have a future in SRE :D [12:59:59] really!!! :D [13:01:12] I agree with Luca [13:01:51] that's a big compliment, thank u <3 [13:02:48] well deserved! [13:05:26] All NSes updated and pods bounced, did another pass to see if there are any NSes with diffs, and there are none [13:06:48] ack, gonna run tests [13:11:54] only recommendation-api-ng fails, all others passed [13:11:57] looking [13:15:23] ah the host for recommendation-api-ng should be different right? [13:15:29] error msg: https://phabricator.wikimedia.org/P78877 [13:17:23] Probably, I am not sure [13:19:02] I don't get why the host header is https://recommendation-api-ng.discovery.wmnet:31443 [13:22:01] I know this service works different from our other isvc, but I'm not so familiar with how it works [13:23:23] let me investigate [13:23:29] yeaI'll also dig a bit [13:36:45] I think this is because codfw is still drained. [13:38:21] so the site-specific endpoint doesn't work for recommendation-api-ng [13:38:28] Or rather: the inference endpoint.... yes [13:39:35] yeah it doesn't use the inference endpoint, because it's not based on kserve [13:39:43] curl -vv 'https://recommendation-api-ng.svc.codfw.wmnet:31443/service/lw/recommendation/api/v1/translation' works 9sorta, it's missing query parms) [13:40:01] I saw a query example here https://phabricator.wikimedia.org/T371465#10080149 [13:40:05] time curl "https://recommendation-api-ng.discovery.wmnet:31443/service/lw/recommendation/api/v1/translation?source=en&target=fr&count=3&seed=Apple" [13:41:27] httpbb test_recommendation-api-ng.yaml --hosts recommendation-api-ng.svc.codfw.wmnet --https_port 31443 --insecure <- this also sorta works, but has the wrong aprams [13:41:40] (--insecure gets around the SSL cert failure) [13:42:12] How about I re-pool codfw, we quickly test and if it still doesn't work, I depool again and we continue investigating? [13:45:13] https://www.irccloud.com/pastebin/mOc8mUFT/ [13:45:15] this works [13:45:42] Well, then I'd mark the service as working and we re-pool? [13:46:00] and if httpbb still doesn't work afterwards, the test is broken, not the service [13:46:40] yes, we need to fix the httpbb test [13:47:23] pooling now [13:47:41] done [13:50:39] \o/ great, thank you!! [13:51:05] I'll keep an eye on traffic #s and eror codes [13:54:01] nice work folks! [13:57:45] ty <3 [14:02:26] is eqiad going to be done next week? [14:04:28] 06Machine-Learning-Team, 06DC-Ops, 10ops-eqiad, 06SRE: Q4:rack/setup/install ml-serve101[2345] - https://phabricator.wikimedia.org/T393948#10992180 (10Jclark-ctr) ml-serve1015 is now racked into E 12 and added to netbox @elukey Let me know when you’re finished with any testing you want to do. I’ll stay... [15:02:40] 06Machine-Learning-Team: Spark Job in airflow-devenv cannot access Hive Metastore because of Kerberos Authentication Failure - https://phabricator.wikimedia.org/T398907#10992522 (10kevinbazira) Thanks to @brouberol, who has been super helpful when resolving WMF Airflow issues. He also provided clarity on where o... [15:03:15] klausman: re staging start and end time for downloads, Start: Jul 10, 2025 @ 07:45:00, End: Jul 10, 2025 @ 07:47:02 [15:03:31] elukey: I think that'd be the plan, as Friday is not ideal for deployment [15:04:51] aiko: you already talk like an SRE! [15:05:13] lollll [15:06:18] isaranto: we should stop looking for the new SRE [15:07:14] aiko telling luca not to deploy on fridays :D go aikooooooooooo [19:36:52] 06Machine-Learning-Team: Spark Job in airflow-devenv cannot access Hive Metastore because of Kerberos Authentication Failure - https://phabricator.wikimedia.org/T398907#10993364 (10brouberol) I have merged a change into `airflow-dags` (see https://phabricator.wikimedia.org/T394297#10993349) that //should// resol...