[06:54:55] good morning [06:55:30] morning! [07:09:05] good morning [07:09:36] morning folks o/ [07:10:43] 10Lift-Wing, 06Machine-Learning-Team, 13Patch-For-Review: revertrisk model servers should return a 400 response for non canonical language names - https://phabricator.wikimedia.org/T399437#11003256 (10isarantopoulos) a:03kevinbazira [07:11:02] 10Lift-Wing, 06Machine-Learning-Team, 13Patch-For-Review: revertrisk model servers should return a 400 response for non canonical language names - https://phabricator.wikimedia.org/T399437#11003258 (10isarantopoulos) p:05Triage→03Medium [07:20:52] 06Machine-Learning-Team, 07Essential-Work: Update knative's queue proxy image and the Swift/S3 accounts used on ml-serve clusters - https://phabricator.wikimedia.org/T398533#11003272 (10kevinbazira) As we prepare to deploy in eqiad, at the moment, these 2 httpbb tests fail: ` kevinbazira@deploy1003:~$ httpbb /... [07:30:43] 06Machine-Learning-Team, 07Essential-Work: Update knative's queue proxy image and the Swift/S3 accounts used on ml-serve clusters - https://phabricator.wikimedia.org/T398533#11003276 (10isarantopoulos) The edit check failure is likely due to the recent change of schema that now requires the `page_title` (so in... [07:36:38] morning morning [07:36:38] elukey: o/ I am ready for the eqiad deployments whenever you get a minute: https://phabricator.wikimedia.org/T398533#11000373 [07:36:38] in the meantime, I am looking into the httpbb tests that are failing: https://phabricator.wikimedia.org/T398533#11003272 [07:41:56] 06Machine-Learning-Team, 07Essential-Work: Update knative's queue proxy image and the Swift/S3 accounts used on ml-serve clusters - https://phabricator.wikimedia.org/T398533#11003283 (10achou) @kevinbazira The first issue (edit-check) is occurring because the patch (updating edit-check httpbb tests for page_ti... [07:44:44] 06Machine-Learning-Team, 07Essential-Work: Update knative's queue proxy image and the Swift/S3 accounts used on ml-serve clusters - https://phabricator.wikimedia.org/T398533#11003284 (10achou) Ah, I can't merge it. Only SREs have +2 for puppet changes. [07:45:14] 06Machine-Learning-Team: Build model training pipeline using WMF ML Airflow instance - https://phabricator.wikimedia.org/T396495#11003285 (10gkyziridis) a:03gkyziridis [07:48:22] kevinbazira: o/ the first issue is related to https://gerrit.wikimedia.org/r/c/operations/puppet/+/1167858 , but needs SRE to merge it [07:56:52] for the second issue related to the rec-api, I'm wondering what command you used for testing when you originally added the httpbb test for rec-api in this patch? https://gerrit.wikimedia.org/r/c/operations/puppet/+/1064021 [08:05:01] aiko: ack! let me dig into phab to find it ... [08:10:29] aiko: o/ patch merged! [08:10:37] kevinbazira: gimme 5 mins and I should be ready [08:10:58] okok [08:17:56] kevinbazira: for the rec-api httpbb test, we can address it later. for now we can just use curl to test the service: curl "https://recommendation-api-ng.svc.eqiad.wmnet:31443/service/lw/recommendation/api/v1/translation?source=en&target=fr&count=3&seed=Apple" [08:18:39] elukey: thank u!! [08:21:29] kevinbazira: for the third issue (article-description), I ran it and it passed, not sure why you got that error [08:21:47] https://www.irccloud.com/pastebin/wStmC8Ew/ [08:22:21] yep, the third issue was puzzling as I am able to run it locally without any issues [08:24:03] could have been an intermittent issue [08:25:47] kevinbazira: so I am going to depool eqiad and apply the knative change (already merged, it needs an SRE to deploy though). This will cause all the isvc pods to be restarted, could you please verify afterwards that all are up etc..? [08:27:15] elukey: yes, I'll monitor the pods and run httpbb tests when they are up [08:28:03] done! Pods should be restarted now [08:28:39] please also spot-check if docker-registry.discovery.wmnet/knative-serving-queue:1.7.2-7 is picked up (for example using kubectl describe pod $pod-id -n $namespace | grep ... [08:29:07] ack, checking ... [08:30:08] article-descriptions currently at 0/3 [08:31:03] this is the first step, the second one is to deploy to all isvc namespaces, you should see a diff related to AWS_ credentials [08:31:37] those are the ones used by the storage initializer, so any issue that may arise should be concerned in the pod's bootstrap phase (when the model is downloaded) [08:31:54] once deployed and verified that we are good, I'll repool [08:32:02] in article-models: both article-country and articlequality-predictor are up [08:35:53] article-country is now using the latest proxy image: [08:35:53] ``` [08:35:53] kubectl describe pod article-country-predictor-00013-deployment-8758c89ff-4kx9w [08:35:53] ... [08:35:53] Image: docker-registry.discovery.wmnet/knative-serving-queue:1.7.2-7 [08:35:53] ... [08:35:53] ``` [08:37:38] all right it is working :) [08:45:08] all pods are up. now running httpbb tests [08:47:15] I excluded the 2 failing tests reported in https://phabricator.wikimedia.org/T398533#11003272 and the rest passed: [08:47:15] ``` [08:47:15] kevinbazira@deploy1003:~$ httpbb /srv/deployment/httpbb-tests/liftwing/production/!(test_editcheck.yaml|test_recommendation-api-ng.yaml) --hosts inference.svc.eqiad.wmnet --https_port 30443 [08:47:15] Sending to inference.svc.eqiad.wmnet... [08:47:15] PASS: 117 requests sent to inference.svc.eqiad.wmnet. All assertions passed. [08:47:15] ``` [08:47:52] very nice :) [08:47:58] I think we can proceed with the deployments [08:48:42] the rec-api is working too: [08:48:42] ``` [08:48:42] $ time curl "https://recommendation-api-ng.svc.eqiad.wmnet:31443/service/lw/recommendation/api/v1/translation?source=en&target=fr&count=3&seed=Apple" [08:48:42] [{"title":"Aphis spiraecola","pageviews":0,"wikidata_id":"Q10415221","rank":96.0,"langlinks_count":9,"collection":null},{"title":"Pristine apple","pageviews":0,"wikidata_id":"Q19840599","rank":64.0,"langlinks_count":0,"collection":null},{"title":"Golden Russet","pageviews":0,"wikidata_id":"Q19597352","rank":418.0,"langlinks_count":0,"collection":null}] [08:48:42] real 0m0.423s [08:48:42] user 0m0.015s [08:48:42] sys 0m0.000s [08:48:43] ``` [08:49:06] yes, please proceed with the deployments [08:50:51] kevinbazira: in this case the queue-proxy container is running only on isvc pods, so rec-api and ores-legacy shouldn't have been touched (lemme know if you see otherwise) [08:52:29] yep ores-legacy and rec-api weren't touched. the pods are old 40d and 33d: [08:52:30] ``` [08:52:30] ores-legacy-main-59c998bcbb-c2cxj 2/2 Running 0 40d [08:52:30] recommendation-api-ng-main-6584ddd684-5s6vw 2/2 Running 0 33d [08:52:30] ``` [09:06:58] elukey: I've checked and s3 credentials are yet to be deployed: https://phabricator.wikimedia.org/P79065 [09:07:08] should I proceed to deploy them? [09:09:15] yep [09:09:35] ack ... [09:12:35] article-descriptions done [09:13:43] article-models done [09:15:01] articletopic-outlink done [09:16:22] the edit-check diff shows no changes to deploy ... o_0 [09:17:48] experimental done [09:19:09] llm done [09:20:06] logo-detection done [09:25:16] readability done [09:25:26] recommendation-api-ng done [09:27:28] revertrisk done [09:28:28] revision-models done [09:29:17] revscoring-articlequality done [09:30:30] revscoring-articletopic done [09:31:57] revscoring-draftquality done [09:33:16] revscoring-drafttopic done [09:35:04] revscoring-editquality-damaging done [09:37:13] revscoring-editquality-goodfaith done [09:39:03] revscoring-editquality-reverted done [09:40:55] elukey: all s3 credentials' deployements are completed. [09:47:44] kevinbazira: very nice! Let's re-run httpbb and then I'll repool [09:48:09] the ML clusters should be free now from old Buster images \o/ [09:48:53] httpbb tests passed: [09:48:54] ``` [09:48:54] kevinbazira@deploy1003:~$ httpbb /srv/deployment/httpbb-tests/liftwing/production/!(test_editcheck.yaml|test_recommendation-api-ng.yaml) --hosts inference.svc.eqiad.wmnet --https_port 30443 [09:48:54] Sending to inference.svc.eqiad.wmnet... [09:48:54] PASS: 117 requests sent to inference.svc.eqiad.wmnet. All assertions passed. [09:48:54] ``` [09:51:20] all right eqiad repooled! Thanks kevinbazira ! [09:51:27] the rec-api is up and running too: [09:51:28] ``` [09:51:28] $ time curl "https://recommendation-api-ng.svc.eqiad.wmnet:31443/service/lw/recommendation/api/v1/translation?source=en&target=fr&count=3&seed=Apple" [09:51:28] [{"title":"Plum pox","pageviews":0,"wikidata_id":"Q1788571","rank":33.0,"langlinks_count":5,"collection":null},{"title":"Nurse grafting","pageviews":0,"wikidata_id":"Q24897497","rank":401.0,"langlinks_count":1,"collection":null},{"title":"Carabao mango","pageviews":0,"wikidata_id":"Q18343413","rank":447.0,"langlinks_count":2,"collection":null}] [09:51:28] real 0m0.445s [09:51:28] user 0m0.007s [09:51:28] sys 0m0.008s [09:51:29] ``` [09:51:30] we touched it this time round: [09:51:30] ``` [09:51:31] recommendation-api-ng-main-7dcb586d7-4plzc 2/2 Running 0 26m [09:51:31] ... [09:51:32] ``` [09:52:36] elukey: as always, thanks a lot for supporting us :) [09:55:11] 06Machine-Learning-Team, 07Essential-Work: Update knative's queue proxy image and the Swift/S3 accounts used on ml-serve clusters - https://phabricator.wikimedia.org/T398533#11003758 (10kevinbazira) Thanks to @elukey, who provided support on this task, we have deployed the new knative queue proxy image and Swi... [13:15:17] 06Machine-Learning-Team, 05Goal: Q1 FY2025-26 Goal: Airflow training pipeline for Tone check model - https://phabricator.wikimedia.org/T398970#11004366 (10isarantopoulos) [13:15:19] 06Machine-Learning-Team, 10EditCheck: Retrain peacock detection model for production use - https://phabricator.wikimedia.org/T388211#11004367 (10isarantopoulos) [13:28:10] 06Machine-Learning-Team, 07Essential-Work: Update knative's queue proxy image and the Swift/S3 accounts used on ml-serve clusters - https://phabricator.wikimedia.org/T398533#11004430 (10isarantopoulos) a:03kevinbazira [13:31:37] 07artificial-intelligence, 06cloud-services-team: Supporting AI, LLM, and data models on WMCS - https://phabricator.wikimedia.org/T336905#11004450 (10Raymond_Ndibe) I found this repo with a list of llms and their licenses https://github.com/eugeneyan/open-llms/blob/main/README.md I don't believe we should wait... [14:01:56] will be late ~5mins to the meeting, sorry! [14:04:45] 06Machine-Learning-Team, 06DC-Ops, 10ops-eqiad, 06SRE: Q4:rack/setup/install ml-serve101[2345] - https://phabricator.wikimedia.org/T393948#11004586 (10elukey) For future notes, these are the BIOS's Attributes: ` {'ACPICSTC2Latency': 800, 'ACPISRATL3CacheAsNUMADomain': 'Auto', 'ACSEnable': 'Auto', 'APBD... [14:05:03] ack, no worries! [14:39:15] 06Machine-Learning-Team, 05Goal: Q1 FY2025-26 Goal: Scaling Add-a-link to more wikis via production (airflow) pipelines - https://phabricator.wikimedia.org/T398950#11004718 (10isarantopoulos) [14:39:19] 06Machine-Learning-Team, 10Add-Link, 06Growth-Team: Make airflow-dag for addalink training pipeline output compatible with deployed model - https://phabricator.wikimedia.org/T388258#11004719 (10isarantopoulos) [14:41:33] 06Machine-Learning-Team: Build model training pipeline for tone check using WMF ML Airflow instance - https://phabricator.wikimedia.org/T396495#11004734 (10isarantopoulos) [14:44:32] 06Machine-Learning-Team, 07Essential-Work, 13Patch-For-Review: Update kserve to 0.15.2 - https://phabricator.wikimedia.org/T367048#11004752 (10isarantopoulos) [14:44:49] 06Machine-Learning-Team, 07Essential-Work, 13Patch-For-Review: Update kserve to 0.13.1 - https://phabricator.wikimedia.org/T367048#11004753 (10isarantopoulos) [14:45:36] 06Machine-Learning-Team, 10EditCheck, 10Editing-team (Tracking): Build Tone Check Model feedback-based retraining pipeline - https://phabricator.wikimedia.org/T393103#11004756 (10SSalgaonkar-WMF) a:05SSalgaonkar-WMF→03None [14:50:55] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10MediaWiki-Recent-changes, 06Moderator-Tools-Team, 13Patch-For-Review: [batch #1] Enable revertrisk filters in simplewiki & trwiki - https://phabricator.wikimedia.org/T395668#11004783 (10isarantopoulos) a:03gkyziridis [14:51:56] 06Machine-Learning-Team, 07Essential-Work: Update knative's queue proxy image and the Swift/S3 accounts used on ml-serve clusters - https://phabricator.wikimedia.org/T398533#11004788 (10isarantopoulos) 05Open→03Resolved [14:54:23] (03CR) 10Thiemo Kreuz (WMDE): [C:03+2] build: Updating mediawiki/mediawiki-phan-config to 0.16.0 [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1168437 (owner: 10Libraryupgrader) [17:21:24] 06Machine-Learning-Team, 06collaboration-services, 10Wikipedia-iOS-App-Backlog (iOS Release FY2025-26): [Spike] Fetch Topics for Articles in History on iOS app - https://phabricator.wikimedia.org/T379119#11005753 (10Mazevedo) a:05Tsevener→03Seddon [19:14:42] FIRING: AlertLintProblem: Linting problems found for ORESFetchScoreJobKafkaLag - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem [19:16:33] 06Machine-Learning-Team, 06collaboration-services, 10Wikipedia-iOS-App-Backlog (iOS Release FY2025-26): [Spike] Fetch Topics for Articles in History on iOS app - https://phabricator.wikimedia.org/T379119#11006197 (10Ottomata) > there an external API to pull this event stream / data lake table information? O... [19:44:42] RESOLVED: AlertLintProblem: Linting problems found for ORESFetchScoreJobKafkaLag - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem [20:14:42] FIRING: AlertLintProblem: Linting problems found for ORESFetchScoreJobKafkaLag - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem [20:44:42] RESOLVED: AlertLintProblem: Linting problems found for ORESFetchScoreJobKafkaLag - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem [21:14:42] FIRING: AlertLintProblem: Linting problems found for ORESFetchScoreJobKafkaLag - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem [21:44:42] RESOLVED: AlertLintProblem: Linting problems found for ORESFetchScoreJobKafkaLag - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem [22:14:42] FIRING: AlertLintProblem: Linting problems found for ORESFetchScoreJobKafkaLag - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem [22:44:42] RESOLVED: AlertLintProblem: Linting problems found for ORESFetchScoreJobKafkaLag - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem [23:14:42] FIRING: AlertLintProblem: Linting problems found for ORESFetchScoreJobKafkaLag - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem [23:44:42] RESOLVED: AlertLintProblem: Linting problems found for ORESFetchScoreJobKafkaLag - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem