[08:14:10] hello! [09:07:08] o/ [09:12:07] 06Machine-Learning-Team, 13Patch-For-Review: Run unit tests for the inference-services repo in CI - https://phabricator.wikimedia.org/T360120#10367102 (10kevinbazira) [09:13:26] (03CR) 10Kevin Bazira: "recheck" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1098919 (https://phabricator.wikimedia.org/T360120) (owner: 10Kevin Bazira) [09:25:57] (03PS1) 10Kevin Bazira: article-country: handle empty country results gracefully [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1099158 (https://phabricator.wikimedia.org/T371897) [09:55:42] (03PS1) 10Nik Gkountas: Improve re-ordering for section translation recommendations [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1099162 [10:42:56] 06Machine-Learning-Team, 13Patch-For-Review: [LLM] Use Flash attention 2 for GPU inference - https://phabricator.wikimedia.org/T371344#10367264 (10isarantopoulos) The above sequence of actions failed again. The logs are available in [[ https://phabricator.wikimedia.org/P71372 | this paste ]] We have observed... [11:27:30] (03CR) 10Ilias Sarantopoulos: [C:03+1] test: update llm test image to support latest ci tests [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1098919 (https://phabricator.wikimedia.org/T360120) (owner: 10Kevin Bazira) [11:34:23] (03PS1) 10Ilias Sarantopoulos: llm: move dir under src/models [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1099166 (https://phabricator.wikimedia.org/T369344) [11:34:42] (03CR) 10CI reject: [V:04-1] llm: move dir under src/models [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1099166 (https://phabricator.wikimedia.org/T369344) (owner: 10Ilias Sarantopoulos) [11:35:29] (03CR) 10Kevin Bazira: [C:03+2] test: update llm test image to support latest ci tests [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1098919 (https://phabricator.wikimedia.org/T360120) (owner: 10Kevin Bazira) [11:35:39] (03PS2) 10Ilias Sarantopoulos: llm: move dir under src/models [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1099166 (https://phabricator.wikimedia.org/T369344) [11:36:11] (03Merged) 10jenkins-bot: test: update llm test image to support latest ci tests [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1098919 (https://phabricator.wikimedia.org/T360120) (owner: 10Kevin Bazira) [11:37:16] (03CR) 10CI reject: [V:04-1] llm: move dir under src/models [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1099166 (https://phabricator.wikimedia.org/T369344) (owner: 10Ilias Sarantopoulos) [11:42:09] (03PS1) 10Kevin Bazira: ci: remove backward compatibility from entrypoint [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1099167 (https://phabricator.wikimedia.org/T360120) [11:48:51] isaranto: o/ would you like to merge https://gerrit.wikimedia.org/r/1097986 so that we can proceed with https://gerrit.wikimedia.org/r/1099167 ? [11:49:08] o/ [11:49:55] yes, thanks for reminding. I was wondering why CI didn't run. everything is set properly in integration/config repo but for some reason it didnt run [11:51:31] I'm going to merge and see [11:51:35] (03CR) 10Ilias Sarantopoulos: [V:03+2 C:03+2] ci: update reference-quality to support latest ci entrypoint [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1097986 (https://phabricator.wikimedia.org/T360120) (owner: 10Ilias Sarantopoulos) [11:51:39] (03PS6) 10Ilias Sarantopoulos: ci: update reference-quality to support latest ci entrypoint [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1097986 (https://phabricator.wikimedia.org/T360120) [11:51:47] (03CR) 10Ilias Sarantopoulos: [V:03+2 C:03+2] ci: update reference-quality to support latest ci entrypoint [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1097986 (https://phabricator.wikimedia.org/T360120) (owner: 10Ilias Sarantopoulos) [11:55:20] (03PS2) 10Kevin Bazira: ci: remove backward compatibility from entrypoint [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1099167 (https://phabricator.wikimedia.org/T360120) [11:55:52] okok this --^ is the last patch and we close this task :) [11:58:35] klausman: can you deploy https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1099152 for testing rec-api on the staging with local change to liveness prob? (ie https://phabricator.wikimedia.org/P71348) [12:11:00] looking... [12:13:51] kart_: yeah, +2'd your change and will finagle in the liveness probe change for deployment [12:14:07] Actually, let me make that a proper change so it doesn't get lost [12:17:13] (03CR) 10Ilias Sarantopoulos: [C:03+1] ci: remove backward compatibility from entrypoint [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1099167 (https://phabricator.wikimedia.org/T360120) (owner: 10Kevin Bazira) [12:17:23] o/ [12:17:41] klausman: I have already made that change :) [12:17:43] kart_: lets go with failureThreshold=6, so the compoungf timeout would be 15s+6*10s=75s [12:17:46] was waiting for ci to test [12:17:50] oh [12:18:28] klausman: nice! [12:18:33] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1099171 [12:18:43] feel free to use yours if you already have one [12:19:00] nah, it's not even a commit yet [12:19:47] I'd also like someone to review the reverts for the readiness prob so that we dont forget about it [12:19:47] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1098911?usp=dashboard [12:19:47] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1098912?usp=dashboard [12:20:01] thaaanks [12:23:39] Rebasing ^^ [12:25:19] +2ed [12:36:52] klausman: Did we deploy to staging? [12:37:37] No, Ilias's change still needs a +2, and a reply on whether my suggested threshold is ok with him [12:38:46] OK! [12:39:19] ah, I see he already made the change [12:39:30] It's now +2'd, waiting for CI [12:42:07] applying changes... [12:42:35] aaand it's throwing erros [12:46:20] kart_: https://phabricator.wikimedia.org/P71435 [12:46:48] I think the base error is: Input should be a valid integer, unable to parse string as an integer [type=int_parsing, input_value='int = 10', input_type=str] [12:50:57] 06Machine-Learning-Team, 13Patch-For-Review: [LLM] Use Flash attention 2 for GPU inference - https://phabricator.wikimedia.org/T371344#10367614 (10MunizaA) >>! In T371344#10367264, @isarantopoulos wrote: > The above sequence of actions **failed** again. The logs are available in [[ https://phabricator.wikimedi... [12:51:07] ah [12:51:50] API_CONCURRENCY_LIMIT should be 10, not int 10! [12:52:25] I totally missed that :-/ [12:54:30] * isaranto lunch! [12:55:27] klausman: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1099176 [12:56:06] +2'd [12:57:48] deploying [12:57:57] And success [12:59:27] \0/ [13:00:25] Thanks a lot klausman and isaranto [13:00:37] \o/ [13:00:58] klausman: We can keep things at staging and do more testing till Monday and deploy on the Production. [13:01:09] sgtm [13:05:03] Thanks again and have a nice weekend! I can be relax with my upcoming bike rides :D [13:07:46] enjoy! and ride safely :) [13:33:47] (03CR) 10Kevin Bazira: [C:03+2] ci: remove backward compatibility from entrypoint [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1099167 (https://phabricator.wikimedia.org/T360120) (owner: 10Kevin Bazira) [13:34:35] (03CR) 10Kevin Bazira: [V:03+2 C:03+2] ci: remove backward compatibility from entrypoint [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1099167 (https://phabricator.wikimedia.org/T360120) (owner: 10Kevin Bazira) [13:37:29] 06Machine-Learning-Team, 06Data-Platform, 10Data-Platform-SRE (2024.11.30 - 2024.12.20): Create an Airflow instance for ML - https://phabricator.wikimedia.org/T380258#10367763 (10Gehel) [13:38:18] 06Machine-Learning-Team, 10Data-Platform-SRE (2024.11.30 - 2024.12.20): Move Lab machines into analytics net for DL access and switch to homedirs on Ceph - https://phabricator.wikimedia.org/T380279#10367775 (10Gehel) [13:58:51] 06Machine-Learning-Team, 10Item Quality Evaluator, 10Wikidata, 10Wikidata.org, 10Wikidata Dev Team (Epics & Stalled): Move Wikidata tools to Lift Wing - https://phabricator.wikimedia.org/T343419#10368004 (10karapayneWMDE) [14:38:11] (03CR) 10Sbisson: Improve re-ordering for section translation recommendations (031 comment) [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1099162 (owner: 10Nik Gkountas) [14:56:13] klausman: o/ I'm having https://github.com/pybind/pybind11/issues/1728 this issue when building triton-flash-attention on ml-lab, can we install python3-dev apt package on ml-lab? [14:56:32] logs: https://phabricator.wikimedia.org/P71211#285270 [14:58:16] aiko: o/ did you try pip install pybind11 in your env? [14:58:25] https://pypi.org/project/pybind11/ [14:58:57] yeah it is installed [14:59:39] ack, it is similar (if not the same issue) with the one I had [15:00:07] I was just going to jump on this again now . check this comment from Muniza https://phabricator.wikimedia.org/T371344#10367614 [15:00:19] (it is not a solution :P ) [15:05:39] ohh yeah it's a similar issue! [15:05:45] I'm wondering if it'll be solved by installing python3-dev package [15:06:53] 06Machine-Learning-Team, 13Patch-For-Review: [LLM] Use Flash attention 2 for GPU inference - https://phabricator.wikimedia.org/T371344#10368271 (10MunizaA) If you have miniconda installed, maybe you could try running the following? I just tried this again and was able to build CK FA2 from scratch: ` git clone... [15:18:12] 06Machine-Learning-Team, 13Patch-For-Review: [LLM] Use Flash attention 2 for GPU inference - https://phabricator.wikimedia.org/T371344#10368303 (10isarantopoulos) Thanks a lot for all the help @MunizaA!! I will try it to check if it will work. We'll need to figure out the proper setup afterwards anyway cause v... [15:21:56] (03CR) 10Nik Gkountas: Improve re-ordering for section translation recommendations (031 comment) [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1099162 (owner: 10Nik Gkountas) [15:22:05] actually I didn't face that issue, but Muniza did when she reran this [15:22:42] I am running it again to see what is going on, perhaps using conda would work [15:23:00] ack! [15:32:07] 06Machine-Learning-Team, 13Patch-For-Review: [LLM] Use Flash attention 2 for GPU inference - https://phabricator.wikimedia.org/T371344#10368352 (10MunizaA) >>! In T371344#10368303, @isarantopoulos wrote: > > The issues we are having seem to be related to hipcc so I will download the original image to see what... [15:36:55] (03CR) 10Sbisson: Improve re-ordering for section translation recommendations (032 comments) [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1099162 (owner: 10Nik Gkountas) [15:45:44] (03CR) 10Nik Gkountas: Improve re-ordering for section translation recommendations (031 comment) [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1099162 (owner: 10Nik Gkountas) [15:47:09] (03PS2) 10Nik Gkountas: Improve re-ordering for section translation recommendations [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1099162 [15:47:15] (03CR) 10Nik Gkountas: Improve re-ordering for section translation recommendations (031 comment) [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1099162 (owner: 10Nik Gkountas) [16:30:57] (03CR) 10Sbisson: [C:03+2] Improve re-ordering for section translation recommendations [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1099162 (owner: 10Nik Gkountas) [16:31:37] (03Merged) 10jenkins-bot: Improve re-ordering for section translation recommendations [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1099162 (owner: 10Nik Gkountas) [16:43:27] have a nice weekend folks :) see u next week (llm sprint!) [16:56:05] nighty aiko o/ cu next week! [16:56:43] \o heading out as well [17:05:49] o/ Tobias, have a nice weekend. I'm building one last thing and will head out as well [17:45:52] 06Machine-Learning-Team, 13Patch-For-Review: [LLM] Use Flash attention 2 for GPU inference - https://phabricator.wikimedia.org/T371344#10368603 (10isarantopoulos) I managed to get a successful build with you suggested (using conda) 🎉 Here is the [[ https://phabricator.wikimedia.org/P71372#286174 | result ]]... [18:22:55] 06Machine-Learning-Team, 13Patch-For-Review: [LLM] Use Flash attention 2 for GPU inference - https://phabricator.wikimedia.org/T371344#10368656 (10isarantopoulos) I took a look at one of the official rocm/pytroch images to see what hipconfig looks like over there. I used the image [[ https://hub.docker.com/lay...