[00:15:01] (03PS4) 10Jdlrobson: Don't use live configuration [extensions/ORES] - 10https://gerrit.wikimedia.org/r/957970 (https://phabricator.wikimedia.org/T345922) (owner: 10Jsn.sherman) [03:26:57] 10Machine-Learning-Team, 10ORES, 10Beta-Cluster-Infrastructure, 10PageTriage, 10Patch-For-Review: Special:NewPagesFeed broken on beta cluster testwiki - https://phabricator.wikimedia.org/T349635 (10MPGuy2824) a:03MPGuy2824 [06:52:29] (03CR) 10Ilias Sarantopoulos: article-descriptions: add article-descriptions model server (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/970831 (https://phabricator.wikimedia.org/T343123) (owner: 10Kevin Bazira) [06:56:39] (03CR) 10Ilias Sarantopoulos: article-descriptions: add article-descriptions model server (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/970831 (https://phabricator.wikimedia.org/T343123) (owner: 10Kevin Bazira) [07:05:31] (03PS9) 10Kevin Bazira: article-descriptions: add article-descriptions model server [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/970831 (https://phabricator.wikimedia.org/T343123) [07:13:58] (03CR) 10CI reject: [V: 04-1] article-descriptions: add article-descriptions model server [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/970831 (https://phabricator.wikimedia.org/T343123) (owner: 10Kevin Bazira) [07:22:56] (03CR) 10Kevin Bazira: "recheck" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/970831 (https://phabricator.wikimedia.org/T343123) (owner: 10Kevin Bazira) [07:33:50] hello folks [07:39:07] kevinbazira: o/ kserve 0.11.2 is out, so we can use it directly so we'll not need to upgrade later on [07:41:04] elukey: o/ morning [07:42:19] looks like pypi 0.11.1 is the latest on pypi: https://pypi.org/project/kserve/0.11.1/ [07:45:13] Meanwhile, for some reason the llm isvc CI pipeline seems to have started failing even when the article-descriptions patch did change anything on the llm isvc: https://integration.wikimedia.org/ci/job/inference-services-pipeline-llm/123/execution/node/84/log/ [07:46:04] rather the article-descriptions patch did *not change anything on the llm isvc [07:52:15] ah snap they didn't fix yet the issue, yesterday Ilias filed a github issue [07:52:24] we can go with 0.11.1 atm then [07:53:04] "fatal: could not read Username for 'https://github.com': No such device or address" [07:53:07] kevinbazira: --^ [07:53:13] I think it is a temporary issue [07:53:19] okok [07:53:44] but it failed two times in a row [07:53:45] mmmm [07:53:53] yep [07:56:21] kicked off a manual rebuild: https://integration.wikimedia.org/ci/job/inference-services-pipeline-llm/124/console [07:56:31] if it fails we need to follow up with releng :) [07:57:00] yeah already failed [07:57:12] * elukey bbiab [08:00:10] before we reach out to releng. it looks like a github project that the llm isvc relies on is down: https://github.com/Titaniumtown/bitsandbytes-rocm [08:02:41] isaranot: --^ [08:02:56] isaranto:--^ [08:04:58] Ouch ! I wonder what the issue is [08:06:05] I'll look into this [08:20:17] wow [08:21:28] I see https://gitlab.com/users/Titaniumtown/projects now [08:21:41] but no bits and bytes [08:22:26] I suspect with recent reporting and opinion pieces around MS and GH, this will be a common thing for GH repos :-/ [08:24:07] I have found two forks: https://github.com/broncotc/bitsandbytes-rocm https://github.com/broncotc/bitsandbytes-rocm both of which do not have very recent updates, but I dunno when the now gone root repo had its last update [08:24:20] er, second one is https://github.com/RockeyCoss/bitsandbytes-rocm [08:25:12] (it may also be completely unrelated to the original repo, who knows) [08:26:59] Another option is of course to try and contact Titaniumtown on GL and see if they still have a copy of the code [08:27:15] (which would still raise questions on the maintenance of course) [08:39:13] 10Machine-Learning-Team, 10Research: Upgrade xgboost in knowledge_integrity - https://phabricator.wikimedia.org/T350389 (10elukey) Thanks a lot folks for this work! And also @MunizaA thanks a lot for the sha512! \o/ [08:46:39] This was needed to load a quantized version of an LLM which still had its failures. I can totally remove it for now or try to switch to another fork since it is an experimental feature [08:48:57] +1 [08:51:15] isaranto: o/ next Thursday there is the monthly moderator tools meetup, lemme know if you need infos about it etc.. [08:52:02] I can attend as well to inform that you'll be following the use case for them from now on (if you are still available of course) [08:53:22] We can have a chat early next week! [09:12:59] (03CR) 10Elukey: "Did a second pass, left some notes :)" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/970831 (https://phabricator.wikimedia.org/T343123) (owner: 10Kevin Bazira) [09:14:07] (03CR) 10Elukey: article-descriptions: add article-descriptions model server (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/970831 (https://phabricator.wikimedia.org/T343123) (owner: 10Kevin Bazira) [09:26:54] elukey: I am doing the puppet v7 migration for serve-eqiad today, starting again with etcd, then ctrl, then workers. Since the roles are shared between eqiad and codfw, I'll do individual machine switches for them only, and continue the same way with codfw next week, and then switch the roles and cleanup after all is done. any comments/objections? [09:28:59] klausman: sounds good, remember that you also have the dc-specific hiera option, that you could use to target only some subset of hosts (rather than using the host specific hiera) [09:29:43] Yeah, I was strongly considering that. The first host of each role will still be individual, of course. [09:29:50] but if you already have already moved one host of each kind I think that you can proceed with all the rest [09:29:58] Ack. [09:30:11] disable puppet, merge the role specific one, and proceed with a couple at the time [09:30:31] :+1: [09:31:23] klausman: do you want me to help with the ores code deprecation in puppet? [09:31:33] That would be lovely. [09:32:41] okok I can file some patches [09:40:19] elukey: So e.g. hieradata/role/eqiad/etcd/v3/ml_etcd.yaml would be the right place for eqiad-wide etcd stuff, right? [09:42:04] yep, check with pcc to be sure though [09:43:36] ack [09:44:04] btw, the "auto" selection of hosts doesn't seem to work for me, but my pcc install might be outdated [09:45:01] I struggled to make it work as well [09:51:34] pcc looks good, proceeding with puppet-disable and cookbook [09:53:43] aaand there's a snag [09:54:34] (03PS1) 10Ilias Sarantopoulos: llm: remove bitsandbytes-rocm package [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/975212 [09:54:49] https://phabricator.wikimedia.org/P53538 Kinda weird. [09:55:05] I was discussing with kevin to create a fork of article-descriptions repo under wikimedia domain on GH (or in our gitlab). Since it is a direct dependency I think we should have a repo under WMF's ownership and do any changes there (add tags etc). wdyt? [10:11:28] (03CR) 10Kevin Bazira: [C: 03+1] llm: remove bitsandbytes-rocm package [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/975212 (owner: 10Ilias Sarantopoulos) [10:11:37] Ah, the puppet v7 thing was me forgetting a step %-) [10:12:38] isaranto: it is ok for me, one thing that I'd be worried is extra TOIL in maintaining the repo/tags/etc.. (every time we'd need to rebase from upstream, tag, etc..). If we document everything it should be good though (just raising some thoughts, the idea is good) [10:22:08] (03CR) 10Elukey: [C: 03+1] llm: remove bitsandbytes-rocm package [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/975212 (owner: 10Ilias Sarantopoulos) [10:38:40] all code reviews for ores deprecation in puppet are out [10:38:49] I may have missed something but the bulk should be there [10:39:55] thank you so much! [10:40:22] Do they have any particular order I should look them at in? [10:40:46] the are chained, so I'll merge one after the other [10:40:51] ack [11:00:04] Just migrated 1008 to puppet v7. Will let that soak over lunch, then do the remainder of 100x in chunks [11:02:50] ack [11:03:04] thanks for the reviews, going to merge the clean up [11:03:44] I had one question abotu conftool, feel free to answer here or on the review [11:05:05] ah yes, I think that the conftool data will be removed by puppet-merge [11:06:00] :+1: [11:22:32] 10Machine-Learning-Team, 10Patch-For-Review: Decommission ORES configurations and servers - https://phabricator.wikimedia.org/T347278 (10elukey) Updated Wikitech and Mediawiki documentation pages about ORES with a deprecation banner. [11:22:48] Added a lot of deprecation banners in ORES mediawiki/wikitech pages [11:23:00] I have surely missed some, but for the moment the message should be clear [11:24:06] * elukey lunch [11:26:21] thanks Luca! [11:26:58] (03CR) 10Ilias Sarantopoulos: [C: 03+2] llm: remove bitsandbytes-rocm package [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/975212 (owner: 10Ilias Sarantopoulos) [11:27:45] (03Merged) 10jenkins-bot: llm: remove bitsandbytes-rocm package [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/975212 (owner: 10Ilias Sarantopoulos) [12:58:37] (03PS10) 10Ilias Sarantopoulos: article-descriptions: add article-descriptions model server [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/970831 (https://phabricator.wikimedia.org/T343123) (owner: 10Kevin Bazira) [13:03:54] I created a fork of the descartes repo in WMF Github. https://github.com/wikimedia/descartes [13:03:54] I chose GH since the original repo is over there and syncing (if needed) is super easy. kevinbazira you can use this repository with the release I made and tag v1.0.0 which I created manually. [13:03:54] I want to also add a tagging mechanism to the repo but for now in order to not block you we can proceed with this [13:03:59] * isaranto lunch [13:04:35] this is the release https://github.com/wikimedia/descartes/releases/tag/1.0.0 and the tag is pinned on the latest commit of the transformers-wrapper branch [13:05:10] great. thank you for creating the repo isaranto. I am going to use the v1.0.0 tag. [13:19:02] elukey: puppet runs are failing on cumin hosts, related to your ORES removal, maybe one of the patches isn't merged yet? [13:19:04] Error: /Stage[main]/Profile::Httpbb/Httpbb::Test_suite[ores/test_ores.yaml]/File[/srv/deployment/httpbb-tests/ores/test_ores.yaml]: Could not evaluate: Could not retrieve information from environment production source(s) puppet:///modules/profile/httpbb/ores/test_ores.yaml [13:19:45] moritzm: fixing, I have probably missed one config removal [13:19:51] thanks [13:23:12] moritzm: should be https://gerrit.wikimedia.org/r/c/operations/puppet/+/975267 [13:23:39] drive-by-+1'd [13:23:51] thanks! [13:23:59] merging once CI gives the +2 [13:24:30] lgtm, thanks [13:50:05] (03PS1) 10AikoChou: revert-risk: update knowledge integrity to 0.5.0 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/975274 (https://phabricator.wikimedia.org/T349844) [13:50:42] (03PS2) 10AikoChou: revert-risk: update knowledge integrity to 0.5.0 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/975274 (https://phabricator.wikimedia.org/T349844) [13:55:38] elukey: halp, I messed up [13:56:03] what happened? :) [13:56:39] The v7 migfrations on 1001 and 1003 failed. I think it was because I was too eager re-enabling puppet on them before the migration cookbook could run [13:56:51] I thought the cookbook can handle multiple hosts, but it can't [13:57:23] So I did two at a time in parallel, but Puppet (v5?) hit 1001 and 1003 and now the cookbook fails (paste in a second) [13:58:04] https://phabricator.wikimedia.org/P53545 [13:58:18] (03CR) 10Ilias Sarantopoulos: [C: 03+1] "I suggest we upgrade kserve in this patch as well. Any reason not to do it?" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/975274 (https://phabricator.wikimedia.org/T349844) (owner: 10AikoChou) [13:59:00] aiko: o/ thanks for working on revertrisk! I haven't deployed the model servers. I'll wait for the xgboost upgrade and I'll deploy them all monday morning [13:59:32] klausman: so you only migrated/worked on 100[13], got the errors and then reverted https://gerrit.wikimedia.org/r/c/operations/puppet/+/975039 [13:59:38] right? So the other nodes are ok [14:00:33] the other nodes are ok [14:01:03] okok, let me try something [14:01:16] can I work on ml-serve1001 or are you working on it now? [14:01:18] I tried making an override for 1001 and 1003, but that didn't work, so I reverted that [14:01:18] klausman: --^ [14:01:29] I have my hands off of both 1001 and 1003 [14:05:07] so I tried to clean up the puppet cert files like described in https://wikitech.wikimedia.org/wiki/Server_Lifecycle, but I don't see the new CRS on puppet master 1001, and I see that there is a banner on 1001 saying "the host has been migrated to puppet 7" [14:05:24] checking what the cookbook does, there is surely the way to revert [14:06:49] https://gerrit.wikimedia.org/r/c/operations/puppet/+/975039 has a rollback procedure, I just realized [14:07:19] I can try that, but I don't want to interfere with wat you're doing, so I'll try it on 1003 [14:07:30] (03CR) 10AikoChou: revert-risk: update knowledge integrity to 0.5.0 (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/975274 (https://phabricator.wikimedia.org/T349844) (owner: 10AikoChou) [14:07:41] klausman: write in here what you are planning to do first :) [14:09:37] 10Machine-Learning-Team, 10Research: Upgrade xgboost in knowledge_integrity - https://phabricator.wikimedia.org/T350389 (10achou) I verified the sha512 checksum. The model file has been uploaded to Swift: ` aikochou@stat1005:~$ s3cmd -c /etc/s3cmd/cfg.d/ml-team.cfg ls s3://wmf-ml-models/revertrisk/language-agn... [14:10:21] $ sudo puppetserver ca clean --certname ml-serve1003.eqiad.wmnet [14:10:23] Error: [14:10:25] Could not find files to clean for ml-serve1003.eqiad.wmnet [14:10:53] That's the first step. I figure the migration maybe already did that, so I tried the next step (ca destroy on the p-m) [14:11:10] But looking at the script... where would the puppet.conf in the first line come from? [14:11:36] where did you find these steps? [14:11:46] I was trying a similar thing but with the right commands for 1001 [14:11:50] https://gerrit.wikimedia.org/r/c/operations/puppet/+/975039 [14:11:51] this is why I asked for coordination [14:11:55] The ticket for the migration [14:12:10] And I am only touching 1003, as I said. [14:14:24] the script is indeed a little strange [14:14:43] I will try with a puppet.conf from another eqiad host. the file looks very generic. [14:15:29] I am wondering if we could simply re-apply the puppet 7 settings [14:15:32] clean up the certs [14:15:38] and retry the cookbook [14:16:32] so https://gerrit.wikimedia.org/r/c/operations/puppet/+/975040 [14:17:13] klausman: --^ [14:17:34] Currently the+1'd [14:17:43] oops, half a sentence there [14:17:53] thanks merged, let's see [14:18:29] I also need to merge [14:18:32] oops, [14:18:35] https://gerrit.wikimedia.org/r/c/operations/puppet/+/975041 [14:18:58] +1 [14:19:16] when you are done with puppet-merge lemme know, I'll start the cookbook [14:20:17] all merged [14:20:19] super [14:21:06] Looks like I’ve woken up to a pretty dramatic day [14:21:40] (03CR) 10Ilias Sarantopoulos: [C: 03+1] revert-risk: update knowledge integrity to 0.5.0 (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/975274 (https://phabricator.wikimedia.org/T349844) (owner: 10AikoChou) [14:22:27] chrisalbon: morning! No nothing dramatic [14:22:38] puppet always hates us [14:23:16] lol [14:23:41] Yes, puppet hates us, not "Tobias once again was too eager re-enabling Puppet" ;-/ [14:26:12] I asked some help from John, I think that he knows the magic to make this work [14:28:17] ack. [14:31:55] klausman: shuold be all good now [14:32:05] 1003 too? [14:32:05] ml-serve1001 at least did you need help wit the other?> [14:32:17] ill do that now one sec [14:32:25] I am not sure I didn't mess that up further. What is the cert cleaning procedure for v7? [14:32:58] for 7 you run the following on puppetserver1001 [14:33:00] ml-serve1001.eqiad.wmnet [14:33:04] sudo puppetserver ca clean --certname ml-serve1001.eqiad.wmnet [14:33:43] ah, I see. Thanks [14:35:11] jbond: thanks! So I was basically missing the ca clean step on puppetserver right? [14:35:24] np ml-serve1003 puppet7 is running now [14:35:48] elukey: i guess it depends if you where trying to roll back or fix forward :) [14:36:00] jbond: the latter :D [14:36:02] if the later then yes yuo just needed to clean on the pupetserver and then sign there [14:36:12] super thanks for the explanation and the fix <3 [14:36:17] no problem [14:36:50] Ty a lot. I promise not to mess up the remainder of the hosts. Or at least not in the same way. [14:39:18] why are you moving on a per host basis and not for the whole role, BTW? [14:39:46] Because I didn't want to do codfw and eqiad at the same time [14:40:01] For the staging cluster that was easy becaus there is only one. [14:41:09] And for cassandra it was easier because they don't see prod traffic [14:41:17] (our Cassandra, that is) [14:42:20] isaranto: while cleaning up I found https://grafana.wikimedia.org/d/000000263/ores-extension?orgId=1, not sure if we want to keep it or not [14:42:32] ok. all the other kubernetes are migrated already, BTW, including the main wikikube, so the risk seems very minimal [14:42:53] Yeah, I'll do eqiad as whole-roles next week [14:43:07] 10Machine-Learning-Team, 10Patch-For-Review: Decommission ORES configurations and servers - https://phabricator.wikimedia.org/T347278 (10elukey) [14:45:10] elukey: I don't remember this dashboard. But since we are monitoring the extension through the official mediawiki dashboards I think we're covered [14:45:26] ack deleting it :) [14:45:48] ok! [14:45:51] thanks [14:47:33] 10Machine-Learning-Team, 10Patch-For-Review: Decommission ORES configurations and servers - https://phabricator.wikimedia.org/T347278 (10elukey) [14:47:42] 10Machine-Learning-Team, 10Patch-For-Review: Decommission ORES configurations and servers - https://phabricator.wikimedia.org/T347278 (10elukey) @klausman everything should be done, except the work in T349632, lemme know if anything is missing, otherwise this is done. [14:50:00] (03PS3) 10AikoChou: revert-risk: upgrade Kserve 0.11.1 and knowledge integrity 0.5.0 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/975274 (https://phabricator.wikimedia.org/T349844) [14:50:30] 10Machine-Learning-Team, 10Patch-For-Review: Decommission ORES configurations and servers - https://phabricator.wikimedia.org/T347278 (10klausman) >>! In T347278#9340974, @elukey wrote: > @klausman everything should be done, except the work in T349632, lemme know if anything is missing, otherwise this is done.... [14:53:47] (03CR) 10AikoChou: [C: 03+2] "Thanks for the review :)" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/975274 (https://phabricator.wikimedia.org/T349844) (owner: 10AikoChou) [15:04:31] (03Merged) 10jenkins-bot: revert-risk: upgrade Kserve 0.11.1 and knowledge integrity 0.5.0 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/975274 (https://phabricator.wikimedia.org/T349844) (owner: 10AikoChou) [16:40:52] have a nice weekend folks! [17:00:41] ciao Luca! I'm going afk as well.cu Monday folks, enjoy the weekend! [17:11:01] bye Ilias and luca! have a nice weekend :D [17:37:53] (03PS11) 10Kevin Bazira: article-descriptions: add article-descriptions model server [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/970831 (https://phabricator.wikimedia.org/T343123) [17:40:03] bye all! [17:47:04] (03CR) 10Kevin Bazira: article-descriptions: add article-descriptions model server (036 comments) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/970831 (https://phabricator.wikimedia.org/T343123) (owner: 10Kevin Bazira) [19:46:33] 10Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10Growth-Team, 10Patch-For-Review: Add revertrisk-language-agnostic to RecentChanges filters - https://phabricator.wikimedia.org/T348298 (10KStoller-WMF)