[09:01:55] morning :) [09:02:07] \o [09:08:38] hello :) [09:18:26] aiko: o/ one qs about https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/849627 [09:18:30] what is the idea for staging? [09:19:29] because you are swapping the nsfw isvc with the revert-risk one afaics [09:30:36] elukey: the idea is I only want to deploy revert-risk to staging, because nsfw isvc had been deployed to prod, so it doesn't need to be in staging. [09:31:05] elukey: and we don't want too many pods in staging [09:34:47] aiko: yeah but we'd need a representative of each model server in staging to test [09:35:03] so I'd say to remove the staging override, and then we are good to go [09:35:26] these are small model server deployments so we can afford it [09:35:32] does it sound good? [09:40:12] elukey: ok considering we need a representative of each model server for test purpose, that sounds good [09:40:28] elukey: I'll update the patch :) [09:48:43] updated :) [09:52:16] merged :) [10:15:56] https://aws.amazon.com/de/blogs/containers/using-prometheus-to-avoid-disasters-with-kubernetes-cpu-limits/ is really nice [10:17:38] going to lunch in a bit! [10:19:51] 10Machine-Learning-Team, 10ContentTranslation, 10SRE, 10Wikimedia Enterprise: Run NLLB-200 model in a new instance - https://phabricator.wikimedia.org/T321781 (10Pginer-WMF) [10:20:32] 10Machine-Learning-Team, 10ContentTranslation, 10SRE, 10Wikimedia Enterprise: Run NLLB-200 model in a new instance - https://phabricator.wikimedia.org/T321781 (10Pginer-WMF) [12:09:50] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team, 10User-notice: Deploy "add a link" to 6th round of wikis - https://phabricator.wikimedia.org/T304550 (10kevinbazira) The conclusion on the backtesting results is that most of the languages look fine besides: - crhwiki and cuwiki whose precision is below... [12:25:54] elukey: when you're back, I'd love some help with that Postgres Puppet change [13:00:50] (meeting now, til 15:00) [13:12:07] I am back :) [13:45:31] above time is wrong, meeting ends at 16:00 :) [13:46:57] I added some comments to the code review :) [13:47:11] merci! [14:02:34] elukey: the weird thing is that I still get a PCC error that says that the role can't be found (https://puppet-compiler.wmflabs.org/pcc-worker1001/37807/wikilabels-database-02.wikilabels.eqiad1.wikimedia.cloud/prod.wikilabels-database-02.wikilabels.eqiad1.wikimedia.cloud.err) [14:04:57] (meeting) [14:06:08] (ack) [14:12:28] Morning all! [14:23:37] \o hey crhis [14:23:42] chris* [14:29:48] klausman: checking pcc now [14:29:51] hello Chris [14:35:39] klausman: I think it is a weird error due to the fact that the role is not merged yet, I wouldn't worry about it [14:37:36] Alright. [14:38:41] aiko: wow I got a weird error for the process pool while testing [14:38:41] concurrent.futures.process.BrokenProcessPool: A child process terminated abruptly, the process pool is not usable anymore [14:42:23] perfs are still not good with the process pool [14:42:28] will need to work on it more [14:54:17] elukey: really! does that error happen frequently? [15:37:18] aiko: sporadically afaics, but probably the exeption needs to be handled [15:43:03] elukey: you think the postgres patch is ready to merge? [15:44:31] +1ed [15:44:32] I take that as a yes :) [15:45:14] the code with the process pool seems to be slow when traffic hits the model server sigh :( [15:45:50] I will have to run more comparisons [15:57:40] elukey: one last puppet fix :) [16:16:41] ship it! [16:16:50] going afk! [16:51:49] \o