[08:32:59] 10Machine-Learning-Team, 10API Platform, 10Anti-Harassment, 10Cloud-Services, and 19 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10jnuche) When you're ready to publish your docs/coverage with [[ https://gitlab.wikimedia.org/repos/releng/docpub | docpub ]], your... [08:58:17] o/ [08:58:26] lol my gerrit UI fails when I expand the file https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/922512 [09:00:46] 10Lift-Wing, 10Machine-Learning-Team: Move Revert-risk language agnostic model from staging to production - https://phabricator.wikimedia.org/T332998 (10elukey) @diego Hiiiii! Do you have a model card that we can review? :) [09:20:47] 10Machine-Learning-Team, 10API Platform, 10Anti-Harassment, 10Cloud-Services, and 19 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10jnuche) [09:26:42] \o [09:27:02] elukey: I can push the changes (pod#) to codfw any time you're ready [09:27:59] 10Machine-Learning-Team, 10ORES, 10artificial-intelligence, 10ML-Governance, 10Documentation: Use data template on English Wikipedia ORES model cards - https://phabricator.wikimedia.org/T337830 (10kevinbazira) [09:33:38] klausman: o/ anytime! [09:34:14] One thing: there are older pods for revert risk stuck in "Terminating", any ide what's going on there? [09:35:03] ah weird [09:36:13] Ok, diff for the replica-# looks good, applying. [09:37:41] Hm. I sorta expected the 3/3 -> 5/15, but now there are extra lines with 3/3 [09:39:22] 3/3 is each pod (#containers) [09:40:13] ah, right. I thought it was replicas, for some reason [09:43:30] ALos, some of them remain pending. Is that for the 5~15 elasticity? [09:45:22] in theory no, it should be 5x3/3 [09:45:34] maybe we need to tweak the ceiling for max cpu/memory [09:45:55] yeah, assorted google results say it's usually insufficient resources [09:47:00] Oh. [09:47:22] kubectl get events -n etc.. should tell you [09:47:32] https://phabricator.wikimedia.org/P48666 [09:47:49] ml-staging2001.codfw.wmnet is marked as "Not Ready [09:50:31] ah snap [09:51:36] let's uncordon [09:52:02] It's not cordoned [09:52:33] the ctrl nodes think it's actually unreachable [09:52:47] I've already tried restarting the kubelet, didn't make a difference [09:53:03] No systemd units are marked as failed there. Currently examining logs [10:01:55] definitely strange [10:02:23] Dunno what to do about it. Rebooting seems a bit desperate/papering over a problem [10:04:07] I agree, could it be ca-advisor related? [10:04:29] maybe, the time correlates [10:04:36] I'll poke Filippo [10:04:50] let's try to stop the service first [10:04:58] doing it [10:05:53] done, kubelet also restarted [10:06:25] Still tainted [10:07:42] ah wait, systemctl cat cadvisor points to kubelet + override [10:09:42] Oh, so it might not even be part of the rollout yet [10:30:30] klausman: one thing that I don't get though is why we have 5 pods in staging for revertrisk, in theory I applied an override to have only one [10:30:36] (aside from cadvisor) [10:33:57] you mean five pods with 3c each? [10:35:28] we should have only one pod for each rr model [10:35:31] So one difference I see form a YAML perspective is that the 5/15 thing is in an `inference_services` stanza, while the override is in `inference`. Not sure if that is how it's supposed to stack [10:36:34] ahhh wait https://integration.wikimedia.org/ci/job/helm-lint/10808/console [10:36:45] I got fooled by the diff in --^ [10:36:54] only one of the RR has 1 pod set, the other has 5 [10:37:04] so something in the template isn't play as I expected [10:37:13] ok mistery solved, I'll investigate [10:40:40] (03PS24) 10Ilias Sarantopoulos: feat: hardcode threshold calls to switch to Lift Wing [extensions/ORES] - 10https://gerrit.wikimedia.org/r/915541 (https://phabricator.wikimedia.org/T319170) [10:42:06] (03CR) 10CI reject: [V: 04-1] feat: hardcode threshold calls to switch to Lift Wing [extensions/ORES] - 10https://gerrit.wikimedia.org/r/915541 (https://phabricator.wikimedia.org/T319170) (owner: 10Ilias Sarantopoulos) [10:42:12] (03PS25) 10Ilias Sarantopoulos: feat: hardcode threshold calls to switch to Lift Wing [extensions/ORES] - 10https://gerrit.wikimedia.org/r/915541 (https://phabricator.wikimedia.org/T319170) [10:42:49] 10Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10Patch-For-Review: Move backend of ORES MediaWiki extension to Lift Wing - https://phabricator.wikimedia.org/T319170 (10PatchDemoBot) Test wiki on [[ https://patchdemo.wmflabs.org | Patch demo ]] by ISarantopoulos-WMF using patch(es) linked to this task... [10:43:36] (03CR) 10CI reject: [V: 04-1] feat: hardcode threshold calls to switch to Lift Wing [extensions/ORES] - 10https://gerrit.wikimedia.org/r/915541 (https://phabricator.wikimedia.org/T319170) (owner: 10Ilias Sarantopoulos) [10:46:09] 10Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10Patch-For-Review: Move backend of ORES MediaWiki extension to Lift Wing - https://phabricator.wikimedia.org/T319170 (10PatchDemoBot) Test wiki **created** on [[ https://patchdemo.wmflabs.org | Patch demo ]] by ISarantopoulos-WMF using patch(es) linked t... [10:50:26] (03PS26) 10Ilias Sarantopoulos: feat: hardcode threshold calls to switch to Lift Wing [extensions/ORES] - 10https://gerrit.wikimedia.org/r/915541 (https://phabricator.wikimedia.org/T319170) [10:51:09] 10Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10Patch-For-Review: Move backend of ORES MediaWiki extension to Lift Wing - https://phabricator.wikimedia.org/T319170 (10PatchDemoBot) Test wiki on [[ https://patchdemo.wmflabs.org | Patch demo ]] by ISarantopoulos-WMF using patch(es) linked to this task... [10:51:54] (03CR) 10CI reject: [V: 04-1] feat: hardcode threshold calls to switch to Lift Wing [extensions/ORES] - 10https://gerrit.wikimedia.org/r/915541 (https://phabricator.wikimedia.org/T319170) (owner: 10Ilias Sarantopoulos) [10:52:09] 10Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10Patch-For-Review: Move backend of ORES MediaWiki extension to Lift Wing - https://phabricator.wikimedia.org/T319170 (10PatchDemoBot) Test wiki **created** on [[ https://patchdemo.wmflabs.org | Patch demo ]] by ISarantopoulos-WMF using patch(es) linked t... [11:07:16] 10Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10Patch-For-Review: Move backend of ORES MediaWiki extension to Lift Wing - https://phabricator.wikimedia.org/T319170 (10PatchDemoBot) Test wiki **created** on [[ https://patchdemo.wmflabs.org | Patch demo ]] by ISarantopoulos-WMF using patch(es) linked t... [11:09:16] 10Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10Patch-For-Review: Move backend of ORES MediaWiki extension to Lift Wing - https://phabricator.wikimedia.org/T319170 (10PatchDemoBot) Test wiki on [[ https://patchdemo.wmflabs.org | Patch demo ]] by ISarantopoulos-WMF using patch(es) linked to this task... [11:09:34] * elukey lunch! [11:09:38] ditto [11:10:15] 10Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10Patch-For-Review: Move backend of ORES MediaWiki extension to Lift Wing - https://phabricator.wikimedia.org/T319170 (10PatchDemoBot) Test wiki **created** on [[ https://patchdemo.wmflabs.org | Patch demo ]] by ISarantopoulos-WMF using patch(es) linked t... [11:19:31] (03PS27) 10Ilias Sarantopoulos: feat: hardcode threshold calls to switch to Lift Wing [extensions/ORES] - 10https://gerrit.wikimedia.org/r/915541 (https://phabricator.wikimedia.org/T319170) [11:20:09] (03PS28) 10Ilias Sarantopoulos: feat: hardcode threshold calls to switch to Lift Wing [extensions/ORES] - 10https://gerrit.wikimedia.org/r/915541 (https://phabricator.wikimedia.org/T319170) [11:27:41] (03PS1) 10AikoChou: revert-risk: handle unsupported edit types for wikidata model [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/924912 (https://phabricator.wikimedia.org/T333125) [11:30:43] 10Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10Patch-For-Review: Move backend of ORES MediaWiki extension to Lift Wing - https://phabricator.wikimedia.org/T319170 (10PatchDemoBot) Test wiki **created** on [[ https://patchdemo.wmflabs.org | Patch demo ]] by ISarantopoulos-WMF using patch(es) linked t... [11:33:36] 10Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10Patch-For-Review: Move backend of ORES MediaWiki extension to Lift Wing - https://phabricator.wikimedia.org/T319170 (10PatchDemoBot) Test wiki on [[ https://patchdemo.wmflabs.org | Patch demo ]] by ISarantopoulos-WMF using patch(es) linked to this task... [11:33:51] 10Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10Patch-For-Review: Move backend of ORES MediaWiki extension to Lift Wing - https://phabricator.wikimedia.org/T319170 (10PatchDemoBot) Test wiki on [[ https://patchdemo.wmflabs.org | Patch demo ]] by ISarantopoulos-WMF using patch(es) linked to this task... [11:34:34] 10Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10Patch-For-Review: Move backend of ORES MediaWiki extension to Lift Wing - https://phabricator.wikimedia.org/T319170 (10PatchDemoBot) Test wiki **created** on [[ https://patchdemo.wmflabs.org | Patch demo ]] by ISarantopoulos-WMF using patch(es) linked t... [11:39:59] * isaranto going for lunch [12:52:17] (03CR) 10Kevin Bazira: revert-risk: handle unsupported edit types for wikidata model (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/924912 (https://phabricator.wikimedia.org/T333125) (owner: 10AikoChou) [13:01:27] klausman: I think that we can proceed with the deployments in prod, and with the api-gateway too [13:01:39] staging is kinda not working due to cadvisor [13:11:01] We are going to deploy the changes for mediawiki-config tomorrow morning https://wikitech.wikimedia.org/wiki/Deployments?venotify=saved#deploycal-item-20230601T0700 [13:15:33] 🤞 [13:22:04] 10Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10Patch-For-Review: Move backend of ORES MediaWiki extension to Lift Wing - https://phabricator.wikimedia.org/T319170 (10isarantopoulos) The patches for the switch are the following: - 922512: ORES: add model versions configuration and thresholds | h... [13:23:14] isaranto: nice! [13:29:13] elukey: agreed re: pod numbers. [13:29:28] I'll create the change in a minute, unless you want to. [13:30:16] 10Machine-Learning-Team, 10ORES, 10artificial-intelligence, 10ML-Governance, 10Documentation: Use data template on English Wikipedia ORES model cards - https://phabricator.wikimedia.org/T337830 (10kevinbazira) The data transclusion template has been added to 5/6 English Wikipedia ORES model cards. The e... [13:30:55] klausman: what change do you have in mind? I checked a bit and it is not super easy, we'd need to copy/paste the isvc declaration from values.yaml to values-ml-staging-codfw.yaml afaics [13:31:26] 10Machine-Learning-Team, 10ORES, 10artificial-intelligence, 10ML-Governance, 10Documentation: Use data template on English Wikipedia ORES model cards - https://phabricator.wikimedia.org/T337830 (10kevinbazira) [13:32:57] elukey: I don't quite understand. What deployments did you mean? I thought you were referring to the increase in pod numbers? Wouldn't that be the same change as 924544, just for ml-serve? [13:45:53] Ah wait, the pod increase doesn't need a change since prod derives from the same def, but doesn't have an override [13:46:11] I clearly need more caffeine [13:48:06] yes yes sorry [13:49:13] I can do the push rn if you're game. Starting with codfw [13:51:11] Or we can wait until after the meeting [13:52:15] folks commuting to the office, may be 5 mins late to the meeting, please start without me [13:52:27] klausman: go ahead anytime [13:52:33] roger! [13:59:35] Onew of the pods took a long time to terminate, but we've now switched all, so there are five rr-la pods in codfw [13:59:47] (prod) [14:05:59] 10Machine-Learning-Team, 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10Patch-For-Review: Add a new outlink topic stream for EventGate main - https://phabricator.wikimedia.org/T328899 (10pfischer) I wrote a [[ https://docs.google.com/spreadsheets/d/1ao5HaKaZvAneM2zrTS1r3ZjXSaGnK9M61EQtTR2cAL... [14:14:41] this is the page related to Deployment training https://wikitech.wikimedia.org/wiki/Deployments/Training [14:18:19] 10Machine-Learning-Team, 10Data-Engineering, 10Event-Platform Value Stream: Create new mediawiki.page_links_change stream based on fragment/mediawiki/state/change/page - https://phabricator.wikimedia.org/T331399 (10pfischer) I wrote a [[ https://docs.google.com/spreadsheets/d/1ao5HaKaZvAneM2zrTS1r3ZjXSaGnK9M... [14:41:22] Applying pod increase in eqiad now [14:42:02] elukey: I sunced with Hugh re: API GW limit increase and he gave me the go-ahead, so I'll deploy that once this is done (and has settled without errors for a few minutes) [14:42:06] synced* [14:44:46] okok nuce [14:49:18] Alright, new pods have been up for 8m and the old pods are all gone, pushing the limit increase for the API GW in a few moments [14:50:48] check also logs etc.. [14:51:28] yarp [14:52:53] https://phabricator.wikimedia.org/P48672 Is this known? [14:53:02] aiko: ^^^ [15:12:17] klausman: what's that? I don't have permission to view that [15:12:28] oh, oops, sec [15:12:56] Does it work now? [15:13:18] no [15:13:21] now? [15:13:32] oh yes [15:15:02] yeah the warning is fine [15:15:24] Ok. I figured it was, since I suspect if you get that bit actually wrong, it breaks in toehr ways as well [15:15:33] other* [15:16:09] ok, taking a small break before the rest of today's meetings [15:16:29] (03PS1) 10Ilias Sarantopoulos: fix: remove unused tox command [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/924971 [15:17:27] (03CR) 10Elukey: [C: 03+1] fix: remove unused tox command [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/924971 (owner: 10Ilias Sarantopoulos) [15:26:03] (03CR) 10Ilias Sarantopoulos: [C: 03+2] fix: remove unused tox command [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/924971 (owner: 10Ilias Sarantopoulos) [15:27:17] (03Merged) 10jenkins-bot: fix: remove unused tox command [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/924971 (owner: 10Ilias Sarantopoulos) [16:29:18] going afk folks! Have a nice rest of the day [17:24:37] \o [22:57:04] 10Lift-Wing, 10Machine-Learning-Team: Move Revert-risk language agnostic model from staging to production - https://phabricator.wikimedia.org/T332998 (10diego) >>! In T332998#8891525, @elukey wrote: > @diego Hiiiii! Do you have a model card that we can review? :) here: https://meta.wikimedia.org/wiki/Machine_... [23:10:28] 10Lift-Wing, 10Machine-Learning-Team: Move Revert-risk language agnostic model from staging to production - https://phabricator.wikimedia.org/T332998 (10diego)