[06:08:27] Guten Morgen! [06:43:08] o/ [07:52:26] 10Machine-Learning-Team, 10API Platform, 10Anti-Harassment, 10Cloud-Services, and 19 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10WMDE-leszek) [07:55:42] 10Machine-Learning-Team, 10API Platform, 10Anti-Harassment, 10Cloud-Services, and 19 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10WMDE-leszek) Hello, regarding the wikibase/termbox service -- we'd be fine with a move to gitlab but have a question for ourselves... [08:39:24] 10Machine-Learning-Team, 10API Platform, 10Anti-Harassment, 10Cloud-Services, and 19 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10kevinbazira) [08:41:54] 10Machine-Learning-Team, 10GitLab (Project Migration): Move add-a-link to gitlab - https://phabricator.wikimedia.org/T334605 (10kevinbazira) Thank you for the confirmation. We have given the green light to migrate. There will be communication in case pipeline tests break. [09:06:40] Bon giorno :) [09:46:43] elukey: I guess this is where I can create a new namespace right? https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/helmfile.d/admin_ng/values/ml-serve.yaml#156 [09:48:55] I want to use a TLS certificate as well and then configure it in the fastapi app [09:54:42] isaranto: yes, but there are also other puppet-related settings to add, we didn't document those yet :( [09:55:00] if you are blocked I can work on it today [09:56:08] I'm not blocked since I am working on the mediawiki extension [09:58:15] once we add this I could add that documentation to make sure I understand how things work [09:58:48] we also need to create a load balanced service, that is a little complicated [09:58:58] see https://wikitech.wikimedia.org/wiki/LVS#Add_a_new_load_balanced_service [09:59:03] (one for staging, one for prod) [10:00:02] ah wait past Luca added some docs https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Deploy#Add_a_new_helmfile_config [10:00:39] yay, past luca! [10:00:58] current Luca doesn't remember a lot, this is not great :D [10:01:20] ah well, the root problem of all knowledge: discovery :) [10:01:31] At least if it's written down, it can be indexed [10:02:26] klausman: do you want to add the namespace? [10:03:10] I can probably do that and not injure myself :) [10:05:14] Have we decided on what the namespace will be called? [10:05:34] ores-legacy? not 100% this is what I remember [10:05:45] seems good [10:06:32] thank you both :) please add me as reviewer. I want to document the procedure of adding a new service/namespace/endpoint etc [10:06:43] Sure! [10:10:33] filed also all the code reviews to workaround the permission issues for the gpu devices, we'll see (already asked Moritz's opinion as well) [10:11:57] elukey: the secrets for the new ns should just be o-l and o-l-deply in the private repo, right? [10:13:38] in theory yes, but please check what we did for the other ns.. the puppet fake private repo is also a good source of truth, we should have everything that needs to be set in it [10:13:52] ack [10:14:36] Should we maybe also add a values-ml-staging-codfw.yaml to _example_ at some point? [10:15:32] not sure if needed but it could be g ood [10:16:37] Yeah, it's not super important, but might prevent someone from forgetting about staging (like I almost did :)) [10:17:54] it is not mandatory to have it, only if we vary some settings [10:18:01] otherwise the values.yaml file is good enough [10:18:09] I think the main use is turning off monitoring in staging [10:18:41] not sure what monitoring does, we don't have disabled it on other places (also, why should it be disabled?) [10:19:06] It's disabled e.g. on articletopic-outlink in staging [10:19:21] as for why: no idea. [10:19:28] let's try to figure it out [10:21:05] so I see [10:21:06] monitoring: # If enabled is true, monitoring annotations will be added to the deployment. enabled: false [10:21:15] this is the "app" module [10:21:51] what are the consequences of those annotations? [10:22:34] that prometheus.io/etc.. annotations are added, so gathering metrics [10:23:06] but it seems oriented to serviceops-like apps [10:23:14] I think that we don't really need to set it to false [10:23:52] I'll omit the staging file then, since at least at first, turning off those would be the only content [10:30:03] elukey: does https://gerrit.wikimedia.org/r/c/labs/private/+/909974 look sensible? I want to commit that (and do the puppetmaster side stuff) before committing the main change [10:31:15] yep looks good [10:31:49] * isaranto goes afk for lunch [10:32:33] same! [10:32:44] \o [10:39:50] Ok, faux and real secrets done, time for Lunch :) [11:37:26] 10Machine-Learning-Team, 10Epic: Add GPUs to the Machine Learning infrastructure - https://phabricator.wikimedia.org/T333462 (10elukey) [12:31:47] elukey: The _example_ contain `createNamespace: false`, will have to be true for the first commit? [12:32:30] klausman: that option it so avoid helm to create the namespace itself, we do it via admin_ng [12:32:43] *is to [12:33:34] ah, right. thx [12:34:08] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/909992 is ready, then [12:34:27] aaaand jenkins already hates it [12:34:43] klausman: Ilias already created that bit :) [12:34:52] oooh. [12:35:08] you'd need to file a change for admin_ng [12:35:16] otherwise we cannot proceed with Ilias' one [12:37:40] That's just a tlsExtraSANs entry, right? [12:38:39] ah, and a stanza of course [12:40:48] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/909993 should be better [13:11:25] elukey:I am seeing extra diffs, anout resource quotas (180 -> 240 for cpu and mem) [13:12:05] klausman: for what cluster? [13:12:11] And two IP blocks are being removed, 10.192.0.35/32 and 2620:0:860:101:10:192:0:35/128 [13:12:14] staging [13:12:20] (nl-staging-codfw) [13:12:21] yeah those were for prod, I think it is ok [13:12:37] the two ips blocked - what do they refer to? [13:12:48] Alright, syncing until you yeHchecking netbox [13:13:00] ?? [13:13:11] sorry, change my mind middle of the typing :D [13:13:17] I am checking netbox for the IPs. [13:13:23] okok [13:14:39] Both IP's are listed as unused in netbox. Is there any history? [13:15:04] if they are in the global config we should be fine [13:15:37] It's in cert-manager, cfssl-issuer, NetworkPolicy (networking.k8s.io) [13:16:08] The policy for allowing egress to those IPs [13:16:14] So yeah, probably fine. [13:16:18] Syncing staging [13:17:10] synced and diff is now empty [13:18:27] And the namespace is visible [13:18:55] # kubectl get namespace |grep -E '(NAME|ores)' [13:18:56] NAME STATUS AGE [13:18:58] ores-legacy Active 2m10s [13:19:22] ok let's add the info to the task [13:19:25] so it is documented [13:20:16] What specific info? [13:20:28] what you just wrote above :) [13:20:33] that the namespace is created etc.. [13:20:35] right :) [13:21:48] I left a comment for Ilias in https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/908191 but it is a quick/minor thing, then we can merge and deploy [13:21:52] I haven't synced the prod namespaces yet. Do you want to do any further testing before I do that? [13:22:19] let's wait for the first deployment, but it should be ok [13:22:23] diff on server-eqiad shows nothing unusual (the IPs from above and the quota changes are absent) [13:24:32] okok [13:24:41] * elukey taking a little break [13:46:31] isaranto: change 909992 is abandoned, we'll use yours (908191) [13:48:38] yeah sry my bad I never saw that abandoned tag [13:51:23] isaranto: left a comment in https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/908191, just a precaution, then we can merge and deploy [13:51:27] things should be ready to go [13:51:46] then me and Tobias will work on adding the lvs endpoint for staging [13:56:57] done, thanks both! [13:59:44] 10Machine-Learning-Team, 10Patch-For-Review: Create ORES migration endpoint (ORES/Liftwing translation) - https://phabricator.wikimedia.org/T330414 (10klausman) Namespace has been created on staging, and is visible: ` # kubectl get namespace |grep -E '(NAME|ores)' NAME STATUS... [14:57:56] 10Machine-Learning-Team, 10Data-Engineering, 10Data-Persistence, 10Discovery-Search, and 8 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10ayounsi) [14:58:44] 10Machine-Learning-Team, 10Data-Engineering, 10Data-Persistence, 10Discovery-Search, and 8 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10ayounsi) [14:59:55] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Discovery-Search, and 8 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10Marostegui) [15:01:34] elukey: I broke puppet on the ctrl plane :-/ https://alerts.wikimedia.org/?q=%40state%3Dactive&q=instance%3D~%28%5Eml%7C%5Eores%29 [15:02:56] klausman: ack so what is the next step :) ? [15:03:09] I think there is a missing config in puppet proper. [15:03:11] Due to: [15:03:13] parameter 'all_infrastructure_users' entry 'ml-serve' entry 'ores-legacy' entry 'groups' expects a value of type Undef or Array, got String [15:04:44] * elukey nods [15:04:53] Working on that rn [15:15:19] My entire PCC setup is broken because Python3.11 does not allow local user library install anymore :( [15:16:10] running it :) [15:16:15] thank you [15:16:48] klausman: the change is good and needed, but the error above makes me wonder if the '-deploy' entry that I see in your private repo diff may be related [15:17:00] in theory no but I have no idea how deep is the yaml horror [15:17:32] Well, the other NSes have a -deploy entry as well [15:18:42] sure sure they have - space deploy though [15:18:46] this is what I was referring to [15:19:00] anyway, no space left on the pcc node [15:19:40] space deploy? [15:20:37] "- deploy" vs "-deploy" [15:22:05] I am not seeing that in the (faux) private repo [15:22:27] Oh! you mean the groups: thing [15:22:59] So those are absent in the faux private repo, but I added them on the puppetmaster. I think. [15:24:04] yep, they're there [15:26:43] ok, merged, running on 22002 [15:27:28] klausman: I fixed the "-deploy" thing in puppet private [15:27:36] ml-serve-ctrl1001 works nicely now [15:27:38] but... [15:27:41] IDGI [15:28:02] I think the "-deploy" without the space was considered a string [15:28:02] what exactly was missing on the private repo? And why did it affect this NS, but not the others? [15:28:11] see the error above [15:28:23] 2002 r-p-a worked [15:28:54] "parameter 'all_infrastructure_users' entry 'ml-serve' entry 'ores-legacy' entry 'groups' expects a value of [15:28:57] type Undef or Array, got String [15:29:00] " [15:29:01] I thought you had meant that the `groups: - deploy` thing was missing on the fake rprivate repo [15:29:28] YAML will be the end of us all. [15:29:51] thank you for saving my behind once more :) [15:30:11] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Discovery-Search, and 8 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10MoritzMuehlenhoff) [15:30:25] np, it was in the error msg :) [15:33:53] going afk folks! have a good rest of the day [15:34:05] cu tomorrow! ciao [15:43:40] bye Luca :) [17:08:57] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Discovery-Search, and 9 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10colewhite) [17:09:35] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Discovery-Search, and 9 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10colewhite) [18:22:29] 10Machine-Learning-Team, 10ORES, 10Advanced-Search, 10All-and-every-Wikisource, and 65 others: Remove unnecessary targets definitions - https://phabricator.wikimedia.org/T328497 (10Jdlrobson) Graph done in https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Graph/+/902213