[03:03:22] anyone knows where I can find the files from which this image was built? docker-registry.tools.wmflabs.org/toolforge-bullseye0-builder:latest not on gitlab (or atleast not easy to find) [08:26:30] Raymond_Ndibe: it's on gerrit, but we are not using that anymore https://gerrit.wikimedia.org/g/cloud/toolforge/buildpacks [09:44:42] I suspect that connection error thing is either some rate limit on the wikimedia CDN, or some network issue with the new k8s workers. [09:47:42] let me check which workers are they running on [10:00:51] taavi: listeria is running on the new nfs workers yes, chie-bot seems to be running crons, so they would spawn in different places I guess, but it might have been the new workers too [10:08:11] I think chiebot was running also on the new NFS nodes [10:10:17] this is going to be impossible to debug without a way to reproduce or at least exact times of when it has been happening [10:12:10] given that it seems to happen to several tools, we should be able to reproduce with some code snippet [10:24:38] hmm, just restarted harbor to cleanup caches, and now I seem to be unable to log in :/ [10:25:59] I'm looking at the container_network_transmit_packets_dropped_total prometheus metric and it's showing a few workers that have dropped some tranmitted packets somehow. one of them is worker-nfs-5, another is worker-82 and then ingress-4 and -5 [10:34:26] is tools-harbor seems down? there's an alert and I cannot log in [10:34:42] s/seems // [10:34:54] dcaro: ^ [10:35:17] sorry missed the message from dcaro just above :) [10:36:13] still looking for a review of https://gerrit.wikimedia.org/r/c/operations/puppet/+/993693 btw [10:38:38] hmm, it's strange that it does not let me log in :/ [10:48:49] interesting, took all harbor down (docker-compose down), re-run the prepare script, and brought it up and it seemed to do the trick [11:01:06] dhinus: did you have any exim specialists to ask for reviews in mind? [11:01:39] not really :) [11:01:58] hmm [11:02:17] I'm not sure who's worked with exim in the past [11:02:54] I was hoping we can identify someone either in the team or outside... but if we can't I'm fine with merging :) [11:04:50] I added Keith and Jesse since they seem to have touched the prod exim config, I'll just merge this evening if we don't get a response [11:06:05] sgtm! [12:43:00] quick review https://gitlab.wikimedia.org/repos/cloud/toolforge/tools-webservice/-/merge_requests/21 [12:43:12] just copy-pasted the scripts from the other clis [12:51:12] dcaro: left a comment [15:14:17] I'm restarting harbor, some alerts might trigger [15:42:19] * bd808 yawns and waves [15:42:52] my brain has no idea what timezone it is in. UTC+12? UTC-7? :shrug: [16:12:50] \o welcome back! [16:17:58] 🌴 🏖️ [17:01:23] taavi: for the wmcs-cookbooks repo the Gerrit setting "submit type" was "fast-forward only". I've changed it to "rebase if necessary", which is what we have in operations/puppet. [17:01:42] oh that should do it. thanks. [17:03:18] dcaro: is there a reason you didn't open an MR for https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/tree/allow_kind_pull_from_harbor?ref_type=heads? (in my mind this had been merged so I was very, very confused for a bit, but I thing we just merged one branch into another but never to main?) [17:05:45] it's here https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/93 [17:06:26] it was merged [17:07:10] and it's in main: https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/commit/954b7e72c214d7d7fbd4d48ffa7622ca4f3a973f [17:07:23] not sure why it did not delete the branch :/ [17:08:06] I think there's an extra commit on that branch, weird [17:08:15] might have re-pushed to it somehow [17:09:02] just deleted it [17:12:26] hmm, I was just trying to test dhinu.s changes and the changes from that branch don't seem to be there [17:15:44] oh, probably needs a rebase (dhinu.s branch) [17:17:33] nope, I just checked and my branch is up to date [17:19:05] ok, might my local environment then [17:26:23] hmmm.... [18:05:32] * dcaro off [18:31:18] andrewbogott: if you're planning to remove many more k8s workers, we will need to provision matching capacity in new nodes [18:31:59] Yep. I think I'm done for now, I removed six workers. [18:32:11] do you want to add the new ones? You probably have the command already in your recent bash history ") [18:32:19] um... not sure what ") means [18:33:26] ok, these are smaller nodes so I'll add 3 new larger ones [18:33:36] the command is `cookbook wmcs.toolforge.add_k8s_node --cluster-name tools --role worker_nfs` ftr [18:37:44] thanks! [18:53:37] has anyone seen this certificate error before? https://phabricator.wikimedia.org/P55900 that did not go away after trying to remove and re-create that instance [19:02:07] probably you need 'cert clean' on the puppetmaster [19:02:12] Not sure how it got into that state though [19:04:21] * bd808 lunch [19:17:05] i think this is a race condition with how the image gets accessed. https://gerrit.wikimedia.org/r/c/operations/puppet/+/992677 and a image rebuild should fix it. [19:33:20] that patch looks right but I don't yet understand how it affects puppet certs [19:35:33] basically the cookbook thinks that the first puppet run is complete when it is not [19:36:15] oh, the cookbook checks the cloud-init flag? [19:36:22] then that makes sense [19:37:01] yep, since the root key it uses is embedded in the base images now [23:12:07] Could I get a +1 on https://phabricator.wikimedia.org/T356195 [23:30:05] Rook: I don't think taavi and I have much confidence that just moving from toolforge to a dedicated project will change they error they are having at https://github.com/dpriskorn/WikidataTopicCurator/issues/5. This seems like a wild guess by the developer. [23:34:43] * bd808 comments on the task [23:36:36] That's fine. Thanks for commenting on it