[07:06:15] morning [07:38:52] I think we have a k8s worker stuck on nfs \o/, more debugging [07:49:31] morning [08:14:37] dcaro: after the latest changes, do you want the api-gateway patch tested/reviewed, or are you fine with just merging it? [08:14:51] o/ [08:15:35] blancadesal: would be nice if you review it too :) [08:16:11] ok! [08:46:19] Hello. Just FYI, I'm looking at the rsync_nginxlogs failures on clouddumpos100[1-2]. They are related to work I did with this patch: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1028866 [08:46:27] Rook: if you have a use for LBSaaSv2, please do file a task about it :-) I don't think anyone has planned to enable it so far but that's mostly just because no-one has asked for it [08:47:16] btullis: thanks! I started (but stopped because something else popped up) on task T364820, feel free to take it [08:47:16] T364820: SystemdUnitDown - https://phabricator.wikimedia.org/T364820 [08:47:37] dcaro: Many thanks, I will. [08:52:04] Nifty auto-created task. [08:55:09] Any critical alert attached to team=wmcs will get a task created, you can configure that as a 'receiver' in the alertmanager config for whatever alerts you want :) [09:28:25] I see maintain-kubeusers contains some prometheus counters, and is listening with an http server on tcp/9000. Do we use this for anything? for alerts maybe? [09:31:13] I think that there's no alerts set for it yet [09:32:11] we gather the metrics though [09:38:23] ok, so if I change the metrics and/or the counters, I wont be breaking much [09:39:05] I don't think so no [09:39:45] ok [09:39:49] thanks! [10:39:08] we have the toolforge check-in and the monthly meeting happening in the same day (today!) should we move one of them? [10:39:24] (I'm not around next tuesday though) [10:40:10] there's nothing in the monthly agenda though [10:41:52] maybe we can move the monthly one? [10:41:57] I propose having the monthly today, and reusing as 'check-in' if there's not many subjects to discuss, then moving the next monthly one week off (so from Jun 11th to Jun 18th) [10:44:23] sounds ok, I'll only be able to attend about half the monthly today though [10:45:54] hmm, then maybe do the check-in at the usual time, and a very short monthly (probably just say hi if nobody adds anything to the agenda) [10:45:56] another option would be to push the weekly forward by one week (permanently), that too would de-sync them [10:49:06] I'm not here next tuesday, but that works also for me [10:55:26] Moved the check-in for next week's wednesday, let me know if that does not work for someone [11:06:04] thanks! [11:49:45] arturo: topranks: proposed announcement for cloud-announce@, please review: https://etherpad.wikimedia.org/p/T364459 [11:53:30] 👀 [11:54:57] taavi: LGTM, fixed what I believe is a typo [11:55:38] 🚢 🇮🇹 [11:58:39] lgtm too, but what's up with italian titanic? [12:01:08] "ship it"? [12:06:27] yes! ^^U [12:07:18] hahahaha [12:07:46] looks good to me too [12:09:15] sent [12:12:17] ahaaahaa [12:34:59] https://phabricator.wikimedia.org/T326436 would be the ticket for LBSaaSv2 it might need an additional tag for visibility [12:58:57] quick review https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/281 [13:02:27] dcaro: left some comments [13:06:14] thanks! fixed it :) [13:07:18] usual question: is anyone interested in running the toolforge monthly meeting later today? [13:12:16] I can run it for a change if nobody else wants [13:15:54] quick fix https://gitlab.wikimedia.org/repos/cloud/toolforge/api-gateway/-/merge_requests/19 missed one letter xd [13:17:47] dcaro: approved [13:17:58] thanks! [13:30:20] \o/ we got openapi joint def running on toolsbeta and tools [13:30:40] \o/ [13:41:19] 🎉 [13:41:37] yay, created a silly app to show the openapi docs xd https://api-docs.toolforge.org/docs [13:42:06] (requests don't work of course) [13:43:29] that's still really cool! [13:44:55] anyone wants to investigate NFS issues? (https://phabricator.wikimedia.org/T364822), I'm a bit busy and half-out of ideas, if nobody wants to try I'll just reboot the host [13:46:45] do we want to get rid of the /api in the jobs api? `/jobs/api/v1/images` or add it to the other apis? right now we have `/builds/v1/build`, etc, same for envvars [13:48:23] I think so yes, we might want also to prepend all routes with `/tool/` [13:49:02] main reason to remove it is that we will not be adding non-api urls there [13:54:26] dhinus: is that one wiki replica view change still in your mind? [13:55:23] Is someone using s3/swift/rados? there's a spike in the request time [14:47:09] taavi: yes, I was writing a comment to the task just before the meeting [14:48:09] I can do it tomorrow morning. I saw the migration patch has been merged, does it mean it's already applied to prod? [14:57:33] taavi: I was writing in the task "I'll do it tomorrow morning, but it's not a blocker for the migration in prod", then I wondered if the migration in prod has already been done. are prod migrations applied straight after merging, or later? [14:58:11] migrations on the production mediawiki databases are always applied by the DBAs, in general there's a separate task for doing that [14:58:29] s/migrations/schema changes/ [14:58:51] ack. does my comment make sense to you? (not a blocker because of what you mentioned in the wikireplicas patch) [14:59:26] iirc the commit that was merged is the one to add a script to move data away from that column, the column can only be dropped after that script has been merged, deployed and ran [15:00:23] ah right, I misinterpreted what that commit was about [15:00:52] andrewbogott: I raised the alert for rgw, can you review/tell me if that's ok? https://gerrit.wikimedia.org/r/c/operations/alerts/+/1031494 [15:00:54] it doesn't seem to be that urgent to comment anything until the maintain-views change is deployed [15:01:52] dcaro: I'll look eventually! [15:02:26] taavi: thanks [16:01:58] * arturo offline [16:02:47] bd808: can I leave you the final say on this one? https://toolsadmin.wikimedia.org/tools/membership/status/1700 [16:03:38] dhinus: I think you can approve it. The naming still looks bad, but :shrug: [16:05:01] ok thanks! [16:07:05] I learned a thing about kubernetes namespaces and services yesterday. The DNS service makes it possible for something in namespace A to reach a service in namespace B by simply constructing the proper hostname. `http://${TOOL}.tool-${TOOL}.svc.tools.local:8000` is the URL pattern in our deployment to reach a `webservice` managed Service [16:10:20] I am getting an error message from `toolforge envvars create ...` saying "EnvvarsClientError: Error contacting k8s cluster.". Any tips on troubleshooting? [16:10:40] `toolforge envvars list` is working from the same account [16:15:16] I think I may have figured out the problem. It looks like this account is out of "secrets" quota [16:15:21] * bd808 will file a bug [16:17:53] yes please, we should send back a nicer message [16:18:05] T364878 [16:18:07] T364878: Running out of "secrets" quota (envvars) produces unhelpful error message from `toolforge envvars create` - https://phabricator.wikimedia.org/T364878 [16:19:44] 30 envvars is maybe too low of a limit too. I've been meaning to suggest raising the services limit for everyone, maybe I should also suggest a higher secrets limit [16:19:46] bd808: you have an environment variable populated by k8s already with the name _SERVICE_* https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/71#note_81786 [16:20:23] bd808: +1 to raising the limit, the only reason it's not done is because nobody requested it yet and we are very bad at guessing the future usage xd [16:23:33] hmm, though those have the IPs directly, not the dns-resolved name, so you'll have to restart the pod when the service gets restarted (though you will have to force-reload the connection if the ip changes too using DNS, as it's usually cached) [16:24:36] dcaro: the _SERVICE_{HOST,PORT} thing is good to know. For this particular service that knowledge doesn't help me. I am already connecting to things in the same namespace using the DNS exposed Service mappings. [16:25:29] The tool that has hit the secrets limit is a testing deployment of 12-factor thing that needs to override many defaults to build a test deployment. [16:25:36] 👍 [16:41:21] gtg but there's the default quota increase: https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/24 (might be back around later to test/merge it if anyone reviewed) [16:54:46] Is there anything going on with volumes in codfw? Since yesterday they seem to be having issues detaching from magnum cluster nodes [17:53:43] Rook: not that I know of but a few things broken when I replaced cloudcontrol2001-dev. I can have a look, what's a specific example? [17:55:30] 7bcba237-5c73-4d92-a9ae-528434da2258 is, mostly, no longer appearing in horizon but still appears in cli. It's attached to a cluster which has also most been removed, but still has bits about [17:57:56] ok, I'll look in a few [17:58:22] Thanks! [18:02:50] ideally that volume would be deleted entirely, right? [18:03:23] yes [18:20:53] this is producing an entirely new log message that I've never seen before :( [18:21:00] heh [18:59:22] Rook: inelegant but it seems to work to do 'openstack volume set --detached --state available ' and then delete the volume. [18:59:24] Are there others? [19:00:03] hmm...I could try that. Let me see if I can remake one [19:01:02] Yes I believe there are some others 7e2a4701-df98-4482-adbc-16a8b92acde7, and maybe d21cd3f7-a5c4-4f2e-bb89-eebde2c3b6be though that one usually doesn't end up deleted. Still the first one is listed as deleting rather than deleted. Is it gone from your view? [19:01:58] uhoh [19:04:05] guess I will go back to the logfiles [19:51:36] Rook: I've deleted those three volumes and I think I've fixed the underlying issue (which is that the volumes were still vaguely associated with cloudcontrol2001-dev which doesn't exist anymore.) [19:51:47] please let me know if that happens again or if I've left things sitting around still [19:52:16] Will do, thanks!