[07:06:15] <dcaro>	 morning
[07:38:52] <dcaro>	 I think we have a k8s worker stuck on nfs \o/, more debugging
[07:49:31] <blancadesal>	 morning
[08:14:37] <blancadesal>	 dcaro: after the latest changes, do you want the api-gateway patch tested/reviewed, or are you fine with just merging it?
[08:14:51] <arturo>	 o/
[08:15:35] <dcaro>	 blancadesal: would be nice if you review it too :)
[08:16:11] <blancadesal>	 ok!
[08:46:19] <btullis>	 Hello. Just FYI, I'm looking at the rsync_nginxlogs failures on clouddumpos100[1-2]. They are related to work I did with this patch: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1028866
[08:46:27] <taavi>	 Rook: if you have a use for LBSaaSv2, please do file a task about it :-) I don't think anyone has planned to enable it so far but that's mostly just because no-one has asked for it
[08:47:16] <dcaro>	 btullis: thanks! I started (but stopped because something else popped up) on task T364820, feel free to take it
[08:47:16] <stashbot>	 T364820: SystemdUnitDown   - https://phabricator.wikimedia.org/T364820
[08:47:37] <btullis>	 dcaro: Many thanks, I will.
[08:52:04] <btullis>	 Nifty auto-created task.
[08:55:09] <dcaro>	 Any critical alert attached to team=wmcs will get a task created, you can configure that as a 'receiver' in the alertmanager config for whatever alerts you want :)
[09:28:25] <arturo>	 I see maintain-kubeusers contains some prometheus counters, and is listening with an http server on tcp/9000. Do we use this for anything? for alerts maybe?
[09:31:13] <dcaro>	 I think that there's no alerts set for it yet
[09:32:11] <dcaro>	 we gather the metrics though
[09:38:23] <arturo>	 ok, so if I change the metrics and/or the counters, I wont be breaking much
[09:39:05] <dcaro>	 I don't think so no
[09:39:45] <arturo>	 ok
[09:39:49] <arturo>	 thanks!
[10:39:08] <dcaro>	 we have the toolforge check-in and the monthly meeting happening in the same day (today!) should we move one of them?
[10:39:24] <dcaro>	 (I'm not around next tuesday though)
[10:40:10] <dcaro>	 there's nothing in the monthly agenda though
[10:41:52] <blancadesal>	 maybe we can move the monthly one? 
[10:41:57] <dcaro>	 I propose having the monthly today, and reusing as 'check-in' if there's not many subjects to discuss, then moving the next monthly one week off (so from Jun 11th to Jun 18th)
[10:44:23] <blancadesal>	 sounds ok, I'll only be able to attend about half the monthly today though
[10:45:54] <dcaro>	 hmm, then maybe do the check-in at the usual time, and a very short monthly (probably just say hi if nobody adds anything to the agenda)
[10:45:56] <blancadesal>	 another option would be to push the weekly forward by one week (permanently), that too would de-sync them
[10:49:06] <dcaro>	 I'm not here next tuesday, but that works also for me
[10:55:26] <dcaro>	 Moved the check-in for next week's wednesday, let me know if that does not work for someone
[11:06:04] <blancadesal>	 thanks!
[11:49:45] <taavi>	 arturo: topranks: proposed announcement for cloud-announce@, please review: https://etherpad.wikimedia.org/p/T364459
[11:53:30] <arturo>	 👀
[11:54:57] <arturo>	 taavi: LGTM, fixed what I believe is a typo
[11:55:38] <arturo>	 🚢 🇮🇹
[11:58:39] <blancadesal>	 lgtm too, but what's up with italian titanic?
[12:01:08] <taavi>	 "ship it"?
[12:06:27] <arturo>	 yes! ^^U
[12:07:18] <topranks>	 hahahaha 
[12:07:46] <topranks>	 looks good to me too 
[12:09:15] <taavi>	 sent
[12:12:17] <blancadesal>	 ahaaahaa
[12:34:59] <Rook>	 https://phabricator.wikimedia.org/T326436 would be the ticket for LBSaaSv2 it might need an additional tag for visibility
[12:58:57] <dcaro>	 quick review https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/281
[13:02:27] <arturo>	 dcaro: left some comments
[13:06:14] <dcaro>	 thanks! fixed it :)
[13:07:18] <taavi>	 usual question: is anyone interested in running the toolforge monthly meeting later today?
[13:12:16] <dcaro>	 I can run it for a change if nobody else wants
[13:15:54] <dcaro>	 quick fix https://gitlab.wikimedia.org/repos/cloud/toolforge/api-gateway/-/merge_requests/19 missed one letter xd
[13:17:47] <blancadesal>	 dcaro: approved
[13:17:58] <dcaro>	 thanks!
[13:30:20] <dcaro>	 \o/ we got openapi joint def running on toolsbeta and tools
[13:30:40] <taavi>	 \o/
[13:41:19] <blancadesal>	 🎉
[13:41:37] <dcaro>	 yay, created a silly app to show the openapi docs xd https://api-docs.toolforge.org/docs
[13:42:06] <dcaro>	 (requests don't work of course)
[13:43:29] <blancadesal>	 that's still really cool!
[13:44:55] <dcaro>	 anyone wants to investigate NFS issues? (https://phabricator.wikimedia.org/T364822), I'm a bit busy and half-out of ideas, if nobody wants to try I'll just reboot the host
[13:46:45] <blancadesal>	 do we want to get rid of the /api in the jobs api?  `/jobs/api/v1/images`  or add it to the other apis? right now we have `/builds/v1/build`, etc,  same for envvars
[13:48:23] <dcaro>	 I think so yes, we might want also to prepend all routes with `/tool/<toolname>`
[13:49:02] <dcaro>	 main reason to remove it is that we will not be adding non-api urls there
[13:54:26] <taavi>	 dhinus: is that one wiki replica view change still in your mind?
[13:55:23] <dcaro>	 Is someone using s3/swift/rados? there's a spike in the request time
[14:47:09] <dhinus>	 taavi: yes, I was writing a comment to the task just before the meeting
[14:48:09] <dhinus>	 I can do it tomorrow morning. I saw the migration patch has been merged, does it mean it's already applied to prod?
[14:57:33] <dhinus>	 taavi: I was writing in the task "I'll do it tomorrow morning, but it's not a blocker for the migration in prod", then I wondered if the migration in prod has already been done. are prod migrations applied straight after merging, or later?
[14:58:11] <taavi>	 migrations on the production mediawiki databases are always applied by the DBAs, in general there's a separate task for doing that
[14:58:29] <taavi>	 s/migrations/schema changes/
[14:58:51] <dhinus>	 ack. does my comment make sense to you? (not a blocker because of what you mentioned in the wikireplicas patch)
[14:59:26] <taavi>	 iirc the commit that was merged is the one to add a script to move data away from that column, the column can only be dropped after that script has been merged, deployed and ran
[15:00:23] <dhinus>	 ah right, I misinterpreted what that commit was about
[15:00:52] <dcaro>	 andrewbogott: I raised the alert for rgw, can you review/tell me if that's ok? https://gerrit.wikimedia.org/r/c/operations/alerts/+/1031494
[15:00:54] <taavi>	 it doesn't seem to be that urgent to comment anything until the maintain-views change is deployed
[15:01:52] <andrewbogott>	 dcaro: I'll look eventually!
[15:02:26] <dhinus>	 taavi: thanks
[16:01:58] * arturo offline
[16:02:47] <dhinus>	 bd808: can I leave you the final say on this one? https://toolsadmin.wikimedia.org/tools/membership/status/1700
[16:03:38] <bd808>	 dhinus: I think you can approve it. The naming still looks bad, but :shrug:
[16:05:01] <dhinus>	 ok thanks!
[16:07:05] <bd808>	 I learned a thing about kubernetes namespaces and services yesterday. The DNS service makes it possible for something in namespace A to reach a service in namespace B by simply constructing the proper hostname.  `http://${TOOL}.tool-${TOOL}.svc.tools.local:8000` is the URL pattern in our deployment to reach a `webservice` managed Service
[16:10:20] <bd808>	 I am getting an error message from `toolforge envvars create ...` saying "EnvvarsClientError: Error contacting k8s cluster.". Any tips on troubleshooting?
[16:10:40] <bd808>	 `toolforge envvars list` is working from the same account
[16:15:16] <bd808>	 I think I may have figured out the problem. It looks like this account is out of "secrets" quota
[16:15:21] * bd808 will file a bug
[16:17:53] <dcaro>	 yes please, we should send back a nicer message
[16:18:05] <bd808>	 T364878
[16:18:07] <stashbot>	 T364878: Running out of "secrets" quota (envvars) produces unhelpful error message from `toolforge envvars create` - https://phabricator.wikimedia.org/T364878
[16:19:44] <bd808>	 30 envvars is maybe too low of a limit too. I've been meaning to suggest raising the services limit for everyone, maybe I should also suggest a higher secrets limit
[16:19:46] <dcaro>	 bd808: you have an environment variable populated by k8s already with the name <TOOL>_SERVICE_* https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/71#note_81786
[16:20:23] <dcaro>	 bd808:  +1 to raising the limit, the only reason it's not done is because nobody requested it yet and we are very bad at guessing the future usage xd
[16:23:33] <dcaro>	 hmm, though those have the IPs directly, not the dns-resolved name, so you'll have to restart the pod when the service gets restarted (though you will have to force-reload the connection if the ip changes too using DNS, as it's usually cached)
[16:24:36] <bd808>	 dcaro: the <TOOL>_SERVICE_{HOST,PORT} thing is good to know. For this particular service that knowledge doesn't help me. I am already connecting to things in the same namespace using the DNS exposed Service mappings.
[16:25:29] <bd808>	 The tool that has hit the secrets limit is a testing deployment of 12-factor thing that needs to override many defaults to build a test deployment.
[16:25:36] <dcaro>	 👍
[16:41:21] <dcaro>	 gtg but there's the default quota increase: https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/24  (might be back around later to test/merge it if anyone reviewed)
[16:54:46] <Rook>	 Is there anything going on with volumes in codfw? Since yesterday they seem to be having issues detaching from magnum cluster nodes
[17:53:43] <andrewbogott>	 Rook: not that I know of but a few things broken when I replaced cloudcontrol2001-dev. I can have a look, what's a specific example?
[17:55:30] <Rook>	 7bcba237-5c73-4d92-a9ae-528434da2258 is, mostly, no longer appearing in horizon but still appears in cli. It's attached to a cluster which has also most been removed, but still has bits about
[17:57:56] <andrewbogott>	 ok, I'll look in a few
[17:58:22] <Rook>	 Thanks!
[18:02:50] <andrewbogott>	 ideally that volume would be deleted entirely, right?
[18:03:23] <Rook>	 yes
[18:20:53] <andrewbogott>	 this is producing an entirely new log message that I've never seen before :(
[18:21:00] <Rook>	 heh
[18:59:22] <andrewbogott>	 Rook:  inelegant but it seems to work to do 'openstack volume set --detached --state available <id>' and then delete the volume.
[18:59:24] <andrewbogott>	 Are there others?
[19:00:03] <Rook>	 hmm...I could try that. Let me see if I can remake one
[19:01:02] <Rook>	 Yes I believe there are some others 7e2a4701-df98-4482-adbc-16a8b92acde7, and maybe d21cd3f7-a5c4-4f2e-bb89-eebde2c3b6be though that one usually doesn't end up deleted. Still the first one is listed as deleting rather than deleted. Is it gone from your view?
[19:01:58] <andrewbogott>	 uhoh
[19:04:05] <andrewbogott>	 guess I will go back to the logfiles
[19:51:36] <andrewbogott>	 Rook: I've deleted those three volumes and I think I've fixed the underlying issue (which is that the volumes were still vaguely associated with cloudcontrol2001-dev which doesn't exist anymore.)
[19:51:47] <andrewbogott>	 please let me know if that happens again or if I've left things sitting around still
[19:52:16] <Rook>	 Will do, thanks!