[00:05:21] <RichSmith>	 https://grid-deprecation.toolforge.org/t/cluebotng-review shows me as being a member, but... richs@tools-sgebastion-10:~$ become cluebotng-review
[00:05:22] <RichSmith>	 You are not a member of the group tools.cluebotng-review.
[00:05:22] <RichSmith>	 Any existing member of the tool's group can add you to that.
[00:34:28] <autom-frwikt>	 Hey. Would someone know why I get "Your account is pending approval from your GitLab administrator and hence blocked. Please contact your GitLab administrator if you think this is an error." when trying to sign in to gitlab? (at https://gitlab.wikimedia.org/users/sign_in )
[00:34:43] <autom-frwikt>	 (it's been 4 days)
[00:49:53] <bd808>	 autom-frwikt: did you file a Phabricator task asking to have your GitLab account approved? You should be able to login, but then see a banner linking to https://phabricator.wikimedia.org/maniphest/task/edit/form/117/ to request someone review and approve your Developer account to use GitLab.
[00:50:35] <bd808>	 This is lame anti-spam stuff we have implemented because GitLab has poor built-in tools for defending against vandalism.
[00:52:02] <bd808>	 RichSmith: hmmm... it looks to me like you should have inherited membership in https://toolsadmin.wikimedia.org/tools/id/cluebotng-review via https://toolsadmin.wikimedia.org/tools/id/cluebotng
[00:52:37] <RichSmith>	 bd808: Yea, that's what I thought
[00:53:36] * bd808 stares at the code in `become`
[00:55:46] <bd808>	 I wonder if we've messed this up at some point? It isn't obvious to me how `become` would be looking up indirect membership via another tool.
[00:57:37] <RichSmith>	 Heh
[00:58:01] <bd808>	 heh. the part of the `become` script that does the group lookup hasn't changed since 2014 so if it is messed up that's surprising.
[00:59:24] <bd808>	 Oh! I know how it works. RichSmith, you have to first `become cluebotng` and then as that user `become cluebotng-review`.
[00:59:36] <RichSmith>	 Oh, okay
[01:00:03] <bd808>	 Its clunky, but that's how it has always worked apparently
[01:09:43] <autom-frwikt>	 bd808: thanks for the tip, I filed a new task. (The banner was missing a link to that form)
[02:09:14] <legoktm>	 I know there's some process limiting on the bastions - is that per user or per tool?
[02:20:19] <gifti>	 is there a way to see if any of your tools still use the grid? a webinterface maybe?
[02:27:32] <legoktm>	 yes one second
[02:27:44] <legoktm>	 gifti: https://grid-deprecation.toolforge.org/
[02:27:56] <gifti>	 thanks!
[04:17:20] <wm-bb>	 <AinieBaldu> BANTUAN RM1,000 DIKREDITKAN DALAM MYKAD PENERIMA B40: INI CARA MENGGUNAKANNYA
[04:17:21] <wm-bb>	 <AinieBaldu> 
[04:17:22] <wm-bb>	 <AinieBaldu> Baca Lanjut👇
[04:17:24] <wm-bb>	 <AinieBaldu> 
[04:17:25] <wm-bb>	 <AinieBaldu> https://bitly.ws/33zUp
[04:17:27] <wm-bb>	 <AinieBaldu> 
[04:17:28] <wm-bb>	 <AinieBaldu> #bantuankerajaan
[08:58:59] <taavi>	 legoktm: the grid bastion process limits are per user, you get assigned a systemd slice when logging in via ssh
[10:24:15] <proc>	 Is there some kind of issue with Horizon applicationcredential auth?
[10:24:51] <proc>	 I have a set of creds in a file on my bastion, they have no expiry, and I've always just ran `source creds.sh` then `terraform plan` etc just fine. Today I'm getting a `Error creating OpenStack container infra client: Authentication failed`
[11:05:02] <dhinus>	 proc: I upgraded OpenStack yesterday, so it might be related. unfortunately I don't have old appcreds I can use to test if they still work. have you tried creating new ones?
[11:08:43] <stw>	 Yeah, I just tried with old appcreds too and it looks like I'm getting the same issue. I'll try new ones later.
[11:18:47] <stw>	 OK, new credentials seem to get past that error for me.
[11:20:02] <stw>	 That said, the web proxy API now appears to be throwing an Internal Server Error
[11:23:02] <dhinus>	 stw: looking
[11:39:10] <dhinus>	 stw: I seem to be able to list, create and delete web proxies. are you still getting errors?
[11:39:38] <dhinus>	 and are you doing it from horizon, cli or terraform?
[11:46:40] <stw>	 I'm doing it from Terraform, and indeed it appears to be working now - not sure what happened. :/
[11:47:59] <stw>	 Huh, looks like in my tests, it actually failed, succeeded, failed again, then all subsequent attempts succeeded.
[11:50:03] <stw>	 https://stwalkerster.co.uk/workspace/Screenshot_20231201_114832.png was the error I was getting, though it's fairly non-specific.
[11:55:03] <taavi>	 !log admin restart neutron-rpc-server.service on eqiad1 cloudcontrols
[11:55:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL
[12:54:23] <proc>	 dhinus: will generate a new pair
[12:57:32] <proc>	 dhinus: it doesn't seem to let me create new credentials. I just see the "working..." spinner then eventually: Danger: There was an error submitting the form. Please try again.
[12:57:41] <proc>	 so I guess it's timing out
[14:33:35] <dhinus>	 proc: we're having some issues on openstack that might explain the error. I'll let you know when these are resolved so you can retry
[14:42:15] <proc>	 ok, thanks!
[15:10:30] <taavi>	 we should be back
[15:15:34] <wm-bot>	 !log tools.bridgebot <lucaswerkmeister> Double IRC messages to other bridges
[15:18:52] <lucaswerkmeister>	 testing bridgebot…
[15:18:56] <wm-bb>	 <lucaswerkmeister> ok
[15:19:25] <lucaswerkmeister>	 !log tools.bridgebot Double IRC messages to other bridges
[15:19:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.bridgebot/SAL
[15:19:29] <lucaswerkmeister>	 (is stashbot around now?)
[15:19:30] <lucaswerkmeister>	 yay
[15:20:39] <wm-bb>	 <lucaswerkmeister> (for the Telegram side: we lost a couple of bridged messages from IRC due to a network issue which also affected the bridgebot but seems to be resolved now)
[15:25:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.bridgebot/SAL
[15:26:41] <dhinus>	 !status mostly OK after a network outage
[15:30:52] <wm-bot>	 !log tools.stewardbots <anticomposite> SULWatcher/manage.sh restart # SULWatchers not restarting on their own
[15:30:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.stewardbots/SAL
[15:48:52] <andrewbogott>	 !log reimaging cloudcontrol1005 due to widespread misbehavior
[15:48:54] <stashbot>	 andrewbogott: Unknown project "reimaging"
[15:48:59] <andrewbogott>	 !log admin reimaging cloudcontrol1005 due to widespread misbehavior
[15:49:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL
[16:10:12] <legoktm>	 taavi: gotcha thanks (I was deploying to multiple tools in parallel and seeing if increasing the parallelism would make it faster, answer: nope)
[16:13:19] <dhinus>	 proc: can you retry creating credentials now?
[17:03:45] <bd808>	 legoktm: the main way you can gain more bastion quota is to fan out to other bastions. :)
[19:12:11] <wm-bot>	 !log tools.lexeme-forms <lucaswerkmeister> deployed 7acef657d0 (update Croation noun Wikifunctions)
[19:12:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.lexeme-forms/SAL
[19:17:52] <wm-bot>	 !log tools.stewardbots <superpes> Restarted StewardBot stucked on IRC
[19:17:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.stewardbots/SAL
[19:19:54] <wm-bot>	 !log tools.lexeme-forms <lucaswerkmeister> deployed ba19a1cd5f (l10n updates: ja, sk, zh-hans)
[19:19:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.lexeme-forms/SAL
[19:41:16] <proc>	 dhinus: Credential recreation worked. I can't seem to create the cluster though.
[19:41:26] <proc>	 cluster creation fails with: `CREATE_FAILED` and reason: `ERROR: Internal Error`
[19:41:51] <proc>	 cluster UID ac2e0261-c6f7-4fc9-934e-fe7959135b1a
[21:01:25] <AntiComposite>	 ruh roh, wmopbot just died and https://signatures.toolforge.org/ isn't loading either.
[21:03:40] <wm-bb>	 <lucaswerkmeister> https://lexeme-forms.toolforge.org/ still loads for me, so I don’t think all of Toolforge is down yet
[21:04:57] <taavi>	 yeah, looks like ceph was unhappy for a few moments. ceph itself is back to being happy but I fear NFS clients will not
[21:10:35] <taavi>	 at least signatures recovered itself.. let's see if wmopbot does too
[21:10:44] <taavi>	 is anyone seeing issues with other tools?
[21:22:20] <andrewbogott>	 !log tools rebooting tools-sgeweblight-10-[18,21,32].tools.eqiad1.wikimedia.cloud to recover from nfs lockup
[21:22:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[23:18:15] <danilo>	 I got the following mensage when using 'kubectl describe pod ...' for wmopbot: "Events: ... Warning  FailedScheduling  4m25s (x70 over 23m)  default-scheduler  0/64 nodes are available ..."
[23:27:36] <bd808>	 danilo: There was a ceph storage blip that messed up NFS so that the kubernetes nodes needed to be restarted. I think what you are seeing there was a side effect of the restart script that t.aavi ran to do the restarts. I was just able to schedule a new pod and `kubectl get node` is now showing all nodes as "Ready".
[23:30:24] * bd808 sees that pod is still pending and looks closer
[23:31:27] <bd808>	 Looks like the problem is "58 Insufficient cpu." How many cores is that pod trying to get?
[23:33:15] <bd808>	 Looks like the default limits. This isn't making sense to me yet. The pod is asking for cpu:250m which is tiny.
[23:36:43] <bd808>	 danilo: I am going to delete that pending pod and see if anything different happens when the ReplicaSet tries to recreate it.
[23:37:39] <bd808>	 huh. same issue.
[23:37:43] <danilo>	 ok, feel free to delete, restart or do anythink that can help to solve the problem
[23:39:34] <bd808>	 "0/64 nodes are available: 3 node(s) had taint {ingressgen2: true}, that the pod didn't tolerate, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 58 Insufficient cpu." is the FailedScheduling description for those following along at home.
[23:41:00] <taavi>	 bd808: hm. I wonder if we're genuinely running out of capacity with all of the grid migrations going on
[23:42:23] <bd808>	 taavi: it's possible I guess. There are 20-40 pods running on each worker node right now
[23:42:26] <taavi>	 uh wat. https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1&forceLogin&from=now-2d&to=now&viewPanel=36
[23:43:27] <taavi>	 that's a very rapid increase in the total CPU requests starting at around 21 UTC
[23:45:24] <AntiComposite>	 interesting that memory requests go down and cpu requests go up until 23:00, at which they smoothly go up at similar rates
[23:45:51] * bd808 waits for someone to have the ah! moment
[23:46:52] <taavi>	 I added a line with the max capacity of the cluster, and it seems we've been very close to that very soon
[23:47:33] <taavi>	 s/very soon/for a while/. dero
[23:47:37] <taavi>	 derp*
[23:48:10] <bd808>	 taavi: do we still have a script to crank out new worker instances?
[23:48:29] <taavi>	 we have a cookbook. I'll start it, we clearly need more regardless of the cause
[23:48:43] <AntiComposite>	 https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1&forceLogin&from=now-6h&to=now&viewPanel=36 (now-6h) seems to tell a different story than now-2d at least between 21:00 and 23:00
[23:49:16] <AntiComposite>	 in that there was a massive spike at 22:40
[23:51:07] <AntiComposite>	 the overall rate of rise is the same though