[00:52:38] * bd808 off
[04:14:40] <andrewbogott>	 The central cloud-vps puppetserver remains mostly broken. If someone is feeling excited about certs and pki please have a look at T361772, otherwise I'll resume after I get some sleep.
[04:14:40] <stashbot>	 T361772: Expired cert failure on cloudinfra-cloudvps-puppetserver-1.cloudinfra.eqiad1.wikimedia.cloud - https://phabricator.wikimedia.org/T361772
[08:15:08] <arturo>	 o/
[08:22:53] <blancadesal>	 morning
[08:23:36] <arturo>	 what the heck
[08:23:38] <arturo>	 https://www.irccloud.com/pastebin/HNQGsrVN/
[08:23:52] <arturo>	 why would a puppet client use the wrong CN in the certificate?
[08:44:17] <arturo>	 I'm giving a try to T361772
[08:44:17] <stashbot>	 T361772: Expired cert failure on cloudinfra-cloudvps-puppetserver-1.cloudinfra.eqiad1.wikimedia.cloud - https://phabricator.wikimedia.org/T361772
[08:49:16] <dcaro>	 arturo: I think puppet uses it to identify the node and (you can) decide what things to allow or not (ex. auto-approve all certs with cn *.admin-monitoring.eqiad1.wikimedia.cloud), and well, to revoke them too xd
[08:49:56] <dcaro>	 oh wait, *the wrong* cn
[08:50:03] <arturo>	 but canary VMs are not in the admin-monitoring project even
[08:50:09] <dcaro>	 that's probably copy-pasted, is it using a pre-built image?
[08:50:23] <dcaro>	 (as in, the cert has been copy-pasted)
[08:50:48] <arturo>	 I don't think so, just a normal VM created by normal means, but I could be wrong
[08:51:09] <dcaro>	 puppetmasters had been carrying around the original puppetmaster ca cert for example (the cn was something like puppetmaster-1.... on some of them)
[08:52:39] <dcaro>	 2021-01-12 01:22:48 is some time ago xd
[08:52:52] <dcaro>	 (from the CN name)
[08:53:31] <dcaro>	 hmm, I suspect that as we did a cleanup of certs, that old cert might have been in the stack to clean up
[08:53:38] <dcaro>	 and now it's revoked, so that's why it started failing
[08:54:25] <arturo>	 the question stands: why would canary-something use a cert from fullstack-something. Two VMs in two different projects. Your theory is that the client cert was put there by hand by someone?
[08:54:48] <dcaro>	 maybe not by hand, but in the base image somehow
[08:54:55] <dcaro>	 or by the script that sets it up
[08:56:03] <arturo>	 ok, so a placeholder, then with the puppetserver that never got to work for real, the cert was never refreshed
[08:56:06] <arturo>	 that could make sense!
[09:07:00] <blancadesal>	 ugh, harbor robot permissions are still driving me crazy. There are discrepancies in the permissions that can be set via the UI vs the API. There are also discrepancies between the docs and reality. AND bugs introduced with 2.10
[09:07:54] <taavi>	 :(
[09:10:50] <taavi>	 huh. apparently metricsinfra-puppetserver-1 is out of indes on /srv???
[09:12:26] <dcaro>	 hmm, maybe puppet7 opens many files at the same time for some reason (that might be part of why it uses so much memory)
[09:13:35] <taavi>	 lots of '/srv/puppet_code/environments_staging/oot_branch_202404xxxxxx/' directories apparently?
[09:16:28] <dcaro>	 hmm, that sounds weird, looks like part of the code updating process? maybe the updating of the puppet code is leaving stuff around?
[09:17:37] <taavi>	 i think a branch with a name like that is used by the git updater script
[09:18:22] <taavi>	 I deleted those directories and branches, and it seems to work now
[09:18:49] <arturo>	 ok, I made some changes on cloudinfra-cloudvps-puppetserver-1, then the puppetserver daemon fails to start, failing to create /etc/puppet/puppetserver .... which exists already? 
[09:26:20] <dcaro>	 hmm, permissions?
[09:26:42] <arturo>	 they look good to me
[09:26:44] <arturo>	 https://www.irccloud.com/pastebin/p5FL80Qc/
[09:27:16] <taavi>	 the target of that symlink does not exist
[09:27:46] <arturo>	 I think I can fix that
[09:27:54] <arturo>	 I saved the ca/ file from an earlier movement
[09:28:00] <arturo>	 the ca/ dir*
[09:28:14] <taavi>	 please do restore that, we don't want to generate a new CA here I think
[09:28:45] <taavi>	 usually "Not using expired certificate for ca from cache" on the client means the client has an outdated copy somewhere - why are you messing with the server CA setup at all?
[09:28:48] <arturo>	 oh, per T361772 we may want to generate a new one
[09:28:49] <stashbot>	 T361772: Expired cert failure on cloudinfra-cloudvps-puppetserver-1.cloudinfra.eqiad1.wikimedia.cloud - https://phabricator.wikimedia.org/T361772
[09:29:13] <arturo>	 this is what andrew was trying yesterday
[09:29:24] <arturo>	 (to generate a new CA)
[09:29:54] <taavi>	 I don't think we want to do that
[09:30:50] <arturo>	 ok
[09:31:55] <arturo>	 the CA I'm about to restore was generated yesterday anyway
[09:32:04] <arturo>	 https://www.irccloud.com/pastebin/YhYBI8Fk/
[09:33:41] <arturo>	 just restored it
[09:34:03] <taavi>	 are you saying the old CA is gone??
[09:34:26] <arturo>	 per the comments in the ticket by andrew yesterday, I believe he was trying to re-generate the CA
[09:34:35] <taavi>	 why??????????????????????????????
[09:34:46] <taavi>	 is there a backup of the old one somewhere??
[09:35:06] <dcaro>	 private git repo maybe?
[09:35:07] <arturo>	 maybe? hopefully!
[09:36:00] <taavi>	 seems like there is /srv/puppetbroken/server/sslbroken/ - I'm going to restore that
[09:36:01] <dcaro>	 it's ok, we'll sort it out in any case
[09:36:14] <arturo>	 https://www.irccloud.com/pastebin/QAjnMyGc/
[09:36:22] <arturo>	 taavi: that one is just from the day before :-P
[09:37:11] <arturo>	 could you share what is your concern regarding the re-generation?
[09:37:45] <arturo>	 isn't this covered by the cattle not pets principle? -- or is this just a stupid question
[09:39:55] <blancadesal>	 quick +1 here? https://gerrit.wikimedia.org/r/c/operations/puppet/+/1016724 
[09:40:06] <taavi>	 because I don't want to replace the certs in all the certs in all the projects if I don't have to
[09:40:25] <arturo>	 last puppetserver error seems to be `Parent directory '/srv/puppet/server/ssl/ca' is not writable` which looks easier to understand
[09:41:01] <taavi>	 seriously though, to me it seems there's been many people messing with the puppet ca in the last few days without understanding how it is used nor documenting everything they've done, and now it's much more a mess than it could be
[09:44:06] <arturo>	 I think you are right
[09:44:34] <arturo>	 I'll write on T361772 what I did today
[09:44:35] <stashbot>	 T361772: Expired cert failure on cloudinfra-cloudvps-puppetserver-1.cloudinfra.eqiad1.wikimedia.cloud - https://phabricator.wikimedia.org/T361772
[09:45:35] <taavi>	 I am going to go eat lunch. please no-one do anything on the puppetserver before I come back
[09:45:45] <arturo>	 ok
[09:46:39] <blancadesal>	 does this mean puppet is broken right now?
[09:48:21] <arturo>	 blancadesal: depends on the project. In tools/toolsbeta, most likely no
[09:49:05] <blancadesal>	 ok
[09:51:17] <blancadesal>	 dcaro: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1016724  if you have a moment so I can go ahead with the harbor upgrade in tools
[09:51:57] <dcaro>	 blancadesal: yep
[09:52:32] <dcaro>	 blancadesal: how do you want to do it?
[09:52:54] <blancadesal>	 dcaro: in what sense?
[09:53:49] <dcaro>	 blancadesal: anything you want me to do besides being around and checking harbor logs/etc.?
[09:55:00] <blancadesal>	 if it's like last time, it should be painless, i.e. merge patch -> force puppet run -> run prepare script -> docker-compose down/up 🤞
[09:55:22] <dcaro>	 make sure to do some backups of the DB and the data, just in case
[09:55:30] <blancadesal>	 that's done
[09:56:20] <blancadesal>	 if you're around in case it goes sideways, I'll of course be grateful :)) 
[09:56:46] <dcaro>	 awesome, yes I'm here :)
[09:59:30] <dcaro>	 blancadesal: +1d the patch, oh, can you merge?
[10:00:37] <blancadesal>	 dcaro: don't think so, it also doesn't automerge on 2 x +1?
[10:00:51] <dcaro>	 nope, I'll merge for you :/
[10:00:55] <blancadesal>	 thanks
[10:01:50] <dcaro>	 you might want to disable puppet on the harbor instance to avoid it running it without you knowing (not a big issue, as it should not be really changing anything until you manually run the prepare)
[10:03:58] * taavi is back
[10:04:19] <arturo>	 taavi: that was quick. https://phabricator.wikimedia.org/T361772#9687731 contains the report of what I did today
[10:04:52] <arturo>	 I will log off the server and go work on other stuff unless you request assistance
[10:05:38] <blancadesal>	 hmm, is there a lag between merging and puppet picking it up? forcing a puppet run didn't change anything
[10:05:50] <taavi>	 thanks, that report is helpful
[10:05:58] <arturo>	 blancadesal: try  `git-sync-upstream` on the tools puppetserveer
[10:06:04] <dcaro>	 blancadesal: yes, that yes
[10:06:23] <taavi>	 arturo: unfortunately even the oldest backup I found on cloudinfra-cloudvps-puppetserver-1 had the CA cert generated yesterday evening
[10:06:43] <dcaro>	 blancadesal: it also has to be manually merged on puppetmaste1001 after submitting it in gerrit
[10:06:50] <arturo>	 taavi: ok. Also note the mess with CNs and altnames. You should check those too
[10:06:52] <taavi>	 I'm starting to suspect this mess is beyond repair
[10:06:54] <dcaro>	 (done)
[10:07:21] <blancadesal>	 dcaro: what do you mean by manually merged?
[10:07:48] <dcaro>	 wm-ssh puppetmaster1001 + sudo puppet-merge
[10:08:14] <dcaro>	 (by hand, it shows a diff of the changes that it will bring from grerrit, and you have to review + ack)
[10:08:32] <dcaro>	 as in, the gerrit repository and the actual puppetmaster code are not automatically synced
[10:08:50] <blancadesal>	 dcaro: by (done) do you mean you did it? :))
[10:09:20] <dcaro>	 moritzm did, but yes :)
[10:09:20] <arturo>	 I guess blancadesal doesn't have puppet-merge privilege because she doesn't have the SRE title
[10:09:31] <dcaro>	 yep, that's a long-standing issue :/
[10:11:14] <blancadesal>	 ok, now it seemed to work
[10:11:21] <arturo>	 taavi: I see in https://openstack-browser.toolforge.org/project/cloudinfra that the old general puppetmasters are still around. Maybe all the VMs still use them, and the blast radius of this mess is not that big after all?
[10:12:09] <arturo>	 well, `puppetmaster.cloudinfra.wmflabs.org` already points to the newer server... so nevermind
[10:14:19] <taavi>	 I commented https://phabricator.wikimedia.org/T361772#9687752 about the two main options forward that I see at the moment
[10:15:03] <taavi>	 also, there might be one more complication, which is that I'm not sure if puppet 5 clients can enroll to a new puppet 7 ca
[10:18:06] <blancadesal>	 harbor upgrade is done, seemingly without any issues
[10:18:09] <dcaro>	 that puppetserver cert is not in the private repo?
[10:18:14] <dcaro>	 blancadesal: \o/
[10:18:28] <arturo>	 dcaro: usually not, just the filesystem
[10:18:30] <dcaro>	 let's run a toolforge build + webservice to test
[10:18:37] <arturo>	 but I could be wrong
[10:18:46] <arturo>	 the priv repo usually host just priv hiera and little else
[10:18:51] <dcaro>	 arturo: :/, I remember having to copy stuff to the private repo before
[10:19:07] <dcaro>	 (when I refreshed the certs with puppet 5)
[10:19:12] <blancadesal>	 dcaro: build + docker pull went fine, i'll do webservice too
[10:19:21] <dcaro>	 \o/
[10:19:46] <arturo>	 dcaro: yeah, but I think always stuff used inside puppet manifests (code), not by the puppet server proc itself
[10:21:50] <dcaro>	 From the wiki ca cert renewal page https://www.irccloud.com/pastebin/3HzBqFJv/
[10:21:58] <dcaro>	 (not sure it applies for puppet 7 though)
[10:22:03] <blancadesal>	 dcaro: webservice went fine too
[10:22:21] <dcaro>	 blancadesal: awesome :)
[10:22:35] <taavi>	 so I restored cloudinfra-cloudvps-puppetserver-1:/etc/puppet/puppetserver/ca/ca_crt.pem from an instance, and now we're at the original problem andrew was looking at yesterday ("Info: Not using expired certificate for ca from cache; expired at 2024-04-03 14:54:05 UTC")
[10:22:37] <arturo>	 dcaro: oh! cc taavi 
[10:22:43] <blancadesal>	 I'll keep the backups around for a few days
[10:23:04] <dcaro>	 blancadesal: if they are not very big, we can keep them longer, until the next upgrade
[10:23:07] <dcaro>	 (just in case)
[10:23:52] <arturo>	 taavi: how did you restore the priv key of the CA?
[10:24:43] <taavi>	 i realized that the private key had not been changed when the cert itself was renewed
[10:24:56] <blancadesal>	 dcaro: 6G on toolsbeta and 28G on tools, iirc. there's enough space to keep them
[10:25:14] <dcaro>	 ack, we can leave them until the next upgrade then (or we start running out of space)
[10:27:07] <blancadesal>	 i should probably move the db dumps, they are in /tmp right now
[10:27:27] <dcaro>	 oh
[10:27:58] <dcaro>	 yes, move them to /srv/..., /tmp is tricky (gets removed on restart, defaults to everyone can read, might be all in memory, etc.)
[10:29:39] <dcaro>	 you can gzip them too if they are big (and not gzipped yet)
[10:30:47] <blancadesal>	 I gzipped the /data folder backup, the db dump is just 25M on tools
[10:31:04] <dcaro>	 nice, it was bigger before xd
[10:33:18] * dcaro rebuilding my lima-kilo env after all the goodies that were merged yesterday
[10:33:56] <dcaro>	 blancadesal: can you check also maintain-harbor if you have not? just see if it fails with the newer harbor
[10:34:16] <arturo>	 that reminds me that I should upload the new toolforge-jobs cli with the dumps function
[10:34:39] <blancadesal>	 dcaro: I haven't checked, will do after lunch
[10:34:52] <dcaro>	 arturo: it's deployed already I think (checked on the default bastion)
[10:35:02] <dcaro>	 blancadesal: ack, thanks
[10:36:18] <blancadesal>	 dcaro, arturo, fyi, I will leave the openapi stuff for next week. I want to get to the bottom of the harbor permissions issues first
[10:36:27] <taavi>	 question: where is the server certificate used by the puppetserver process located?
[10:39:32] <taavi>	 arturo: /srv/puppet/server/ssl/certs/cloudinfra-cloudvps-puppetserver-1.cloudinfra.eqiad1.wikimedia.cloud.pem was modified at 9:43 this morning - was that you?
[10:39:37] <dcaro>	 taavi: have you restarted the process lately? (just thinking that if not, it might not be there anymore)
[10:40:18] <taavi>	 that is now the wrong certificate unfortuanately, issued by the cloudinfra-internal CA and not the cloud-wide one
[10:40:26] <taavi>	 dcaro: I think I have, but let me try that again
[10:48:51] <arturo>	 taavi: I have no idea at this point. Everything I did is the phab comment I sent you earlier 
[10:48:54] <dcaro>	  /srv/puppet/server/ssl/certs/puppetmaster.cloudinfra.wmflabs.org.pem.old maybe? (Not Before: Mar  5 15:42:45 2024 GMT, comes from the pre-puppet 7 setup it seems though)
[10:50:38] <taavi>	 status update: the server seems to work again, https://phabricator.wikimedia.org/T361772#9687944
[10:51:00] <taavi>	 what's left is running `mv /var/lib/puppet/ssl/certs/ca{,-old}.pem` on all the broken instances hooked up to the cloud-wide puppetserver
[10:52:42] <dcaro>	 so the issue was the alt names in the config?
[10:52:53] <dcaro>	 (partially at least)
[10:54:06] <taavi>	 the original reason why instances started breaking is that `/var/lib/puppet/ssl/certs/ca.pem` was updated when the ca certificate was renewed
[10:54:32] <taavi>	 all of what I commented was to just fix the situation caused by people trying to renew the server certs when the server certs were totally fine and the issue was on the client side
[10:55:23] <dcaro>	 ack, but changing that part of the puppet config is something we were missing before?
[10:56:07] <dcaro>	 (just curious, as if it was working before without the config setting, feels weird we need it now)
[10:56:41] <taavi>	 the DNS alt name setting missing was one problem, another was trying to manually generate certs that are usually automatically generated by puppetserver on startup
[10:57:08] <dcaro>	 ack
[10:57:56] <taavi>	 I'm seeing some recoveris from the puppet is broken alert, I'm going to wait half an hour or so to give things time to self-recover and after that I'll look at fixing the remaining instances
[10:58:10] <dcaro>	 👍
[10:58:38] <arturo>	 thanks!
[10:59:14] <dcaro>	 things seem to be fixing themselves yes, not sure why the cached cert removal is needed sometimes (cache invalidation xd)
[11:00:33] <arturo>	 dcaro: could you please remind me how we ship deb packages for the toolforge cli? I did not upload the package to aptly yesterday. Who did?
[11:01:02] <dcaro>	 arturo: might have been raymond, he released also the --health-check-script option for toolforge jobs run
[11:01:13] <arturo>	 oh! that makes sense, then
[11:01:16] <dcaro>	 you had merged your patch right?
[11:01:21] <arturo>	 yes
[11:01:38] <dcaro>	 yep, that might have been it then
[11:01:54] <arturo>	 👍
[11:02:35] <dcaro>	 I tested it slightly, but would appreciate a more thorough test (I'm not sure raymond noticed that he was deploying two things)
[11:05:05] <arturo>	 will do, also I will write some docs on wikitech
[11:05:56] <dcaro>	 awesome, I added an entry in the toolforge changelog already, feel free to modify it to point to the docs :)
[11:06:17] <dcaro>	 https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Changelog
[11:06:25] * dcaro lunch
[11:08:24] <arturo>	 👍
[11:29:09] <arturo>	 different pyyaml versions :-(
[11:29:15] <arturo>	 between the bastions
[11:29:34] <arturo>	 https://www.irccloud.com/pastebin/y4ypc3ii/
[11:30:04] <arturo>	 also different python tabulate versions:
[11:30:22] <arturo>	 https://www.irccloud.com/pastebin/gGSH2scD/
[11:40:40] <moritzm>	 dcaro: I'd go ahead and https://gerrit.wikimedia.org/r/c/operations/puppet/+/1013521 and https://gerrit.wikimedia.org/r/c/operations/puppet/+/1013542 (by disabling puppet and enabling on an initial host), or what that mess with anything right now?
[11:42:22] <moritzm>	 these are step stones o move the cloudceph nodes to nftables, so that evetually we can apply DSCP to eventually mitigate a scenario like the recent outage
[11:42:38] <arturo>	 moritzm: I think it should be safe to go, I don't think we are in the middle of any ceph operation right at this moment
[11:45:10] <arturo>	 +1'd both changes
[11:50:24] <moritzm>	 cheers
[11:52:53] <taavi>	 arturo: something is up with the puppet setup on canary1039-3.cloudvirt-canary, are you ok with me just recreating it?
[11:53:10] <arturo>	 taavi: yeah, no problem. Try the cookbook
[11:53:19] <taavi>	 that was my plan. thanks
[12:13:42] <moritzm>	 the two cloudceph patches are merged now, I'll doublecheck whether there's additional ferm services which need to be migrated and next prepare patches to switch one canary server each (mon and osd) to nftables
[12:50:10] <taavi>	 what is the correct place/command to run a cloud-wide cumin these days?
[12:51:54] <arturo>	 taavi: I would assume cloudcumin?
[12:54:59] <dhinus>	 arturo: yes, but T346453
[12:54:59] <stashbot>	 T346453: [cumin] [openstack] Openstack backend fails when project is not set - https://phabricator.wikimedia.org/T346453
[12:55:12] <arturo>	 ah!
[12:55:19] <dhinus>	 funny coincidence that I spoke with riccardo yesterday about it
[12:55:36] <dhinus>	 and I'm currently trying to set up a local devstack vm to test some things
[12:56:14] <dhinus>	 in the meantime, taavi I think you can try applying manually the patch in that task
[12:56:44] <dhinus>	 unless you mean cloud* physical vms
[12:56:54] <dhinus>	 they should work from cloudcumin without patches
[12:57:00] <dhinus>	 *physical hosts
[12:57:11] <taavi>	 `taavi@cloudcumin1001 ~ $ sudo cumin "O{*}"` did work (except T361831 and that it tries to connect to trove instances)
[12:57:12] <stashbot>	 T361831: cloudcumin can't reach bastion-restricted itself - https://phabricator.wikimedia.org/T361831
[12:57:49] <dhinus>	 hmm interesting, so it's not failing with the Unauthorized error anymore?
[12:58:33] <taavi>	 not this time it seems
[12:59:47] <taavi>	 (what I did was related to fixing the old puppet certificate: https://phabricator.wikimedia.org/T361772#9688382)
[13:00:51] <dhinus>	 I'll try to find out what changed
[13:04:27] <dhinus>	 ok it's simply that cloudcumin1001 is /already/ manually patched :)
[13:08:11] <dhinus>	 I added a note to T346453
[13:08:12] <stashbot>	 T346453: [cumin] [openstack] Openstack backend fails when project is not set - https://phabricator.wikimedia.org/T346453
[13:09:07] <andrewbogott>	 taavi: thank you for the rescue. I'm still not sure I understand why the server regenerated the certs for you on startup but didn't for me? (I only touched the CA cert after about 1000 different restart/rebuild attempts)
[13:09:32] <blancadesal>	 re licenses, we have a lot of different ones across the toolforge repos: at least AGPL, GPL, Apache, and MIT. Is there a reason? Should we agree on one? 
[13:21:36] <taavi>	 andrewbogott: hard to say without knowing what exactly you did :/
[13:30:35] <andrewbogott>	 taavi: basically this https://www.puppet.com/docs/puppet/7/ssl_regenerate_certificates.html
[13:30:52] <andrewbogott>	 but I'll build a less important server and see if I can reproduce
[13:31:02] <taavi>	 that is about the puppet client certtificates, not the puppet server certificate that needed regenerating here
[13:34:38] <andrewbogott>	 Well for some reason that page won't let me c/p but the first section is for primary server with alt names
[13:35:22] * taavi paged
[13:36:18] <dcaro>	 redis?
[13:36:21] <taavi>	 toolschecker redis check failing with 'You can't write against a read only replica.'
[13:36:22] <taavi>	 yep
[13:36:39] <dcaro>	 you on it?
[13:36:59] <taavi>	 yeah, I'll have a look
[13:37:12] <dcaro>	 ack, let me know if you want help/need anything
[13:37:47] <dcaro>	 toolforge tools seem to be having issues too
[13:38:08] <dcaro>	 oh, maybe not
[13:38:14] <dcaro>	 got a timeout from one tool, but it's back up
[13:39:48] <taavi>	 a-ha:
[13:39:49] <taavi>	 root@tools-redis-5:~# redis-cli info replication
[13:39:49] <taavi>	 ERR max number of clients reached
[13:40:17] <taavi>	 and that blocks the keepalived health check, which in turn routes traffic to the wrong  place
[13:40:47] <dcaro>	 oh
[13:41:09] <taavi>	 i restarted redis
[13:41:18] <taavi>	 and there comes the recovery
[13:41:25] <dcaro>	 ack, if it was a stray bot might happen again
[13:42:07] <taavi>	 yep :/
[13:47:33] <arturo>	 mmm I did not get paged at all
[13:47:38] * arturo checks victorops in the phone
[13:49:21] <arturo>	 I'm not part of the WMCS team :-(
[13:50:24] <taavi>	 hmm, seems like everybody except me is a team admin, so I can't fix that
[13:50:26] <dcaro>	 :faceplam:
[13:51:12] <dcaro>	 taavi: I see you as team admin (I think: `TeamsWMCS *=Team Admin` in your profile)
[13:51:55] <dcaro>	 wait no, that's just the legend, an asterisk in the team name means admin
[13:51:57] <taavi>	 dcaro: looking at teams (from the top nav) -> wmcs -> team members, I see that star next to everybody except me and devnull
[13:52:19] <dcaro>	 done
[13:52:33] <arturo>	 thanks!
[13:52:43] <dcaro>	 interestingly enough, flagging a user as team admin is clicking the `pencil` icon on the right
[13:52:57] <taavi>	 does arturo need to be added to the rotations too?
[13:53:03] <dcaro>	 https://usercontent.irccloud-cdn.com/file/K8z8lhdn/image.png
[13:53:23] <dcaro>	 yes, adding him
[13:55:03] <dcaro>	 there was some weird override setting `arturo` shift to `devnull` user
[13:55:04] <dcaro>	 changed it
[13:55:25] <arturo>	 👍
[13:55:48] <dcaro>	 arturo: I think it's done, can you double check?
[13:57:18] <arturo>	 dcaro: LGTM
[13:57:21] <arturo>	 thanks!
[13:57:40] <dcaro>	 the google calendar looks weird to me now xd
[13:59:35] <taavi>	 andrewbogott: what is cloudinfra-cloudvps-puppetserver-2?
[13:59:51] <andrewbogott>	 taavi: A failed attempt at a scratch rebuild, I'll delete it.
[14:10:38] <andrewbogott>	 I have whatever the opposite of midas touch is
[14:10:40] <blancadesal>	 dcaro: the harbor upgrade breaks the delete-stale-toolforge-artifacts job that runs once a week on wednesdays. It seems that non-admin users no longer have permissions to `get immutabletagrules` (admin users still can) all other jobs seem to be able to run. at least there's some time to fix this before next wednesday xd
[14:11:02] <andrewbogott>	 taavi: any idea what's happening with puppet on cloudbackup1002-dev.eqiad.wmnet?  Seems like the same issue except this one is in prod
[14:11:18] <andrewbogott>	 And in this case all I've done so far is apply a new puppet role
[14:11:27] <dcaro>	 blancadesal: good catch, can you open a task?
[14:11:52] <dcaro>	 andrewbogott: what's the error?
[14:12:09] <andrewbogott>	 https://www.irccloud.com/pastebin/RSyrPiNW/
[14:12:28] <dcaro>	 hmm, it says nothing about the cache as the others
[14:12:29] <andrewbogott>	 huh, why is cloud services in there at all?
[14:12:56] <taavi>	 andrewbogott: did you accidentally switch it from puppet5 to puppet7 or the other way around when changing the role?
[14:14:07] <andrewbogott>	 possibly
[14:14:18] <andrewbogott>	 that's at least a good thing to look for
[14:14:24] <andrewbogott>	 (actual change is https://gerrit.wikimedia.org/r/c/operations/puppet/+/1016447)
[14:14:35] <andrewbogott>	 anyway, meeting time, I'll investigate that more after
[14:16:20] <blancadesal>	 dcaro: T361842
[14:16:21] <stashbot>	 T361842: [harbor, maintain-harbor] Harbor upgrade 2.10 breaks delete-stale-toolforge-artifacts cron job - https://phabricator.wikimedia.org/T361842
[14:16:28] <dcaro>	 thanks!
[14:16:30] <dcaro>	 blancadesal: meeting?
[14:16:36] <blancadesal>	 coming
[15:14:45] <arturo>	 just noticed our registry admission controller don't enforce the registry URL for initcontainers https://gitlab.wikimedia.org/repos/cloud/toolforge/registry-admission/-/blob/master/server/registryadmission.go?ref_type=heads#L75
[15:15:03] <arturo>	 which means that most likely pods with initcontainer can run code from outside our docker registry
[15:15:26] <arturo>	 and same for ephemerealcontainers
[15:16:12] <arturo>	 not that this feels like a big deal anyway, given any user can just run wget whatever from within the container
[16:00:59] * arturo offline
[16:07:13] <bd808>	 LOL. I'm reviewing a plan from Search to make a new home for non-CirrusSearch opensearch indexes. Toolhub i listed as needing "<1MB" of disk for its current index. The next smallest need is 9GB. :)
[16:20:10] <taavi>	 what's up with that toolforge-push email?
[16:27:08] <dhinus>	 I was also wondering. it looks like it's a pretty old github account https://github.com/toolforge-push
[16:27:18] <bd808>	 It wasn't me. The "firefox" tag on it made me think it wasn't Phorge either (which is where the credentials are properly used)
[16:28:00] <bd808>	 The account is used from Phorge (phabricator) to mirror repos into the toolforge GitHub org account
[16:28:16] <bd808>	 https://github.com/toolforge/
[16:28:55] <bd808>	 https://github.com/toolforge/admin is an example of a repo that those credentials are used to keep updated
[16:31:03] <bd808>	 The account is wired into phabricator in places like https://phabricator.wikimedia.org/source/tool-admin-web/uri/view/18249/
[16:32:12] <dhinus>	 thanks bd808, do you know of other people who interacted with that account in the past?
[16:32:22] <dhinus>	 is the password shared somewhere?
[16:32:22] <bd808>	 just me.
[16:33:11] <bd808>	 It may be in the SRE pwstore? I can't remember if I gave it to Brooke os someone for safekeeping there
[16:33:19] <bd808>	 *or someone
[16:34:43] <bd808>	 I "own" the toolforge-push and composer-rate-limits-suck github accounts today
[16:35:10] <bd808>	 the https://github.com/composer-ratelimits-suck account is used in Jenkins
[16:38:54] <bd808>	 T242898 has some historical info related to the push account. The account and the service it facilitates are things I should have documented on wikitech. I'll make myself a task to do that.
[16:38:54] <stashbot>	 T242898: Mirroring Diffusion repositories to GitHub seems to be broken - https://phabricator.wikimedia.org/T242898
[16:46:44] <bd808>	 T361859
[16:46:45] <stashbot>	 T361859: Document diffusion->github mirroring to https://github.com/toolforge/ on wikitech - https://phabricator.wikimedia.org/T361859
[16:48:41] <dhinus>	 thanks bd808!
[16:49:48] <dhinus>	 I'm not sure if we should also rotate the password or if it's not worth it, as whoever tried to log in also needs the code that was sent to tools.admin@tools.wmflabs.org
[16:51:06] <dhinus>	 oh this page gives you a list of IPs with active sessions! https://github.com/settings/sessions
[16:51:34] <bd808>	 I could rotate the password. The only "authorized" workflow for the account uses a non-password token so it shouldn't be difficult to reset.
[16:51:54] * bd808 fires up an incognito session to check
[16:52:51] <bd808>	 expect another validation email for a Chrome user-agent....
[16:53:16] * bd808 is in
[16:56:06] <dhinus>	 any other active sessions in the list?
[16:56:15] <bd808>	 nope. just me
[16:56:35] <bd808>	 I'll go ahead and rotate the password anyway
[16:56:45] <dhinus>	 sounds good
[16:57:59] <dhinus>	 a weird but possible explanation (it happened to me once!) is someone still having that password in their pwd manager, and mistakenly auto-filling with the wrong user :)
[16:59:33] * dhinus finds there's also https://github.com/settings/security-log
[17:00:41] <dhinus>	 I'm logging off, thanks bd808 for looking into this!
[17:10:51] * dcaro off
[17:10:53] <dcaro>	 cya tomorrow
[17:10:54] <bd808>	 that security-log url does show the ip that tried to login. It geolocates to "McKees Rocks, Pennsylvania, United States"
[17:13:40] <bd808>	 Rook: any chance that login attempt for https://github.com/toolforge-push was you?
[17:15:06] <Rook>	 It was
[17:15:18] <bd808>	 mystery solved then :)
[17:15:39] <bd808>	 I rotated the password. If you need it I can share
[17:15:50] <Rook>	 I found another way. Though how does one access the associated email?
[17:16:28] <bd808>	 the emails go to the tools.admin@tools.wmflabs.org shared alias for all Toolforge admin tool maintainers
[17:18:22] <Rook>	 Oh there it is, thanks!
[18:05:24] * bd808 lunch
[18:32:46] <Rook>	 Does the troubleshooting page (https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting) describe what to do when no nova-compute processes are found (nova-compute proc minimum)?
[18:34:47] <andrewbogott>	 Rook: Are you looking at 1031?
[18:34:54] <Rook>	 yes
[18:35:16] <andrewbogott>	 I'm looking too -- I think this is a consequence of something arturo investigated.  I'll open a task.
[18:35:23] <andrewbogott>	 It's not immediately urgent, OK to just ack for now.
[18:35:52] <Rook>	 Good to know. Though does the troubleshooting page give guidance on what to do for this error?
[18:36:54] <andrewbogott>	 I'm not sure.
[18:37:11] <andrewbogott>	 It's one of those alerts that could be many different things
[18:38:17] <Rook>	 alrighty, I'll just ack it. Let's see if it resolves itself for long enough to make another alert again
[18:39:11] <andrewbogott>	 it might, it restarts on every puppet run and then crashes.
[18:39:19] <andrewbogott>	 I have a vague idea of what's happening
[18:39:50] <andrewbogott>	 although not why it's happening now, 50 days after reimage
[18:55:42] <andrewbogott>	 Rook: I think I have a fix, can you just keep mashing the ack button in the meantime?
[18:56:07] <Rook>	 Can do
[18:58:58] <andrewbogott>	 I suspect that something lost track of the uuid for that host, and now it's trying to re-pool itself with a different id and crashing.  The nova database has an ID which is now in https://gerrit.wikimedia.org/r/c/operations/puppet/+/1017124
[19:06:29] <andrewbogott>	 Rook: I think that worked
[19:06:52] <Rook>	 Neat
[19:26:34] <Rook>	 Do we have a response page for ProjectProxyMainProxyDown?
[19:30:52] <andrewbogott>	 that was me -- did it recover?
[19:31:12] <Rook>	 yep, looks happy now
[19:31:32] <andrewbogott>	 All my interruptions have sub-interruptions.
[19:31:59] <andrewbogott>	 Rook: it's probably worth documenting all these missing runbooks if the alerts don't have links.
[19:33:07] <Rook>	 Where should we document them?
[19:34:39] <andrewbogott>	 Open a phab task or tasks suggesting that someone write some docs :)
[19:34:58] <Rook>	 How would you recommend it be tagged?
[19:35:32] <andrewbogott>	 probably cloud-vps for the ones from today
[19:36:25] <Rook>	 Alrighty
[21:26:04] <arturo>	 !bash <andrewbogott> All my interruptions have sub-interruptions.
[21:26:04] <stashbot>	 arturo: Stored quip at https://bash.toolforge.org/quip/-vEBq44BhuQtenzvKxHN
[21:41:08] <andrewbogott>	 I'm pretty sure that nothing new is currently broken so I'm going to make my escape
[22:07:18] <bd808>	 https://bash.toolforge.org/search?q=andrewbogott is a nice collection :)
[22:11:39] <bd808>	 Neha Jha (2018 GSoC intern) poked me about places where she could work on leveling up some of her backend skills. Among other things I told her y'all are doing lots of Toolforge things and might be read for coding help. T190638 was her internship.
[22:11:39] <stashbot>	 T190638: GSoC 2018 proposal for Improvements for the Toolforge 'webservice' command - https://phabricator.wikimedia.org/T190638
[22:13:11] <bd808>	 She reached out because she saw my linkedin post about the grid shutdown, so I guess that maybe did some good besides letting Chase, Brooke, and a few others know it actually happened.