[06:28:37] <XioNoX>	 cwhite: noted, thx !
[06:57:23] <jayme>	 XioNoX: good morning o/
[07:08:26] <XioNoX>	 jayme: hello!
[07:08:41] <XioNoX>	 finishing up a few things then we can get started
[07:15:11] <jayme>	 XioNoX: ack, lmk when you're ready. I've prepared patches yesterday
[07:19:05] <XioNoX>	 jayme: which host should we try the rename on ?
[07:19:59] <jayme>	 kubernetes2023 and 2032, see https://phabricator.wikimedia.org/T365571 
[07:20:50] <jayme>	 we can start with kubernetes2023 -> wikikube-worker2001
[07:21:12] <jayme>	 prereq https://gerrit.wikimedia.org/r/c/operations/puppet/+/1034956 , https://gerrit.wikimedia.org/r/c/operations/puppet/+/1034976
[07:23:24] <XioNoX>	 jayme: +1
[07:24:02] <XioNoX>	 jayme: then I'll run `cumin1002:~$ test-cookbook -c 1008818 --dry-run sre.hosts.rename -t T365571 kubernetes2023 wikikube-worker2001`
[07:24:05] <stashbot>	 T365571: Rename wikikube worker nodes during OS reimage - https://phabricator.wikimedia.org/T365571
[07:24:38] <jayme>	 til test-cookbook is a thing
[07:24:49] <XioNoX>	 https://wikitech.wikimedia.org/wiki/Spicerack/Cookbooks#Test_before_merging
[07:25:00] <jayme>	 let me merge and run puppet on the install servers
[07:29:53] <jayme>	 ...taking it's time
[07:32:46] <jayme>	 XioNoX: done
[07:34:08] <XioNoX>	 cool
[07:34:43] <XioNoX>	 jayme: running the cookbook then
[07:37:17] <XioNoX>	 great...
[07:37:26] <XioNoX>	 https://www.irccloud.com/pastebin/D4IzTbUb/
[07:38:31] <jayme>	 uh
[07:39:09] <jayme>	 I wasn't aware this is ruby...I'm shocked
[07:39:10] <XioNoX>	 gpg --decrypt works, I'll look ad pws later on
[07:40:39] <XioNoX>	 https://www.irccloud.com/pastebin/CcgrYkLv/
[07:40:44] <jayme>	 I don't see a log being created on cumin1002 or 2002. Should the test-cookbook log like a regular one?
[07:41:07] <XioNoX>	 jayme: probably because it's on dry-run
[07:41:35] <XioNoX>	 so `NetboxHostNotFoundError: wikikube-worker2001` makes sens because nothing really got renamed
[07:42:07] <XioNoX>	 ah, and JSONDecodeError: Expecting value: line 1 column 1 (char 0) too, because it doesn't send the PATCH
[07:42:27] <jayme>	 yeah :D
[07:42:37] <XioNoX>	 not sure how much we can handle the --dry-run case...
[07:44:35] <XioNoX>	 jayme: let's go without --dry-run ?
[07:45:10] <jayme>	 XioNoX: yeah, I'd say so. No idea about dry-run either
[07:45:40] <jayme>	 I think the redfish call should probably not fail, even on empty response
[07:47:58] <XioNoX>	 yeah might be a bug in https://github.com/wikimedia/operations-software-spicerack/blob/master/spicerack/redfish.py#L341 (/cc volans)
[07:49:04] <XioNoX>	 jayme: running without dry-run
[07:49:39] <volans>	 the logs are in your home
[07:49:48] <XioNoX>	 https://www.irccloud.com/pastebin/10oyaB13/
[07:49:56] <volans>	 ~/cookbook_testing/logs  # The log directory where all cookbooks will log into
[07:50:03] <volans>	 from the help message
[07:51:28] <XioNoX>	 at least rollback worked as expected, changed the name and back in Netbox https://netbox.wikimedia.org/dcim/devices/4399/changelog/
[07:52:28] <volans>	 XioNoX: iDRAC.Embedded.1/iDRAC.Embedded.1
[07:52:32] <volans>	 it's there twice
[07:53:00] <volans>	 oob_manager includes that
[07:53:26] <XioNoX>	 https://github.com/wikimedia/operations-software-spicerack/blob/master/spicerack/redfish.py#L821
[07:53:29] <XioNoX>	 indeed :)
[07:53:35] <volans>	 as i said in the reivew, you had to replace that with f'{self.redfish.oob_manager}/Ethern....' ;)
[07:53:56] <volans>	 missed that in the last review sorry
[07:56:16] <XioNoX>	 volans, jayme https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1008818/7..8
[07:56:44] <volans>	 +1ed
[07:57:26] <jayme>	 ditto
[07:58:14] <XioNoX>	 updating dns...
[08:00:19] <XioNoX>	 dns done, sync-netbox-hiera in progress
[08:01:54] <jayme>	 exciting 
[08:02:27] <XioNoX>	 switch port updated, removed from debmon, removed from puppet
[08:02:41] <XioNoX>	 Rename completed 👍 - now please run the re-image cookbook on the new name with --new
[08:02:41] <XioNoX>	 Updated Phabricator task T365571
[08:02:43] <stashbot>	 T365571: Rename wikikube worker nodes during OS reimage - https://phabricator.wikimedia.org/T365571
[08:03:00] <jayme>	 nice
[08:03:21] <jayme>	 now the vlan thing or the reimage thing first?
[08:04:20] <XioNoX>	 jayme: only the re-image for now I'd say to make sure the rename is fully working on its own
[08:05:13] <XioNoX>	 then we can merge the rename cookbook
[08:05:43] <jayme>	 okay. I can run the reimage...should hopefully be done within an hour before I have to leave :|
[08:05:46] <XioNoX>	 then we do the move-vlan on the same host, with yet another re-image
[08:06:13] <XioNoX>	 jayme: yeah, it's slow but it's the safest path
[08:06:45] <XioNoX>	 but if we try the move-vlan reimage right now and there is an issue it might be a pain to troubleshot
[08:07:47] <volans>	 agree
[08:07:58] <volans>	 I'll be offline for ~40m shortly
[08:08:23] <jayme>	 reimage is running
[08:08:34] <jayme>	 pws still worked for me btw XioNoX
[08:10:01] <XioNoX>	 jayme: I upgraded to Ubuntu 24.04 yesterday evening, so that's probably why :)
[08:10:29] <jayme>	 yeah, these things don't happen to debian stable :p
[08:12:08] <jayme>	 XioNoX: should we run homer already in parallel with the reimage?
[08:14:03] <XioNoX>	 jayme: I was waiting for the re-image to complete, but we could. With the rename cookbook, the only things that needs to happen on the network side is renaming the BGP peer description/name, so only aesthetic
[08:14:25] <jayme>	 ah, makes sense. Let's wait then
[08:20:32] <moritzm>	 pws works fine for me on Debian unstable FWIW. Possibly Canonical moved GNUPG to a snap :-)
[08:23:11] <jayme>	 moritzm: didn't you do a rewrite of pws?
[08:23:49] <XioNoX>	 from a random reddit comment:  "Are you using Ruby 3? Because exists? was deprecated (it was a duplicated method) and ultimately removed. It’s just exist? now. Update your gem or downgrade your Ruby. "
[08:24:04] <XioNoX>	 the error I'm getting is `/usr/bin/pws:787:in `initialize': undefined method `exists?' for FileTest:Module (NoMethodError)
[08:24:08] <moritzm>	 I had started some ago, but it's not usable yet, need some time to pick it up
[08:24:40] <moritzm>	 what does ruby --version show?
[08:25:11] <XioNoX>	 ruby 3.2.3
[08:26:32] <moritzm>	 ah yes, Debian has 3.2 and 3.1, but the ruby interpreter still defaults in 3.1 in unstable, then it's possibly some deprecation in 3.2 after all
[08:26:44] <XioNoX>	 I renamed all the `exists?` to `exist?` and now getting a different error
[08:26:49] <XioNoX>	 .users file is signed by AB48C7022E543EABE8021D6FB29E1E6371FDBFB6 which is not in /home/xionox/.pws-trusted-users
[08:26:59] <XioNoX>	 but looks like the tool itself works again
[08:28:46] <jayme>	 first puppet run started on the renamed node
[08:28:59] <moritzm>	 we updated the list of users who do updates after John left, see https://office.wikimedia.org/wiki/Pwstore#User_database for thecurrent version of .pws-trusted-users
[08:29:00] <XioNoX>	 jayme: looks like it's all good then?
[08:29:23] * jayme trying not to jinx it
[08:29:43] <XioNoX>	 moritzm: yep, all good now
[08:33:35] <moritzm>	 ack. I'll update our forked script in wmf-laptop-sre in the next days
[08:34:13] <XioNoX>	 does that means ubuntu 24.04 LTS is less stable than debian-unstable ? :)
[08:40:39] <jayme>	 now rebooting after the first puppet run
[08:45:04] <jayme>	 XioNoX: I think we're good
[08:45:12] <XioNoX>	 jayme: nice!
[08:45:17] <jayme>	 ah, there is the finish call from the cookbook
[08:45:58] <jayme>	 XioNoX: well...calico is not yet running
[08:46:04] <jayme>	 let me check
[08:48:52] <jayme>	 2024-05-23 08:44:40.575 [WARNING][9] startup/startup.go 984: Calico node 'kubernetes2023.codfw.wmnet' is already using the IPv4 address 10.192.16.39. - fair enough
[08:49:52] <XioNoX>	 I opened that minor task for the small bug we found https://phabricator.wikimedia.org/T365680
[08:51:24] <jayme>	 I'm running puppet on all k8s workers in codfw, I think the ferm rule for connecting to typha from the renamed nodes might not be there on all of the nodes
[08:59:32] <Emperor>	 probably-dumb puppet question: is it easy to get the network a node's in in CIDR notation? AFAICT there are facts with network and netmask in (from which one could construct the x.x.x.x/y notation), is there a canned answer or do I need to roll my own?
[08:59:54] <XioNoX>	 Emperor: what's the end goal ?
[09:01:47] <Emperor>	 XioNoX: bootstrapping a ceph cluster requires (inter alia) specifying the allowed networks that mon nodes can be placed in ( e.g. I currently have config: public_network: 10.64.16.0/22,10.64.32.0/22,10.64.136.0/24 ). I have in hiera the list of mon nodes by hostname.
[09:02:02] <jayme>	 XioNoX: calico still not coming up
[09:02:22] <XioNoX>	 jayme: you mean bgp sessions or something else?
[09:02:51] <XioNoX>	 where does it keep states?
[09:03:06] <jayme>	 XioNoX: something else I suppose ... still not sure. It had trouble connecting to typha, which seemed fine by that time. But now I'm not sure what the issue is
[09:03:43] <jayme>	 XioNoX: what states? It terminates after some point because it does not get ready, then retries after exponential backoff
[09:04:06] <jayme>	 "kubectl -n kube-system logs calico-node-mlx67 --follow" for logs (after kube-env admin codfw)
[09:04:16] * Emperor finds wmf::mask2cidr
[09:04:45] <Emperor>	 wmflib even
[09:04:46] <XioNoX>	 jayme: I mean " Calico node 'kubernetes2023.codfw.wmnet' is already using the IPv4 address 10.192.16.39" so it still remembers the old name somewhere?
[09:05:12] <jayme>	 XioNoX: ah, that was because I had not deleted the old node from the k8s api at that point
[09:05:26] <XioNoX>	 Emperor: yeah, I'm wondering if there is value in having that level of granularity vs. for example allowing all of 10/8 ?
[09:05:50] <XioNoX>	 or 0/0 :)
[09:07:57] <Emperor>	 it's a fair question. I can see that we might e.g. want to have our RGWs not be pickable as mons (and they currently live in separate networks)
[09:08:45] <XioNoX>	 RGW?
[09:09:08] <XioNoX>	 but right now 10.64.16.0/22,10.64.32.0/22,10.64.136.0/24 are whole vlans
[09:09:30] <Emperor>	 XioNoX: the nodes that provide the S3 API endpoint (rather than having any storage on them)
[09:09:36] <XioNoX>	 jayme: do you have an example of destination IP it tries to connect to ?
[09:10:11] <XioNoX>	 Emperor: what do you mean here that they live in different networks ?
[09:10:11] <jayme>	 XioNoX: 10.192.32.151:5473,10.192.48.62:5473,10.192.5.6:5473
[09:10:28] <jayme>	 those hosts lack iptables rules allowing the new node to connect, still
[09:10:40] <XioNoX>	 ok, was about to look at it
[09:10:41] <jayme>	 not sure why that has not been applied by ferm when running puppet
[09:12:53] <jayme>	  /etc/ferm/conf.d/10_calico-typha contains the new node, but I still don't see it in iptables -L
[09:13:26] <jayme>	 it does come up after a manual ferm restart, though
[09:14:27] <XioNoX>	 jayme: as iptables is ipbased, and we're not changing the IP it shouldn't even be removed then readded
[09:15:07] <jayme>	 XioNoX: right. But ferm receives the hostname and does a DNS lookup
[09:15:17] <moritzm>	 ferm is only restarted by Puppet if a puppet-managed ferm resource is changed (ferm::service or ferm::rule)
[09:16:05] <moritzm>	 jayme: which specific ferm:.service is that?
[09:16:09] <XioNoX>	 jayme: oh, I have a theory
[09:16:13] <jayme>	 So it might remove the ip from iptables when merging this https://gerrit.wikimedia.org/r/c/operations/puppet/+/1034976/1/hieradata/common/kubernetes.yaml and the new name does not hvae an ip then
[09:16:28] <moritzm>	 with the legacy resolution the resolve() function of ferm does the DNS lookup
[09:16:41] <moritzm>	 and that only performs a re-lookup if a Puppet resource changes
[09:16:48] <jayme>	 when dns starts working, ferm is not refreshed because the file did not change
[09:16:56] <Emperor>	 XioNoX: well, moss-fe1002 is in 10.64.48.0/22 , for example (which I think isn't in any of those ranges in public_network)
[09:17:09] <moritzm>	 with the new srange() parameter it is resolved on the Puppet server side with every Puppet run
[09:17:12] <XioNoX>	 jayme: yeah exactly!
[09:17:29] <jayme>	 XioNoX: restarted ferm where required, calico came up, BGP sessions established
[09:17:45] <XioNoX>	 jayme: so maybe the solution is to add the new name without removing the old one
[09:18:05] <XioNoX>	 and then remove the old one when the host is being re-imaged?
[09:18:09] <jayme>	 yeah, probably
[09:18:58] <jayme>	 maybe relaxing the ferm rule would also be possible mid-term...that would spare a puppet on all nodes of the cluster when adding new ones I think
[09:19:15] <XioNoX>	 Emperor: that's one of eqiad's row D vlan, so it's kind of random depending on server placement, not depending on actual server role
[09:20:00] <XioNoX>	 Emperor: if you had a server in each rack we would have to list all of our private ranges, so close to 10/8
[09:20:38] <XioNoX>	 jayme: is there some kind of auth on the service listening on that port?
[09:21:00] <jayme>	 XioNoX: I need to double check. I think nowdays calico can do mTLS
[09:21:58] <XioNoX>	 I'm leaving it to the security people, but that sounds like a good plan
[09:25:20] <Emperor>	 XioNoX: Hm, yeah, I should probably save myself hassle and just say 10/8 :)
[09:25:58] <XioNoX>	 Emperor: then what about IPv6? :)
[09:26:52] <XioNoX>	 and is there ferm filtering as well?
[09:28:00] <XioNoX>	 maybe it could be a 0/0 in Ceph, but the filtering done in ferm itself to leverage the automation we already have to keep it simpler? (just some ideas, I don't know the full setup)
[09:31:31] <jayme>	 XioNoX: created https://phabricator.wikimedia.org/T365687 for that
[09:32:06] <XioNoX>	 cool, making progres!
[09:32:19] <XioNoX>	 I didn't know about srange()!
[09:32:20] <jayme>	 XioNoX: unfortunately I gtg in a bit so I can't really continue (availability will be spotty during commute)
[09:32:55] <jayme>	 node is still cordoned and not pooled, though. So maybe someone else from my team can pick the work up with you
[09:33:22] <XioNoX>	 no pb! hnowlan let me know if/when you want to resume that
[09:34:40] <jayme>	 or maybe akosiaris - if he has bandwith
[09:44:23] <hnowlan>	 I can pick the pool/cordon stuff up in a few minutes
[09:46:48] <jayme>	 hnowlan: we where planning on testing the v-lan move as well. But it's fine if you dont have time. I can pick that up after vacation then
[09:46:59] <jayme>	 gtg - cheers
[10:39:24] <btullis>	 effie: Is it OK to merge 9c7deac003 ?
[10:39:44] <btullis>	 Or feel free to merge mine 7f8b802394
[10:39:49] <effie>	 btullis: is that the one on regex.yaml ?
[10:39:59] <btullis>	 Yes
[10:40:03] <effie>	 yes please go ahead
[10:40:17] <btullis>	 Ack, done. Thanks.
[10:42:25] <hnowlan>	 XioNoX: just catching up - the plan is to change the vlan and reimage wikikube-worker2001 again, right? 
[10:58:56] <effie>	 fabfur: any objections in merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/1034108 and  https://gerrit.wikimedia.org/r/c/operations/puppet/+/1034514 ?
[10:59:17] <fabfur>	 hello let me check
[11:01:39] <effie>	 cheers :)
[11:11:01] <fabfur>	 looks good to me
[11:11:25] <fabfur>	 not the greatest expert in this|lua
[11:13:40] <effie>	 fabfur: well, I was planning to merge it :)
[11:13:51] <effie>	 ok to merge those two?
[11:14:09] <fabfur>	 ok
[11:30:27] <XioNoX>	 hnowlan: yep exactly 
[11:45:11] <XioNoX>	 in theory it's jsut about running a cookbook :)
[11:51:25] <vgutierrez>	 eoghan, XioNoX: time to depool upload@esams before enabling IPIP encapsulation
[11:51:36] <eoghan>	 Good luck!
[11:52:56] <XioNoX>	 last one!
[11:59:10] <vgutierrez>	 XioNoX: hopefully :)
[12:22:18] <XioNoX>	 hnowlan: oh, before I forget, we should also ask DCops to update the physical label on the server
[12:51:01] <topranks>	 XioNoX: awesome work on the cookbook, great to see a successful test :)
[12:51:16] <sukhe>	 nice work all!
[12:51:39] <sukhe>	 though I am so used to not renaming hosts anymore that it will take time for me to actually believe it is possible :P
[12:56:36] <XioNoX>	 hopefully it won't be used often, other than for the mw hosts of course :)
[13:19:00] <hnowlan>	 XioNoX: sounds good! I have depooled and cordoned it
[13:31:17] <hnowlan>	 Filed T365712 for relabelling
[13:31:18] <stashbot>	 T365712: Relabel codfw Kubernetes hosts  - https://phabricator.wikimedia.org/T365712
[13:31:24] <hnowlan>	 (also a related one for eqiad) 
[14:13:21] <XioNoX>	 hnowlan: just saw your message, does that mean it's ok to proceed with testing the move vlan cookbook? maybe we can schedule it for tomorrow morning?
[14:14:48] <hnowlan>	 XioNoX: yep! sure, sounds good
[14:33:51] <moritzm>	 XioNoX, hnowlan: when you run the next test, can you leave a note wrt https://phabricator.wikimedia.org/T365687#9825949 please?
[14:41:24] <hnowlan>	 moritzm: ack, thanks
[14:42:57] <hnowlan>	 heads up eoghan, XioNoX, arnoldokoth and others - in like 15-20 minutes I'm going to try to roll out new certs for sessionstore in codfw and then eqiad. This uses the old puppet-based ecdsa script which hasn't been used in a long time afaik
[14:43:29] <hnowlan>	 testing in staging has been okay so far but sessionstore is obviously pretty critical and can be a bit tricky so the risk level is medium
[14:43:39] <hnowlan>	 more info in https://phabricator.wikimedia.org/T363996 
[14:43:55] <XioNoX>	 if it can happen after 17min that would be perfect, that's when my oncall shift ends :)
[14:44:24] <hnowlan>	 heh, it'll be about that :P 
[15:01:48] <hnowlan>	 proceeding with the sessionstore changes in private 
[15:18:04] <hnowlan>	 deploying in codfw
[15:21:26] <hnowlan>	 key looks okay, no errors on sessionstore or session loss that I can see 
[15:21:48] <hnowlan>	 rps is up as a result of the deploy, waiting until the graphs calm down 
[15:25:07] <hnowlan>	 so riddle me this - wtf does echostore do? We have zero docs on it anywhere :D I know it's just another kask but what does it affect? 
[15:25:41] <hnowlan>	 proceeding with eqiad sessionstore
[15:28:27] <bblack>	 yeah it's not super-documented, but I assume it's either the main tables or just the unread-tracking db for: https://www.mediawiki.org/wiki/Extension:Echo
[15:29:22] <hnowlan>	 eqiad is done, traffic graphs are wacky but no errors and no session loss
[15:32:52] <AntiComposite>	 looks like https://phabricator.wikimedia.org/T234286 was the original creation task for echostore, but it conventiently doesn't use that name.
[15:33:30] <hnowlan>	 echostore also doesn't have a cert in puppet yet it uses one to serve traffic 🫠
[15:34:04] * jhathaway impressed by AntiComposite's sleuthing
[15:34:51] <hnowlan>	 anyway, echostore expires in October. Hopefully we will be using the mesh by then and won't have to worry about that
[15:43:55] <brouberol>	 dcaro: there's a pending puppet/private change related to gitlab. Is that safe to deploy?
[15:44:19] <brouberol>	 (it seems to be, but I'd rather ask)
[15:45:01] <dcaro>	 brouberol: yes :0
[15:45:04] <dcaro>	 * :)
[15:45:10] <brouberol>	 gotcha, thanks
[16:08:03] <dcaro>	 thanks!
[16:41:17] <RhinosF1>	 hnowlan: -operations is too noisy with all that
[16:41:33] <RhinosF1>	 hnowlan: 17:37:33 <dduvall> !log destroying all blubberoid deployments as part of its decommissioning (T318289)
[16:41:33] <stashbot>	 T318289: Deprecate Blubber's CLI and microservice (blubberoid) interfaces - https://phabricator.wikimedia.org/T318289
[16:41:50] <hnowlan>	 ack, thanks 
[20:50:09] <mutante>	 if a scheduled maintenance is like "BGP session may flap several times" but that's it. no point in adding it to calendar I assume?
[20:51:18] <jhathaway>	 I think a flapping bgp session is probably worth adding, in case something goes wrong, and it does more then flap
[20:51:47] <jhathaway>	 Does anyone know why we have ip4:74.121.51.111 in our spf record for wikimedia.org?
[20:52:54] <mutante>	 ok
[20:54:02] <mutante>	 jhathaway: it's related to fundraising tech, donatewiki
[20:54:18] <mutante>	 you will see when you search Phab for the string "mkt4477"
[20:54:46] <mutante>	 links.email.donate.wikimedia.org
[20:55:21] <mutante>	 (or maybe "was" related)
[20:55:53] <jhathaway>	 interesting, do they still said mail for us?
[20:56:58] <mutante>	 probably not, I asked in 2016 if we can remove that from DNS, but gotta ask fr-tech to be sure
[20:57:16] <jhathaway>	 nod, thanks
[21:07:48] <dwisehaupt>	 jhathaway: i believe that is the ip related to what is now branded acoustic which sends out our mass donor mailings.
[21:07:53] <dwisehaupt>	 i can verify.
[21:10:15] <jhathaway>	 thanks dwisehaupt that would be great
[21:11:24] <dwisehaupt>	 yeah, verified that is in their network space and we definitely still use them.
[21:11:53] <dwisehaupt>	 looks like it's traced back to T80999
[21:13:31] <mutante>	 releasing that ticket to public, nothing secret in it but NDAed simply because it was RT imported
[21:13:58] <jhathaway>	 I surprised it not an include, something like include:_spf.acoustic.com or similar
[21:14:10] <jhathaway>	 but I suppose they have had the same IP for a long time
[21:14:44] <dwisehaupt>	 yeah, that i don't know.
[21:15:19] <dwisehaupt>	 i'll bring it up in our standup soon and see if we can verify it's still what we want.
[21:16:12] <jhathaway>	 thanks