[06:28:37] cwhite: noted, thx ! [06:57:23] XioNoX: good morning o/ [07:08:26] jayme: hello! [07:08:41] finishing up a few things then we can get started [07:15:11] XioNoX: ack, lmk when you're ready. I've prepared patches yesterday [07:19:05] jayme: which host should we try the rename on ? [07:19:59] kubernetes2023 and 2032, see https://phabricator.wikimedia.org/T365571 [07:20:50] we can start with kubernetes2023 -> wikikube-worker2001 [07:21:12] prereq https://gerrit.wikimedia.org/r/c/operations/puppet/+/1034956 , https://gerrit.wikimedia.org/r/c/operations/puppet/+/1034976 [07:23:24] jayme: +1 [07:24:02] jayme: then I'll run `cumin1002:~$ test-cookbook -c 1008818 --dry-run sre.hosts.rename -t T365571 kubernetes2023 wikikube-worker2001` [07:24:05] T365571: Rename wikikube worker nodes during OS reimage - https://phabricator.wikimedia.org/T365571 [07:24:38] til test-cookbook is a thing [07:24:49] https://wikitech.wikimedia.org/wiki/Spicerack/Cookbooks#Test_before_merging [07:25:00] let me merge and run puppet on the install servers [07:29:53] ...taking it's time [07:32:46] XioNoX: done [07:34:08] cool [07:34:43] jayme: running the cookbook then [07:37:17] great... [07:37:26] https://www.irccloud.com/pastebin/D4IzTbUb/ [07:38:31] uh [07:39:09] I wasn't aware this is ruby...I'm shocked [07:39:10] gpg --decrypt works, I'll look ad pws later on [07:40:39] https://www.irccloud.com/pastebin/CcgrYkLv/ [07:40:44] I don't see a log being created on cumin1002 or 2002. Should the test-cookbook log like a regular one? [07:41:07] jayme: probably because it's on dry-run [07:41:35] so `NetboxHostNotFoundError: wikikube-worker2001` makes sens because nothing really got renamed [07:42:07] ah, and JSONDecodeError: Expecting value: line 1 column 1 (char 0) too, because it doesn't send the PATCH [07:42:27] yeah :D [07:42:37] not sure how much we can handle the --dry-run case... [07:44:35] jayme: let's go without --dry-run ? [07:45:10] XioNoX: yeah, I'd say so. No idea about dry-run either [07:45:40] I think the redfish call should probably not fail, even on empty response [07:47:58] yeah might be a bug in https://github.com/wikimedia/operations-software-spicerack/blob/master/spicerack/redfish.py#L341 (/cc volans) [07:49:04] jayme: running without dry-run [07:49:39] the logs are in your home [07:49:48] https://www.irccloud.com/pastebin/10oyaB13/ [07:49:56] ~/cookbook_testing/logs # The log directory where all cookbooks will log into [07:50:03] from the help message [07:51:28] at least rollback worked as expected, changed the name and back in Netbox https://netbox.wikimedia.org/dcim/devices/4399/changelog/ [07:52:28] XioNoX: iDRAC.Embedded.1/iDRAC.Embedded.1 [07:52:32] it's there twice [07:53:00] oob_manager includes that [07:53:26] https://github.com/wikimedia/operations-software-spicerack/blob/master/spicerack/redfish.py#L821 [07:53:29] indeed :) [07:53:35] as i said in the reivew, you had to replace that with f'{self.redfish.oob_manager}/Ethern....' ;) [07:53:56] missed that in the last review sorry [07:56:16] volans, jayme https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1008818/7..8 [07:56:44] +1ed [07:57:26] ditto [07:58:14] updating dns... [08:00:19] dns done, sync-netbox-hiera in progress [08:01:54] exciting [08:02:27] switch port updated, removed from debmon, removed from puppet [08:02:41] Rename completed 👍 - now please run the re-image cookbook on the new name with --new [08:02:41] Updated Phabricator task T365571 [08:02:43] T365571: Rename wikikube worker nodes during OS reimage - https://phabricator.wikimedia.org/T365571 [08:03:00] nice [08:03:21] now the vlan thing or the reimage thing first? [08:04:20] jayme: only the re-image for now I'd say to make sure the rename is fully working on its own [08:05:13] then we can merge the rename cookbook [08:05:43] okay. I can run the reimage...should hopefully be done within an hour before I have to leave :| [08:05:46] then we do the move-vlan on the same host, with yet another re-image [08:06:13] jayme: yeah, it's slow but it's the safest path [08:06:45] but if we try the move-vlan reimage right now and there is an issue it might be a pain to troubleshot [08:07:47] agree [08:07:58] I'll be offline for ~40m shortly [08:08:23] reimage is running [08:08:34] pws still worked for me btw XioNoX [08:10:01] jayme: I upgraded to Ubuntu 24.04 yesterday evening, so that's probably why :) [08:10:29] yeah, these things don't happen to debian stable :p [08:12:08] XioNoX: should we run homer already in parallel with the reimage? [08:14:03] jayme: I was waiting for the re-image to complete, but we could. With the rename cookbook, the only things that needs to happen on the network side is renaming the BGP peer description/name, so only aesthetic [08:14:25] ah, makes sense. Let's wait then [08:20:32] pws works fine for me on Debian unstable FWIW. Possibly Canonical moved GNUPG to a snap :-) [08:23:11] moritzm: didn't you do a rewrite of pws? [08:23:49] from a random reddit comment: "Are you using Ruby 3? Because exists? was deprecated (it was a duplicated method) and ultimately removed. It’s just exist? now. Update your gem or downgrade your Ruby. " [08:24:04] the error I'm getting is `/usr/bin/pws:787:in `initialize': undefined method `exists?' for FileTest:Module (NoMethodError) [08:24:08] I had started some ago, but it's not usable yet, need some time to pick it up [08:24:40] what does ruby --version show? [08:25:11] ruby 3.2.3 [08:26:32] ah yes, Debian has 3.2 and 3.1, but the ruby interpreter still defaults in 3.1 in unstable, then it's possibly some deprecation in 3.2 after all [08:26:44] I renamed all the `exists?` to `exist?` and now getting a different error [08:26:49] .users file is signed by AB48C7022E543EABE8021D6FB29E1E6371FDBFB6 which is not in /home/xionox/.pws-trusted-users [08:26:59] but looks like the tool itself works again [08:28:46] first puppet run started on the renamed node [08:28:59] we updated the list of users who do updates after John left, see https://office.wikimedia.org/wiki/Pwstore#User_database for thecurrent version of .pws-trusted-users [08:29:00] jayme: looks like it's all good then? [08:29:23] * jayme trying not to jinx it [08:29:43] moritzm: yep, all good now [08:33:35] ack. I'll update our forked script in wmf-laptop-sre in the next days [08:34:13] does that means ubuntu 24.04 LTS is less stable than debian-unstable ? :) [08:40:39] now rebooting after the first puppet run [08:45:04] XioNoX: I think we're good [08:45:12] jayme: nice! [08:45:17] ah, there is the finish call from the cookbook [08:45:58] XioNoX: well...calico is not yet running [08:46:04] let me check [08:48:52] 2024-05-23 08:44:40.575 [WARNING][9] startup/startup.go 984: Calico node 'kubernetes2023.codfw.wmnet' is already using the IPv4 address 10.192.16.39. - fair enough [08:49:52] I opened that minor task for the small bug we found https://phabricator.wikimedia.org/T365680 [08:51:24] I'm running puppet on all k8s workers in codfw, I think the ferm rule for connecting to typha from the renamed nodes might not be there on all of the nodes [08:59:32] probably-dumb puppet question: is it easy to get the network a node's in in CIDR notation? AFAICT there are facts with network and netmask in (from which one could construct the x.x.x.x/y notation), is there a canned answer or do I need to roll my own? [08:59:54] Emperor: what's the end goal ? [09:01:47] XioNoX: bootstrapping a ceph cluster requires (inter alia) specifying the allowed networks that mon nodes can be placed in ( e.g. I currently have config: public_network: 10.64.16.0/22,10.64.32.0/22,10.64.136.0/24 ). I have in hiera the list of mon nodes by hostname. [09:02:02] XioNoX: calico still not coming up [09:02:22] jayme: you mean bgp sessions or something else? [09:02:51] where does it keep states? [09:03:06] XioNoX: something else I suppose ... still not sure. It had trouble connecting to typha, which seemed fine by that time. But now I'm not sure what the issue is [09:03:43] XioNoX: what states? It terminates after some point because it does not get ready, then retries after exponential backoff [09:04:06] "kubectl -n kube-system logs calico-node-mlx67 --follow" for logs (after kube-env admin codfw) [09:04:16] * Emperor finds wmf::mask2cidr [09:04:45] wmflib even [09:04:46] jayme: I mean " Calico node 'kubernetes2023.codfw.wmnet' is already using the IPv4 address 10.192.16.39" so it still remembers the old name somewhere? [09:05:12] XioNoX: ah, that was because I had not deleted the old node from the k8s api at that point [09:05:26] Emperor: yeah, I'm wondering if there is value in having that level of granularity vs. for example allowing all of 10/8 ? [09:05:50] or 0/0 :) [09:07:57] it's a fair question. I can see that we might e.g. want to have our RGWs not be pickable as mons (and they currently live in separate networks) [09:08:45] RGW? [09:09:08] but right now 10.64.16.0/22,10.64.32.0/22,10.64.136.0/24 are whole vlans [09:09:30] XioNoX: the nodes that provide the S3 API endpoint (rather than having any storage on them) [09:09:36] jayme: do you have an example of destination IP it tries to connect to ? [09:10:11] Emperor: what do you mean here that they live in different networks ? [09:10:11] XioNoX: 10.192.32.151:5473,10.192.48.62:5473,10.192.5.6:5473 [09:10:28] those hosts lack iptables rules allowing the new node to connect, still [09:10:40] ok, was about to look at it [09:10:41] not sure why that has not been applied by ferm when running puppet [09:12:53] /etc/ferm/conf.d/10_calico-typha contains the new node, but I still don't see it in iptables -L [09:13:26] it does come up after a manual ferm restart, though [09:14:27] jayme: as iptables is ipbased, and we're not changing the IP it shouldn't even be removed then readded [09:15:07] XioNoX: right. But ferm receives the hostname and does a DNS lookup [09:15:17] ferm is only restarted by Puppet if a puppet-managed ferm resource is changed (ferm::service or ferm::rule) [09:16:05] jayme: which specific ferm:.service is that? [09:16:09] jayme: oh, I have a theory [09:16:13] So it might remove the ip from iptables when merging this https://gerrit.wikimedia.org/r/c/operations/puppet/+/1034976/1/hieradata/common/kubernetes.yaml and the new name does not hvae an ip then [09:16:28] with the legacy resolution the resolve() function of ferm does the DNS lookup [09:16:41] and that only performs a re-lookup if a Puppet resource changes [09:16:48] when dns starts working, ferm is not refreshed because the file did not change [09:16:56] XioNoX: well, moss-fe1002 is in 10.64.48.0/22 , for example (which I think isn't in any of those ranges in public_network) [09:17:09] with the new srange() parameter it is resolved on the Puppet server side with every Puppet run [09:17:12] jayme: yeah exactly! [09:17:29] XioNoX: restarted ferm where required, calico came up, BGP sessions established [09:17:45] jayme: so maybe the solution is to add the new name without removing the old one [09:18:05] and then remove the old one when the host is being re-imaged? [09:18:09] yeah, probably [09:18:58] maybe relaxing the ferm rule would also be possible mid-term...that would spare a puppet on all nodes of the cluster when adding new ones I think [09:19:15] Emperor: that's one of eqiad's row D vlan, so it's kind of random depending on server placement, not depending on actual server role [09:20:00] Emperor: if you had a server in each rack we would have to list all of our private ranges, so close to 10/8 [09:20:38] jayme: is there some kind of auth on the service listening on that port? [09:21:00] XioNoX: I need to double check. I think nowdays calico can do mTLS [09:21:58] I'm leaving it to the security people, but that sounds like a good plan [09:25:20] XioNoX: Hm, yeah, I should probably save myself hassle and just say 10/8 :) [09:25:58] Emperor: then what about IPv6? :) [09:26:52] and is there ferm filtering as well? [09:28:00] maybe it could be a 0/0 in Ceph, but the filtering done in ferm itself to leverage the automation we already have to keep it simpler? (just some ideas, I don't know the full setup) [09:31:31] XioNoX: created https://phabricator.wikimedia.org/T365687 for that [09:32:06] cool, making progres! [09:32:19] I didn't know about srange()! [09:32:20] XioNoX: unfortunately I gtg in a bit so I can't really continue (availability will be spotty during commute) [09:32:55] node is still cordoned and not pooled, though. So maybe someone else from my team can pick the work up with you [09:33:22] no pb! hnowlan let me know if/when you want to resume that [09:34:40] or maybe akosiaris - if he has bandwith [09:44:23] I can pick the pool/cordon stuff up in a few minutes [09:46:48] hnowlan: we where planning on testing the v-lan move as well. But it's fine if you dont have time. I can pick that up after vacation then [09:46:59] gtg - cheers [10:39:24] effie: Is it OK to merge 9c7deac003 ? [10:39:44] Or feel free to merge mine 7f8b802394 [10:39:49] btullis: is that the one on regex.yaml ? [10:39:59] Yes [10:40:03] yes please go ahead [10:40:17] Ack, done. Thanks. [10:42:25] XioNoX: just catching up - the plan is to change the vlan and reimage wikikube-worker2001 again, right? [10:58:56] fabfur: any objections in merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/1034108 and https://gerrit.wikimedia.org/r/c/operations/puppet/+/1034514 ? [10:59:17] hello let me check [11:01:39] cheers :) [11:11:01] looks good to me [11:11:25] not the greatest expert in this|lua [11:13:40] fabfur: well, I was planning to merge it :) [11:13:51] ok to merge those two? [11:14:09] ok [11:30:27] hnowlan: yep exactly [11:45:11] in theory it's jsut about running a cookbook :) [11:51:25] eoghan, XioNoX: time to depool upload@esams before enabling IPIP encapsulation [11:51:36] Good luck! [11:52:56] last one! [11:59:10] XioNoX: hopefully :) [12:22:18] hnowlan: oh, before I forget, we should also ask DCops to update the physical label on the server [12:51:01] XioNoX: awesome work on the cookbook, great to see a successful test :) [12:51:16] nice work all! [12:51:39] though I am so used to not renaming hosts anymore that it will take time for me to actually believe it is possible :P [12:56:36] hopefully it won't be used often, other than for the mw hosts of course :) [13:19:00] XioNoX: sounds good! I have depooled and cordoned it [13:31:17] Filed T365712 for relabelling [13:31:18] T365712: Relabel codfw Kubernetes hosts - https://phabricator.wikimedia.org/T365712 [13:31:24] (also a related one for eqiad) [14:13:21] hnowlan: just saw your message, does that mean it's ok to proceed with testing the move vlan cookbook? maybe we can schedule it for tomorrow morning? [14:14:48] XioNoX: yep! sure, sounds good [14:33:51] XioNoX, hnowlan: when you run the next test, can you leave a note wrt https://phabricator.wikimedia.org/T365687#9825949 please? [14:41:24] moritzm: ack, thanks [14:42:57] heads up eoghan, XioNoX, arnoldokoth and others - in like 15-20 minutes I'm going to try to roll out new certs for sessionstore in codfw and then eqiad. This uses the old puppet-based ecdsa script which hasn't been used in a long time afaik [14:43:29] testing in staging has been okay so far but sessionstore is obviously pretty critical and can be a bit tricky so the risk level is medium [14:43:39] more info in https://phabricator.wikimedia.org/T363996 [14:43:55] if it can happen after 17min that would be perfect, that's when my oncall shift ends :) [14:44:24] heh, it'll be about that :P [15:01:48] proceeding with the sessionstore changes in private [15:18:04] deploying in codfw [15:21:26] key looks okay, no errors on sessionstore or session loss that I can see [15:21:48] rps is up as a result of the deploy, waiting until the graphs calm down [15:25:07] so riddle me this - wtf does echostore do? We have zero docs on it anywhere :D I know it's just another kask but what does it affect? [15:25:41] proceeding with eqiad sessionstore [15:28:27] yeah it's not super-documented, but I assume it's either the main tables or just the unread-tracking db for: https://www.mediawiki.org/wiki/Extension:Echo [15:29:22] eqiad is done, traffic graphs are wacky but no errors and no session loss [15:32:52] looks like https://phabricator.wikimedia.org/T234286 was the original creation task for echostore, but it conventiently doesn't use that name. [15:33:30] echostore also doesn't have a cert in puppet yet it uses one to serve traffic 🫠 [15:34:04] * jhathaway impressed by AntiComposite's sleuthing [15:34:51] anyway, echostore expires in October. Hopefully we will be using the mesh by then and won't have to worry about that [15:43:55] dcaro: there's a pending puppet/private change related to gitlab. Is that safe to deploy? [15:44:19] (it seems to be, but I'd rather ask) [15:45:01] brouberol: yes :0 [15:45:04] * :) [15:45:10] gotcha, thanks [16:08:03] thanks! [16:41:17] hnowlan: -operations is too noisy with all that [16:41:33] hnowlan: 17:37:33 !log destroying all blubberoid deployments as part of its decommissioning (T318289) [16:41:33] T318289: Deprecate Blubber's CLI and microservice (blubberoid) interfaces - https://phabricator.wikimedia.org/T318289 [16:41:50] ack, thanks [20:50:09] if a scheduled maintenance is like "BGP session may flap several times" but that's it. no point in adding it to calendar I assume? [20:51:18] I think a flapping bgp session is probably worth adding, in case something goes wrong, and it does more then flap [20:51:47] Does anyone know why we have ip4:74.121.51.111 in our spf record for wikimedia.org? [20:52:54] ok [20:54:02] jhathaway: it's related to fundraising tech, donatewiki [20:54:18] you will see when you search Phab for the string "mkt4477" [20:54:46] links.email.donate.wikimedia.org [20:55:21] (or maybe "was" related) [20:55:53] interesting, do they still said mail for us? [20:56:58] probably not, I asked in 2016 if we can remove that from DNS, but gotta ask fr-tech to be sure [20:57:16] nod, thanks [21:07:48] jhathaway: i believe that is the ip related to what is now branded acoustic which sends out our mass donor mailings. [21:07:53] i can verify. [21:10:15] thanks dwisehaupt that would be great [21:11:24] yeah, verified that is in their network space and we definitely still use them. [21:11:53] looks like it's traced back to T80999 [21:13:31] releasing that ticket to public, nothing secret in it but NDAed simply because it was RT imported [21:13:58] I surprised it not an include, something like include:_spf.acoustic.com or similar [21:14:10] but I suppose they have had the same IP for a long time [21:14:44] yeah, that i don't know. [21:15:19] i'll bring it up in our standup soon and see if we can verify it's still what we want. [21:16:12] thanks