[00:10:25] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: prometheus-postfix-exporter.service on mx-out1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:10:25] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: prometheus-postfix-exporter.service on mx-out1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:47:34] <volans>	 Netbox's dependencies upgraded in prod too
[07:59:14] <XioNoX>	 thx!
[08:10:25] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: prometheus-postfix-exporter.service on mx-out1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:54:55] <jayme>	 XioNoX: could you please "clear bgp neighbor 10.64.16.55" on cr1-eqiad.wikimedia.org and cr2-eqiad.wikimedia.org ?
[08:56:08] <XioNoX>	 jayme: done
[08:56:13] <jayme>	 <3
[08:56:14] <XioNoX>	 jayme: what's going on ? :)
[08:56:36] <jayme>	 change of ip block size in staging-eqiad
[08:57:19] <jayme>	 seems like it was not related to changing the blocksize to /30 ...maybe the routers get upset for some other reason
[08:57:30] <jayme>	 (as I went straight from /26 to /28 this time)
[09:47:15] <volans>	 moritzm: I think magru might be a good occasion to have someone else but you setup the ganeti cluster for better knowledge sharing ;)
[09:49:37] <moritzm>	 nah. not really, next clusters will surely use routed ganeti anyway
[09:50:38] <volans>	 if you want to keep all the suffering to yourself I'll not stop you :D 
[09:50:49] <volans>	 but you're right, things will most likely change
[09:52:56] <moritzm>	 on the matter of magru; shall we install the install7001 with Bookworm? we'll need to move these to Bookworm anyway a some point, so we can simply use the chance of the new setup?
[09:54:02] <volans>	 what's the catch?
[09:54:15] <XioNoX>	 ^ asking the real questions
[09:56:21] <moritzm>	 don't expect any issues, but you never know
[09:56:27] <moritzm>	 isc-dhcp didn't change a lot
[09:56:32] <moritzm>	 4.4.1->4.4.3
[09:56:52] <moritzm>	 squid moved from 4 to 5, but we hardly use any features
[09:58:03] <moritzm>	 atftpd moved from 0.7 to 0.8, but wouldn't expect any major changes either
[09:58:07] <XioNoX>	 +1 on using bookworm then
[09:58:30] <volans>	 agree +1
[09:58:50] <volans>	 we can always fall back to another install server in case of issues
[09:58:55] <XioNoX>	 (fyi, cr1/2-magru upgraded to latest Juniper recommended version)
[10:00:28] <volans>	 so we expect new and fancy bugs?
[10:06:28] <moritzm>	 sgtm, using bookworm for installl7001, then
[10:25:38] <topranks>	 XioNoX: nice!  weird little issue but glad it didn't turn out to be anything too annoying 
[10:30:02] <topranks>	 jayme: ping me if you want to discuss / look at that K8s prefix size / limit thing 
[10:30:22] <topranks>	 that host tripped our limit of 50 prefixes earlier on (it sent 51)
[10:30:57] <jayme>	 topranks: thanks, X.ioNoX already gave some background. I think this was a temporary problem during my migration to smaller ip block sizes as calico announced a bunch of /32 prefixes
[10:31:02] <topranks>	 currently it's sending 5 x /28, even with /30 size networks that would be covered by 48 of them, so it's not clear to me why it was exceeded 
[10:31:31] <topranks>	 there are some advantages with /32s as we don't need to do longest-prefix match on them and TCAM can be used differently 
[10:31:41] <topranks>	 is there a task on this or wider background on what's going on?
[11:19:57] <sukhe>	 topranks: any news on the last remaining transit? :)
[11:20:17] <sukhe>	 mostly since we have to update comms at some point tomorrow
[11:20:55] <topranks>	 no hopefully they'll come back today 
[11:20:59] <sukhe>	 ok thanks!
[11:21:07] <topranks>	 yesterday was a holiday in Brazil so I didn't chase them then, Arzhel was on this morning 
[11:21:21] <topranks>	 the second transit is also proving to be less than ideal, in terms of their upstream connectivity 
[11:21:32] <sukhe>	 ah yeah, it was a day off there
[11:21:32] <topranks>	 we definitely need the 3 of them in place before we go live for that reason 
[11:21:42] <sukhe>	 oh
[11:21:58] <topranks>	 the one that's not working is just some odd arp issue, guaranteed it's a line of config wrong or something simple 
[11:22:04] <sukhe>	 and so where are we the third? has the process started or will start now? (again, simply for the communication perspective)
[11:22:04] <topranks>	 so hopefully they find and resolve quick 
[11:22:11] <XioNoX>	 topranks: note that their support or turn up team seems to be in Peru
[11:22:12] <topranks>	 but that said it's been the case for a week or so now :(
[11:22:24] <topranks>	 XioNoX: ok good to know 
[11:22:38] <XioNoX>	 and there is no RX light on that link since 2 days ago
[11:22:55] <XioNoX>	 so there is change but not in the good direction
[11:23:03] <topranks>	 sukhe: you should ask Arzhel these questions I'm clearly way out of date here :D
[11:23:04] <sukhe>	 XioNoX: :]
[11:23:08] <topranks>	 ok thanks
[11:23:13] <topranks>	 yeah that's not good at all 
[11:23:20] <sukhe>	 yeah I forogt he is back today
[11:23:38] <topranks>	 someone messing with the xconnect to solve what _looks_ like a logical issue 
[11:23:59] <XioNoX>	 For momentum/novacore, we should maybe see with Willy if there is any way to ditch them sooner
[11:24:03] <topranks>	 don't believe our side would have shown 'up' if the physical path only worked on one strand.  anyway we'll see where it lands 
[11:24:16] <topranks>	 yeah they are not very inspiring 
[11:26:14] <topranks>	 I pulled some quick stats from the routes we see from them - there are 1400 or so ASNs they seem to be peered with 
[11:26:58] <topranks>	 but how much diversity that adds isn't clear 
[11:27:09] <topranks>	 peeringdb lists them as being at Equinix Sao Paolo, so hopefully some 
[11:28:16] <topranks>	 peeringdb also shows they need 100k IPv4 prefixes, making them one of the larger ISPs on the planet :(
[11:29:28] <XioNoX>	 or one of the less capable :)
[11:29:34] <XioNoX>	 and it's a mistake
[11:30:05] <topranks>	 indeed 
[11:33:16] <topranks>	 sukhe: I got distracted, re: your comment on that dns patch 
[11:33:37] <topranks>	 I guess you are saying it's safe to add the prometheus CNAME now?
[11:35:07] <sukhe>	 sorry I meant for later 
[11:35:15] <sukhe>	 if you merge this now, no
[11:35:30] <topranks>	 ok so I'll go ahead with patch as it is then 
[11:35:40] <sukhe>	 but what I meant and failed was saying that you can do 
[11:35:54] <sukhe>	 ; prometheus7001 instead of todo
[11:35:56] <topranks>	 oh I can have the proper thing in the comment 
[11:35:58] <topranks>	 sure 
[11:36:05] <sukhe>	 yep
[11:36:56] <XioNoX>	 moritzm: can you ping me when we're ready to create VMs in magru? So I can take care of netflow
[11:38:18] <moritzm>	 will do, once the magru DNS include patch is merged, I'll create the first master node
[11:38:26] <topranks>	 sukhe: if that looks good for a +1 I'll merge and unblock the ganeti stuff 
[11:39:03] <sukhe>	 +1
[11:39:21] <topranks>	 thanks :)
[11:48:23] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service cloudidm2001-dev:443 has failed probes (http_cloudtestidm_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/IDM/Runbook - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:51:12] <moritzm>	 XioNoX: first node is live, second should be joining in a bit. I'll add netflow7001 with role::insetup to test VM creation, then you can turn it into a proper netflow node
[11:51:56] <XioNoX>	 moritzm: feel free to add it directly with the netflow role
[11:52:04] <XioNoX>	 netinsight I think 
[11:54:11] <moritzm>	 ok
[12:08:24] * volans lunch
[12:10:25] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: prometheus-postfix-exporter.service on mx-out1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:23:19] <wikibugs>	 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, 10Puppet-Core, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#9763711 (10MoritzMuehlenhoff)
[13:20:25] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: prometheus-postfix-exporter.service on mx-out1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:22:02] <moritzm>	 netbox_ganeti_magru01_sync.service is failing on netbox1002 with "RuntimeError: Cluster group magru01 does not exist. It must be created on Netbox before running this script"
[13:22:30] <moritzm>	 I don't remember that this was necessary so far and it's also not covered in the existing ganeti docs, did something change there?
[13:24:14] <volans>	 let me check
[13:24:30] <moritzm>	 cheers
[13:26:38] <volans>	 yeah I think it needs to be manually created on netbox and associated with the VIP, let me do that
[13:27:31] <volans>	 done, forcing a run of netbox_ganeti_magru01_sync.service
[13:27:46] <moritzm>	 it should be magru01, though?
[13:27:56] <moritzm>	 we'll have two clusters
[13:28:26] <volans>	 there is only 1 ip though
[13:28:26] <volans>	 https://netbox.wikimedia.org/search/?q=svc.magru&obj_type=
[13:29:31] <moritzm>	 yeah, the second one will be created when the second cluster is setup
[13:29:35] <volans>	 Summary of performed actions: Counter({'Clusters created': 1, 'Nodes added': 1})
[13:29:38] <volans>	 ok
[13:29:41] <volans>	 done for magru01
[13:30:19] <volans>	 https://netbox.wikimedia.org/virtualization/cluster-groups/104/
[13:30:35] <volans>	 automatically one cluster called B3 was created
[13:30:47] <volans>	 with one node assigned (ganeti7001)
[13:31:58] <moritzm>	 ack, also added the steps in https://wikitech.wikimedia.org/wiki/Ganeti#Add_the_Ganeti_group_in_Netbox
[13:35:25] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: prometheus-postfix-exporter.service on mx-out1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:41:45] <jhathaway>	 ^ this is me, I'll fix
[13:45:07] <sukhe>	 topranks: flapping BGP session for dns700x. 
[13:45:10] <sukhe>	 sessions
[13:45:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: netbox_ganeti_magru01_sync.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:45:26] <sukhe>	 wait though hmm
[13:45:30] <sukhe>	 > Not accepting/receiving prefixes from anycast BGP peer
[13:45:44] <sukhe>	 this means that becaues we are not adveritising the prefixes though right?
[13:45:49] <sukhe>	 but why did it trigger now I wonder
[13:45:57] <topranks>	 nah they've been up for 17 hours no flaps 
[13:45:58] <XioNoX>	 sukhe: it has been triggering for a while
[13:46:23] <topranks>	 oh probably since I added the peerings yesterday 
[13:46:40] <topranks>	 that's fine I guess, I figured we're not alerting on magru, makes sense we see that if no routes announced
[13:46:49] <sukhe>	 yeah sorry for the noise
[13:46:52] <sukhe>	 looks good and adds up
[13:46:56] <topranks>	 np!
[13:47:03] <sukhe>	 just wanted to make sure it's not a repeat of the weird bird issues we once saw
[13:47:08] <sukhe>	 that self resolved so who knows :)
[14:18:58] <XioNoX>	 moritzm: hahahah
[14:18:59] <XioNoX>	 https://gerrit.wikimedia.org/r/c/operations/puppet/+/1026577
[14:19:18] <XioNoX>	 https://gerrit.wikimedia.org/r/c/operations/puppet/+/1026573
[14:30:18] <sukhe>	 XioNoX: topranks: can you please let me know if there is an update to the transits or point me to where I can follow said updates 
[14:30:23] <sukhe>	 context is the same Comms thing above :)
[14:31:40] <XioNoX>	 sukhe: no news so far, we can message you when we hear more for any of the two
[14:31:57] <sukhe>	 thanks!
[14:32:34] <moritzm>	 XioNoX: haha, all great minds think alike!
[14:32:35] <sukhe>	 I will stop asking you now but yeah, I wanted to say that if you hear something just let me know
[14:33:13] <sukhe>	 we are still waiting on the research question so that's another blocker
[14:35:06] <XioNoX>	 sukhe: will do! We're pushing both providers to give us some progress
[14:41:46] <topranks>	 I have to step out to take my Mam to the doctors, back in an hour or two 
[14:49:32] <moritzm>	 sukhe: the magru01 Ganeti cluster is setup, you can go ahead and create one durum, doh and ncredir node there
[14:49:42] <moritzm>	 magru02 should be ready tomorrow
[14:49:55] <sukhe>	 moritzm: thanks!
[14:59:31] <moritzm>	 XioNoX: netflow7001 is up, I'll merge the patch for the kafka brokers/ferm
[14:59:43] <XioNoX>	 moritzm: sweet!
[15:13:23] <moritzm>	 puppet fails to install gnmic on netflow7001,does that ring a bell?
[15:13:56] <moritzm>	 it's installed on e.g.6001, but seems missing on apt.w.o
[15:14:12] <moritzm>	 ah, right: https://phabricator.wikimedia.org/T347461
[15:15:07] <XioNoX>	 :)
[15:15:19] <moritzm>	 do you still have the deb used on the current hosts around, so that we can install it on 7001 as well?
[15:15:45] <XioNoX>	 moritzm: you can use the most recent one from upstream https://github.com/openconfig/gnmic/releases/download/v0.36.2/gnmic_0.36.2_Linux_x86_64.deb
[15:17:17] <moritzm>	 ack, that's also what we use on 6001 e.g., installed
[15:22:27] <sukhe>	 moritzm: we have no unified track for VM commissioning in magru right?
[15:22:30] <sukhe>	 I can create one but asking
[15:25:36] <moritzm>	 we don't yet, but it would be great to have one :-)
[15:25:41] <sukhe>	 https://phabricator.wikimedia.org/T364016
[15:27:51] <moritzm>	 I'll add a proposed distribution of VMs to clusters
[15:31:38] <moritzm>	 I've updated the task with a proposed list of hostnames and how to spread to the two clusters, let me know if it looks good
[15:32:26] <sukhe>	 ok great thanks!
[15:32:35] <sukhe>	 so group would be b3 for the first and b4 for the second 
[15:32:37] <XioNoX>	 moritzm: why install 7002 ?
[15:32:58] <sukhe>	 XioNoX: do you mean why not 7001?
[15:33:04] <XioNoX>	 oh, copy paste I guess it also says drmrs
[15:33:16] <XioNoX>	 sukhe: yeah
[15:33:26] <XioNoX>	 same for bast or prometheus
[15:33:27] <moritzm>	 some emacs magic gone wrong, fixed the hostnames
[15:35:49] <volans>	 let me know if I should create the magru02 cluster group in netbox
[15:36:45] <moritzm>	 would that cause any issues if the underlying ganeti cluster isn't set up yet?
[15:37:03] <moritzm>	 if so, please go ahead, otherwise I'd do it tomorrow when the cluster is up
[15:37:23] <volans>	 that's a 1M$ question, but until you deploy the config for the ganeti-netbox sync for it
[15:37:30] <volans>	 I don't think it shouldcreate any issue
[15:37:45] <moritzm>	 ok, let's just do it, then. worst case some sync timer will fail until tomorrow
[15:38:22] <volans>	 yep
[15:38:37] <volans>	 {done} https://netbox.wikimedia.org/virtualization/cluster-groups/105/
[15:39:57] <moritzm>	 cheers
[16:09:00] <sukhe>	 volans: around? 
[16:09:06] <volans>	 yes
[16:09:16] <sukhe>	 so dns7x are in an outdated state with netbox commits and I wanted to bring them up to date
[16:09:21] <sukhe>	 sudo cookbook sre.dns.netbox "force update dns7x" --force e4ecebd8e04119d710db5039a5b6678939f55d95
[16:09:28] <sukhe>	 any concerns with me running this?
[16:09:39] <XioNoX>	 sukhe: reply from Telxius saying they're working on it
[16:09:48] <sukhe>	 XioNoX: thanks!
[16:09:52] <XioNoX>	 (transit)
[16:09:55] <sukhe>	 yeah
[16:10:45] <volans>	 sukhe: e4ecebd8e04119d710db5039a5b6678939f55d95 is the last commit, go ahead
[16:10:49] <sukhe>	 volans: basically, dns7x were not pooled and so didn't get the authdns-update for a while. and now, the magru one is leading to an outdated zone file
[16:10:52] <sukhe>	 thanks for confirming!
[16:11:41] <volans>	 it should be a noop on the other hosts due to gdnsd being smart IIRC
[16:12:02] <volans>	 but might reload everywhere anyway
[16:12:07] <sukhe>	 yeah, I think so as well 
[16:12:09] <volans>	 I don't recall the fine detail sorry 
[16:12:25] <sukhe>	 np, I will verify everything but I wanted to check with you about --force
[16:13:40] <volans>	 it just sets the sha1 to the given one in case there are no changes
[16:13:48] <volans>	 nothing else changes in the cookbook execution
[16:14:08] <volans>	 instead of bailing out allows to continue the deploy of that specific sha1
[16:14:25] <sukhe>	 sukhe@cp7012:~$ dig ganeti01.svc.magru.wmnet +nsid +noanswer +all
[16:14:48] <sukhe>	 er +noall but 
[16:14:51] <sukhe>	 ; NSID: 64 6e 73 37 30 30 32 ("dns7002")
[16:14:54] <sukhe>	 this was missing, so all is good now :)
[16:15:05] <volans>	 yay
[16:15:14] <sukhe>	 thanks for reviewing!
[16:15:27] <volans>	 anytime :)
[20:09:04] <wikibugs>	 10netops, 06Infrastructure-Foundations, 06SRE, 10MW-1.43-notes (1.43.0-wmf.3; 2024-04-30), 13Patch-For-Review: Add probenet configuration for magru - https://phabricator.wikimedia.org/T362902#9765219 (10CDanis) 05Open→03Resolved
[20:09:46] <wikibugs>	 10netops, 06Infrastructure-Foundations, 06SRE, 10MW-1.43-notes (1.43.0-wmf.3; 2024-04-30), 13Patch-For-Review: Add probenet configuration for magru - https://phabricator.wikimedia.org/T362902#9765222 (10CDanis)