[00:10:25] FIRING: [3x] SystemdUnitFailed: prometheus-postfix-exporter.service on mx-out1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:10:25] FIRING: [3x] SystemdUnitFailed: prometheus-postfix-exporter.service on mx-out1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:47:34] Netbox's dependencies upgraded in prod too [07:59:14] thx! [08:10:25] FIRING: [3x] SystemdUnitFailed: prometheus-postfix-exporter.service on mx-out1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:54:55] XioNoX: could you please "clear bgp neighbor 10.64.16.55" on cr1-eqiad.wikimedia.org and cr2-eqiad.wikimedia.org ? [08:56:08] jayme: done [08:56:13] <3 [08:56:14] jayme: what's going on ? :) [08:56:36] change of ip block size in staging-eqiad [08:57:19] seems like it was not related to changing the blocksize to /30 ...maybe the routers get upset for some other reason [08:57:30] (as I went straight from /26 to /28 this time) [09:47:15] moritzm: I think magru might be a good occasion to have someone else but you setup the ganeti cluster for better knowledge sharing ;) [09:49:37] nah. not really, next clusters will surely use routed ganeti anyway [09:50:38] if you want to keep all the suffering to yourself I'll not stop you :D [09:50:49] but you're right, things will most likely change [09:52:56] on the matter of magru; shall we install the install7001 with Bookworm? we'll need to move these to Bookworm anyway a some point, so we can simply use the chance of the new setup? [09:54:02] what's the catch? [09:54:15] ^ asking the real questions [09:56:21] don't expect any issues, but you never know [09:56:27] isc-dhcp didn't change a lot [09:56:32] 4.4.1->4.4.3 [09:56:52] squid moved from 4 to 5, but we hardly use any features [09:58:03] atftpd moved from 0.7 to 0.8, but wouldn't expect any major changes either [09:58:07] +1 on using bookworm then [09:58:30] agree +1 [09:58:50] we can always fall back to another install server in case of issues [09:58:55] (fyi, cr1/2-magru upgraded to latest Juniper recommended version) [10:00:28] so we expect new and fancy bugs? [10:06:28] sgtm, using bookworm for installl7001, then [10:25:38] XioNoX: nice! weird little issue but glad it didn't turn out to be anything too annoying [10:30:02] jayme: ping me if you want to discuss / look at that K8s prefix size / limit thing [10:30:22] that host tripped our limit of 50 prefixes earlier on (it sent 51) [10:30:57] topranks: thanks, X.ioNoX already gave some background. I think this was a temporary problem during my migration to smaller ip block sizes as calico announced a bunch of /32 prefixes [10:31:02] currently it's sending 5 x /28, even with /30 size networks that would be covered by 48 of them, so it's not clear to me why it was exceeded [10:31:31] there are some advantages with /32s as we don't need to do longest-prefix match on them and TCAM can be used differently [10:31:41] is there a task on this or wider background on what's going on? [11:19:57] topranks: any news on the last remaining transit? :) [11:20:17] mostly since we have to update comms at some point tomorrow [11:20:55] no hopefully they'll come back today [11:20:59] ok thanks! [11:21:07] yesterday was a holiday in Brazil so I didn't chase them then, Arzhel was on this morning [11:21:21] the second transit is also proving to be less than ideal, in terms of their upstream connectivity [11:21:32] ah yeah, it was a day off there [11:21:32] we definitely need the 3 of them in place before we go live for that reason [11:21:42] oh [11:21:58] the one that's not working is just some odd arp issue, guaranteed it's a line of config wrong or something simple [11:22:04] and so where are we the third? has the process started or will start now? (again, simply for the communication perspective) [11:22:04] so hopefully they find and resolve quick [11:22:11] topranks: note that their support or turn up team seems to be in Peru [11:22:12] but that said it's been the case for a week or so now :( [11:22:24] XioNoX: ok good to know [11:22:38] and there is no RX light on that link since 2 days ago [11:22:55] so there is change but not in the good direction [11:23:03] sukhe: you should ask Arzhel these questions I'm clearly way out of date here :D [11:23:04] XioNoX: :] [11:23:08] ok thanks [11:23:13] yeah that's not good at all [11:23:20] yeah I forogt he is back today [11:23:38] someone messing with the xconnect to solve what _looks_ like a logical issue [11:23:59] For momentum/novacore, we should maybe see with Willy if there is any way to ditch them sooner [11:24:03] don't believe our side would have shown 'up' if the physical path only worked on one strand. anyway we'll see where it lands [11:24:16] yeah they are not very inspiring [11:26:14] I pulled some quick stats from the routes we see from them - there are 1400 or so ASNs they seem to be peered with [11:26:58] but how much diversity that adds isn't clear [11:27:09] peeringdb lists them as being at Equinix Sao Paolo, so hopefully some [11:28:16] peeringdb also shows they need 100k IPv4 prefixes, making them one of the larger ISPs on the planet :( [11:29:28] or one of the less capable :) [11:29:34] and it's a mistake [11:30:05] indeed [11:33:16] sukhe: I got distracted, re: your comment on that dns patch [11:33:37] I guess you are saying it's safe to add the prometheus CNAME now? [11:35:07] sorry I meant for later [11:35:15] if you merge this now, no [11:35:30] ok so I'll go ahead with patch as it is then [11:35:40] but what I meant and failed was saying that you can do [11:35:54] ; prometheus7001 instead of todo [11:35:56] oh I can have the proper thing in the comment [11:35:58] sure [11:36:05] yep [11:36:56] moritzm: can you ping me when we're ready to create VMs in magru? So I can take care of netflow [11:38:18] will do, once the magru DNS include patch is merged, I'll create the first master node [11:38:26] sukhe: if that looks good for a +1 I'll merge and unblock the ganeti stuff [11:39:03] +1 [11:39:21] thanks :) [11:48:23] FIRING: [2x] ProbeDown: Service cloudidm2001-dev:443 has failed probes (http_cloudtestidm_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/IDM/Runbook - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:51:12] XioNoX: first node is live, second should be joining in a bit. I'll add netflow7001 with role::insetup to test VM creation, then you can turn it into a proper netflow node [11:51:56] moritzm: feel free to add it directly with the netflow role [11:52:04] netinsight I think [11:54:11] ok [12:08:24] * volans lunch [12:10:25] FIRING: [3x] SystemdUnitFailed: prometheus-postfix-exporter.service on mx-out1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:23:19] 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, 10Puppet-Core, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#9763711 (10MoritzMuehlenhoff) [13:20:25] FIRING: [4x] SystemdUnitFailed: prometheus-postfix-exporter.service on mx-out1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:22:02] netbox_ganeti_magru01_sync.service is failing on netbox1002 with "RuntimeError: Cluster group magru01 does not exist. It must be created on Netbox before running this script" [13:22:30] I don't remember that this was necessary so far and it's also not covered in the existing ganeti docs, did something change there? [13:24:14] let me check [13:24:30] cheers [13:26:38] yeah I think it needs to be manually created on netbox and associated with the VIP, let me do that [13:27:31] done, forcing a run of netbox_ganeti_magru01_sync.service [13:27:46] it should be magru01, though? [13:27:56] we'll have two clusters [13:28:26] there is only 1 ip though [13:28:26] https://netbox.wikimedia.org/search/?q=svc.magru&obj_type= [13:29:31] yeah, the second one will be created when the second cluster is setup [13:29:35] Summary of performed actions: Counter({'Clusters created': 1, 'Nodes added': 1}) [13:29:38] ok [13:29:41] done for magru01 [13:30:19] https://netbox.wikimedia.org/virtualization/cluster-groups/104/ [13:30:35] automatically one cluster called B3 was created [13:30:47] with one node assigned (ganeti7001) [13:31:58] ack, also added the steps in https://wikitech.wikimedia.org/wiki/Ganeti#Add_the_Ganeti_group_in_Netbox [13:35:25] FIRING: [4x] SystemdUnitFailed: prometheus-postfix-exporter.service on mx-out1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:41:45] ^ this is me, I'll fix [13:45:07] topranks: flapping BGP session for dns700x. [13:45:10] sessions [13:45:25] RESOLVED: SystemdUnitFailed: netbox_ganeti_magru01_sync.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:45:26] wait though hmm [13:45:30] > Not accepting/receiving prefixes from anycast BGP peer [13:45:44] this means that becaues we are not adveritising the prefixes though right? [13:45:49] but why did it trigger now I wonder [13:45:57] nah they've been up for 17 hours no flaps [13:45:58] sukhe: it has been triggering for a while [13:46:23] oh probably since I added the peerings yesterday [13:46:40] that's fine I guess, I figured we're not alerting on magru, makes sense we see that if no routes announced [13:46:49] yeah sorry for the noise [13:46:52] looks good and adds up [13:46:56] np! [13:47:03] just wanted to make sure it's not a repeat of the weird bird issues we once saw [13:47:08] that self resolved so who knows :) [14:18:58] moritzm: hahahah [14:18:59] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1026577 [14:19:18] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1026573 [14:30:18] XioNoX: topranks: can you please let me know if there is an update to the transits or point me to where I can follow said updates [14:30:23] context is the same Comms thing above :) [14:31:40] sukhe: no news so far, we can message you when we hear more for any of the two [14:31:57] thanks! [14:32:34] XioNoX: haha, all great minds think alike! [14:32:35] I will stop asking you now but yeah, I wanted to say that if you hear something just let me know [14:33:13] we are still waiting on the research question so that's another blocker [14:35:06] sukhe: will do! We're pushing both providers to give us some progress [14:41:46] I have to step out to take my Mam to the doctors, back in an hour or two [14:49:32] sukhe: the magru01 Ganeti cluster is setup, you can go ahead and create one durum, doh and ncredir node there [14:49:42] magru02 should be ready tomorrow [14:49:55] moritzm: thanks! [14:59:31] XioNoX: netflow7001 is up, I'll merge the patch for the kafka brokers/ferm [14:59:43] moritzm: sweet! [15:13:23] puppet fails to install gnmic on netflow7001,does that ring a bell? [15:13:56] it's installed on e.g.6001, but seems missing on apt.w.o [15:14:12] ah, right: https://phabricator.wikimedia.org/T347461 [15:15:07] :) [15:15:19] do you still have the deb used on the current hosts around, so that we can install it on 7001 as well? [15:15:45] moritzm: you can use the most recent one from upstream https://github.com/openconfig/gnmic/releases/download/v0.36.2/gnmic_0.36.2_Linux_x86_64.deb [15:17:17] ack, that's also what we use on 6001 e.g., installed [15:22:27] moritzm: we have no unified track for VM commissioning in magru right? [15:22:30] I can create one but asking [15:25:36] we don't yet, but it would be great to have one :-) [15:25:41] https://phabricator.wikimedia.org/T364016 [15:27:51] I'll add a proposed distribution of VMs to clusters [15:31:38] I've updated the task with a proposed list of hostnames and how to spread to the two clusters, let me know if it looks good [15:32:26] ok great thanks! [15:32:35] so group would be b3 for the first and b4 for the second [15:32:37] moritzm: why install 7002 ? [15:32:58] XioNoX: do you mean why not 7001? [15:33:04] oh, copy paste I guess it also says drmrs [15:33:16] sukhe: yeah [15:33:26] same for bast or prometheus [15:33:27] some emacs magic gone wrong, fixed the hostnames [15:35:49] let me know if I should create the magru02 cluster group in netbox [15:36:45] would that cause any issues if the underlying ganeti cluster isn't set up yet? [15:37:03] if so, please go ahead, otherwise I'd do it tomorrow when the cluster is up [15:37:23] that's a 1M$ question, but until you deploy the config for the ganeti-netbox sync for it [15:37:30] I don't think it shouldcreate any issue [15:37:45] ok, let's just do it, then. worst case some sync timer will fail until tomorrow [15:38:22] yep [15:38:37] {done} https://netbox.wikimedia.org/virtualization/cluster-groups/105/ [15:39:57] cheers [16:09:00] volans: around? [16:09:06] yes [16:09:16] so dns7x are in an outdated state with netbox commits and I wanted to bring them up to date [16:09:21] sudo cookbook sre.dns.netbox "force update dns7x" --force e4ecebd8e04119d710db5039a5b6678939f55d95 [16:09:28] any concerns with me running this? [16:09:39] sukhe: reply from Telxius saying they're working on it [16:09:48] XioNoX: thanks! [16:09:52] (transit) [16:09:55] yeah [16:10:45] sukhe: e4ecebd8e04119d710db5039a5b6678939f55d95 is the last commit, go ahead [16:10:49] volans: basically, dns7x were not pooled and so didn't get the authdns-update for a while. and now, the magru one is leading to an outdated zone file [16:10:52] thanks for confirming! [16:11:41] it should be a noop on the other hosts due to gdnsd being smart IIRC [16:12:02] but might reload everywhere anyway [16:12:07] yeah, I think so as well [16:12:09] I don't recall the fine detail sorry [16:12:25] np, I will verify everything but I wanted to check with you about --force [16:13:40] it just sets the sha1 to the given one in case there are no changes [16:13:48] nothing else changes in the cookbook execution [16:14:08] instead of bailing out allows to continue the deploy of that specific sha1 [16:14:25] sukhe@cp7012:~$ dig ganeti01.svc.magru.wmnet +nsid +noanswer +all [16:14:48] er +noall but [16:14:51] ; NSID: 64 6e 73 37 30 30 32 ("dns7002") [16:14:54] this was missing, so all is good now :) [16:15:05] yay [16:15:14] thanks for reviewing! [16:15:27] anytime :) [20:09:04] 10netops, 06Infrastructure-Foundations, 06SRE, 10MW-1.43-notes (1.43.0-wmf.3; 2024-04-30), 13Patch-For-Review: Add probenet configuration for magru - https://phabricator.wikimedia.org/T362902#9765219 (10CDanis) 05Open→03Resolved [20:09:46] 10netops, 06Infrastructure-Foundations, 06SRE, 10MW-1.43-notes (1.43.0-wmf.3; 2024-04-30), 13Patch-For-Review: Add probenet configuration for magru - https://phabricator.wikimedia.org/T362902#9765222 (10CDanis)