[08:59:10] 10puppet-compiler, 10Infrastructure-Foundations, 10User-dcaro: PCC Remove .configs file support under worker.py - https://phabricator.wikimedia.org/T294541 (10dcaro) 05In progress→03Open [08:59:24] 10puppet-compiler, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-dcaro: PCC: add a less arbitrary success condition - https://phabricator.wikimedia.org/T295030 (10dcaro) 05In progress→03Open [08:59:39] 10puppet-compiler, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-dcaro: PCC: add automatic style checker (balck + isort) - https://phabricator.wikimedia.org/T295063 (10dcaro) 05In progress→03Open [09:53:38] 10SRE-tools, 10Data-Persistence-Backup, 10Infrastructure-Foundations, 10media-backups, and 2 others: minio monitoring broken due to TLS certificate marked as insecure - https://phabricator.wikimedia.org/T295594 (10jcrespo) [09:54:02] 10SRE-tools, 10Data-Persistence-Backup, 10Infrastructure-Foundations, 10media-backups, and 2 others: minio monitoring broken due to TLS certificate marked as insecure - https://phabricator.wikimedia.org/T295594 (10jcrespo) [10:08:18] 10netops, 10Infrastructure-Foundations: Use next-hop-self for iBGP sessions - https://phabricator.wikimedia.org/T295672 (10ayounsi) p:05Triage→03High [10:09:19] 10netops, 10Infrastructure-Foundations, 10SRE: cr1-eqiad -> Charter/AS7843 connectivity is broken - https://phabricator.wikimedia.org/T295650 (10ayounsi) Thanks for taking care of it. Proper fix is most likely T295672. [10:15:51] 10SRE-tools, 10Data-Persistence-Backup, 10Infrastructure-Foundations, 10media-backups, 10observability: minio monitoring broken due to TLS certificate marked as insecure - https://phabricator.wikimedia.org/T295594 (10jcrespo) 05Open→03Resolved a:03jbond The patch + running puppet fixed the issue. I... [10:43:46] volans: Are the instructions for creating a VM here still the best approach: [10:43:48] https://wikitech.wikimedia.org/wiki/Ganeti#Create_a_VM [10:43:55] 10netops, 10Infrastructure-Foundations, 10SRE: Rebuild Routinator (rpki) VMs with larger disk - https://phabricator.wikimedia.org/T292503 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by cmooney@cumin1001 for hosts: `rpki1001.eqiad.wmnet` - rpki1001.eqiad.wmnet (**PASS**) - Downtimed hos... [10:44:21] i.e. running the sre.ganeti.makevm cookbook? [10:44:25] topranks: yeah, those are still accurate [10:44:56] yep ^^^ [10:44:59] moritzm: cool, seems easy enough some other comments I seen confused me. [10:45:03] thanks both! [10:45:44] which ones? if there still references to the old script (which predated the cookbook) we can remove them [10:49:12] none on the wiki, was just some comments on irc I think I mis-interpreted [10:51:57] moritzm: would now be a good time to run a fleet wide (buster+) debdeploy to upgrade python3-wmflib? [10:52:37] volans: sure, go ahead :-) [10:52:42] thx [11:11:13] XioNoX, topranks: I know you have played on netbox-next for cable represenation options, would you mind if I import a fresh clean DB backup from netbox prod? It would nuke any local data modification. If you need those I can find another way to test the import script patch [11:11:26] +1 for me [11:11:48] Fire ahead volans I think everyone got to see the example so it's fine to thrash it [11:12:06] ack, wanna make a screenshot just in case before I nuke it topranks ? [11:12:59] yeah actually not a bad idea. [11:13:01] one sec [11:13:20] take your time [11:13:24] no hurry here [11:13:29] ok fire away thanks :) [11:16:20] thank you! [11:48:18] moritzm: Not sure if something has gone wrong creating rpki1001 VM. [11:48:33] The instance exists but when I connect to the console there is no output, have rebooted it but same thing. [11:49:18] Ganeti says it's running and from what I can tell status looks ok [11:49:46] perhaps it's just a matter of waiting and I'm being impatient. [11:51:30] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: cr1-eqiad -> Charter/AS7843 connectivity is broken - https://phabricator.wikimedia.org/T295650 (10cmooney) Please ignore the above, unrelated CRs. I pasted the wrong task ID when doing the commit. [11:54:55] you mean when you connect via "sudo gnt-instance console rpki1001.eqiad.wmnet there's no output? [11:55:03] yes [11:58:16] interesting! that's the very same error I'm currently running into with in the new ganeti-test* cluster [11:59:01] if you have a look at the processes on ganeti1009 (where rpki1001 was created) you'll see a kvm-console-wrapper zombie process [11:59:14] ok. The virtual console did "detach" when I rebooted the instance, so it kind of looks like the command is attaching to _something_ [11:59:29] and the socat command which would have connected to the instance froze [12:00:06] ah ok yeah there are two of them there alright [12:01:11] in a meeting now [12:01:15] no rush on this [12:24:16] 10netops, 10Infrastructure-Foundations, 10SRE: cr1-eqiad -> Charter/AS7843 connectivity is broken - https://phabricator.wikimedia.org/T295650 (10cmooney) a:03cmooney [12:27:55] 10SRE-tools, 10Data-Persistence-Backup, 10Infrastructure-Foundations, 10media-backups, 10observability: minio monitoring broken due to TLS certificate marked as insecure - https://phabricator.wikimedia.org/T295594 (10jbond) @jcrespo To be clear there was nothing wrong with your config, this is something... [12:38:43] 10netops, 10Infrastructure-Foundations, 10SRE: cr1-eqiad -> Charter/AS7843 connectivity is broken - https://phabricator.wikimedia.org/T295650 (10cmooney) > My guess would be that this is Charter filtering traffic on their IXP port to only routers they have peerings with, for security/anti-DDoS reasons. > >... [12:47:34] FYI I've re-imported the latest backup into netbox-next, feel free to edit things as needed like before :) I'll be testing a mass run of the puppetdb import script for all devices in the afternoon [14:50:21] 10netops, 10Infrastructure-Foundations: Upgrade core routers to Junos 20+ - https://phabricator.wikimedia.org/T295690 (10ayounsi) p:05Triage→03Low [14:54:37] 10netops, 10Infrastructure-Foundations, 10fundraising-tech-ops: Upgrade pfw to Junos 20+ - https://phabricator.wikimedia.org/T295691 (10ayounsi) p:05Triage→03Low [17:31:27] FYI (cc XioNoX, topranks): https://vincent.bernat.ch/en/blog/2021-source-of-truth-network [17:31:50] thx [17:31:55] that points to https://blog.networktocode.com/post/nautobots-rollback/ TIL fo rme [17:31:58] *for me [17:32:04] although is of few months ago [17:33:38] yeah, it's one of the cool features of Nautobot [17:33:51] yeah it definitely looks good, I seen it for the first time last week. [17:34:09] btw on footnote 1 we're quoted :D [17:34:18] s/quoted/mentioned/ [17:35:32] traceroute www.wikimedia.org [17:36:34] jbond: ? [17:36:42] that a bad paste or you suspect we have issues John? [17:37:01] oh sorry bad paste :) [17:37:15] jbond: Password: [17:37:18] :) [17:37:23] lol [17:37:30] lol [17:47:03] hunter2 [20:55:30] what's with the "BGP peer above prefix limit global" that's been alerting since Thursday? [21:06:02] paravoid: IPv6 IX sessions have a cut of of 4000, with an alerting threshold of 80%, looks like AS4230 is hitting that [21:06:31] paravoid: it only emails peering@ though [21:12:08] interesting, looks like it keeps sending us too many prefixes then gets kicked then goes back to normal [21:13:57] oh, I see that topranks set a higher threshold not long ago [21:14:01] all good then [21:14:07] sry yeah was just coming here to update [21:14:10] alert has not cleared. [21:14:38] not? [21:14:58] no, but maybe librenms just hasn't polled.. I've not timed exactly but it's been more than 5 mins I think [21:14:58] the device logs are clean since [21:15:33] I mean in LibreNMS / alertmanager it's still shoiwng. Router looks good. [21:16:38] yeah it will recover within 5 or 10min I guess [21:17:07] yeah assume so [21:18:19] it's gone now :)