[00:55:23] <wikibugs>	 10netops, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, and 2 others: codfw:frack:servers migration task - https://phabricator.wikimedia.org/T375151#10216097 (10Papaul) We did phase 2 today, all the 1G nodes are now connected to the new fasw2-c8a/b. We will me moving the 10G nodes next we...
[02:52:30] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service cloudidm2001-dev:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#cloudidm2001-dev:443 - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[06:25:00] <slyngs>	 moritzm: Related to my IDP failover?
[06:44:06] <moritzm>	 yeah, there was some confusion since idp.w.o wasn't actually failed over, when I pushed a revert dnsauth-update showed an empty diff, so probably yo merged the DNS change in git, but didn't deploy it yesterday :-)
[06:44:10] <moritzm>	 all resolved
[06:45:21] <moritzm>	 at some point we should have a cookbook for this, also to allow the sider SRE group to failover if needed in other situations
[06:45:30] <slyngs>	 dns1004 was out, the documentation said to use 2004 instead... And I'm fairly sure I ran the DNS update, because the cat was walking around the keyboard.
[06:47:05] <slyngs>	 Also I still have the output of dig, which says: idp1004 at 16:01
[06:49:00] <moritzm>	 might be, reading backscroll from -sre there was also this note later:
[06:49:02] <moritzm>	 [18:01] <sukhe> please hold off on running authdns-update or netbox DNS cookbook 
[06:49:03] <moritzm>	 [18:01] <sukhe> resolving a broken state on a DNS host that was rebooting and failed for some reason (that I will figure out later once I restore it)
[06:49:10] <moritzm>	 (times in CET)
[06:49:23] <slyngs>	 That makes sense then :-)
[06:49:44] <moritzm>	 any it's working now and we should have a cookbook anyway :-)
[06:51:29] <slyngs>	 We should :-)
[06:52:30] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service cloudidm2001-dev:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#cloudidm2001-dev:443 - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[06:53:30] <slyngs>	 As for "why does it matter which IDP host your hiting?" I matters if the client hits idp1004 but the server thinks it's idp2004, then they don't have the same ticket. I would be interested in seeing if that would work when we have Redis in place.
[06:58:41] <moritzm>	 yeah, up to CAS 6.6 we had all memcached sessions relayed to both nodes by mcrouter so we only missed very little (just the packets sent while mcrouter can't receive during the reboot), but be the same again once redis support is live
[07:10:46] <elukey>	 o/
[07:11:04] <elukey>	 we can proceed with the irc.w.o move if you guys want
[07:12:15] <moritzm>	 we announced it for 8 UTC, let's wait until then
[07:14:38] <elukey>	 sure sure, it doesn't change much in my opinion
[07:16:42] <moritzm>	 let's wait 45m, just in case there is one community who members who's actually actively looking after their bot during the designated window
[07:17:09] <moritzm>	 very unlikely, but still :-)
[07:23:31] <elukey>	 :)
[07:47:22] <volans>	 rotfl
[08:01:06] <moritzm>	 I'm merging the patch to point irc.w.o to 1003 now
[08:03:01] <moritzm>	 and deployed to the DNS servers
[08:03:05] <moritzm>	 ttl is 5 mins
[08:03:36] <moritzm>	 there is one broken bot (flink-bot), which drops out of the channel 1-2 times per minute, that one should be visible soon
[08:04:18] <volans>	 drops and reconnects?
[08:04:53] <moritzm>	 seems to be on the bot side, it joins and then leaves shortly after
[08:05:19] <moritzm>	 possily bot still running somewhere, but broken by something else or so
[08:11:14] <moritzm>	 irc.w.o should now resolve properly to 1003 on all TTL-respecting DNS implementations, next I'd reboot irc1002 to force bots to reconnect, unless anyone wants to test something else beforehand?
[08:12:51] <volans>	 moritzm: I joined my old irssi configured irc  and I can connect fine and get the messages
[08:13:12] <volans>	 Connection to irc.wikimedia.org established
[08:13:12] <volans>	 Welcome to IRCStream
[08:13:17] <volans>	 so I guess is the new one :D
[08:13:46] <moritzm>	 yeah, you can e.g. join #en.wikimedia to see events coming in
[08:14:07] <volans>	 with get the messages I meant just that
[08:14:14] <volans>	 that I was getting the updates from channels
[08:16:43] <moritzm>	 slyngs, elukey: I'd go ahead with the irc1002 reboot unless you want to test some more?
[08:17:11] <elukey>	 I was about to say: green light from me
[08:17:36] <moritzm>	 ack, going ahead
[08:22:15] <moritzm>	 I see no bots connection on irc1003 following the reboot
[08:22:57] <moritzm>	 #en.wikipedia has just me and rc-pmtpa
[08:26:08] <moritzm>	 I'll roll back
[08:26:52] <RhinosF1>	 i was able to connect fine directly to it. I did notice it showed no one else in #en.wikipedia moritzm but I was there for a bit
[08:27:25] <moritzm>	 yeah, there's some problem which bots not connecting
[08:27:47] <RhinosF1>	 moritzm: i don't see you as in #en.wikipedia now
[08:27:51] <RhinosF1>	 if you're there
[08:27:58] <RhinosF1>	 it might just be the member list
[08:28:53] <moritzm>	 so you only see rc-pmtpa and yourself as well?
[08:28:59] <RhinosF1>	 moritzm: yup
[08:29:12] <RhinosF1>	 although i've stopped getting events from rc-pmtpa
[08:29:41] <RhinosF1>	 that was working
[08:29:44] <RhinosF1>	 at first
[08:29:52] <moritzm>	 I'm getting events just fine, I'll go forward with the revert for now
[08:30:18] <RhinosF1>	 ack
[08:30:49] <moritzm>	 we need the member list working so that we can the amount of bots properly connecting etc.
[08:31:01] <volans>	 that's expected
[08:31:07] <volans>	 the new IRC server doesn't show you the other users
[08:31:36] <volans>	 it implements the IRC protocol but in a limited way designed to be optimized for the very specific use case of the service
[08:31:50] <RhinosF1>	 events are coming in for #meta.wikimedia fine but not #en.wikipedia
[08:31:53] <volans>	 moritzm: youc an get that from grafana
[08:32:35] <volans>	 https://grafana.wikimedia.org/d/eb101795-c69e-4b9c-b848-f042d604f234/ircstream?orgId=1
[08:33:00] <volans>	 but I agree I'm not getting any more events on #en.wikipedia
[08:33:06] <moritzm>	 forcing a reboot to fail over clients back
[08:34:27] <volans>	 in the dashboard is clear that irc1003 is not relaying messages while irc2003 is continuing to do so
[08:34:59] <volans>	 but the clients are all connected to irc1003
[08:36:16] <moritzm>	 the global number of connected channels isn't really useful, we need to be able to easily compare connected bots per channel
[08:36:42] <slyngs>	 Yeah, I need to get that patch up to Faidon standard
[08:36:45] <moritzm>	 otherwise we have no real insight if the known list of bots for a channel is actually properly able to re-connect
[08:37:34] <moritzm>	 either that or a live channel memberlist on the IRC level
[08:38:00] <slyngs>	 Probably both, otherwise we don't know which bots are connected
[08:38:15] <slyngs>	 Just that X number of bots are connect to some channel
[08:38:32] <volans>	 that can be a prometheus metric I guess
[08:38:36] <moritzm>	 bots are starting to reconnect to the old setup
[08:39:05] <slyngs>	 Faidon was concerned about the number of metrics that would generate, something about each label being a timeserie in Prometheus
[08:39:55] <volans>	 you don't need the bot names, just the counter of bots per channel, there are 833 channels AFAICT
[08:41:33] <slyngs>	 That would still be 833 extra timeserie... I mean that still the plan
[08:41:55] <volans>	 I was always told that in prometheus terms that's nothing
[08:42:24] <slyngs>	 Perfect, then I just need to make my patch approvable :-)
[08:45:29] <volans>	 for example prometheus1005 right now has ~10M metrics
[08:45:49] <volans>	 check with o11y, but AFAIK 1k metrics is not a problem
[08:45:58] <slyngs>	 Then I'd assume that you are correct that 8-900 extra won't be an issue
[08:46:12] <elukey>	 moritzm: sorry just seen the msgs, but shoudn't we just rely on the connected bots metric to figure out if it is healthy or not?
[08:46:19] <elukey>	 I don't think we need extra metrics
[08:46:31] <elukey>	 I wouldn't rollback
[08:46:59] <moritzm>	 it's already rolled back, feeds were complete, see above by RhinosF1
[08:47:04] <elukey>	 the bots per channel is an extra but I don't think it will give us more insights if they work or not
[08:47:45] <volans>	 elukey: the problem is that irc1003 stopped relaying messages
[08:47:45] <moritzm>	 the number of connected globally isn't really useful, how would you otherwise want to compare the list of connected bots for channel foo with the old setup and after moving to the new service?
[08:47:55] <volans>	 not completely but almost
[08:48:16] <volans>	 https://grafana.wikimedia.org/d/eb101795-c69e-4b9c-b848-f042d604f234/ircstream?orgId=1&from=1728545641044&to=1728550088655&viewPanel=1
[08:48:20] <moritzm>	 well, for assessing whether the migration works fine,  almost is't enough
[08:48:45] <volans>	 moritzm: I was continuing my previois phrase, not responding to you, sorry for the confusion :)
[08:48:53] <volans>	 was to update luca
[08:48:55] <moritzm>	 ah, ok :-)
[08:50:16] <elukey>	 okok I missed the msg relayed, but for the metric I don't know if knowing if a bot is connected to a channel makes us more aware if they are consuming messages or not, this is my point
[08:51:04] <moritzm>	 well, but if a given bot is not connected at, we know 100% for sure it's not consuming messages
[08:51:49] <volans>	 are there any sopecific logs related to the message relaying?
[08:52:15] <volans>	 I'm looking at ircstream and beside the "spam" of the prometheus connecting/disconnecting messages every few seconds
[08:52:35] <volans>	 and the real logs from bots after the switch I don't see anything related to teh ingestion/relaying of messages
[08:53:30] <volans>	 moritzm: btw if it's just for the switch we can get that information from the logs, we have bot name and channel, see the 'User subscribed to feed' lines
[08:56:48] <elukey>	 +1
[08:57:49] <elukey>	 I am still wondering what happened
[08:59:26] <elukey>	 https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&var-server=irc1003&var-datasource=thanos&var-cluster=misc&from=now-1h&to=now
[08:59:39] * volans meeting
[09:00:01] <elukey>	 could it be that for ircstreams we need more than 1 vcore + 2G of ram?
[09:00:39] <volans>	 was it ever tested with more than few clients/channels connected?
[09:00:58] <elukey>	 we tested it with one/two bots plus us connected, so no
[09:01:02] <slyngs>	 Clients no, channels yes
[09:01:10] <elukey>	 I think we assumed that the settings would have been the same for irc1002
[09:01:12] <moritzm>	 don't think so, the host overview shows no sign of CPU or memory exhaustion
[09:01:33] <moritzm>	 but maybe we were hitting some internal limits
[09:01:54] <moritzm>	 like defaults within the python libraries or similar
[09:02:20] <elukey>	 moritzm: there is a hole in metrics reporting (some minutes), and also I am wondering if maybe asyncio needs more than one cpu for ircstreams to handle ~300 clients
[09:02:58] <elukey>	 in theory the control loop is one, so one cpu is fine, but maybe there was contention with other things
[09:03:13] <moritzm>	 that's from the reboot, I rebooted 1003 to force connected clients back to irc1002
[09:03:23] <elukey>	 okok that explains, nevermind then
[09:06:20] <elukey>	 also I see only one thread for ircstream
[09:06:27] <elukey>	 the python3 process I mean
[09:09:33] <elukey>	 so IIUC from the code, upon receiving a UDP msg
[09:09:34] <elukey>	 https://github.com/paravoid/ircstream/blob/main/ircstream/rc2udp.py#L47
[09:09:42] <elukey>	 it broadcasts it, async, to all bots
[09:09:56] <elukey>	 that is https://github.com/paravoid/ircstream/blob/main/ircstream/ircserver.py#L829
[09:10:07] <elukey>	 so a for loop that sends priv messages to all bots
[09:10:17] <elukey>	 if there are errors we should see the correspondent metric, in theory
[09:11:42] <elukey>	 everything should be non-blocking async, unless some cpu bound code is executed but I don't see any
[09:12:49] <elukey>	 and the relayed msgs seems to be the metric incremented at the end of the broadcast
[09:12:53] <elukey>	 ircstream_messages_total
[09:14:58] <elukey>	 it really smells as if the broadcast loop slows down when more clients are connected
[09:15:09] <elukey>	 could it be max numbers of asyncio tasks reached?
[09:15:18] <elukey>	 wondering what is the default
[09:19:57] <elukey>	 can't find anything substancial
[09:31:06] <wikibugs>	 10netops, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations: keepalived: it doesn't support mixing IPv4 and IPv6 VIPs on the same VRRP instance - https://phabricator.wikimedia.org/T376879 (10aborrero) 03NEW
[09:43:43] <elukey>	 created 10 bots that hit ircstreams.w.o at the same time using Moritz's script
[09:43:55] <moritzm>	 one thing we could look into is to setup a prerouting rule in nftables on irc1002 and forward all 6667/tcp traffic to irc1003
[09:43:57] <elukey>	 I'll ramp up until I see a failure
[09:44:09] <elukey>	 moritzm: really good idea
[09:44:27] <moritzm>	 and then we have a mirror of all live traffic and can use that as a repro cas
[09:45:04] <elukey>	 Faidon at the time wrote to me "iptables -t mangle -A PREROUTING -p udp --dport 9390 -j TEE --gateway ${ircstream_ip}" but we have nftables there :D
[09:45:37] <elukey>	 ah wait sorry you meant for 6667
[09:45:41] <elukey>	 nevermind
[09:45:58] <moritzm>	 ah wait, irc1002 is actually still on ferm, only 1003 is using nftables
[09:46:59] <elukey>	 I have now 100 bots connected and I don't see sign of failures
[09:49:24] <elukey>	 ramping up to 300
[09:50:01] <elukey>	 no issues
[09:51:03] <elukey>	 code is https://phabricator.wikimedia.org/P69588
[09:51:36] <elukey>	 so I don't think it is a bottleneck issue :D
[09:52:44] <elukey>	 at this point the metric might be just telling us that some bots have trouble reconnecting for some reason, but we would see some errors in the stats
[09:52:47] <elukey>	 in theory
[09:53:24] <jinxer-wm>	 FIRING: SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:56:43] <moritzm>	 hmmh, maybe we should add a metric on client connects and disconnects?
[09:57:14] <moritzm>	 and for "User subscribed to feed"
[09:57:37] <moritzm>	 since in general these should be relatively scarce for an otherwise stable bot
[09:58:24] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:00:56] <elukey>	 while reviewing logs on irc1003 I found client-id:jbond <3
[10:07:25] <elukey>	 I tried to isolate some ips among client disconnects, and I see too many of them for each IP
[10:07:30] <elukey>	 in the irc1003 logs I mean
[10:10:56] * volans back from meeting
[10:12:16] <slyngs>	 I've cleaned up the patch to ircstream and updated the merge request. If that's approved I can try to add the connects and disconnects as well.
[10:12:37] <volans>	 elukey: you're connecting to only one channel
[10:12:44] <volans>	 and it's always the same
[10:12:52] <volans>	 so the relay has to just relay those messages to all clients
[10:13:05] <elukey>	 good point
[10:13:06] <volans>	 not messages from many different streams to many different set of clients
[10:13:30] <volans>	 add like ~10-20 channels and make each bot connect to a random subset of them
[10:13:36] <elukey>	 in theory we should have seen some isuse anyway with 300 bots
[10:13:38] <volans>	 like ~5-10
[10:18:43] <moritzm>	 we could also grab the channel list from 1002 and make the bots join them all
[10:19:58] <elukey>	 starting another load test, 300 bots connected to 3 channels each (sample from a list of 10)
[10:21:07] <elukey>	 no problem so far
[10:22:52] <moritzm>	 in my earlier tests with 1003 during the hackathon I also had my client connected for days and events were flowing in without any issues
[10:23:04] <moritzm>	 but obviously just with our test users
[10:24:21] <volans>	 earlier the problem started ~2 minutes after the bulk of clients connected
[10:25:03] <volans>	 elukey: is there a specific reason irc1003 has so much less channels right now?
[10:25:31] <wikibugs>	 10netops, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations, 13Patch-For-Review: keepalived: it doesn't support mixing IPv4 and IPv6 VIPs on the same VRRP instance - https://phabricator.wikimedia.org/T376879#10217030 (10aborrero) 05Open→03Resolved
[10:25:47] <moritzm>	 volans: that's due to the reboot
[10:26:01] <moritzm>	 the channel only gets opened when an event arrives 
[10:26:20] <moritzm>	 and for rare languages and less used wikis like wikivoyage is takes days for some event to flow in
[10:26:53] <elukey>	 doing another load test, 300 bots, 8 channels for each randomly picked up
[10:27:38] <volans>	 got it
[10:27:45] <elukey>	 I imagine Faidon smiling at me the more I push load to ircstreams
[10:28:39] <elukey>	 "bring it on, we'll see"
[10:29:16] <elukey>	 updated https://phabricator.wikimedia.org/P69588 as well
[10:29:18] <elukey>	 so far nothing
[10:31:49] <moritzm>	 some kick feature would be useful, like where you send ircstream a signal and then it keeps all channels open, but forcibly disconnects all clients
[10:37:29] <elukey>	 moritzm: this is different but very nice https://github.com/paravoid/ircstream/blob/4866e6aad3d532ab50a288efbb680fcd21c156f9/ircstream/ircserver.py#L246
[10:39:25] <elukey>	 and I am wondering if there are missing logs in https://github.com/paravoid/ircstream/blob/4866e6aad3d532ab50a288efbb680fcd21c156f9/ircstream/ircserver.py#L291 that we should have
[10:39:54] <elukey>	 it should be the part in which the client is disconnected
[10:40:13] <moritzm>	 we should also sprinkly in random     asyncio.terminate(msg='Please move to Eventstreams')    calls
[10:41:28] <elukey>	 definitely
[10:41:44] <elukey>	 going afk in a bit for the lunch break + errands, will keep checking later
[10:51:04] <wikibugs>	 10netops, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations: keepalived: it doesn't support mixing IPv4 and IPv6 VIPs on the same VRRP instance - https://phabricator.wikimedia.org/T376879#10217086 (10aborrero) 05Resolved→03In progress p:05Triage→03Medium I have detected there is no V...
[10:54:59] <slyngs>	 I have a "half" patch for listing all nicks connected to channel
[10:55:33] <slyngs>	 It doesn't prevent a client from connecting with an existing nick, so two clients can share a nick.
[10:56:56] <moritzm>	 nice. the latter wouldn't be an issue
[11:15:43] <slyngs>	 https://github.com/paravoid/ircstream/pull/4
[11:24:18] <jbond>	 elukey:do i have a rouge irc connection causing issues?
[11:25:08] <jbond>	 hmm seems i am on irc.wikimedia.org but not in any rooms, probably set up for troublshooting something ill disconnect
[11:29:23] * jbond done https://github.com/b4ldr/vps-config/commit/1f89170214c3ec87ba307caaf36860388e5faebb
[11:34:27] <moritzm>	 jbond: no no :-)
[11:34:42] <moritzm>	 we're currently working on replacing the old irc.wikimedia.org setup
[11:35:04] <moritzm>	 based on a patched ircd-ratbox release from 2003 and a relay using Py2
[11:35:16] <moritzm>	 towards https://github.com/paravoid/ircstream
[11:35:24] <jbond>	 ahh nice
[11:35:55] <moritzm>	 that you were connected was just some random fun fact found by Luca when debugging, you're free to hang on the new instance as well ofc :-)
[11:35:56] * volans lunch
[11:36:00] <volans>	 hey jbond :D
[11:36:19] <jbond>	 hi :D
[11:36:58] <jbond>	 ahh ok but tbh , i don't think i need to be on irc.w.o :)
[11:37:16] <moritzm>	 unless you turned into a bot since you left us, no :-)
[11:37:21] <jbond>	 hehe
[11:57:49] <slyngs>	 No one really NEEDS to be on irc.w.o, yet here we are :-)
[11:58:07] <slyngs>	 Trying to get them to leave.
[12:09:04] <wikibugs>	 10netops, 06Infrastructure-Foundations, 06SRE: Move public-vlan host BGP peerings from CRs to top-of-rack switches in codfw - https://phabricator.wikimedia.org/T360772#10217316 (10cmooney) 05Open→03Declined
[12:29:03] <wikibugs>	 10netops, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations, 13Patch-For-Review: keepalived: it doesn't support mixing IPv4 and IPv6 VIPs on the same VRRP instance - https://phabricator.wikimedia.org/T376879#10217375 (10aborrero) still not working. I saw this weird tcpdump capture on cloud...
[12:29:33] <wikibugs>	 10netops, 06Infrastructure-Foundations, 06SRE: Move codfw dns hosts to per-rack vlans and BGP peer with top-of-rack switch - https://phabricator.wikimedia.org/T376894 (10cmooney) 03NEW p:05Triage→03Low
[12:39:18] <wikibugs>	 10netops, 06Infrastructure-Foundations, 06SRE: Consolidate Automation Templates for DC Switches - https://phabricator.wikimedia.org/T312635#10217448 (10Aklapper)
[12:50:04] <elukey>	 jbond: nono I was happy to see you around, that's it :)
[12:56:22] <wikibugs>	 10netops, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations: keepalived: it doesn't support mixing IPv4 and IPv6 VIPs on the same VRRP instance - https://phabricator.wikimedia.org/T376879#10217541 (10cmooney) One thing I might be messing you up is the "authentication" section in /etc/keepali...
[13:10:04] <wikibugs>	 10netops, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations: keepalived: it doesn't support mixing IPv4 and IPv6 VIPs on the same VRRP instance - https://phabricator.wikimedia.org/T376879#10217599 (10aborrero) >>! In T376879#10217541, @cmooney wrote: > One thing I might be messing you up is...
[13:10:26] <wikibugs>	 10netops, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations: keepalived: it doesn't support mixing IPv4 and IPv6 VIPs on the same VRRP instance - https://phabricator.wikimedia.org/T376879#10217600 (10aborrero) there is also this warning in the logs:  Oct 10 13:07:05 cloudgw2002-dev Keepalive...
[13:18:27] <jinxer-wm>	 FIRING: SystemdUnitCrashLoop: node-bgpalerter.service crashloop on rpki2003:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[13:25:33] <wikibugs>	 10CAS-SSO, 10netbox, 06Infrastructure-Foundations, 13Patch-For-Review: Unable to log in to Netbox - https://phabricator.wikimedia.org/T373702#10217669 (10SLyngshede-WMF) The weird account name is due to the "taavi" user already existing in the netbox database. The Django OIDC module (social-app-django) Net...
[13:33:27] <jinxer-wm>	 RESOLVED: SystemdUnitCrashLoop: node-bgpalerter.service crashloop on rpki2003:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[13:34:27] <jinxer-wm>	 FIRING: SystemdUnitCrashLoop: node-bgpalerter.service crashloop on rpki2003:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[13:36:08] <wikibugs>	 10CAS-SSO, 10netbox, 06Infrastructure-Foundations, 13Patch-For-Review: Unable to log in to Netbox - https://phabricator.wikimedia.org/T373702#10217733 (10ayounsi) No objection to that. Seems like a good idea. In the short term we can delete the old account too.
[13:37:13] <wikibugs>	 10CAS-SSO, 10netbox, 06Infrastructure-Foundations, 13Patch-For-Review: Unable to log in to Netbox - https://phabricator.wikimedia.org/T373702#10217736 (10SLyngshede-WMF) Alternatively: Manually link the correct account in the database.    ` from social_django.models import UserSocialAuth from users.models....
[13:39:43] <wikibugs>	 10netops, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations, 06SRE: CloudVPS: IPv6 in codfw1dev - https://phabricator.wikimedia.org/T245495#10217752 (10cmooney)
[13:42:31] <wikibugs>	 10netops, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations, 06SRE: CloudVPS: IPv6 in codfw1dev - https://phabricator.wikimedia.org/T245495#10217761 (10cmooney)
[13:42:34] <wikibugs>	 10CAS-SSO, 10netbox, 06Infrastructure-Foundations, 13Patch-For-Review: Unable to log in to Netbox - https://phabricator.wikimedia.org/T373702#10217762 (10SLyngshede-WMF) Looking at the data for taavi UID linkning won't work as the preferred_username and uid doesn't match, so we might be limited to manual f...
[13:46:52] <wikibugs>	 10netops, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations, 06SRE: cloudsw: codfw: enable IPv6 - https://phabricator.wikimedia.org/T374713#10217784 (10cmooney) 05Open→03Resolved This is now complete, the cloudsw is set up to route the networks are required and announcing them upst...
[13:50:16] <wikibugs>	 10netops, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations, 06SRE: openstack: initial IPv6 support in neutron - https://phabricator.wikimedia.org/T375847#10217807 (10cmooney) >>! In T375847#10195673, @aborrero wrote: > `lang=shell-session > root@ipv6-test-1:~# ip -br a > lo...
[13:54:11] <wikibugs>	 10netops, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations, 06SRE: CloudVPS: IPv6 in codfw1dev - https://phabricator.wikimedia.org/T245495#10217836 (10cmooney) The edge (cloudsw/cr) networking is now complete, elements in the range are reachable externally.  ` cathal@officepc:~$ mtr -z -b...
[14:03:42] <jinxer-wm>	 RESOLVED: SystemdUnitCrashLoop: node-bgpalerter.service crashloop on rpki2003:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[14:04:27] <jinxer-wm>	 FIRING: SystemdUnitCrashLoop: node-bgpalerter.service crashloop on rpki2003:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[14:58:42] <jinxer-wm>	 RESOLVED: SystemdUnitCrashLoop: node-bgpalerter.service crashloop on rpki2003:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[15:01:27] <jinxer-wm>	 FIRING: SystemdUnitCrashLoop: node-bgpalerter.service crashloop on rpki2003:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[15:44:07] <wikibugs>	 10netops, 06Infrastructure-Foundations, 06SRE: Upgrade Management routers to 23.4R2-S2 - https://phabricator.wikimedia.org/T369504#10218317 (10Papaul)
[15:46:08] <wikibugs>	 10netops, 06Infrastructure-Foundations, 06SRE: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092#10218350 (10Papaul)
[15:59:26] <wikibugs>	 10netops, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations: keepalived: it doesn't support mixing IPv4 and IPv6 VIPs on the same VRRP instance - https://phabricator.wikimedia.org/T376879#10218489 (10aborrero) >>! In T376879#10217600, @aborrero wrote: > there is also this warning in the logs...
[16:26:27] <jinxer-wm>	 RESOLVED: SystemdUnitCrashLoop: node-bgpalerter.service crashloop on rpki2003:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[16:28:26] <jinxer-wm>	 FIRING: SystemdUnitCrashLoop: node-bgpalerter.service crashloop on rpki2003:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[17:16:54] <wikibugs>	 10netops, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations, and 2 others: openstack: work out IPv6 and designate integration - https://phabricator.wikimedia.org/T374715#10218766 (10cmooney) Reverse delegation is now working for the ranges we've assigned to OpenStack.  I've not gotten an ans...
[17:31:42] <jinxer-wm>	 RESOLVED: SystemdUnitCrashLoop: node-bgpalerter.service crashloop on rpki2003:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[17:33:26] <jinxer-wm>	 FIRING: SystemdUnitCrashLoop: node-bgpalerter.service crashloop on rpki2003:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[18:01:42] <jinxer-wm>	 RESOLVED: SystemdUnitCrashLoop: node-bgpalerter.service crashloop on rpki2003:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[18:04:27] <jinxer-wm>	 FIRING: SystemdUnitCrashLoop: node-bgpalerter.service crashloop on rpki2003:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[18:52:23] <wikibugs>	 10netops, 06Infrastructure-Foundations, 06SRE: Upgrade Management routers to 23.4R2-S2 - https://phabricator.wikimedia.org/T369504#10219169 (10Papaul)
[19:29:27] <jinxer-wm>	 RESOLVED: SystemdUnitCrashLoop: node-bgpalerter.service crashloop on rpki2003:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[19:31:27] <jinxer-wm>	 FIRING: SystemdUnitCrashLoop: node-bgpalerter.service crashloop on rpki2003:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[19:42:24] <jinxer-wm>	 FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:15:43] <wikibugs>	 10netops, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations: keepalived: it doesn't support mixing IPv4 and IPv6 VIPs on the same VRRP instance - https://phabricator.wikimedia.org/T376879#10219564 (10Multichill) Ipv6 vrrp is all link-local if I recall correctly. Did you configure it like that?
[22:01:27] <jinxer-wm>	 RESOLVED: SystemdUnitCrashLoop: node-bgpalerter.service crashloop on rpki2003:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[22:02:27] <jinxer-wm>	 FIRING: SystemdUnitCrashLoop: node-bgpalerter.service crashloop on rpki2003:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[22:22:27] <jinxer-wm>	 RESOLVED: SystemdUnitCrashLoop: node-bgpalerter.service crashloop on rpki2003:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[22:25:26] <jinxer-wm>	 FIRING: SystemdUnitCrashLoop: node-bgpalerter.service crashloop on rpki2003:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[22:56:42] <jinxer-wm>	 RESOLVED: SystemdUnitCrashLoop: node-bgpalerter.service crashloop on rpki2003:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[23:00:27] <jinxer-wm>	 FIRING: SystemdUnitCrashLoop: node-bgpalerter.service crashloop on rpki2003:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[23:42:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:51:42] <jinxer-wm>	 RESOLVED: SystemdUnitCrashLoop: node-bgpalerter.service crashloop on rpki2003:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[23:55:27] <jinxer-wm>	 FIRING: SystemdUnitCrashLoop: node-bgpalerter.service crashloop on rpki2003:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop