[08:15:48] taavi: good morning. I'm online for when you want to do the network debugging [08:44:31] arturo: is there way to get a second-level subdomain on a cloud vps instance? something like thing1.my-instance.my-project.eqiad1.wikimedia.cloud, thing2.my-instance.my-project.eqiad1.wikimedia.cloud, etc.? [08:45:04] it should be possible, yes, but we most likely don't have docs for it [08:45:33] maybe the `.svc.` subdomain does what you are looking for? [08:45:50] `something.svc.my-project.eqiad1.wikimedia.cooud` and you point it to your VM [08:46:10] the `.svc.` MAY be there already for you to use [08:48:25] hmm, how would that work? It's Santhosh asking for https://www.mediawiki.org/wiki/User:Santhosh.thottingal/WikiFamily, but on a cloud vps instance instead of a local dev machine. So they'd like a sub-subdomain for each language [08:49:50] there could be many entries [08:49:53] why does the wiki domain need to be a subdomain of the instance fqdn in the first place? [08:50:20] `en_wiki.svc.my-project.eqiad1.wikimedia.cloud` [08:51:00] also, remember that you can't access the instance ip address externally anyway, you would need to either ssh tunnel (in which case the address you're accessing in your browser is localhost anyway), or use the web proxy in which case the address used is something.wmcloud.org [08:56:09] taavi: about if the wiki domain needs to be a subdomain of the instance fqdn, it probably doesn't. I think they just want to run all the testwikis on the same instance, and have some way to route to them [08:57:06] they say they do need them to be on a public ip though, for integration testing with selenium [09:00:56] on the other hand, they say they will need a dozen or so total testwikis, which is maybe not that much? So maybe it would be easier to just have each on its own instance? [09:01:35] if they need public IPs, they will likely need the web proxy [09:01:43] which has some limitations on the name it accepts [09:02:05] they will need to craft stuff like `code-myproject.wmcloud.org` [09:02:26] example: `en-myproject.wmcloud.org` and `es-myprojet.wmcloud.org` [09:02:27] morning [09:02:33] and point them to the different VMs (or the same) [09:02:49] o/ [09:02:52] blancadesal: meh, there's nothing there that would make it need a single instance per wiki [09:03:02] true [09:03:39] arturo: give me a few mins, let's start looking at the irc connection issues then? [09:03:48] sure [09:04:40] santhosh is not on irc. should I tell him to open a ticket and discuss options there? [09:05:42] i don't see any cloud vps infra level changes that are needed, so a cloud-l thread or a task in their project-specific project feels more appropriate to me [09:09:31] ok, I told them to open a task and ping you and arturo on it [09:09:46] i'll go back to being sick now [09:13:24] :-P [09:14:09] ok [09:14:52] so looks like wikibugs has been relatively stable the last night, or at least the connection to libera.chat [09:15:18] ok, do we know if the libera.chat address changes often? [09:15:42] I see it has like 4 address in the resolution, which will complicate things [09:15:45] https://www.irccloud.com/pastebin/01OOwwaF/ [09:15:49] i think they rotate the addresses in resolution fairly often [09:16:24] but if you do /whois on an bot, you can see the invidiual server it's connected to at that moment [09:16:41] how do I do that? [09:16:56] on a priv message? [09:17:58] ok, gotcha [09:18:01] standard IRC whois [09:18:04] yeah [09:18:25] seems to be connected to molybdenum.libera.chat [09:18:32] which one? [09:18:39] wikibugs [09:18:45] which bot are we going to debug? [09:19:23] mmmm we need to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/1008814 [09:20:17] stashbot and wm-bb seem to had issues the most recently, about an hour ago [09:20:17] See https://wikitech.wikimedia.org/wiki/Tool:Stashbot for help. [09:20:25] ok [09:20:54] just pick one :-) [09:21:13] let's pick stashbot [09:21:28] that one is connected to erbium.libera.chat [09:22:42] moritzm: just merged the conntrack patch FYI [09:23:02] ack, ok! [09:23:06] in case you want to run puppet on a few servers and double check [09:24:03] the only change will be that the roles with nftables will get the CLI tool installed, I'll confirm it on a test host,but don't expect any issues [09:25:13] ac [09:25:15] ack* [09:25:29] taavi: so, only 2 connections to erbium flowing in cloudgw at the moment: [09:25:31] https://www.irccloud.com/pastebin/dyD1EwLz/ [09:30:13] ok [09:30:35] so, stashbot is running on tools-k8s-worker-nfs-51 so 172.16.2.115 [09:30:42] yeah [09:31:09] so there is a double NAT involved because ks [09:31:10] k8s* [09:31:18] aborrero@tools-k8s-worker-nfs-51:~$ sudo conntrack -L --dst 82.96.96.60 [09:31:18] tcp 6 431991 ESTABLISHED src=192.168.246.246 dst=82.96.96.60 sport=57442 dport=6697 src=82.96.96.60 dst=172.16.2.115 sport=6697 dport=8458 [ASSURED] mark=0 use=1 [09:33:59] and of course, it would be nice to have a failure case happening now, no? [09:35:01] yeah, but I don't think we have a way to reproduce this on demand [09:35:25] is your theory that either of those conntrack entries gets lost somehow? [09:36:01] yes [09:36:27] the double NAT makes it really hard for a connection to recover [09:37:12] but it should not be a big deal if everything was well configured [09:37:21] I mean, this is bread and butter of the internet, NAT everywhere [09:37:32] mhm [09:37:40] what is the process of a connection recovery? [09:39:34] dcaro: it depends on what the failure was, and what the connection is doing, etc. But For example, if one of the 2 NAT entries expire before the other, I don't know how that can ever recover [09:40:13] for it to exire it should have been idle for a while no? what's the timeout? [09:41:58] so there's a few different errors we've seen on irc https://phabricator.wikimedia.org/T357729#9552141 [09:42:18] i guess we could try deleting the conntrack entries on both hosts, one at a time, and see what errors do they produce [09:43:09] ok [09:43:34] dcaro: yes, I think there is no timeout if the packets are flowing [09:44:52] i'm going to delete the stashbot conntrack entry from the k8s worker and see what happens [09:45:07] wow, that was super fast! [09:45:16] yeah, that was like instant [09:45:32] i was expecting that to take more time [09:45:43] did the bot die (in k8s) [09:45:58] or, how did the code react to the conntrack entry being deleted? [09:46:39] so far it has not noticed [09:47:18] last time it looks like it took stashbot 20 minutes to re-connect [09:47:21] which is not nice, because otherwise the bot is not working [09:48:06] does the irc connection send any heartbeats/keepalive/etc.? [09:48:29] (seems like that should both keep the connection open, and find that it does not work quick) [09:48:36] BTW the NAT entry in cloudgw also died [09:49:24] the irc protocol does have a ping mechanism, https://modern.ircdocs.horse/#ping-message [09:49:32] although i don't know how often it does that [09:51:08] I see now [09:51:11] https://www.irccloud.com/pastebin/ZrCfYdif/ [09:51:22] taavi: i think it's about 2-3 minutes [09:51:38] and same in cloudgw [09:51:38] tcp 6 262 ESTABLISHED src=172.16.2.115 dst=82.96.96.60 sport=44687 dport=6697 [UNREPLIED] src=82.96.96.60 dst=185.15.56.1 sport=6697 dport=44687 mark=0 use=1 [09:51:55] "UNREPLIED"? [09:52:17] it means conntrack saw packets flowing only in the original direction [09:52:23] i.e, the server did not reply [09:52:38] ok, and that seems to make sense as the source port (after nat) has changed? [09:53:07] correct, there is a different port involved, the server no longer knows this connection [09:53:55] even worse, the client seems to understand the connection is already established, so not even trying to restablish the connection [09:54:11] let me see if I can tcpdump [09:55:23] arturo: how long is the timeout in conntrack for a connection to have no activity before the entry is cleaned up? [09:55:36] I don't remember [09:56:38] https://www.irccloud.com/pastebin/VMnP6ACi/ [09:57:03] 300 seconds? [10:00:28] this is the client code trying to keep the connection alive? [10:00:32] https://www.irccloud.com/pastebin/WW11uqXF/ [10:01:05] stashbot sends a ping every 300s https://github.com/bd808/python-ib3/blob/main/src/ib3/mixins.py#L29 [10:01:18] maybe that can be shorten to every minute? [10:02:12] that may be good idea, indeed [10:02:30] this also looks like the counter is in the wrong branch of the try https://github.com/bd808/python-ib3/blob/main/src/ib3/mixins.py#L52 [10:04:02] not it is not, the counter is reset in the pong funct [10:04:33] stashbot finally noticed it got disconnected it seems [10:04:34] See https://wikitech.wikimedia.org/wiki/Tool:Stashbot for help. [10:05:49] I was able to catch two conntrack events in cloudgw [10:05:58] https://www.irccloud.com/pastebin/aKjcEZve/ [10:06:02] but stashbot specifically doesn't seem to be a very frequent flapper, only 4 times this year in my logs and that includes my test forced disconnect [10:06:22] the first is an UPDATE to set the FIN_WAIT flag, the second is destroying the entry [10:09:27] i would assume that's because stashbot's timeout logic (that you just linked) finally noticed so many pings went unanswered, which made it disconnect so it explicitely closed the connection [10:12:54] jouncebot also uses the same code [10:19:45] ok, so maybe another test [10:19:54] let's the connection flow normally [10:20:44] and then, we will drop some packets in cloudgw, so that the connection suffers. But we wont let the conntrack entries die. After a couple of minutes, we can stop dropping packets, and see if the connection recovers [10:21:50] ok [10:22:26] btw. tools-static went down and up [10:22:34] (the blackbox connection check) [10:23:35] taavi: I assume stashbot is still running on the same worker, no? [10:23:45] should be [10:24:22] ok, seems to be using a different server now [10:24:32] molybdenum.libera.chat [10:25:45] brb [10:31:09] just noticed that for each IRC message in a channel the bot is currently present, there will be additional TCP packet chatter [10:31:22] which seems obvious if you think about it :-P [10:34:19] ok, dropping packets now [10:37:21] stashbot survived the packet drop just fine [10:37:21] See https://wikitech.wikimedia.org/wiki/Tool:Stashbot for help. [10:41:44] maybe we are seeing just server side throttles [10:47:59] so i got a bit sidetracked when looking at the wikibugs tool specifically, and now i found that the admin tool is making tons of redis connections and not closing them [10:48:20] I'll "solve" that problem by making https://admin.toolforge.org/tools redirect to toolhub [10:59:03] let me know if you want to keep digging on the IRC stuff [10:59:09] if/when [11:04:33] should it redirect to https://toolsadmin.wikimedia.org/tools/ instead? [11:22:58] maybe, imo the toolhub interface is quite a bit nicer for searching even if it includes non-toolforge tools [11:30:07] arturo: sorry, got quite distracted when looking at redis. it seems like wikibugs-irc is having quite a bit of trouble staying connected, Bryan has some ideas about that at T359097 [11:30:08] T359097: Frequent "Redis listener crashed; restarting in a few seconds." errors logged - https://phabricator.wikimedia.org/T359097 [11:32:01] ack [11:32:12] happy to continue other times [11:34:10] we'll probably want to re-visit the irc connection issue after wikibugs redis connectivity has been fixed [11:34:52] 👍 [14:56:06] * arturo be back in a bit [14:58:16] taavi: can I get a +1 on the revised quota request in T358477 (see last 2 comments)? [14:58:17] T358477: Request for more compute and storage for the GLAMS dashboard project - https://phabricator.wikimedia.org/T358477 [15:00:18] dhinus: hmm, I don't know what to do with that one [15:00:37] the graph I linked in my comment also showing very little RAM usage most of the time [15:01:03] hmm interesting, I trusted them on the RAM usage and I didn't double check [15:05:07] I can try increasing the Cinder quota only, and ask them to provide more evidence of RAM needs [15:16:47] dhinus: I added saved searches to the toolforge dashboard, is that what you meant? https://phabricator.wikimedia.org/project/board/539/ [15:18:05] dcaro: nice, that what I had in mind yeah [15:19:17] actually not quite :) I was thinking of having those links, but pointing them to a filtered workboard [15:19:27] like the link "Quarterly Goals" on https://phabricator.wikimedia.org/project/board/6960/ [15:20:54] that makes more sense yes xd, I'll try that [15:21:50] how do I save the filter? [15:22:26] I didn't "save" it, when you create a new filter it gets a random ID and I just copied the URL [15:22:39] they seem to persist, but I wondered if they will persist forever... [15:22:43] oh, ok, in the links some have nice names [15:23:12] yeah only the "default" filters I think ("all", "assigned") [15:23:28] I assume those ones are built-in, but I'm not 100% sure [15:23:43] is that pile of dns alerts because someone is changing toolforge things? [15:24:11] uooo, `title:jobs-api` seems to work as I expected xd [15:24:26] andrewbogott: not me [15:24:46] it's not a total dns failure, I can still log into VMs... [15:24:58] which DNS alerts? [15:25:11] there are few that triggered 7 mins ago [15:25:15] https://alerts.wikimedia.org/?q=team%3Dwmcs [15:25:19] per -security there seems to be some alerting maintenacne going on? [15:25:37] from cloudservices*, the type Check DNS resolution of www.wmcloud.org [15:26:18] I can't find anything that's actually broken so far [15:26:24] * andrewbogott reads _security backscroll [15:27:23] ok found it [15:27:35] ? [15:27:48] they failed over the alerting host from alert1001 to alert2001 [15:27:56] but in hiera, the ip address of alert2001 is wrong [15:28:16] i'll send a patch to the relevant people [15:28:22] well that must've been noisy for everyone then! [15:28:25] thank you taavi [15:29:43] uh [15:29:44] no [15:29:51] i can't read [15:30:51] it was not that, I have no clue where I got that wrong address [15:31:30] I predict that the firewall on our dns servers doesn't allow access from 2001, trying to check that now [15:32:09] so just to be sure, this is the thing that's failing: [15:32:10] taavi@alert2001 ~ $ host www.wmcloud.org 10.64.151.4 [15:32:10] ;; communications error to 10.64.151.4#53: connection refused [15:33:11] yeah, doesn't work on alert1001 either though [15:33:54] although I don't get connection refused there [15:34:00] so why is it failing just now? [15:35:53] https://www.irccloud.com/pastebin/NrJ4S3Wi/ [15:36:04] not sure yet, I still suspect firewalls someplace [15:36:23] none of the powerdns services are bound on the 10.x host interface [15:37:06] can this just be an expired/removed downtime? and those alerts really need removing? (I remember porting something equivalent to metricsinfra at some point, but removing them from icinga) [15:38:56] if it's an expired downtime seems like that should be visible on the UI someplace... [15:39:49] icinga at least thinks it's a new alert, as of 0d 0h 56m 19s [15:40:47] but I agree that this seems like it never could've worked with that IP [15:40:53] where is that IP coming from though? [15:42:48] that's the primary wikirealm address of that host [15:43:11] the icinga server was just switched over which might explain it looking like a new alert [15:45:26] true [15:47:56] I don't follow,it's supposed to be checking ns0.openstack.eqiad1.wikimediacloud.org which is a public IP... [15:48:57] oh, nope /those/ checks are working properly [15:50:57] the check_dig checks (which take the address of the host) are working, the check_dns checks (which don't) probably never worked [15:51:00] I'll fix or remove [16:01:25] taavi: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1008887 [16:02:00] +1 [16:02:04] can I get a quick +1 here? https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/215 [16:02:33] taavi, arturo: was there consensus that the ping-pong behavior of stashbot, jouncebot, etc needs to be on a shorter delay to make something better? I got a bit lost in the backscroll. [16:03:16] bd808: I think we were unable to decide on the root cause of the network problems. Reducing a bit the ping-ping timeout may help [16:03:17] bd808: that will help those bots reconnect more quickly, but those bots seem to be flapping much less than wikibugs [16:03:27] it should not hurt [16:03:29] yes [16:03:43] (and it might improve something xd) [16:03:44] for wikibugs, right now it's main issue seems to be staying connected in redis [16:03:58] unless it would make the bot flap more because of some other intermittent network issue [16:04:43] I think a minute of network issue might be enough to force a flap (iirc it was 5 min now) [16:04:52] taavi: I think I figured that one out last night. https://phabricator.wikimedia.org/T359097#9599234 [16:04:52] as in it might be reasonable [16:06:03] If we wanted stashbot to reconnect if not seeing the irc server for 1 minute then I'd need to dial down the ping window to 30s. Easy enough to do. [16:06:36] wait, toolforge-deploy merge requests are now created automagically, hehe [16:06:48] dhinus: I think I went a bit overboard with the side-links xd https://phabricator.wikimedia.org/project/board/539/query/open/ wdyt? [16:07:28] arturo: yep, Raymond_Ndibe did that :), make sure to deploy and merge asap after one is created though [16:07:35] bd808: a too low value may be counter effective thoughç [16:07:51] dcaro: looks great! [16:07:58] dcaro: ack [16:08:50] dhinus: feel free to add/remove more (well, anyone too) [16:11:53] arturo: yeah, that's one of the things I wonder about. Finding the "right" balance. I do anicdotally know that zppixbot thrashed on connectivity all the time in part because it had like a 5 second ping wait timeout. [16:13:43] yeah [16:14:00] other than that, we could not detect any infra problem in either the k8s NAT or the cloudgw NAT [16:14:52] my next theory is server-side throttling by libera.chat [16:17:36] do they use the global nat all of them? [16:18:05] everything uses our global nat, except if using floating IP [16:18:50] Everything coming out of Toolforge uses the global nat. But also know that this behavior of wikibugs dropping is new. Like within the last 5-6 weeks new. [16:19:16] when I rejoined the team :-( [16:19:38] the bot's code and library versions had not changed for years, but suddenly connections became unstable [16:19:59] and not just to libra.chat, but also to gerrit [16:20:09] do I make NAT boxes sad? I'll keep digging [16:21:18] did some bits of the gateway setup get switched on/off recently? Like did long hoped for thing X become a reality? [16:21:58] there was a world blip https://downdetector.com/, bunch of sites down [16:22:38] bd808: not that I know of, but I can double check [16:22:46] arturo: re "do I make NAT boxes sad?", at $DAYJOB-2 I was famous for saying often that bored network engineers were the biggest threat to our uptime. ;) [16:22:53] we upgraded all the worker nodes too [16:23:28] dcaro: yeah, maybe we can blame containerd.io [16:23:34] xd [16:26:33] jokes aside, a slight network change as a result of the bookworm + containerd.io upgrade seems like a 100% possible cause [16:29:42] * arturo was definitely nerd snipped again [16:30:07] so, as part of this change, now we are using iptables-save with the nf_tables variant in the k8s workers [16:30:17] that's definitely a change compared to previous buster nodes [16:30:44] I wonder if calico, containerd and kube-proxy are aware of this [16:31:18] taavi: review when you have a moment: https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/65 [17:03:11] andrewbogott: reference: T276327 [17:03:11] T276327: cloud: puppetmasters: adopt cinder volumes to store certs and git repos - https://phabricator.wikimedia.org/T276327 [17:21:53] something broke on the switches? there's a BFD alert [17:23:22] T359198 [17:23:22] T359198: Icinga BFD check failing - https://phabricator.wikimedia.org/T359198 [17:24:56] 👍 [17:26:39] it's gone now [17:27:58] * dcaro off [17:28:01] cya tomorrow [19:15:19] * bd808 lunch