[08:36:10] Raymond_Ndibe: could you give an update on the CD pipeline and if you need help to get it fixed? [11:56:39] I intend to work on this today blancadesal: [14:19:09] andrewbogott: morning :) you are on clinic duty, is that okay with you? [14:20:04] yep, that's fine although now I'm behind :) [14:24:01] afaik, there is nothing specific 'to do' at the moment :) [14:25:29] andrewbogott: do you know anything kiwix-mirror-update? T381212 [14:25:29] T381212: SystemdUnitDown kiwix-mirror-update.service - https://phabricator.wikimedia.org/T381212 [14:25:39] *anything about [14:26:28] I know what it is but not much more than that. Ariel used to be the keeper of the dumps, they handed off the responsibility but I can't remember who they handed it to... [14:26:46] If you aren't already looking I will look in a few [14:27:05] I haven't looked yet, hoping you would remember something more than me :D [14:28:03] 'master.download.kiwix.org' so might be an upstream issue [14:28:09] oops, paste fail [14:28:18] 'failed to connect to master.download.kiwix.org (135.181.224.247): Connection refused' [14:28:56] that also fits with it failing on both servers at once [14:29:59] this page shows a 503 so maybe they do have upstream issues https://kiwix.org/en/wifi-hotspot/ [14:30:36] this one is also down https://library.kiwix.org [14:32:40] :( [14:32:55] I pinged a random kiwix dev on task but I wonder if they have a slack or irc channel... [14:33:02] * andrewbogott asks in -data-persistence [14:35:02] dhinus: anyway... I think for now we should ack and ignore that alert. [14:40:11] andrewbogott: sgtm, or maybe silence it for 2-3 days? [14:43:01] * dhinus adds a 3-day silence [14:44:16] sgtm [15:15:48] andrewbogott: not sure it's related, but now I see "Persistent high iowait" on clouddumps1002 [15:16:25] I don't think that can be related but it might still be something interesting : [15:19:18] I think that must be some web client hammering it, iowait shows 'www-data' with overwhelming io usage [15:20:12] andrewbogott: may I interest you in this patch? https://gerrit.wikimedia.org/r/c/operations/puppet/+/1091850 [16:06:42] topranks: if you are free would you mind joining us in https://meet.google.com/onc-zdxs-bok? We're trying to understand a grafana panel from T381078 [16:06:42] T381078: cloudgw: suspected network problems - https://phabricator.wikimedia.org/T381078 [16:09:02] topranks: nevermind, I think dhinus figured out that the graph was summing things that didn't make sense as sums [16:46:29] Rook: we're looking at paws network activity. Traffic from paws notebooks will always show as originating from one of the magnum cluster VMs, right? [16:46:47] Or the controller? Or... something else? [16:46:48] It should, yes [16:46:57] ok [16:47:03] If it is coming from notebooks should only be one of the workers, not the control node [16:47:17] cool, thanks [16:47:18] And everything that is going on with it looks like it is coming from the workers to me [16:48:04] yeah, topranks is trying to see what they're contacting. It looks like private/prod IPs which is weird. [16:51:59] They're ddos scripts [16:52:09] And proxies [16:54:25] yeah [16:54:46] hoping there will be commonalities to block or throttle but no luck so far [16:57:56] Yeah, the junk that is showing up in the servers is somewhat diverse, there are some popular scripts that seem to show up, but then some random ones as well [16:58:18] topranks: ^ [17:09:58] ooh 'Warning: The current total number of facts: 3385 exceeds the number of facts limit: 2048' I've never seen that before [17:11:31] ^ Why does this feel like a political comment? [17:17:51] wikipedia has too many facts [17:29:44] In the mediawiki-quickstart Cloud VPS project the ssh ingress rule in the default security group is limited to the `172.16.8.0/22` CIDR. This does not seem to cover the project's own IPs (172.16.5.248) which makes me wonder if the CIDR is actually correct. [17:30:58] The related rule in deployment-prep uses `172.16.0.0/21` as the allowed network CIDR which is probably overly broad in the other direction... [17:32:23] This is blocking some work that mhurd is attempting (ssh into an instance from a Cloud VPS hosted GitLab CI job) so I am going to apply the more expansive CIDR to the mediawiki-quickstart ingress rules. [17:35:31] bd808: looking at a new project (wikiqlever), I see the default group has two rules: 172.16.0.0/21 and 172.16.8.0/22 [17:35:39] both say 'managed by tofu-infra' [17:36:35] I see the same in mediawiki-quickstart, is that because you just now added it? [17:36:41] andrewbogott: you are correct. My eyes missed the 172.16.0.0/21 entry. Ok. the problem is somewhere else :) [17:37:01] 👍 [17:37:09] Maybe they're installing ferm on their vms? [17:37:32] possible. I am digging deeper. [17:38:12] bd808: iirc the gitlab runners also have outbound firewall rules [17:38:45] https://gerrit.wikimedia.org/g/operations/puppet/+/b4d4a849ef9944d41dfd7284ed04be610305340c/hieradata/cloud.yaml#189 [17:41:28] I know that the runners can talk to the OpenStack API endpoints. I assumed that meant they could also make normal inter project traffic happen. This may have been a bad assumption. And also a big documentation gap [17:42:22] traffic to the openstack api endpoints is not inter-project traffic [17:47:27] I think we've always just opened ssh to all cross-project IPs as a way to ensure bastion access. [17:50:19] * bd808 goes poking around in runner-1023.gitlab-runners.eqiad1.wikimedia.cloud to see the live config instead of a cloud of hiera managed possibilities [19:02:16] I have confirmed that the current configuration for `profile::gitlab::runner` on the WMCS hosted nodes includes `profile::gitlab::runner::restrict_firewall: true` which turns on a default REJECT iptables rule for interactions with the `172.16.0.0/21` network. [19:02:27] So the WMCS runners functionally cannot access anything that the DO hosted runners cannot access, and because of our split-horizon DNS I think there are actually things you can do from DO that you cannot from WMCS. [19:03:16] Hopefully Monte will make time to write up a Phab task about this discovery so we can work out if it is fully intended and get things documented somewhere [20:04:44] andrewbogott: if you are still online, I was wondering if there is a way to get shell access to the K8s nodes for paws? [20:05:12] or - to explain what my interest is I built a temp dashboard in your grafana [20:05:15] https://grafana.wmcloud.org/goto/UVF9SS4Nz [20:05:41] topranks: I don't think we especially have/use shell access there but Rook would know [20:05:51] I've found this to be easiest https://www.irccloud.com/pastebin/SC05VlNE/ [20:05:59] I'm wondering in general if we can take those interface names and work them back to specific pods [20:06:08] lots of them are busy at times, but some are not at all [20:06:49] Are they running now? I've been keeping regular checks on them while awake and cleaning them up as I find them. Though they could be hiding from me [20:06:57] Rook: thanks, where would I run that from? [20:07:27] no things are quiet right now [20:07:44] Bastion is fine. You can generate a kubectl config file with tofu 'bash deploy.sh eqiad1` Or just take one out of my directory [20:08:07] last big spike of traffic was on node 4 at ~14:00 UTC (3Gb/sec - 350kpps) [20:08:33] ok I'll take a look, I'm actually not that familiar with the cloud bastions, I believe I have access thanks [20:09:08] Alrighty, let me know if you need any help