[07:38:56] Looking at scroll, do we have any docs / training on handling k8s issues? I'm on-call next week, and am not sure I'd have the first idea of where to start [08:27:13] Emperor: wikitech has a good amount of information, but tbh as for how to get the bigger picture (and figure out what exactly to look at in wikitech) I'm not sure [11:58:04] Emperor: there’s a bunch of older training videos from Joe and Alex, those are good places to start [11:58:27] Not sure if you mean like kubernetes β€œ101” kinda stuff or the specifics of our clusters or somewhere in between though [11:58:45] I am mostly learning as I go tbh πŸ˜… [12:01:54] more the latter, I have an airily high-level understanding of k8s, but that's not the same as "how do we use it here", or more "where to start looking when it goes wrong" [12:42:09] most such stuff is under here: https://wikitech.wikimedia.org/wiki/Kubernetes/Administration [12:44:26] As far as specific debugging for issues, a lot of the simple stuff (e.g. something died, it should be started again) is handled by the platform itself. What's usually left is looking at specific workload logs (all are in logstash) or events (there is a specific kubernetes events dashboard again in logstash) [12:54:26] akosiaris: shall we have another go at otel? [12:55:19] effie: see -serviceops ;) [12:55:51] sigh E_TOO_MANY_CHANELS, but that is fair [12:57:03] ;-) [13:40:01] what's the difference in use between hieradata/common/profile/foo.yaml and hieradata/role/common/foo.yaml ? [13:40:39] (e.g. we have puppetserver.yaml in both and it's not obviously clear to me which things are set in one vs the other) [13:44:15] hi effie ok if I merge your changes? [13:45:55] Emperor: iirc it should match a given role or profile, sometime the roles and profiles have the same name, but a profile need to be included from a role [13:46:40] Emperor: the distinction is simply on the role and profile. if you want the hiera to be applied to the role, it should go in the role override. if you want it for profiles (such as puppetmaster::*), it goes there [13:46:58] Emperor: the order of matching is defined in https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/puppetmaster/files/hiera/production.yaml [13:51:42] stevemunene: yes please! [13:52:09] stevemunene: sorry for the delay [13:52:09] volans: that doesn't obviously-to-me mention hieradata/common/profile anywhere? or is that done by wmflib::expand_path somehow? [13:52:18] great, no worries [13:54:33] Emperor: yes I think that's the one converting puppet class paths into hiera file paths [15:46:21] I'd like to deploy a new shellbox chart version once this deployment window ends (adds explicit timeout settings, https://gerrit.wikimedia.org/r/1005139, should be a no-op in practice), any reason not to? [15:46:26] I promise not to leave it to NA oncallers if it goes badly :D [15:47:45] kamila_: SGTM -- in the unlikely event it goes sour later in the day, anything we should know about rolling it back? [15:48:33] Is it possible to add a mediawiki-config patch I was working on to the train? [15:48:48] The patch is this one: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1029664/11 [15:49:19] rzl: just rollback, nothing depends on the new functionality yet [15:49:28] CC: dancy andre [15:49:46] denisse: that'll be for a backport window, not the train :) see https://wikitech.wikimedia.org/wiki/Backport_windows [15:49:53] kamila_: πŸ‘ [15:50:09] thanks rzl <3 [15:50:14] rzl: Thanks, I'm taking a look. :) [15:50:34] denisse: I have no objection to deploying that. Do you need assistance doing that or will you handle it yourself? [15:51:27] dancy: I'd greatly appreciate assistance deploying the change as I've never done it before. :D [15:51:46] OK. I'll talk with you in #wikimedia-operations [15:51:56] denisse: great news, you're learning how to do a mediawiki-config backport at the best possible time in history πŸ˜‚ [15:52:08] cdanis: Really? :o How so? [15:53:03] because of `scap backport` existing [17:04:29] mutante: great pointer, thanks [17:04:34] jhathaway: around, by any chance? [17:04:43] yup [17:05:12] mail issues? [17:05:17] if I'm understanding this right it looks like mx1001 / mx2001 can no longer deliver mail *to* phabricator [17:05:23] yeah, exim queues are backing up, looks phab-related [17:05:52] ah, interesting [17:05:54] I think "mail TO phabricator" was broken before but in other ways https://phabricator.wikimedia.org/T356077 [17:05:59] yup [17:06:18] I'll revert the revert, to clear the queues, then look into fixing the issue [17:07:09] thanks :) [17:07:10] appreciate it -- still here if you need any more hands [17:07:13] thanks [17:07:37] and yeah, same, happy to help if needed [17:08:07] other than the queues backing up, is it causing other issues? [17:08:28] someone reported phab slowness but I don't know if that's conclusively related or not [17:08:38] okay, I wouldn't think so [17:08:50] that's my instinct too but I didn't want to just sit on it :) [17:08:56] nothing else going on afaik [17:14:18] https://grafana.wikimedia.org/d/000000587/phabricator?orgId=1&from=now-24h&to=now&var-node=phab1004&var-port=9117 [17:14:26] Phab is, uh, a fair bit busier than normal [17:14:43] another scrape maybe? [17:14:49] phab email revert applied [17:15:53] rzl: phab apache throughput hasn't been this high since May 20th [17:16:03] I don't remember when the last one was but I would guess so [17:32:31] moritzm: A fun mystery for you https://phabricator.wikimedia.org/T366310 [18:28:36] jhathaway: hm, it looks like the mx2001 queue stopped growing but didn't start trending downward, and 1001 is still increasing -- is that expected with where you're at so far? [18:28:46] https://grafana.wikimedia.org/d/000000451/mail?orgId=1&from=now-24h&to=now&refresh=5m [18:30:13] not sure, I assumed more of the queue would clear out, but the backoff algorithm for exim may mean we need to wait longer, or force the queue to drain. I'll dig in more. [18:30:55] nod [18:43:29] rzl: connections are still being refused, even though the iptables rule is now present, not sure why yet [18:45:18] ack [18:53:09] IPv6-only? [18:53:32] and we have 2 different firewall providers but I doubt mx is already migrated [18:54:15] i think in the alert it showed only the IPv6 adddress [18:59:40] mutante when did phabricator lose its public ip and go behind lvs? [19:04:50] jhathaway: it never had one or before 2015 [19:05:13] and before that bugzilla.wikimedia.org and rt.wikimedia.org were the bug trackers [19:05:51] here is some history, how in 2015 it was on iridium.eqiad.wmnet and we wanted to expose ssh https://phabricator.wikimedia.org/T100519 [19:06:12] thanks mutante, I see now there is a special route to phabricator.discovery.wmnet [19:06:33] 2015: "phab: add IPv6 VCS real server IP" https://gerrit.wikimedia.org/r/c/operations/puppet/+/255164 [19:09:42] I wonder if exim is trying ipv6 only for some reason [19:10:23] I would expect it to try that first at least before giving up and falling back [19:10:33] nod [19:13:01] so there is firewall::provider in hiera role data. some hosts are migrated to nftables and some aren't. but mx is still iptables, just to make sure [19:13:34] jhathaway: oh, I just noticed mx1001 has an interface::alias straight in site.pp [19:13:37] jhathaway: is it possible it's trying v6 first, reporting the error, falling back to v4, and then succeeding without error? [19:13:51] so you have 2 public IPS on the same interface [19:14:13] I recently ran this issues when I tried to rsync to/from a host that was also like this [19:14:17] with the second public IP [19:14:40] I had to tell rsync that I want it to use the other source IP, or I would get firewalled [19:14:52] cdanis: could be, but the queue hasn't decreased [19:15:20] mutante: yeah, though I don't know of another change other than the one I reverted to switch phabs oubound mx [19:15:22] jhathaway: is there a way to clear the retry backoff delay? [19:15:48] 208.80.154.76 (mx1001) 208.80.154.91 2620:0:861:3:208:80:154:91 (wiki-mail-eqiad.wikimedia.org) [19:16:05] allowed from/to mx1001 but not that other IP maybe? [19:16:07] you can force the queue, but I didn't want to overload phab [19:18:01] I bet half of this mail is Phab receiving bounce messages [19:19:10] I'm going to bounce exim on mx1001, to see if that changes its ability to connect, I can connect via nc to phabs wmnet ip [19:25:00] it looks like that melted the queue [19:25:52] yup, the old turn it off and on again \o/ [19:26:10] you love to see it [19:28:00] hah:) [19:28:52] jhathaway: sorry to interrupt your day, thanks for the work on that :) [19:29:22] not at all! my fault in the first place, thanks for the help [19:29:31] not sure if related here, but we had other services before where puppet first starts the service, then adds the IPv6 IP but needs manual restart then to make it listen on that [19:47:17] mutante: you're making calico sound so nice by comparison