[07:26:02] brouberol, btullis, stevemunene - o/ there are some alerts related to data platform, https://alerts.wikimedia.org/?q=%40state%3Dactive&q=%40cluster%3Dwikimedia.org&q=team%3Ddata-platform - most of them are days old, could you please check? [07:26:13] elukey will do [07:26:16] <3 [07:26:21] (b.tullis is OOO atm) [07:27:19] there are also errors in config-master for search-* confd resources [07:27:20] https://alerts.wikimedia.org/?q=%40state%3Dactive&q=%40cluster%3Dwikimedia.org&q=alertname%3DConfdResourceFailed [07:27:43] I suspect this is related to infla*tador's current migration work, but we'd need to check [10:09:13] I'm looking into getting an urgent puppet script for mentorship reenabled (https://gerrit.wikimedia.org/r/c/operations/puppet/+/1136970). To that end, I would like to run the script as a test against testwiki. Is there an issue with that doing that now? [10:20:11] MichaelG_WMF: this is a revert of a patch by Amir1. I think we need his +1 on that one. I see he is added already to the list of reviewers [10:20:58] yes, sorry I posted here after posting in operations because I was not sure what the right channel for the current window was. Now we're already talking over there. Sorry for the noise! [10:21:24] ack [14:56:21] Could an op kickban acooper from this channel, per T392100?  Thanks. [14:56:21] T392100: Offboard Andy Cooper from the Security Team - https://phabricator.wikimedia.org/T392100 [15:13:42] sbassett: this is a publicly accessible channel, I don't see why he should be kickbanned [15:14:15] _security should be [15:14:46] But otherwise unless it's access restricted, I don't see why not [15:15:06] s/not/ [15:16:25] If you wanna see what projects he's in on Phab, use https://phabricator.wikimedia.org/project/query/taMQAss.L1.l/#R too [15:16:35] Looks like everything sensitive removed [15:22:02] <_joe_> sbassett: no. [15:22:09] <_joe_> this is a public channel [15:22:40] <_joe_> anyone is welcome, in fact I count a large number of former employees here. [15:34:43] indeed, that would not be needed [16:25:36] taavi are you making changes to switches in CODFW? I'm having issues with a reimage/vlan move and p-apaul that `user taavi` was in the output. If you wanna join us in #wikimedia-dcops we're having the discussion there [16:26:18] inflatador: that sounds like https://gerrit.wikimedia.org/r/c/operations/homer/public/+/1127542 not being deployed there, cc topranks [16:30:12] taavi: yeah thanks just bad timing it seems [17:58:20] claime: hey, what was the purpose of `Finished scap sync-world: Deploy mediawiki chart 0.8.11 (duration: 03m 02s)` [17:58:55] I'm seeing logging issues since that deploy, and I'm curious to know what changed. [18:07:25] cwhite: can you expand on the issues you're seeing? [18:07:55] swfrench-wmf: major drop in throughput: https://grafana-rw.wikimedia.org/d/VCK8-FpZz/cwhite-logstash?orgId=1&refresh=1m&from=now-3h&to=now [18:08:27] IOW, I think logstash is having trouble ingesting some logs being generated [18:08:37] possibly really big logs [18:09:50] interesting! that should only have picked up https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1137038, which should not affect "normal" mediawiki deployments in any way [18:10:08] (or logs) [18:11:33] trying to see what else that might have picked up [18:14:58] confirmed that "normal" mediawiki deployments are unchanged since the last backport deployment (e.g., pod ages all in ~ 4h20m range) [18:22:28] Ok, thank you. I'll keep digging on my end. [18:29:45] cwhite: so, super naively on my part, I'm not seeing a huge increase in aggregate message count or size inbound, e.g., in https://grafana.wikimedia.org/goto/xwRzPmJNR?orgId=1 [18:30:31] if you have a way to back out what's blocking the processing side, I can try to sort out where it came from (if that's possible) [18:30:50] I'm not suspecting a volume issue. I'm suspecting a per-object size increase. [18:32:00] Opensearch is complaining about exceeding its field key limit (2048), which may coincide with an application logging huge json objects. [18:58:19] Is there a task tracking the Logstash issues? (No logs from MW since 17:05 UTC it seems) [19:05:45] kostajh: we've been using https://phabricator.wikimedia.org/T390215 for logstash issues lately. we're aware it's backed up and working on it now [19:07:49] ok, thanks, and good luck! [21:22:57] does anyone know if PCC is supposed to work with cumin aliases? I see it in the git log, but it doesn't seem to work for https://gerrit.wikimedia.org/r/c/operations/puppet/+/1137069 [21:27:56] inflatador: I have never seen A: as a selector [21:29:22] hashar I see in a few times in the puppet git log, but that's no guarantee that it ever actually worked ;) https://gerrit.wikimedia.org/r/c/operations/puppet/+/1093958 is an example [21:29:32] it is not supported [21:29:39] but jhathaway would know for sure [21:30:05] hashar NP, I believe you [21:30:33] but [21:30:36] this is the internet [21:30:42] AI is everywhere spreading misinformations! [21:30:44] you need refs! [21:30:47] {{CN}} [21:30:48] https://gerrit.wikimedia.org/g/operations/software/puppet-compiler/+/refs/heads/master/puppet_compiler/controller.py#127 [21:30:57] I believe that is the code dealing with the Hosts: meta header [21:31:14] so re: O: P: for the one I usually come by [21:31:28] and from that list there is a `cumin:` which I guess might be the one to use to do some cumin query [21:31:45] so maybe `Hosts: cumin:A:elastic-eqiad` [21:31:58] but that is purely a guess. I don't know anything more but the code I have linked above [21:32:20] * inflatador knows even less than that! [21:33:12] from `git log --grep 'Hosts: cumin:'` I found a bunch of changes [21:33:17] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1110826 [21:33:38] and a single one used the A selector [21:33:52] https://gerrit.wikimedia.org/r/c/operations/puppet/+/745496 [21:33:56] Hosts: cumin:A:pki [21:34:04] so yeah that might work :] [21:34:42] The puppet-compiler code is a fun read [21:35:30] I am off, happy compiling! [21:37:39] .o/ [21:37:48] thanks for the suggestion! [21:43:03] inflatador: I just tried running PCC on the patch h.ashar mentioned, and it couldn't resolve the alias pki, so I don't think that works unfortunately [21:43:27] jhathaway yeah, didn't work for me either https://gerrit.wikimedia.org/r/c/operations/puppet/+/1137069 [21:43:38] I'm not sure how cumin loads aliases, on the cumin hosts it is just in a flat file, /etc/cumin/aliases.yaml [21:43:50] so I think we would need to inject a similar config on the pcc hosts [21:44:39] I agree it would be nice to support them [21:45:11] It'd be even nicer to use config mgmt than doesn't require hacking to see its output ;P [21:45:24] woah woah, shots fired :P [21:46:24] don't worry, I'm back on my meds today ;) [21:47:22] :) [21:49:27] thanks for your work on that T389932 btw...I'm trying to run PCC against all the hosts b/c inevitably I'm gonna mess up the regex and then the reimage'll hang [21:49:27] T389932: Improve the user experience adding new nodes to puppet - https://phabricator.wikimedia.org/T389932 [22:22:14] yup, hopefully we will find a good solution