[05:41:50] <_joe_>	 is there any reason why admin.yaml still has data about people who left the wmf 5 years ago and are marked ensure: absent?
[05:41:57] <_joe_>	 admin/data.yaml
[05:42:08] <_joe_>	 it's becoming a tad too huge, IMHO
[05:59:08] <volans>	 9I *think* to prevent re-usage of the same usernames but not sure
[06:36:20] <_joe_>	 for that, it would be enough to just declare them in the absent group and check if a name is there and in case just fail()
[06:36:38] <_joe_>	 that file has become unmanageable (and I wonder why it's not in hiera)
[06:36:49] <_joe_>	 am I the only one who thinks it's hard to manage at this point?
[06:37:11] <marostegui>	 it is hard yeah
[07:11:44] <hashar>	 good morning! We could use a scap config change to be applied:  https://gerrit.wikimedia.org/r/c/operations/puppet/+/724515/  .  It drops a direct link to a Kibana dashboard which is no more relevant ;)
[07:13:44] <godog>	 I don't think data.yaml is unmanageable due to absent users
[07:16:39] <marostegui>	 I think it is due to its size
[07:18:24] <_joe_>	 yeah the size is my only concern too
[07:20:19] <godog>	 fair enough, I usually search in the file and hadn't noticed
[08:02:52] <jynus>	 as a technical workaround- maybe it could be split into 2 files, or the absent list sent to a yaml array
[08:03:53] <jynus>	 hashar I can deploy
[08:04:24] <hashar>	 jynus: thanks
[08:04:39] <hashar>	 it would onlyaffect the deployment servers
[08:05:01] <jynus>	 but shouldn't be original author around? unless you take full responsability of it?
[08:07:07] <_joe_>	 jynus: I can take full responsibility for that change
[08:07:38] <_joe_>	 I'll merge it
[08:07:56] <jynus>	 it was more of a deference to the original uploader, I have no context of it
[08:07:56] <hashar>	 it is really just a cosmetic change
[08:08:00] <jynus>	 ok
[08:08:07] <hashar>	 the variable is printed to the user when we detect some errors in logstash
[08:08:17] <hashar>	 so I am not worried about the impact :]  
[08:08:36] <hashar>	 I am crafting a more ambitious change to let us "easily" filter errors coming from canary hosts
[08:11:32] <jynus>	 lots of puppet errors now
[08:11:46] <jynus>	 ema is looking
[08:12:32] <hashar>	 what did that broke :-\
[08:13:06] <volans>	 unrelated to the above
[08:13:56] <_joe_>	 yeah definitely
[08:15:06] <ema>	 yeah the puppetfails are due to https://gerrit.wikimedia.org/r/q/68d4584c5d79ac9bcaef41f58b770e2199362986
[08:15:09] <ema>	 reverting
[08:36:10] <hashar>	 _joe_: jynus : thank you for the merge attention
[10:21:09] <jbond>	 _joe_: re admin.yaml.  we need at least one run of puppet with the user in the absent group and the users ensure parameter set to absent to make sure the user actully gets removed.  after that the reason for the useres to stick arund is to make sure we dont't re-use the username and uid
[10:24:12] <moritzm>	 and for hosts which didn't run Puppet
[10:24:31] <moritzm>	 like hosts which were powered down for hw maintaince until the vendor sends spare parts etc.
[10:52:26] <akosiaris>	 heads up, I just caused an outage
[10:52:35] <effie>	 akosiaris:  need a hand ?
[10:52:45] <akosiaris>	 it's already fixing itself, there will be some fallout as far as pages go
[10:52:52] <effie>	 alright cool 
[10:53:17] <majavah>	 effie: just making sure you saw this already https://phabricator.wikimedia.org/T291990
[10:58:40] <volans>	 ack
[12:32:40] <godog>	 if you are a pontoon user and your puppet server has started asking for a password over ssh please let me know, you'll need a puppet.git on the server with ecd7eda428b5f1 included
[13:07:04] <akosiaris>	 https://wikitech.wikimedia.org/wiki/Incident_documentation/2021-09-29_eqiad-kubernetes 
[13:07:11] <akosiaris>	 incident report for the outage earlier ^
[13:09:09] <volans>	 thanks
[13:48:00] <brennen>	 jelto, jbond: i think we're ready for https://gerrit.wikimedia.org/r/c/operations/puppet/+/710083
[13:59:54] <jelto>	 brennen: I would be fine with opening up GitLab. However do you plan to announce it community wide too? I'm currently working on the puppetisation (https://gerrit.wikimedia.org/r/c/operations/puppet/+/724430) I don't expect anything to break but maybe it makes sense to wait until we are 100% on puppet
[14:00:28] <brennen>	 jelto: my thought was that we should wait about a week for announcement
[14:00:31] <brennen>	 sort of a soft launch
[14:01:00] <brennen>	 let any weirdness surface before we advertise it more widely.
[14:01:16] <jelto>	 brennen: sounds good :)
[14:03:30] <akosiaris>	 https://blog.sinkingpoint.com/posts/elasticsearch-logging/ interesting read. It does raise a couple of good points.
[14:06:22] <cdanis>	 akosiaris: thanks! would it also make sense to run more than one replica of typha?
[14:38:07] <akosiaris>	 cdanis: possibly. We haven't even researched into that yet though.
[14:52:18] <cdanis>	 ok! do you mind adding an actionable to check?
[14:52:26] <cdanis>	 in general i think one-replica services are an antipattern
[14:53:55] <akosiaris>	 Sure
[15:17:13] <RhinosF1>	 I got a CAS error twice logging into gitlab
[15:17:17] <RhinosF1>	 jelto: ^
[15:33:12] <bd808>	 godog (or anyone who knows): When I'm using the "free" log collection in the k8s cluster (stderr -> rsyslog), do I still need to add a the "@cee" token to indicate json structured logs to rsyslog?
[15:34:41] <godog>	 bd808: mmhh IIRC no token needed no
[15:38:09] <bd808>	 cool. My logs don't seem to be making it into logstash, so I was wondering if that could be the issue. I'll start a ticket sometime today about it.
[15:40:17] <godog>	 ack
[15:50:07] <jbond>	 RhinosF1: could you add any details to T291964
[15:50:08] <stashbot>	 T291964: Attempting to login to gitlab.wikimedia.org sometimes results in CAS 500 Internal Server Error - https://phabricator.wikimedia.org/T291964
[15:57:12] <RhinosF1>	 jbond: not really
[15:57:19] <RhinosF1>	 More details button wouldn't even load
[15:57:24] <RhinosF1>	 Press login get 500
[15:58:11] <jbond>	 RhinosF1: ack, im looking at the logs and seems to be the same stack strace ass the one reported in that ticket
[15:59:31] <RhinosF1>	 jbond: if it helps Chrome latest on iOS 15
[15:59:40] <jbond>	 ack thanks
[16:02:11] <hnowlan>	 are debdeploy manifests stored anywhere? Just curious to compare one I've created with historical examples for the same library
[16:04:29] <jbond>	 hnowlan: not stored anywhere but there are a couple in my home dir on cumin1001 or cumin2002:/home/jmm/debdeploy/ 
[16:05:08] <hnowlan>	 jbond: thanks! 
[16:07:12] <hnowlan>	 does debdeploy work in beta? 
[16:07:59] <jbond>	 hnowlan: AFAIK no 
[16:11:19] <hnowlan>	 jbond: rats. any idea if there is an official alternative policy for rolling out packages that would otherwise be deployed via it? 
[16:12:22] <_joe_>	 hnowlan: ask in #-cloud-admin, but i think cumin is available for beta
[16:13:33] <jbond>	 hnowlan: im not to familure with beta but in genral cloud relies on unattended upgrades and manul apt invocations
[16:15:50] <bd808>	 hnowlan: there is a "deployment-cumin.deployment-prep.eqiad1.wikimedia.cloud" instance. I'd guess that's your best bet.
[16:16:03] <hnowlan>	 thanks! 
[16:16:20] <bd808>	 and yes, unattended upgrades should be running on all Cloud VPS instances
[16:17:43] <bd808>	 https://openstack-browser.toolforge.org/project/deployment-prep is a handy way to find things hiding in deployment-prep
[16:32:27] <akosiaris>	 the kubemaster2001 latency icinga alert seems to be related to https://grafana.wikimedia.org/d/000000377/host-overview?viewPanel=18&orgId=1&from=now-1h&to=now&var-server=kubemaster2001&var-datasource=thanos&var-cluster=kubernetes
[16:32:32] <akosiaris>	 allocstalls? weird
[16:37:25] <akosiaris>	 the other master did not have such issues... 
[16:37:36] <akosiaris>	 interesting...
[16:38:27] <_joe_>	 it swapped
[16:39:39] <akosiaris>	 https://grafana.wikimedia.org/d/000000377/host-overview?viewPanel=4&orgId=1&from=now-1h&to=now&var-server=ganeti2007&var-datasource=thanos&var-cluster=ganeti
[16:39:42] <akosiaris>	 found the culprit
[16:39:46] <akosiaris>	 the hosting node is packed
[16:40:54] <akosiaris>	 clusters might need a rebalancing
[16:42:04] <akosiaris>	 Initial check done: 2 bad nodes, 24 bad instances.
[16:42:04] <akosiaris>	 Cluster is not N+1 happy, continuing but no guarantee that the cluster will end N+1 happy.
[16:42:05] <akosiaris>	 Initial score: 47.23675021
[16:42:06] <akosiaris>	 yup
[16:42:10] <akosiaris>	 definitely in need of rebalancing
[18:56:35] <mutante>	 something seems to have happened with the wmf-style checks run by CI: I have one where I get "wmf-style: total violations delta 1"  but also "New violations: Nothing found" at the same time. That seems impossible..to get a delta but nothing is fixed or resolved.. hrmm