[08:23:51] <jynus>	 I wonder if you would have time today elukey to talk about our lord and saviour the megacli check?
[08:25:27] <elukey>	 jynus: o/ I have some meetings this morning, what about after lunch?
[08:25:47] <jynus>	 👍
[10:01:09] <kevinbazira>	 Hi everyone, we are investigating why the `kafka_burrow_partition_lag{group="cpjobqueue-ORESFetchScoreJob"}` metric stopped being exported from the codfw site: https://prometheus-codfw.wikimedia.org/ops/graph?g0.expr=kafka_burrow_partition_lag%7Bgroup%3D%22cpjobqueue-ORESFetchScoreJob%22%7D&g0.tab=0&g0.stacked=0&g0.show_exemplars=0&g0.range_input=2d
[10:01:09] <kevinbazira>	 this is causing a recurring linting alert for the ML team as reported in: https://phabricator.wikimedia.org/T399683
[10:01:09] <kevinbazira>	 does anyone recall a change around that time that might have affected this job? any pointers would be greatly appreciated.
[10:01:09] <kevinbazira>	 cc: elukey, isaranto ---^
[13:07:15] <inflatador>	 K8s question: where do we set quotas for individual namespaces? I've got an app that's hard down right now and it looks like a quota issue: https://phabricator.wikimedia.org/P79233
[13:12:14] <inflatador>	 nm, looks like helmfile.d/admin_ng/values/main.yaml
[13:20:23] <btullis>	 Hello. I'm seeking a review of this namepsace limitrange change on wikikube. https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1170145 Thanks.
[13:21:14] <inflatador>	 btullis I think we're good, it turned out to be a release-specific setting rather than a general quota issue
[13:21:23] <inflatador>	 and it looks like you fixed it already ;)
[13:23:08] <inflatador>	 ^^ correction on the above, we may still have an issue
[13:25:38] <elukey>	 inflatador: o/ afaics it looks like a limitrange issue, not a quota one (pod/container level vs overall cpu/memory used in a namespace). What errors do you see?
[13:29:25] <btullis>	 elukey: You're right. We saw this `maximum memory usage per Container is 3Gi, but limit is 4Gi` - it was addressed by this: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1170145 
[13:30:04] <btullis>	 Deploying now. Thanks.
[13:33:38] <elukey>	 btullis: ah yes yes I asked because I saw "^^ correction on the above, we may still have an issue"
[13:33:43] <elukey>	 perfect then :)
[13:33:53] <inflatador>	 elukey ACK, thanks for your help
[13:40:26] <inflatador>	 ^^ looks like we're out of the woods...alerts are clearing
[14:42:29] <inflatador>	 Is there anything wrong with Kafka at the moment? The same application listed above is getting constant disconnection errors ( Consumer clientId=cirrus-streaming-updater-producer-eqiad:mediawiki.cirrussearch.page_rerender.v1-0, groupId=cirrus-streaming-updater-producer-eqiad] Disconnecting from node 1005 due to socket connection setup timeout.0
[15:14:49] <inflatador>	 ^^ FWiW it looks like this cleared up, I guess knock-on effect from backpressure created by the earlier incident?
[19:21:33] <swfrench-wmf>	 FYI, in a little while, I'll briefly disable puppet on O:configcluster (etcd) hosts to merge and apply [0] under supervision. this should be fine, but given the rarity of these changes, might as well be careful.
[19:21:33] <swfrench-wmf>	 [0] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1111239
[19:53:56] <swfrench-wmf>	 ^ this is done, nothing exciting to report