[08:23:51] I wonder if you would have time today elukey to talk about our lord and saviour the megacli check? [08:25:27] jynus: o/ I have some meetings this morning, what about after lunch? [08:25:47] 👍 [10:01:09] Hi everyone, we are investigating why the `kafka_burrow_partition_lag{group="cpjobqueue-ORESFetchScoreJob"}` metric stopped being exported from the codfw site: https://prometheus-codfw.wikimedia.org/ops/graph?g0.expr=kafka_burrow_partition_lag%7Bgroup%3D%22cpjobqueue-ORESFetchScoreJob%22%7D&g0.tab=0&g0.stacked=0&g0.show_exemplars=0&g0.range_input=2d [10:01:09] this is causing a recurring linting alert for the ML team as reported in: https://phabricator.wikimedia.org/T399683 [10:01:09] does anyone recall a change around that time that might have affected this job? any pointers would be greatly appreciated. [10:01:09] cc: elukey, isaranto ---^ [13:07:15] K8s question: where do we set quotas for individual namespaces? I've got an app that's hard down right now and it looks like a quota issue: https://phabricator.wikimedia.org/P79233 [13:12:14] nm, looks like helmfile.d/admin_ng/values/main.yaml [13:20:23] Hello. I'm seeking a review of this namepsace limitrange change on wikikube. https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1170145 Thanks. [13:21:14] btullis I think we're good, it turned out to be a release-specific setting rather than a general quota issue [13:21:23] and it looks like you fixed it already ;) [13:23:08] ^^ correction on the above, we may still have an issue [13:25:38] inflatador: o/ afaics it looks like a limitrange issue, not a quota one (pod/container level vs overall cpu/memory used in a namespace). What errors do you see? [13:29:25] elukey: You're right. We saw this `maximum memory usage per Container is 3Gi, but limit is 4Gi` - it was addressed by this: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1170145 [13:30:04] Deploying now. Thanks. [13:33:38] btullis: ah yes yes I asked because I saw "^^ correction on the above, we may still have an issue" [13:33:43] perfect then :) [13:33:53] elukey ACK, thanks for your help [13:40:26] ^^ looks like we're out of the woods...alerts are clearing [14:42:29] Is there anything wrong with Kafka at the moment? The same application listed above is getting constant disconnection errors ( Consumer clientId=cirrus-streaming-updater-producer-eqiad:mediawiki.cirrussearch.page_rerender.v1-0, groupId=cirrus-streaming-updater-producer-eqiad] Disconnecting from node 1005 due to socket connection setup timeout.0 [15:14:49] ^^ FWiW it looks like this cleared up, I guess knock-on effect from backpressure created by the earlier incident? [19:21:33] FYI, in a little while, I'll briefly disable puppet on O:configcluster (etcd) hosts to merge and apply [0] under supervision. this should be fine, but given the rarity of these changes, might as well be careful. [19:21:33] [0] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1111239 [19:53:56] ^ this is done, nothing exciting to report