[04:16:25] 06serviceops, 10LPL Essential, 10MinT, 10Community Wishlist (Translations), 10Community-Tech (Ezo Red Fox (July 29 - Aug 9, 2024)): Caching service request for MinT - https://phabricator.wikimedia.org/T370755#10056502 (10santhosh) @jijiki That should be ok. Our team capacity is also thin in this month.... [07:18:12] 06serviceops, 07Kubernetes: Better visibility for throttled pods - https://phabricator.wikimedia.org/T372241 (10fgiunchedi) 03NEW [07:19:55] 06serviceops, 10observability, 07Kubernetes: Alert on unscrapable pods - https://phabricator.wikimedia.org/T372242 (10fgiunchedi) 03NEW [07:43:40] 06serviceops, 10observability, 07Kubernetes: Alert on unscrapable pods - https://phabricator.wikimedia.org/T372242#10056802 (10JMeybohm) With how the prometheus service discovery currently works (e.g scraping every container port by default) we do have a large number of "okay to be down" targets, so an alert... [07:51:03] 06serviceops, 07Kubernetes: Better visibility for throttled pods - https://phabricator.wikimedia.org/T372241#10056826 (10JMeybohm) Generally speaking throttling is not an issue (as long as availability/latency targets are still met) but more a measure against processes going rough (so it's very common and kind... [09:19:52] 06serviceops, 10Prod-Kubernetes, 07Kubernetes, 13Patch-For-Review: cfssl-issuer: Generate Kubernetes Events - https://phabricator.wikimedia.org/T337928#10056995 (10JMeybohm) a:03JMeybohm [12:02:43] 06serviceops, 07Kubernetes: Better visibility for throttled pods - https://phabricator.wikimedia.org/T372241#10057336 (10fgiunchedi) That's fair, thank you for the rationale @JMeybohm ! Feel free to resolve/decline the task as you see fit [12:11:08] 06serviceops, 10observability, 07Kubernetes: Alert on unscrapable pods - https://phabricator.wikimedia.org/T372242#10057363 (10fgiunchedi) Indeed on the pod granularity the alert would be noisy, I checked the data in terms of "percentage of reported `up`" by namespace + app and maybe this has more signal? ht... [12:13:16] 06serviceops, 06DC-Ops, 10ops-codfw, 06SRE: Install (2) 960GB SSDs each in kafka-main20[06-10] - https://phabricator.wikimedia.org/T371423#10057365 (10JMeybohm) One thing I've noticed is that kafka-main2010 seems to have a different disk then all the others (all others are 1.7T models): ` sde... [13:06:43] 06serviceops, 07Kubernetes: Remove deprecated cloudnative-pg charts from chart-museum - https://phabricator.wikimedia.org/T371667#10057600 (10brouberol) Due to a review error, we also had a chart misnaming with a chart called `cluster` instead of `cloudnative-pg-cluster`. Could you remove the `cluster` chart a... [13:16:59] 06serviceops, 06DC-Ops, 10ops-codfw, 06SRE: Install (2) 960GB SSDs each in kafka-main20[06-10] - https://phabricator.wikimedia.org/T371423#10057632 (10Jhancock.wm) yes, I have a surplus of 1.7G disks and almost no 1G. so you get a bonus. [14:09:01] 06serviceops, 07Kubernetes: Remove deprecated cloudnative-pg charts from chart-museum - https://phabricator.wikimedia.org/T371667#10057894 (10JMeybohm) Just to be extra sure, you want the following to be removed: - stable/cloudnative-pg-operator-0.2.0.tgz - stable/cloudnative-pg-operator-crds-0.1.0.tgz -... [15:02:51] 06serviceops, 07Kubernetes: Remove deprecated cloudnative-pg charts from chart-museum - https://phabricator.wikimedia.org/T371667#10058085 (10brouberol) Yes, that's perfect. [15:41:50] Hi! We were reviewing some grafana boards for parsoid and the cluster overview shows up like it has no data: https://grafana.wikimedia.org/goto/xTAvcVjIR?orgId=1 [15:41:58] Is this related to k8s migration ? [15:42:30] Correction: it shows up metrics for the parsoid cluster, but utilization is very low [15:43:03] nemo-yiannis: yes, all parsoid traffic has been moved to k8s, and I think the `parsoid` cluster of baremetal hosts is vestigal [15:43:12] ok, thanks! [15:43:12] something more useful here: https://grafana.wikimedia.org/d/U7JT--knk/mediawiki-on-k8s?orgId=1&var-dc=eqiad%20prometheus%2Fk8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All&refresh=1m&from=now-12h&to=now [15:43:29] or also https://grafana.wikimedia.org/d/35WSHOjVk/application-servers-red-k8s?orgId=1&refresh=1m&var-site=All&var-deployment=mw-parsoid&var-method=GET&var-code=200&var-handler=php&var-service=mediawiki I think? [15:44:41] sounds good, thanks! [17:44:46] brouberol: you probably saw by now, but the bug was caused by the puppet change where quoting of the integer map key was missed. Fixed up in , but I didn't submit a revert of the revert to put things back. [17:49:23] 06serviceops, 06Data Products, 06Data-Platform-SRE, 10Dumps-Generation, and 2 others: Migrate current-generation dumps to run from our containerized images - https://phabricator.wikimedia.org/T352650#10058764 (10xcollazo) >>! In T352650#10051791, @dr0ptp4kt wrote: > - If I'm understanding correctly, people... [17:49:58] I saw, that's on me! I did revert the revert, and everything seems to be working fine now [17:54:54] excellent. glad you are back towards being on track. [19:10:57] 06serviceops, 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: Install (2) 960GB SSDs each in kafka-main10[06-10] - https://phabricator.wikimedia.org/T371422#10059256 (10VRiley-WMF) Thanks @JMeybohm Currently, at eqiad we don't have many 960 gig SSDs. However, we do have larger sizes. As I understand, t... [19:12:38] 06serviceops, 10Shellbox, 06SRE, 10Charts (Sprint 3): Figure out how a shellbox instance for the Chart extension would work - https://phabricator.wikimedia.org/T370739#10059265 (10Catrope) 05Open→03Resolved Thank you for weighing in everyone! I think we've gotten enough useful advice here that we c... [21:05:59] 06serviceops, 10LPL Essential, 10MinT, 10Community Wishlist (Translations), 10Community-Tech (Fennec Fox (Aug 12-23, 2024)): Caching service request for MinT - https://phabricator.wikimedia.org/T370755#10059861 (10MusikAnimal)