[06:24:05] 10serviceops: Improve kafka main 's partitions usage and leaders using topicmappr's rebalance - https://phabricator.wikimedia.org/T345077 (10elukey) I am trying to follow up on this: ` Broker distribution: degree [min/max/avg]: 4/4/4.00 -> 4/4/4.00 - Broker 2001 - leader: 282, follower: 564, total: 846... [07:01:01] hello folks [07:01:21] I am discovering more and more usages of the topicmappr tool, we can optimize for storage, partition count, etc.. [07:01:28] but of course one can affect the other [07:01:49] (like we balance partition count but storage usage gets uneven, etc..) [07:02:21] At this point I'd say that keeping a sane storage usage across kafka brokers may be a good primary key point [07:02:43] having partition count not even isn't ideal but it doesn't cause issues afaics [07:03:04] (more info added to T345077) [07:03:10] thoughts? [07:10:23] 10serviceops, 10Maps, 10Regression, 10Russian-Sites: Vandal attack on OpenStreetMap affected Wikimedia Maps - https://phabricator.wikimedia.org/T344753 (10Jgiannelos) Unless we plan to invest some time to make ad-hoc regional tile invalidation thats relatively fast I don't think we have any method other th... [07:13:30] elukey: considering my limited knowlege it makes sense to me to try to balance storage, especially with new topics incoming from search [07:22:59] same thought, and it can possibly also lead to a good balance in traffic as well (not 100% correlated of course) [07:23:22] we have a lot of topics with little traffic, they are probably the reason of that imbalance [07:23:52] at this point I'd debianize topicmappr and metricsfetcher (the two tools) so that we have them on kafka nodes [07:24:20] and possibly, if everybody agrees, I'd proceed with the fine-tune-rebalance in main-codfw [07:24:42] storage is ok there but it is a good testbed [07:24:57] sounds good to me [07:24:59] main-eqiad still has some big imbalance in storage used across brokers [07:25:14] if the plan works we can fine-tune-rebalance there [07:25:19] ack perfect :) [07:25:22] I'll write docs at the end [07:25:25] I promise [07:26:05] <3 [07:59:29] Great! [08:17:58] cassandra is down on restbase1030, it apparently has a SSD problem https://phabricator.wikimedia.org/T344259 [08:23:13] the /etc/cassandra-a/service-enabled file is not present, apparently a manual intervention [08:23:20] I wonder if it's just a downtime that expired [08:38:18] claime: it is, see https://phabricator.wikimedia.org/T344761 [08:40:04] yeah I was just surprised I saw no mention of a downtime, didn't see one in icinga, and afaik we don't have a downtime history on alertmanager [08:54:19] fyi grafana apparently leaked their signing key for their apt packages and had to rotate it: https://grafana.com/blog/2023/08/24/grafana-security-update-gpg-signing-key-rotation/ [08:58:13] mszabo: thanks for the pointer, we fixed that yesterday: https://gerrit.wikimedia.org/r/c/operations/puppet/+/952841 [08:58:29] and no grafana debs have been imported between the 24th and when the fix landed [08:58:54] awesome, thank you! [09:01:45] 10serviceops, 10Maps, 10Regression, 10Russian-Sites: Vandal attack on OpenStreetMap affected Wikimedia Maps - https://phabricator.wikimedia.org/T344753 (10MSantos) 05Open→03Resolved a:03Jgiannelos Agreed, for that we already have an issue {T231885} [09:11:31] 10serviceops, 10Cassandra: Cassandra instance with corrupted commit log after powercycle of restbase1027 - https://phabricator.wikimedia.org/T345058 (10elukey) >>! In T345058#9123817, @Eevans wrote: > Spoiler alert though: For good or bad, we're not really setup to be doing repairs at all. Does it mean that t... [10:32:06] FYI, kubetcd2006 will briefly go down for a Ganeti node reboot [10:42:38] it's back [11:01:02] 10serviceops, 10MW-on-K8s, 10Patch-For-Review: mw-on-k8s tls-proxy container CPU throttling at low average load - https://phabricator.wikimedia.org/T344814 (10Clement_Goubert) CPU limits have now been removed on all mw-on-k8s deployments except mw-misc. We'll wait a few days to see how the reduced concurrenc... [12:10:14] FYI, kubetcd1004 will briefly go down for a Ganeti node reboot [12:43:31] 10serviceops, 10Observability-Tracing, 10Patch-For-Review, 10User-fgiunchedi: jaeger is configured to receive traces from production - https://phabricator.wikimedia.org/T344253 (10fgiunchedi) [13:37:12] 10serviceops, 10SRE, 10ops-eqiad: Move eqiad thumbor hosts to kubernetes cluster - https://phabricator.wikimedia.org/T343993 (10VRiley-WMF) Updated physical labeling as requested. [13:40:43] kubetcd1006 will briefly go down for a Ganeti node reboot [13:56:00] folks I am slowly rolling out https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/937432 [13:56:05] (eventgate pods) [13:56:11] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kubernetes2025.codfw.wmnet with OS bullseye [13:56:12] the change is running on main since ages ago [13:56:26] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kubernetes2025.codfw.wmnet with OS bullseye executed with errors: - kubernetes... [14:13:48] thanks luca! :) [14:18:09] 10serviceops, 10Cassandra: Cassandra instance with corrupted commit log after powercycle of restbase1027 - https://phabricator.wikimedia.org/T345058 (10Eevans) >>! In T345058#9126001, @elukey wrote: >>>! In T345058#9123817, @Eevans wrote: >> Spoiler alert though: For good or bad, we're not really setup to be d... [14:27:02] 10serviceops, 10Content-Transform-Team-WIP, 10Maintenance-Worktype, 10Wikimedia-Incident: Maps Unavailability due to thanos-swift cfssl rollout (14 Aug 2023) - https://phabricator.wikimedia.org/T344324 (10jijiki) 05In progress→03Stalled [14:32:05] ok I was far too optimistic in believing envoy config can reference environment variables :( specifically for T320563 I'd like the node name or address substituted somewhere in the envoy config to talk to otel-collector [14:32:45] * godog sobs in envoy config [14:36:38] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kubernetes2025.codfw.wmnet with OS bullseye [14:38:04] godog: There's an ugly way of doing it with using envsubst in the entrypoint.sh (and now jayme will hunt me down) [14:39:18] that won't work really as the config is coming in a configmap...well...we could read it from there and write it so someplace else [14:39:26] claime: lol [14:40:08] jayme: I'm correct in thinking we can't use the downward api in a configmap, right? [14:40:31] Like reference spec.nodeIp or whatever [14:40:36] I think you probably can [14:40:49] but you'd still have to get that value into the envoy config [14:43:21] godog: might be the wrong path, but maybe try to look in the direction of "file based cds" (cluster discovery service) [14:44:40] jayme: interesting, thank you! then I'm guessing at that point we'd write a file for cds to pick up at container startup ? [14:45:24] yeah...no idea beyond that I know this system exists :D [14:47:14] maybe it's as shitty as envsubst in the end... [14:47:49] heh so far it seems slightly less of an hack, at least it is all within envoy [14:48:03] 10serviceops, 10Data-Persistence, 10Patch-For-Review: WikiKube: Investigate how to abstract misc Mariadb clusters host/ip information so that no deployment of apps is needed when a master is failed over - https://phabricator.wikimedia.org/T340843 (10akosiaris) 05Open→03In progress p:05Triage→03Medium... [15:00:03] godog: there also is the envoy cli argument --config-yaml where one can specify inline yaml that gets merged with the config.... [15:00:11] sounds totally safe to use :D [15:00:30] Totally [15:01:04] jayme: hahah! I'd like to unsubscribe [15:01:30] seriously though, I'll update the task with the latest findings/suggestions [15:01:58] haha, lol https://stackoverflow.com/questions/54047568/how-can-i-use-environment-variables-in-the-envoyproxy-config-file [15:05:45] I'm both not surprised and amused that jaeger is also involved here [15:06:41] that's exactly what I found earlier lol [15:37:58] eventgate changes rolled out to all instances :) [15:56:34] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kubernetes2025.codfw.wmnet with OS bullseye executed with errors: - kubernetes... [16:17:07] elukey: <3 [16:36:59] 10serviceops, 10Data-Persistence: WikiKube: Investigate how to abstract misc Mariadb clusters host/ip information so that no deployment of apps is needed when a master is failed over - https://phabricator.wikimedia.org/T340843 (10akosiaris) [16:37:04] 10serviceops, 10CX-cxserver, 10Language-Team, 10Kubernetes: cxserver: Section Mapping Database (m5) not accessible by certain region - https://phabricator.wikimedia.org/T341117 (10akosiaris) 05In progress→03Resolved Fix merged and deployed. Some hiccups aside, it works fine across all 3 environments (s... [16:38:46] 10serviceops, 10Data-Persistence: WikiKube: Investigate how to abstract misc Mariadb clusters host/ip information so that no deployment of apps is needed when a master is failed over - https://phabricator.wikimedia.org/T340843 (10akosiaris) Update: Patch was merged and deployed for cxserver. Things went ok a... [16:40:36] 10serviceops, 10Observability-Metrics: Decide on default histogram buckets for MediaWiki timers - https://phabricator.wikimedia.org/T344751 (10colewhite) It looks like current metrics in graphite are p50, p75, p95, and p99. Probably this set is the bare minimum. Any other percentiles worth gathering? [18:55:32] 10serviceops, 10SRE, 10ops-eqiad: Move eqiad thumbor hosts to kubernetes cluster - https://phabricator.wikimedia.org/T343993 (10Jclark-ctr) [18:55:42] 10serviceops, 10SRE, 10ops-eqiad: Move eqiad thumbor hosts to kubernetes cluster - https://phabricator.wikimedia.org/T343993 (10Jclark-ctr) 05Open→03Resolved [18:55:52] 10serviceops, 10SRE, 10Thumbor, 10Thumbor Migration, and 2 others: Final steps for fully-Kubernetes Thumbor - https://phabricator.wikimedia.org/T334488 (10Jclark-ctr) [19:14:38] 10serviceops, 10Observability-Metrics: Decide on default histogram buckets for MediaWiki timers - https://phabricator.wikimedia.org/T344751 (10akosiaris) Those [numbers](https://github.com/prometheus/statsd_exporter#global-defaults) are for summary quantiles, not histograms buckets. Summaries [aren't aggregata... [20:38:11] 10serviceops, 10Observability-Metrics: Decide on default histogram buckets for MediaWiki timers - https://phabricator.wikimedia.org/T344751 (10colewhite) [20:38:25] 10serviceops, 10Observability-Metrics: Decide on default histogram buckets for MediaWiki timers - https://phabricator.wikimedia.org/T344751 (10colewhite) >>! In T344751#9128570, @akosiaris wrote: > Those numbers are for summary quantiles, not histograms buckets. Good catch! I've updated the task description. [20:44:17] 10serviceops, 10Observability-Metrics: Decide on default histogram buckets for MediaWiki timers - https://phabricator.wikimedia.org/T344751 (10colewhite) [21:42:09] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10Jhancock.wm) [22:59:29] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10Jhancock.wm)