[06:24:05] <wikibugs>	 10serviceops: Improve kafka main 's partitions usage and leaders using topicmappr's rebalance - https://phabricator.wikimedia.org/T345077 (10elukey) I am trying to follow up on this:  ` Broker distribution:   degree [min/max/avg]: 4/4/4.00 -> 4/4/4.00   -   Broker 2001 - leader: 282, follower: 564, total: 846...
[07:01:01] <elukey>	 hello folks
[07:01:21] <elukey>	 I am discovering more and more usages of the topicmappr tool, we can optimize for storage, partition count, etc..
[07:01:28] <elukey>	 but of course one can affect the other
[07:01:49] <elukey>	 (like we balance partition count but storage usage gets uneven, etc..)
[07:02:21] <elukey>	 At this point I'd say that keeping a sane storage usage across kafka brokers may be a good primary key point
[07:02:43] <elukey>	 having partition count not even isn't ideal but it doesn't cause issues afaics
[07:03:04] <elukey>	 (more info added to T345077)
[07:03:10] <elukey>	 thoughts?
[07:10:23] <wikibugs>	 10serviceops, 10Maps, 10Regression, 10Russian-Sites: Vandal attack on OpenStreetMap affected Wikimedia Maps - https://phabricator.wikimedia.org/T344753 (10Jgiannelos) Unless we plan to invest some time to make ad-hoc regional tile invalidation thats relatively fast I don't think we have any method other th...
[07:13:30] <jayme>	 elukey: considering my limited knowlege it makes sense to me to try to balance storage, especially with new topics incoming from search
[07:22:59] <elukey>	 same thought, and it can possibly also lead to a good balance in traffic as well (not 100% correlated of course)
[07:23:22] <elukey>	 we have a lot of topics with little traffic, they are probably the reason of that imbalance
[07:23:52] <elukey>	 at this point I'd debianize topicmappr and metricsfetcher (the two tools) so that we have them on kafka nodes
[07:24:20] <elukey>	 and possibly, if everybody agrees, I'd proceed with the fine-tune-rebalance in main-codfw
[07:24:42] <elukey>	 storage is ok there but it is a good testbed
[07:24:57] <jayme>	 sounds good to me
[07:24:59] <elukey>	 main-eqiad still has some big imbalance in storage used across brokers
[07:25:14] <elukey>	 if the plan works we can fine-tune-rebalance there
[07:25:19] <elukey>	 ack perfect :)
[07:25:22] <elukey>	 I'll write docs at the end
[07:25:25] <elukey>	 I promise
[07:26:05] <jayme>	 <3
[07:59:29] <claime>	 Great!
[08:17:58] <claime>	 cassandra is down on restbase1030, it apparently has a SSD problem https://phabricator.wikimedia.org/T344259
[08:23:13] <claime>	 the /etc/cassandra-a/service-enabled file is not present, apparently a manual intervention
[08:23:20] <claime>	 I wonder if it's just a downtime that expired
[08:38:18] <moritzm>	 claime: it is, see https://phabricator.wikimedia.org/T344761
[08:40:04] <claime>	 yeah I was just surprised I saw no mention of a downtime, didn't see one in icinga, and afaik we don't have a downtime history on alertmanager
[08:54:19] <mszabo>	 fyi grafana apparently leaked their signing key for their apt packages and had to rotate it: https://grafana.com/blog/2023/08/24/grafana-security-update-gpg-signing-key-rotation/
[08:58:13] <moritzm>	 mszabo: thanks for the pointer, we fixed that yesterday: https://gerrit.wikimedia.org/r/c/operations/puppet/+/952841
[08:58:29] <moritzm>	 and no grafana debs have been imported between the 24th and when the fix landed
[08:58:54] <mszabo>	 awesome, thank you!
[09:01:45] <wikibugs>	 10serviceops, 10Maps, 10Regression, 10Russian-Sites: Vandal attack on OpenStreetMap affected Wikimedia Maps - https://phabricator.wikimedia.org/T344753 (10MSantos) 05Open→03Resolved a:03Jgiannelos Agreed, for that we already have an issue {T231885}
[09:11:31] <wikibugs>	 10serviceops, 10Cassandra: Cassandra instance with corrupted commit log after powercycle of restbase1027 - https://phabricator.wikimedia.org/T345058 (10elukey) >>! In T345058#9123817, @Eevans wrote: > Spoiler alert though: For good or bad, we're not really setup to be doing repairs at all.  Does it mean that t...
[10:32:06] <moritzm>	 FYI, kubetcd2006 will briefly go down for a Ganeti node reboot
[10:42:38] <moritzm>	 it's back
[11:01:02] <wikibugs>	 10serviceops, 10MW-on-K8s, 10Patch-For-Review: mw-on-k8s tls-proxy container CPU throttling at low average load - https://phabricator.wikimedia.org/T344814 (10Clement_Goubert) CPU limits have now been removed on all mw-on-k8s deployments except mw-misc. We'll wait a few days to see how the reduced concurrenc...
[12:10:14] <moritzm>	 FYI, kubetcd1004 will briefly go down for a Ganeti node reboot
[12:43:31] <wikibugs>	 10serviceops, 10Observability-Tracing, 10Patch-For-Review, 10User-fgiunchedi: jaeger is configured to receive traces from production - https://phabricator.wikimedia.org/T344253 (10fgiunchedi)
[13:37:12] <wikibugs>	 10serviceops, 10SRE, 10ops-eqiad: Move eqiad thumbor hosts to kubernetes cluster - https://phabricator.wikimedia.org/T343993 (10VRiley-WMF) Updated physical labeling as requested.
[13:40:43] <moritzm>	 kubetcd1006 will briefly go down for a Ganeti node reboot
[13:56:00] <elukey>	 folks I am slowly rolling out https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/937432
[13:56:05] <elukey>	 (eventgate pods)
[13:56:11] <wikibugs>	 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kubernetes2025.codfw.wmnet with OS bullseye
[13:56:12] <elukey>	 the change is running on main since ages ago
[13:56:26] <wikibugs>	 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kubernetes2025.codfw.wmnet with OS bullseye executed with errors: - kubernetes...
[14:13:48] <jayme>	 thanks luca! :)
[14:18:09] <wikibugs>	 10serviceops, 10Cassandra: Cassandra instance with corrupted commit log after powercycle of restbase1027 - https://phabricator.wikimedia.org/T345058 (10Eevans) >>! In T345058#9126001, @elukey wrote: >>>! In T345058#9123817, @Eevans wrote: >> Spoiler alert though: For good or bad, we're not really setup to be d...
[14:27:02] <wikibugs>	 10serviceops, 10Content-Transform-Team-WIP, 10Maintenance-Worktype, 10Wikimedia-Incident: Maps Unavailability due to thanos-swift cfssl rollout  (14 Aug 2023) - https://phabricator.wikimedia.org/T344324 (10jijiki) 05In progress→03Stalled
[14:32:05] <godog>	 ok I was far too optimistic in believing envoy config can reference environment variables :( specifically for T320563 I'd like the node name or address substituted somewhere in the envoy config to talk to otel-collector
[14:32:45] * godog sobs in envoy config
[14:36:38] <wikibugs>	 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kubernetes2025.codfw.wmnet with OS bullseye
[14:38:04] <claime>	 godog: There's an ugly way of doing it with using envsubst in the entrypoint.sh (and now jayme will hunt me down)
[14:39:18] <jayme>	 that won't work really as the config is coming in a configmap...well...we could read it from there and write it so someplace else
[14:39:26] <godog>	 claime: lol
[14:40:08] <claime>	 jayme: I'm correct in thinking we can't use the downward api in a configmap, right?
[14:40:31] <claime>	 Like reference spec.nodeIp or whatever
[14:40:36] <jayme>	 I think you probably can
[14:40:49] <jayme>	 but you'd still have to get that value into the envoy config
[14:43:21] <jayme>	 godog: might be the wrong path, but maybe try to look in the direction of "file based cds" (cluster discovery service)
[14:44:40] <godog>	 jayme: interesting, thank you! then I'm guessing at that point we'd write a file for cds to pick up at container startup ?
[14:45:24] <jayme>	 yeah...no idea beyond that I know this system exists :D
[14:47:14] <jayme>	 maybe it's as shitty as envsubst in the end...
[14:47:49] <godog>	 heh so far it seems slightly less of an hack, at least it is all within envoy
[14:48:03] <wikibugs>	 10serviceops, 10Data-Persistence, 10Patch-For-Review: WikiKube: Investigate how to abstract misc Mariadb clusters host/ip information so that no deployment of apps is needed when a master is failed over - https://phabricator.wikimedia.org/T340843 (10akosiaris) 05Open→03In progress p:05Triage→03Medium...
[15:00:03] <jayme>	 godog: there also is the envoy cli argument --config-yaml where one can specify inline yaml that gets merged with the config....
[15:00:11] <jayme>	 sounds totally safe to use :D
[15:00:30] <claime>	 Totally
[15:01:04] <godog>	 jayme: hahah! I'd like to unsubscribe
[15:01:30] <godog>	 seriously though, I'll update the task with the latest findings/suggestions
[15:01:58] <jayme>	 haha, lol https://stackoverflow.com/questions/54047568/how-can-i-use-environment-variables-in-the-envoyproxy-config-file
[15:05:45] <godog>	 I'm both not surprised and amused that jaeger is also involved here
[15:06:41] <claime>	 that's exactly what I found earlier lol
[15:37:58] <elukey>	 eventgate changes rolled out to all instances :)
[15:56:34] <wikibugs>	 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kubernetes2025.codfw.wmnet with OS bullseye executed with errors: - kubernetes...
[16:17:07] <claime>	 elukey: <3
[16:36:59] <wikibugs>	 10serviceops, 10Data-Persistence: WikiKube: Investigate how to abstract misc Mariadb clusters host/ip information so that no deployment of apps is needed when a master is failed over - https://phabricator.wikimedia.org/T340843 (10akosiaris)
[16:37:04] <wikibugs>	 10serviceops, 10CX-cxserver, 10Language-Team, 10Kubernetes: cxserver: Section Mapping Database (m5) not accessible by certain region - https://phabricator.wikimedia.org/T341117 (10akosiaris) 05In progress→03Resolved Fix merged and deployed. Some hiccups aside, it works fine across all 3 environments (s...
[16:38:46] <wikibugs>	 10serviceops, 10Data-Persistence: WikiKube: Investigate how to abstract misc Mariadb clusters host/ip information so that no deployment of apps is needed when a master is failed over - https://phabricator.wikimedia.org/T340843 (10akosiaris) Update:   Patch was merged and deployed for cxserver. Things went ok a...
[16:40:36] <wikibugs>	 10serviceops, 10Observability-Metrics: Decide on default histogram buckets for MediaWiki timers - https://phabricator.wikimedia.org/T344751 (10colewhite) It looks like current metrics in graphite are p50, p75, p95, and p99.  Probably this set is the bare minimum.  Any other percentiles worth gathering?
[18:55:32] <wikibugs>	 10serviceops, 10SRE, 10ops-eqiad: Move eqiad thumbor hosts to kubernetes cluster - https://phabricator.wikimedia.org/T343993 (10Jclark-ctr)
[18:55:42] <wikibugs>	 10serviceops, 10SRE, 10ops-eqiad: Move eqiad thumbor hosts to kubernetes cluster - https://phabricator.wikimedia.org/T343993 (10Jclark-ctr) 05Open→03Resolved
[18:55:52] <wikibugs>	 10serviceops, 10SRE, 10Thumbor, 10Thumbor Migration, and 2 others: Final steps for fully-Kubernetes Thumbor - https://phabricator.wikimedia.org/T334488 (10Jclark-ctr)
[19:14:38] <wikibugs>	 10serviceops, 10Observability-Metrics: Decide on default histogram buckets for MediaWiki timers - https://phabricator.wikimedia.org/T344751 (10akosiaris) Those [numbers](https://github.com/prometheus/statsd_exporter#global-defaults) are for summary quantiles, not histograms buckets. Summaries [aren't aggregata...
[20:38:11] <wikibugs>	 10serviceops, 10Observability-Metrics: Decide on default histogram buckets for MediaWiki timers - https://phabricator.wikimedia.org/T344751 (10colewhite)
[20:38:25] <wikibugs>	 10serviceops, 10Observability-Metrics: Decide on default histogram buckets for MediaWiki timers - https://phabricator.wikimedia.org/T344751 (10colewhite) >>! In T344751#9128570, @akosiaris wrote: > Those numbers are for summary quantiles, not histograms buckets.  Good catch!  I've updated the task description.
[20:44:17] <wikibugs>	 10serviceops, 10Observability-Metrics: Decide on default histogram buckets for MediaWiki timers - https://phabricator.wikimedia.org/T344751 (10colewhite)
[21:42:09] <wikibugs>	 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10Jhancock.wm)
[22:59:29] <wikibugs>	 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10Jhancock.wm)