[07:56:43] 10serviceops, 10SRE, 10observability, 10Patch-For-Review, and 2 others: Create an alert for high memcached bw usage - https://phabricator.wikimedia.org/T224454 (10elukey) An optional (but in my opinion useful) alert could be related to a prolonged usage of the gutter pool, that is not something we wish for... [09:00:49] Mornin' [10:01:47] 10serviceops, 10MW-on-K8s, 10Release-Engineering-Team (Priority Backlog 📥): Provide an mwdebug functionality on kubernetes - https://phabricator.wikimedia.org/T276994 (10Clement_Goubert) I created a [[ https://logstash.wikimedia.org/app/dashboards#/view/c8fa7480-6a48-11ed-83a4-ab884db3ba3b | mw-debug (k8s) ]... [10:01:56] 10serviceops, 10MW-on-K8s, 10Release-Engineering-Team (Priority Backlog 📥): Provide an mwdebug functionality on kubernetes - https://phabricator.wikimedia.org/T276994 (10Clement_Goubert) I created a [[ https://logstash.wikimedia.org/app/dashboards#/view/c8fa7480-6a48-11ed-83a4-ab884db3ba3b | mw-debug (k8s) ]... [10:13:53] <_joe_> claime: cool [10:14:21] I'm trying to create the grafana dashes for the other deployments [10:30:07] hi -- re: confd + mw + k8s I was wondering if you had time/bandwidth to think about https://phabricator.wikimedia.org/T322523 ? [10:49:35] elukey: new pause container works as expected with 1.23 btw [10:52:03] \o/ [11:50:53] <_joe_> hnowlan: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/856950 (diff at https://integration.wikimedia.org/ci/job/helm-lint/8405/console [11:51:05] <_joe_> to convert api-gateway to modules [11:54:34] 10serviceops, 10SRE, 10Traffic, 10Patch-For-Review: _etcd-client SRV record missing for conftool cluster - https://phabricator.wikimedia.org/T320397 (10Vgutierrez) 05Open→03Resolved a:03Joe ` vgutierrez@lvs6001:~$ ./liberica etcd --config /home/vgutierrez/config.yaml Using config file: /home/vgutier... [11:56:49] _joe_: nice, lgtm [11:59:32] in case this is useful for anyone, I got sick of continuously forgetting to bump Chart.yaml when I change a chart so I wrote a pre-commit hook: https://gist.github.com/nosmo/306d50581f4069958206d772b6d49176 [11:59:52] there's probably edge cases but It'll Do [12:00:06] Noice. [12:11:12] <_joe_> I kind-of did it in rake [12:11:28] <_joe_> I was wondering if we wanted to add it as a pre-commit hook [12:36:30] yeah, it's been bugging me too. And it clearly annoys users as well, we should at least have it in CI [12:56:24] <_joe_> ideally, it should ask [12:57:24] 10serviceops, 10SRE, 10Thumbor, 10Thumbor Migration, and 2 others: Encoding issues when handling unicode characters in filenames - https://phabricator.wikimedia.org/T323114 (10hnowlan) 05Open→03Resolved [12:57:26] 10serviceops, 10SRE, 10Thumbor, 10Thumbor Migration, and 2 others: Migrate thumbor to Kubernetes - https://phabricator.wikimedia.org/T233196 (10hnowlan) [14:38:38] hnowlan: o/ [14:39:26] I am checking https://istio.io/latest/docs/tasks/policy-enforcement/rate-limit as possibility to add basic rate limits to our istio ingresses (if needed). What do we use for the api-gw? [14:40:19] (in my case the rate limit would be needed as basic protection for the internal endpoint of the ml-serve clusters, inference.discovery.wmnet, since multiple people will query it from Hadoop etc..) [14:40:26] (so bypassing the api-gw) [14:50:07] elukey: we use the Envoy ratelimit implmentation mentioned in those docs (https://github.com/envoyproxy/ratelimit) [14:50:14] 10serviceops, 10Maps: Re-import full planet data into eqiad and codfw - https://phabricator.wikimedia.org/T314472 (10jijiki) 05Open→03In progress [14:50:15] it runs as a sidecar in each apigw pod [14:51:44] hnowlan: ah nice and does it use redis as well? [14:52:05] or is it a local rate limit for each api-gw pod? [14:52:15] 10serviceops, 10Maps: Re-import full planet data into eqiad and codfw - https://phabricator.wikimedia.org/T314472 (10jijiki) Planet import in eqiad (on maps1009) started at 11:53 UTC [14:54:13] <_joe_> elukey: yes it uses redis [14:54:44] ack thanks [14:56:04] I see the rdb nodes specified in the config [14:57:02] would it be ok to have an instance of Redis for other clusters as well on rdb nodes? (trying to understand if I need one on ML-specific nodes or if rdb could be used for this use case) [14:57:37] <_joe_> elukey: not sure there's still room on the rdbs [14:57:42] <_joe_> but in theory, yes [14:57:47] <_joe_> you could use them [14:57:57] <_joe_> clearly though, the moment you do so [15:00:33] (there is also nutcracker of course, sigh..) [15:08:04] <_joe_> hnowlan: uh why do we have both cassandra-http-gateway and image-suggestion-api as charts? [15:09:27] _joe_: afaik image-suggestion-api is an older project that never got deployed [15:09:40] <_joe_> oh ok [15:09:46] <_joe_> so we can maybe remove the chart you mean [15:10:12] <_joe_> uh we also have a deployment [15:10:39] <_joe_> but not a namespace in production [15:11:47] <_joe_> ok, let's remove it then [15:11:49] <_joe_> :) [15:13:46] let me confirm [15:13:59] <_joe_> I'll prepare a patch in the meantime [15:26:50] https://groups.google.com/a/kubernetes.io/g/dev/c/sEVopPxKPDo/m/9ME3CzicBwA [15:27:22] interesting. etcd 3.4 and 3.4 can croak and corrupt data [15:28:00] the "usually there is no data loss" part sounds particularly reassuring (not) [15:34:05] <_joe_> akosiaris: the worst part is [15:34:08] <_joe_> "This issue only affects etcd clusters where auth is enabled." [15:34:15] <_joe_> that smells like code rot tbh [15:34:58] for the second issue [15:35:07] but the first issue isn't reassuring either [15:35:15] not that we ever do defragmentation [15:35:24] but imagine if we found ourselves wanting to [15:36:04] <_joe_> we're still on 3.3 :P [15:54:27] 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic, 10Release-Engineering-Team (Seen): Make mw-web and mw-api-ext available behind LVS - https://phabricator.wikimedia.org/T323621 (10Clement_Goubert) [16:00:47] 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic, 10Release-Engineering-Team (Seen): Make mw-web and mw-api-ext available behind LVS - https://phabricator.wikimedia.org/T323621 (10Clement_Goubert) 05Open→03In progress [16:01:03] 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic, and 3 others: Deploy mediawiki kubernetes services - https://phabricator.wikimedia.org/T321786 (10Clement_Goubert) [19:02:22] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Decommission mw13[07-48] - https://phabricator.wikimedia.org/T306162 (10wiki_willy) a:03Jclark-ctr [21:29:25] 10serviceops, 10SRE, 10observability, 10Patch-For-Review, and 2 others: Create an alert for high memcached bw usage - https://phabricator.wikimedia.org/T224454 (10jijiki) >>! In T224454#8411988, @elukey wrote: > An optional (but in my opinion useful) alert could be related to a prolonged usage of the gutte... [21:35:54] 10serviceops, 10SRE, 10good first task: Upgrade all deployment charts to use the latest version of common_templates - https://phabricator.wikimedia.org/T292390 (10Aklapper)