[00:01:10] legoktm: the pull: always hack unblocked what I was doing, but I'm going to think about explicit versioned pins as an alternative (https://phabricator.wikimedia.org/T291442#7367430). [02:06:24] 10serviceops, 10Performance-Team: Rewrite mw-warmup.js in Python - https://phabricator.wikimedia.org/T288867 (10Krinkle) a:05Krinkle→03None Unassigning for now. I had originally intended to take this on as a side-project to toy around more with Python. But both due to overall lack time and because I had fo... [04:37:39] 10serviceops, 10Infrastructure-Foundations, 10netops: TCP retransmissions in eqiad and codfw - https://phabricator.wikimedia.org/T291385 (10Marostegui) p:05Triage→03Medium [05:14:32] 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Benchmark performance of MediaWiki on k8s - https://phabricator.wikimedia.org/T280497 (10Joe) >>! In T280497#7365189, @jijiki wrote: > @ssastry we have done some benchmarks, but non of those were parsoid urls, it would grea... [05:15:26] 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Benchmark performance of MediaWiki on k8s - https://phabricator.wikimedia.org/T280497 (10Joe) Also, parsoid *might* need more memory and we might need to adapt mediawiki-config so that we can raise php's memory limit in k8s... [07:08:58] 10serviceops, 10Infrastructure-Foundations, 10netops: TCP retransmissions in eqiad and codfw - https://phabricator.wikimedia.org/T291385 (10jijiki) >>! In T291385#7365671, @cmooney wrote: We have been living with this for quite a long time, we can wait a little longer :) > Should we de-pool those two boxes... [08:13:06] 10serviceops, 10SRE: Rebuild production Stretch images with GNUTLS/OpenSSL updates for LE issue chain update - https://phabricator.wikimedia.org/T291458 (10MoritzMuehlenhoff) [09:02:05] 10serviceops, 10SRE: Rebuild production Stretch images with GNUTLS/OpenSSL updates for LE issue chain update - https://phabricator.wikimedia.org/T291458 (10Joe) 05Open→03In progress a:03Joe [09:25:54] 10serviceops, 10SRE: Rebuild production Stretch images with GNUTLS/OpenSSL updates for LE issue chain update - https://phabricator.wikimedia.org/T291458 (10Joe) The debmonitor query for [[https://debmonitor.wikimedia.org/packages/libssl1.0.2 | libssl 1.0.2]] tells us it's mostly images under the `/releng` pref... [09:36:51] 10serviceops, 10SRE: Rebuild production Stretch images with GNUTLS/OpenSSL updates for LE issue chain update - https://phabricator.wikimedia.org/T291458 (10Joe) The debmonitor query for [[ https://debmonitor.wikimedia.org/packages/libgnutls30 || libgnutls30]] tells us again it's mostly releng images, plus: *... [12:15:20] akosiaris: o/ [12:16:21] I am working on https://gerrit.wikimedia.org/r/c/operations/puppet/+/720048 to refactor tokens/secrets for the k8s services (and introduce the ml ones), but IIUC you are working on something related to tokens so lemme know if you are ok with my patch :) [12:17:17] (the follow up patches for deployment-charts start from https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/722276) [12:18:01] if everybody agrees I'll merge the puppet one and change private hiera according to the new version, and then I'll deploy the deployment-charts patch (that should be a no-op)_ [13:08:49] 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Benchmark performance of MediaWiki on k8s - https://phabricator.wikimedia.org/T280497 (10ssastry) >>! In T280497#7367757, @Joe wrote: >>>! In T280497#7365189, @jijiki wrote: >> @ssastry we have done some benchmarks, but non... [13:27:50] elukey: Still trying to catchup. Do proceed and I 'll rebase on top of yours [13:32:18] 10serviceops, 10Infrastructure-Foundations, 10netops: TCP retransmissions in eqiad and codfw - https://phabricator.wikimedia.org/T291385 (10joanna_borun) 05Open→03In progress [13:33:37] akosiaris: ack thanks! Basically I am trying to split tokens and secrets between main and ml-serve clusters, so that we'll eventually separate them etc.. [13:49:24] 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Benchmark performance of MediaWiki on k8s - https://phabricator.wikimedia.org/T280497 (10ssastry) Times from scandium.eqiad.wmnet: * http://en.wikipedia.org/w/rest.php/en.wikipedia.org/v3/page/html/Hospet/1043074958 takes b... [13:58:06] 10serviceops, 10Scap, 10Release-Engineering-Team (Doing): Deploy Scap version 4.0.0 - https://phabricator.wikimedia.org/T291095 (10jijiki) Thank @dancy, I will try to get it done this week with @Arnoldokoth [14:27:49] hypothetically speaking: if Platform wanted to eliminate service-runner from service-template-node (and by extension, from node services moving forward), would anyone in here object? [14:28:03] Rationale: service-runner was created during a different time, when we deployed services to bare metal. Concurrency and supervision can be handled by k8s now, rate limiting would be provided by the API gateway...it doesn't really leave anything left [14:31:40] urandom: sounds interesting, I think it deserves a task to have a brief discussion [14:32:08] yup, I'm mostly just taking the temperature [14:32:39] in case it's come up before, and/or someone feels passionately about it [14:44:48] <_joe_> urandom: rate-limiting should indeed be handled differently, but ofc we need a path of migration [14:45:18] _joe_: definitely. [14:45:21] <_joe_> but circling back to your question, I don't think a single-worker pod is a great idea *always* [14:46:06] <_joe_> dunno what akosiaris thinks about it, but I guess we would need to re-think the helm charts too, quite significantly [14:46:36] honest question: when is a single worker pod not a great idea? [14:49:56] <_joe_> when you starve a resource different than CPU cycles for instance; but more importantly if you need sidecars (for e.g. TLS termination, service-to-service communication) the overhead becomes larger [14:50:17] I see [14:50:26] <_joe_> if before you had the overhead of those sidecars for every N workers (say 8) [14:50:32] <_joe_> now you have it for each one [14:51:26] <_joe_> now, if we want to remove that complexity from *our* software and move it to the frontend envoy, we can just run N containers running a single worker; or we can find an off-the-shelf coordinator [14:51:50] <_joe_> but the model of having multiple workers running and a dispatching queue is very popular for a reason - it's a good model :) [14:52:17] heh, it sounds like an argument against using Node, to me [14:52:24] ;-) [14:52:33] <_joe_> *cough cough* [14:53:18] but service-runner does other things too though, right? it exposes easily reused APIs for metrics, structured logging [14:53:36] <_joe_> that stuff would stay in the form of a library AIUI [14:53:48] <_joe_> basically from a runner to just a framework for the single worker [14:54:21] <_joe_> urandom: I never made it a mystery we wouldn't ever have used node if I had to make the choice [14:54:54] no, not really [14:54:54] by "not really", i mean, those things are injected into the application via service-runner, but that's more an example of poor encapsulation than anything else [14:55:12] btw, rate limiting isn't being used in k8s anywhere. We never made it work cause we really couldn't. [14:55:32] re non-single-worker pods - currently we have none afaik. [14:55:47] <_joe_> Pchelolo: what do you mean? [14:55:49] urandom: yeah agreed. It's a mixing from what I remember. It just imports the library [14:55:56] Pchelolo: is that a double-negative? [14:56:02] like, in all node services we have num_workers: 1 [14:56:06] * urandom loves double (or greater) negatives [14:56:09] <_joe_> oh do we? [14:56:25] <_joe_> that explains a few things [14:56:25] no, we do have some with num_workers: 2 [14:56:41] mobileapps, chromium-render and push-notifications [14:57:02] a push-notifications and wikifeeds too I see [14:57:12] mmm, ok. I've grepped helmfile.d, thought it was all templated [14:57:24] N=2 isn't a very big N [14:57:37] IIRC the reasoning for that was that service-runner was actually doing a better job at supervising the workers than just having num_workers: 0 [14:57:44] though I guess that x2 here though [14:58:37] I had run some benchmarks back then and it turned out num_workers: 0 was the worst regarding reliability. Under heavy load mathoid would return the most errors when num_workers was 0, while 1 and 2 would perform better [14:59:07] * akosiaris searching task [15:00:17] num_workers: 0 is also not exactly the same as no service-runner btw [15:01:16] for example, num_workers: 0 doesn't hard-suicide when it really should. it was created for running tests [15:03:17] ah it was in a commit message: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/464504 [15:03:33] 0 vs 1 [15:03:58] I clearly remember discussing this with Marko and talking about the pros and cons [15:04:55] interestingly it seems I misremembered. It was better with ncpu = 0 ? I see more sustained concurrent requests in that test. [15:05:36] there's overhead on interprocess communication with num_workers: 1 [15:06:01] sure, but it was the missed heartbeat, death of worker and restart that's those 3 requests [15:06:20] the worker just couldn't keep up and reply to the heartbeats from the master. Now I remember [15:08:35] There is also https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/464579 [15:08:47] where I am NOT that forthcoming [15:09:22] but the gist is that service-runner would restart things faster than the kubelet (which is what I was talking before earlier) [15:09:40] which is not surprizing. [15:10:06] our question is: do we care enough to keep maintaining pretty complex logic of service-runner [15:10:23] yup. Even if we tun the livenessProbe to reach service-runner level detection of faults, the restart is never going to be as fast [15:10:45] akosiaris: (finish your thoughts with the node thread first) l.egoktm said you were looking at/thinking about T290357 (kubectl exec or similar for attaching to a container). I was wondering if you have a guess yet about how long it may be before something like that would be possible. I'm trying to figure out how much to focus on work arounds vs just waiting. [15:12:57] Pchelolo: my gut says no. We can definitely survive normal traffic levels without it for most workloads and just rely on kubernetes. It would work. We might have increased 5xx and increased latencies in the extreme cases of course. We 'll also need to pay the cost of rethinking pod sizes for a lot of services, but that's a one off downpayment. [15:13:42] It will also increase of memory footprint a bit, but that's mostly neglegible. We kept pods small anyway [15:13:58] and we get -1 moving piece. and service-runner is not a trivial moving piece as it's VERY hard to test changes [15:14:57] akosiaris: thank you for all the background info. cc urandom [15:14:57] It complicates the services it uses, too. Lots of indirection [15:15:29] Services that use *it*, rather [15:16:00] btw, I have one more question - did you ever look into why a lot of node services running in k8s respond with 503 fairly often and get depooled from LVS by pybal and then added back? [15:16:03] like this https://logstash.wikimedia.org/goto/ec674ba48261703db9dc2eb181f8db0a [15:16:24] 10serviceops, 10Observability-Logging, 10Patch-For-Review, 10User-jijiki: Measure segfaults in mediawiki and parsoid servers - https://phabricator.wikimedia.org/T246470 (10lmata) [15:16:25] it's not like a constant thing, but it happens quite a lot [15:16:40] 10serviceops, 10Observability-Metrics, 10Patch-For-Review, 10User-jijiki: Measure segfaults in mediawiki and parsoid servers - https://phabricator.wikimedia.org/T246470 (10lmata) [15:18:31] That's almost exclusively citoid [15:19:25] and it's kind of tacit knowledge to not care for pybal failing to reach it cause it errors out often [15:19:33] that's pybal btw. The health checks for it [15:19:45] it's eventgate too [15:20:15] all instances of it. and sometimes it seems like we loose events due to these 503s, even with 2 retries.. [15:21:24] but from eventgate service perspective everything is totally fine. like, no logs, no abnormal metrics, nothing [15:22:00] envoy is returning that 503, lemme check [15:23:21] yeah, j.o.e said we log 503s sent by envoy, and if you could tell me where are these logs, that would be a great help in the investigation [15:24:03] bd808: I don't have a very good answer. It's high on the priority of things next to fixing all the helm3 tokens and corresponding RBAC rights. I f you can wait till the end of the week, I 'll have an answer on whether we 'll just go with a workaround for now or try to solve this well. [15:26:56] akosiaris: *nod* I will keep doing small things to move myself forward and hope to hear more later this week. I'm happy to hear that it is high on your list. I assume that the list is a long one. :) [15:32:02] Pchelolo: interestingly, I can't find them either. They should have been in e.g. https://logstash.wikimedia.org/goto/e07719f49b9495a3f73e7d9f69323b21 but they are not. [15:34:34] yup, no logs from tls-proxy-container at all :( [15:37:37] hm, for wikifeeds there are logs from tls-proxy about 503 [15:41:22] so, this one Sep 21, 2021 @ 13:50:42.306 syslog lvs1016 [eventgate-main_4492 ProxyFetch] WARN: kubernetes1013.eqiad.wmnet (enabled/up/pooled): Fetch failed (https://localhost/_info), 0.266 s [15:41:24] * ottomata following [15:41:31] can be correlated to a deployment from what I see [15:41:45] i did deploy today. [15:41:53] but some other logs Pchelolo was looking at were from i think Aug 17 [15:42:00] the eventgate-main pods had been up for 55 days [15:42:05] so no corresponding deployment [15:42:35] these logs are all over for all different instances of eventgate [15:50:15] ok, no wonder eventgate doesn't log 50x from envoy [15:51:07] there's no access_log filter in eventgate envoy template [15:57:28] yeah, it's using it's own eventgate/templates/_tls_helpers.tpl file and not using the common one all other services do [15:57:52] probably just a historical accident [15:57:56] all other services symlink to the common_templates/ versioned directory [15:57:56] will file a task! :) [15:58:41] yeah, let's fix that so we can have some visibility into what's going on [15:58:49] then we 'll be able to solve that last part [15:59:06] 10serviceops, 10Analytics, 10Event-Platform: eventgate helm chart should use common_templates _tls_helpers.tpl instead of its own custom copy - https://phabricator.wikimedia.org/T291504 (10Ottomata) [16:30:25] ottomata: akosiaris if you have any spare time https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/722654 [16:31:37] <_joe_> Pchelolo: actually I think you need 0.3 [16:31:54] <_joe_> because that's where the canary was introduced in the common templates [16:32:02] <_joe_> but I'll take a look tomorrow, I promise [16:32:13] ok. will try 0.3 [16:32:16] thank you [16:32:29] <_joe_> https://integration.wikimedia.org/ci/job/helm-lint/5432/console btw has the manifest diffs [16:33:21] <_joe_> there are a few things to verify [16:33:31] <_joe_> I think we need to tweak values.yaml too [16:35:01] oh sweet. Will look [16:40:51] <_joe_> yeah I clearly didn't publicize that thing from CI enough :) [17:02:11] question [17:02:22] is our envoy container listening to ::1 ? [18:29:38] Pchelolo: is there a ticket about the 50s errors? [18:29:42] 503* [18:33:40] legoktm: I'm investigating T215001 and T249745 and this is where the rabbit hole led so far [18:39:43] oh, that monster of a ticket :/ [18:40:29] has it gotten worse since the "24 errors in 7 days" back in April? [18:54:45] legoktm: no, it's still pretty low. but I want to figure out why is it happening at all, eventgate is really a super-thin proxy, it shouldn't need 3 retries with 20s timeout each [18:55:14] * legoktm nods [19:15:27] Pchelolo: thanks for the helmfile patches! [19:15:42] i was the first user of canary releases so there was tons of back and forth from me and alex [19:16:04] ottomata: feel free to update that if you want. not really sure about routing_tag vs routed_via... [19:16:38] lemme make sure it does the same thing, but based on the CI diff it looks like it [19:20:25] hmm no it won't work with more changes i think Pchelolo [19:20:40] mm? [19:20:44] hmmm wait... [19:20:53] i'm looking at the diff of eventgate-release-name-tls-service [19:23:53] i think this is where alex and I differed on opinions. the common _tls_helpers does routing via these label selectors: [19:23:59] app: {{ template "wmf.chartname" . }} [19:24:00] routed_via: {{ .Release.Name }} [19:24:20] which I found strange, since app main_app.name != wmf.chartname [19:24:48] eventgate chart currently uses [19:24:52] chart: {{ template "wmf.chartname" . }} [19:24:52] app: {{ .Values.main_app.name }} [19:24:52] routing_tag: {{ .Values.service.routing_tag | default .Release.Name }} [19:26:20] i actually don't know how common tls_helpers works with canaries... will keep reading [19:26:45] thank you. I'm honestly at the limit of my understanding of this as well [19:26:47] if there are multiple 'deployments(?)' (e.g. eventgate-main) taht use the same chart [19:27:30] the common tls_helpers looks like it would break [19:27:45] the tls-service would select based on chart name and either 'production' or 'canary' (release name) [19:27:51] which exists in all deployments [19:28:10] e.g. chartname == eventgate, .Release.Name == production [19:29:31] wait [19:29:48] the selector in default _tls has 2 parts: [19:30:02] oh, yeah, you are right [19:31:02] why wouldn't we just do selector: routed_via: {{ template "wmf.chartname" . }} - that should work [19:31:27] chartname == eventgate [19:31:28] right? [19:31:49] which is true for 4 different eventgate deployments [19:32:01] ok maybe [19:32:04] we are missing this [19:32:05] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/685748/11/_scaffold/templates/service.yaml [19:32:13] its in the scaffolding, so would be set for new charts [19:32:18] og no, routed_via: {{ template "wmf.releasename" . }} I mean [19:33:07] we'd need a conditional [19:33:14] wmf.releasename == e.g. eventgate-main-production [19:33:20] or eventgate-main-canary [19:35:20] ok. I'm entirely lost now :) will have lunch and read all this yaml magic [19:35:27] i'm a little lost too [19:36:02] as i stated a few times in that old ticket, i think the names are very confusing, if we could get them all straight it might be a little easier [19:36:15] app, chart, service, deployment, instance, cluster, release [19:36:24] 10serviceops, 10Data-Persistence-Backup, 10GitLab (Infrastructure), 10Patch-For-Review, 10User-brennen: Backups for GitLab - https://phabricator.wikimedia.org/T274463 (10brennen) [19:37:21] 10serviceops, 10GitLab (Infrastructure), 10Patch-For-Review: GitLab replica in codfw - https://phabricator.wikimedia.org/T285867 (10brennen) [21:20:24] Is there a "clean up old images that are not needed" process for docker-registry.wikimedia.org? Toolhub is creating a growing number of images that are just eating up storage space (and yes, I may want/need to rethink my pipelinelib setup). [21:30:12] and if we do that it would be great if we could also remove them from Debmonitor as now it does GC based on last update time :/ [21:49:49] ottomata: do we still need http service for eventgate? [22:16:10] 10serviceops, 10Toolhub: Get mcrouter & prometheus-mcrouter-exporter tags for helmfile.d from upstream config - https://phabricator.wikimedia.org/T291530 (10bd808) [23:01:40] bd808: I believe all deletion is manual right now (per https://wikitech.wikimedia.org/wiki/Docker-registry#Deleting_images) [23:01:57] eventually we'll need some GC process [23:04:02] legoktm: thanks. sounds like this is a "SREs can do it if they want to" thing right now? [23:05:11] yes, and "we'll worry about it once its closer to being an issue" :)