[08:34:49] hello folks [08:35:54] yesterday (after a chat with Hugh, thanks!) I've read https://wikitech.wikimedia.org/wiki/Changeprop#Testing and to make changeprop work in staging we need to manually send msgs to staging kafka topics [08:36:11] of course my rules are not working atm (namely I don't see traffic on lift wing) [08:36:16] but progress :D [08:39:34] elukey: I'm planning on switching staging to codfw today (to update eqiad), will that interfere with your tests? [08:46:52] jayme: yesterday I didn't see changeprop's pods in staging though, but if there are any for me it is fine! [08:48:48] we'll see :D I haven't deployed anything to staging-codfw yet [08:49:51] should we do it before switching the other dc to 1.23? I mean my tests can be paused anytime, it is just to make sure that something works on 1.23 :D [08:50:43] there is a note in https://phabricator.wikimedia.org/T327664 okok, perfect :) [08:51:09] yes, will do in a minute [09:02:36] 10serviceops, 10Icinga, 10SRE, 10SRE Observability: High average POST latency for mw requests on api_appserver in codfw on alert1001 - https://phabricator.wikimedia.org/T326544 (10Joe) p:05Triage→03Medium a:03Joe [09:03:16] do we have a restbase-async staging endpoint? [09:03:52] I am checking /etc/changeprop/config.yaml in changeprop's staging and in theory we have the prod endpoints in there [09:04:56] (I a worried now that if I send a msg in the staging revision-create topic then I trigger some prod changes) [09:27:31] hnowlan: ^ maybe? [09:30:33] I filed https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/883117/ to switch kafka topic so it should be less problematic [09:30:54] I'll create another liftwing test topic in main so I can freely test it without side effects [09:39:33] 10serviceops: Upgrade mc* and mc-gp* hosts to Debian Bullseye - https://phabricator.wikimedia.org/T293216 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jiji@cumin1001 for host mc2041.codfw.wmnet with OS bullseye [09:46:53] btullis: I'm unable to deploy datahub to the staging-cluster in codfw (as it fails readiness probes), you have a minute to take a look? [09:47:13] Sure do. [09:48:04] 10serviceops, 10Foundational Technology Requests, 10Prod-Kubernetes, 10Shared-Data-Infrastructure, 10Kubernetes: k8s-api for k8s-staging down in codfw - https://phabricator.wikimedia.org/T327751 (10fgiunchedi) [09:48:45] Would you like to jump on a call, or shall I just try a deploy and see what the logs say? [09:50:04] btullis: async would be better for me tbh. Deploy is currently running, feel free to take a look at the logs. [09:50:43] "deploy is running" in terms of it will probably break down in about 5min (helm atomic timeout) [09:51:22] Great, will look now. Is this blocking you from anything else? [09:52:31] not exactly. I'm in the process of deploying everything to staging-codfw (k8s 1.23) before switching the active staging cluster to codfw (and doing the update in eqiad) [09:52:59] so if you're fine with datahub not being deployed in staging this is not blocking me [09:53:36] might have something to do with k8s 1.23, though. That should be figured out before the prod clusters are updated [09:53:49] but that won't happen this week for sure [09:54:25] Yeah, it's not a problem if datahub isn't running in staging, but I will look at it right now. [09:55:31] looks to me like it's just the frontend that's failing fwiw [09:55:56] 10serviceops, 10Observability-Tracing: Rename the service to k8s-ingress-aux - https://phabricator.wikimedia.org/T327756 (10Clement_Goubert) [09:57:22] It's saying that the gms pods was OOMKilled too, which is unusual. [09:57:30] https://www.irccloud.com/pastebin/3TfvHE9m/ [10:03:07] are frontend and gms coupled in such a way? [10:06:17] Sorry, stupid question but how are you deploying to staging-codfw? If I do `helmfile -e staging status`it's still querying staging-eqiad isn't it? [10:07:14] frontend is a client application of the gms, but I've never seen the oomkiller come into play for datahub to date. [10:09:04] you can deploy to the non-active staging cluster by overriding the kubectl config file to use: [10:09:06] helmfile --state-values-set kubeConfig=/etc/kubernetes/datahub-deploy-staging-codfw.config -e staging -i apply [10:09:22] (there is a note about that in every helmfile.yaml) [10:10:15] 10serviceops, 10Foundational Technology Requests, 10Prod-Kubernetes, 10Shared-Data-Infrastructure, 10Kubernetes: k8s-api for k8s-staging down in codfw - https://phabricator.wikimedia.org/T327751 (10fgiunchedi) [10:10:25] Ah, thanks. [10:13:59] 10serviceops: Upgrade mc* and mc-gp* hosts to Debian Bullseye - https://phabricator.wikimedia.org/T293216 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jiji@cumin1001 for host mc2041.codfw.wmnet with OS bullseye completed: - mc2041 (**PASS**) - Downtimed on Icinga/Alertmanager - Disa... [10:14:35] 10serviceops, 10Observability-Tracing: Rename aux-k8s-ingress service to k8s-ingress-aux - https://phabricator.wikimedia.org/T327756 (10Clement_Goubert) [10:14:56] 10serviceops, 10Foundational Technology Requests, 10Prod-Kubernetes, 10Shared-Data-Infrastructure, 10Kubernetes: k8s-api for k8s-staging down in codfw - https://phabricator.wikimedia.org/T327751 (10fgiunchedi) [10:54:21] 10serviceops, 10Foundational Technology Requests, 10Prod-Kubernetes, 10Shared-Data-Infrastructure, and 2 others: Update Kubernetes clusters to v1.23 - https://phabricator.wikimedia.org/T307943 (10JMeybohm) [10:54:34] 10serviceops, 10Foundational Technology Requests, 10Prod-Kubernetes, 10Shared-Data-Infrastructure, 10Kubernetes: k8s-api for k8s-staging down in codfw - https://phabricator.wikimedia.org/T327751 (10JMeybohm) 05Open→03Resolved Thanks for the heads up, I forgot to apply https://gerrit.wikimedia.org/r/c... [11:18:09] So far I think it's related to the oomkilling of the datahub-gms container. The memory request is 1G with a limit of 2G, but it seems to be getting killed just at the point where it starts to reach out to kafka and karapace (schema registry). I'm trying to increase the limits temporarily to see if it behaves differently. [11:33:54] It's oomkilled even when I lift the memory limit to 3G [11:45:59] https://usercontent.irccloud-cdn.com/file/XQLO8DBN/image.png [11:47:29] Confirmed memory exhaustion with `docker stats` it seems to be a runaway memory leak. The same container is happy with 1G on staging-eqiad. https://grafana.wikimedia.org/d/-D2KNUEGk/kubernetes-pod-details?orgId=1&var-datasource=eqiad%20prometheus%2Fk8s-staging&var-namespace=datahub&var-pod=All&viewPanel=34 [12:10:46] jayme: I'm thinking of deleting the datahub deployment from staging-eqiad and then redeploying it, to try to ascertain whether it's related to k8s version 1.23. Are you OK with that? [12:12:00] <_joe_> btullis: I think there's a bigger chance it's related to configuration differences between eqiad/codfw though [12:12:59] <_joe_> I mean, it could just be that something in the chart isn't working as expected in k8s 1.23, ofc [12:13:16] <_joe_> but both are equally probable :/ [12:14:15] _joe_: OK, do you mean configuration differences in the cluster, or some difference in the helm chart? I confess I'm a bit baffled by this. [12:14:37] <_joe_> either, really :) [12:17:50] OK, well it's safe enough for me to do that test anyway, I reckon. Delte and redeploy from staging-eqiad. I'd expect to see the memory usage stable when it's redeployed to staging-eqiad, but if it isn't then it'll tell us /something/. [12:22:53] btullis: I'm fine with you deleting it from staging-eqiad [12:23:01] (was out for lunch, sorry) [12:24:04] I'm not going to do the staging-eqiad update until this is figured out, though. I've another chart failing (for totally different reasons but still) [12:42:03] <_joe_> elukey, effie can either of you remove https://grafana.wikimedia.org/d/000000586/memcache-historic-data?orgId=1 if it's indeed not needed as I understand it? [12:42:32] we can delete it yes [13:37:24] jayme: Just to let you know, the memory use for datahub is stable on staging-eqiad as expected. I'll raise a ticket for this and carry on looking into it. [13:42:48] 10serviceops, 10Toolhub: toolhub is undeployable since introduction of the mcrouter helm chart module - https://phabricator.wikimedia.org/T327786 (10JMeybohm) [13:44:00] 10serviceops, 10Toolhub: toolhub is undeployable since introduction of the mcrouter helm chart module - https://phabricator.wikimedia.org/T327786 (10JMeybohm) [13:45:27] btullis: hmm...interesting. Could it be that it is unable to access "something" from codfw and then goes into a memory eating loop? [13:51:00] 10serviceops, 10Toolhub: toolhub is undeployable since introduction of the mcrouter helm chart module - https://phabricator.wikimedia.org/T327786 (10JMeybohm) [13:51:12] jayme: That does seem quite likely, but I'm looking for more evidence. I see the following at the start of the logs for the gms service, which suggests that mariadb, elasticsearch, and kafka are all reachable. [13:51:20] https://www.irccloud.com/pastebin/JokFi4KL/ [13:51:58] That doesn't include karapace, so I'll check that now. [13:56:52] 10serviceops, 10Toolhub: toolhub is undeployable since introduction of the mcrouter helm chart module - https://phabricator.wikimedia.org/T327786 (10JMeybohm) I'm going to fix this short term by changing just the references to the data structure but ultimately toolhob should migrate to the cache.mcrouter modul... [14:20:01] _joe_: elukey I am deleteing it, I have never used it [14:52:33] qq - is there a way in deployment-chart's helmfile config to differenciate between staging-eqiad and codfw? I mean from the values.yaml perspective [14:54:06] changeprop's staging config points to kafka-main eqiad for both clusters, this is why I am asking [14:56:41] I think you'd have to create two environments, staging-codfw and staging-eqiad [14:58:03] ok so there is no easy way now as I suspected, sigh [15:00:28] 10serviceops, 10Observability-Alerting: Port k8s cache alerts from icinga to alerts.git - https://phabricator.wikimedia.org/T327792 (10fgiunchedi) [15:01:35] elukey: I think it's just a matter of changing your one staging environment in the helmfile to two [15:01:53] /etc/helmfile-defaults/general-staging-codfw.yaml and /etc/helmfile-defaults/general-staging-eqiad.yaml exist [15:02:19] but I may be wrong, jayme may have a better answer [15:02:56] claime: I blame Janis [15:03:15] lol [15:16:16] hmm, I don't recall exactly but I think we don't differentiate between the two staging clusters in helmfile.d usually [15:16:26] <_joe_> we don't, yes [15:16:38] <_joe_> and you can blame me for that actually :P [15:16:58] <_joe_> you can otoh jayme and akosiaris for adding staging-codfw and not changing that :P [15:18:11] I think we did not change that more or less on purpose to kind of hide the existence of the second staging cluster [15:19:56] another secret involuntarily divulged [15:19:58] smh [15:20:08] ahahaha [15:21:28] elukey: is that in actual problem or does it just look odd? [15:21:44] I think that the issue with datahub might be related to reverse DNS. Not sure yet, but perhaps the staging-codfw entries in the reverse DNS maps aren't up to date? [15:22:44] jayme: it is not a big issue, it is annoying when testing since it took me a bit to realize that changeprop in staging-eqiad wasn't showing any vital sign because staging-codfw pods took over the kafka consumer group :D (they pull/push to the same kafka cluster, main eqiad) [15:23:31] ah, I see [15:23:53] btullis: can you elaborate? [15:25:21] I've been capturing database traffic from datahub to an-test-coord1001 like this: `sudo tcpdump -i any port 3306 and not host an-test-client1001.eqiad.wmnet and not host localhost and not host an-test-druid1001.eqiad.wmnet and host not 10.64.75.83` [15:26:20] The `not 10.64.75.83` filters out the gsm service address from the staging-eqiad deployment. [15:26:56] 10serviceops, 10Content-Transform-Team-WIP, 10Maps: Re-import full planet data into eqiad and codfw - https://phabricator.wikimedia.org/T314472 (10jijiki) [15:33:37] Comparing the `tcpdump` output of staging-eqiad and staging-codfw deployments, one difference is that it has a reverse DNS address for the pod's IP in staging-eqiad. On staging-codfw the same trace shows bare IPs. [15:33:43] https://usercontent.irccloud-cdn.com/file/A7b3yZTn/image.png [15:36:35] I can't really see why that should be a problem. Could be that the an-test-coord just tried to lookup the ips in .eqiad.wmnet as that is the local domain... [15:41:13] 10serviceops, 10Toolhub, 10Patch-For-Review: Update toolhub helm chart to use the mcrouter helm chart module - https://phabricator.wikimedia.org/T327786 (10JMeybohm) [15:42:18] No, but it would sort of fit with the kind of failures I have seen before. The TLS setup was pretty fussy and the errors indecipherable. If it's an easy thing to try, I'd like to give it a go. [15:46:05] 10serviceops, 10MW-on-K8s, 10SRE, 10observability: New mediawiki.httpd.accesslog topic on kafka-logging + logstash and dashboard - https://phabricator.wikimedia.org/T324439 (10Clement_Goubert) 05Resolved→03Stalled [15:46:11] 10serviceops, 10MW-on-K8s, 10SRE, 10observability: Logging options for apache httpd in k8s - https://phabricator.wikimedia.org/T265876 (10Clement_Goubert) [15:52:28] As said, I think its just a matter of the search domain configured in resolv.conf of an-test-coord [15:52:46] if you do such thing on a codfw host, it will be the other way around [15:53:33] 10serviceops: Upgrade mc* and mc-gp* hosts to Debian Bullseye - https://phabricator.wikimedia.org/T293216 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jiji@cumin1001 for host mc2042.codfw.wmnet with OS bullseye [15:53:57] 10serviceops, 10MW-on-K8s, 10SRE, 10observability: Logging options for apache httpd in k8s - https://phabricator.wikimedia.org/T265876 (10Clement_Goubert) [15:54:28] 10serviceops, 10MW-on-K8s, 10SRE, 10observability: New mediawiki.httpd.accesslog topic on kafka-logging + logstash and dashboard - https://phabricator.wikimedia.org/T324439 (10Clement_Goubert) 05Stalled→03Resolved Retention updated for `mediawiki.httpd.accesslog` in `codfw` [16:04:09] https://gerrit.wikimedia.org/r/c/operations/dns/+/883226 [16:06:58] I think this adds them to dns though, doesn't it? --^ Apologies if I'm missing something. [16:11:33] Created the datahub ticket: https://phabricator.wikimedia.org/T327799 [16:12:27] 10serviceops, 10Infrastructure-Foundations, 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: allow mw-deployers to disable puppet on mwdebug hosts - https://phabricator.wikimedia.org/T305979 (10Dzahn) @daniel Here you go, you (and other deployers) should now be able to disable (and enable) puppet on med... [16:13:22] 10serviceops, 10Infrastructure-Foundations, 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: allow mw-deployers to disable puppet on mwdebug hosts - https://phabricator.wikimedia.org/T305979 (10Dzahn) 05Open→03Resolved a:03Dzahn [16:14:07] btullis: your change seems right but separate from the entry in resolv.conf [16:14:31] check if on the affected host that file has a line like "search eqiad.wmnet wikimedia.org codfw.wmnet" [16:14:42] or maybe only eqiad or none [16:16:19] those are created by "profile::resolving::domain_search" keys in Hiera [16:19:34] 10serviceops, 10Content-Transform-Team-WIP, 10Maps: Re-import full planet data into eqiad and codfw - https://phabricator.wikimedia.org/T314472 (10jijiki) We are happily serving maps from codfw, and both datacentres are up to date 🎉 [16:20:27] mutante: Thanks. I suppose the affected hosts in this case would be the kubestage200[1-2], which is where the pod is running and failing. They have `search codfw.wmnet` which seems about right. [16:21:57] hmm yea, though most hosts have not only their local DC in there [16:22:18] maybe it has to be edited inside the container? [16:24:22] it's also possible you need firewall changes to allow talking to DNS servers from inside pods but I haven't followed the full story [16:28:34] 10serviceops: Upgrade mc* and mc-gp* hosts to Debian Bullseye - https://phabricator.wikimedia.org/T293216 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jiji@cumin1001 for host mc2042.codfw.wmnet with OS bullseye completed: - mc2042 (**PASS**) - Downtimed on Icinga/Alertmanager - Disa... [16:29:55] Thanks. (In meeting now) Firewall all looks ok so far. I'll try looking more closely inside the container too. [16:40:17] yea, get a shell inside the running container and try the DNS lookup from there manually [17:49:26] 10serviceops, 10Foundational Technology Requests, 10Prod-Kubernetes, 10Shared-Data-Infrastructure, 10Kubernetes: Metrics changes with Kubernetes v1.23 - https://phabricator.wikimedia.org/T322919 (10JMeybohm) p:05Medium→03High [17:54:32] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Update staging-codfw to k8s 1.23 - https://phabricator.wikimedia.org/T326340 (10JMeybohm) [17:55:48] 10serviceops, 10Foundational Technology Requests, 10Prod-Kubernetes, 10Shared-Data-Infrastructure, and 2 others: Update Kubernetes clusters to v1.23 - https://phabricator.wikimedia.org/T307943 (10JMeybohm) [18:01:04] 10serviceops, 10Content-Transform-Team-WIP, 10Maps: Re-import full planet data into eqiad and codfw - https://phabricator.wikimedia.org/T314472 (10TheDJ) Lots hope it’s finally more predictable from this point forward. Thank you for the perseverance team! [18:07:27] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Update staging-codfw to k8s 1.23 - https://phabricator.wikimedia.org/T326340 (10JMeybohm) [18:07:29] 10serviceops, 10Toolhub: Update toolhub helm chart to use the mcrouter helm chart module - https://phabricator.wikimedia.org/T327786 (10JMeybohm) [18:07:34] 10serviceops, 10Foundational Technology Requests, 10Prod-Kubernetes, 10Shared-Data-Infrastructure, and 2 others: Update Kubernetes clusters to v1.23 - https://phabricator.wikimedia.org/T307943 (10JMeybohm) [18:10:28] 10serviceops, 10Toolhub: Update toolhub helm chart to use the mcrouter helm chart module - https://phabricator.wikimedia.org/T327786 (10JMeybohm) My ragged patch does create a diff removing the proxies from the mcrouter config.json (because they are no longer present in the backing data structure I guess). too... [19:24:49] 10serviceops, 10Toolhub: Update toolhub helm chart to use the mcrouter helm chart module - https://phabricator.wikimedia.org/T327786 (10bd808) >>! In T327786#8554414, @JMeybohm wrote: > My ragged patch does create a diff removing the proxies from the mcrouter config.json (because they are no longer present in... [19:53:08] 10serviceops, 10Parsoid: Request for additional disk space on testreduce1001 - https://phabricator.wikimedia.org/T296051 (10Dzahn) These machines are now managed by serviceops-core team. [21:27:35] inflatador: how goes? [21:30:38] ottomata greetings! I think I address the namespace PR, anything else I can help with? [21:31:18] how'd the operator image building/tesing go? [21:31:30] did i ever build locally? [21:33:22] Nope, it was still churning when I logged in this morning. I meant to try it on a Linux VM but forgot. Let me try that now [21:58:49] hmm, I guess we don't have docker-ce in our default repos on WMCS? [22:08:14] OK, now we're cooking again. Let's see if it finishes [22:09:50] inflatador: https://phabricator.wikimedia.org/P43318 [22:11:08] mutante thanks, but maybe the docker.io package is sufficient? I'm just used to using docker-ce , but any docker will do [22:12:46] maybe, maybe not https://stackoverflow.com/questions/45023363/what-is-docker-io-in-relation-to-docker-ce-and-docker-ee-now-called-mirantis-k [22:13:06] I just had the part from the pastebin handy because for some reason we wanted docker-ce at that time [22:13:30] but yea, just use the Debian package then (io) [22:16:41] it's also not immediately clear to me why we can't find docker-ce, since reprepro on our APT repo has: docker-ce | 5:20.10.18~3-0~debian-bullseye | bullseye-wikimedia | amd64 [22:17:02] and the wikimedia.list is in sources.list.d .. so maybe it's just about APT pinning that one [22:21:00] mutante: it's in the thirdparty/ci component. See profile::ci::docker [22:25:20] bd808: thanks. it may have moved to that component. vaguely remember that [22:52:42] ah, this just isn't gonna work on WMCS, probably a network access issue. Local VM doesn't work because of my M1 CPU. Giving UTM a shot [23:11:52] nope, just needed sudo ...image she is building [23:19:56] it's working! Will pick back up tomorrow