[06:30:16] 10serviceops, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10ayounsi) [06:43:10] 10serviceops, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10ayounsi) [06:49:23] 10serviceops, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Marostegui) [06:49:58] 10serviceops, 10DBA, 10Data-Persistence, 10Discovery-Search, and 8 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10ayounsi) [06:50:26] 10serviceops, 10DBA, 10Data-Persistence, 10Discovery-Search, and 8 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10ayounsi) [06:51:44] 10serviceops, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Marostegui) [06:53:07] 10serviceops, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Marostegui) Adding Jaime for the backup related hosts [07:28:35] 10serviceops, 10DBA, 10Data-Engineering, 10Data-Persistence, and 11 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Marostegui) [07:35:28] 10serviceops, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Marostegui) [07:45:30] 10serviceops, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Marostegui) [08:04:10] 10serviceops, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Marostegui) [08:21:55] 10serviceops, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Marostegui) [08:22:55] 10serviceops, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Marostegui) [08:23:45] 10serviceops, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Marostegui) [08:37:27] 10serviceops, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Marostegui) [08:38:29] 10serviceops, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Marostegui) [08:40:18] 10serviceops, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Marostegui) [08:44:39] 10serviceops, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Marostegui) [09:05:21] 10serviceops, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Marostegui) [09:05:59] 10serviceops, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Marostegui) [09:30:29] 10serviceops, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Vgutierrez) [09:35:07] 10serviceops, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Marostegui) [09:37:07] 10serviceops, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Marostegui) [09:38:10] 10serviceops, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10jcrespo) [09:39:10] 10serviceops, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10jcrespo) [10:03:52] 10serviceops, 10Toolhub: Update toolhub helm chart to use the mcrouter helm chart module - https://phabricator.wikimedia.org/T327786 (10JMeybohm) >>! In T327786#8554640, @bd808 wrote: > [...] > In any case, the helmfile.d config for Toolhub only uses the "eqiad-servers" and "codfw-servers" pools so I am not wo... [10:07:33] 10serviceops, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10MoritzMuehlenhoff) We can't migrate the puppetdb2002 VM (it's being moved to baremetal, but that is unlikely completed by then), so we'll need to disable P... [10:24:03] 10serviceops, 10Sustainability (Incident Followup): sessionstore: alert on rate of status 500 responses - https://phabricator.wikimedia.org/T327960 (10Clement_Goubert) [11:28:29] pooling k8s thumbor for a bit [11:51:57] <_joe_> hnowlan: it looks great from the dashboards [11:59:14] _joe_: not too bad - still plenty of throttling but we'll learn to live with that. I'll wait an hour or so before I pass judgement on the overall throughput, still relatively low on k8s comparatively [11:59:34] should be 50/50, but the haproxy queues on k8s nodes are filling up [11:59:53] <_joe_> huh [12:00:17] probably need more replicas [12:43:12] btullis: from what I understand datahub GMS in staging-codfw is able and does connect to all external services and datastores correctly, right? [12:44:52] jayme: Yes, correct. I can't find anywhere that traffic is getting blocked between staging-codfw and any of the external services. [12:52:53] hmm..okay. Is there a way to increase verbosity for logs? [13:00:33] It's a bit fiddly. [13:00:47] > Right now we print all WARN and above. For datahub-frontend and datahub-gms, You could change this by changing the JAVA_OPTS for the pod and providing your own logback configuration xml (just like normal Java web app) [13:04:41] have you seen that codfw logs: [13:04:44] ANTLR Tool version 4.5 used for code generation does not match the current runtime version 4.7.2 [13:04:45] ANTLR Runtime version 4.5 used for parser compilation does not match the current runtime version 4.7.2 [13:04:51] eqiad does not [13:07:30] Seen it, yes. Understood it, no. [13:08:56] following up on your earlier question, I was just checking and we currently don't configure JAVA_OPTS but if we added it, then it would be included at the right point in the process startup. [13:08:58] https://gerrit.wikimedia.org/r/plugins/gitiles/analytics/datahub/+/refs/heads/wmf/docker/datahub-gms/start.sh#71 [13:09:46] btw. I did deploy datahub to staging-codfw without helm --atomic. So the helm release will stick around now (pod still failing ofc) [13:09:46] ...and from the running container in staging-codfw, here is the command it was passed: `+ exec ./dockerize -wait http://datahubsearch.svc.eqiad.wmnet:9200 -wait-http-header 'Accept: */*' -wait tcp://an-test-coord1001.eqiad.wmnet:3306 -wait tcp://kafka-test1006.eqiad.wmnet:9092 -timeout 240s java -jar ./jetty-runner.jar --jar jetty-util.jar --jar jetty-jmx.jar --config /datahub/datahub-gms/scripts/jetty.xml [13:09:46] /datahub/datahub-gms/bin/war.war` [13:09:59] Oh handy, thanks. [13:12:06] By the way, we're likely to have a new datahub chart to deploy shortly, for an unrelated issue. I don't think that the image will need to be rebuilt, it's just different env variables. [13:12:46] I'm confused. Wasn't it the frontend that failed when I initially reported this? [13:13:26] or is frontend failing a side effect of gms failing and I did not see that? [13:13:32] Frontend fails if it can't talk to the gms. GMS blows up in memory and is oomkilled. [13:13:46] aah, okay [13:14:51] 10serviceops, 10Observability-Tracing: Add ingress to aux-k8s - https://phabricator.wikimedia.org/T325178 (10Clement_Goubert) [13:14:55] 10serviceops, 10Observability-Tracing, 10Patch-For-Review: Rename aux-k8s-ingress service to k8s-ingress-aux - https://phabricator.wikimedia.org/T327756 (10Clement_Goubert) 05In progress→03Resolved [13:15:27] the ANTLR error/warning that you indicated might well be related, but I can't understand how this could be different between eqiad and codfw. [13:15:30] 10serviceops, 10Observability-Tracing: Add ingress to aux-k8s - https://phabricator.wikimedia.org/T325178 (10Clement_Goubert) [13:17:42] also I found them in the frontend logs [13:17:50] so they might not be relevant [13:30:04] btullis: do you happen to recall if this thing ever ran before in staging-codfw? [13:31:01] I do not recall it ever working in staging-codfw, sorry. [13:31:44] 10serviceops, 10DBA, 10Data-Persistence, 10Discovery-Search, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10Marostegui) Adding Jaime for the backup hosts. [13:35:47] well..maybe it never did [13:36:30] 10serviceops, 10DBA, 10Data-Persistence, 10Discovery-Search, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10Marostegui) [13:39:00] 10serviceops, 10DBA, 10Data-Persistence, 10Discovery-Search, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10Marostegui) [13:40:36] 10serviceops, 10DBA, 10Data-Persistence, 10Discovery-Search, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10Marostegui) [13:43:15] 10serviceops, 10DBA, 10Data-Persistence, 10Discovery-Search, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10Marostegui) [13:44:46] 10serviceops, 10DBA, 10Data-Persistence, 10Discovery-Search, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10Marostegui) [14:05:33] folks I have a brainbounce for anybody that worked with nodejs and HTTPs calls.. I am currently testing changeprop calling lift wing (via https) and I think I am hitting https://github.com/nodejs/node/issues/37104 [14:05:59] changeprop calls liftwing-staging.svc.codfw.wmnet, but the HTTP Host: header is set based of the pod/backend to hit [14:06:33] (like enwiki-goodfaith.revscoring-editquality-goodfaith.wikimedia.org) [14:06:49] this is due to how istio/knative work, and up to now no TLS validation error appeared [14:07:37] afaics it seems that there is no way to change node's behavior, so the only alternative that I am thinking is to add new SANs (maybe with wildcards, is it possible?) [14:07:50] or am I missing something obvious? [14:08:56] 10serviceops, 10DBA, 10Data-Persistence, 10Discovery-Search, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10Marostegui) [14:17:19] elukey: There's a note here about using options from tls.connect and in particular setting the `servername` parameter to enable SNI. https://nodejs.org/api/tls.html#tlsconnectoptions-callback [14:17:29] https://nodejs.org/api/https.html#httpsrequesturl-options-callback [14:17:53] https://usercontent.irccloud-cdn.com/file/55HN5Fs4/image.png [14:20:32] ...or using `servername ''` to disable SNI support? [14:26:28] You could alternatively use a wildcard certificate `*.revscoring-editquality-goodfaith.wikimedia.org`but I don't think that a double-wildcard certificate is possible. [14:28:34] btullis: thanks will check! Ideally it would be awesome to avoid any change to changeprop in this case, I'd prefer to avoid any consequence with other systems :( [14:28:59] btullis: what do you mean with double-wildcard? Two sans with wildcards? [14:33:47] elukey: I meant having one certificate including a SAN of `*.*.wikimedia.org` for two levels of wildcard support. I was wondering if this would be a way for you to have one certificate support the naming convention you used above. But I don't think it is. [14:37:01] btullis: ah okok, I'd be a little scared in having a san so generic though [14:37:59] I think I may need to opt for the first option (the single * SAN) and update it when new namespaces are added [14:40:09] btullis: thanks for the brainbounce :) [14:41:33] Always a pleasure. Yeah, I can't think of any other ways, given that you're constrained such that you can't disable the SNI option from changeprop's HTTPS call. [14:44:17] yeah it seems something heavy to introduce, not sure about the effects for the rest.. [15:09:44] jayme: Yes, sorry for being vague. I don't recall datahub ever running in staging-codfw before. [15:14:57] multi-level wildcards aren't a thing last time I checked [15:15:07] so no *.*.example.org [15:15:16] it would be a security nightmare and too easy to abuse [15:15:23] so nobody issues and nobody supports them [15:15:23] btullis: yeah, no problem. We'll figure it out eventually :) [15:16:05] sorry for not being able to help on the nodejs thing, it doesn't ring a bell [15:18:26] akosiaris: thanks! I opted to have a wildcard only for the leftmost part, it should in theory work fine. [15:20:21] basically https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/883964 [15:23:31] out of curiosity: why are those directly under wikimedia.org? will the endpoints be public? [15:28:00] they are already IIRC [15:28:22] taavi: nono we chose the endpoint initially while testing and left it, maybe we could migrate to .wmnet or similar, it seemed fine for the mlserve k8s domain.. the pods are exposed by the istio ingress, a discovery.wmnet endpoint, and then by the api gateway [15:28:30] it is just a way to target pods basically [15:28:39] it can also be batman.com [15:28:48] or anything that we want [15:28:56] elukey: still, what does the api-gateway expose them under? [15:29:15] some /prefix? [15:29:30] akosiaris: api.wikimedia.org/inference/etc.. (or something similar, I don't recall exactly, still WIP) [15:30:01] the api gateway knows how to set the host headers but it is not exposed to the clients [15:30:07] ok, cool. So I remembered correctly they are exposed publicly already as APIs, just not under their own domain [15:30:12] and +1 to batman.com :P [15:30:23] next one please ... robin.eu :P [15:30:32] * elukey takes notes [15:30:43] Hmm, I just tried to rebuild production-images following https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/+/881876 [15:31:01] and 2023-01-26 15:27:47,568 [docker-pkg-build] INFO - Successfully tagged docker-registry.discovery.wmnet/httpd-fcgi:2.4.38-10 (image.py:210) [15:31:03] 2023-01-26 15:27:47,569 [docker-pkg-build] INFO - Removing build context /tmp/docker-pkg-httpd-fcgi6jo1j1ww (image.py:491) [15:31:05] 2023-01-26 15:27:47,943 [docker-pkg-build] ERROR - Verification of image docker-registry.discovery.wmnet/httpd-fcgi failed with return code 1 (image.py:447) [15:31:07] 2023-01-26 15:27:47,944 [docker-pkg-build] ERROR - -- output: None (image.py:450) [15:31:30] It still built and pushed the image though [15:31:40] claime: I am not sure what you compalin about. There is None errors [15:31:48] 👅 [15:31:53] x) [15:32:06] jokes aside, interesting [15:32:33] I'll dig into docker-pkg to get a sense of what's happening [15:32:51] But I'd rather not be the only one who knows I may have broken the production images :') [15:36:49] claime: very stupid question but have you retried the build/publish? Maybe it was a temporary glitch with the registry? [15:36:58] Not yet [15:37:03] Although there is something weird [15:37:16] docker-pkg code on build2001 doesn't match what I have in the git repo [15:43:21] we agree that there shouldn't be a container running on build2001 right? [15:45:50] yes [15:46:06] and exposing port 8080 (which it does) [15:47:06] yep [15:47:10] My thinking is [15:47:22] docker-pkg leaves the test container up if it fails the test.sh [15:47:32] _joe_: is that you ^ ? [15:47:34] It did, and we forgot to clean up [15:47:44] container runtime and output of last correlates [15:47:52] container start time* [15:48:02] There's nothing in the containers' logs [15:48:04] <_joe_> meeting, sorry [15:48:05] I'm gonna stop it [15:48:20] <_joe_> but yes stop it [15:48:31] it's almost certainly some leftover. judging from timestamps and the content of the container [15:48:47] Yep [15:48:55] Stopped and running test manually [15:49:04] All PASS [15:49:07] Awesome [15:49:11] I didn't break anything [15:56:09] <_joe_> there's an issue with running test.sh on build servers that I didn't get to fix [15:56:50] <_joe_> it's not docker-pkg that leaves the container up, it's test.sh to allow you to go inspect what's wrong [16:01:26] I'm debugging the test.sh script right now [16:01:39] It was querying 9181 without proxying it [16:04:27] <_joe_> <3 [16:04:41] <_joe_> so we're at that point where i do meetings and you fix my nonsense? [16:04:46] <_joe_> cool, achievement unlocked [16:04:55] Ready to be a a manager _joe_ [16:05:13] <_joe_> hey that escalated fast [16:05:19] You could become a Great Influencer [16:05:21] GI _joe_ [16:05:42] https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/+/883978/ That should do it [16:05:49] <_joe_> this reminds me: I need to ask "do you like puns?" In interviews [16:06:06] This is a discriminatory question and I won't stand for it [16:06:12] In fact I'm vehemently sitting [16:08:47] Successfully published image docker-registry.discovery.wmnet/httpd-fcgi:2.4.38-10 [16:08:50] Yay [16:14:14] 10serviceops, 10DBA, 10Data-Persistence, 10Discovery-Search, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10Marostegui) [16:16:54] claime: ٩(^‿^)۶ [16:17:19] :D [16:39:10] 10serviceops, 10DBA, 10Data-Persistence, 10Discovery-Search, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10colewhite) [16:45:10] 10serviceops, 10DBA, 10Data-Persistence, 10Discovery-Search, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10jcrespo) [16:52:48] 10serviceops, 10DBA, 10Data-Persistence, 10Discovery-Search, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10herron) [17:14:55] 10serviceops, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Eevans) [17:17:46] 10serviceops, 10DBA, 10Data-Persistence, 10Discovery-Search, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10Eevans) [17:23:23] 10serviceops, 10DBA, 10Data-Persistence, 10Discovery-Search, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10Eevans) [17:39:19] 10serviceops, 10Abstract Wikipedia team (Phase θ – Throttling), 10Patch-For-Review: Kubernetes Wikifunctions security and control measures - https://phabricator.wikimedia.org/T326785 (10Jdforrester-WMF) [18:16:50] 10serviceops, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10herron) [18:17:27] 10serviceops, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10herron) [20:40:35] 10serviceops, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10RKemper)