[13:30:04] \o [13:31:14] gehel my google meet window is spinning...not sure what's going on but be there soon (I hope) [14:16:10] o/ [14:16:23] errands, back in 40’ [16:03:15] ebernhardson: SUP CI is fixed again, replaced the personal password with a project access token with maintainer role (which is required to create tags) [16:04:07] pfischer: thanks! [16:11:25] ebernhardson: do you know who is in charge of image-related weighted tags? [16:48:50] pfischer: hmm, image suggestions are cormac's team [16:49:38] they come from the platform_eng airflow dags [17:35:26] dr0ptp4kt: i realized i haven't been adding you to patches to review Cirrus related changes. Do you want to be included? [17:51:03] * ebernhardson wonders why phpcs in CI seems to have different warnings than my local :S [17:55:12] `wdqs-main` and `wdqs-scholarly` are fully up and in production now! (cc gehel) [17:55:15] * ryankemper uncorks champagne bottle [17:55:18] nice! [17:58:38] ( ^_^)o自自o(^_^ ) CHEERS! [18:00:24] workout/lunch, back in ~90 [18:01:25] ebernhardson I declined SRE pairing but we could do 2 PM PDT today or just catch up Thursday [18:01:51] Next up: various cleanup tasks like https://phabricator.wikimedia.org/T372816 and https://phabricator.wikimedia.org/T373391. And then we need to outline a migration plan (i.e. communicate with community, start moving over remaining wdqs hosts as usage ramps up, eventually kick people off the old wdqs once sufficient time has passed, and then finally tearing down all the old wdqs lvs stuff) [18:02:48] ebernhardson: jfyi I'm taking an early lunch as well so we can go ahead and cancel pairing today [18:12:23] kk [18:42:39] hmm, shipped new sup container. Started and runs fine in staging. Shipped to codfw and it seems stuck at `ContainerCreating` ... [18:45:02] ahh, looks like maybe it just took a long time to pull the containers? Looks like it took ~5m between the 'Pulling' and 'Pulled' events [18:52:00] but the producer keeps failing in codfw with the new container :( Looking into it [18:53:06] UnknownHostException: flink-zk2001.codfw.wmnet: Temporary failure in name resolution [19:09:23] * ebernhardson is not finding how the consumer started and found zk, but the producer did not :S [19:33:58] .moti wave [19:34:00] still no luck :S It seems like the producer is unable to make connections to the dns server, but no clue why...also not finding where in helmfile that is configured (or is it elsewhere?) [19:34:02] back [19:34:30] ebernhardson my best guess is changes to network policies? Can take a look [19:34:57] inflatador: curiously though i deployed the consumer and producer from the same helmfile command, but only the producer is failing. Would appreciate some looking :) [19:35:18] :eyes [19:35:47] it's not super critical, since it's the codfw producer and there should be almost no events in codfw. But still need to understand before moving on to the eqiad deploy :) [19:37:24] still not 100% sure that calico is the issue, but these are the tickets: T373195 T359423 [19:37:24] T373195: Migrate Search Platform-owned helm charts to Calico Network Policies - https://phabricator.wikimedia.org/T373195 [19:37:24] T359423: Migrate charts to Calico Network Policies - https://phabricator.wikimedia.org/T359423 [19:43:30] not seeing anything interesting in `kubectl get networkpolicies`, except maybe that dns is unmentioned [19:43:38] but it's also not mentioned in the eqiad pods that haven't been redeployed yet [19:49:44] * inflatador also wonders why it would be different between producer and consumer [19:53:56] they do have different networkpolicies, but the difference looks to be limited to consumer having egress for the elastic servers [19:56:02] i suppose i could try rolling back the container version, but doubtful that changes anythin [19:56:27] (the only change deployed was a new container version). Also worried that might then break the consumer which is more important :P [19:56:46] ebernhardson assuming you didn't have the same problem in staging? [19:56:55] inflatador: right, staging deployed just fine [19:57:27] the codfw deploy was a little wierd in that it took ~5 minutes to pull the containers, never seen that before but seems unrelated [19:57:56] Maybe delete the producer and see it get scheduled somewhere else? Doubt it would fix anything but could potentially rule out a broken kube worker [19:58:11] they have been aggressively renaming/reimaging the kube workers lately [19:58:59] hmm, i guess can issue a re-deploy. sec [20:02:14] re-deployed producer in codfw with app.restartNonce=2 [20:03:47] inflatador: looks to have worked, albeit is a bit concerning that it will randomly fail depending on which host it lands on [20:03:57] it's not fully up yet though [20:04:05] but logs show it getting past where it was before [20:05:34] ebernhardson ACK, let's keep an eye out. Did you get the hostname of the "bad host" by any chance? [20:06:05] inflatador: didn't think to, i suppose i also don't know how to find which host a container was assigned to [20:07:03] i guess it's in the complete pods output, but i never look at that (kubectl get -o yaml pods) [20:07:32] maybe it's in logstash somewhere, looking [20:08:22] if you can find it cool, if not nbd. [20:10:00] inflatador: maybe wikikube-worker2043.codfw.wmnet [20:10:56] i think the broken pod was flink-app-producer-b65b89bf5-5hff9, and it has logs with kubernetes.host=wikikube-worker2043.codfw.wmnet for that pod [20:12:01] ebernhardson ACK.../me used to do a lot of this kinda troubleshooting, albeit for openstack nova workers [20:13:04] i'm mildly surprised that when backing off and then restarting the pod it kept going to the same host? I guess i didn't force a restart of the pod before because i expected it was already doing that [20:13:53] but i suppose i also noticed it was only restarting the one container, not the whole pod, just didn't put 1+1 together :P [20:15:35] I guess that's the default behavior for kubernetes? Seems like it would be better to try different hosts [20:23:07] codfw looks to be running now, deploying in eqiad [20:31:36] looks to have deployed normally in eqiad at least [20:45:17] * ebernhardson ponders how long it might take to run a regex query to find every page in every index that has a : in the redirect.titles array... [20:48:22] probably better off to get it from hadoop :P [20:58:12] hmm, elastic took just over 5 minutes to find 68k pages in enwiki_content...actually not as terrible. But thats only 1 index [21:35:50] hmm, turns out we haven't been pruning the cirrus_index_without_content table, and have ~14tb of indexes going back to 20230521 [21:36:20] also something is wrong with how it's done, because the source has eqiad/codfw/cloudelastic, but this table only has codfw partitions :( [21:38:52] ?!?! [21:43:14] not sure why, it looks like both cirrus_index and cirrus_index_without_content are configured in the script that drops from snapshot partitioned tables... [21:49:26] ebernhardson just curious, what would you lose if stat1008 died? Just looking at https://wikitech.wikimedia.org/wiki/Data_Platform/Systems/Clients and trying to get a better idea of the impact of losing a host [21:50:32] guessing a lot more stuff would have to be pulled back down from HDFS? [23:20:29] inflatador: hmm, i don't think we would lose anything in particular. I probably have random notebooks there but nothing critical [23:20:47] it's mostly used as a gateway into the hadoop cluster where actual things are [23:21:29] i rarely pull bulk data to stat machines directly, the compute happens in the cluster and results stored to the cluster