[06:22:14] <dcausse>	 bah... poor timing on my side to have started a full reindex yesterday with T330165 on the horizon :/
[06:22:14] <stashbot>	 T330165: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165
[06:23:27] <dcausse>	 errand, back for wdqs office hours
[08:00:45] <gehel>	 dcausse, pfischer: we're starting in a minute in https://meet.jit.si/WDQSOfficeHour
[09:58:18] <gehel>	 Lunch 
[10:24:48] <dcausse>	 lunch
[13:13:36] <inflatador>	 <o/
[13:17:28] <btullis>	 Hello. I'm just checking with you that you're aware of the HDFS safe mode that's coming up shortly. 
[13:17:51] <btullis>	 I see from here that you have a WDQS streaming service running in YARN: https://yarn.wikimedia.org/cluster/app/application_1678266962370_5659
[13:18:08] <dcausse>	 btullis: thanks for the heads up, looking
[13:18:12] <btullis>	 Does this need to be stopped or anything? Is it writing to HDFS?
[13:18:26] <btullis>	 dcausse: A pleasure.
[13:18:38] <dcausse>	 btullis: stopping, it does nothing at the moment
[13:19:35] <btullis>	 Great, thanks. So for reference, we're going to be setting the YARN queues to STOPPED in about 10 minutes: https://gerrit.wikimedia.org/r/c/operations/puppet/+/903627
[13:19:52] <dcausse>	 ok, thanks!
[13:20:06] <btullis>	 Then we're going to be putting HDFS into read-only mode shortly after that: https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Cluster/Hadoop/Administration#Safe_Mode
[13:20:51] <dcausse>	 btullis: do you have a rough estimate of how long it will last?
[13:20:58] <btullis>	 The the switch upgrade happens at 14:00 UTC, when we lose access to Hive, Superset, Turnilo, Phab, Druid etc.
[13:21:38] <dcausse>	 oh ok it's related to the switch upgrade, I'll follow that along
[13:21:40] <inflatador>	 btullis do y'all do this for all switch maintenances?
[13:21:43] <btullis>	 The sre-infra-foundations team are doing the switch upgrade. Last time it was about 30 minutes for the work, plus a 10 minutes of checking, before we got the all-clear to restart stuff.
[13:22:32] <btullis>	 inflatador: Yes. Mainly because it's knocking out a significant portion of the Hadoop worker nodes, but in this case also a pretty important MariaDB database for us, which we can't easily migrate to another host.
[13:24:32] <inflatador>	 btullis understood, I'll hit my team list for awareness
[13:25:50] <btullis>	 Thanks. If the communication from our side could have been better in any way, let us know.
[13:31:24] <inflatador>	 No worries on that, just want to make sure we're ready for the next round and beyond
[14:50:48] <ebernhardson>	 \o
[14:52:08] <dcausse>	 o/
[15:32:20] <gehel>	 pfischer: given the last emails from Ververica, it seems that we won't get any training from them. I'll look around to see if I find other opportunities. Let me know if you have any other idea!
[15:42:05] <dcausse>	 inflatador: https://ververica.zendesk.com/hc/en-us/articles/4413642980498-Direct-buffer-OutOfMemoryError-when-using-Kafka-Connector-in-Flink might explain the error we've seen
[15:42:35] <dcausse>	 uploaded https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/903689 to use PLAINTEXT instead of the SSL port
[15:42:54] <inflatador>	 dcausse :eyes on your patch
[15:43:45] <dcausse>	 and filed T333373 to add support for SSL at the job level
[15:43:46] <dcausse>	 thanks!
[15:46:20] <inflatador>	 will deploy as soon as Jenkins finishes merging
[15:52:11] <inflatador>	 dcausse OK, it's deployed
[15:52:18] <dcausse>	 inflatador: thanks!
[15:55:26] <pfischer>	 o/ dcausse: I think I understand how the build + deployment process for flink is supposed to work. Are you currently working on the update pipeline or could we move it to GitLab?
[15:56:02] <dcausse>	 pfischer: I don't have any patch in progress there
[15:56:44] <dcausse>	 pfischer: there might be some bits you could steal from https://gitlab.wikimedia.org/repos/search-platform/flink-rdf-streaming-updater
[15:57:25] <dcausse>	 it does not have anything regarding the maven build but has few info about our to use our flink base image and the deployment
[15:57:34] <dcausse>	 to the docker registry
[15:58:37] <dcausse>	 Caused by: org.apache.http.conn.HttpHostConnectException: Connect to localhost:6500, I guess the task managers pods do not have the envoy sidecar containers
[16:01:51] <inflatador>	 ryankemper the maintenance is over, I unbanned the elastic and cloudelastic nodes but if you have a chance, could you repool the elastic and w[cd]qs nodes? Otherwise I'll do them when I get back from my workout in ~40
[16:04:59] <dcausse>	 yes the services mesh are not materialized from the flink-app chart...
[16:18:08] <ryankemper>	 inflatador: sure will get those repooled
[16:20:23] <dcausse>	 if someone has a couple minutes for a +1: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/903696/ (it's a test service running on the dse-k8s cluster)
[16:52:27] <ryankemper>	 dcausse: done
[16:53:25] <dcausse>	 ryankemper: thanks!
[17:03:58] <inflatador>	 back
[17:05:16] <ebernhardson>	 spark is so fickle...it doesn't want to read and write to the same table (but different partitions) with the way glent writes tables, but i mucked around with things a bit and plugged the java dataframe into our python partition writer...and it works fine
[17:06:09] <ebernhardson>	 i guess i'll port that logic over to the java side, or maybe i can reuse the rdf spark bits
[17:50:46] <inflatador>	 lunch, back in time for pairing
[18:18:29] <dcausse>	 dinner
[18:21:26] <inflatador>	 back
[18:32:31] <gehel>	 ryankemper: are you busy with the SLO dashboards in #wikimedia-observability ? Or are you joining the pairing session?
[18:33:01] <ryankemper>	 gehel: yeah I'll join the meet in 5m after I wrap up this discussion
[18:33:09] <gehel>	 ryankemper: take your time!
[18:33:43] <ryankemper>	 kk
[19:44:57] <ebernhardson>	 hmm, odd. cirrus index import failed this week with ` Number of dynamic partitions created is 1001, which is more than 1000. To solve this try to set hive.exec.max.dynamic.partitions to at least 1001.'  I wonder whats causing more dynamic partitions than before
[19:46:31] <ebernhardson>	 oh, all.dblist now has 1002 entries
[21:11:31] <Trey314159>	 all.dblist having 1002 entires is a cause for celebration!—maybe with a hint of trepidation