[06:22:14] bah... poor timing on my side to have started a full reindex yesterday with T330165 on the horizon :/ [06:22:14] T330165: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 [06:23:27] errand, back for wdqs office hours [08:00:45] dcausse, pfischer: we're starting in a minute in https://meet.jit.si/WDQSOfficeHour [09:58:18] Lunch [10:24:48] lunch [13:13:36] Hello. I'm just checking with you that you're aware of the HDFS safe mode that's coming up shortly. [13:17:51] I see from here that you have a WDQS streaming service running in YARN: https://yarn.wikimedia.org/cluster/app/application_1678266962370_5659 [13:18:08] btullis: thanks for the heads up, looking [13:18:12] Does this need to be stopped or anything? Is it writing to HDFS? [13:18:26] dcausse: A pleasure. [13:18:38] btullis: stopping, it does nothing at the moment [13:19:35] Great, thanks. So for reference, we're going to be setting the YARN queues to STOPPED in about 10 minutes: https://gerrit.wikimedia.org/r/c/operations/puppet/+/903627 [13:19:52] ok, thanks! [13:20:06] Then we're going to be putting HDFS into read-only mode shortly after that: https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Cluster/Hadoop/Administration#Safe_Mode [13:20:51] btullis: do you have a rough estimate of how long it will last? [13:20:58] The the switch upgrade happens at 14:00 UTC, when we lose access to Hive, Superset, Turnilo, Phab, Druid etc. [13:21:38] oh ok it's related to the switch upgrade, I'll follow that along [13:21:40] btullis do y'all do this for all switch maintenances? [13:21:43] The sre-infra-foundations team are doing the switch upgrade. Last time it was about 30 minutes for the work, plus a 10 minutes of checking, before we got the all-clear to restart stuff. [13:22:32] inflatador: Yes. Mainly because it's knocking out a significant portion of the Hadoop worker nodes, but in this case also a pretty important MariaDB database for us, which we can't easily migrate to another host. [13:24:32] btullis understood, I'll hit my team list for awareness [13:25:50] Thanks. If the communication from our side could have been better in any way, let us know. [13:31:24] No worries on that, just want to make sure we're ready for the next round and beyond [14:50:48] \o [14:52:08] o/ [15:32:20] pfischer: given the last emails from Ververica, it seems that we won't get any training from them. I'll look around to see if I find other opportunities. Let me know if you have any other idea! [15:42:05] inflatador: https://ververica.zendesk.com/hc/en-us/articles/4413642980498-Direct-buffer-OutOfMemoryError-when-using-Kafka-Connector-in-Flink might explain the error we've seen [15:42:35] uploaded https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/903689 to use PLAINTEXT instead of the SSL port [15:42:54] dcausse :eyes on your patch [15:43:45] and filed T333373 to add support for SSL at the job level [15:43:46] thanks! [15:46:20] will deploy as soon as Jenkins finishes merging [15:52:11] dcausse OK, it's deployed [15:52:18] inflatador: thanks! [15:55:26] o/ dcausse: I think I understand how the build + deployment process for flink is supposed to work. Are you currently working on the update pipeline or could we move it to GitLab? [15:56:02] pfischer: I don't have any patch in progress there [15:56:44] pfischer: there might be some bits you could steal from https://gitlab.wikimedia.org/repos/search-platform/flink-rdf-streaming-updater [15:57:25] it does not have anything regarding the maven build but has few info about our to use our flink base image and the deployment [15:57:34] to the docker registry [15:58:37] Caused by: org.apache.http.conn.HttpHostConnectException: Connect to localhost:6500, I guess the task managers pods do not have the envoy sidecar containers [16:01:51] ryankemper the maintenance is over, I unbanned the elastic and cloudelastic nodes but if you have a chance, could you repool the elastic and w[cd]qs nodes? Otherwise I'll do them when I get back from my workout in ~40 [16:04:59] yes the services mesh are not materialized from the flink-app chart... [16:18:08] inflatador: sure will get those repooled [16:20:23] if someone has a couple minutes for a +1: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/903696/ (it's a test service running on the dse-k8s cluster) [16:52:27] dcausse: done [16:53:25] ryankemper: thanks! [17:03:58] back [17:05:16] spark is so fickle...it doesn't want to read and write to the same table (but different partitions) with the way glent writes tables, but i mucked around with things a bit and plugged the java dataframe into our python partition writer...and it works fine [17:06:09] i guess i'll port that logic over to the java side, or maybe i can reuse the rdf spark bits [17:50:46] lunch, back in time for pairing [18:18:29] dinner [18:21:26] back [18:32:31] ryankemper: are you busy with the SLO dashboards in #wikimedia-observability ? Or are you joining the pairing session? [18:33:01] gehel: yeah I'll join the meet in 5m after I wrap up this discussion [18:33:09] ryankemper: take your time! [18:33:43] kk [19:44:57] hmm, odd. cirrus index import failed this week with ` Number of dynamic partitions created is 1001, which is more than 1000. To solve this try to set hive.exec.max.dynamic.partitions to at least 1001.' I wonder whats causing more dynamic partitions than before [19:46:31] oh, all.dblist now has 1002 entries [21:11:31] all.dblist having 1002 entires is a cause for celebration!—maybe with a hint of trepidation