[07:05:16] o/ [07:07:00] \o [07:07:13] welcome back :) [07:07:29] want to do a quick catchup on meet? [07:11:49] zpapierski: yes please :) [07:12:27] zpapierski: https://meet.google.com/jjf-cyyx-cpz?authuser=0 [08:00:56] errand [08:44:31] zpapierski, dcausse: would you be around to talk SWE recruiting? [08:44:49] sure [08:45:27] Not for a few hours - on my way to the bank :( [08:45:43] let's start with David, and we can update you later on [08:45:52] Sure [08:46:02] meet.google.com/vby-xrgw-kdk [08:46:06] meeting in French! [09:44:54] lunch [10:00:31] lunch 2 [11:19:45] gehel: fyi, I'm back [11:20:11] first day back to school for Oscar, I'll be off in a few minutes. [11:20:15] I'll ping you when back [11:20:18] sure [12:20:18] dcausse: want me to do this wide port selection patch? also - perhaps you know where should I file a ticket for +2 rights on deployment_charts? [12:20:19] dcausse: I'm back just in time for the 14:30 meeting. No need to cover for me! [12:22:39] gehel: great, removing myself [12:24:09] zpapierski: please go ahead with the patch, for the +2 perms I don't know, I would tag "operations" and let them triage [12:24:18] ok [12:26:28] zpapierski: you should just ask someone to add you to wmf-deployment gerrit group, shouldn't require any approvals as you're in the deployment admin group [12:26:53] thanks, will do that [13:00:57] dcausse: if you have a moment https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/714355/ [13:01:10] looking [13:01:12] I did (I think) accept all policy on those pods [13:01:31] only for the testing, we need to eliminate (or confirm) this as being the cause [13:03:05] I worry that egress rules might still be in the way [13:03:30] right, I can probably do accept all on them too, w8 [13:03:59] can I just remove egress from policies? [13:04:17] nvm, I'll just do the seame [13:04:45] maybe? set egress.enabled to false? [13:05:47] I can try that later, documentation mentions accept all on egress similarly to what I did [13:06:33] pushed, let's see what helm-lint has to say [13:07:40] "both the egress policy on the source pod and the ingress policy on the destination pod need to allow the traffic" [13:07:44] from https://kubernetes.io/docs/concepts/services-networking/network-policies/ [13:07:52] I wonder how it works today [13:08:24] since I see no egress rules allowing the the jobmanager to go to the tms and vice-versa [13:21:21] it has to, I can clearly see heartbeats from tms on jm [13:21:47] anyway - is that ok now? [13:22:42] there seems to be some subnets allowed from default-network-policy-conf.yaml (e.g. 10.64.64.0/21) that might be it [13:23:25] right, they are added to egress rules, if enabled [13:29:32] anyway - want to merge the patch (I created a ticket for my rights, but I don't have them yet) [13:29:34] ? [13:29:54] yes, was checking couple things before [13:30:02] ah, ok [13:35:34] I think we allow ingress on the taskmanager_data_port from the jobmanager not the taskmanagers (podSelector of the ingress rule) [13:35:45] so this had no effect I suppose [13:36:34] that's not entirely impossible, I barely understand how k8s network policies work [13:37:11] trying something [13:37:50] I think I know what you mean [13:38:07] I should add data port under: [13:38:12] https://www.irccloud.com/pastebin/yiGJVEWk/ [13:38:33] I only put it under [13:38:36] https://www.irccloud.com/pastebin/UrzDRpk4/ [13:38:45] which is a job manager I guess? [13:40:32] zpapierski: I mean https://gist.github.com/nomoa/f55450c8012fae69844ed4f707ded740 [13:40:47] there are two pod selectors [13:41:13] that's what I meant as well, data port should be in the second one (or both) [13:41:15] the one selecting the pods to apply the rules and the one selecting the pods from which the ingress apply [13:41:37] give me a sec, I'll prepare a patch [13:41:57] no there should be a new "from" section in the taskmanager spec [13:42:23] not just a new port in taskmanager match? [13:42:54] ah, no - you're correct [13:42:55] no because here we only have tm -> jm and jm -> tm (reading pod selectors) we need a tm -> tm [13:43:02] yeah, I understand [13:43:05] a sec [13:44:00] maybe they need other ports as well? [13:44:06] I leave them all there for now [13:46:35] something like this? https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/714364 [13:50:13] looking [13:51:20] yes [13:52:55] ok, let's try it [13:53:18] I'll deploy that to codfw [13:53:31] +2ed you take care of the deploy? [13:53:50] yep [13:54:11] deploy only to codfw, eqiad has to be started with the right savepoint [13:54:16] I know [13:56:58] huh, I did a sync and still got 0.0.18? [13:57:04] something's not right [13:58:02] ah, ok I needed to wait longer [13:58:28] dcausse, zpapierski: have time for a quick update on SWE hiring? [13:58:40] in 10min? I'm doing flink deployment now [13:58:46] sure [13:58:51] meet.google.com/xcv-jyrn-cso [13:58:55] jump in when ready! [14:00:14] ok, pods recreated with new ingress rules, waiting a bit for cluster init [14:00:37] I'm going to be so happy if this is it [14:03:14] checkpoints are completing, is taht it? [14:03:48] I'm not seeing any data yet on grafana though [14:05:01] seeing data in the topic [14:05:08] I am as well :D [14:05:18] such a small thing, but it works :) [14:05:43] dcausse: do you have time to join us? [14:05:49] yes joining [16:08:04] dinner [16:21:00] zpapierski: i saw your think about seeing local state of a yarn executor instance (rocksdb). For the dumbest idea ever...i have a thing that spins up a python repl on yarn executor you can telnet into. From there you can poke around the local system easy-ish [16:21:02] https://wikitech.wikimedia.org/wiki/User:EBernhardson/pyspark [16:21:21] i've used it before in a while loop that keeps trying until os.hostname() returns the host i want [16:21:54] i guess its socket.gethostname(), same idea though :) [17:18:48] ryankemper: FYI I'm back from vacations, in case you were waiting for some reviews on the cookbooks, I plan to have a look at them in the next couple of days tops, I'm catching up with stuff ;) [17:19:15] volans: great! welcome back :) [17:19:35] thanks [17:29:59] flink@codfw failed because it started from an old offset (basically backfilling) but without a state, causing all events to be buffered filling up many timers that cannot be fired in time during the checkpoing allowed time [17:31:18] will wait for the run to fail and set offsets manually to something recent [17:31:22] dinner [17:47:33] ryankemper: easy patch for you: https://gerrit.wikimedia.org/r/714391 [18:17:40] I got an interesting question on my talk page and I figured I would follow up here. Given the apparent scaling issues with Wikidata Query Service, has it reached a point that bots adding statements (but not new items) should no longer run? [18:28:01] hare: no, we're not at the point of taking drastic actions like that [18:57:21] Thank you for confirming [18:58:59] ebernhardson: thanks, catching up on those patches now [19:03:27] ryankemper: thanks for the merge. in that series of patches i'm trying to fix airflow complaining about disk space issues, by shipping task logs to hdfs [19:03:52] ebernhardson: gotcha, the implication being the task logs are what are filling up the disk? [19:04:12] ryankemper: usually what happens is some thing fails and generates a spectacular amount of logs. And then airflow keeps repeating it :) [19:04:46] ah yes that's a pattern any operator is intricately familiar with :P [19:07:43] I do hope you find a solution to Blazegraph. If there are any leading replacements I am happy to try them out on my own hardware. I wonder what people here think of "dgraph" since they seem to think highly of themselves. [20:00:23] hare: thanks for the offer! I don't think the team is at the stage yet where we are seriously considering any alternatives yet , but I have also seen some suggestions crop up from the community (with no personal opinion or insight into them myself). At this stage, we are trying to determine what general priorities are most important for users, which will help us better understand which compromises/optimizations we'll want from an alternative [21:44:11] :q [21:45:18] this isn't vim [21:50:05] :}