[06:55:50] o/ [07:31:42] dcausse: welcome back! [07:31:54] dcausse: if you're already up and running, I'm in https://meet.google.com/jkx-vjrx-kbn [08:36:33] o/ pfischer, btullis, if you have a minute: https://gerrit.wikimedia.org/r/917820 [08:38:45] context is a flink job running in codfw that struggles to load its state, I share Erik assumption that it's mem related so that's why I'd like to try to increase the heap size, other change relates to collecting rocksdb metrics [08:39:44] if this is not enough I'll run the job from yarn with more resources [09:17:19] dcausse: sure, looking at it [09:17:26] thanks! [09:18:57] +1 [09:24:37] dcausse: would you have a minute to discuss https://phabricator.wikimedia.org/T325315 ? I’m somewhat stuck/not sure how to proceed considering ott [09:24:51] ottomata’s https://phabricator.wikimedia.org/T331399 [09:27:15] pfischer: sure, can we schedule something for later this afternoon? because I probably need to get up to speed on what was discussed so far, (might be interesting to include Erik in the discussion as well) [09:27:57] Sure, I’ll schedule a meeting [09:28:19] thanks! [09:43:53] Trey314159: probably another question for you: https://www.mediawiki.org/wiki/Topic:Xhex7tacfvvzc9ov [10:01:50] lunch [12:41:32] pfischer: if you haven't seen it, this comment (on another ticket) best describes the need for a commen event data model for links: https://phabricator.wikimedia.org/T333497#8772933 [12:41:43] and it looks like redirect target would fit in there too [12:41:57] they don't all need to be in the same stream, but we should make a consistent data model if we can [12:47:23] o/ [12:53:59] dcausse: I'll reply to that stop word question [12:54:11] thanks! [14:23:24] inflatador, ebernhardson: any feedback on the cluster expansion for elasticsearch? [15:54:42] gehel: looking. I thought there was a related phab ticket but sadly i'm coming up with nothing [16:07:34] ebernhardson maybe this one? https://phabricator.wikimedia.org/T334210 . We can go over this in more depth at pairing today [16:07:54] workout, back in ~40 [16:08:35] yup thats the one, thanks! Seems i managed to never subscribe to it [16:50:40] the job resumed properly from yarn... generated checkpoint sizes are back to something normal (~between 1G and 2G), going to assume that it was failing due to resource constraints preventing rocksdb compaction... [16:51:14] back [16:51:39] will wait for the lag to catch-up a bit and will resume from k8s [16:52:20] nice! [16:53:03] that means we perhaps need to provision a bit more resource on k8s for when such situations happen [16:53:50] yea seems like it [16:53:55] it's currently 2G? [16:54:13] yes 2G * 3 nodes [16:54:31] 4cpus per container [16:54:31] my intuition is that's really low for anything jvm. [16:54:51] yes... [16:55:23] can we just double it? 6G of memory extra has to be about nothing [16:56:42] How much RAM did you give it in yarn? [16:56:45] yes we can probably ask to double, not sure about the layout (do we want fat jvms or more jvms) [16:57:01] in yarn I gave a lot [16:57:13] 5g * 12 :P [16:58:00] i would probably lean toward fatter for now, but i suppose i don't have a great reason [16:58:33] I kept the old checkpoint I might do some testing to try to guess what's a reasonable value, 5g * 12 seems too much, it ought to run with less [16:59:06] I'm OK with throwing memory at the problem if that stabilizes it [17:01:09] not sure about the tradeoff, more jvms will split the state in smaller chunks but you definitely pay a bunch of overhead for the jvm itself... so yes "fatter" jvm seems generally prefererable if the state remains reasonable [17:14:45] going offline will check things back later tonight [17:45:44] lunch, back in time for pairing [18:25:20] Few ‘ late to pairing [19:15:26] Updated the service restart documentation per g-ehel's request to document what needs to happen for DC switch maintenances. Feel free to look it over and add/change anything https://wikitech.wikimedia.org/wiki/Service_restarts#Elasticsearch [19:26:10] inflatador: my ask was more specifically to ensure that T335042 is up to date [19:26:10] T335042: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 [19:26:32] If you could add a note in there in the "Search Platform" section. Thanks! [19:28:03] gehel I'll update it now, but FWiW Ryan and I typically ban/update the task the day before [19:28:47] yep, you should note in that task that you will ban the nodes the day before and unban them once the maintenance is completed. [19:53:32] ACK, done [20:31:08] https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/917944 small change to ban cookbook if anyone has time to look [21:38:54] hmm, wcqs2002 is down according to icinga [21:39:20] err...1002 that is [21:40:23] huh [21:41:44] SSH seems down, just rebooted from DRAC [21:41:52] it let me ssh in [21:42:05] oh, 1002 [21:47:19] nothing particularly clear in logs, it went down may 4 ~ 4:58 utc. It looks like it was shutdown at least somewhat cleanly, jetty reports a little bit of shutdown [22:04:02] Hmm. out for today, but I might dig around more tomorrow. [22:06:58] lag is still close to 6 days, but it's decreasing