[09:57:43] with new jvm settings applied cloudelastic1006 is complaining on the old GC, probably because of -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly, the ratio between old and young/survivor pool size changed a lot tho. [09:58:25] wondering if we should also consider bumping the heap to 10g on the small clusters running in cloudelatic too (https://gerrit.wikimedia.org/r/c/operations/puppet/+/855673) [11:14:16] lunch [11:28:57] lunch + errands [14:03:58] dcausse good timing since I'm planning on rebooting the main clusters today, will get that merged shortly [14:05:28] inflatador: thanks! [14:06:04] https://gerrit.wikimedia.org/r/c/operations/puppet/+/855673 would have to be adapted if we want to bump cloudelastic too [14:07:06] but feel free to go ahead with this patch if you want to start rebooting the small clusters before Erik wakes up [14:07:13] it's easy to make another patch [14:08:22] sounds good, I'll get another patch started once I merge this one [14:49:22] ryankemper: fyi, I just repooled elastic2052 following T320482 [14:49:23] T320482: Degraded RAID on elastic2052 - https://phabricator.wikimedia.org/T320482 [15:10:37] dcausse: we're in the k8s meeting if you want to join: https://meet.google.com/wjt-srdx-cgq [15:10:50] gehel: sure [16:40:54] https://gerrit.wikimedia.org/r/859094 <- memory upgrade for cloudelastic patch [17:58:44] lunch/errands, back in ~1h [17:58:47] hmm, it seems like cirrus docs with file_text=[] was previously a bug with js/css (maybe other code embedded in wiki pages) but doesn't exist in newly created documents. Wonder if there is any worth to fixing up the existing indices [18:01:00] ryankemper I'm in the middle of rebooting codfw to apply the JVM changes, it's in a tmux window on eqiad cumin if you need to check it while I'm gone. I just restarted as it was stuck on the batching step for over an hour [18:21:38] i guess i'll just do a one-off cleanup, collect the set of titles with file_text=[] and issue a manual update [18:21:48] verify it's really all .css/.js [18:43:11] inflatador: ack [18:58:27] back [18:59:47] Cookbook is up to group 12, no signs of stalling since I restarted it [19:01:20] hmm, hold that thought, we are red in CODFW. Probably a reindex thing, checking now [19:01:42] nm...back to yellow [19:04:55] * ebernhardson thought could speed this up by writing a painless query that filters for `params._source.file_text instanceof List`, but somehow we can only access _source in a function_score script, which can't filter, but not in a script query, which can filter [19:06:31] oh i spoke too soon..there is a min_score option can hax [19:07:03] but perhaps asking the cluster to decode all those json docs is a bit much [19:08:26] a small test against kkwiki_general, with 310k docs, has 14.7k matching docs. might as well stick with the thing thats already running a search_after by page_id over all indices [19:10:43] wow i can't write things properly :P The actual answer is 79 docs [20:01:15] hmm, somehow the incoming_links update came up with 0 via the automation...even though my manual run last week found many :( [20:01:29] something wrong somewhere... [20:03:03] nm...yet again i can't write things properly :P 2022113 != 20221113. Maybe dashes would have been better :P [20:27:52] * ebernhardson sighs ... need some sort of additional consistency checks...the esbulk conversion expects a field named 'wikiid', and i called it 'wiki' in the incoming_links_update schema... [22:24:57] looks like we're getting more GC alerts in cloudelastic, does anyone have time to look at https://gerrit.wikimedia.org/r/c/operations/puppet/+/859094 [22:24:59] ? [22:27:50] thats basically what dcausse said, the adjusted gc settins shrank the old pool and caused more alerts. Bigger heap instead of fine tuning gc settings should be fine [22:28:06] fine tuning gc settings = lots of guess work and hopium :P [22:28:24] more like roughtuning :D [22:28:32] inflatador: looks good to me. if there's a phab ticket then stick that bug in the commit msg, otherwise it's all good [22:29:49] ryankemper nah, I just ripped off ebernhardson 's commit message from https://gerrit.wikimedia.org/r/c/operations/puppet/+/855673 , no task that I know of [22:30:09] Phab is down for me, based on the scroll in operations it looks like they just failed it over [22:30:29] inflatador: +1 from me [22:30:40] ACK, will merge [22:33:52] OK, merged and restarting cloudelastic to apply now [22:34:31] Also: the CODFW restarts are finished, we can do eqiad tomorrow if that works for everyone else [22:36:34] +1 [22:39:34] aww, phabricator out also breaks ci. will have to wait a bit [22:39:53] fatal: unable to access 'https://phabricator.wikimedia.org/diffusion/NLSP/new-lexeme-special-page.git/': The requested URL returned error: 500 [23:03:40] OK, cloudelastic's back up. Heading out for the day