[03:25:00] codfw upgrade is *almost* done. just a handful of nodes left. looks like we're hitting a bug of some sort in the new spicerack do-the-masters-at-the-end logic though: [03:25:08] https://www.irccloud.com/pastebin/HnR6p1do/ [03:26:01] That error message comes from here: https://github.com/wikimedia/operations-software-spicerack/blob/07d9eb94db21e7682ec1bc1e0988d305237041e7/spicerack/elasticsearch_cluster.py#L839-L840 [06:45:40] * ebernhardson should have printed more debug info in that error message. To think all the times i've complained about 'File not found' not including the file name :P [06:54:30] hmm, so it's completely correct. There is a cycle in the restart graph, it's not possible to restart all the nodes in a cluster prior to the masters. But I suppose that means it needs to be more flexible [06:59:05] once we get to the point where all remaining nodes are masters of some clusters it just needs to restart whatever and call it good enough [07:00:44] would it work to have the same master nodes for the three clusters (not even sure that would solve this situation, still trying to understand this graph :P) [07:01:23] the graph essentially has an edge from all non-master nodes to all master nodes in a cluster [07:01:40] but, for example, elastic2025 is master of chi and child in omega. And elastic2047 is master of omega and child of chi [07:03:15] I see, since we do the restart at the node level that does not work [07:04:57] yea, if it could restart one cluster on a node this would work, but as is i think we need to skip this algo once we get to the point where all remaining nodes are a master somewhere. By then enough nodes are already on 7 that we shouldn't have to worry anymore. [07:05:24] sure [07:05:57] i suppose that also means this is overcomplicated, it could simply be remove all the masters from the restart list :) [07:06:11] oh indeed :) [07:07:02] hm seeing ElasticSearch numbers of masters eligible - 9243 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - Found 0 eligible masters. but I can see them interrogating the cluster, the metric collection might be wrong? [07:09:08] hmm, must be. 9243 clearly reports master capable instances [07:09:08] hm the code for master elligible seems to have changed [07:09:21] .startswith('m') should == 'dim' [07:09:50] i suppose it should check `'m' in value` instead of a specific position [07:09:57] yes [07:11:51] it looks like the reason my unit test passed, but prod failed here is that in my unit test it assumed one node would be master in multiple clusters, in which case it ends up working. But with each node oly master in a single cluster it doesn't work [07:14:15] I suppose we could enforce this layout in puppet (haven't thought about the drawbacks tho) [07:15:20] if we were going to change masters, would maybe be better to have 9 small ganetti instances and each is a master to 1 cluster, keeping it all separate. But probably not worth it, i can simplify this spicerack stuff [07:15:45] sure [07:21:00] I think https://gerrit.wikimedia.org/r/c/operations/software/spicerack/+/828406 is what we need. That means we have to poke someone for a spicerack deploy [07:31:28] heading out, everything looks happy enough (except that airflow drop data thing, will check tomorrow :P) [07:31:41] good night! [07:55:58] I did a pass on the above patch, feel free to poke me when a release is needed [08:32:38] volans: thanks! if the release takes some time I guess it might make sense to do it relatively soon so that Ryan can finish the restart of codfw elastic nodes when he starts his day [08:33:21] doesn't take that long, but sure I can review, merge and release before his day starts :D [08:33:35] thanks! :) [09:58:38] lunch [12:00:58] Lunch [12:45:17] hi search platform team! Any idea if i might need to reindex blazegraph if going between 0.3.97 and whatever the latest release is? [12:58:03] addshore: no I don't think so, (latest should be 0.3.115) [14:33:22] dcausse: WMF/WMDE sync: https://meet.google.com/bfe-uzwh-ytj [14:55:41] looks like pcm.wikipedia.org does not have its index created [14:59:35] might be that addWiki is only creating the index in eqiad? [16:50:15] ryankemper: if you're around I can upgrade spicerack to the latest version that includes Erik's fix [16:50:40] volans: around [16:51:43] great [16:52:25] volans: FWIW the previous version jbond deployed to `cumin2002` and then I don't remember if we ever circled back to tell him he could do the full deploy [16:53:17] no, he didn't, I noticed earlier that 1001 is on a previous version, I'll most likely do that oen too right now if you agree [16:53:21] cumin2002 is upated [16:54:13] so all yours ryankemper :) [16:56:13] I'll upgrade also cumin1001, as there are no other changes (apart a new module not yet used) and the chnages for elasticsearch can be "reverted" going back to v3.2.0 in teh worse case scenario [16:56:24] unless you have any concern [17:05:10] volans: no concerns [17:06:20] ok installed on both [17:08:07] I have an errand to run, but I'll be available later if you encounter any issue. [17:08:12] hopefully all will be smooth [17:08:52] volans: thanks, will kick off the remainder of the es7 codfw upgrade after I'm out of my current meeting [17:09:30] volans: quick question about reverting to `v3.2.0` if necessary - can I literally just hard reset to that git commit before running the cookbook? (I'm aware future puppet runs will bring it back ofc) [17:11:07] you need to install the previous debian package [17:11:45] for example from /var/cache/apt/archives/spicerack_3.2.0-1+deb11u1_amd64.deb [17:11:50] on the cumin hosts [17:11:52] themselves [17:12:19] that's for the spicerack change, not sure if you have cookbook changes [17:14:45] no cookbook changes [17:15:03] ah right, what I said above would be for cookbooks not spicerack [17:15:05] ack, thanks [17:38:23] some new errors from elastic instances in codfw: uncaught exception in thread [Log4j2-TF-2-ConfiguratonFileWatcher-108] [17:38:32] probably not super important, but something [18:04:00] Trey314159: ahh! that makes total sense. foobar~2 says to search within an edit distance, soo foobar~ is invalid. We have a bit in cirrus that escapes the ~ if the suffix is unusable, but it must not work in this case. Perhaps it thinks its `foobar~ ` which takes the default edit ditsance of [18:04:02] 2 [18:04:23] too many space-like things :P [18:06:19] Doesn't tilde by itself do a fuzzy match? Like `fred~` matches free? So I don't know how the tilde is being interpreted. But `fred~fred` is fine [18:06:39] in cirrus `fred~fred` gets transformed into `fred\~fred` before passed on to elasticsearch [18:07:09] oh wait.. "edit distance" and "fuzzy" are the same thing. Sorry, had a brain fart [18:07:56] so `fred~` must confuse the parser.. probably one bit of code is treating it as a space and another is not. [18:08:18] `fred~fred` becomes `fred\~fred`, `fred~ ` is left as `fred~ `, but it seems `fred~` either needs nbsp converted to space, or the ~ escaped [18:08:28] yea [18:09:13] at least we have a line on the problem and it doesn't involve CJK characters, so that's an improvement [18:09:37] yea thats an edge case that we can clean up, but I imagine only effects bots that source their text from random places [18:10:37] yeah, this looks bot-like because the query has carriage returns in it, so it wasn't typed in or even copied and pasted [19:04:59] ryankemper: we should probably try and ship https://gerrit.wikimedia.org/r/c/operations/puppet/+/828403 so the master alert stops flapping [19:05:36] ebernhardson: shipping it now [19:27:33] codfw's completely done! [19:27:37] https://www.irccloud.com/pastebin/vkTFv1l8/ [19:30:04] nice! [19:37:33] ryankemper awesome work [19:38:34] inflatador: in typical fashion, us wrangling deployment-prep back into a good state was much harder than the actual codfw upgrade itself :P [19:40:03] don't ever change, deployment-prep ;) [20:17:13] Thank you for looking after deployment-prep :) [20:17:28] It's quite nice to see someone actually care [20:18:04] * ebernhardson clearly doesn't understand elastica config wrt Connection / Transport and needs to play with it in shell.php as well [20:18:46] ebernhardson: saw your last changes and yes I believe it's not right, doing some checks [20:19:12] dcausse: yea i read your comments before deploying the patch and delayed the patch till tomorrow so i can play with this and figure out how it's supposed to work [20:19:55] i think i'm confused that Connection holds the host/port, where i seem to be expecting that as a part of the transport. It seems connection shouldn't have a baked in tcp/ip abstraction and that should wholy be the transport, but that might not be the case [20:19:58] this is what I told Niklas to use on translatewiki: https://phabricator.wikimedia.org/P33718 [20:20:20] yes it's confusing [20:23:48] fun unrelated thing, we aren't the only ones who spell things crazy, in AbstractTransport: public function sanityzeQueryStringBool [20:24:16] :) [21:33:18] meh, poked around a bit but no good ideas on the error messages log4j creates every 10 minutes on each instance about SecurityManager rejecting the setContextClassLoader call inside of a ConfigurationFileWatcher thread [21:33:22] seems harmless, but still spammy [21:33:40] other than that codfw hosts seem quiet