[10:51:56] dcausse: did compiling lucene library for hebmorph give you any issues, like test failures? [10:52:07] unit tests are failing on that project and I'm wondering if I did that [10:52:50] zpapierski: can't remember, lemme check if I some modifications locally [10:55:15] yes I have a fix [10:56:13] zpapierski: this is what I did: https://github.com/nomoa/HebMorph/tree/wmf_6.6.1 [10:56:22] not sure if it's the same kind of error [11:00:07] thx, I'll look at it [11:12:16] WCQS data load progress: ~575/724 [11:13:41] getting close! [11:13:49] lunch [11:13:52] lunch 2 [11:21:36] it seems I should deploy streaming updater [11:41:40] break + errand [12:58:09] I'm surprised that we have 64shards for enwiki_content, I thought we were at 8*4replicas [13:01:20] sigh... my memory is failing me I +2ed the patch to increase that in 2020 [13:18:40] inflatador: o/ (for when you're around) we've received an alert today: "search.svc.codfw.wmnet/ElasticSearch unassigned shard check" which is when looking at 'curl -s https://search.svc.codfw.wmnet:9243/_cat/shards | grep UNASSIGNED' caused by 2 enwiki_content shards. I think running the sre.elasticsearch.force-shard-allocation cookbook might help with that (c.f. [13:18:42] https://wikitech.wikimedia.org/wiki/Search) [13:20:54] hm wait https://search.svc.codfw.wmnet:9243/_cluster/allocation/explain?pretty says that the recovery attempt to elastic2059 failed with IOException[No space left on device] [13:23:38] no sorry it's elastic2035 [13:24:45] ha elastic2035 is T298853 [13:24:45] T298853: Degraded RAID on elastic2035 - https://phabricator.wikimedia.org/T298853 [13:26:47] so if we're not going to fix this node we should ban it and possibly shut it down [14:32:25] Greetings! [14:32:31] o/ [14:32:34] dcausse will take a look shortly [14:32:41] o/ [14:33:09] inflatador: thanks! :) [15:16:34] dcausse yeah, it looks like we need to decommission this node. In the meantime, I can ban it from omega/chi/whatever clusters if you like [15:17:52] inflatador: there is some doc about decommissioning servers: https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Failed_-%3E_Decommissioned [15:18:21] You can probably go through those with ryankemper, and get rid of this server for real [15:19:21] yeah, thanks for the docs, was planning on doing just that [15:19:47] inflatador: after decomissionning running sre.elasticsearch.force-shard-allocation might resolve the alert hopefully [15:22:55] inflatador: I think you were in cultural orientation during yesterday's Monthly Tech Dept Updates. You should watch the presentation from Zbyszko and David: https://drive.google.com/file/d/1PbUcfnGyqU9UtJAE2n737bLVlLwizZ5z/view [15:23:16] That's about the Flink Updater. It's really good and will give you some additional context [15:26:28] ACK on both counts. Also trying to reimage elastic2051.codfw.wmnet as it looks like it failed yesterday [15:26:50] do you know what the failure was? [15:34:18] it looks like something timed out. I just ran it again, if it fails again I will get on the mgmt console and try to watch it as it goes [15:42:10] there are also logs on the cumin host, probably under /var/log/spicerack/sre/hosts/reimage [15:42:26] time to go for me, see you tomorrow! [15:44:17] \o [16:02:02] going AFK, back in ~30 [16:22:37] meh, the new integration instance gets stuck at the end of a run too :( guess i have to figure out why [16:23:10] o/ [16:33:13] I wonder how to do [sensor1, sensor2] >> [task1, task2] >> complete [16:33:46] tried chain(sensors, tasks) but it's not that [16:35:55] dcausse: has to be two separate calls, never found another way [16:36:11] dcausse: [sensor1, sensor2] >> task1 >> complete, then against with task2 [16:36:20] ah ok [16:36:31] or perhaps a dummy task in between? [16:36:38] it would make sense to have a way, maybe airflow 2 has something, but didn't find anything in 1 [16:36:48] ok thanks [16:37:08] a dummy inbetween would wor ktoo [16:43:07] back with new glasses [16:44:26] hm... not sure what it does...: Relationships can only be set between Operators; received list [16:44:41] anyways will reshape all that [16:44:45] dcausse: thats really odd :) [16:45:33] code suggests it iterated over the provided list and found a list inside it [16:47:22] yeah... created a dummy sensor_sync task and did [16:47:36] sensor_sync << sensors [16:47:44] and then sensor_sync >> tasks [16:48:01] huh, i would expect that to work [16:48:20] also tried overwritting sensor_sync = sensor_sync << sensors but did not work either... [16:48:34] not sure if << >> returns something tho [16:50:21] looks like it should always return the right hand side [16:50:49] but i guess i'm not certain on ordering, a << b should turn into a.__lshift__(a, b) which returns b [16:51:44] just called complete << task << sensors multiple times and it seems to have worked [16:51:50] good enough :) [16:52:04] also I have no clue if the resulting dag is what I have in mind :) [16:52:05] i guess i never use <<, always >>, but no particular reason why :) [16:52:15] open the airflow ui and look at the dag view graph [16:52:25] yes [16:52:25] will tell you if the result is what you wanted [16:53:04] I have to use << because sensors is an array here [16:55:07] hm.. it works too... [16:55:48] hmm, `[a, b] >> c` should call c.__rlshift__(c, [a,b]) [16:56:12] yes not sure why I thought it would not work... [16:56:24] i dunno if the extra r is for 'reversed' or maybe 'right hand', but it has the implementation for both ways [16:57:10] yes... to much magic for me here I guess :) [16:58:58] yea, i kinda go back and forth on operator overloading. Something about the resulting code seems more elegant and cleaner, but then you have all the problems of redefining an operator and not having the semantics people have known for decades [17:02:37] yes... it works well when you stay in the designed DSL, but if you mix this with other things it gets messy [17:53:13] dinner [18:20:59] OK, I think elastic2051 is finally going to reimage properly (at least it's going further than before) , back to dcausse 's request to ban elastic2035 from clusters [18:59:00] inflatador: feel free to ping me if you need any help with banning the host [19:01:44] ryankemper thanks, will let you know. Headed to lunch now but will be back in ~30 [19:02:33] meantime, here's the errors I'm getting when I try to 'run-puppet-agent' on elastic2051: https://phabricator.wikimedia.org/P18732 [19:28:10] and back [19:29:55] added an update to the paste, not sure on fix [19:31:39] ebernhardson: yeah the java thing is very bizarre, i noticed that in the output of my puppet run [19:32:17] I'm tempted to manually install `java-common` and see if that unsticks things but I dislike the idea that there might be something wrong with our automation [19:32:22] Could very well just be a one-off though [19:33:00] problem is our package, it only depends on Depends: bash, libc6, adduser, coreutils [19:33:02] I guess I would expect `elasticsearch-oss` to install java as a dependency anyway so that probably wouldn't fix things anyway [19:33:09] Ah [19:33:11] Interesting [19:33:24] yea we need to add a dependency on something, maybe elasticsearch-oss [19:33:49] heh, that doesn't depend on java either [19:35:30] we could probably depend on java8-runtime [19:35:54] When I do a reverse depend on `java-common`, I see that `2054` has `openjdk-8-jre-headless` as a package that depends on `java-common` [19:35:57] Whereas elastic2051 lacks that [19:38:30] hmm, i'm seeing openjdk-8-jre-headless in rdepends for both hosts, `apt-cache rdepends java-common | md5sum` gives same value [19:39:16] ebernhardson: I just started mucking around manually, before the mucking the rdepends was empty [19:39:16] it seems like puppet happened to decide a different order to install the packages, but unclear [19:39:25] ahh :) [19:39:30] Interestingly though the mucking failed to install anything [19:39:39] For the same reason, java not being present etc [19:40:11] I think best option now is to do a fresh re-image and see if we end up in the same failure mode [19:40:25] It seems likely we're hitting an order of operations race type thingy [19:40:31] oh, you were looking at the installed dependencies. I was mostly assuming puppet tried to install our plugins in a round that didn't yet include java [19:40:33] yea [19:41:18] inflatador: wanna go ahead and kick off the re-image of 2051 for *hopefully* one last time? :P [19:41:49] ryankemper sure, will kick it off now [21:14:01] inflatador, ryankemper: I think that the elasticsearch package we had before might have had a dependency on openjdk in some form. We migrated to elasticsearch-oss a while back, but since it was on servers that already had openjdk installed, we never saw the issue [21:14:16] https://github.com/wikimedia/puppet/blob/production/modules/elasticsearch/manifests/packages.pp should have a dependency on openjdk [21:18:21] gehel: ah thanks, that context is very helpful, we'll try adding a dependency before this elasticsearch dependency https://github.com/wikimedia/puppet/blob/9f57c572bdb457f7556a666effe4edb89f7587d8/modules/elasticsearch/manifests/packages.pp#L11 [22:33:11] ryankemper: note that we have a profile::java::java_8 that installs openjdk8 and takes care of the apt component for buster backports [22:34:18] we also have a java class in the java module that takes care of some additional configuration (cacerts and entropy source) [22:46:34] gehel: ack, thanks, brian and i are working on a patch now [22:46:54] the additional config is slightly less straightforward but will see if anything needs to be done there [22:46:56] Definitely a good patch to show how puppet compiler works [22:47:18] I'm happy to review the patch tomorrow! [22:47:23] on that note: good night!