[14:25:11] \o [17:34:15] if we wanna do puppet deploy window, i have https://gerrit.wikimedia.org/r/c/operations/puppet/+/856654 [17:51:52] meh, feature collection returned a bunch of nans which were unexpected, and since it's been awhile since this ran we don't have any previous history to compare to :( [17:52:50] it's always for mean or stddev features, i suppose those could all end up in a div-by-zero or some such which perhaps elastic 7 treats differently [17:53:19] ebernhardson :eyes on your puppet patch [17:56:42] inflatador: there is also an easier patch as it's parent that has to merge too [17:59:48] inflatador: while it does hard code, the set of database shards that exist is relatively constant, on the span of years [18:01:56] ebernhardson ACK, sounds like it's not much of a problem then [18:03:16] I merged the parent patch just now [18:05:09] will merge the patch linked above after Jenkins runs [18:05:17] hmm, this can't be right: 'title_min_raw_df': 3.4028235e+38 [18:05:31] document frequency of a erm of 3.4e38? [18:05:35] s/erm/term/ [18:06:09] and a max of 0, so yea something bad happening there :) [18:07:28] ahh, 3.4e38 is the largest positive value of a float [18:08:37] because the way the StatisticsHelper in ltr is implemented, it starts min at Float.MAX_VALUE, and then makes it smaller every time it sees a smaller value. If it sees no values, the min is MAX_VALUE [18:08:38] hmm, snapshot hosts. That's a new one on me [18:09:11] inflatador: this should only run on snapshot1007, as an 'other' dump. There are also 100[89] used for mediawiki dumps iirc [18:10:13] ebernhardson OK, interesting. I'll go ahead and run puppet there, then (I merged the first 2 in that relation chain) [18:10:54] inflatador: kk. tbh not 100% sure what it will do with the process that is currently running. Perhaps should wait until it finishes before merging the third [18:11:29] oh i guess i have my numbers mixed, based on manifests/site.pp snapshot1008 is now the misc_crons dumper [18:11:58] ah, was just about to ask [18:14:28] OK, duppet is happy on 1008 . Do you want me to check in on the current dump process tomorrow or whenever you think it'll finish? [18:14:43] err...puppet not 'duppet' [18:16:59] it takes like 10 days, thats why we needed this patch :P [18:17:47] it started nov 23, it's going alphabetical and is up to ptwiki [18:19:16] so i guess it might finish in the next day or two, hard to say exactly :) [18:20:14] Yeah, no worries. If you want me to check in Monday I can, otherwise I'll leave in your capable hands ;) [18:26:57] d'oh, looks like one of the smaller cloudelastic instances is alerting for GC again [18:31:00] :( [18:31:01] added to https://phabricator.wikimedia.org/T323646 [18:32:07] doesn't look terrible to me based on the grafana https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-exported_cluster=cloudelastic-psi-eqiad&var-instance=cloudelastic1006&var-top5_fielddata=All&var-top10_terms=All&var-top5_completion=All&from=1669680000000&to=1669939199000 [18:32:19] but I don't really know what we're aiming for ;) [18:33:11] hmm, the psi graphs for cloudelastic1006 look fine [18:33:31] but it is running the old gc a lot, hmm [18:33:49] oh, i wonder if the reported max on the old pool graph is just plain wrong... [18:34:30] yeah, we bumped it to 10 GB didn't we? Maybe we just need to update the dashboard. Not sure how that ties into alerts though [18:35:00] hmm, 1-4 report a max of 1G for young pool, 5 and 6 report 2G for young pool. That means 5 and 6 have 1gb less available to the old pool. wonder why [18:35:54] I guess the old GC is still churning, not sure how often we sample that but will take a look [18:37:19] yea we recently increased it to 10G, but also changed some gc options [18:38:07] looks like 1004 and 1006 are using the same set of options on the java commandline [18:38:58] the general answer is restart the instance, see if it does something different :P [18:40:29] but there is something certainly odd to understand, while the old pool is struggling the jvm heap isn't maxed out. it's not using whats available :S This is how we end up tuning random GC settings :P [19:16:01] can't make SRE pairing today, will be back in ~45 so maybe can join later [19:59:46] back [20:16:54] hmm, not finding great reasons for the bad feature values in elastic 7.10 :S i suspect it's a problem with the plugin, but nothing about these particular bits seems to have changed in the ltr plugin side of things. It should be returning 0's when the terms.size() == 0 in both 6.8.23 and 7.10.2, but in 7.10.2 we are getting results that look like that didn't happen :S [20:18:24] can't really hax it up from the mjolnir side, it would be wrong at query time...i suppose will create a ticket and hope david has ideas [20:59:30] Hi! I'd like to deploy the cirrus update pipeline flink applications. What would be an elasticsearch host I could use for testing? [21:00:19] pfischer: relforge100[34].eqiad.wmnet, we would need to setup some indexes there for it to run against. Those are test hosts that have no prod traffic [21:00:33] we could load a dump of some wiki you want to test with [21:01:03] or maybe even use the snapshotting functionality now that it's deployed and previously tested, i suppose, to bring a live index over [21:07:56] Wouldn't an empty index suffice? My plan was to create/update pages on test.wikipedia.org and see If the documents make it to ES [21:08:32] I'd have to make sure the ES cluster is not connected to Cirrus search though [21:08:54] hmm, i suppose it could use an index without any mappings, but you wouldn't be able to query it other than by page id [21:09:15] That's fine for now [21:09:39] I'd just like to see the ingestion working [21:09:57] in that case you can simply PUT to :9200 on relforge to an arbitrary index name and go for it. CirrusSearch isn't configured with connections to them at all [21:10:13] the PUT to an index name creates an empty index [21:10:45] relforge is the host name? [21:11:04] theres no load balancer like prod though, so you have to point it at the servers directly. relforge1003.eqiad.wmnet and relforge1004.eqiad.wmnet, they are in a cluster together but have no load balancer so you have to talk to hosts directly [21:11:16] meh, repeating myself at beginning and end of sentence again :P [21:11:56] No worries. Better twice than sorry(sth like that) [21:13:18] Is there any preferred Kafka broker for suck tests? [21:15:08] hmm, not sure :S I don't think we have test kafka brokers like we do with the test elasticsearch instance. Probably kafka-main, but not 100% sure. I often ask ottomata [21:16:03] oh, there is a test-eqiad listed in puppet [21:16:26] kafka-test100(6,7,8,9,10).eqiad.wmnet. I'd still suggest checking with #analytics / ottomata though [21:17:07] Alright, thanks Eric! I'll do that tomorrow. [21:18:45] FWiW, there are kafka brokers in deployment-prep, not sure if that is an appropriate place to test though [21:29:14] kafka-test is for random testing of things in prod, [21:29:17] 'suck tests'? :) [21:31:06] lol, i didn't even notice that. corrected it while reading i guess :P